+ All Categories
Home > Documents > Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006....

Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006....

Date post: 03-Jan-2016
Category:
Upload: louise-brooks
View: 216 times
Download: 0 times
Share this document with a friend
36
Ch 2. Probability Ch 2. Probability Distributions (1/2) Distributions (1/2) Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006. C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence Laboratory, Seoul National Univ ersity http://bi.snu.ac.kr/
Transcript
Page 1: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

Ch 2. Probability Distributions (1/2)Ch 2. Probability Distributions (1/2)Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning,

C. M. Bishop, 2006.C. M. Bishop, 2006.

Summarized by

Yung-Kyun Noh and Joo-kyung Kim

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

Page 2: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

2.1. Binary Variables 2.1.1. The beta distribution

2.2. Multinomial Variables 2.2.1. The Dirichlet distribution

2.3. The Gaussian Distribution 2.3.1. Conditional Gaussian distributions 2.3.2. Marginal Gaussian distributions 2.3.3. Bayes` theorem for Gaussian variables 2.3.4. Maximum likelihood for the Gaussian 2.3.5. Sequential estimation

Page 3: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

3 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Density EstimationDensity Estimation

Modeling the probability distribution p(x) of a random variable x, given a finite set x1,…,xn of observations. We will assume that the data points are i.i.d.

Fundamentally ill-posed There are infinitely many probability distributions that could ha

ve given rise to the observed finite data set. The issue of choosing an appropriate distribution relates t

o the problem of model selection. Begins by considering parametric distributions.

binomial, multinomial, and Gaussian Governed by a small number of adaptive parameters.

Such as the mean and variance in the case of a Gaussian.

Page 4: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

4 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Frequentist and Bayesian Treatments for the DensFrequentist and Bayesian Treatments for the Density Estimationity Estimation

Frequentist Choose specific values for the parameters by optimizing some c

riterion, such as the likelihood function.

Bayesian Introduce prior distributions over the parameters and the use Ba

yes` theorem to compute the corresponding posterior distribution given the observed data.

Page 5: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

5 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bernoulli DistributionBernoulli Distribution

Considering single binary r.v. x {0,1}.∈

Frequentist treatment Likelihood function

Suppose we have a data set D={x1,…,xN} of observed values of x.

Maximum likelihood estimator

If we flip a coin 3 times and happen to observe 3 heads the ML estimator is 1. An extreme example of the overfitting associated with ML.

1Bern | 1 , , var 1

xxx E x x

1

1 1

1 1

| | 1

ln | ln | ln 1 ln 1

nn

N Nxx

nn n

N N

n n nn n

p D p x

p D p x x u x

1

1 N

ML nn

xN

Page 6: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

6 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Binomial DistributionBinomial Distribution

Binomial Distribution The distribution of the number m of observations of x=1 given that

the data set has size N.

Histogram plot of the binomial distribution (N=10, μ=0.25)

Bin | , 1

, var 1

N mmNm N

m

E m N m N

Page 7: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

7 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Beta DistributionBeta Distribution

,

and are often called hyperparameters.

b 1a 1

1x 1 u

0 0

2

a bBeta | a,b 1

a b

x e du Beta | a,b d 1

a abE , var

a b a b a b 1

a b

Page 8: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

8 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bernoulli & Binomial Distribution - Bayesian Bernoulli & Binomial Distribution - Bayesian Treatment (1/3)Treatment (1/3) We need to introduce a prior distribution. Conjugacy

Posterior distribution have the same functional form as the prior. Beta (prior) Binomial (likelihood) Beta (posterior) Dirichlet (prior) Multinomial (likelihood) Dirichlet (posterior) Gaussian (prior) Gaussian (likelihood) Gaussian (posterior)

When Beta distribution is a prior. The posterior distribution of μ is now obtained by multiplying the beta

prior by the binomial likelihood function and normalizing.

Has the same functional dependence on μ as the prior distribution reflecting the conjugacy properties.

p

where l b 1m a 1p | m,l ,a,b 1 , l N m

Page 9: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

9 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bernoulli & Binomial Distribution - Bayesian Bernoulli & Binomial Distribution - Bayesian Treatment (2/3)Treatment (2/3)

Because of the beta distribution property, it is simple to normalize. Simply another beta distribution

Observing a data set of m observations of x=1 and has been to increase the value of a by m.

This allows us to provide a simple interpretation of the hyperparameters a and b in the prior as an effective number of observations if x=1 and x=0.

l b 1m a 1

m a l bp | m,l ,a,b 1

m a l b

Page 10: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

10 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bernoulli & Binomial Distribution - Bayesian Bernoulli & Binomial Distribution - Bayesian Treatment (3/3)Treatment (3/3) The posterior distribution can act as the prior if we subsequently observe additiona

l data. Prediction of the outcome of the next trial

If m,l∞, then the result reduces to the maximum likelihood result. The Bayesian and maximum likelihood results (frequentist view) will agree in the limit

of an infinitely large data set. For a finite data set, the posterior mean for μ always lies between the prior mean and the

maximum likelihood estimate for μ corresponding to the relative frequencies of events given by μML.

As the number of observations increases, the posterior distribution becomes more sharply peaked (variance is reduced).

Illustration of one step of sequential Bayesian inference

1 1

0 0

m ap x 1| D p x 1| p | D d p | D d E | D

m a l b

Page 11: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

11 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Multinomial Variables (1/2)Multinomial Variables (1/2)

We will use 1-of-K scheme The variable is represented by a K-dimensional vector x in which o

ne of the elements xk equals 1, and all remaining elements equal 0.

Ex) x=(0,0,1,0,0,0)T

Considering a data set D of N independent observations.

E |

K

kk

k 1

K

k

x k 1

p |

p | 1

xx μ μ

x μ μ

x μ μ

N K K

xnkx nnkk k

n 1 k 1 k 1

p D |

μ μ

Page 12: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

12 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Multinomial Variables (2/2)Multinomial Variables (2/2)

Maximizing the log-likelihood using a Lagrange multiplier

Multinomial distribution The joint distribution of the quantities m1,…,mK, conditioned on

the parameters μ and on the total number N of observations.

where

K K

k k k

k 1 k 1

k nk

n

ML kk

ln 1

m x

N

m μ μ

K K

mk1 2 K k k

k 1k 11 2 K

N !Mult , ,..., | ,N , N

! ! ... !

m m m μ μ mm m m

Page 13: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

13 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Dirichlet Distribution (1/2)Dirichlet Distribution (1/2)

Dirichlet distribution The relation of multinomial and Dirichlet distributions is the same as t

hat of binomial and beta distributions. Prior

Posterior

K

k Kk 1 1k k

k

k 11 1 K k

p | D, Dir |

N

...

a m

μ a μ a m

a

μa m a m

K

k Kk 1 1k

k

k 11 K

Dir |...

a

a

μ a μa a

Page 14: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

14 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Dirichlet Distribution (2/2)Dirichlet Distribution (2/2)

The Dirichlet distribution over three variables. Confined to a simplex because of the constraints.

Two horizontal axes are simplex, the vertical axis corresponds to the density. (ak=0.1, 1, 10, respectively)

0 1 and 1k kk

Page 15: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

15 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Gaussian DistributionThe Gaussian Distribution

In the case of a single variable

For a D-dimensional vector x

The distribution maximizes the entropy The central limit theorem

The sum of a set of random variables has a distribution that becomes increasingly Gaussian as the number of terms in the sum increases.

where is the mean and is the variance.

22

1 / 2 22

2

1 1N x | , exp x

22

where is a -dimensional mean vector and is a covariance matrix.

T 1

1 / 2D / 2

1 1N x | , exp x x

22

D D D

Page 16: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

16 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Geometrical Form of the Gaussian The Geometrical Form of the Gaussian Distribution (1/3)Distribution (1/3)

Functional dependence of the Gaussian on x

Δ is called the Mahalanobis distance Is the Euclidean distance when ∑ is I.

∑ can be taken to be symmetric. Because any antisymmetric component would disappear from th

e exponent.

The eigenvector equation

Choose the eigenvectors to form an orthonormal set.

T2 1x x

i i iu u

T

i j iju u I

Page 17: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

17 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Geometrical Form of the Gaussian The Geometrical Form of the Gaussian Distribution (2/3)Distribution (2/3)

The covariance matrix can be expressed as an expansion in terms of its eigenvectors

The functional dependence becomes

We can interpret {yi} as a new coordinate system defined by the orthonormal vectors ui that are shifted and rotated.

D D

T 1 T

i i i i i

i 1 i 1 i

1u u , u u

, where D 2

2 Tii i

i 1 i

yy u x

Page 18: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

18 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Geometrical Form of the Gaussian The Geometrical Form of the Gaussian Distribution (3/3)Distribution (3/3) y=U(x-μ) U is a matrix whose rows are given by ui

T. Orthogonal matrix. To be well defined, it should be positive definite.

The determinant |∑| of the covariance matrix can be written as the product of its eigenvalues.

The Gaussian distribution takes the form which is the product of D independent univariate Gaussian distributions.

Normalization of p(y) confirms that the multivariate Gaussian is indeed normalized.

D1 / 2 1 / 2

j

j 1

2Dj

1 / 2

j 1 jj

y1p y p x J exp

22

Page 19: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

19 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

11stst Moment Moment

1st moment of multivariate Gaussian

Integrate this to get

E x x μ x μ x x

z z z μ z

T 1

1 / 2D / 2

T 1

1 / 2D / 2

1 1 1exp{ ( ) ( )} d

( 2 ) 2

1 1 1exp{ }( )d

( 2 ) 2

E x μ

Page 20: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

20 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

22nd nd Moment Moment

2nd moment of multivariate Gaussian

T

E xx x μ x μ xx x

z z z+ )(z+ ) z

T T 1 T

1 / 2D / 2

1 T

1 / 2D / 2

1 1 1exp{ ( ) ( )} d

( 2 ) 2

1 1 1exp{ }( d

( 2 ) 2

z=x-μ

D

=1

z= uj j

j

yz z zz x

u u y

u u

T 1 T

1 / 2D / 2

D D D 2T k

i j i j1 / 2D / 2i 1 j 1 k 1 k

DT

i i i

i 1

1 1 1exp{ } d

( 2 ) 2

y1 1exp{ }y y d

( 2 ) 2

Vanish by integration

E xx μμT T Tcov[x]=E[(x-E[x])(x-E[x]) ]=

Page 21: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

21 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Covariance Matrix Form for the Gaussian Covariance Matrix Form for the Gaussian DistributionDistribution

For large D, the total number of parameters (D(D+3)/2) grows quadratically with D.

One way to reduce computation cost is to restrict the form of the covariance matrix. (a) general form (b) diagonal (c) isotropic (proportional to the identity matrix)

Page 22: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

22 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Conditional & Marginal Gaussian Distributions Conditional & Marginal Gaussian Distributions (1/2)(1/2)

If two sets of variables are jointly Gaussian, then the conditional distribution of one set conditioned on the other is again Gaussian.

The mean of the conditional distribution p(xa|xb) is a linear function of xb and covariance is independent of xa. An example of a Linear Gaussian model.

If a joint distribution p(xa, xb) is Gaussian, then the marginal distribution is also Gaussian. Prove using

a a b bp x p x ,x dx

T 1 T 1 T 11 1x x x x x const

2 2

1

a|b a ab bb b b

1

a|b aa ab bb ba

x

Page 23: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

23 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Conditional & Marginal Gaussian Distributions Conditional & Marginal Gaussian Distributions (2/2)(2/2) The contours of a Gaussian dis

tribution p(xa,xb) over two variables.

The marginal distribution p(xa) and the conditional distribution p(xa|xb)

Page 24: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

24 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Conditional Gaussian Distributions (1/3)Conditional Gaussian Distributions (1/3)

Find from joint distribution Rearrange multivariate Gaussian w.r.t. xa.

Define:

Caution:

Partitioning quadratic part of Gaussian again quadratic form w.r.t. xa.

x xa bp |

xx=

xa

b

= a

b

= aa ab

ba bb

1 = aa ab

ba bb

1

aa aa

x μ x μ

x μ x μ x μ x μ

x μ x μ x μ x μ

T 1

T T

a a aa a a a a ab b b

T T

b b ba a a b b bb b b

1( ) ( )

21 1

( ) ( ) ( ) ( )2 21 1

( ) ( ) ( ) ( )2 2

Page 25: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

25 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Conditional Gaussian Distributions (2/3)Conditional Gaussian Distributions (2/3)

Fix xb and use

2nd order terms in xa:

Linear terms in xa:

Tx x 1

a aa a a|b aa

1

2

Tx x

x xa aa a ab b b

1

a|b a|b aa a ab b b a aa ab b b

{ ( )}

{ ( )} ( )

x μ x μ x x x μT 1 T 1 T 11 1

const2 2

Page 26: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

26 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Conditional Gaussian Distributions (3/3)Conditional Gaussian Distributions (3/3)

Use the identity for inverse of a partitioned matrix

Conditional mean: Conditional variance:

Note the mean of the conditional distribution is a linear funct

ion of xb. linear-Gaussian model. The covariance is independent of xa.

1 1

1 1 1 1

A B M MBD

C D D CM D D CMBD

1 1M ( A BD C )

1

a|b aa ab bb ba

x1

a|b a ab bb b b( )

x xa bp( | )

Page 27: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

27 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Marginal Gaussian Distributions (1/2)Marginal Gaussian Distributions (1/2)

Find Again, start from partitioned quadratic form

Terms that involve xb:

x x x xa a b bp p , d

x μ x μ

x μ x μ x μ x μ

x μ x μ x μ x μ

T 1

T T

a a aa a a a a ab b b

T T

b b ba a a b b bb b b

1( ) ( )

21 1

( ) ( ) ( ) ( )2 21 1

( ) ( ) ( ) ( )2 2

T T

b b b b bx x x m (x m x m m m1 T 1 T 1

bb bb bb bb bb

1 1 1) ( )

2 2 2

m (xbb b ba a a ) Integrated out when marg

inalized by xb.

Page 28: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

28 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Marginal Gaussian Distributions (2/2)Marginal Gaussian Distributions (2/2)

After marginalization,

Again compare with

And with We obtain intuitively satisfying result that the marginal distribution

has mean and covariance given by

T T

T T

(x (x

x x x

x ( x x

T 1

bb b ba a a bb bb b ba a a

a aa a a aa a ab b

1 1 1

a aa ab bb ba a a aa ab bb ba a

1[ )] [ )]

21

( ) const.21

) ( ) const.2

x μ x μ x x x μT 1 T 1 T 11 1

const2 2

xap( )

1 1

aa ab bb ba aa( )

xa aE[ ] xa aacov[ ]

Page 29: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

29 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bayes’ Theorem for Gaussian Variables (1/2)Bayes’ Theorem for Gaussian Variables (1/2)

Setting:

– The mean of y is a linear function of x

xDefinition: z=

y

x x

y|x y x

1

1

p N( | , )

p( ) N( | A b,L )

z x y|x

x x x xT T

T T

T

ln p ln p( ) ln p( )

1 1( ) ( ) ( y A b ) L( y A b ) const.

2 2

A LA A L1z Rz, R

2 LA L

Page 30: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

30 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bayes’ Theorem for Gaussian Variables (2/2)Bayes’ Theorem for Gaussian Variables (2/2)

With similar processes,

Note that

From above equations and mean variance equations of conditional Gaussian distribution, we can also get these results.

1 1 T

1 1 1 T

AE[ z ] , cov[ z ]

A b A L A A

1 1 TE[ y ] A b, cov[ y ] L A A

x T 1 TE[ | y ] ( A LA ) { A L( y b ) }

x T 1cov[ | y ] ( A LA )

Page 31: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

31 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Likelihood for the Gaussian (1/2)Maximum Likelihood for the Gaussian (1/2)

ML w.r.t. μ.

ML w.r.t. ∑. Imposing the symmetry and positive definiteness constraints

NT 1

n n

n 1

N N1

n ML n

n 1 n 1

ND N 1ln p X | , ln 2 ln x x

2 2 2

1ln p X | , x x

u N

N

T

ML n ML n ML

n 1

1x x

N

Page 32: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

32 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Likelihood for the Gaussian (2/2)Maximum Likelihood for the Gaussian (2/2)

Evaluating the expectations of the ML solutions under the true distribution.

The ML estimate for the covariance has an expectation that is less than the true value. Using following estimator, that can be corrected.

ML

ML

E

N 1E

N

N

T

n ML n ML

n 1

1x x

N 1

Page 33: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

33 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sequential Estimation (1/4)Sequential Estimation (1/4)

Allow data points to be processed one at a time and then discarded.

Robbins-Monro algorithm More general formulation of sequential learning Consider a pair of random variables θ and z governed by a joint distri

bution p(z,θ).

N N 1N N 1

ML n N n N ML

n 1 n 1

N 1 N 1

ML N ML

1 1 1 1 N 1x x x x

N N N N N

1x

N

Page 34: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

34 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sequential Estimation (2/4)Sequential Estimation (2/4)

Regression function

Our goal is to find the root θ* at which f(θ*)=0. We observe z one at a time and wish to find a corresponding seque

ntial estimation scheme for θ*.

f E z | zp z | dz

Page 35: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

35 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sequential Estimation (3/4)Sequential Estimation (3/4)

In the case of a Gaussian distribution. (Θ corresponds to μ)

The Robbins-Monro procedure defines a sequence of successive estimate of the root θ*.

N=1 N=1

where , , and

N N 1 N 1

N 1

2

N N NN

a z

lim a 0 a a

Page 36: Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.

36 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sequential Estimation (4/4)Sequential Estimation (4/4)

A general maximum likelihood problem

Finding the maximum likelihood solution corresponds to finding the root of a regression function.

For Gaussian: maximum likelihood solution for μ is the μ s.t. makes E[z| μ]=0

N

n

n 1ML

N

n xNn 1

1ln p x | 0

N

1lim ln p x | E ln p x |

N

N N 1 N 1

N 1 NN 1a ln p x |

2

ML ML2

ML

1z ln p x | , ( x )


Recommended