+ All Categories
Home > Documents > Model Inference and Averaging Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai...

Model Inference and Averaging Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai...

Date post: 29-Dec-2015
Category:
Upload: tyrone-clarke
View: 237 times
Download: 0 times
Share this document with a friend
Popular Tags:
46
Model Inference and Model Inference and Averaging Averaging Prof. Liqing Zhang Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University
Transcript

Model Inference and Model Inference and AveragingAveraging

Prof. Liqing ZhangProf. Liqing Zhang

Dept. Computer Science & Engineering,

Shanghai Jiaotong University

Contents• The Bootstrap and Maximum Likelihood Methods• Bayesian Methods • Relationship Between the Bootstrap and Bayesian

Inference• The EM Algorithm • MCMC for Sampling from the Posterior• Bagging• Model Averaging and Stacking• Stochastic Search: Bumping

23/4/19 Model Inference and Averaging 2

Bootstrap by Basis Expansions

• Consider a linear expansion

• The least square error solution

• The Covariance of \beta

23/4/19 Model Inference and Averaging 3

1

( ) ( )N

j jj

x h x

1ˆ ( )T TH H H y

1 2

2 2

ˆ ˆ( ) ( ) ;

ˆ ˆ( ( )) /

T

i i

cor H H

y x N

23/4/19 Model Inference and Averaging 4

23/4/19 Model Inference and Averaging 5

Parametric Model

• Assume a parameterized probability density (parametric model) for observations

23/4/19 Model Inference and Averaging 6

Maximum Likelihood Inference

• Suppose we are trying to measure the true value of some quantity (xT).

– We make repeated measurements of this quantity {x1, x2, … xn}.

– The standard way to estimate xT from our measurements is to calculate the mean value:

and set .

x 1

Nxi

i1

N

DOES THIS PROCEDURE MAKE SENSE?

The maximum likelihood method (MLM) answers this question and provides a

general method for estimating parameters of interest from data.

T xx

23/4/19 Model Inference and Averaging 7

The Maximum Likelihood Method

• Statement of the Maximum Likelihood Method– Assume we have made N measurements of x

{x1, x2, …, xn}.– Assume we know the probability distribution

function that describes x: f(x, a).

– Assume we want to determine the parameter a.

• MLM: pick a to maximize the probability of

getting the measurements (the xi's) we did!

23/4/19 Model Inference and Averaging 8

• The probability of measuring• The probability of measuring • The probability of measuring • If the measurements are independent, the

probability of getting the measurements we did is:

• We can drop the dxn term as it is only a proportionality constant.

])[,(),(),(),(),(),(

21

21n

n

n

dxxfxfxfdxxfdxxfdxxfL

),(:FunctionLikelihoodthecalledis1

N

iixfLL

The MLM Implementation

1 1 is ( , )x f x dx2 2 is ( , )x f x dx

is ( , )n nx f x dx

23/4/19 Model Inference and Averaging 9

• We want to pick the a that maximizes L:

– Often easier to maximize lnL. – L and lnL are both maximum at the same

location.• we maximize rather than L itself because

converts the product into a summation.

L

*0

ln L ln f (xi ,)i1

N

Log Maximum Likelihood Method

ln L ln L

23/4/19 Model Inference and Averaging 10

• The new maximization condition is:

• could be an array of parameters (e.g. slope and intercept) or just a single variable.

• equations to determine a range from simple linear equations to coupled non-linear equations.

0),(lnln

*1*

i

N

i

xfL

Log Maximum Likelihood Method

23/4/19 Model Inference and Averaging 11

• Let f(x, ) be given by a Gaussian distribution function.

• Let =μ be the mean of the Gaussian. We want to use our data+MLM to find the mean.

• We want the best estimate of a from our set of n measurements {x1, x2, …, xn}.

• Let’s assume that s is the same for each measurement.

An Example: Gaussian

23/4/19 Model Inference and Averaging 12

An Example: Gaussian

• Gaussian PDF

• The likelihood function for this problem is:

n

i

in

i

xnxxxn

n

i

xn

ii

eeee

exfL

12

2

2

2

2

22

2

21

2

2

2

)(

2

)(

2

)(

2

)(

1

2

)(

1

2

1

2

1

2

1),(

f (xi ,) 1 2

e (x i )2

2 2

23/4/19 Model Inference and Averaging 13

• We want to find the a that maximizes the log likelihood function:

n

ii

n

ii

n

ii

n

i

i

xn

xx

xn

L

111

2

12

2

10)1)((2;0)(

02

)(

2

1ln

ln

n

i

i

x

nn

ii

xn

exfL

n

i

i

12

2

2

)(

1

2

)()

2

1ln(

)]2

1ln([),(lnln 1

2

2

An Example: Gaussian

23/4/19 Model Inference and Averaging 14

• If are different for each data point then is just the weighted average:

n

i i

n

i i

ix

12

12

1

Weighted Average

An Example: Gaussian

23/4/19 Model Inference and Averaging 15

• Let f(x, ) be given by a Poisson distribution.

• Let = μ be the mean of the Poisson.

• We want the best estimate of a from our set of n measurements {x1, x2, … xn}.

• Poisson PDF:

!),(

x

exf

x

An Example: Poisson

23/4/19 Model Inference and Averaging 16

An Example: Poisson

• The likelihood function for this problem is:

!!..!!...

!!

!),(

2121

11

121

n

xn

n

xxx

n

i i

xn

ii

xxx

e

x

e

x

e

x

e

x

exfL

n

ii

n

i

23/4/19 Model Inference and Averaging 17

An Example: Poisson

• Find a that maximizes the log likelihood function:

n

ii

n

ii

n

n

ii

xn

xn

xxxxnd

d

d

Ld

1

1

211

1

01

)!!..!ln(lnln

Average

23/4/19 Model Inference and Averaging 18

• For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution.

• Maximum likelihood estimates are usually consistent.

– For large n the estimates converge to the true value of the parameters we wish to determine.

• Maximum likelihood estimates are usually unbiased.

– For all sample sizes the parameter of interest is calculated correctly.

General properties of MLM

23/4/19 Model Inference and Averaging 19

General properties of MLM

• Maximum likelihood estimate is efficient: the estimate has the smallest variance.

• Maximum likelihood estimate is sufficient: it uses all the information in the observations (the xi’s).

• The solution from MLM is unique.

• Bad news: we must know the correct probability distribution for the problem at hand!

23/4/19 Model Inference and Averaging 20

Maximum Likelihood

• We maximize the likelihood function

• Log-likelihood function

23/4/19 Model Inference and Averaging 21

Score Function

• Assess the precision of using the likelihood function

• Assume that L takes its maximum in the interior parameter space. Then

23/4/19 Model Inference and Averaging 22

Likelihood Function

• We maximize the likelihood function

• We omit normalization since only adds a constant factor

• Think of L as a function of with Z fixed

• Log-likelihood function

23/4/19 Model Inference and Averaging 23

Fisher Information

• Negative sum of second derivatives is the information matrix

• is called the observed information, should be greater 0.

• Fisher information ( expected information ) is

23/4/19 Model Inference and Averaging 24

Sampling Theory

• Basic result of sampling theory

• The sampling distribution of the max-likelihood estimator approaches the following normal distribution, as

when we sample independently from • This suggests to approximate the distribution with

23/4/19 Model Inference and Averaging 25

Error Bound

• The corresponding error estimates are obtained from

• The confidence points have the form

23/4/19 Model Inference and Averaging 26

Simplified form of the Fisher information

Suppose, in addition, that the operations of integration and differentiation can be swapped for the second derivative of f(x;θ) as well, i.e.,

In this case, it can be shown that the Fisher information equals

The Cramér–Rao bound can then be written as

23/4/19 Model Inference and Averaging 27

Single-parameter proof

If the expectation of T is denoted by ψ(θ), then, for all θ,

Let X be a random variable with probability density function f(x;θ). Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If V is the score, i.e.

then the expectation of V, written E(V), is zero. If we consider the covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because E(V) = 0. Expanding this expression we have

Using the chain rule

and the definition of expectation gives, after cancelling f(x;θ),

because the integration and differentiation operations

commute (second condition).

The Cauchy-Schwarz inequality shows that

Therefore

23/4/19 Model Inference and Averaging 28

An Example

• Consider a linear expansion

• The least square error solution

• The Covariance of \beta

23/4/19 Model Inference and Averaging 29

1

( ) ( )N

j jj

x h x

1ˆ ( )T TH H H y

1 2

2 2

ˆ ˆ( ) ( ) ;

ˆ ˆ( ( )) /

T

i i

cor H H

y x N

An Example

• The confidence region

23/4/19 Model Inference and Averaging 30

1

1 1/2

ˆˆ prediction model ( ) ( ),

The standard deviation

ˆ ˆ [ ( )] [ ( ) ( ) ( )]

N

j jj

T T

Consider x h x

se x h x H H h x

ˆ ˆ( ) 1.96 [ ( )]x se x

23/4/19 Model Inference and Averaging 31

Bayesian Methods• Given a sampling model Pr(Z | ) and a prior Pr( )

for the parameters, estimate the posterior probability

• By drawing samples or estimating its mean or mode

• Differences to mere counting ( frequentist approach )– Prior: allow for uncertainties present before seeing the

data– Posterior: allow for uncertainties present after seeing the

data

23/4/19 Model Inference and Averaging 32

Bayesian Methods

• The posterior distribution affords also a predictive distribution of seeing future values

• In contrast, the max-likelihood approach would predict future data on the basis of

not accounting for the uncertainty in the parameters

newZ

An Example

• Consider a linear expansion

• The least square error solution

23/4/19 Model Inference and Averaging 33

1

( ) ( )N

j jj

x h x

xx

pp

T

exp1

21

2/2/2

1

• The posterior distribution for \beta is also Gaussian, with mean and covariance

• The corresponding posterior values for \mu(x),

23/4/19 Model Inference and Averaging 34

23/4/19 Model Inference and Averaging 35

23/4/19 Model Inference and Averaging 36

Bootstrap vs Bayesian

• The bootstrap mean is an approximate posterior average

• Simple example:– Single observation z drawn from a normal

distribution– Assume a normal prior for :– Resulting posterior distribution

23/4/19 Model Inference and Averaging 37

Bootstrap vs Bayesian

• Three ingredients make this work– The choice of a noninformative prior for – The dependence of on Z only through

the max-likelihood estimate Thus

– The symmetry of

Bootstrap vs Bayesian• The bootstrap distribution represents an (approximate)

nonparametric, noninformative posterior distribution for our parameter.

• But this bootstrap distribution is obtained painlessly without having to formally specify a prior and without having to sample from the posterior distribution.

• Hence we might think of the bootstrap distribution as a \poor man's" Bayes posterior. By perturbing the data, the bootstrap approximates the Bayesian effect of perturbing the parameters, and is typically much simpler to carry out.

23/4/19 Model Inference and Averaging 38

23/4/19 Model Inference and Averaging 39

The EM Algorithm

• The EM algorithm for two-component Gaussian mixtures – Take initial guesses for the

parameters

– Expectation Step: Compute the responsibilities

2211ˆ,ˆ,ˆ,ˆ,ˆ

23/4/19 Model Inference and Averaging 40

The EM Algorithm

– Maximization Step: Compute the weighted means and variances

– Iterate 2 and 3 until convergence

23/4/19 Model Inference and Averaging 41

The EM Algorithm in General

• Baum-Welch algorithm

• Applicable to problems for which maximizing the log-likelihood is difficult but simplified by enlarging the sample set with unobserved ( latent ) data ( data augmentation ).

23/4/19 Model Inference and Averaging 42

The EM Algorithm in General

23/4/19 Model Inference and Averaging 43

23/4/19 Model Inference and Averaging 44

The EM Algorithm in General

• Start with initial params

• Expectation Step: at the j-th step compute

as a function of the dummy argument

• Maximization Step: Determine the new params by maximizing

• Iterate 2 and 3 until convergence

)1(ˆ j

23/4/19 Model Inference and Averaging 45

Model Averaging and Stacking

• Given predictions • Under square-error loss, seek weights

• Such that

• Here the input x is fixed and the N observations in Z are distributed according to P

23/4/19 Model Inference and Averaging 46

Model Averaging and Stacking

• The solution is the population linear regression of Y on namely

• Now the full regression has smaller error, namely

• Population linear regression is not available, thus replace it by the linear regression over the training set


Recommended