+ All Categories
Home > Documents > Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

Date post: 06-Jan-2016
Category:
Upload: burke
View: 52 times
Download: 0 times
Share this document with a friend
Description:
Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/. Contents. - PowerPoint PPT Presentation
26
Ch 3. Linear Models for Ch 3. Linear Models for Regression (1/2) Regression (1/2) Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006. C. M. Bishop, 2006. Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun Biointelligence Laboratory, Seoul National Univ ersity http://bi.snu.ac.kr/
Transcript
Page 1: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

Ch 3. Linear Models for Regression Ch 3. Linear Models for Regression (1/2)(1/2)

Pattern Recognition and Machine Learning, Pattern Recognition and Machine Learning, C. M. Bishop, 2006.C. M. Bishop, 2006.

Previously summarized by Yung-Kyun Noh

Modified and presented by Rhee, Je-Keun

Biointelligence Laboratory, Seoul National University

http://bi.snu.ac.kr/

Page 2: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

2 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

ContentsContents

3.1 Linear Basis Function Models 3.1.1 Maximum likelihood and least squares 3.1.2 Geometry of least squares 3.1.3 Sequential learning 3.1.4 Regularized least squares 3.1.5 Multiple outputs

3.2 The Bias-Variance Decomposition 3.3 Bayesian Lear Regression

3.3.1 Parameter distribution 3.3.2 Predictive distribution 3.3.3 Equivalent kernel

Page 3: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

3 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Linear Basis Function ModelsLinear Basis Function Models

Linear regression

Linear model Linearity in the parameters Using basis functions, allow nonlinear function of the input vector x. Simplify the analysis of this class of models Have some significant limitations

M: total number of parameters : basis functions ( : dummy basis function) ,

0 1 1( , ) ... D Dy w w x w x x w

1

0

( , ) ( ) ( )M

Tj j

j

y w

x w x w x

( )j x0( ) 1 x

0 1( ,..., )TMw w w 0 1( ,..., )TM

Page 4: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

4 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Basis FunctionsBasis Functions

Polynomial functions: Global functions of the input variable

spline functions Gaussian basis functions:

Sigmoidal basis functions: Logistic sigmoid functions:

Fourier basis wavelets

( ) jj x x

2

2

( )( ) exp{ }

2j

j

xx

s

( ) ( )j

j

xx

s

1( )

1 exp( )j xa

Page 5: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

5 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Likelihood and Least Squares (1/2)Maximum Likelihood and Least Squares (1/2)

Assumption: Gaussian noise model

: zero mean Gaussian random variable with precision (inverse variance) .

Result Conditional mean = (unimoda

l)

For dataset Likelihood: (Drop the explicit x)

( , )t y x w

1( | , , ) ( | ( , ), )p t t y x w x wN

[ | ] ( | ) ( , )t tp t dt y x x x wE

1 1{ ,..., }, { ,..., }N Nt t X x x t

1

1

( | , , ) ( | ( ), )N

Tn n

n

p t

t X w w xN

Page 6: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

6 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Likelihood and Least Squares (2/3)Maximum Likelihood and Least Squares (2/3)

Log-likelihood

Maximization of the likelihood function under a conditional Gaussian noise distribution for a linear model is equivalent to minimizing a sum-of-squares error function.

1

1

l n ( | , ) l n ( | ( ), ) l n l n(2 ) ( )2 2

NT

n n Dn

N Np t E

t w w x wN

2

1

1( ) ( ( ))

2

NT

D n nn

E t

w w x

Page 7: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

7 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Maximum Likelihood and Least Squares (3/3)Maximum Likelihood and Least Squares (3/3)

The gradient of the log likelihood function

Setting the gradient of log likelihood and setting it to zero to get

where the NxM design matrix

1( )T TML

w Φ Φ Φ t

0 1 1 1 1 1

0 2 1 2 1 2

0 1 1

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

M

M

N N M N

x x x

x x xΦ

x x x

Page 8: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

8 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bias and Precision Parameter by MLBias and Precision Parameter by ML

Some other solutions we can get by setting derivative to zero. Bias maximizing log likelihood

The bias compensates for the difference between the averages (over the training set) of the target values and the weighted sum of the averages of the basis function values.

Noise precision parameter maximizing log likelihood

1

01

M

j jj

w t w

1

1 N

nn

t tN

1

1( )

N

j j nnN

x

2

1

1 1{ ( )}

NT

n ML nnML

tN

w x

Page 9: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

9 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Geometry of Least SquaresGeometry of Least Squares

If the number M of basis functions is smaller than the number N of data points, then the M vectors will span a linear subspace S of dimensionality M.

: jth column of

( )j n x

y: linear combination of The least-squares solution

for w corresponds to that choice of y that lies in subspace S and that is closest to t.

j Φ

j

Page 10: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

10 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Sequential LearningSequential Learning

On-line learning Technique of Stochastic gradient descent (or sequential gradient

descent)

For the case of sum-of-squares error function (least-mean-square or the LMS algorithm)

( 1) ( )nE w w

( 1) ( ) ( )( )Tn n nt w w w

Page 11: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

11 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Regularized Least SquaresRegularized Least Squares

Regularized least-square Control over-fitting Total error function

Closed form solution (setting the gradient):

This represents a simple extension of the least-squares solution.

A more general regularizer

( ) ( )D WE Ew w

2

1

1( ) { ( )}

2

NT

D n nn

E t

w w x1

( )2

TWE w w w

1( )T T w I Φ Φ Φ t

2

1 1

1 1{ ( )}

2 2

N M qT

n n jn j

t w

w x

Page 12: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

12 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

General RegularizerGeneral Regularizer

In case q=1 in general regularizer ‘lasso’ in the statistical literature

If λ is sufficiently large, some of the coefficients wj are driven to zero.

Sparse model: corresponding basis functions play no role. Minimizing the unregularized sum-of-squares error s.t. the constraint

Contours of the regularization termThe lasso gives the sparse solution

1

M q

jj

w

Page 13: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

13 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Regularization & complexityRegularization & complexity

Regularization allows complex models to be trained on data sets of limited size without severe over-fitting, essentially by limiting the effective model complexity.

However, the problem of determining the optimal model complexity is then shifted from on of finding the appropriate number of basis functions to one of determining a suitable value of the regularization coefficient λ.

Page 14: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

14 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Multiple OutputsMultiple Outputs

For K>1 target variables 1. Introduce a different set of basis functions for each componen

t of t. 2. Use the same set of basis functions to model all of the compo

nents of the target vector. (W: MxK matrix of parameters)

For each variable tk,

: pseudo-inverse of

( , ) ( )Ty x w W x

1( | , , ) ( | ( ), )Tp t x W t W x IN

1( )T Tk k k

w Φ Φ Φ t Φ tΦ Φ

Page 15: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

15 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (1/4)The Bias-Variance Decomposition (1/4)

Frequentist viewpoint of the model complexity issue: bias-variance trade-off.

Expected squared loss

Bayesian: the uncertainty in our model is expressed through a posterior distribution over w.

Frequentist: make a point estimate of w based on the data set D.

2 2[ ] { ( ) ( )} ( ) { ( ) } ( , )L y h p d h t p t d dt x x x x x x xEArises from the intrinsic noise on the data

( ) [ | ] ( | )h t tp t dt x x xE

Dependent on the particular dataset D.

2[{ ( ; ) ( )} ]D y D hx xE

Page 16: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

16 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (2/4)The Bias-Variance Decomposition (2/4)

Bias The extent to which the average prediction over all data sets diff

ers from the desired regression function. Variance

The extent to which the solutions for individual data sets vary around their average.

The extent to which the function y(x;D) is sensitive to the particular choice of data set.

Expected loss = (bias)2 + variance + noise

2

2

2 2

var i ance(bi as)

[{ ( ; ) ( )} ]

{ [ ( ; )] ( )} [{ ( ; ) [ ( ; )]} ]D

D D D

y D h

y D h y D y D

x x

x x x x

E

E E E

Page 17: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

17 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (3/4)The Bias-Variance Decomposition (3/4)

bias-variance trade-off Averaging many solutions

for the complex model (M=25) is a beneficial procedure.

A weighted averaging (although with respect to the posterior distribution of parameters, not with respect to multiple data sets) of multiple solutions lies at the heart of Bayesian approach.

( ) si n(2 )h x x

Page 18: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

18 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

The Bias-Variance Decomposition (4/4)The Bias-Variance Decomposition (4/4)

The average prediction

Bias and variance

Bias-variance decomposition is based on averages with respect to ensembles of data sets (frequentist perspective). We would be better off combining them into a single large training set.

( )

1

1( ) ( )

Ll

l

y x y xL

2 2

1

1( ) { ( ) ( )}

N

n nn

bias y x h xN

( ) 2

1 1

1 1{ ( ) ( )}

N Ll

n nn l

variance y x y xN L

Page 19: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

19 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Bayesian Linear RegressionBayesian Linear Regression

In the particular problem, it cannot be decided simply by maximizing the likelihood function, because it always leads to excessively complex models and overfitting.

Independent hold-out data can be used to determine model complexity, but this can be both computationally expensive and wasteful of valuable data.

Bayesian treatment of linear regression will avoid the overfitting problem of maximum likelihood, and will also lead to autoamtic methods of determining model complexity using training data alome.

Page 20: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

20 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Parameter distribution (1/3)Parameter distribution (1/3)

Conjugate prior of likelihood

Posterior

The maximum posterior weight vector

If S0=α -1I with α → 0, the mean mN reduces to wML given by (3.15)

0 0( ) ( | , )p w w m SN

( | ) ( | , )N Np w t w m SN10 0( )T

N N m S S m Φ t

1 10

TN S S Φ Φ

1( )T TML

w Φ Φ Φ t

Page 21: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

21 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Parameter distribution (2/3)Parameter distribution (2/3)

Consider prior

Corresponding posterior

Log of the posterior

Maximization of this posterior distribution with respect to w is equivalent to the minimization of the sum-of squares error function with the addition of a quadratic regularization term with λ=α /β.

1( ) ( | 0, )p w w IN

10 0( )T

N N m S S m Φ t1 1

0T

N S S Φ Φ

2

1

l n ( | )

{ ( )} .2 2

NT T

n nn

p

t const

w t

w x w w

Page 22: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

22 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Parameter distribution (3/3)Parameter distribution (3/3)

Other forms of prior over parameters

1/

1

1( | ) exp( )

2 2 (1/ ) 2

Mq M q

jj

qp w

q

w

0 1( , )y x w w x w

Page 23: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

23 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Predictive Distribution (1/2)Predictive Distribution (1/2)

Our real interests( | , , )

( | , ) ( | , , )

p t

p t p d

t

w w t w1( | , , ) ( | ( , ), )p t t y x w x wN

( | ) ( | , )N Np w t w m SN

Mean of the Gaussian predictive distribution (red line), and predictive uncertainty (shaded region) as the number of data increases.

2

( | , , , )

( | ( ), ( ))TN N

p t

t

x t

m x xN

2 1( ) ( ) ( )T

N N

x x S x

noise Uncertainty associated with the parameters

w.0 if N∞

Page 24: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

24 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Predictive Distribution (2/2)Predictive Distribution (2/2)

Draw samples from the posterior distribution over w.

Page 25: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

25 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Equivalent Kernel (1/2)Equivalent Kernel (1/2)

If we substitue (3.53) into the expression (3.3), we see that the predictive mean can be written in the form

Mean of the predictive distribution at a point x.

1

( , ) ( ) ( ) ( ) ( )N

T T T TN N N N n n

n

y t

x m m x x S Φ t x S x

1

( , ) ( , )N

N n nn

y k t

x m x x ( , ) ( ) ( ')Tn Nk x x x S x

Smoother matrix or equivalent kernel

Polynomial and sigmoidal basis function

Page 26: Previously summarized by Yung-Kyun Noh Modified and presented by Rhee, Je-Keun

26 (C) 2007, SNU Biointelligence La

b, http://bi.snu.ac.kr/

Equivalent Kernel (2/2)Equivalent Kernel (2/2)

Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel, we can instead define a localized kernel directly and use this to make predictions for new input vector x, given the observed training set. This leads to a practical framework for regression (and classification) called G

aussian processes.

The equivalent kernel satisfies an important property shared by kernel functions in general, namely that it can be expressed in the form an inner product with respect to a vector ψ(x) of nonlinear functions. Inner product of nonlinear functions

( , ) ( ) ( )Tk x z x x 1/ 2 1/ 2( ) ( )N x S x


Recommended