+ All Categories
Home > Documents > G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...

G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of...

Date post: 11-Jan-2016
Category:
Upload: alban-cross
View: 216 times
Download: 0 times
Share this document with a friend
33
. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield
Transcript
Page 1: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Semi-supervised learning

Guido SanguinettiDepartment of Computer Science, University of Sheffield

Page 2: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Programme

• Different ways of learning– Unsupervised learning

• EM algorithm

– Supervised learning

• Different ways of going semi-supervised– Generative models

• Reweighting

– Discriminative models• Regularisation: clusters and manifolds

Page 3: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Disclaimer• This is going to be a high-level introduction

rather than a technical talk• We will spend some time talking about

supervised and unsupervised learning as well• We will be unashamedly probabilistic, if not fully

Bayesian• Main reference C.Bishop’s book “Pattern

Recognition and Machine Learning” and M. Seeger’s chapter in “Semi-supervised learning”, Chapelle, Schölkopf and Zien, eds.

Page 4: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Different ways to learn

• Reinforcement learning: learn a strategy to optimise a reward.

• Closely related to decision theory and control theory.

• Supervised learning: learn a map

• Unsupervised learning: estimate a density

Page 5: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Unsupervised learning

• We are given data x.• We want to estimate the density

that generated the data.• Generally, assume a latent

variable y as in graphical model.• A continuous y leads to

dimensionality reduction (cf previous lecture), a discrete one to clustering

x

θπ

y

Page 6: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Example: mixture of Gaussians

• Data: vector measurements xi, i=1,...,N.

• Latent variable: K-dimensional binary vectors (class membership) yi; yij=1 means point i belongs to class j.

• The θ parameters are in this case the covariances and the means of each component.

Page 7: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Estimating mixtures

• Two objectives: estimate the parameters , j and j and estimate the posterior probabilities on class membership

• ij are the responsibilities.

• Could use gradient descent.• Better to use EM

Page 8: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Expectation-Maximization

• Iterative procedure to estimate the maximum likelihood values of parameters in models with latent variables.

• We want to maximise the log-likelihood of the model

• Notice that log of a sum is not nice.• The key mathematical tool is Jensen’s inequality

Page 9: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Jensen’s inequality

• For every concave function f, random variable x and probability distribution q(x)

• A cartoon of the proof, which relies on the centre of mass of a convex polygon lying inside the polygon.

Page 10: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Bound on the log-likelihood

• Jensen’s inequality leads to

• H[q] is the entropy of the distribution q. Notice the absence of the nasty log of a sum.

• If and only if q(c) is the posterior q(c|x), we get that the bound is saturated

Page 11: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

EM

• For a fixed value of the parameters, the posterior will saturate the bound.

• In the M-step, optimise the bound with respect to the parameters.

• In the E-step, recompute the posterior with the new value of the parameters.

• Exercise: EM for mixture of Gaussians.

θ

M-step

E-step

Page 12: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Supervised learning

• The data consists of a set of inputs x and a set of output (target) values y.

• The goal is to learn the functional relation

f:x y• Evaluated using reconstruction error

(classification accuracy), usually on a separate test set.

• Continuous y leads to regression, discrete to classification

Page 13: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Classification-Generative

• The generative approach starts with modelling p(x|c) as in unsupervised learning.

• The model parameters are

estimated (e.g. using Maximum

Likelihood).• The assignment is based on the

posterior probabilities p(c|x)• Requires estimation of many parameters.

x

θπ

c

Page 14: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Example: discriminant analysis• Assume class conditional distributions to be

Gaussian

• Estimate means and covariances using maximum likelihood (O(KD2) params).

• Classify novel (test) data using posterior

• Exercise: rewrite posterior in terms of sigmoids.

Page 15: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Classification-Discriminative

• Also called diagnostic• Modelling the class-conditional

distributions involves significant

overheads.• Discriminative techniques avoid

this by modelling directly the posterior distribution p(cj|x).

• Closely related with the concept of transductive learning.

x

μθ

c

Page 16: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Example: logistic regression

• Restrict to the two classes case.• Model posteriors as

where we have introduced the logistic sigmoid

• Notice similarity with discriminant function in generative models.

• Number of parameters to be estimated is D.

Page 17: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Estimating logistic regression

• Let • Let ci be 0 or 1.

• The likelihood for the data (xi,ci) is

• The gradient of the negative log-likelihood is

• Overfitting, may need a prior on w

Page 18: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Semi-supervised learning• In many practical applications, the labelling

process is expensive and time consuming.• Plenty of examples of inputs x, but few of the

corresponding target c• The goal is still to predict p(c|x) and is evaluated

using classification accuracy• Semi-supervised learning is, in this sense, a

special case of supervised learning where we use the extra unlabeled data to improve the predictive power.

Page 19: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Notation

• We have a labeled or complete data set Dl, comprising of Nl sets of pairs (x,c)

• We have an unlabeled or incomplete data set Du comprising of Nu sets of vectors x

• The interesting (and common) case is when Nu>>Nl

Page 20: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Baselines

• One attack to semi-supervised learning would be to ignore the labels and estimate an unsupervised clustering model.

• Another attack is to ignore the unlabeled data and just use the complete data.

• No free lunch: it is possible to construct data distributions for which either of the baselines outperforms a given SSL method.

• The art in SSL is designing good models for specific applications, rather than theory.

Page 21: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Generative SSL

• Generative SSL methods are most intuitive• The graphical model for

generative classification and

unsupervised learning is the

same• The likelihoods are combined

• Parameters are estimated by EM

x

θπ

c

Page 22: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Discriminant analysis

• McLachlan (1977) considered SSL for a generative model with two Gaussian classes

• Assume the covariance to be known, means need to be estimated from the data.

• Assume we have Nu=M1+M2<<Nl

• Assume also that the labelled data is split evenly among the two classes, so Nl=2n

Page 23: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

A surprising result

• Consider the misclassification expected error R• Let R be the expected error obtained using only

the labelled data, and R1 the expected error obtained using all data (estimating parameters with EM)

• If , the first order expansion (in Nu/Nl) of R-R1 is positive only for Nu<M*, with M* finite

• In other words, unlabelled data helps up to a point only.

Page 24: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

A way out

• Perhaps considering labelled and unlabelled data on the same footing is not ideal.

• McLachlan went on to propose to estimate the class means as

• He then shows that, for suitable , the expected error obtained in this way is to first order always lower than the one obtained without the unlabelled data

Page 25: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

A hornets’ nest

• In general, it seems sensible to reweight the log-likelihood as

• This is partly because, in the important case when Nl is small, we do not want the label information to be swamped

• But how do you choose ? • Cross validation on the labels?

• Unfeasible for small Nl

Page 26: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Stability

• An elegant solution proposed by Corduneanu and Jaakkola (2002)

• In the limit when all data is labelled the likelihood is unimodal

• In the limit when all data is unlabelled the likelihood is multimodal (e.g. permutations)

• When moving from =1 to =0, we must encounter a critical * when the likelihood becomes multimodal

• That is the “optimal” choice

Page 27: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Discriminative SSL

• The alternative paradigm for SSL

is the discriminative approach• As in supervised learning, the

discriminative approach models

directly p(c|x) using the graphical model above• The total likelihood factorises as

x

μθ

c

Page 28: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Surprise!

• Using Bayes’ theorem we obtain the posterior over \theta as

• This is seriously worrying!• Why?

Page 29: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Regularization

• Bayesian methods in a straight discriminative SSL cannot be employed

• Some limited success in non-Bayesian methods (e.g. Anderson 1978 for logistic regression over discrete variables)

• Most promising avenue is to

modify the graphical model used• This is known as regularization

x

μθ

c

Page 30: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Discriminative vs Generative

• The idea behind regularization is that information about the generative structure of x is feeding into the discriminative process

• Does it still make sense to talk about a discriminative approach?

• Strictly speaking, perhaps no• In practice, the hypothesis used for

regularization are much weaker than modelling the full class conditional distribution

Page 31: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Cluster assumption• Perhaps the most widely used form of

regularization• It states that the boundary between classes

should cross areas of low data density• It is often reasonable particularly in high

dimensions• Implemented in a host of Bayesian and non-

Bayesian methods• Can be understood in terms of smoothness of

the discriminant function

Page 32: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Manifold assumption

• Another convenient assumption is that the data lies on a low-dimensional sub-manifold of a high-dimensional space

• Often the starting point is to map the data to a high dimensional space via a feature map

• The cluster assumption is then applied in the high dimensional space

• Key concept is the Graph Laplacian• Ideas pioneered by Belkin and Niyogi

Page 33: G. Sanguinetti, EPSRC Winter School 01/08 Semi-supervised learning Guido Sanguinetti Department of Computer Science, University of Sheffield.

G. Sanguinetti, EPSRC Winter School 01/08

Manifolds cont.

• We want discriminant functions which vary little in regions of high data density

• Mathematically, this is equivalent to

being very small, where is the Laplace-Beltrami operator

• The finite sample approximation to is given by the graph Laplacian

• The objective function for SSL is a combination of the error given by f and of its norm under


Recommended