+ All Categories
Home > Documents > Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2...

Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2...

Date post: 07-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
51
©2005-2007 Carlos Guestrin 1 Co-Training for Semi- supervised learning (cont.) Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University April 23 rd , 2007
Transcript
Page 1: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

©2005-2007 Carlos Guestrin1

Co-Training for Semi-supervised learning(cont.)Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

April 23rd, 2007

Page 2: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

2©2005-2007 Carlos Guestrin

Exploiting redundant information insemi-supervised learning

Want to predict Y fromfeatures X f(X) a Y have some labeled data L lots of unlabeled data U

Co-training assumption: X isvery expressive X = (X1,X2) can learn

g1(X1) a Y g2(X2) a Y

Page 3: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

3©2005-2007 Carlos Guestrin

Co-Training Algorithm[Blum & Mitchell ’99]

Page 4: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

4©2005-2007 Carlos Guestrin

Understanding Co-Training: Asimple setting Suppose X1 and X2 are discrete

|X1| = |X2| = N

No label noise Without unlabeled data, how hard is it to learn g1 (or g2)?

Page 5: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

5©2005-2007 Carlos Guestrin

Co-Training in simple setting –Iteration 0

Page 6: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

6©2005-2007 Carlos Guestrin

Co-Training in simple setting –Iteration 1

Page 7: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

7©2005-2007 Carlos Guestrin

Co-Training in simple setting – afterconvergence

Page 8: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

8©2005-2007 Carlos Guestrin

Co-Training in simple setting –Connected components

Suppose infinite unlabeled data Co-training must have at least one labeled

example in each connected component of L+Ugraph

What’s probability of making an error?

For k Connected components, how muchlabeled data?

Page 9: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

9©2005-2007 Carlos Guestrin

How much unlabeled data?

Page 10: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

10©2005-2007 Carlos Guestrin

Co-Training theory

Want to predict Y from features X f(X) a Y

Co-training assumption: X is very expressive X = (X1,X2) want to learn g1(X1) a Y and g2(X2) a Y

Assumption: ∃ g1, g2, ∀ x g1(x1) = f(x), g2(x2) = f(x) One co-training result [Blum & Mitchell ’99]

If (X1 ⊥ X2 | Y) g1 & g2 are PAC learnable from noisy data (and thus f)

Then f is PAC learnable from weak initial classifier plus unlabeled data

Page 11: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

11©2005-2007 Carlos Guestrin

What you need to know about co-training

Unlabeled data can help supervised learning (a lot) whenthere are (mostly) independent redundant features

One theoretical result: If (X1 ⊥ X2 | Y) and g1 & g2 are PAC learnable from noisy data

(and thus f) Then f is PAC learnable from weak initial classifier plus

unlabeled data Disagreement between g1 and g2 provides bound on error of final

classifier Applied in many real-world settings:

Semantic lexicon generation [Riloff, Jones 99] [Collins, Singer 99],[Jones 05]

Web page classification [Blum, Mitchell 99] Word sense disambiguation [Yarowsky 95] Speech recognition [de Sa, Ballard 98] Visual classification of cars [Levin, Viola, Freund 03]

Page 12: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

©2005-2007 Carlos Guestrin12

Transductive SVMs

Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

April 23rd, 2007

Page 13: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

13©2005-2007 Carlos Guestrin

Semi-supervised learning anddiscriminative models We have seen semi-supervised learning for

generative models EM

What can we do for discriminative models Not regular EM

we can’t compute P(x) But there are discriminative versions of EM

Co-Training! Many other tricks… let’s see an example

Page 14: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

14©2005-2007 Carlos Guestrin

Linear classifiers – Which line is better?

Data:

Example i:

w.x = ∑j w(j) x(j)

Page 15: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

15©2005-2007 Carlos Guestrin

Support vector machines (SVMs)

w.x

+ b

= +

1

w.x

+ b

= -1

w.x

+ b

= 0

margin γ

Solve efficiently by quadraticprogramming (QP) Well-studied solution algorithms

Hyperplane defined by supportvectors

Page 16: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

16©2005-2007 Carlos Guestrin

What if we have unlabeled data?nL Labeled Data:

Example i:

w.x = ∑j w(j) x(j)

nU Unlabeled Data:

Page 17: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

17©2005-2007 Carlos Guestrin

Transductive support vectormachines (TSVMs)

w.x + b

= +1

w.x + b = -1

w.x + b = 0

margin γ

Page 18: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

18©2005-2007 Carlos Guestrin

Transductive support vectormachines (TSVMs)

w.x + b

= +1

w.x + b = -1

w.x + b = 0

margin γ

Page 19: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

19©2005-2007 Carlos Guestrin

What’s the difference between transductivelearning and semi-supervised learning? Not much, and A lot!!!

Semi-supervised learning: labeled and unlabeled data ! learn w use w on test data

Transductive learning same algorithms for labeled and unlabeled data, but… unlabeled data is test data!!!

You are learning on the test data!!! OK, because you never look at the labels of the test data can get better classification but be very very very very very very very very careful!!!

never use test data prediction accuracy to tune parameters, select kernels, etc.

Page 20: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

20©2005-2007 Carlos Guestrin

Adding slack variables

w.x + b

= +1

w.x + b

= -1

w.x + b = 0

margin γ

Page 21: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

21©2005-2007 Carlos Guestrin

Transductive SVMs – now with slackvariables! [Vapnik 98]

w.x + b

= +1

w.x + b = -1

w.x + b

= 0

margin γ

Page 22: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

22©2005-2007 Carlos Guestrin

Learning Transductive SVMs is hard!

w.x + b

= +1

w.x + b = -1

w.x + b

= 0

margin γ

Integer Program NP-hard!!! Well-studied solution algorithms,

but will not scale up to very largeproblems

Page 23: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

23©2005-2007 Carlos Guestrin

A (heuristic) learning algorithm forTransductive SVMs [Joachims 99]

w.x + b

= +1

w.x + b = -1

w.x + b

= 0

margin γ

If you set to zero → ignore unlabeled data Intuition of algorithm:

start with small add labels to some unlabeled data based on classifier

prediction slowly increase keep on labeling unlabeled data and re-running

classifier

Page 24: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

24©2005-2007 Carlos Guestrin

Some results classifying newsarticles – from [Joachims 99]

Page 25: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

25©2005-2007 Carlos Guestrin

What you need to know abouttransductive SVMs

What is transductive v. semi-supervised learning

Formulation for transductive SVM can also be used for semi-supervised learning

Optimization is hard! Integer program

There are simple heuristic solution methods thatwork well here

Page 26: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

©2005-2007 Carlos Guestrin26

DimensionalityreductionMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University

April 23rd, 2007

Page 27: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

27©2005-2007 Carlos Guestrin

Dimensionality reduction

Input data may have thousands or millions ofdimensions! e.g., text data has

Dimensionality reduction: represent data withfewer dimensions easier learning – fewer parameters visualization – hard to visualize more than 3D or 4D discover “intrinsic dimensionality” of data

high dimensional data that is truly lower dimensional

Page 28: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

28©2005-2007 Carlos Guestrin

Feature selection

Want to learn f:XaY X=<X1,…,Xn> but some features are more important than others

Approach: select subset of features to be usedby learning algorithm Score each feature (or sets of features) Select set of features with best score

Page 29: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

29©2005-2007 Carlos Guestrin

Simple greedy forward feature selectionalgorithm Pick a dictionary of features

e.g., polynomials for linear regression Greedy heuristic:

Start from empty (or simple) set offeatures F0 = ∅

Run learning algorithm for current setof features Ft

Obtain ht

Select next best feature Xi e.g., Xj that results in lowest cross-

validation error learner when learning withFt ∪ {Xj}

Ft+1 ← Ft ∪ {Xi} Recurse

Page 30: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

30©2005-2007 Carlos Guestrin

Simple greedy backward featureselection algorithm Pick a dictionary of features

e.g., polynomials for linear regression Greedy heuristic:

Start from all features F0 = F Run learning algorithm for current set

of features Ft Obtain ht

Select next worst feature Xi e.g., Xj that results in lowest cross-

validation error learner when learning withFt - {Xj}

Ft+1 ← Ft - {Xi} Recurse

Page 31: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

31©2005-2007 Carlos Guestrin

Impact of feature selection onclassification of fMRI data [Pereira et al. ’05]

Page 32: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

32©2005-2007 Carlos Guestrin

Lower dimensional projections

Rather than picking a subset of the features, wecan new features that are combinations ofexisting features

Let’s see this in the unsupervised setting just X, but no Y

Page 33: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

33©2005-2007 Carlos Guestrin

Linear projection and reconstruction

x1

x2

project into1-dimension z1

reconstruction:only know z1,

what was (x1,x2)

Page 34: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

34©2005-2007 Carlos Guestrin

Principal component analysis –basic idea Project n-dimensional data into k-dimensional

space while preserving information: e.g., project space of 10000 words into 3-dimensions e.g., project 3-d into 2-d

Choose projection with minimum reconstructionerror

Page 35: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

35©2005-2007 Carlos Guestrin

Linear projections, a review

Project a point into a (lower dimensional) space: point: x = (x1,…,xn) select a basis – set of basis vectors – (u1,…,uk)

we consider orthonormal basis: ui·ui=1, and ui·uj=0 for i≠j

select a center – x, defines offset of space best coordinates in lower dimensional space defined

by dot-products: (z1,…,zk), zi = (x-x)·ui minimum squared error

Page 36: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

36©2005-2007 Carlos Guestrin

PCA finds projection that minimizesreconstruction error Given m data points: xi = (x1

i,…,xni), i=1…m

Will represent each point as a projection:

where: and

PCA: Given k·n, find (u1,…,uk) minimizing reconstruction error:

x1

x2

Page 37: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

37©2005-2007 Carlos Guestrin

Understanding the reconstructionerror

Note that xi can be representedexactly by n-dimensional projection:

Rewriting error:

Given k·n, find (u1,…,uk) minimizing reconstruction error:

Page 38: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

38©2005-2007 Carlos Guestrin

Reconstruction error andcovariance matrix

Page 39: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

39©2005-2007 Carlos Guestrin

Minimizing reconstruction error andeigen vectors

Minimizing reconstruction error equivalent to pickingorthonormal basis (u1,…,un) minimizing:

Eigen vector:

Minimizing reconstruction error equivalent to picking(uk+1,…,un) to be eigen vectors with smallest eigen values

Page 40: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

40©2005-2007 Carlos Guestrin

Basic PCA algoritm

Start from m by n data matrix X Recenter: subtract mean from each row of X

Xc à X – X Compute covariance matrix:

Σ Ã XcT Xc

Find eigen vectors and values of Σ Principal components: k eigen vectors with

highest eigen values

Page 41: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

41©2005-2007 Carlos Guestrin

PCA example

Page 42: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

42©2005-2007 Carlos Guestrin

PCA example – reconstruction

only used first principal component

Page 43: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

43©2005-2007 Carlos Guestrin

Eigenfaces [Turk, Pentland ’91]

Input images: Principal components:

Page 44: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

44©2005-2007 Carlos Guestrin

Eigenfaces reconstruction

Each image corresponds to adding 8 principalcomponents:

Page 45: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

45©2005-2007 Carlos Guestrin

Relationship to Gaussians

PCA assumes data is Gaussian x ~ N(x;Σ)

Equivalent to weighted sum of simpleGaussians:

Selecting top k principal componentsequivalent to lower dimensional Gaussianapproximation:

ε~N(0;σ2), where σ2 is defined by errork

x1

x2

Page 46: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

46©2005-2007 Carlos Guestrin

Scaling up

Covariance matrix can be really big! Σ is n by n 10000 features ! |Σ| finding eigenvectors is very slow…

Use singular value decomposition (SVD) finds to k eigenvectors great implementations available, e.g., Matlab svd

Page 47: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

47©2005-2007 Carlos Guestrin

SVD

Write X = U S VT

X ← data matrix, one row per datapoint U ← weight matrix, one row per datapoint – coordinate of xi in eigenspace S ← singular value matrix, diagonal matrix

in our setting each entry is eigenvalue λj

VT ← singular vector matrix in our setting each row is eigenvector vj

Page 48: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

48©2005-2007 Carlos Guestrin

PCA using SVD algoritm

Start from m by n data matrix X Recenter: subtract mean from each row of X

Xc ← X – X Call SVD algorithm on Xc – ask for k singular vectors Principal components: k singular vectors with highest

singular values (rows of VT) Coefficients become:

Page 49: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

49©2005-2007 Carlos Guestrin

Using PCA for dimensionalityreduction in classification

Want to learn f:XaY X=<X1,…,Xn> but some features are more important than others

Approach: Use PCA on X to select a fewimportant features

Page 50: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

50©2005-2007 Carlos Guestrin

PCA for classification can lead toproblems…

Direction of maximum variation may be unrelated to “discriminative”directions:

PCA often works very well, but sometimes must use more advancedmethods e.g., Fisher linear discriminant

Page 51: Co-Training for Semi- supervised learning (cont.)guestrin/Class/10701-S07/Slides/co... · 2 ©2005-2007 Carlos Guestrin Exploiting redundant information in semi-supervised learning

51©2005-2007 Carlos Guestrin

What you need to know

Dimensionality reduction why and when it’s important

Simple feature selection Principal component analysis

minimizing reconstruction error relationship to covariance matrix and eigenvectors using SVD problems with PCA


Recommended