+ All Categories
Home > Documents > Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP...

Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP...

Date post: 22-Jan-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
28
Semi supervised Learning Semi-supervised Learning COMP 790 90 Seminar COMP 790-90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Overview Overview Semi-supervised learning Semi-supervised classification Semi-supervised classification Semi-supervised clustering S i i d l t i Semi-supervised clustering Search based methods C K Cop K-mean Seeded K-mean Constrained K mean COMP 790-090 Data Mining: Concepts, Algorithms, and Applications 2 Constrained K-mean Similarity based methods
Transcript
Page 1: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Semi supervised LearningSemi-supervised Learning

COMP 790 90 SeminarCOMP 790-90 Seminar

Spring 2011

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

OverviewOverview

Semi-supervised learningSemi-supervised classificationSemi-supervised classification

Semi-supervised clustering

S i i d l t iSemi-supervised clusteringSearch based methods

C KCop K-mean

Seeded K-mean

Constrained K mean

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2

Constrained K-mean

Similarity based methods

Page 2: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Supervised Classification Example

.....

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications3

Supervised Classification Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications4

Page 3: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Supervised Classification Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications5

Unsupervised Clustering Example

... . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications6

Page 4: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Unsupervised Clustering Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications7

Semi-Supervised LearningSemi-Supervised Learning

Combines labeled and unlabeled data duringtraining to improve performance:

Semi-supervised classification: Training on labeleddata exploits additional unlabeled data, frequently

l i i l ifiresulting in a more accurate classifier.

Semi-supervised clustering: Uses small amount ofl b l d d t t id d bi th l t i f l b l dlabeled data to aid and bias the clustering of unlabeleddata.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications8

Page 5: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Semi-Supervised Classification Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications9

Semi-Supervised Classification Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications10

Page 6: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Semi-Supervised ClassificationAlgorithms:Algorithms:

Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].

Co-training [Blum:COLT98].

Transductive SVM’s [Vapnik:98,Joachims:ICML99].

Assumptions:Known, fixed set of categories given in the labeled datadata.

Goal is to improve classification of examples into these known categories.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications11

Semi-Supervised Clustering Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications12

Page 7: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Semi-Supervised Clustering Example

. . . .... .. ..

..

. .. ...

..

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications13

Second Semi-Supervised Clustering Example

.. . . ...... .. ..

.

. .. ..

.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications14

Page 8: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Second Semi-Supervised Clustering Example

. . . .... .. ..

..

. .. ...

..

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications15

Semi-Supervised ClusteringSemi-Supervised Clustering

Can group data using the categories in theinitial labeled datainitial labeled data.

Can also extend and modify the existing setof categories as needed to reflect otherof categories as needed to reflect otherregularities in the data.

C l t di j i t t f l b l d d tCan cluster a disjoint set of unlabeled datausing the labeled data as a “guide” to thet f l t d i d

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications16

type of clusters desired.

Page 9: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Problem definitionProblem definition

Input:Input:A set of unlabeled objects

Some domain knowledgeSome domain knowledge

Output:A partitioning of the objects into clusters p g j

Objective:Maximum intra-cluster similarityy

Minimum inter-cluster similarity

High consistency between the partitioning and the

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications17

domain knowledge

What is Domain Knowledge?What is Domain Knowledge?

Must-link and cannot-link

Class labelsClass labels

Ontology

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications18

Page 10: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Why semi-supervised clustering?Wh t l t i ?Why not clustering?

Could not incorporate prior knowledge into clustering processprocess

Why not classification?Sometimes there are insufficient labeled data.

Potential applicationsBioinformatics (gene and protein clustering)(g p g)

Document hierarchy construction

News/email categorization

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications19

Image categorization

Semi-Supervised ClusteringSemi-Supervised ClusteringApproachesApproaches

Search-based Semi-Supervised ClusteringAlter the clustering algorithm using theAlter the clustering algorithm using the constraints

Similarity-based Semi-Supervised y pClustering

Alter the similarity measure based on the constraints

Combination of both

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications20

Page 11: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Search-Based Semi-Supervised ClusteringAlt th l t i l ith th t h fAlter the clustering algorithm that searches for a good partitioning by:

Modifying the objective function to give a reward forModifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].

Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].Wagstaff:ICML01].

Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications21

Unsupervised KMeans Clustering

KMeans iteratively partitions a dataset into KKMeans iteratively partitions a dataset into Kclusters.

Algorithm:

Initialize K cluster centers randomly Repeat}{KInitialize K cluster centers randomly. Repeat

until convergence:Cluster Assignment Step: Assign each data point x

}{1l l

Cluster Assignment Step: Assign each data point xto the cluster Xl, such that L2 distance of x from(center of Xl) is minimum

C i i S i h l

l

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications22

Center Re-estimation Step: Re-estimate each clustercenter as the mean of the points in that clusterl

Page 12: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

KMeans Objective FunctionKMeans Objective Function

Locally minimizes sum of squared distanceLocally minimizes sum of squared distancebetween the data points and theircorresponding cluster centers:corresponding cluster centers:

2

|||| K

lix

Initialization of K cluster centers:

1|||| l Xx li

lix

Totally random

Random perturbation from global mean

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications23

Random perturbation from global mean

Heuristic to ensure well-separated centers etc.

K Means Examplep

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications24

Page 13: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

K Means ExamplepRandomly Initialize Means

x

x

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications25

K Means ExamplepAssign Points to Clusters

x

x

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications26

Page 14: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

K Means ExamplepRe-estimate Means

xx

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications27

K Means ExamplepRe-assign Points to Clusters

xx

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications28

Page 15: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

K Means ExamplepRe-estimate Means

x

x

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications29

K Means ExamplepRe-assign Points to Clusters

x

x

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications30

Page 16: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

K Means ExamplepRe-estimate Means and Converge

x

x

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications31

Semi-Supervised K-MeansSemi-Supervised K-Means

Constraints (Must-link, Cannot-link)COP K-MeansCOP K-Means

Partial label information is givenS d d K M (B ICML’02)Seeded K-Means (Basu, ICML’02)

Constrained K-Means

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications32

Page 17: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

COP K-MeansCOP K-MeansCOP K Means is K Means with must link (mustCOP K-Means is K-Means with must-link (mustbe in same cluster) and cannot-link (cannot be insame cluster) constraints on data points.) p

Initialization: Cluster centers are chosen randomlybut no must-link constraints that may be violatedy

Algorithm: During cluster assignment step inCOP-K-Means, a point is assigned to its nearestcluster without violating any of its constraints. Ifno such assignment exists, abort.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications33

Based on Wagstaff et al.: ICML01

COP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means Algorithm

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications34

Page 18: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

IllustrationIllustrationDetermineits labelits label

xx

Must-linkx

Assign to the red class

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications35

IllustrationIllustrationDetermineits label

xx

Cannot-link

Assign to the red class

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications36

Page 19: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

IllustrationIllustration

DetermineDetermineits label Must-link

xx

C t li kCannot-link

The clustering algorithm fails

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications37

EvaluationEvaluation

Rand index: measures the agreement between twoRand index: measures the agreement between two partitions, P1 and P2, of the same data set D.

Each partition is viewed as a collection of n(n-1)/2 p ( )pairwise decisions, where n is the size of D.

a is the number of decisions where P1 and P2 put a pair f bj t i t th l tof objects into the same cluster

b is the number of decisions where two instances are placed in different clusters in both partitions. p p

Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2)

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications38

Page 20: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

EvaluationEvaluation

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications39

Semi-Supervised K-MeansSemi-Supervised K-Means

Seeded K Means:Seeded K-Means:Labeled data provided by user are used for initialization: initialcenter for cluster i is the mean of the seed points having label i.

Seed points are only used for initialization, and not in subsequentsteps.

Constrained K-Means:Labeled data provided by user are used to initialize K-Meansalgorithm.

Cluster labels of seed data are kept unchanged in the clusterCluster labels of seed data are kept unchanged in the clusterassignment steps, and only the labels of the non-seed data are re-estimated.

Based on Basu et al ICML’02

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications40

Based on Basu et al., ICML 02.

Page 21: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Seeded K-MeansSeeded K-Means

Use labeled data to findUse labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points may change.points may change.

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications41

Seeded K-Means ExampleSeeded K-Means Example

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications42

Page 22: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Seeded K-Means ExampleInitialize Means Using Labeled Initialize Means Using Labeled Data

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications43

Seeded K-Means ExamplepAssign Points to Clusters

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications44

Page 23: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Seeded K-Means ExamplepRe-estimate Means

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications45

Seeded K-Means ExampleAssign points to clusters and Assign points to clusters and Converge

xx the label is changed

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications46

Page 24: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Constrained K-MeansConstrained K-Means

Use labeled data to findUse labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points will not change. p g

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications47

Constrained K-Means ExampleExample

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications48

Page 25: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Constrained K-Means ExamplepInitialize Means Using Labeled DataData

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications49

Constrained K-Means ExampleExampleAssign Points to Clusters

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications50

Page 26: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Constrained K-Means ExampleExampleRe-estimate Means and Converge

xx

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications51

DatasetsDatasetsData sets:Data sets:

UCI Iris (3 classes; 150 instances)CMU 20 Newsgroups (20 classes; 20,000 instances)Yahoo! News (20 classes; 2 340 instances)Yahoo! News (20 classes; 2,340 instances)

Data subsets created for experiments:Small-20 newsgroup: random sample of 100 documents fromeach newsgroup created to study effect of datasize on algorithmseach newsgroup, created to study effect of datasize on algorithms.Different-3 newsgroup: 3 very different newsgroups (alt.atheism,rec.sport.baseball, sci.space), created to study effect of dataseparability on algorithms.sepa ab ty o a go t s.Same-3 newsgroup: 3 very similar newsgroups (comp.graphics,comp.os.ms-windows, comp.windows.x).

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications52

Page 27: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

EvaluationEvaluation

Objective function

Mutual information

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications53

Results: MI and Seeding

Zero noise in seeds [Small-20 NewsGroup]

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications54

Semi-Supervised KMeans substantially better than unsupervised KMeans

Page 28: Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP 790COMP 790-90 Seminar90 Seminar Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL

Results: Objective function and Seeding

User-labeling consistent with KMeans assumptions

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications55

g p[Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction

Results: Objective Function d S diand Seeding

User-labeling inconsistent with KMeans assumptions[Y h ! N ] Obj i f i f i d

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications56

[Yahoo! News] Objective function of constrained algorithms decreases with seeding


Recommended