Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP...

Semi supervised LearningSemi-supervised Learning

COMP 790 90 SeminarCOMP 790-90 Seminar

Spring 2011

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

OverviewOverview

Semi-supervised learningSemi-supervised classificationSemi-supervised classification

Semi-supervised clustering

S i i d l t iSemi-supervised clusteringSearch based methods

C KCop K-mean

Seeded K-mean

Constrained K mean

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2

Constrained K-mean

Similarity based methods

Supervised Classification Example

.....



.. . . ...... .. ..

.

. .. ..

.



.. . . ...... .. ..

.

. .. ..

.


Unsupervised Clustering Example

... . . ...... .. ..

.

. .. ..

.


Unsupervised Clustering Example

.. . . ...... .. ..

.

. .. ..

.


Semi-Supervised LearningSemi-Supervised Learning

Combines labeled and unlabeled data duringtraining to improve performance:

Semi-supervised classification: Training on labeleddata exploits additional unlabeled data, frequently

l i i l ifiresulting in a more accurate classifier.

Semi-supervised clustering: Uses small amount ofl b l d d t t id d bi th l t i f l b l dlabeled data to aid and bias the clustering of unlabeleddata.


Semi-Supervised Classification Example

.. . . ...... .. ..

.

. .. ..

.


Semi-Supervised Classification Example

.. . . ...... .. ..

.

. .. ..

.


Semi-Supervised ClassificationAlgorithms:Algorithms:

Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].

Co-training [Blum:COLT98].

Transductive SVM’s [Vapnik:98,Joachims:ICML99].

Assumptions:Known, fixed set of categories given in the labeled datadata.

Goal is to improve classification of examples into these known categories.


Semi-Supervised Clustering Example

.. . . ...... .. ..

.

. .. ..

.


Semi-Supervised Clustering Example

. . . .... .. ..

..

. .. ...

..


Second Semi-Supervised Clustering Example

.. . . ...... .. ..

.

. .. ..

.


Second Semi-Supervised Clustering Example

. . . .... .. ..

..

. .. ...

..


Semi-Supervised ClusteringSemi-Supervised Clustering

Can group data using the categories in theinitial labeled datainitial labeled data.

Can also extend and modify the existing setof categories as needed to reflect otherof categories as needed to reflect otherregularities in the data.

C l t di j i t t f l b l d d tCan cluster a disjoint set of unlabeled datausing the labeled data as a “guide” to thet f l t d i d


type of clusters desired.

Problem definitionProblem definition

Input:Input:A set of unlabeled objects

Some domain knowledgeSome domain knowledge

Output:A partitioning of the objects into clusters p g j

Objective:Maximum intra-cluster similarityy

Minimum inter-cluster similarity

High consistency between the partitioning and the


domain knowledge

What is Domain Knowledge?What is Domain Knowledge?

Must-link and cannot-link

Class labelsClass labels

Ontology


Why semi-supervised clustering?Wh t l t i ?Why not clustering?

Could not incorporate prior knowledge into clustering processprocess

Why not classification?Sometimes there are insufficient labeled data.

Potential applicationsBioinformatics (gene and protein clustering)(g p g)

Document hierarchy construction

News/email categorization


Image categorization

Semi-Supervised ClusteringSemi-Supervised ClusteringApproachesApproaches

Search-based Semi-Supervised ClusteringAlter the clustering algorithm using theAlter the clustering algorithm using the constraints

Similarity-based Semi-Supervised y pClustering

Alter the similarity measure based on the constraints

Combination of both


Search-Based Semi-Supervised ClusteringAlt th l t i l ith th t h fAlter the clustering algorithm that searches for a good partitioning by:

Modifying the objective function to give a reward forModifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].

Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].Wagstaff:ICML01].

Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].


Unsupervised KMeans Clustering

KMeans iteratively partitions a dataset into KKMeans iteratively partitions a dataset into Kclusters.

Algorithm:

Initialize K cluster centers randomly Repeat}{KInitialize K cluster centers randomly. Repeat

until convergence:Cluster Assignment Step: Assign each data point x

}{1l l

Cluster Assignment Step: Assign each data point xto the cluster Xl, such that L2 distance of x from(center of Xl) is minimum

C i i S i h l

l


Center Re-estimation Step: Re-estimate each clustercenter as the mean of the points in that clusterl

KMeans Objective FunctionKMeans Objective Function

Locally minimizes sum of squared distanceLocally minimizes sum of squared distancebetween the data points and theircorresponding cluster centers:corresponding cluster centers:

2

|||| K

lix

Initialization of K cluster centers:

1|||| l Xx li

lix

Totally random

Random perturbation from global mean


Random perturbation from global mean

Heuristic to ensure well-separated centers etc.

K Means Examplep


K Means ExamplepRandomly Initialize Means

x

x


K Means ExamplepAssign Points to Clusters

x

x


K Means ExamplepRe-estimate Means

xx

xx


K Means ExamplepRe-assign Points to Clusters

xx

xx


K Means ExamplepRe-estimate Means

x

x


K Means ExamplepRe-assign Points to Clusters

x

x


K Means ExamplepRe-estimate Means and Converge

x

x


Semi-Supervised K-MeansSemi-Supervised K-Means

Constraints (Must-link, Cannot-link)COP K-MeansCOP K-Means

Partial label information is givenS d d K M (B ICML’02)Seeded K-Means (Basu, ICML’02)

Constrained K-Means


COP K-MeansCOP K-MeansCOP K Means is K Means with must link (mustCOP K-Means is K-Means with must-link (mustbe in same cluster) and cannot-link (cannot be insame cluster) constraints on data points.) p

Initialization: Cluster centers are chosen randomlybut no must-link constraints that may be violatedy

Algorithm: During cluster assignment step inCOP-K-Means, a point is assigned to its nearestcluster without violating any of its constraints. Ifno such assignment exists, abort.


Based on Wagstaff et al.: ICML01

COP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means Algorithm


IllustrationIllustrationDetermineits labelits label

xx

Must-linkx

Assign to the red class


IllustrationIllustrationDetermineits label

xx

Cannot-link

Assign to the red class


IllustrationIllustration

DetermineDetermineits label Must-link

xx

C t li kCannot-link

The clustering algorithm fails


EvaluationEvaluation

Rand index: measures the agreement between twoRand index: measures the agreement between two partitions, P1 and P2, of the same data set D.

Each partition is viewed as a collection of n(n-1)/2 p ( )pairwise decisions, where n is the size of D.

a is the number of decisions where P1 and P2 put a pair f bj t i t th l tof objects into the same cluster

b is the number of decisions where two instances are placed in different clusters in both partitions. p p

Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2)




Semi-Supervised K-MeansSemi-Supervised K-Means

Seeded K Means:Seeded K-Means:Labeled data provided by user are used for initialization: initialcenter for cluster i is the mean of the seed points having label i.

Seed points are only used for initialization, and not in subsequentsteps.

Constrained K-Means:Labeled data provided by user are used to initialize K-Meansalgorithm.

Cluster labels of seed data are kept unchanged in the clusterCluster labels of seed data are kept unchanged in the clusterassignment steps, and only the labels of the non-seed data are re-estimated.

Based on Basu et al ICML’02


Based on Basu et al., ICML 02.

Seeded K-MeansSeeded K-Means

Use labeled data to findUse labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points may change.points may change.


Seeded K-Means ExampleSeeded K-Means Example


Seeded K-Means ExampleInitialize Means Using Labeled Initialize Means Using Labeled Data

xx


Seeded K-Means ExamplepAssign Points to Clusters

xx


Seeded K-Means ExamplepRe-estimate Means

xx


Seeded K-Means ExampleAssign points to clusters and Assign points to clusters and Converge

xx the label is changed


Constrained K-MeansConstrained K-Means

Use labeled data to findUse labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points will not change. p g


Constrained K-Means ExampleExample


Constrained K-Means ExamplepInitialize Means Using Labeled DataData

xx


Constrained K-Means ExampleExampleAssign Points to Clusters

xx


Constrained K-Means ExampleExampleRe-estimate Means and Converge

xx


DatasetsDatasetsData sets:Data sets:

UCI Iris (3 classes; 150 instances)CMU 20 Newsgroups (20 classes; 20,000 instances)Yahoo! News (20 classes; 2 340 instances)Yahoo! News (20 classes; 2,340 instances)

Data subsets created for experiments:Small-20 newsgroup: random sample of 100 documents fromeach newsgroup created to study effect of datasize on algorithmseach newsgroup, created to study effect of datasize on algorithms.Different-3 newsgroup: 3 very different newsgroups (alt.atheism,rec.sport.baseball, sci.space), created to study effect of dataseparability on algorithms.sepa ab ty o a go t s.Same-3 newsgroup: 3 very similar newsgroups (comp.graphics,comp.os.ms-windows, comp.windows.x).



Objective function

Mutual information


Results: MI and Seeding

Zero noise in seeds [Small-20 NewsGroup]


Semi-Supervised KMeans substantially better than unsupervised KMeans

Results: Objective function and Seeding

User-labeling consistent with KMeans assumptions


g p[Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction

Results: Objective Function d S diand Seeding

User-labeling inconsistent with KMeans assumptions[Y h ! N ] Obj i f i f i d


[Yahoo! News] Objective function of constrained algorithms decreases with seeding

Date post:	22-Jan-2021
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Semi-supervised Learning - Computer Science · 2011. 3. 16. · Semi-supervised Learning COMP...

Documents