Semi supervised LearningSemi-supervised Learning
COMP 790 90 SeminarCOMP 790-90 Seminar
Spring 2011
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
OverviewOverview
Semi-supervised learningSemi-supervised classificationSemi-supervised classification
Semi-supervised clustering
S i i d l t iSemi-supervised clusteringSearch based methods
C KCop K-mean
Seeded K-mean
Constrained K mean
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2
Constrained K-mean
Similarity based methods
Supervised Classification Example
.....
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications3
Supervised Classification Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications4
Supervised Classification Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications5
Unsupervised Clustering Example
... . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications6
Unsupervised Clustering Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications7
Semi-Supervised LearningSemi-Supervised Learning
Combines labeled and unlabeled data duringtraining to improve performance:
Semi-supervised classification: Training on labeleddata exploits additional unlabeled data, frequently
l i i l ifiresulting in a more accurate classifier.
Semi-supervised clustering: Uses small amount ofl b l d d t t id d bi th l t i f l b l dlabeled data to aid and bias the clustering of unlabeleddata.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications8
Semi-Supervised Classification Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications9
Semi-Supervised Classification Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications10
Semi-Supervised ClassificationAlgorithms:Algorithms:
Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].
Co-training [Blum:COLT98].
Transductive SVM’s [Vapnik:98,Joachims:ICML99].
Assumptions:Known, fixed set of categories given in the labeled datadata.
Goal is to improve classification of examples into these known categories.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications11
Semi-Supervised Clustering Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications12
Semi-Supervised Clustering Example
. . . .... .. ..
..
. .. ...
..
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications13
Second Semi-Supervised Clustering Example
.. . . ...... .. ..
.
. .. ..
.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications14
Second Semi-Supervised Clustering Example
. . . .... .. ..
..
. .. ...
..
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications15
Semi-Supervised ClusteringSemi-Supervised Clustering
Can group data using the categories in theinitial labeled datainitial labeled data.
Can also extend and modify the existing setof categories as needed to reflect otherof categories as needed to reflect otherregularities in the data.
C l t di j i t t f l b l d d tCan cluster a disjoint set of unlabeled datausing the labeled data as a “guide” to thet f l t d i d
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications16
type of clusters desired.
Problem definitionProblem definition
Input:Input:A set of unlabeled objects
Some domain knowledgeSome domain knowledge
Output:A partitioning of the objects into clusters p g j
Objective:Maximum intra-cluster similarityy
Minimum inter-cluster similarity
High consistency between the partitioning and the
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications17
domain knowledge
What is Domain Knowledge?What is Domain Knowledge?
Must-link and cannot-link
Class labelsClass labels
Ontology
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications18
Why semi-supervised clustering?Wh t l t i ?Why not clustering?
Could not incorporate prior knowledge into clustering processprocess
Why not classification?Sometimes there are insufficient labeled data.
Potential applicationsBioinformatics (gene and protein clustering)(g p g)
Document hierarchy construction
News/email categorization
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications19
Image categorization
Semi-Supervised ClusteringSemi-Supervised ClusteringApproachesApproaches
Search-based Semi-Supervised ClusteringAlter the clustering algorithm using theAlter the clustering algorithm using the constraints
Similarity-based Semi-Supervised y pClustering
Alter the similarity measure based on the constraints
Combination of both
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications20
Search-Based Semi-Supervised ClusteringAlt th l t i l ith th t h fAlter the clustering algorithm that searches for a good partitioning by:
Modifying the objective function to give a reward forModifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99].
Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].Wagstaff:ICML01].
Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans, EM) [Basu:ICML02].
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications21
Unsupervised KMeans Clustering
KMeans iteratively partitions a dataset into KKMeans iteratively partitions a dataset into Kclusters.
Algorithm:
Initialize K cluster centers randomly Repeat}{KInitialize K cluster centers randomly. Repeat
until convergence:Cluster Assignment Step: Assign each data point x
}{1l l
Cluster Assignment Step: Assign each data point xto the cluster Xl, such that L2 distance of x from(center of Xl) is minimum
C i i S i h l
l
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications22
Center Re-estimation Step: Re-estimate each clustercenter as the mean of the points in that clusterl
KMeans Objective FunctionKMeans Objective Function
Locally minimizes sum of squared distanceLocally minimizes sum of squared distancebetween the data points and theircorresponding cluster centers:corresponding cluster centers:
2
|||| K
lix
Initialization of K cluster centers:
1|||| l Xx li
lix
Totally random
Random perturbation from global mean
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications23
Random perturbation from global mean
Heuristic to ensure well-separated centers etc.
K Means Examplep
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications24
K Means ExamplepRandomly Initialize Means
x
x
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications25
K Means ExamplepAssign Points to Clusters
x
x
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications26
K Means ExamplepRe-estimate Means
xx
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications27
K Means ExamplepRe-assign Points to Clusters
xx
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications28
K Means ExamplepRe-estimate Means
x
x
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications29
K Means ExamplepRe-assign Points to Clusters
x
x
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications30
K Means ExamplepRe-estimate Means and Converge
x
x
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications31
Semi-Supervised K-MeansSemi-Supervised K-Means
Constraints (Must-link, Cannot-link)COP K-MeansCOP K-Means
Partial label information is givenS d d K M (B ICML’02)Seeded K-Means (Basu, ICML’02)
Constrained K-Means
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications32
COP K-MeansCOP K-MeansCOP K Means is K Means with must link (mustCOP K-Means is K-Means with must-link (mustbe in same cluster) and cannot-link (cannot be insame cluster) constraints on data points.) p
Initialization: Cluster centers are chosen randomlybut no must-link constraints that may be violatedy
Algorithm: During cluster assignment step inCOP-K-Means, a point is assigned to its nearestcluster without violating any of its constraints. Ifno such assignment exists, abort.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications33
Based on Wagstaff et al.: ICML01
COP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means AlgorithmCOP K-Means Algorithm
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications34
IllustrationIllustrationDetermineits labelits label
xx
Must-linkx
Assign to the red class
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications35
IllustrationIllustrationDetermineits label
xx
Cannot-link
Assign to the red class
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications36
IllustrationIllustration
DetermineDetermineits label Must-link
xx
C t li kCannot-link
The clustering algorithm fails
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications37
EvaluationEvaluation
Rand index: measures the agreement between twoRand index: measures the agreement between two partitions, P1 and P2, of the same data set D.
Each partition is viewed as a collection of n(n-1)/2 p ( )pairwise decisions, where n is the size of D.
a is the number of decisions where P1 and P2 put a pair f bj t i t th l tof objects into the same cluster
b is the number of decisions where two instances are placed in different clusters in both partitions. p p
Total agreement can then be calculated using Rand(P1; P2) = (a + b)/ (n (n -1)/2)
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications38
EvaluationEvaluation
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications39
Semi-Supervised K-MeansSemi-Supervised K-Means
Seeded K Means:Seeded K-Means:Labeled data provided by user are used for initialization: initialcenter for cluster i is the mean of the seed points having label i.
Seed points are only used for initialization, and not in subsequentsteps.
Constrained K-Means:Labeled data provided by user are used to initialize K-Meansalgorithm.
Cluster labels of seed data are kept unchanged in the clusterCluster labels of seed data are kept unchanged in the clusterassignment steps, and only the labels of the non-seed data are re-estimated.
Based on Basu et al ICML’02
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications40
Based on Basu et al., ICML 02.
Seeded K-MeansSeeded K-Means
Use labeled data to findUse labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points may change.points may change.
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications41
Seeded K-Means ExampleSeeded K-Means Example
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications42
Seeded K-Means ExampleInitialize Means Using Labeled Initialize Means Using Labeled Data
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications43
Seeded K-Means ExamplepAssign Points to Clusters
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications44
Seeded K-Means ExamplepRe-estimate Means
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications45
Seeded K-Means ExampleAssign points to clusters and Assign points to clusters and Converge
xx the label is changed
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications46
Constrained K-MeansConstrained K-Means
Use labeled data to findUse labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points will not change. p g
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications47
Constrained K-Means ExampleExample
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications48
Constrained K-Means ExamplepInitialize Means Using Labeled DataData
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications49
Constrained K-Means ExampleExampleAssign Points to Clusters
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications50
Constrained K-Means ExampleExampleRe-estimate Means and Converge
xx
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications51
DatasetsDatasetsData sets:Data sets:
UCI Iris (3 classes; 150 instances)CMU 20 Newsgroups (20 classes; 20,000 instances)Yahoo! News (20 classes; 2 340 instances)Yahoo! News (20 classes; 2,340 instances)
Data subsets created for experiments:Small-20 newsgroup: random sample of 100 documents fromeach newsgroup created to study effect of datasize on algorithmseach newsgroup, created to study effect of datasize on algorithms.Different-3 newsgroup: 3 very different newsgroups (alt.atheism,rec.sport.baseball, sci.space), created to study effect of dataseparability on algorithms.sepa ab ty o a go t s.Same-3 newsgroup: 3 very similar newsgroups (comp.graphics,comp.os.ms-windows, comp.windows.x).
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications52
EvaluationEvaluation
Objective function
Mutual information
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications53
Results: MI and Seeding
Zero noise in seeds [Small-20 NewsGroup]
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications54
Semi-Supervised KMeans substantially better than unsupervised KMeans
Results: Objective function and Seeding
User-labeling consistent with KMeans assumptions
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications55
g p[Small-20 NewsGroup] Obj. function of data partition increases exponentially with seed fraction
Results: Objective Function d S diand Seeding
User-labeling inconsistent with KMeans assumptions[Y h ! N ] Obj i f i f i d
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications56
[Yahoo! News] Objective function of constrained algorithms decreases with seeding