CogNova Technologies 1 COMP3503 Automated Discovery and Clustering Methods COMP3503 Automated...

CogNovaTechnologies

1

COMP3503COMP3503

Automated Discovery and Automated Discovery andClustering MethodsClustering Methods

Daniel L. SilverDaniel L. Silver

CogNovaTechnologies

2

AgendaAgenda Automated Automated Exploration/Discovery Exploration/Discovery (unsupervised clustering (unsupervised clustering methods)methods)

K-Means Clustering MethodK-Means Clustering Method Kohonen Self Organizing Maps Kohonen Self Organizing Maps (SOM)(SOM)

CogNovaTechnologies

3

Overview of Modeling Overview of Modeling MethodsMethods Automated Exploration/DiscoveryAutomated Exploration/Discovery

• e.g.. e.g.. discovering new market segmentsdiscovering new market segments• distance and probabilistic clustering algorithmsdistance and probabilistic clustering algorithms

Prediction/ClassificationPrediction/Classification• e.g.. e.g.. forecasting gross sales given current forecasting gross sales given current factorsfactors

• statistics (regression, K-nearest neighbour)statistics (regression, K-nearest neighbour)• artificial neural networks, genetic algorithmsartificial neural networks, genetic algorithms

Explanation/DescriptionExplanation/Description• e.g.. e.g.. characterizing customers by demographics characterizing customers by demographics

• inductive decision trees/rulesinductive decision trees/rules• rough sets, Bayesian belief netsrough sets, Bayesian belief nets

x1

x2

f(x)

x

if age > 35 and income < $35k then ...

AB

CogNovaTechnologies

4Automated Automated Exploration/DiscoveryExploration/DiscoveryThrough Unsupervised Through Unsupervised LearningLearningObjective: Objective: To induce a model To induce a model without use of a target without use of a target (supervisory) variable such that (supervisory) variable such that similar examples are grouped into similar examples are grouped into self-organized clusters or self-organized clusters or catagories.catagories.

This can be considered a method of This can be considered a method of unsupervised concept learning.unsupervised concept learning.

There is no explicit teaching There is no explicit teaching signal.signal. x1

x2

AC

B

CogNovaTechnologies

5

Classification Systems Classification Systems and Inductive Learningand Inductive LearningBasic Framework for Inductive LearningBasic Framework for Inductive Learning

InductiveLearning System

Environment

TrainingExamples

TestingExamples

Induced Model ofClassifier

Output Classification

(x, f(x))

(x, h(x))

h(x) = f(x)?

A problem of representation and search for the best hypothesis, h(x).

~

THIS FRAMEWORK IS NOT

APPLICABLE TO UNSUPERVISED

LEARNING

CogNovaTechnologies

6Automated Automated Exploration/DiscoveryExploration/DiscoveryThrough Unsupervised Through Unsupervised LearningLearningMulti-dimensional Feature SpaceMulti-dimensional Feature Space

Age

Income$ Education

CogNovaTechnologies

7Automated Automated Exploration/DiscoveryExploration/DiscoveryThrough Unsupervised Through Unsupervised LearningLearning Common UsesCommon Uses Market segmentationMarket segmentation Population catagorizationPopulation catagorization Product/service catagorizationProduct/service catagorization Automated subject indexing Automated subject indexing (WEBSOM)(WEBSOM)

Multi-variable (vector) Multi-variable (vector) quantization quantization • reduce several variables to onereduce several variables to one

CogNovaTechnologies

8

8

●Clustering techniques apply when there is no class Clustering techniques apply when there is no class to be predictedto be predicted●Aim: divide instances into “natural” groupsAim: divide instances into “natural” groups●Clusters can be:Clusters can be:

●disjoint vs. overlappingdisjoint vs. overlapping●deterministic vs. probabilisticdeterministic vs. probabilistic●flat vs. hierarchicalflat vs. hierarchical

WHF 3.6 & 4.8 WHF 3.6 & 4.8 ClusteringClustering

CogNovaTechnologies

9

9

Representing clusters IRepresenting clusters I

Simple 2-D representation

Venn diagram

Overlapping clusters

CogNovaTechnologies

10

10

Representing clusters IIRepresenting clusters II

1 2 3

a 0.4 0.1 0.5b 0.1 0.8 0.1c 0.3 0.3 0.4d 0.1 0.1 0.8e 0.4 0.2 0.4f 0.1 0.4 0.5g 0.7 0.2 0.1h 0.5 0.4 0.1…

Probabilistic assignment

Dendrogram

NB: dendron is the Greek word for tree

CogNovaTechnologies

11

11

●The classic clustering algorithm -- The classic clustering algorithm -- k-meansk-meansk-meansk-means clusters are disjoint, deterministic, and clusters are disjoint, deterministic, and flatflat

3.6 & 4.8 3.6 & 4.8 ClusteringClustering

CogNovaTechnologies

12

K-Means Clustering K-Means Clustering MethodMethod

Consider Consider mm examples each with 2 examples each with 2 attributes [attributes [x,yx,y] in a 2D input ] in a 2D input spacespace

Method depends on storing all Method depends on storing all examplesexamples

Set number of clusters, KSet number of clusters, K Centroid of clusters is initially Centroid of clusters is initially the average coordinates of first K the average coordinates of first K examples or randomly chosen examples or randomly chosen coordinatescoordinates

CogNovaTechnologies

13


Until cluster boundaries stop Until cluster boundaries stop changingchanging•Assign each example to the Assign each example to the cluster whocluster who’’s centroid is nearest s centroid is nearest - using some distance measure - using some distance measure ((e.g.e.g. Euclidean distance) Euclidean distance)

•Recalculate centroid of each Recalculate centroid of each cluster:cluster:e.g.e.g. [mean(x), mean(y)] for all [mean(x), mean(y)] for all examples currently in cluster Kexamples currently in cluster K

CogNovaTechnologies

14

DEMO: http://home.dei.polimi.it/matteucc/Clustering

/tutorial_html/AppletKM.html

3.6 & 4.8 3.6 & 4.8 ClusteringClustering

CogNovaTechnologies

15


Advantages:Advantages:• simple algorithm to implement simple algorithm to implement • can form very interesting groupingscan form very interesting groupings• clusters can be characterized by attributes / clusters can be characterized by attributes / valuesvalues

• sequential learning methodsequential learning method Disadvantages:Disadvantages:

• all variables must be ordinal in natureall variables must be ordinal in nature• problems transforming categorical variables problems transforming categorical variables • requires lots of memory, computation timerequires lots of memory, computation time• does poorly for large numbers of attributes does poorly for large numbers of attributes (curse of dimensionality)(curse of dimensionality)

CogNovaTechnologies

16

16

DiscussionDiscussion●Algorithm minimizes squared distance to cluster Algorithm minimizes squared distance to cluster centerscenters●Result can vary significantly based on initial choice Result can vary significantly based on initial choice of seedsof seeds●Can get trapped in local minimumCan get trapped in local minimumExample:Example:

●To increase chance of finding global optimum: To increase chance of finding global optimum: restart with different random seedsrestart with different random seeds●Can we applied recursively with Can we applied recursively with k k = 2= 2

instances

initial cluster centres

CogNovaTechnologies

17

Kohonen SOM Kohonen SOM (Self Organizing (Self Organizing Feature Map)Feature Map)

Implements a version of K-MeansImplements a version of K-Means Two layer feed-forward neural Two layer feed-forward neural networknetwork

Input layer fully connected to an Input layer fully connected to an output layer of N outputs arranged output layer of N outputs arranged in 2D; weights initialized to small in 2D; weights initialized to small random valuesrandom values

Objective is to arrange the outputs Objective is to arrange the outputs into a map which is topologically into a map which is topologically organized according to the features organized according to the features presented in the data presented in the data

CogNovaTechnologies

18

Kohonen SOMKohonen SOMThe Training AlgorithmThe Training Algorithm

Present an example to the inputsPresent an example to the inputs The winning output is that whose weights The winning output is that whose weights are are ““closestclosest”” to the input values ( to the input values (e.g.e.g. Euclidean dist.)Euclidean dist.)

Adjust the winners weights slightly to Adjust the winners weights slightly to make it more like the examplemake it more like the example

Adjust the weights of the neighbouring Adjust the weights of the neighbouring output nodes relative to proximity to output nodes relative to proximity to chosen output node chosen output node

Reduce the neighbourhood size Reduce the neighbourhood size Repeated for I iterations through examples Repeated for I iterations through examples

CogNovaTechnologies

19

SOM DemosSOM Demos

http://www.eee.metu.edu.tr/~alatan/Courses/Demo/Kohonen.htm

http://davis.wpi.edu/~matt/courses/soms/applet.html

CogNovaTechnologies

20

Kohonen SOMKohonen SOM In the end, the network effectively In the end, the network effectively quantizes the input vector of each quantizes the input vector of each example to a single output nodeexample to a single output node

The weights to that node indicate the The weights to that node indicate the feature values that characterize the feature values that characterize the clustercluster

Topology of map shows association/ Topology of map shows association/ proximity of clusters proximity of clusters

Biological justification - evidence Biological justification - evidence of localized topological mapping in of localized topological mapping in neocortexneocortex

CogNovaTechnologies

21

21

Faster distance Faster distance calculationscalculations

●Can we use Can we use kkD-trees or ball trees to D-trees or ball trees to speed up the process? Yes:speed up the process? Yes:

●First, build tree, which remains static, for all the First, build tree, which remains static, for all the data pointsdata points●At each node, store number of instances and At each node, store number of instances and sum of all instancessum of all instances●In each iteration, descend tree and find out In each iteration, descend tree and find out which cluster each node belongs towhich cluster each node belongs to

●Can stop descending as soon as we find out that a Can stop descending as soon as we find out that a node belongs entirely to a particular clusternode belongs entirely to a particular cluster●Use statistics stored at the nodes to compute new Use statistics stored at the nodes to compute new cluster centerscluster centers

CogNovaTechnologies

22

22

ExampleExample

CogNovaTechnologies

23

23

6.8 Clustering: how many 6.8 Clustering: how many clusters?clusters?

●How to choose How to choose kk in in kk-means? Possibilities:-means? Possibilities:●Choose Choose kk that minimizes cross-validated squared that minimizes cross-validated squared distance to cluster centersdistance to cluster centers●Use penalized squared distance on the training data Use penalized squared distance on the training data (eg. using an MDL criterion)(eg. using an MDL criterion)●Apply Apply k-k-means recursively with means recursively with k k = 2 and use = 2 and use stopping criterion (eg. based on MDL)stopping criterion (eg. based on MDL)

●Seeds for subclusters can be chosen by seeding along Seeds for subclusters can be chosen by seeding along direction of greatest variance in clusterdirection of greatest variance in cluster(one standard deviation away in each direction from cluster (one standard deviation away in each direction from cluster center of parent cluster)center of parent cluster)●Implemented in algorithm called Implemented in algorithm called XX-means (using Bayesian -means (using Bayesian Information Criterion instead of MDL)Information Criterion instead of MDL)

CogNovaTechnologies

24

24

Hierarchical Hierarchical clusteringclustering● Recursively splitting clusters Recursively splitting clusters

produces a hierarchy that can be produces a hierarchy that can be represented as a represented as a dendogramdendogram

Could also be represented as a Venn Could also be represented as a Venn diagram of sets and subsets (without diagram of sets and subsets (without intersections)intersections)

Height of each node in the dendogram Height of each node in the dendogram can be made proportional to the can be made proportional to the dissimilarity between its childrendissimilarity between its children

CogNovaTechnologies

25

25

Agglomerative Agglomerative clusteringclustering● Bottom-up approachBottom-up approach● Simple algorithmSimple algorithm

Requires a distance/similarity measureRequires a distance/similarity measure Start by considering each instance to be a Start by considering each instance to be a

clustercluster Find the two closest clusters and merge themFind the two closest clusters and merge them Continue merging until only one cluster is leftContinue merging until only one cluster is left The record of mergings forms a hierarchical The record of mergings forms a hierarchical

clustering structure – a clustering structure – a binary dendogrambinary dendogram

CogNovaTechnologies

26

26

Distance measuresDistance measures● Single-linkageSingle-linkage

Minimum distance between the two clustersMinimum distance between the two clusters Distance between the clusters closest two Distance between the clusters closest two

membersmembers Can be sensitive to outliersCan be sensitive to outliers

● Complete-linkageComplete-linkage Maximum distance between the two clustersMaximum distance between the two clusters Two clusters are considered close only if all Two clusters are considered close only if all

instances in their union are relatively similarinstances in their union are relatively similar Also sensitive to outliersAlso sensitive to outliers Seeks compact clustersSeeks compact clusters

CogNovaTechnologies

27

27

Distance measures Distance measures cont.cont.● Compromise between the extremes of minimum Compromise between the extremes of minimum

and maximum distanceand maximum distance● Represent clusters by their centroid, and use Represent clusters by their centroid, and use

distance between centroids – distance between centroids – centroid linkagecentroid linkage Works well for instances in multidimensional Works well for instances in multidimensional

Euclidean spaceEuclidean space Not so good if all we have is pairwise similarity Not so good if all we have is pairwise similarity

between instancesbetween instances● Calculate average distance between each pair Calculate average distance between each pair

of members of the two clusters – of members of the two clusters – average-average-linkagelinkage

● Technical deficiency of both: results depend on Technical deficiency of both: results depend on the numerical scale on which distances are the numerical scale on which distances are measuredmeasured

CogNovaTechnologies

28

28

More distance measuresMore distance measures● Group-averageGroup-average clustering clustering

Uses the average distance between all Uses the average distance between all members of the merged clustermembers of the merged cluster

Differs from average-linkage because it Differs from average-linkage because it includes pairs from the same original clusterincludes pairs from the same original cluster

● Ward'sWard's clustering method clustering method Calculates the increase in the sum of squares Calculates the increase in the sum of squares

of the distances of the instances from the of the distances of the instances from the centroid before and after fusing two clusterscentroid before and after fusing two clusters

Minimize the increase in this squared distance Minimize the increase in this squared distance at each clustering stepat each clustering step

● AllAll measures will produce the same result if the measures will produce the same result if the clusters are compact and well separatedclusters are compact and well separated

CogNovaTechnologies

29

29

Example hierarchical Example hierarchical clusteringclustering

● 50 examples of different creatures from the zoo 50 examples of different creatures from the zoo datadata

Dendogram Polar plot

Complete-linkage

CogNovaTechnologies

30

30

Example hierarchical Example hierarchical clustering 2clustering 2

Single-linkage

CogNovaTechnologies

31

31

Incremental Incremental clusteringclustering

●Heuristic approach (COBWEB/CLASSIT)Heuristic approach (COBWEB/CLASSIT)●Form a hierarchy of clusters Form a hierarchy of clusters incrementallyincrementally●Start:Start:

●tree consists of empty root nodetree consists of empty root node

●Then:Then:●add instances one by oneadd instances one by one●update tree appropriately at each stageupdate tree appropriately at each stage●to update, find the right leaf for an instanceto update, find the right leaf for an instance●May involve restructuring the treeMay involve restructuring the tree

●Base update decisions on Base update decisions on category category utilityutility

CogNovaTechnologies

32

32

Clustering weather Clustering weather datadata

N

M

L

K

J

I

H

G

F

E

D

C

B

A

ID

TrueHighMildRainy

False

NormalHotOvercast

TrueHighMildOvercast

TrueNormalMildSunny

False

NormalMildRainy

False

NormalCoolSunny

False

HighMildSunny

TrueNormalCoolOvercast

TrueNormalCoolRainy

False

NormalCoolRainy

False

HighMildRainy

False

HighHot Overcast

TrueHighHotSunny

False

HighHotSunny

Windy

Humidity

Temp.Outlook 11

2

3

CogNovaTechnologies

33

33

Clustering weather Clustering weather datadata

N

M

L

K

J

I

H

G

F

E

D

C

B

A

ID

TrueHighMildRainy

False

NormalHotOvercast

TrueHighMildOvercast

TrueNormalMildSunny

False

NormalMildRainy

False

NormalCoolSunny

False

HighMildSunny

TrueNormalCoolOvercast

TrueNormalCoolRainy

False

NormalCoolRainy

False

HighMildRainy

False

HighHot Overcast

TrueHighHotSunny

False

HighHotSunny

Windy

Humidity

Temp.Outlook 44

3

Merge best host and runner-up

5

Consider splitting the best host if merging doesn’t help

CogNovaTechnologies

34

34

Final hierarchyFinal hierarchy

CogNovaTechnologies

35

35

Example: the iris Example: the iris data data (subset)(subset)

CogNovaTechnologies

36

36

Clustering with Clustering with cutoffcutoff

CogNovaTechnologies

37

37

Category utilityCategory utility

●Category utility: quadratic loss functionCategory utility: quadratic loss functiondefined on conditional probabilities:defined on conditional probabilities:

●Every instance in different category Every instance in different category numerator becomesnumerator becomes

maximum

number of attributes

CogNovaTechnologies

38

38

Numeric attributesNumeric attributes

CogNovaTechnologies

39

39

Probability-based Probability-based clusteringclustering

●Problems with heuristic approach:Problems with heuristic approach:●Division by Division by k?k?●Order of examples?Order of examples?●Are restructuring operations sufficient?Are restructuring operations sufficient?●Is result at least Is result at least locallocal minimum of category utility? minimum of category utility?

●Probabilistic perspective Probabilistic perspective seek the seek the most likelymost likely clusters given the clusters given the datadata●Also: instance belongs to a particular Also: instance belongs to a particular cluster cluster with a certain probabilitywith a certain probability

CogNovaTechnologies

40

40

Finite mixturesFinite mixtures

●Model data using a Model data using a mixture mixture of distributionsof distributions●One cluster, one distributionOne cluster, one distribution

●governs probabilities of attribute values in that clustergoverns probabilities of attribute values in that cluster

●Finite mixtures Finite mixtures : finite number of clusters: finite number of clusters●Individual distributions are normal (Gaussian)Individual distributions are normal (Gaussian)●Combine distributions using cluster weightsCombine distributions using cluster weights

CogNovaTechnologies

41

41

Two-class mixture modelTwo-class mixture modelA 51A 43B 62B 64A 45A 42A 46A 45A 45

B 62A 47A 52B 64A 51B 65A 48A 49A 46

B 64A 51A 52B 62A 49A 48B 62A 43A 40

A 48B 64A 51B 63A 43B 65B 66B 65A 46

A 39B 62B 64A 52B 63B 64A 48B 64A 48

A 51A 48B 64A 42A 48A 41

data

model

A=50, A =5, pA=0.6 B=65, B =2, pB=0.4

CogNovaTechnologies

42

42

Using the mixture Using the mixture modelmodel

CogNovaTechnologies

43

43

Learning the clustersLearning the clusters

●Assume:Assume:●we know there are we know there are k k clustersclusters

●Learn the clusters Learn the clusters ●determine their parametersdetermine their parameters●I.e. means and standard deviationsI.e. means and standard deviations

●Performance criterion:Performance criterion:●probability of training data given the clustersprobability of training data given the clusters

●EM algorithmEM algorithm●finds a local maximum of the likelihoodfinds a local maximum of the likelihood

CogNovaTechnologies

44

44

EM algorithmEM algorithm

●EM = Expectation-MaximizationEM = Expectation-Maximization●Generalize Generalize k-k-means to probabilistic settingmeans to probabilistic setting

●Iterative procedure:Iterative procedure:●E “expectation” step:E “expectation” step: Calculate cluster probability for each instance Calculate cluster probability for each instance●M “maximization” step:M “maximization” step: Estimate distribution parameters from cluster Estimate distribution parameters from cluster probabilitiesprobabilities

●Store cluster probabilities as instance Store cluster probabilities as instance weightsweights●Stop when improvement is negligibleStop when improvement is negligible

CogNovaTechnologies

45

45

More on EMMore on EM

●Estimate parameters from weighted instancesEstimate parameters from weighted instances

●Stop when log-likelihood saturatesStop when log-likelihood saturates

●Log-likelihood:Log-likelihood:

CogNovaTechnologies

46

46

Extending the mixture Extending the mixture modelmodel

●More then two distributions: easyMore then two distributions: easy●Several attributes: easy—assuming Several attributes: easy—assuming independence!independence!●Correlated attributes: difficultCorrelated attributes: difficult

●Joint model: bivariate normal distributionJoint model: bivariate normal distributionwith a (symmetric) covariance matrixwith a (symmetric) covariance matrix●nn attributes: need to estimate attributes: need to estimate n n + + n n ((nn+1)/2 parameters+1)/2 parameters

CogNovaTechnologies

47

47

More mixture model More mixture model extensionsextensions

●Nominal attributes: easy if independentNominal attributes: easy if independent●Correlated nominal attributes: difficultCorrelated nominal attributes: difficult

●Two correlated attributes Two correlated attributes vv1 1 vv22 parameters parameters

●Missing values: easyMissing values: easy●Can use other distributions than normal:Can use other distributions than normal:

●““log-normal” if predetermined minimum is givenlog-normal” if predetermined minimum is given●““log-odds” if bounded from above and belowlog-odds” if bounded from above and below●Poisson for attributes that are integer countsPoisson for attributes that are integer counts

●Use cross-validation to estimate Use cross-validation to estimate k k !!

CogNovaTechnologies

48

48

Bayesian clusteringBayesian clustering

●Problem: many parameters Problem: many parameters EM overfits EM overfits●Bayesian approach Bayesian approach : give every parameter a : give every parameter a prior probability distributionprior probability distribution

●Incorporate prior into overall likelihood figureIncorporate prior into overall likelihood figure●Penalizes introduction of parametersPenalizes introduction of parameters

●Eg: Laplace estimator for nominal attributesEg: Laplace estimator for nominal attributes●Can also have prior on number of clusters!Can also have prior on number of clusters!●Implementation: NASA’s AUTOCLASSImplementation: NASA’s AUTOCLASS

CogNovaTechnologies

49

49

DiscussionDiscussion

●Can interpret clusters by using supervised Can interpret clusters by using supervised learninglearning

●post-processing steppost-processing step

●Decrease dependence between attributes?Decrease dependence between attributes?●pre-processing steppre-processing step●E.g. use E.g. use principal component analysisprincipal component analysis

●Can be used to fill in missing valuesCan be used to fill in missing values●Key advantage of probabilistic clustering:Key advantage of probabilistic clustering:

●Can estimate likelihood of dataCan estimate likelihood of data●Use it to compare different models objectivelyUse it to compare different models objectively

CogNovaTechnologies

50

50

WEKA TutroialWEKA Tutroial

K-means, EM and Cobweb and the Weather dataK-means, EM and Cobweb and the Weather data

http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-http://www.cs.ccsu.edu/~markov/ccsu_courses/DataMining-Ex3.htmlEx3.html

CogNovaTechnologies

51

CogNovaTechnologies

52

52Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6)

Semisupervised Semisupervised learninglearning

● Semisupervised learningSemisupervised learning: attempts to use : attempts to use unlabeled data as well as labeled dataunlabeled data as well as labeled data

The aim is to improve classification The aim is to improve classification performanceperformance

● Why try to do this? Unlabeled data is often Why try to do this? Unlabeled data is often plentiful and labeling data can be plentiful and labeling data can be expensiveexpensive

Web mining: classifying web pagesWeb mining: classifying web pages Text mining: identifying names in textText mining: identifying names in text Video mining: classifying people in the newsVideo mining: classifying people in the news

● Leveraging the large pool of unlabeled Leveraging the large pool of unlabeled examples would be very attractiveexamples would be very attractive

CogNovaTechnologies

53


Clustering for Clustering for classificationclassification

●Idea: use naïve Bayes on labeled Idea: use naïve Bayes on labeled examples and then apply EMexamples and then apply EM

●First, build naïve Bayes model on labeled dataFirst, build naïve Bayes model on labeled data●Second, label unlabeled data based on class probabilities Second, label unlabeled data based on class probabilities (“expectation” step)(“expectation” step)●Third, train new naïve Bayes model based on all the data Third, train new naïve Bayes model based on all the data (“maximization” step)(“maximization” step)●Fourth, repeat 2Fourth, repeat 2ndnd and 3 and 3rdrd step until convergence step until convergence

●Essentially the same as EM for clustering Essentially the same as EM for clustering with fixed cluster membership with fixed cluster membership probabilities for labeled data and probabilities for labeled data and #clusters = #classes#clusters = #classes

CogNovaTechnologies

54


CommentsComments●Has been applied successfully to document Has been applied successfully to document classificationclassification

●Certain phrases are indicative of classesCertain phrases are indicative of classes●Some of these phrases occur only in the unlabeled data, Some of these phrases occur only in the unlabeled data, some in both setssome in both sets●EM can generalize the model by taking advantage of co-EM can generalize the model by taking advantage of co-occurrence of these phrasesoccurrence of these phrases

●Refinement 1: reduce weight of unlabeled dataRefinement 1: reduce weight of unlabeled data●Refinement 2: allow multiple clusters per classRefinement 2: allow multiple clusters per class

CogNovaTechnologies

55


Co-trainingCo-training●Method for learning from Method for learning from multiple viewsmultiple views (multiple sets (multiple sets of attributes), eg:of attributes), eg:

●First set of attributes describes content of web pageFirst set of attributes describes content of web page●Second set of attributes describes links that link to the web pageSecond set of attributes describes links that link to the web page

●Step 1: build model from each viewStep 1: build model from each view●Step 2: use models to assign labels to unlabeled dataStep 2: use models to assign labels to unlabeled data●Step 3: select those unlabeled examples that were most Step 3: select those unlabeled examples that were most confidently predicted (ideally, preserving ratio of confidently predicted (ideally, preserving ratio of classes)classes)●Step 4: add those examples to the training setStep 4: add those examples to the training set●Step 5: go to Step 1 until data exhaustedStep 5: go to Step 1 until data exhausted●Assumption: views are independentAssumption: views are independent

CogNovaTechnologies

56


EM and co-trainingEM and co-training●Like EM for semisupervised learning, Like EM for semisupervised learning, but view is switched in each iteration but view is switched in each iteration of EMof EM

●Uses all the unlabeled data (probabilistically Uses all the unlabeled data (probabilistically labeled) for traininglabeled) for training

●Has also been used successfully with Has also been used successfully with support vector machinessupport vector machines

●Using logistic models fit to output of SVMsUsing logistic models fit to output of SVMs

●Co-training also seems to work when Co-training also seems to work when views are chosen randomly!views are chosen randomly!

●Why? Possibly because co-trained classifier is Why? Possibly because co-trained classifier is more robustmore robust

CogNovaTechnologies

57

THE ENDTHE END

[email protected]@acadiau.ca

Date post:	16-Jan-2016
Category:	Documents
Upload:	maximillian-brooks
View:	215 times
Download:	0 times

CogNova Technologies 1 COMP3503 Automated Discovery and Clustering Methods COMP3503 Automated...

Documents