+ All Categories
Home > Documents > K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by...

K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by...

Date post: 14-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
14
K-Means Clustering 3/3/17
Transcript
Page 1: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

K-MeansClustering3/3/17

Page 2: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

UnsupervisedLearning• Wehaveacollectionofunlabeled datapoints.• Wewanttofindunderlyingstructureinthedata.

Examples:• Identifygroupsofsimilardatapoints.• Clustering

• Findabetterbasistorepresentthedata.• Principalcomponentanalysis

• Compressthedatatoashorterrepresentation.• Auto-encoders

Page 3: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

UnsupervisedLearning• Wehaveacollectionofunlabeled datapoints.• Wewanttofindunderlyingstructureinthedata.

Applications:• GeneratingtheinputrepresentationforanotherAIorMLalgorithm.• ClusterscouldleadtostatesinastatespacesearchorMDPmodel.• Anewbasiscouldbetheinputtoaclassificationorregressionalgorithm.

• Makingdataeasiertounderstand,byidentifyingwhat’simportantand/ordiscardingwhatisn’t.

Page 4: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

TheGoalofClusteringGivenabunchofdata,wewanttocomeupwitharepresentationthatwillsimplifyfuturereasoning.

Keyidea:groupsimilarpointsintoclusters.

Examples:• Identifyingobjectsinsensordata• Detectingcommunitiesinsocialnetworks• Constructingphylogenetictreesofspecies• Makingrecommendationsfromsimilarusers

Page 5: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

EMAlgorithmEstep:“expectation”…terriblename• Classifythedatausingthecurrentmodel.

Mstep:“maximization”…slightlylessterriblename• Generatethebestmodelusingthecurrentclassificationofthedata.

Initializethemodel,thenalternateEandMstepsuntilconvergence.

Note:TheEMalgorithmhasmanyvariations,includingsomethathavenothingtodowithclustering.

Page 6: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

K-MeansAlgorithmModel:kclusterseachrepresentedbyacentroid.

Estep:• Assigneachpointtotheclosestcentroid.

Mstep:• Moveeachcentroidtothemeanofthepointsassignedtoit.

Convergence:werananEstepwherenopointshadtheirassignmentchanged.

Page 7: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

K-MeansExample

Page 8: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

InitializingK-MeansReasonableoptions:

1. StartwitharandomEstep.• Randomlyassigneachpointtoaclusterin{1,2,…,k}.

2. StartwitharandomMstep.a) Pickrandomcentroidswithinthemaximumrangeof

thedata.b) Pickrandomdatapointstouseasinitialcentroids.

Page 9: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

K-MeansinAction

https://www.youtube.com/watch?v=BVFG7fd1H30

Page 10: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

AnotherEMExample:GMMs

GMM:Gaussianmixturemodel• AGaussiandistributionisamultivariategeneralizationofanormaldistribution(theclassicbellcurve).• AGaussianmixtureisadistributioncomprisedofseveralindependentGaussians.• IfwemodelourdataasaGaussianmixture,we’resayingthateachdatapointwasarandomdrawfromoneofseveralGaussiandistributions(butwemaynotknowwhich).

Page 11: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

EMforGaussianMixtureModelsModel:datadrawnfromamixtureofkGaussians

Estep:• Computethe(log)likelihoodofthedata• Eachpoint’sprobabilityofbeingdrawnfromeachGaussian.

Mstep:• UpdatethemeanandcovarianceofeachGaussian.• WeightedbyhowresponsiblethatGaussianwasforeachdatapoint.

Page 12: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

HowdowepickK?There’snohardrule.

• Sometimestheapplicationforwhichtheclusterswillbeuseddictatesk.• Ifkcanbeflexible,thenweneedtoconsiderthetradeoffs:• Higherkwillalwaysdecreasetheerror(increasethelikelihood).• Lowerkwillalwaysproduceasimplermodel.

Page 13: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

HierarchicalClustering• Organizesdatapointsintoahierarchy.• Everylevelofthebinarytreesplitsthepointsintotwosubsets.• Pointsinasubsetshouldbemoresimilarthanpointsindifferentsubsets.• Theresultingclusteringcanberepresentedbyadendrogram.

Page 14: K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by a centroid. E step: •Assign each point to the closest centroid. M step: •Move

DirectionofClusteringAgglomerative(bottom-up)• Eachpointstartsinitsowncluster.• Repeatedlymergethetwomost-similarclustersuntilonlyoneremains.

Divisive(top-down)• Allpointsstartinasinglecluster.• Repeatedlysplitthedataintothetwomostself-similarsubsets.

Eitherversioncanstopearlyifaspecificnumberofclustersisdesired.


Recommended