K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by...

transcript

K-MeansClustering3/3/17

UnsupervisedLearning• Wehaveacollectionofunlabeled datapoints.• Wewanttofindunderlyingstructureinthedata.

Examples:• Identifygroupsofsimilardatapoints.• Clustering

• Findabetterbasistorepresentthedata.• Principalcomponentanalysis

• Compressthedatatoashorterrepresentation.• Auto-encoders

UnsupervisedLearning• Wehaveacollectionofunlabeled datapoints.• Wewanttofindunderlyingstructureinthedata.

Applications:• GeneratingtheinputrepresentationforanotherAIorMLalgorithm.• ClusterscouldleadtostatesinastatespacesearchorMDPmodel.• Anewbasiscouldbetheinputtoaclassificationorregressionalgorithm.

• Makingdataeasiertounderstand,byidentifyingwhat’simportantand/ordiscardingwhatisn’t.

TheGoalofClusteringGivenabunchofdata,wewanttocomeupwitharepresentationthatwillsimplifyfuturereasoning.

Keyidea:groupsimilarpointsintoclusters.

Examples:• Identifyingobjectsinsensordata• Detectingcommunitiesinsocialnetworks• Constructingphylogenetictreesofspecies• Makingrecommendationsfromsimilarusers

EMAlgorithmEstep:“expectation”…terriblename• Classifythedatausingthecurrentmodel.

Mstep:“maximization”…slightlylessterriblename• Generatethebestmodelusingthecurrentclassificationofthedata.

Initializethemodel,thenalternateEandMstepsuntilconvergence.

Note:TheEMalgorithmhasmanyvariations,includingsomethathavenothingtodowithclustering.

K-MeansAlgorithmModel:kclusterseachrepresentedbyacentroid.

Estep:• Assigneachpointtotheclosestcentroid.

Mstep:• Moveeachcentroidtothemeanofthepointsassignedtoit.

Convergence:werananEstepwherenopointshadtheirassignmentchanged.

K-MeansExample

InitializingK-MeansReasonableoptions:

1. StartwitharandomEstep.• Randomlyassigneachpointtoaclusterin{1,2,…,k}.

2. StartwitharandomMstep.a) Pickrandomcentroidswithinthemaximumrangeof

thedata.b) Pickrandomdatapointstouseasinitialcentroids.

K-MeansinAction

https://www.youtube.com/watch?v=BVFG7fd1H30

AnotherEMExample:GMMs

GMM:Gaussianmixturemodel• AGaussiandistributionisamultivariategeneralizationofanormaldistribution(theclassicbellcurve).• AGaussianmixtureisadistributioncomprisedofseveralindependentGaussians.• IfwemodelourdataasaGaussianmixture,we’resayingthateachdatapointwasarandomdrawfromoneofseveralGaussiandistributions(butwemaynotknowwhich).

EMforGaussianMixtureModelsModel:datadrawnfromamixtureofkGaussians

Estep:• Computethe(log)likelihoodofthedata• Eachpoint’sprobabilityofbeingdrawnfromeachGaussian.

Mstep:• UpdatethemeanandcovarianceofeachGaussian.• WeightedbyhowresponsiblethatGaussianwasforeachdatapoint.

HowdowepickK?There’snohardrule.

• Sometimestheapplicationforwhichtheclusterswillbeuseddictatesk.• Ifkcanbeflexible,thenweneedtoconsiderthetradeoffs:• Higherkwillalwaysdecreasetheerror(increasethelikelihood).• Lowerkwillalwaysproduceasimplermodel.

HierarchicalClustering• Organizesdatapointsintoahierarchy.• Everylevelofthebinarytreesplitsthepointsintotwosubsets.• Pointsinasubsetshouldbemoresimilarthanpointsindifferentsubsets.• Theresultingclusteringcanberepresentedbyadendrogram.

DirectionofClusteringAgglomerative(bottom-up)• Eachpointstartsinitsowncluster.• Repeatedlymergethetwomost-similarclustersuntilonlyoneremains.

Divisive(top-down)• Allpointsstartinasinglecluster.• Repeatedlysplitthedataintothetwomostself-similarsubsets.

Eitherversioncanstopearlyifaspecificnumberofclustersisdesired.

K-Means Clustering - Swarthmore College · K-Means Algorithm Model: k clusters each represented by...

Documents