Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1....

IntroductiontoDataMining

1

Large-scaledataiseverywhere!

• Therehasbeenenormousdatagrowthinbothcommercialandscientificdatabasesduetoadvancesindatagenerationandcollectiontechnologies.

• Newmantra:– Gatherwhateverdatayoucanwheneverand

whereverpossible.

• Expectations:– Gathereddatawillhavevalueeitherforthe

purposecollectedorforapurposenotenvisioned.

Computational Simulations

Business Data

Sensor Networks

Geo-spatial data

Homeland Security

2

Whydatamining?

Commercialviewpoint:– Lotsofdataisbeingcollectedandwarehoused.• Webdata:– Yahoohaspetabytesofwebdata.– Facebookhas~2Bactiveusers.

• Purchasesatdepartment/grocerystores,e-commerce:– Amazonrecords1.1Bordersayear.– Bank/CreditCardtransactions.

– Computershavebecomecheaperandmorepowerful.

– Competitivepressureisstrong.• Providebetter,customizedservicesforanedge(e.g.inCustomerRelationshipManagement).

3

Whydatamining?

Scientificviewpoint:– Datacollectedandstoredat

enormousspeeds.

• Remotesensorsonasatellite.– NASAEOSDISarchivesover 1-petabytesof

earthsciencedata/year.

• Telescopesscanningtheskies.– Skysurveydata.

• High-throughputbiologicaldata.

• Scientificsimulations.– Terabytesofdatageneratedinafewhours.

– Datamininghelpsscientists.• Inautomatedanalysisofmassivedatasets.

• Inhypothesisformation.

4

Whatisdatamining?

Manydefinitions:– Non-trivialextractionofimplicit,previouslyunknownandpotentiallyusefulinformationfromdata.

– Exploration&analysis,byautomaticorsemi-automaticmeans,oflargequantitiesofdatainordertodiscovermeaningfulpatterns.

5

Originsofdatamining

• Drawsideasfrommachinelearning/AI,patternrecognition,statistics,anddatabasesystems.

• Traditionaltechniquesmaybeunsuitableduetodatathatis:– Large-scale– Highdimensional– Heterogeneous– Complex– Distributed

KeyDistinction:Datadrivenvs.Hypothesisdriven

6

Dataminingtasks

• Predictiontask:– Usesomevariablestopredictunknownorfuturevaluesofothervariables.

• Descriptiontask:– Findhuman-interpretablepatternsthatdescribethedata.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

7

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes



13 No Single 85K Yes


15 No Single 90K Yes 10

Milk

Data

Dataminingmethods

8

Predictivemodeling:Classification

Findamodelforclassattributeasafunctionofthevaluesofotherattributes.

9

Tid Employed Level of Education

# years at present address

Credit Worthy

1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … …

10

Model for predicting credit worthiness

ClassEmployed

No Education

Number ofyears

No Yes

Graduate High school, Undergrad

Yes No

> 7 yrs < 7 yrs

Yes

Number ofyears

No

> 3 yr < 3 yr

Examplesofclassification

10

• Predictingtumorcellsasbenignormalignant.

• Classifyingcreditcardtransactionsaslegitimateorfraudulent.

• Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil.

• Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.

• Identifyingintrudersinthecyberspace.

ClusteringFindinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.

11

Inter-clusterdistancesaremaximized

Intra-clusterdistancesareminimized

Applicationsofclustering

• Understanding– Customprofilingfortargetedmarketing.– Grouprelateddocumentsforbrowsing.– Groupgenesandproteinsthathavesimilar

functionality.– Groupstockswithsimilarpricefluctuations.

• Summarization– Reducethesizeoflargedatasets.

12

Clusters for Raw SST and Raw NPP

longitude

latitu

de

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

Cluster

Sea Cluster 1

Sea Cluster 2

Ice or No NPP

Land Cluster 1

Land Cluster 2

UseofK-meanstopartitionSeaSurfaceTemperature(SST)andNetPrimaryProduction(NPP)intoclustersthatreflecttheNorthernandSouthernHemispheres.

Courtesy: Michael Eisen

Associationrulediscovery

Givenasetofrecordseachofwhichcontainsomenumberofitemsfromagivencollection.– Producedependencyruleswhichwillpredictoccurrenceofanitembasedonoccurrencesofotheritems.

TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered:Milk --> CokeDiaper, Milk --> Beer

13

Associationanalysis:Applications

• Market-basketanalysis.– Rulesareusedforsalespromotion,shelfmanagement,andinventory

management.

• Telecommunicationalarmdiagnosis.– Rulesareusedtofindcombinationofalarmsthatoccurtogether

frequentlyinthesametimeperiod.

• MedicalInformatics.– Rulesareusedtofindcombinationofpatientsymptomsandtest

resultsassociatedwithcertaindiseases.

14

Motivatingchallenges

• Scalability.

• Highdimensionality.

• Heterogeneousandcomplexdata.

• Dataownershipanddistribution.

• Non-traditionalanalysis.

15

The4V’sof“BigData”

16

PatternMining

ASSOCIATIONRULES

AssociationRuleMiningGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransaction

Market-basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of association rules

Diaper ® Beer,Milk, Bread ® Eggs,Coke,Beer, Bread ® Milk,

Implication means co-occurrence, not causality!

Definition:FrequentItemsetItemset– Acollectionofoneormoreitems

• Example:Milk,Bread,Diaper– k-itemset

• Anitemset thatcontainskitemsSupportcount(𝜎)– Frequencyofoccurrenceofanitemset– E.g.𝜎(Milk,Bread,Diaper)=2

Support(𝑠)– Fractionoftransactionsthatcontainanitemset– E.g.𝑠(Milk,Bread,Diaper)=2/5

FrequentItemset– Anitemset whosesupportisgreaterthanor

equaltoaminsup threshold

TID Items

1 Bread, Milk


3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer


Definition:AssociationRule

Example:BeerDiaper,Milk Þ

4.052

|T|)BeerDiaper,,Milk(

===ss

67.032

)Diaper,Milk()BeerDiaper,Milk,(

===s

sc

Association Rule– An implication expression of the form

𝑋 → 𝑌, where 𝑋and 𝑌are itemsets.– Example:

Milk, Diaper →Beer

Rule Evaluation Metrics– Support (𝑠)

• Fraction of transactions that contain both 𝑋and 𝑌.

– Confidence (𝑐)• Measures how often items in 𝑌

appear in transactions thatcontain 𝑋.

• It is nothing more than 𝑃(𝑌|𝑋).

TID Items

1 Bread, Milk


3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer


AssociationRuleMiningTask

Givenasetoftransactions𝑇,thegoalofassociationruleminingistofindallruleshaving:

1) support≥minsup threshold,and2) confidence≥minconf threshold.

Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)

TID Items

1 Bread, Milk





Anapproach….

1. Listallpossibleassociationrules.2. Computethesupportandconfidenceforeachrule.3. Prunerulesthatfailtheminsup andminconf

thresholds.

ComputationalComplexityGivend uniqueitems:

– Totalnumberofitemsets =2𝑑– Totalnumberofpossibleassociationrules:

123 1

1

1 1

+-=

úû

ùêë

é÷ø

öçè

æ -´÷ø

öçè

æ=

+

-

=

-

=å å

dd

d

k

kd

j jkd

kd

R

If d=6, R = 602 rules

MiningAssociationRules

Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)

TID Items

1 Bread, Milk





Observations:• All the above rules are binary partitions of the same itemset:

Milk, Diaper, Beer

• Rules originating from the same itemset have identical support butcan have different confidence.

• Thus, we may decouple the support and confidence requirements.

Miningassociationrules

Two-stepapproach:1. FrequentItemset Generation

– Generateallitemsets whosesupport³minsup.

2. RuleGeneration– Generatehighconfidencerulesfromeachfrequentitemset,

whereeachruleisabinarypartitioningofafrequentitemset.

Frequentitemset generationisstillexpensive.

Frequentitemset generationstrategies

• Reducethenumberofcandidates (𝑀)– Completesearch:𝑀 = 21.– Usepruningtechniquestoreduce𝑀.

• Reducethenumberoftransactions(𝑁)– ReducesizeofNasthesizeofitemset increases.– UsedbyDHPandvertical-basedminingalgorithms.

• Reducethenumberofcomparisons (𝑁𝑀)– Useefficientdatastructurestostorethecandidatesortransactions.

– Noneedtomatcheverycandidateagainsteverytransaction.

PatternLattice

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given 𝑑items, there are 2𝑑 possible candidate itemsets.

Reducingthenumberofcandidates

• Observation:– Ifanitemset isfrequent,thenallofitssubsetsmustalsobefrequent.

• Thisholdsduetothefollowingpropertyofthesupportmeasure:

– Supportofanitemset neverexceedsthesupportofitssubsets.

– Thisisknownastheanti-monotonepropertyofsupport.

)()()(:, YsXsYXYX ³ÞÍ"

Found to be Infrequent

null


A B C D E



ABCDE

null


A B C D E



ABCDEPruned supersets

Illustratingsupport’santi-monotonicity


Minimum Support = 3

TID Items

1 Bread, Milk

2 Beer, Bread, Diaper, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Bread, Coke, Diaper, Milk


Minimum Support = 3

TID Items

1 Bread, Milk





Items (1-itemsets)

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1


Minimum Support = 3

TID Items

1 Bread, Milk





Items (1-itemsets)




Itemset Bread,Milk Bread, Beer Bread,Diaper Beer, Milk Diaper, Milk Beer,Diaper

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Minimum Support = 3



Itemset Count Bread,Milk 3 Beer, Bread 2 Bread,Diaper 3 Beer,Milk 2 Diaper,Milk 3 Beer,Diaper 3

Items (1-itemsets)

Pairs (2-itemsets)


Minimum Support = 3



Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3

Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk

Items (1-itemsets)

Pairs (2-itemsets)


Triplets (3-itemsets)Minimum Support = 3




Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk

2 2 2 1

Items (1-itemsets)

Pairs (2-itemsets)



If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41

With support-based pruning,6 + 6 + 4 = 16




Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk

2 2 2 1

Items (1-itemsets)

Pairs (2-itemsets)



If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41

With support-based pruning,6 + 6 + 4 = 166 + 6 + 1 = 13

APRIORI

Apriori algorithm– ℱ4:frequent𝑘-itemsets– ℒ4:candidate𝑘-itemsets

Algorithm– Let𝑘 = 1– Generateℱ8 =frequent1-itemsets– Repeatuntilℱ4 isempty

1. CandidateGeneration:Generateℒ498 fromℱ4.2. CandidatePruning:Prunecandidateitemsets inℒ498 containing

subsetsoflength𝑘thatareinfrequent.3. SupportCounting:Countthesupportofeachcandidateinℒ498 by

scanningtheDB.4. CandidateElimination:Eliminatecandidatesinℒ498 thatare

infrequent,leavingonlythosethatarefrequent,leadingtoℱ498.

null


A B C D E



ABCDE

Level-by-leveltraversalofthelattice.

CandidateGeneration:theℱ4:8×ℱ4:8method

• Mergetwofrequent(𝑘 − 1)-itemsets iftheirfirst(𝑘 − 2)itemsareidentical

• 𝐹> =ABC,ABD,ABE,ACD,BCD,BDE,CDE– Merge(ABC,ABD)=ABCD– Merge(ABC,ABE)=ABCE– Merge(ABD,ABE)=ABDE

– Donotmerge(ABD,ACD)becausetheyshareonlyprefixoflength1insteadoflength2.

Candidatepruning• Letℱ> =ABC,ABD,ABE,ACD,BCD,BDE,CDEbethesetoffrequent3-itemsets.

• ℒ? =ABCD,ABCE,ABDEisthesetofcandidate4-itemsetsgenerated(frompreviousslide).

• Candidatepruning:– PruneABCEbecauseACEandBCEareinfrequent.– PruneABDEbecauseADEisinfrequent.

• Aftercandidatepruning:ℒ? =ABCD.

Supportcountingofcandidateitemsets

Scanthedatabaseoftransactionstodeterminethesupportofeachcandidateitemset.

– Mustmatcheverycandidateitemset againsteverytransaction,whichisanexpensiveoperation.

TID Items

1 Bread, Milk





Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk

Q:Howshouldweperformthisoperation?

Toreducenumberofcomparisons,storethecandidateitemsets inahashstructure.

– Insteadofmatchingeachtransactionagainsteverycandidate,matchitagainstcandidatescontainedinthehashedbuckets.

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets

Supportcountingofcandidateitemsets

Supportcounting:AnexampleSuppose you have 15 candidate itemsets of length 3:

1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8

How many of these itemsets are supported by transaction (1,2,3,5,6)?

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

5 63

1 2 31 2 51 2 6

1 3 51 3 6 1 5 6 2 3 5

2 3 6 2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

63 5

This is a “full” n-arytree where n is the number of items.

Q: Can we reduce storage requirements?

Supportcountingusingahashtree

2 3 45 6 7

1 4 5 1 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 8

1,4,72,5,8

3,6,9Hash function

Suppose you have 15 candidate itemsets of length 3:

1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8

You need:

• Hash function.

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node).

FactorsaffectingthecomplexityofApriori

MAXIMAL&CLOSEDITEMSETS

Maximalfrequentitemset

null


A B C D E



ABCDE

BorderInfrequent Itemsets

Maximal Itemsets

An itemset is maximal frequent if it is frequent and none of its immediate supersets are frequent.

Closeditemsets• Anitemset 𝑋isclosed ifallofitsimmediatesupersetshavealowersupportthan𝑋.

• Itemset 𝑋isnotclosedifatleastoneofitsimmediatesupersetshasthatsamesupportas𝑋.

TID Items1 A,B2 B,C,D3 A,B,C,D4 A,B,D5 A,B,C,D

Itemset SupportA 4B 5C 3D 4A,B 4A,C 2A,D 3B,C 3B,D 4C,D 3

Itemset SupportA,B,C 2A,B,D 3A,C,D 2B,C,D 2A,B,C,D 2

Maximalvs closedfrequentitemsets

null


A B C D E



ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

Frequent,maximal,andcloseditemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Frequent,maximal,andcloseditemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Q1:Whatifinsteadoffindingthefrequentitemsets,wefindthemaximalfrequentitemsetsortheclosedfrequentitemsets?

Q2:Doestheknowledgeofjustthemaximalfrequentitemsetswillallowmetogenerateallrequiredassociationrules?

Q3:Doestheknowledgeofjusttheclosedfrequentitemsetswillallowmetogenerateallrequiredassociationrules?

BEYONDLEVEL-BY-LEVELEXPLORATION

null

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

ABCD

null

AB AC ADBC BD CD

A B C D

ABC ABD ACD BCD

ABCD

(a) Prefix tree (b) Suffix tree

Traversingthepatternlattice

PatternsstartingwithA.(patternsthatcontainAandanyotheritem)

PatternsstartingwithB.(patternsthatcontainBandanyotheritemexceptA)

PatternsendingwithD.(patternsthatcontainDandanyotheritem)

PatternsendingwithC.(patternsthatcontainCandanyotheritemexceptD)

Breadth-firstvs Depth-first

a" b" c" d"

ab" ac" ad" bc" bd" cd"

abcd"

""

abc" abd" bcd"acd"

a" b" c" d"


abcd"

""

abc" abd" bcd"acd"

a" b" c" d"


abcd"

""

abc" abd" bcd"acd"

a" b" c" d"


abcd"

""

abc" abd" bcd"acd"

Plussesandminuses?

PROJECTIONMETHODS

Projection-basedmethods

null


A B C D E



ABCDE

Projection-basedmethodsnull


A B C D E



ABCDE

TID Items1 A,B2 B,C,D3 A,C,D,E4 A,D,E5 A,B,C6 A,B,C,D7 B,C8 A,B,C9 A,B,D10 B,C,E

TID Items1 B2 3 C,D,E4 D,E5 B,C6 B,C,D7 8 B,C9 B,D10

Initialdatabase

DatabaseassociatedwithnodeA

“Projected

DB”

TID Items1 2 D3 D,E4 5 6 D7 8 9 10 E

DatabaseassociatedwithnodeC

AprojectedDBonprefixpattern𝑋isobtainedasfollows:• Eliminateanytransactionsthatdonotcontain𝑋.• Fromthetransactionsthatareleft,retainonlytheitemsthatarelexicographicallygreaterthantheitemsin𝑋.

Projection-basedmethod• Itemsarelistedinlexicographicorder.• Let𝑃and𝐷𝐵(𝑃) beanode’spatternanditsassociatedprojecteddatabase.

• Miningisperformedbyrecursivelycallingthisfunction:– 𝑇𝑃(𝑃, 𝐷𝐵(𝑃))

1. Determinethefrequentitemsin𝐷𝐵(𝑃),anddenotethemby𝐸(𝑃).2. Eliminatefrom𝐷𝐵(𝑃) anyitemsnotin𝐸(𝑃).3. Foreachitem𝑥in𝐸(𝑃),call𝑇𝑃(𝑃𝑥, 𝐷𝐵(𝑃𝑥)).

BEYONDTRANSACTIONS

Beyondtransactiondatasets

• Theconceptoffrequentpatternsandassociationruleshasbeengeneralizedtodifferenttypesofdatasets:– Sequentialdatasets:

• Sequenceofpurchasingtransactions,web-pagesvisited,articlesread,biologicalsequences,eventlogs,etc.

– Relational/Graphdatasets:• Socialnetworks,chemicalcompounds,web-graphs,informationnetworks,etc.

• Thereisanextensivesetofapproachesandalgorithmsforthem,manyofwhichfollowsimilarideastothosedevelopedfortransactiondatasets.

Clustering(Unsupervisedlearning)

Whatisclusteranalysis?

Findinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Notionofaclustercanbeambiguous

How many clusters?

Notionofaclustercanbeambiguous

How many clusters?

Four ClustersTwo Clusters

Six Clusters

Clusteringformulations

Anumberofclusteringformulationshavebeendeveloped:

1. Weneedtofindafixednumberofclusters.– Well-suitedforcompression-likeapplications.

2. Weneedtofindclustersoffixedsize.– Well-suitedforneighborhood-discovery(recommendationengine).

3. Weneedtofindthesmallestnumberofclustersthatsatisfycertainqualitycriteria.– Well-suitedforapplicationsinwhichclusterqualityisimportant.

4. Weneedtofindthenatural numberofclusters.– Thisisclustering'sholly-grail!

• Extremelyhard,problemdependent,and“quitesupervised”.

Typesofclusterings

• Aclustering isasetofclusters.

• Importantdistinctionbetweenhierarchical andpartitional setsofclusters.

• Partitional clustering– Adivisionofdataobjectsintonon-overlappingsubsets(clusters)such

thateachdataobjectisinexactlyonesubset.

• Hierarchicalclustering– Asetofnestedclustersorganizedasahierarchicaltree.

Partitional clustering

Original Points A Partitional Clustering

Hierarchicalclustering

p4 p1

p3

p2 p4p1 p2 p3

Hierarchical clustering Dendrogram

Otherdistinctionsbetweensetsofclusters

• Exclusiveversusnon-exclusive– Innon-exclusiveclusterings,pointsmaybelongtomultipleclusters.– Canrepresentmultipleclassesor“border”points.

• Fuzzyversusnon-fuzzy– Infuzzyclustering,apointbelongstoeveryclusterwithsomeweight

between0and1.– Weightsmustsumto1.– Probabilisticclusteringhassimilarcharacteristics.

• Partialversuscomplete– Insomecases,weonlywanttoclustersomeofthedata.

• Heterogeneousversushomogeneous– Clustersofwidelydifferentsizes,shapes,anddensities.

Typesofclusters

• Well-separatedclusters• Center-basedclusters• Contiguousclusters• Density-basedclusters• Propertyorconceptual• Describedbyanobjectivefunction

Typesofclusters:Well-separated

Well-separatedclusters:– Aclusterisasetofpointssuchthatanypointinaclusteriscloser(ormore

similar)toeveryotherpointintheclusterthantoanypointnotinthecluster.

Three well-separated clusters

Typesofclusters:Center-based

Center-based– Aclusterisasetofobjectssuchthatanobjectinaclusteriscloser(more

similar)tothe“center”ofacluster,thantothecenterofanyothercluster.– Thecenterofaclusterisoftenacentroid,theaverageofallthepointsin

thecluster,oramedoid,themost“representative”pointofacluster.

Four center-based clusters

Typesofclusters:Contiguity-based

Contiguouscluster(nearestneighborortransitive)– Aclusterisasetofpointssuchthatapointinaclusteriscloser(ormore

similar)tooneormoreotherpointsintheclusterthantoanypointnotinthecluster.

Eight contiguous clusters

Typesofclusters:Density-based

Density-based– Aclusterisadenseregionofpoints,whichisseparatedbylow-density

regions,fromotherregionsofhighdensity.– Usedwhentheclustersareirregularorintertwined,andwhennoiseand

outliersarepresent.

Six density-based clusters

Typesofclusters:Conceptualclusters

SharedPropertyorConceptualClusters– Findsclustersthatsharesomecommonpropertyorrepresentaparticular

concept.

Two overlapping circles

Typesofclusters:Objectivefunction

Clustersdefinedbyanobjectivefunction– Findsclustersthatminimizeormaximizeanobjectivefunction.– Enumerateallpossiblewaysofdividingthepointsintoclustersand

evaluatethe“goodness”ofeachpotentialsetofclustersbyusingthegivenobjectivefunction.(NPHard)

– Canhaveglobalorlocalobjectives.• Hierarchicalclusteringalgorithmstypicallyhavelocalobjectives.• Partitional algorithmstypicallyhaveglobalobjectives.

– Avariationoftheglobalobjectivefunctionapproachistofitthedatatoaparameterizedmodel.• Parametersforthemodelaredeterminedfromthedata.

• Mixturemodelsassumethatthedataisa‘mixture'ofanumberofstatisticaldistributions.

Clusteringrequirements

Thefundamentalrequirementforclusteringistheavailabilityofafunctiontodeterminethesimilarity ordistance betweenobjectsinthedatabase.

Theusermustbeabletoanswersomeofthefollowingquestions:

1. Whenshouldtwoobjectsbelongtothesamecluster?2. Howshouldtheclusterslooklike(i.e.,whattypeofobjects

shouldthecontain)?3. Whataretheobject-relatedcharacteristicsofgoodclusters?

Datacharacteristics&clustering

• Typeofproximityordensitymeasure– Centraltoclustering.– Dependsondataandapplication.

• Datacharacteristicsthataffectproximityand/ordensityare– Dimensionality

• Sparseness

– Attributetype– Specialrelationshipsinthedata

• Forexample,autocorrelation

– Distributionofthedata

• Noiseandoutliers– Ofteninterferewiththeoperationoftheclusteringalgorithm

BASICCLUSTERINGALGORITHMS

1. K-means2. Hierarchicalclustering3. Density-basedclustering

K-meansclustering

• Partitional clusteringapproach.• Numberofclusters,K,mustbespecified.• Eachclusterisassociatedwithacentroid(centerpoint/object).• Eachpointisassignedtotheclusterwiththeclosestcentroid.• Thebasicalgorithmisverysimple.

ExampleofK-meansclustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

K-meansclustering– Details

• Initialcentroidsareoftenchosenrandomly.– Clustersproducedvaryfromoneruntoanother.

• Thecentroidis(typically)themeanofthepointsinthecluster.• “Closeness”ismeasuredbyEuclideandistance,cosinesimilarity,correlation,etc.

• K-meanswillconvergeforcommonsimilaritymeasuresmentionedabove.

• Mostoftheconvergencehappensinthefirstfewiterations.– Oftenthestoppingconditionischangedto“Untilrelativelyfewpointschange

clusters”.

• ComplexityisO(n*K*I*d)– n=numberofpoints,K=numberofclusters,

I=numberofiterations,d=numberofattributes.

K-meansclustering– Objective

Let o1, . . . , on be the set of objects to be clustered, k be the number of desired

clusters, p be the clustering indicator vector such that pi is the cluster number

that the ith object belongs to, and ci be the centroid of the ith cluster.

In the case of Euclidean distance, the K-means clustering algorithm solves the

following optimization problem:

minimize

pf(p) =

nX

i=1

||oi cpi ||22.

Function f() is the objective or clustering criterion function of K-means.


Let o1, . . . , on be the set of objects to be clustered, k be the number of desired

clusters, p be the clustering indicator vector such that pi is the cluster number

that the ith object belongs to, and ri be a vector associated with the ith cluster.

In the case of Euclidean distance, the K-means clustering algorithm solves the

following optimization problem:

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

nX

i=1

||oi rpi ||22.

Note that p and r1, . . . , rk are the variables of the optimization problem that

need to be estimated such that the value of g() is minimized.


The solution to

minimize

pf(p) =

nX

i=1

||oi cpi ||22

is the same as the solution to

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

nX

i=1

||oi rpi ||22.

and 8i, ri = ci.


The solution to

minimize

pf(p) =

nX

i=1

||oi cpi ||22

is the same as the solution to

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

nX

i=1

||oi rpi ||22.

and 8i, ri = ci.

Ther_i vectorscanbethoughtasbeingrepresentativesoftheobjectsthatareassignedtotheith cluster.Ther_i vectorsrepresentacompressedviewofthedata.

K-meansclustering– Objectiveminimize

p

Pni=1 ||oi cpi ||22

minimizep,r1,...,rk

Pni=1 ||oi rpi ||22

Thesearenon-convexoptimizationproblems.

• The𝐾-meansclusteringalgorithmisawayofsolvingtheoptimizationproblem.• Itusesaniterativealternateleastsquaresoptimizationstrategy.

a. Optimizeclusterassignments𝑝,given𝑟$ for𝑖 = 1,… , 𝑘.b. Optimize𝑟$ for𝑖 = 1,… , 𝑘,givenclusterassignments𝑝.

• Itguaranteesconvergencetoalocalminimasolution.However,duetothenon-convexityoftheproblem,itmaynotbetheglobalminimum.

• Run𝐾-meansmultipletimeswithdifferentinitialcentroidsandreturnthesolutionthathasthebestvalue.

TwodifferentK-meansclusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

LimitationsofK-means

• Def. problem:whentheclusteringsolutionthatyougetisnotthebest,natural,insightful,etc.

• K-meanshasproblems whenclustersareofdiffering– Sizes– Densities– Non-globularshapes

• K-meanshasproblems whenthedatacontainsoutliers.

LimitationsofK-means:Differingsizes

Original Points K-means (3 Clusters)

LimitationsofK-means:Differingdensity


LimitationsofK-means:Non-globularshapes


OvercomingK-meansLimitations

Original Points K-means Clusters

One solution is to use many clusters.Finds parts of clusters, and we may need to put them back together.

OvercomingK-meanslimitations

Original Points K-means Clusters

Importanceofchoosinginitialcentroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Importanceofchoosinginitialcentroids…

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Solutionstoinitialcentroidsproblem• Multipleruns

– Helps,butprobabilityisnotonyourside.

• Sampleandusehierarchicalclusteringtodetermineinitialcentroids.

• Selectmorethan𝑘initialcentroidsandthenselectamongtheseinitialcentroids.– Selectmostwidelyseparated.

• Generatealargernumberofclustersandthenperformahierarchicalclustering.

• Bisecting𝐾-means– Notassusceptibletoinitializationissues.

Outliers

• A principled way of dealing with outliers is to do so directly during the

optimization process.

• Robust k-means algorithms as part of the optimization process in addition

to determining the clustering solution they also identify a set of outlier

objects that are not clustered by the algorithm.

• The non-clustered objects are treated as a penalty component of the objec-

tive function (in supervised learning, these penalty components are often

called regularizers) like

minimize

p

X

i : pi 6=1

||oi cpi ||22 +

X

i : pi=1

q(i),

where is a user-specified parameter that controls the penalty associated

with not clustering an object, and q(i) is a cost function associated with

the ith object. A simple q() = 1 is such a cost function.

K-Meansandthe“Curseofdimensionality”

• Whendimensionalityincreases,databecomesincreasinglysparseinthespacethatitoccupies.

• Definitionsofdensityanddistancebetweenpoints,whichiscriticalforclusteringandoutlierdetection,becomelessmeaningful. • Randomly generate 500 points.

• Compute difference between max and min distance between any pair of points.

Asymmetricattributes

Ifwemetafriendinthegrocerystorewouldweeversaythefollowing?

“I see our purchases are very similar since we didn’t buy most of the same things.”

40

SphericalK-meansclustering

Let d1, . . . , dn be the unit length vectors of the set of objects to be clustered, kbe the number of desired clusters, p be the clustering indicator vector such that

pi is the cluster number that the ith object belongs to, and ci be the centroid

of the ith cluster.

The spherical K-means clustering algorithm solves the following optimization

problem:

maximize

p

nX

i=1

cos(di, cpi).

SphericalK-means&Text

Inhigh-dimensionaldata,clustersexistinlower-dimensionalsub-spaces.

HIERARCHICALCLUSTERING

Hierarchicalclustering

• Producesasetofnestedclustersorganizedasahierarchicaltree.

• Canbevisualizedasadendrogram.– Atreelikediagramthatrecordsthesequencesofmergesorsplits.

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Advantagesofhierarchicalclustering• Donothavetoassumeanyparticularnumberofclusters.– Anydesirednumberofclusterscanbe

obtainedby“cutting”thedendrogram attheproperlevel.

• Theymaycorrespondtomeaningfultaxonomies.– Exampleinbiologicalsciences(e.g.,animal

kingdom,phylogenyreconstruction,…).

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1 3 2 5 4 60

0.05

0.1

0.15

0.2

Hierarchicalclustering• Twomainwaysofobtaininghierarchicalclusterings:

– Agglomerative:• Startwiththepointsasindividualclusters.• Ateachstep,mergetheclosestpairofclustersuntilonlyonecluster(orkclusters)left.

– Divisive:• Startwithone,all-inclusivecluster.• Ateachstep,splitaclusteruntileachclustercontainsapoint(ortherearekclusters).

• Traditionalhierarchicalalgorithmsuseasimilarityordistancematrix.– Mergeorsplitoneclusteratatime.

Agglomerativeclusteringalgorithm• Morepopularhierarchicalclusteringtechnique

• Basicalgorithmisstraightforward1. Computetheproximitymatrix.2. Leteachdatapointbeacluster.3. Repeat:4. Mergethetwoclosestclusters.5. Updatetheproximitymatrix.6. Until onlyasingleclusterremains(orkclustersremain).

• Keyoperationisthecomputationoftheproximityoftwoclusters.

– Differentapproachestodefiningthedistancebetweenclustersdistinguishthedifferentalgorithms

StartingsituationStartwithclustersofindividualpointsandaproximitymatrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

IntermediatesituationAftersomemergingsteps,wehavesomeclusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

IntermediatesituationWewanttomergethetwoclosestclusters(C2andC5)andupdatetheproximitymatrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

AftermergingHowdoweupdatetheproximitymatrix?

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Defininginter-clusterproximity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Proximity?

Minimum distance, maximum distance, average distance, distance between centroids, objective-driven selection, etc.

Proximity Matrix


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

Usingminimumdistance.


p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

Usingmaximumdistance.

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix


Usingaveragedistance.

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

´ ´


Usingdistancebetweencentroids.

Strengthofminimumdistance

Can handle non-elliptical shapes.

Original Points Six Clusters

Limitationsofminimumdistance

Original Points

Two Clusters

Sensitive to noise and outliers.Three Clusters

Strengthofmaximumdistance

Less susceptible to noise and outliers.

Original Points Two Clusters

Limitationsofmaximumdistance

Tends to break large clusters.

Biased towards globular clusters.

Two ClustersOriginal Points

Groupaverage

• Compromisebetweensingleandcompletelink.

• Strengths:– Lesssusceptibletonoiseandoutliers.

• Limitations:– Biasedtowardsglobularclusters.

Hierarchicalclustering:Timeandspacerequirements• O(𝑁2) spacesinceitusestheproximitymatrix.

– 𝑁isthenumberofpoints.

• O(𝑁3) timeinmanycases– Thereare𝑁stepsandateachsteptheproximitymatrixmustbeupdatedandsearched(ontheaveragethereare𝑁2 onthatmatrix).

– ComplexitycanbereducedtoΟ(𝑁2log(𝑁))timewithsomecleverness.

Hierarchicalclustering:Problemsandlimitations• Onceadecisionismadetocombinetwoclusters,itcannotbeundone.

• Objectivefunctionisoptimizedonlylocally.

• Differentschemeshaveproblemswithoneormoreofthefollowing:– Sensitivitytonoiseandoutliers.– Difficultyhandlingdifferentsizedclustersandconvexshapes.

– Breakinglargeclusters.

DENSITY-BASEDCLUSTERING

DBSCAN

• DBSCANisadensity-basedalgorithm.– Thedensity isthenumberofpointswithinaspecifiedradius(Eps)

– Apointisacorepoint ifithasmorethanaspecifiednumberofpoints(MinPts)withinEps.

• Thesearepointsthatareattheinteriorofacluster.

– Aborderpoint hasfewerthanMinPts withinEps,butisintheneighborhoodofacorepoint.

– Anoisepoint isanypointthatisnotacorepointoraborderpoint.

DBSCAN:core,border,andnoisepoints

DBSCANalgorithm182 CHAPTER 6. CLUSTER ANALYSIS

Algorithm DBSCAN(Data: D, Radius: Eps, Density: τ )begin

Determine core, border and noise points of D at level (Eps, τ);Create graph in which core points are connected

if they are within Eps of one another;Determine connected components in graph;Assign each border point to connected component

with which it is best connected;return points in each connected component as a cluster;

end

Figure 6.15: Basic DBSCAN algorithm

3. Noise point: A data point that is neither a core point nor a border point is defined asa noise point.

Examples of core points, border points, and noise points are illustrated in Fig. 6.16 forτ = 10. The data point A is a core point because it contains 10 data points within theillustrated radius Eps. On the other hand, data point B contains only 6 points within aradius of Eps, but it contains the core point A. Therefore, it is a border point. The datapoint C is a noise point because it contains only 4 points within a radius of Eps, and itdoes not contain any core point.

After the core, border, and noise points have been determined, the DBSCAN clusteringalgorithm proceeds as follows. First, a connectivity graph is constructed with respect to thecore points, in which each node corresponds to a core point, and an edge is added betweena pair of core points, if and only if they are within a distance of Eps from one another. Notethat the graph is constructed on the data points rather than on partitioned regions, as ingrid-based algorithms. All connected components of this graph are identified. These corre-spond to the clusters constructed on the core points. The border points are then assigned tothe cluster with which they have the highest level of connectivity. The resulting groups arereported as clusters and noise points are reported as outliers. The basic DBSCAN algorithmis illustrated in Fig. 6.15. It is noteworthy that the first step of graph-based clustering isidentical to a single-linkage agglomerative clustering algorithm with termination-criterionof Eps-distance, which is applied only to the core points. Therefore, the DBSCAN algorithmmay be viewed as an enhancement of single-linkage agglomerative clustering algorithms bytreating marginal (border) and noisy points specially. This special treatment can reduce theoutlier-sensitive chaining characteristics of single-linkage algorithms without losing the abil-ity to create clusters of arbitrary shape. For example, in the pathological case of Fig. 6.9(b),the bridge of noisy data points will not be used in the agglomerative process if Eps and τare selected appropriately. In such cases, DBSCAN will discover the correct clusters in spiteof the noise in the data.

Practical Issues

The DBSCAN approach is very similar to grid-based methods, except that it uses circularregions as building blocks. The use of circular regions generally provides a smoother contourto the discovered clusters. Nevertheless, at more detailed levels of granularity, the twomethods will tend to become similar. The strengths and weaknesses of DBSCAN are also

DBSCAN:core,borderandnoisepoints

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

DBSCANclustering

Clusters

DBSCANclustering

Clusters

Thesearealsoclusters.Theyareusuallyeliminatedbyputtingaminimumclustersizethreshold.

DBSCANclustering

Original Points Clusters

• Resistant to (some) noise.

• Can handle clusters of different shapes and sizes.

DBSCAN:Howmuchnoise?

WhenDBSCANdoesnotworkwell

Original Points(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

DBSCAN:DeterminingEPSandMinPts

• Ideaisthatforpointsinacluster,theirkth nearestneighborsareroughlyatthesamedistance.

• Noisepointshavethekth nearestneighboratfartherdistance.• So,plotsorteddistanceofeverypointtoitskth nearestneighbor.

CLUSTERVALIDITY

Differentaspectsofclustervalidation

• Determiningthe clusteringtendency ofasetofdata:– Isthereanon-randomstructureinthedata?

• Comparingtheresultsofaclusteranalysistoexternallyknownresults.– Dotheclusterscontainobjectsofmostlyasingleclasslabel?

• Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout referencetoexternalinformation.– Lookatvariousintra- andinter-clusterdata-derivedproperties.

• Comparingtheresultsoftwodifferentsetsofclusteranalysestodeterminewhichisbetter.

• Theevaluationcanbedonefortheentireclusteringsolutionorjustforselectedclusters.

Measuresofclustervalidity• Numericalmeasuresthatareappliedtojudgevariousaspectsofclustervalidity,areclassifiedintothefollowingthreetypes.– InternalIndex(II): Usedtomeasurethegoodnessofaclusteringstructurewithout respecttoexternalinformation.• SumofSquaredError(SSE)(oranyotheroftheobjectivefunctionsthatwediscussed).

– ExternalIndex(EI): Usedtomeasuretheextenttowhichclusterlabelsmatchexternallysuppliedclasslabels.• Entropy,purity,f-score,etc.

– RelativeIndex(RI): Usedtocomparetwodifferentclusteringsorclusters.• Oftenanexternalorinternalindexisusedforthisfunction,e.g.,SSEorentropy.

II:Measuringclustervalidityviacorrelation

• Twomatrices:– Proximity(distance)matrixofthedata(e.g.,pair-wisecosinesimilarity(Euclidean

distance)).– Idealproximitymatrixthatisimpliedbytheclusteringsolution.

• Onerowandonecolumnforeachdatapoint.• Anentryis1iftheassociatedpairofpointsbelongtothesamecluster.• Anentryis0iftheassociatedpairofpointsbelongstodifferentclusters.

• Computethecorrelationbetweenthetwomatrices.– i.e.,thecorrelationbetweenthevectorized matrices.– (makesurethattheorderingofthedatapointsisthesameinbothmatrices)

• High(low)correlationindicatesthatpointsthatbelongtothesameclusterareclosetoeachother.

• Notagoodmeasureforsomedensityorcontiguitybasedclusters.

II:Measuringclustervalidityviacorrelation

CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

II:Usingsimilaritymatrixforclustervalidation

Orderthesimilaritymatrixwithrespecttoclusterlabelsandinspectvisually.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Clustersfoundinrandomdata

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

yComplete Link


Clustersinrandomdataarenotsocrisp.

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

xy

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y



0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link


1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

II:Frameworkforclustervalidity

• Needaframeworktointerpretanymeasure.– Forexample,ifourmeasureofevaluationhasa valueof10,isthatgood,fair,or

poor?

• Statisticsprovideaframeworkforclustervalidity.– Themore“atypical”aclusteringresultis,themorelikelyitrepresentsvalid

structureinthedata.– Cancomparethevaluesofanindexthatresultfromrandomdataorclusterings to

thoseofaclusteringresult.• Ifthevalueoftheindexisunlikely,thentheclusterresultsarevalid.

– Theseapproachesaremorecomplicatedandhardertounderstand.

• Forcomparingtheresultsoftwodifferentsetsofclusteranalyses,aframeworkislessnecessary.– However,thereisthequestionofwhetherthedifferencebetweentwoindex

valuesissignificant.

II:StatisticalframeworkforSSE

Example– CompareSSEof0.005againstthreeclustersinrandomdata.– HistogramshowsSSEofthreeclustersin500setsofrandomdatapointsofsize100

distributedovertherange0.2– 0.8forxandyvalues.

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Count

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

II:Statisticalframeworkforcorrelation

CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

“Thevalidationofclusteringstructuresisthemostdifficultandfrustratingpartofclusteranalysis.Withoutastrongeffortinthisdirection,clusteranalysiswillremainablackartaccessibleonlytothosetruebelieverswhohaveexperienceandgreatcourage.”

AlgorithmsforClusteringData,JainandDubes

Finalcommentonclustervalidity

Classification(Supervisedlearning)

BASICCONCEPTS

Classification:Definition

• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherexisasetofattributesandyistheclasslabel• x:setofattributes,predictors,independentvariables,inputs.• y:class,response,dependentvariable,oroutput.

• Task:– Learnamodelthatmapseachsetofattributesxintooneofthepredefinedclasslabelsy.

Examplesofclassificationtasks

Task Attributeset,x Classlabel,y

Categorizingemailmessages

Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Identifyingtumorcells

FeaturesextractedfromMRIscans

malignantorbenigncells

Cataloginggalaxies

Featuresextractedfromtelescopeimages

Elliptical,spiral,orirregular-shapedgalaxies

Buildingandusingaclassificationmodel

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10


11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Classificationtechniques

• Baseclassifiers– Decisiontree-basedmethods.– Rule-basedmethods.– Nearest-neighbor.– Neuralnetworks.– NaïveBayesandBayesianbeliefnetworks.– Supportvectormachines.– …andothers.

• Ensembleclassifiers– Boosting,bagging,randomforests,etc.

DECISIONTREES

Wewillusethismethodtoillustratevariousconceptsandissuesassociatedwiththeclassificationtask.

Exampleofadecisiontree

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


HomeOwner

MarSt

Income

YESNO

NO

NO

Yes No

MarriedSingle,Divorced

<80K >80K

SplittingAttributes

TrainingData Model:Decisiontree

Exampleofdecisiontree

MarSt

HomeOwner

Income

YESNO

NO

NO

Yes No

MarriedSingle,

Divorced

<80K >80K

Therecouldbemorethanonetreethatfitsthesamedata!

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Decisiontreeclassificationtask

Apply Model

Induction

Deduction

Learn Model

Model


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No



11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?


Test Set

TreeInductionalgorithm

Training SetDecision Tree

Applymodeltotestdata

MarSt

Income

YESNO

NO

NO

Yes No

MarriedSingle,Divorced

<80K >80K

HomeOwner

MaritalStatus

AnnualIncome

DefaultedBorrower

No Married 80K ?10

TestData

AssignDefaultedto“No”

HomeOwner

Startfromtherootofthetree

Decisiontreeclassificationtask

Apply Model

Induction

Deduction

Learn Model

Model


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No



11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?


Test Set

TreeInductionalgorithm

Training Set

DecisionTree

Buildingthedecisiontree—Treeinduction

• Let𝐷" bethesetoftrainingrecordsthatreachanode𝑡.

• Generalprocedure:– If𝐷" containsrecordsthatbelong

thesameclass𝑦",then𝑡isaleafnodelabeledas𝑦".

– If𝐷" containsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.• Recursivelyapplytheproceduretoeachsubset.

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


𝐷"

?

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

(d)

Yes No

MaritalStatus


AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Buildingthedecisiontree:Example

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No


Yes No

MaritalStatus


(d)

Yes No

MaritalStatus


AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No


HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)


ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No


Yes No

MaritalStatus


(d)

Yes No

MaritalStatus


AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No


HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No


Yes No

MaritalStatus


(d)

Yes No

MaritalStatus


AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No


HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)


ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No


Yes No

MaritalStatus


(d)

Yes No

MaritalStatus


AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No


HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)


ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No



6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


Designissuesofdecisiontreeinduction

• Howshouldthetrainingrecordsbesplit?– Methodforspecifyingtestcondition.

• Thisdependsontheattributetypes.– Methodforselectingwhichattributeandsplitconditiontochoose.• Needameasureforevaluatingthegoodnessofatestcondition.

• Whenshouldthesplittingprocedurestop?– Stopsplittingifalltherecordsbelongtothesameclassorhaveidenticalattributevalues.

– Earlytermination.

Methodsforexpressingtestconditions

• Dependsonattributetypes:– Binary– Nominal– Ordinal– Continuous

• Dependsonnumberofwaystosplit:– 2-waysplit– Multi-waysplit

Multi-waysplit:Useasmanypartitionsasdistinctvalues:

Binarysplit:Dividevaluesintotwosubsets:

Testconditionfornominalattributes

MaritalStatus

Single Divorced Married

Single Married,Divorced

MaritalStatus

Married Single,Divorced

MaritalStatus

OR OR

Single,Married

MaritalStatus

Divorced

Testconditionforordinalattributes

Large

ShirtSize

Medium Extra LargeSmall

• Multi-waysplit:– Useasmanypartitionsasdistinctvalues.

• Binarysplit:– Dividesvaluesintotwosubsets.

– Preserveorderpropertyamongattributevalues.

Medium, Large,Extra Large

ShirtSize

SmallLarge,Extra Large

ShirtSize

Small,Medium

Medium,Extra Large

ShirtSize

Small,Large

Thisgroupingviolatesorderproperty.

Testconditionforcontinuousattributes

AnnualIncome> 80K?

Yes No

AnnualIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Howtodeterminethebestsplit?

Gender

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

CustomerID

...

Yes No Family

Sports

Luxury c1 c10

c20

C0: 0C1: 1

...

c11

Beforesplitting:10recordsofclass0,and10recordsofclass1.

Whichtestconditionisthebest?

Howtodeterminethebestsplit?

• Greedyapproach:– Nodeswithpurer classdistributionarepreferred.

• Needameasureofnodepurity/impurity:

C0: 5C1: 5

C0: 9C1: 1

Highdegreeofimpurity Lowdegreeofimpurity

Measuresofnodeimpurity

• GiniIndex

• Entropy

• Misclassificationerror

å-=j

tjptGINI 2)]|([1)(

å-=j

tjptjptEntropy )|(log)|()(

)|(max1)( tiPtErrori

-=

Two-classproblem

Findingthebestsplit

1. Computeimpuritymeasure(P)beforesplitting.2. Computeimpuritymeasure(M)aftersplitting.

• Computeimpuritymeasureofeachchildnode.• Misthesize-weightedimpurityofthechildren.

3. Choosetheattributetestconditionthatproducesthehighestgain:

orequivalently,lowestimpuritymeasureaftersplitting(M).

Gain = P – M,

Decisiontreebasedclassification

• Advantages:– Inexpensivetoconstruct.– Extremelyfastatclassifyingunknownrecords.– Easytointerpretforsmall-sizedtrees.– Robusttonoise(especiallywhenmethodstoavoidoverfittingareemployed).

– Caneasilyhandleredundantorirrelevantattributes(unlesstheattributesareinteracting).

• Disadvantages:– Spaceofpossibledecisiontreesisexponentiallylarge.Greedyapproachesareoftenunabletofindthebesttree.

– Doesnottakeintoaccountinteractionsbetweenattributes.– Eachdecisionboundaryinvolvesonlyasingleattribute.

OVERFITTING

Classificationerrors

• Trainingerrors(apparenterrors):– Errorscommittedonthetrainingset.

• Testerrors:– Errorscommittedonthetestset.

• Generalizationerrors:– Expectederrorofamodelinarandomlyselectedsubsetofrecordsfromthesamedistribution.

Exampledataset

Twoclassproblem:

+:5400instances

• 5000instancesgeneratedfromaGaussiancenteredat(10,10)

• 400noisyinstancesadded

o:5400instances• Generatedfromauniformdistribution

10%ofthedatausedfortrainingand90%ofthedatausedfortesting

Increasingnumberofnodesinthedecisiontree

Decisiontreewith4nodes

Decisiontree

Decisionboundariesontrainingdata

Decision Tree

Decision boundaries on training data

Decisiontreewith50nodes

Increasingnumberofnodesindecisiontrees

Decision Tree with 4 nodes

Decision Tree with 50 nodes

Which tree is better ?

Modeloverfitting

Underfitting:whenmodelistoosimple,bothtrainingandtesterrorsarelarge.

Overfitting:whenmodelistoocomplex,trainingerrorissmallbuttesterrorislarge.

Model overfitting

Usingtwicethenumberofdatainstances

• Iftrainingdataisunder-representative,testingerrorsincreaseandtrainingerrorsdecreaseonincreasingnumberofnodes.

• Increasingthesizeoftrainingdatareducesthedifferencebetweentrainingandtestingerrorsatagivennumberofnodes.

Reasonsformodeloverfitting

• Presenceofnoise.

• Lackofrepresentativesamples.

• Multiplecomparisonprocedure.

Effectofmultiplecomparisonprocedure

• Considerthetaskofpredictingwhetherstockmarketwillrise/fallinthenext10tradingdays.

• Randomguessing:P(correct)=0.5

• Make10randomguessesinarow:

0547.02

1010

910

810

)8(# 10 =÷÷ø

öççè

æ+÷÷ø

öççè

æ+÷÷ø

öççè

æ

=³correctP

Day1 UpDay2 DownDay3 DownDay4 UpDay5 DownDay6 DownDay7 UpDay8 UpDay9 UpDay10 Down


• Approach:– Get50analysts.– Eachanalystmakes10randomguesses.– Choosetheanalystthatmakesthemostnumberofcorrectpredictions.

• Probabilitythatatleastoneanalystmakesatleast8correctpredictions:

9399.0)0547.01(1)8(# 50 =--=³correctP


• Manyalgorithmsemploythefollowinggreedystrategy:– Initialmodel:𝑀.– Alternativemodel:𝑀' = 𝑀 ∪ 𝛾,where𝛾isacomponenttobeaddedtothemodel(e.g.,atestconditionofadecisiontree).

– Keep𝑀' ifimprovement,Δ 𝑀,𝑀' > 𝛼.

• Oftentimes,𝛾ischosenfromasetofalternativecomponents,Γ=best(𝛾1, 𝛾2, … , 𝛾4).

• Ifmanyalternativesareavailable,onemayinadvertentlyaddirrelevantcomponentstothemodel,resultinginmodeloverfitting.

Effectofmultiplecomparison:Example

Useadditional100noisyvariablesgeneratedfromauniformdistributionalongwith𝑋and𝑌asattributes.

Use30%ofthedatafortrainingand70%ofthedatafortesting.

Usingonly𝑋and𝑌asattributes

Notesonoverfitting

• Overfittingresultsindecisiontreesthataremorecomplex thannecessary.

• Trainingerrordoesnotprovideagoodestimateofhowwellthetreewillperformonpreviouslyunseenrecords.

• Needwaysforestimatinggeneralizationerrors.

Handlingoverfittingindecisiontrees

Pre-Pruning(earlystoppingrule):– Stopthealgorithmbeforeitbecomesafully-growntree.– Typicalstoppingconditionsforanode:

• Stopifallinstancesbelongtothesameclass.• Stopifalltheattributevaluesarethesame.

– Morerestrictiveconditions:• Stopifnumberofinstancesislessthansomeuser-specifiedthreshold.• Stopifclassdistributionofinstancesareindependentoftheavailablefeatures(e.g.,using𝜒2 test).

• Stopifexpandingthecurrentnodedoesnotimproveimpurity measures(e.g.,Giniorinformationgain).

• Stopifestimatedgeneralizationerrorfallsbelowcertainthreshold.

Handling overfitting in decision trees

Post-pruning:– Growdecisiontreetoitsentirety.– Subtree replacement:

• Trimthenodesofthedecisiontreeinabottom-upfashion.• Ifgeneralizationerrorimprovesaftertrimming,replacesub-treebyaleafnode.

• Classlabelofleafnodeisdeterminedfrommajorityclassofinstancesinthesub-tree.

– Subtree raising:• Replacesubtree withmostfrequentlyusedbranch.

Examplesofpost-pruning

Simplified Decision Tree:

depth = 1 :| ImagePages <= 0.1333 : class 1| ImagePages > 0.1333 :| | breadth <= 6 : class 0| | breadth > 6 : class 1depth > 1 :| MultiAgent = 0: class 0| MultiAgent = 1:| | totalPages <= 81 : class 0| | totalPages > 81 : class 1

Decision Tree:depth = 1 :| breadth > 7 : class 1| breadth <= 7 :| | breadth <= 3 :| | | ImagePages > 0.375 : class 0| | | ImagePages <= 0.375 :| | | | totalPages <= 6 : class 1| | | | totalPages > 6 :| | | | | breadth <= 1 : class 1| | | | | breadth > 1 : class 0| | width > 3 :| | | MultiIP = 0:| | | | ImagePages <= 0.1333 : class 1| | | | ImagePages > 0.1333 :| | | | | breadth <= 6 : class 0| | | | | breadth > 6 : class 1| | | MultiIP = 1:| | | | TotalTime <= 361 : class 0| | | | TotalTime > 361 : class 1depth > 1 :| MultiAgent = 0:| | depth > 2 : class 0| | depth <= 2 :| | | MultiIP = 1: class 0| | | MultiIP = 0:| | | | breadth <= 6 : class 0| | | | breadth > 6 :| | | | | RepeatedAccess <= 0.0322 : class 0| | | | | RepeatedAccess > 0.0322 : class 1| MultiAgent = 1:| | totalPages <= 81 : class 0| | totalPages > 81 : class 1

SubtreeRaising

SubtreeReplacement

ENSEMBLEMETHODS

Ensemblemethods

• Constructasetofclassifiersfromthetrainingdata.

• Predictclasslabeloftestrecordsbycombiningthepredictionsmadebymultipleclassifiers.

Whyensemblemethodswork?

Supposethereare25baseclassifiers:– Eachclassifierhaserrorrate,e =0.35.– Assumeerrorsmadebyclassifiersare

uncorrelated.– Probabilitythattheensemble

classifiermakesawrongprediction:

P(X ≥13) = 25i

⎛

⎝⎜

⎞

⎠⎟ε i (1−ε)25−i = 0.06

i=13

25

∑

Generalapproach

OriginalTraining data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Typesofensemblemethods

• Manipulatedatadistribution.– Resamplingmethod.

• Baggingandboosting.

• Manipulateinputfeatures.– Featuresubsetselection.

• Randomforest:Randomlyselectfeaturesubsetsandbuiltdecisiontrees.

• Manipulateclasslabels.– Randomlypartitiontheclassesintotwosubsets,treatthemas+veand–ve,andlearnabinaryclassifier.Dothatmanytimes.Atclassification,useallbinaryclassifiersandgivecreditstotheconstituentclasses.

• Byusingdifferentmodels.– E.g.,DifferentANNtopologies.

Bagging

• Samplingwithreplacement.

• Buildaclassifieroneachbootstrapsample.• Useamajorityvotingpredictionapproach:

– Predictanunlabeledinstanceusingallclassifiersandreturnthemostfrequentlypredictedclassastheprediction.

Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Boosting

• Aniterativeproceduretoadaptivelychangethedistributionoftrainingdatabyfocusingmoreonpreviouslymisclassifiedrecords.– Initially,all𝑁recordsareassignedequalweights.– Unlikebagging,weightsmaychangeattheendofeachboostinground.

• Theweightscanbeusedtocreateaweighted-lossfunctionorbiastheselectionofthesample.

Boosting

• Recordsthatarewronglyclassifiedwillhavetheirweightsincreased.

• Recordsthatareclassifiedcorrectlywillhavetheirweightsdecreased.

Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

Example4ishardtoclassify.

Itsweightisincreased,thereforeitismorelikelytobechosenagaininsubsequentrounds.

ARTIFICIALNEURALNETWORKS

Considerthefollowing

X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.

Considerthefollowing

X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1

S

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)

where sign(x) = +1 if x ≥ 0−1 if x < 0

⎧⎨⎩

Perceptron

• Modelisanassemblyofinter-connectednodesandweightedlinks.

• Outputnodesumsupeachofitsinputvalueaccordingtotheweightsofitslinks.

• Compareoutputnodeagainstsomethresholdt.

S

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

Perceptron Model

Y = sign( wiXii=1

d

∑ − t)

= sign( wiXi )i=0

d

∑

Perceptron

• Singlelayernetwork– Containsonlyinputandoutputnodes.

• Activationfunction:

• Applyingmodelisstraightforward:

– X1 =1,X2 =0,X3 =1=>y=sign(0.2)=1

f (w, x) = sign( x,w )

Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)

where sign(x) = +1 if x ≥ 0−1 if x < 0

⎧⎨⎩

Perceptronlearningrule

• Initializetheweights(w0,w1,…,wd)• Repeat

– Foreachtrainingexample(xi,yi)• Computef(w,xi)• Updatetheweights:

• Untilstoppingconditionismet.• Theaboveisanexampleofastochasticgradientdescentoptimizationmethod.

w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi


• Weightupdateformula:

• Intuition:– Updateweightbasedonerror:– Ify=f(x,w),e=0:noupdateneeded.– Ify>f(x,w),e=2:weightmustbeincreasedsothatf(x,w)willincrease.

– Ify<f(x,w),e=-2:weightmustbedecreasedsothatf(x,w)willdecrease.

w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi ; λ: learning rate

e = yi − f (w(k ), xi )⎡⎣ ⎤⎦


• Sincef(w,x)isalinearcombinationofinputvariables,decisionboundaryislinear.

• Fornonlinearlyseparableproblems,perceptronlearningalgorithmwillfailbecausenolinearhyperplane canseparatethedataperfectly.

Nonlinearlyseparabledata

x1 x2 y0 0 -11 0 10 1 11 1 -1

21 xxy Å=XOR Data

Multilayerartificialneuralnetworks(ANN)

Activationfunction

g(Si )Si Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

Training ANN means learning the weights of the neurons

Artificialneuralnetworks

• Varioustypesofneuralnetworktopologies:– Single-layerednetwork(perceptron)versusmulti-layerednetwork.

– Feed-forwardversusrecurrentnetwork.

• Varioustypesofactivationfunctions(f):

Y = f ( wiXii∑ )

Artificialneuralnetworks

Multi-layerneuralnetworkcansolveanytypeofclassificationtaskinvolvingnonlineardecisionsurfaces.

n1

n2

n3

n4

n5

x1

x2

InputLayer

HiddenLayer

OutputLayer

y

w31

w32

w41

w42

w53

w54

XOR Data

DesignissuesofANN

• Numberofnodesininputlayer:– Oneinputnodeperbinary/continuousattribute.– 𝑘orlog2 𝑘 nodesforeachcategoricalattributewith𝑘values.

• Numberofnodesinoutputlayer:– Oneoutputforbinaryclassproblem.– 𝑘or log2(𝑘) nodesfork-classproblem.

• Numberofnodesinhiddenlayer.• Initialweightsandbiases.

CharacteristicsofANN

• MultilayerANNareuniversalfunctionapproximatorsbutcouldsufferfromoverfittingifthenetworkistoolarge.

• Gradientdescentmayconvergetolocalminimum.• Modelbuildingcanbeverytimeconsuming,butapplyingthemodelcanbeveryfast.

• Canhandleredundantattributesbecauseweightsareautomaticallylearnt.

• Sensitivetonoiseintrainingdata.• Difficulttohandlemissingattributes.

RecentnoteworthydevelopmentsinANN

• Useindeeplearningandunsupervisedfeaturelearning.– Seektoautomaticallylearnagoodrepresentationoftheinputfromunlabeleddata.

• GoogleBrainproject:– Learnedtheconceptofa‘cat’bylookingatunlabeledpicturesfromYouTube.

– Onebillionconnectionnetwork.

Purpose-builtneuralnetworks

• Convolutionneuralnetworks– Deepnetworksthataredesigned

toextractsuccessivelymorecomplicatedfeaturesfrom1D,2D,and3Dsignals(i.e.,audio,images,video).

Purpose-builtneuralnetworks

• Networksthatarespecificallydesignedtomodelarbitrarylengthsequencesandnon-localdependencies:– Recurrentneuralnetworks– Bi-directionalrecurrentneuralnetworks– Longshort-termmemory

• Goodforlanguagemodelingandvariousbiologicalapplications.

SUPPORTVECTORMACHINES

Separating hyperplanes

Findalinearhyperplane (decisionboundary)thatseparatesthedata.


One possible solution.

B1


Another possible solution.

B2


Other possible solutions.

B2


• Which one is better? B1 or B2?• How do we define better?

B1

B2

Support Vector Machines (SVM)

Find the hyperplane that maximizes the margin: 𝐵1is better than 𝐵2.

B1

B2

b11

b12

b21b22

margin

Support vector machinesB1

b11

b12

wxT + b = 0

w

Vector w is normal to the separat-ing hyperplane. Let x and y be twopoints on the hyperplane. Then,

wxT + b = 0 & wyT + b = 0,

andw(x y)T = 0,

which indicates that w is orthogonalto the vector x y, which lies on thehyperplane.Classifica on is performed as follows:

f(x) =

+1 if wxT + b 01 if wxT + b < 0

Modelestimation

• The goal is to find the parameters w and b (i.e., the model’s parameters) such that it separates theclasses and maximizes the margin.

• We know how to measure classifica on accuracy, but how do we measure the margin?

• Let (w, b) be the parameters of a hyperplane that is in the “middle” between the two classes. Wecan scale (w, b) in order to obtain (w, b) such that

f(x) =

+1 if wxT + b +11 if wxT + b 1

• Let x and y be two points such that

wxT + b = +1 & wyT + b = 1,

that is, these points are the closest to the hyperplane posi ve and nega ve instances, respec vely.Then,

w(x y)T = 2||w|| ||x y|| cos(w, x y) = 2

||w|| (margin) = 2margin = 2/||w||

which indicates that w is orthogonal to the vector x y, which lies on the hyperplane.

Support Vector MachinesB1

b11

b12wxT + b = 0

wxT + b = +1

w

Modelestimation• The op miza on problem is formulated as follows:

maximizew,b

2||w||

subject to wxTi + b +1 if xi is +ve

wxTi + b 1 if xi is -ve

• If yi be +1 or 1 if xi is +ve or -ve, respec vely, then the above can be conciselywri en in a standard minimiza on form:

minimizew,b

||w||2

subject to yi(wxTi + b) 1 xi

• This is a constrained quadra c op miza on problem, which is convex and can besolved efficiently using Lagrange mul plies by minimizing the following func on:

Lp = ||w||2

i

i(yi(wxTi + b) 1),

where the i 0’s are what they are called Lagrange mul plies.

Modelestimation

• The dual Lagrangian is used for solving this problem, which can be shown to be:

LD =

i

i

i,j

ijyiyjxixTj .

Since this is the dual of the primal op miza on problem, the problem is now becomesa maximiza on problem.

• At the op mal solu on of the primal/dual problem we have that:

w =

i

iyixi.

• Most of the i’s are 0, and the non-zero i’s are those that define the w vector. Theycorrespond to the training examples for which the model predicts either +1 or 1.These training examples are called the support vectors.

• A test instance z is classified as +ve or -ve based on

f(z) = sign(wzT + b) = sign

i

iyixizT + b

.

ExampleofLinearSVM

x1 x2 y l0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0

Supportvectors

Support vector machines

Whatiftheproblemisnotlinearlyseparable?

Non-separablecase

• Non-linearly separable cases are handled by introducing for each training instance aslack variable i and solving the following op miza on problem:

minimizew,b,i

||w||2 + c (

i i)

subject to wxTi + b +1 i if xi is +ve

wxTi + b 1 + i if xi is -ve

i 0

• ... Or by using a non-linear hyperplane.

• ... Or by doing both.

Nonlinearsupportvectormachines

Whatifdecisionboundaryisnotlinear?

Nonlinearsupportvectormachines

Transformdataintohigherdimensionalspace.

Decisionboundary:

Φ(x)wT + b = 0

NonlinearSVMs

Mappingfromtheoriginalspacetoadifferentspacecanmakethingsseparable.

Learningnon-linearSVMs

• The dual LagrangianLD =

i

i

i,j

ijyiyjxixTj ,

now becomes:LD =

i

i

i,j

ijyiyj(xi)(xj)T ,

• A test instance z is classified as +ve or -ve based on

f(z) = sign

i

iyi(xi)(z)T + b

.

• The matrix K such that K(xi, xj) = (xi)(xj)T is called the kernel matrix.

• Non-linear SVMs require to have such a kernel matrix. I can derive interes ng kernelmatrices involving extremely high-dimensional func ons by opera ng on the originalspace. This is called the kernel trick.

Kerneltrick

Examples:

Thisisaninfinitedimensionpolynomial.

Example of nonlinear SVM

SVM with polynomial degree 2 kernel

LearningnonlinearSVM

• Advantagesofusingkernels:– DonothavetoknowthemappingfunctionF.– ComputingdotproductF(xi)• F(xj)intheoriginalspaceavoidscurseof

dimensionality.

• Thekernelfunctioncanbeconsideredasameasureofsimilaritybetweenobjectsandusedtoencodekeyinformationabouttheclassificationproblem.

• Notallfunctionscanbekernels:– MustmakesurethereisacorrespondingF insomehigh-dimensional

space.– Mercer’stheorem.

CharacteristicsofSVM

• Thelearningproblemisformulatedasaconvexoptimizationproblem,efficientalgorithmsareavailabletofindtheglobalminimaoftheobjectivefunction.

• Over-fittingisaddressedbymaximizingthemarginofthedecisionboundary,buttheuserstillneedstoprovidethetypeofkernelfunctionandcostfunction.

• Difficulttohandlemissingvalues.• Robusttonoise.• Highcomputationalcomplexityforbuildingthemodel.

RIDGEREGRESSION&COORDINATEDESCENT

Linearregressiontask

• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherex isasetofnumericalattributesandy isavalue.

• Goal:– Wewanttolearnavectorw suchthat<x, w>approximatesy inaleastsquaressense.

Linearregressionandnormalequations

>Ğƚ X ďĞ ĂŶ n m ŵĂƚƌŝǆ ǁŚŽƐĞ ƌŽǁƐ ĐŽƌƌĞƐƉŽŶĚ ƚŽ ƚŚĞ ƌĞĐŽƌĚƐĂŶĚ ƚŚĞ ĐŽůƵŵŶƐ ĐŽƌƌĞƐƉŽŶĚ ƚŽ ƚŚĞ ĂƩƌŝďƵƚĞƐ >Ğƚ y ďĞ ĂŶ n 1ǀĞĐƚŽƌ ŽĨ ƚŚĞ ŬŶŽǁŶ ƚĂƌŐĞƚ ǀĂůƵĞƐ ŽĨ ƚŚĞ ƌĞĐŽƌĚƐ ŝŶ X dŚĞ ƐŽůƵƟŽŶƚŽ ƚŚĞ ůŝŶĞĂƌ ƌĞŐƌĞƐƐŝŽŶ ƉƌŽďůĞŵ ŝƐ ƚŚĞ ǀĞĐƚŽƌ w ƐƵĐŚ ƚŚĂƚ

minimizew

||Xw y||2.

dŚĞ ƐŽůƵƟŽŶ ƚŽ ƚŚĞ ĂďŽǀĞ ƉƌŽďůĞŵ ŝƐ ŐŝǀĞŶ ďǇ

w = (XT X)1XT y.

,ŽǁĞǀĞƌ ƚŚŝƐ ŝƐ ŶŽƚ ŚŽǁ ǁĞ ƵƐƵĂůůǇ ƐŽůǀĞ ŝƚ

Ridgeregression

/Ŷ ŽƌĚĞƌ ƚŽ ƉƌĞǀĞŶƚ ŽǀĞƌĮƫŶŐ ǁĞ ĂĚĚ Ă ƌĞŐƵůĂƌŝǌĂƟŽŶ ƉĞŶĂůƚǇ ĂŶĚĞƐƟŵĂƚĞ w ĂƐ ĨŽůůŽǁƐ

minimizew

||Xw y||2 + ||w||2

,

ǁŚĞƌĞ ŝƐ Ă ƵƐĞƌͲƐƵƉƉůŝĞĚ ƉĂƌĂŵĞƚĞƌ ƚŚĂƚ ĐŽŶƚƌŽůƐ ŽǀĞƌĮƫŶŐdŚŝƐ ƚǇƉĞ ŽĨ ƌĞŐƌĞƐƐŝŽŶ ŝƐ ĐĂůůĞĚ ƌŝĚŐĞ ƌĞŐƌĞƐƐŝŽŶ

Estimatingw

ͻ dŚĞƌĞ ĂƌĞ ŵĂŶǇ ǁĂǇƐ ƚŽ ƐŽůǀĞ ƚŚĞ ŽƉƟŵŝǌĂƟŽŶ ƉƌŽďůĞŵ ĨŽƌ ĞƐƟŵĂƟŶŐw ŽŽƌĚŝŶĂƚĞ ĚĞƐĐĞŶƚ ŝƐ ƉƌŽďĂďůǇ ƚŚĞ ƐŝŵƉůĞƐƚ ŵĞƚŚŽĚ

ͻ /ƚ ĐŽŶƐŝƐƚƐ ŽĨ Ă ƐĞƚ ŽĨ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶƐ /Ŷ ĞĂĐŚ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶ ŝƚ ƉĞƌͲĨŽƌŵƐ m ƐƚĞƉƐ ;ŽŶĞ ĨŽƌ ĞĂĐŚ ŽĨ ƚŚĞ ĚŝŵĞŶƐŝŽŶƐ ŝŶ wͿ ƵƌŝŶŐ ƚŚĞ iƚŚƐƚĞƉ ŝƚ ŽƉƟŵŝǌĞƐ ƚŚĞ ǀĂůƵĞ ŽĨ ƚŚĞ ŽďũĞĐƟǀĞ ĨƵŶĐƟŽŶ ďǇ ĮǆŝŶŐ Ăůů ďƵƚƚŚĞ wi ǀĂƌŝĂďůĞ dŚŝƐ ŽƉƟŵŝǌĂƟŽŶ ŝƐ ƉĞƌĨŽƌŵĞĚ ďǇ ƚĂŬŝŶŐ ƚŚĞ ƉĂƌƟĂůĚĞƌŝǀĂƟǀĞ ŽĨ ƚŚĞ ŽďũĞĐƟǀĞ ĨƵŶĐƟŽŶ ǁŝƚŚ ƌĞƐƉĞĐƚ ƚŽ wi ƐĞƫŶŐ ŝƚ ƚŽ ϬĂŶĚ ƐŽůǀŝŶŐ ĨŽƌ wi dŚĂƚ ǀĂůƵĞ ŽĨ wi ŝƐ ƚŚĞ ŶĞǁ ǀĂůƵĞ ĨŽƌ ƚŚĂƚ ǀĂƌŝĂďůĞdŚĞ ĞŶƟƌĞ ƉƌŽĐĞƐƐ ĐŽŶǀĞƌŐĞƐ ǁŚĞŶ ƚŚĞ ĞƌƌŽƌ ĚŽĞƐ ŶŽƚ ĚĞĐƌĞĂƐĞ ƐƵďͲƐƚĂŶƟĂůůǇ ďĞƚǁĞĞŶ ƐƵĐĐĞƐƐŝǀĞ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶƐ

ͻ EŽŶͲŶĞŐĂƟǀŝƚǇ ŝŶ ƚŚĞ ŵŽĚĞů ĐĂŶ ďĞ ĞŶĨŽƌĐĞĚ ďǇ ƐĞƫŶŐ ĂŶǇ ŶĞŐĂƟǀĞwi ǀĂůƵĞƐ ƚŽ Ϭ ĚƵƌŝŶŐ ƚŚĞ ŝŶŶĞƌ ŝƚĞƌĂƟŽŶƐ

Date post:	15-Mar-2020
Category:	Documents
Upload:	others
View:	12 times
Download:	4 times

Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1....

Documents