+ All Categories
Home > Documents > Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1....

Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1....

Date post: 15-Mar-2020
Category:
Upload: others
View: 12 times
Download: 4 times
Share this document with a friend
249
Introduction to Data Mining 1
Transcript
Page 1: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

IntroductiontoDataMining

1

Page 2: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Large-scaledataiseverywhere!

• Therehasbeenenormousdatagrowthinbothcommercialandscientificdatabasesduetoadvancesindatagenerationandcollectiontechnologies.

• Newmantra:– Gatherwhateverdatayoucanwheneverand

whereverpossible.

• Expectations:– Gathereddatawillhavevalueeitherforthe

purposecollectedorforapurposenotenvisioned.

Computational Simulations

Business Data

Sensor Networks

Geo-spatial data

Homeland Security

2

Page 3: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Whydatamining?

Commercialviewpoint:– Lotsofdataisbeingcollectedandwarehoused.• Webdata:– Yahoohaspetabytesofwebdata.– Facebookhas~2Bactiveusers.

• Purchasesatdepartment/grocerystores,e-commerce:– Amazonrecords1.1Bordersayear.– Bank/CreditCardtransactions.

– Computershavebecomecheaperandmorepowerful.

– Competitivepressureisstrong.• Providebetter,customizedservicesforanedge(e.g.inCustomerRelationshipManagement).

3

Page 4: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Whydatamining?

Scientificviewpoint:– Datacollectedandstoredat

enormousspeeds.

• Remotesensorsonasatellite.– NASAEOSDISarchivesover 1-petabytesof

earthsciencedata/year.

• Telescopesscanningtheskies.– Skysurveydata.

• High-throughputbiologicaldata.

• Scientificsimulations.– Terabytesofdatageneratedinafewhours.

– Datamininghelpsscientists.• Inautomatedanalysisofmassivedatasets.

• Inhypothesisformation.

4

Page 5: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Whatisdatamining?

Manydefinitions:– Non-trivialextractionofimplicit,previouslyunknownandpotentiallyusefulinformationfromdata.

– Exploration&analysis,byautomaticorsemi-automaticmeans,oflargequantitiesofdatainordertodiscovermeaningfulpatterns.

5

Page 6: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Originsofdatamining

• Drawsideasfrommachinelearning/AI,patternrecognition,statistics,anddatabasesystems.

• Traditionaltechniquesmaybeunsuitableduetodatathatis:– Large-scale– Highdimensional– Heterogeneous– Complex– Distributed

KeyDistinction:Datadrivenvs.Hypothesisdriven

6

Page 7: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Dataminingtasks

• Predictiontask:– Usesomevariablestopredictunknownorfuturevaluesofothervariables.

• Descriptiontask:– Findhuman-interpretablepatternsthatdescribethedata.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

7

Page 8: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes

11 No Married 60K No

12 Yes Divorced 220K No

13 No Single 85K Yes

14 No Married 75K No

15 No Single 90K Yes 10

Milk

Data

Dataminingmethods

8

Page 9: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Predictivemodeling:Classification

Findamodelforclassattributeasafunctionofthevaluesofotherattributes.

9

Tid Employed Level of Education

# years at present address

Credit Worthy

1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … …

10

Model for predicting credit worthiness

ClassEmployed

No Education

Number ofyears

No Yes

Graduate High school, Undergrad

Yes No

> 7 yrs < 7 yrs

Yes

Number ofyears

No

> 3 yr < 3 yr

Page 10: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Examplesofclassification

10

• Predictingtumorcellsasbenignormalignant.

• Classifyingcreditcardtransactionsaslegitimateorfraudulent.

• Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil.

• Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.

• Identifyingintrudersinthecyberspace.

Page 11: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ClusteringFindinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.

11

Inter-clusterdistancesaremaximized

Intra-clusterdistancesareminimized

Page 12: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Applicationsofclustering

• Understanding– Customprofilingfortargetedmarketing.– Grouprelateddocumentsforbrowsing.– Groupgenesandproteinsthathavesimilar

functionality.– Groupstockswithsimilarpricefluctuations.

• Summarization– Reducethesizeoflargedatasets.

12

Clusters for Raw SST and Raw NPP

longitude

latitu

de

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

Cluster

Sea Cluster 1

Sea Cluster 2

Ice or No NPP

Land Cluster 1

Land Cluster 2

UseofK-meanstopartitionSeaSurfaceTemperature(SST)andNetPrimaryProduction(NPP)intoclustersthatreflecttheNorthernandSouthernHemispheres.

Courtesy: Michael Eisen

Page 13: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Associationrulediscovery

Givenasetofrecordseachofwhichcontainsomenumberofitemsfromagivencollection.– Producedependencyruleswhichwillpredictoccurrenceofanitembasedonoccurrencesofotheritems.

TID Items

1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk

Rules Discovered:Milk --> CokeDiaper, Milk --> Beer

13

Page 14: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Associationanalysis:Applications

• Market-basketanalysis.– Rulesareusedforsalespromotion,shelfmanagement,andinventory

management.

• Telecommunicationalarmdiagnosis.– Rulesareusedtofindcombinationofalarmsthatoccurtogether

frequentlyinthesametimeperiod.

• MedicalInformatics.– Rulesareusedtofindcombinationofpatientsymptomsandtest

resultsassociatedwithcertaindiseases.

14

Page 15: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Motivatingchallenges

• Scalability.

• Highdimensionality.

• Heterogeneousandcomplexdata.

• Dataownershipanddistribution.

• Non-traditionalanalysis.

15

Page 16: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

The4V’sof“BigData”

16

Page 17: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

PatternMining

Page 18: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ASSOCIATIONRULES

Page 19: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

AssociationRuleMiningGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransaction

Market-basket transactions

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Example of association rules

Diaper ® Beer,Milk, Bread ® Eggs,Coke,Beer, Bread ® Milk,

Implication means co-occurrence, not causality!

Page 20: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Definition:FrequentItemsetItemset– Acollectionofoneormoreitems

• Example:Milk,Bread,Diaper– k-itemset

• Anitemset thatcontainskitemsSupportcount(𝜎)– Frequencyofoccurrenceofanitemset– E.g.𝜎(Milk,Bread,Diaper)=2

Support(𝑠)– Fractionoftransactionsthatcontainanitemset– E.g.𝑠(Milk,Bread,Diaper)=2/5

FrequentItemset– Anitemset whosesupportisgreaterthanor

equaltoaminsup threshold

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 21: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Definition:AssociationRule

Example:BeerDiaper,Milk Þ

4.052

|T|)BeerDiaper,,Milk(

===ss

67.032

)Diaper,Milk()BeerDiaper,Milk,(

===s

sc

Association Rule– An implication expression of the form

𝑋 → 𝑌, where 𝑋and 𝑌are itemsets.– Example:

Milk, Diaper →Beer

Rule Evaluation Metrics– Support (𝑠)

• Fraction of transactions that contain both 𝑋and 𝑌.

– Confidence (𝑐)• Measures how often items in 𝑌

appear in transactions thatcontain 𝑋.

• It is nothing more than 𝑃(𝑌|𝑋).

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 22: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

AssociationRuleMiningTask

Givenasetoftransactions𝑇,thegoalofassociationruleminingistofindallruleshaving:

1) support≥minsup threshold,and2) confidence≥minconf threshold.

Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Page 23: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Anapproach….

1. Listallpossibleassociationrules.2. Computethesupportandconfidenceforeachrule.3. Prunerulesthatfailtheminsup andminconf

thresholds.

Page 24: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ComputationalComplexityGivend uniqueitems:

– Totalnumberofitemsets =2𝑑– Totalnumberofpossibleassociationrules:

123 1

1

1 1

+-=

úû

ùêë

é÷ø

öçè

æ -´÷ø

öçè

æ=

+

-

=

-

=å å

dd

d

k

kd

j jkd

kd

R

If d=6, R = 602 rules

Page 25: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

MiningAssociationRules

Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)

TID Items

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

Observations:• All the above rules are binary partitions of the same itemset:

Milk, Diaper, Beer

• Rules originating from the same itemset have identical support butcan have different confidence.

• Thus, we may decouple the support and confidence requirements.

Page 26: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Miningassociationrules

Two-stepapproach:1. FrequentItemset Generation

– Generateallitemsets whosesupport³minsup.

2. RuleGeneration– Generatehighconfidencerulesfromeachfrequentitemset,

whereeachruleisabinarypartitioningofafrequentitemset.

Frequentitemset generationisstillexpensive.

Page 27: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Frequentitemset generationstrategies

• Reducethenumberofcandidates (𝑀)– Completesearch:𝑀 = 21.– Usepruningtechniquestoreduce𝑀.

• Reducethenumberoftransactions(𝑁)– ReducesizeofNasthesizeofitemset increases.– UsedbyDHPandvertical-basedminingalgorithms.

• Reducethenumberofcomparisons (𝑁𝑀)– Useefficientdatastructurestostorethecandidatesortransactions.

– Noneedtomatcheverycandidateagainsteverytransaction.

Page 28: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

PatternLattice

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Given 𝑑items, there are 2𝑑 possible candidate itemsets.

Page 29: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Reducingthenumberofcandidates

• Observation:– Ifanitemset isfrequent,thenallofitssubsetsmustalsobefrequent.

• Thisholdsduetothefollowingpropertyofthesupportmeasure:

– Supportofanitemset neverexceedsthesupportofitssubsets.

– Thisisknownastheanti-monotonepropertyofsupport.

)()()(:, YsXsYXYX ³ÞÍ"

Page 30: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Found to be Infrequent

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDEPruned supersets

Illustratingsupport’santi-monotonicity

Page 31: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Minimum Support = 3

TID Items

1 Bread, Milk

2 Beer, Bread, Diaper, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Bread, Coke, Diaper, Milk

Page 32: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Minimum Support = 3

TID Items

1 Bread, Milk

2 Beer, Bread, Diaper, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Bread, Coke, Diaper, Milk

Items (1-itemsets)

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Page 33: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Minimum Support = 3

TID Items

1 Bread, Milk

2 Beer, Bread, Diaper, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Bread, Coke, Diaper, Milk

Items (1-itemsets)

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Page 34: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Itemset Bread,Milk Bread, Beer Bread,Diaper Beer, Milk Diaper, Milk Beer,Diaper

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Minimum Support = 3

Page 35: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Itemset Count Bread,Milk 3 Beer, Bread 2 Bread,Diaper 3 Beer,Milk 2 Diaper,Milk 3 Beer,Diaper 3

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Minimum Support = 3

Page 36: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3

Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

Page 37: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3

Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk

2 2 2 1

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41

With support-based pruning,6 + 6 + 4 = 16

Page 38: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Illustratingsupport’santi-monotonicity

Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1

Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3

Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk

2 2 2 1

Items (1-itemsets)

Pairs (2-itemsets)

(No need to generatecandidates involving Cokeor Eggs)

Triplets (3-itemsets)Minimum Support = 3

If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41

With support-based pruning,6 + 6 + 4 = 166 + 6 + 1 = 13

Page 39: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

APRIORI

Page 40: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Apriori algorithm– ℱ4:frequent𝑘-itemsets– ℒ4:candidate𝑘-itemsets

Algorithm– Let𝑘 = 1– Generateℱ8 =frequent1-itemsets– Repeatuntilℱ4 isempty

1. CandidateGeneration:Generateℒ498 fromℱ4.2. CandidatePruning:Prunecandidateitemsets inℒ498 containing

subsetsoflength𝑘thatareinfrequent.3. SupportCounting:Countthesupportofeachcandidateinℒ498 by

scanningtheDB.4. CandidateElimination:Eliminatecandidatesinℒ498 thatare

infrequent,leavingonlythosethatarefrequent,leadingtoℱ498.

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Level-by-leveltraversalofthelattice.

Page 41: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

CandidateGeneration:theℱ4:8×ℱ4:8method

• Mergetwofrequent(𝑘 − 1)-itemsets iftheirfirst(𝑘 − 2)itemsareidentical

• 𝐹> =ABC,ABD,ABE,ACD,BCD,BDE,CDE– Merge(ABC,ABD)=ABCD– Merge(ABC,ABE)=ABCE– Merge(ABD,ABE)=ABDE

– Donotmerge(ABD,ACD)becausetheyshareonlyprefixoflength1insteadoflength2.

Page 42: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Candidatepruning• Letℱ> =ABC,ABD,ABE,ACD,BCD,BDE,CDEbethesetoffrequent3-itemsets.

• ℒ? =ABCD,ABCE,ABDEisthesetofcandidate4-itemsetsgenerated(frompreviousslide).

• Candidatepruning:– PruneABCEbecauseACEandBCEareinfrequent.– PruneABDEbecauseADEisinfrequent.

• Aftercandidatepruning:ℒ? =ABCD.

Page 43: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Supportcountingofcandidateitemsets

Scanthedatabaseoftransactionstodeterminethesupportofeachcandidateitemset.

– Mustmatcheverycandidateitemset againsteverytransaction,whichisanexpensiveoperation.

TID Items

1 Bread, Milk

2 Beer, Bread, Diaper, Eggs

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Bread, Coke, Diaper, Milk

Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk

Q:Howshouldweperformthisoperation?

Page 44: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Toreducenumberofcomparisons,storethecandidateitemsets inahashstructure.

– Insteadofmatchingeachtransactionagainsteverycandidate,matchitagainstcandidatescontainedinthehashedbuckets.

TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

N

Transactions Hash Structure

k

Buckets

Supportcountingofcandidateitemsets

Page 45: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Supportcounting:AnexampleSuppose you have 15 candidate itemsets of length 3:

1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8

How many of these itemsets are supported by transaction (1,2,3,5,6)?

1 2 3 5 6

Transaction, t

2 3 5 61 3 5 62

5 61 33 5 61 2 61 5 5 62 3 62 5

5 63

1 2 31 2 51 2 6

1 3 51 3 6 1 5 6 2 3 5

2 3 6 2 5 6 3 5 6

Subsets of 3 items

Level 1

Level 2

Level 3

63 5

This is a “full” n-arytree where n is the number of items.

Q: Can we reduce storage requirements?

Page 46: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Supportcountingusingahashtree

2 3 45 6 7

1 4 5 1 3 6

1 2 44 5 7 1 2 5

4 5 81 5 9

3 4 5 3 5 63 5 76 8 9

3 6 73 6 8

1,4,72,5,8

3,6,9Hash function

Suppose you have 15 candidate itemsets of length 3:

1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8

You need:

• Hash function.

• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node).

Page 47: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

FactorsaffectingthecomplexityofApriori

Page 48: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

MAXIMAL&CLOSEDITEMSETS

Page 49: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Maximalfrequentitemset

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

BorderInfrequent Itemsets

Maximal Itemsets

An itemset is maximal frequent if it is frequent and none of its immediate supersets are frequent.

Page 50: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Closeditemsets• Anitemset 𝑋isclosed ifallofitsimmediatesupersetshavealowersupportthan𝑋.

• Itemset 𝑋isnotclosedifatleastoneofitsimmediatesupersetshasthatsamesupportas𝑋.

TID Items1 A,B2 B,C,D3 A,B,C,D4 A,B,D5 A,B,C,D

Itemset SupportA 4B 5C 3D 4A,B 4A,C 2A,D 3B,C 3B,D 4C,D 3

Itemset SupportA,B,C 2A,B,D 3A,C,D 2B,C,D 2A,B,C,D 2

Page 51: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Maximalvs closedfrequentitemsets

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

124 123 1234 245 345

12 124 24 4 123 2 3 24 34 45

12 2 24 4 4 2 3 4

2 4

Minimum support = 2

# Closed = 9

# Maximal = 4

Closed and maximal

Closed but not maximal

TID Items

1 ABC

2 ABCD

3 BCE

4 ACDE

5 DE

Page 52: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Frequent,maximal,andcloseditemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Page 53: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Frequent,maximal,andcloseditemsets

FrequentItemsets

ClosedFrequentItemsets

MaximalFrequentItemsets

Q1:Whatifinsteadoffindingthefrequentitemsets,wefindthemaximalfrequentitemsetsortheclosedfrequentitemsets?

Q2:Doestheknowledgeofjustthemaximalfrequentitemsetswillallowmetogenerateallrequiredassociationrules?

Q3:Doestheknowledgeofjusttheclosedfrequentitemsetswillallowmetogenerateallrequiredassociationrules?

Page 54: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

BEYONDLEVEL-BY-LEVELEXPLORATION

Page 55: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

null

AB AC AD BC BD CD

A B C D

ABC ABD ACD BCD

ABCD

null

AB AC ADBC BD CD

A B C D

ABC ABD ACD BCD

ABCD

(a) Prefix tree (b) Suffix tree

Traversingthepatternlattice

PatternsstartingwithA.(patternsthatcontainAandanyotheritem)

PatternsstartingwithB.(patternsthatcontainBandanyotheritemexceptA)

PatternsendingwithD.(patternsthatcontainDandanyotheritem)

PatternsendingwithC.(patternsthatcontainCandanyotheritemexceptD)

Page 56: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Breadth-firstvs Depth-first

a" b" c" d"

ab" ac" ad" bc" bd" cd"

abcd"

""

abc" abd" bcd"acd"

a" b" c" d"

ab" ac" ad" bc" bd" cd"

abcd"

""

abc" abd" bcd"acd"

a" b" c" d"

ab" ac" ad" bc" bd" cd"

abcd"

""

abc" abd" bcd"acd"

a" b" c" d"

ab" ac" ad" bc" bd" cd"

abcd"

""

abc" abd" bcd"acd"

Plussesandminuses?

Page 57: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

PROJECTIONMETHODS

Page 58: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Projection-basedmethods

null

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

Page 59: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Projection-basedmethodsnull

AB AC AD AE BC BD BE CD CE DE

A B C D E

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

TID Items1 A,B2 B,C,D3 A,C,D,E4 A,D,E5 A,B,C6 A,B,C,D7 B,C8 A,B,C9 A,B,D10 B,C,E

TID Items1 B2 3 C,D,E4 D,E5 B,C6 B,C,D7 8 B,C9 B,D10

Initialdatabase

DatabaseassociatedwithnodeA

“Projected

DB”

TID Items1 2 D3 D,E4 5 6 D7 8 9 10 E

DatabaseassociatedwithnodeC

AprojectedDBonprefixpattern𝑋isobtainedasfollows:• Eliminateanytransactionsthatdonotcontain𝑋.• Fromthetransactionsthatareleft,retainonlytheitemsthatarelexicographicallygreaterthantheitemsin𝑋.

Page 60: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Projection-basedmethod• Itemsarelistedinlexicographicorder.• Let𝑃and𝐷𝐵(𝑃) beanode’spatternanditsassociatedprojecteddatabase.

• Miningisperformedbyrecursivelycallingthisfunction:– 𝑇𝑃(𝑃, 𝐷𝐵(𝑃))

1. Determinethefrequentitemsin𝐷𝐵(𝑃),anddenotethemby𝐸(𝑃).2. Eliminatefrom𝐷𝐵(𝑃) anyitemsnotin𝐸(𝑃).3. Foreachitem𝑥in𝐸(𝑃),call𝑇𝑃(𝑃𝑥, 𝐷𝐵(𝑃𝑥)).

Page 61: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

BEYONDTRANSACTIONS

Page 62: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Beyondtransactiondatasets

• Theconceptoffrequentpatternsandassociationruleshasbeengeneralizedtodifferenttypesofdatasets:– Sequentialdatasets:

• Sequenceofpurchasingtransactions,web-pagesvisited,articlesread,biologicalsequences,eventlogs,etc.

– Relational/Graphdatasets:• Socialnetworks,chemicalcompounds,web-graphs,informationnetworks,etc.

• Thereisanextensivesetofapproachesandalgorithmsforthem,manyofwhichfollowsimilarideastothosedevelopedfortransactiondatasets.

Page 63: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Clustering(Unsupervisedlearning)

Page 64: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Whatisclusteranalysis?

Findinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 65: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Notionofaclustercanbeambiguous

How many clusters?

Page 66: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Notionofaclustercanbeambiguous

How many clusters?

Four ClustersTwo Clusters

Six Clusters

Page 67: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Clusteringformulations

Anumberofclusteringformulationshavebeendeveloped:

1. Weneedtofindafixednumberofclusters.– Well-suitedforcompression-likeapplications.

2. Weneedtofindclustersoffixedsize.– Well-suitedforneighborhood-discovery(recommendationengine).

3. Weneedtofindthesmallestnumberofclustersthatsatisfycertainqualitycriteria.– Well-suitedforapplicationsinwhichclusterqualityisimportant.

4. Weneedtofindthenatural numberofclusters.– Thisisclustering'sholly-grail!

• Extremelyhard,problemdependent,and“quitesupervised”.

Page 68: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusterings

• Aclustering isasetofclusters.

• Importantdistinctionbetweenhierarchical andpartitional setsofclusters.

• Partitional clustering– Adivisionofdataobjectsintonon-overlappingsubsets(clusters)such

thateachdataobjectisinexactlyonesubset.

• Hierarchicalclustering– Asetofnestedclustersorganizedasahierarchicaltree.

Page 69: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Partitional clustering

Original Points A Partitional Clustering

Page 70: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hierarchicalclustering

p4 p1

p3

p2 p4p1 p2 p3

Hierarchical clustering Dendrogram

Page 71: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Otherdistinctionsbetweensetsofclusters

• Exclusiveversusnon-exclusive– Innon-exclusiveclusterings,pointsmaybelongtomultipleclusters.– Canrepresentmultipleclassesor“border”points.

• Fuzzyversusnon-fuzzy– Infuzzyclustering,apointbelongstoeveryclusterwithsomeweight

between0and1.– Weightsmustsumto1.– Probabilisticclusteringhassimilarcharacteristics.

• Partialversuscomplete– Insomecases,weonlywanttoclustersomeofthedata.

• Heterogeneousversushomogeneous– Clustersofwidelydifferentsizes,shapes,anddensities.

Page 72: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters

• Well-separatedclusters• Center-basedclusters• Contiguousclusters• Density-basedclusters• Propertyorconceptual• Describedbyanobjectivefunction

Page 73: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters:Well-separated

Well-separatedclusters:– Aclusterisasetofpointssuchthatanypointinaclusteriscloser(ormore

similar)toeveryotherpointintheclusterthantoanypointnotinthecluster.

Three well-separated clusters

Page 74: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters:Center-based

Center-based– Aclusterisasetofobjectssuchthatanobjectinaclusteriscloser(more

similar)tothe“center”ofacluster,thantothecenterofanyothercluster.– Thecenterofaclusterisoftenacentroid,theaverageofallthepointsin

thecluster,oramedoid,themost“representative”pointofacluster.

Four center-based clusters

Page 75: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters:Contiguity-based

Contiguouscluster(nearestneighborortransitive)– Aclusterisasetofpointssuchthatapointinaclusteriscloser(ormore

similar)tooneormoreotherpointsintheclusterthantoanypointnotinthecluster.

Eight contiguous clusters

Page 76: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters:Density-based

Density-based– Aclusterisadenseregionofpoints,whichisseparatedbylow-density

regions,fromotherregionsofhighdensity.– Usedwhentheclustersareirregularorintertwined,andwhennoiseand

outliersarepresent.

Six density-based clusters

Page 77: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters:Conceptualclusters

SharedPropertyorConceptualClusters– Findsclustersthatsharesomecommonpropertyorrepresentaparticular

concept.

Two overlapping circles

Page 78: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofclusters:Objectivefunction

Clustersdefinedbyanobjectivefunction– Findsclustersthatminimizeormaximizeanobjectivefunction.– Enumerateallpossiblewaysofdividingthepointsintoclustersand

evaluatethe“goodness”ofeachpotentialsetofclustersbyusingthegivenobjectivefunction.(NPHard)

– Canhaveglobalorlocalobjectives.• Hierarchicalclusteringalgorithmstypicallyhavelocalobjectives.• Partitional algorithmstypicallyhaveglobalobjectives.

– Avariationoftheglobalobjectivefunctionapproachistofitthedatatoaparameterizedmodel.• Parametersforthemodelaredeterminedfromthedata.

• Mixturemodelsassumethatthedataisa‘mixture'ofanumberofstatisticaldistributions.

Page 79: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Clusteringrequirements

Thefundamentalrequirementforclusteringistheavailabilityofafunctiontodeterminethesimilarity ordistance betweenobjectsinthedatabase.

Theusermustbeabletoanswersomeofthefollowingquestions:

1. Whenshouldtwoobjectsbelongtothesamecluster?2. Howshouldtheclusterslooklike(i.e.,whattypeofobjects

shouldthecontain)?3. Whataretheobject-relatedcharacteristicsofgoodclusters?

Page 80: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Datacharacteristics&clustering

• Typeofproximityordensitymeasure– Centraltoclustering.– Dependsondataandapplication.

• Datacharacteristicsthataffectproximityand/ordensityare– Dimensionality

• Sparseness

– Attributetype– Specialrelationshipsinthedata

• Forexample,autocorrelation

– Distributionofthedata

• Noiseandoutliers– Ofteninterferewiththeoperationoftheclusteringalgorithm

Page 81: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

BASICCLUSTERINGALGORITHMS

1. K-means2. Hierarchicalclustering3. Density-basedclustering

Page 82: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering

• Partitional clusteringapproach.• Numberofclusters,K,mustbespecified.• Eachclusterisassociatedwithacentroid(centerpoint/object).• Eachpointisassignedtotheclusterwiththeclosestcentroid.• Thebasicalgorithmisverysimple.

Page 83: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ExampleofK-meansclustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 84: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering– Details

• Initialcentroidsareoftenchosenrandomly.– Clustersproducedvaryfromoneruntoanother.

• Thecentroidis(typically)themeanofthepointsinthecluster.• “Closeness”ismeasuredbyEuclideandistance,cosinesimilarity,correlation,etc.

• K-meanswillconvergeforcommonsimilaritymeasuresmentionedabove.

• Mostoftheconvergencehappensinthefirstfewiterations.– Oftenthestoppingconditionischangedto“Untilrelativelyfewpointschange

clusters”.

• ComplexityisO(n*K*I*d)– n=numberofpoints,K=numberofclusters,

I=numberofiterations,d=numberofattributes.

Page 85: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering– Objective

Let o1, . . . , on be the set of objects to be clustered, k be the number of desired

clusters, p be the clustering indicator vector such that pi is the cluster number

that the ith object belongs to, and ci be the centroid of the ith cluster.

In the case of Euclidean distance, the K-means clustering algorithm solves the

following optimization problem:

minimize

pf(p) =

nX

i=1

||oi cpi ||22.

Function f() is the objective or clustering criterion function of K-means.

Page 86: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering– Objective

Let o1, . . . , on be the set of objects to be clustered, k be the number of desired

clusters, p be the clustering indicator vector such that pi is the cluster number

that the ith object belongs to, and ri be a vector associated with the ith cluster.

In the case of Euclidean distance, the K-means clustering algorithm solves the

following optimization problem:

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

nX

i=1

||oi rpi ||22.

Note that p and r1, . . . , rk are the variables of the optimization problem that

need to be estimated such that the value of g() is minimized.

Page 87: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering– Objective

The solution to

minimize

pf(p) =

nX

i=1

||oi cpi ||22

is the same as the solution to

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

nX

i=1

||oi rpi ||22.

and 8i, ri = ci.

Page 88: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering– Objective

The solution to

minimize

pf(p) =

nX

i=1

||oi cpi ||22

is the same as the solution to

minimize

p,r1,...,rkg(p, r1, . . . , rk) =

nX

i=1

||oi rpi ||22.

and 8i, ri = ci.

Ther_i vectorscanbethoughtasbeingrepresentativesoftheobjectsthatareassignedtotheith cluster.Ther_i vectorsrepresentacompressedviewofthedata.

Page 89: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-meansclustering– Objectiveminimize

p

Pni=1 ||oi cpi ||22

minimizep,r1,...,rk

Pni=1 ||oi rpi ||22

Thesearenon-convexoptimizationproblems.

• The𝐾-meansclusteringalgorithmisawayofsolvingtheoptimizationproblem.• Itusesaniterativealternateleastsquaresoptimizationstrategy.

a. Optimizeclusterassignments𝑝,given𝑟$ for𝑖 = 1,… , 𝑘.b. Optimize𝑟$ for𝑖 = 1,… , 𝑘,givenclusterassignments𝑝.

• Itguaranteesconvergencetoalocalminimasolution.However,duetothenon-convexityoftheproblem,itmaynotbetheglobalminimum.

• Run𝐾-meansmultipletimeswithdifferentinitialcentroidsandreturnthesolutionthathasthebestvalue.

Page 90: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

TwodifferentK-meansclusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 91: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

LimitationsofK-means

• Def. problem:whentheclusteringsolutionthatyougetisnotthebest,natural,insightful,etc.

• K-meanshasproblems whenclustersareofdiffering– Sizes– Densities– Non-globularshapes

• K-meanshasproblems whenthedatacontainsoutliers.

Page 92: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

LimitationsofK-means:Differingsizes

Original Points K-means (3 Clusters)

Page 93: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

LimitationsofK-means:Differingdensity

Original Points K-means (3 Clusters)

Page 94: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

LimitationsofK-means:Non-globularshapes

Original Points K-means (2 Clusters)

Page 95: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

OvercomingK-meansLimitations

Original Points K-means Clusters

One solution is to use many clusters.Finds parts of clusters, and we may need to put them back together.

Page 96: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

OvercomingK-meanslimitations

Original Points K-means Clusters

Page 97: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Importanceofchoosinginitialcentroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 98: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Importanceofchoosinginitialcentroids…

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Page 99: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Solutionstoinitialcentroidsproblem• Multipleruns

– Helps,butprobabilityisnotonyourside.

• Sampleandusehierarchicalclusteringtodetermineinitialcentroids.

• Selectmorethan𝑘initialcentroidsandthenselectamongtheseinitialcentroids.– Selectmostwidelyseparated.

• Generatealargernumberofclustersandthenperformahierarchicalclustering.

• Bisecting𝐾-means– Notassusceptibletoinitializationissues.

Page 100: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Outliers

• A principled way of dealing with outliers is to do so directly during the

optimization process.

• Robust k-means algorithms as part of the optimization process in addition

to determining the clustering solution they also identify a set of outlier

objects that are not clustered by the algorithm.

• The non-clustered objects are treated as a penalty component of the objec-

tive function (in supervised learning, these penalty components are often

called regularizers) like

minimize

p

X

i : pi 6=1

||oi cpi ||22 +

X

i : pi=1

q(i),

where is a user-specified parameter that controls the penalty associated

with not clustering an object, and q(i) is a cost function associated with

the ith object. A simple q() = 1 is such a cost function.

Page 101: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

K-Meansandthe“Curseofdimensionality”

• Whendimensionalityincreases,databecomesincreasinglysparseinthespacethatitoccupies.

• Definitionsofdensityanddistancebetweenpoints,whichiscriticalforclusteringandoutlierdetection,becomelessmeaningful. • Randomly generate 500 points.

• Compute difference between max and min distance between any pair of points.

Page 102: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Asymmetricattributes

Ifwemetafriendinthegrocerystorewouldweeversaythefollowing?

“I see our purchases are very similar since we didn’t buy most of the same things.”

40

Page 103: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

SphericalK-meansclustering

Let d1, . . . , dn be the unit length vectors of the set of objects to be clustered, kbe the number of desired clusters, p be the clustering indicator vector such that

pi is the cluster number that the ith object belongs to, and ci be the centroid

of the ith cluster.

The spherical K-means clustering algorithm solves the following optimization

problem:

maximize

p

nX

i=1

cos(di, cpi).

Page 104: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

SphericalK-means&Text

Inhigh-dimensionaldata,clustersexistinlower-dimensionalsub-spaces.

Page 105: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

HIERARCHICALCLUSTERING

Page 106: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hierarchicalclustering

• Producesasetofnestedclustersorganizedasahierarchicaltree.

• Canbevisualizedasadendrogram.– Atreelikediagramthatrecordsthesequencesofmergesorsplits.

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1

2

3

4

5

6

1

23 4

5

Page 107: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Advantagesofhierarchicalclustering• Donothavetoassumeanyparticularnumberofclusters.– Anydesirednumberofclusterscanbe

obtainedby“cutting”thedendrogram attheproperlevel.

• Theymaycorrespondtomeaningfultaxonomies.– Exampleinbiologicalsciences(e.g.,animal

kingdom,phylogenyreconstruction,…).

1 3 2 5 4 60

0.05

0.1

0.15

0.2

1 3 2 5 4 60

0.05

0.1

0.15

0.2

Page 108: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hierarchicalclustering• Twomainwaysofobtaininghierarchicalclusterings:

– Agglomerative:• Startwiththepointsasindividualclusters.• Ateachstep,mergetheclosestpairofclustersuntilonlyonecluster(orkclusters)left.

– Divisive:• Startwithone,all-inclusivecluster.• Ateachstep,splitaclusteruntileachclustercontainsapoint(ortherearekclusters).

• Traditionalhierarchicalalgorithmsuseasimilarityordistancematrix.– Mergeorsplitoneclusteratatime.

Page 109: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Agglomerativeclusteringalgorithm• Morepopularhierarchicalclusteringtechnique

• Basicalgorithmisstraightforward1. Computetheproximitymatrix.2. Leteachdatapointbeacluster.3. Repeat:4. Mergethetwoclosestclusters.5. Updatetheproximitymatrix.6. Until onlyasingleclusterremains(orkclustersremain).

• Keyoperationisthecomputationoftheproximityoftwoclusters.

– Differentapproachestodefiningthedistancebetweenclustersdistinguishthedifferentalgorithms

Page 110: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

StartingsituationStartwithclustersofindividualpointsandaproximitymatrix

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 111: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

IntermediatesituationAftersomemergingsteps,wehavesomeclusters

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 112: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

IntermediatesituationWewanttomergethetwoclosestclusters(C2andC5)andupdatetheproximitymatrix.

C1

C4

C2 C5

C3

C2C1

C1

C3

C5

C4

C2

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 113: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

AftermergingHowdoweupdatetheproximitymatrix?

C1

C4

C2 U C5

C3 ? ? ? ?

?

?

?

C2 U C5C1

C1

C3

C4

C2 U C5

C3 C4

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Page 114: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Defininginter-clusterproximity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

.

Proximity?

Minimum distance, maximum distance, average distance, distance between centroids, objective-driven selection, etc.

Proximity Matrix

Page 115: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Defininginter-clusterproximity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

Usingminimumdistance.

Page 116: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Defininginter-clusterproximity

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

Usingmaximumdistance.

Page 117: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

Defininginter-clusterproximity

Usingaveragedistance.

Page 118: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

p1

p3

p5

p4

p2

p1 p2 p3 p4 p5 . . .

.

.

. Proximity Matrix

´ ´

Defininginter-clusterproximity

Usingdistancebetweencentroids.

Page 119: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Strengthofminimumdistance

Can handle non-elliptical shapes.

Original Points Six Clusters

Page 120: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Limitationsofminimumdistance

Original Points

Two Clusters

Sensitive to noise and outliers.Three Clusters

Page 121: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Strengthofmaximumdistance

Less susceptible to noise and outliers.

Original Points Two Clusters

Page 122: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Limitationsofmaximumdistance

Tends to break large clusters.

Biased towards globular clusters.

Two ClustersOriginal Points

Page 123: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Groupaverage

• Compromisebetweensingleandcompletelink.

• Strengths:– Lesssusceptibletonoiseandoutliers.

• Limitations:– Biasedtowardsglobularclusters.

Page 124: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hierarchicalclustering:Timeandspacerequirements• O(𝑁2) spacesinceitusestheproximitymatrix.

– 𝑁isthenumberofpoints.

• O(𝑁3) timeinmanycases– Thereare𝑁stepsandateachsteptheproximitymatrixmustbeupdatedandsearched(ontheaveragethereare𝑁2 onthatmatrix).

– ComplexitycanbereducedtoΟ(𝑁2log(𝑁))timewithsomecleverness.

Page 125: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hierarchicalclustering:Problemsandlimitations• Onceadecisionismadetocombinetwoclusters,itcannotbeundone.

• Objectivefunctionisoptimizedonlylocally.

• Differentschemeshaveproblemswithoneormoreofthefollowing:– Sensitivitytonoiseandoutliers.– Difficultyhandlingdifferentsizedclustersandconvexshapes.

– Breakinglargeclusters.

Page 126: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DENSITY-BASEDCLUSTERING

Page 127: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCAN

• DBSCANisadensity-basedalgorithm.– Thedensity isthenumberofpointswithinaspecifiedradius(Eps)

– Apointisacorepoint ifithasmorethanaspecifiednumberofpoints(MinPts)withinEps.

• Thesearepointsthatareattheinteriorofacluster.

– Aborderpoint hasfewerthanMinPts withinEps,butisintheneighborhoodofacorepoint.

– Anoisepoint isanypointthatisnotacorepointoraborderpoint.

Page 128: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCAN:core,border,andnoisepoints

Page 129: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCANalgorithm182 CHAPTER 6. CLUSTER ANALYSIS

Algorithm DBSCAN(Data: D, Radius: Eps, Density: τ )begin

Determine core, border and noise points of D at level (Eps, τ);Create graph in which core points are connected

if they are within Eps of one another;Determine connected components in graph;Assign each border point to connected component

with which it is best connected;return points in each connected component as a cluster;

end

Figure 6.15: Basic DBSCAN algorithm

3. Noise point: A data point that is neither a core point nor a border point is defined asa noise point.

Examples of core points, border points, and noise points are illustrated in Fig. 6.16 forτ = 10. The data point A is a core point because it contains 10 data points within theillustrated radius Eps. On the other hand, data point B contains only 6 points within aradius of Eps, but it contains the core point A. Therefore, it is a border point. The datapoint C is a noise point because it contains only 4 points within a radius of Eps, and itdoes not contain any core point.

After the core, border, and noise points have been determined, the DBSCAN clusteringalgorithm proceeds as follows. First, a connectivity graph is constructed with respect to thecore points, in which each node corresponds to a core point, and an edge is added betweena pair of core points, if and only if they are within a distance of Eps from one another. Notethat the graph is constructed on the data points rather than on partitioned regions, as ingrid-based algorithms. All connected components of this graph are identified. These corre-spond to the clusters constructed on the core points. The border points are then assigned tothe cluster with which they have the highest level of connectivity. The resulting groups arereported as clusters and noise points are reported as outliers. The basic DBSCAN algorithmis illustrated in Fig. 6.15. It is noteworthy that the first step of graph-based clustering isidentical to a single-linkage agglomerative clustering algorithm with termination-criterionof Eps-distance, which is applied only to the core points. Therefore, the DBSCAN algorithmmay be viewed as an enhancement of single-linkage agglomerative clustering algorithms bytreating marginal (border) and noisy points specially. This special treatment can reduce theoutlier-sensitive chaining characteristics of single-linkage algorithms without losing the abil-ity to create clusters of arbitrary shape. For example, in the pathological case of Fig. 6.9(b),the bridge of noisy data points will not be used in the agglomerative process if Eps and τare selected appropriately. In such cases, DBSCAN will discover the correct clusters in spiteof the noise in the data.

Practical Issues

The DBSCAN approach is very similar to grid-based methods, except that it uses circularregions as building blocks. The use of circular regions generally provides a smoother contourto the discovered clusters. Nevertheless, at more detailed levels of granularity, the twomethods will tend to become similar. The strengths and weaknesses of DBSCAN are also

Page 130: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCAN:core,borderandnoisepoints

Original Points Point types: core, border and noise

Eps = 10, MinPts = 4

Page 131: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCANclustering

Clusters

Page 132: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCANclustering

Clusters

Thesearealsoclusters.Theyareusuallyeliminatedbyputtingaminimumclustersizethreshold.

Page 133: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCANclustering

Original Points Clusters

• Resistant to (some) noise.

• Can handle clusters of different shapes and sizes.

Page 134: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCAN:Howmuchnoise?

Page 135: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

WhenDBSCANdoesnotworkwell

Original Points(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

Page 136: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DBSCAN:DeterminingEPSandMinPts

• Ideaisthatforpointsinacluster,theirkth nearestneighborsareroughlyatthesamedistance.

• Noisepointshavethekth nearestneighboratfartherdistance.• So,plotsorteddistanceofeverypointtoitskth nearestneighbor.

Page 137: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

CLUSTERVALIDITY

Page 138: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Differentaspectsofclustervalidation

• Determiningthe clusteringtendency ofasetofdata:– Isthereanon-randomstructureinthedata?

• Comparingtheresultsofaclusteranalysistoexternallyknownresults.– Dotheclusterscontainobjectsofmostlyasingleclasslabel?

• Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout referencetoexternalinformation.– Lookatvariousintra- andinter-clusterdata-derivedproperties.

• Comparingtheresultsoftwodifferentsetsofclusteranalysestodeterminewhichisbetter.

• Theevaluationcanbedonefortheentireclusteringsolutionorjustforselectedclusters.

Page 139: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Measuresofclustervalidity• Numericalmeasuresthatareappliedtojudgevariousaspectsofclustervalidity,areclassifiedintothefollowingthreetypes.– InternalIndex(II): Usedtomeasurethegoodnessofaclusteringstructurewithout respecttoexternalinformation.• SumofSquaredError(SSE)(oranyotheroftheobjectivefunctionsthatwediscussed).

– ExternalIndex(EI): Usedtomeasuretheextenttowhichclusterlabelsmatchexternallysuppliedclasslabels.• Entropy,purity,f-score,etc.

– RelativeIndex(RI): Usedtocomparetwodifferentclusteringsorclusters.• Oftenanexternalorinternalindexisusedforthisfunction,e.g.,SSEorentropy.

Page 140: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Measuringclustervalidityviacorrelation

• Twomatrices:– Proximity(distance)matrixofthedata(e.g.,pair-wisecosinesimilarity(Euclidean

distance)).– Idealproximitymatrixthatisimpliedbytheclusteringsolution.

• Onerowandonecolumnforeachdatapoint.• Anentryis1iftheassociatedpairofpointsbelongtothesamecluster.• Anentryis0iftheassociatedpairofpointsbelongstodifferentclusters.

• Computethecorrelationbetweenthetwomatrices.– i.e.,thecorrelationbetweenthevectorized matrices.– (makesurethattheorderingofthedatapointsisthesameinbothmatrices)

• High(low)correlationindicatesthatpointsthatbelongtothesameclusterareclosetoeachother.

• Notagoodmeasureforsomedensityorcontiguitybasedclusters.

Page 141: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Measuringclustervalidityviacorrelation

CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Page 142: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Usingsimilaritymatrixforclustervalidation

Orderthesimilaritymatrixwithrespecttoclusterlabelsandinspectvisually.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Page 143: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Clustersfoundinrandomdata

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Random Points

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

yComplete Link

Page 144: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Usingsimilaritymatrixforclustervalidation

Clustersinrandomdataarenotsocrisp.

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DBSCAN

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

xy

Page 145: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

II:Usingsimilaritymatrixforclustervalidation

Clustersinrandomdataarenotsocrisp.

K-means

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 146: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Usingsimilaritymatrixforclustervalidation

Clustersinrandomdataarenotsocrisp.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Points

Points

20 40 60 80 100

10

20

30

40

50

60

70

80

90

100Similarity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Complete Link

Page 147: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Usingsimilaritymatrixforclustervalidation

1 2

3

5

6

4

7

DBSCAN

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

500 1000 1500 2000 2500 3000

500

1000

1500

2000

2500

3000

Page 148: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Frameworkforclustervalidity

• Needaframeworktointerpretanymeasure.– Forexample,ifourmeasureofevaluationhasa valueof10,isthatgood,fair,or

poor?

• Statisticsprovideaframeworkforclustervalidity.– Themore“atypical”aclusteringresultis,themorelikelyitrepresentsvalid

structureinthedata.– Cancomparethevaluesofanindexthatresultfromrandomdataorclusterings to

thoseofaclusteringresult.• Ifthevalueoftheindexisunlikely,thentheclusterresultsarevalid.

– Theseapproachesaremorecomplicatedandhardertounderstand.

• Forcomparingtheresultsoftwodifferentsetsofclusteranalyses,aframeworkislessnecessary.– However,thereisthequestionofwhetherthedifferencebetweentwoindex

valuesissignificant.

Page 149: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:StatisticalframeworkforSSE

Example– CompareSSEof0.005againstthreeclustersinrandomdata.– HistogramshowsSSEofthreeclustersin500setsofrandomdatapointsofsize100

distributedovertherange0.2– 0.8forxandyvalues.

0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340

5

10

15

20

25

30

35

40

45

50

SSE

Count

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Page 150: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

II:Statisticalframeworkforcorrelation

CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Corr = -0.9235 Corr = -0.5810

Page 151: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

“Thevalidationofclusteringstructuresisthemostdifficultandfrustratingpartofclusteranalysis.Withoutastrongeffortinthisdirection,clusteranalysiswillremainablackartaccessibleonlytothosetruebelieverswhohaveexperienceandgreatcourage.”

AlgorithmsforClusteringData,JainandDubes

Finalcommentonclustervalidity

Page 152: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Classification(Supervisedlearning)

Page 153: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

BASICCONCEPTS

Page 154: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Classification:Definition

• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherexisasetofattributesandyistheclasslabel• x:setofattributes,predictors,independentvariables,inputs.• y:class,response,dependentvariable,oroutput.

• Task:– Learnamodelthatmapseachsetofattributesxintooneofthepredefinedclasslabelsy.

Page 155: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Examplesofclassificationtasks

Task Attributeset,x Classlabel,y

Categorizingemailmessages

Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Identifyingtumorcells

FeaturesextractedfromMRIscans

malignantorbenigncells

Cataloginggalaxies

Featuresextractedfromtelescopeimages

Elliptical,spiral,orirregular-shapedgalaxies

Page 156: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Buildingandusingaclassificationmodel

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Page 157: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Classificationtechniques

• Baseclassifiers– Decisiontree-basedmethods.– Rule-basedmethods.– Nearest-neighbor.– Neuralnetworks.– NaïveBayesandBayesianbeliefnetworks.– Supportvectormachines.– …andothers.

• Ensembleclassifiers– Boosting,bagging,randomforests,etc.

Page 158: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DECISIONTREES

Wewillusethismethodtoillustratevariousconceptsandissuesassociatedwiththeclassificationtask.

Page 159: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Exampleofadecisiontree

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

HomeOwner

MarSt

Income

YESNO

NO

NO

Yes No

MarriedSingle,Divorced

<80K >80K

SplittingAttributes

TrainingData Model:Decisiontree

Page 160: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Exampleofdecisiontree

MarSt

HomeOwner

Income

YESNO

NO

NO

Yes No

MarriedSingle,

Divorced

<80K >80K

Therecouldbemorethanonetreethatfitsthesamedata!

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 161: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Decisiontreeclassificationtask

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training SetDecision Tree

Page 162: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Applymodeltotestdata

MarSt

Income

YESNO

NO

NO

Yes No

MarriedSingle,Divorced

<80K >80K

HomeOwner

MaritalStatus

AnnualIncome

DefaultedBorrower

No Married 80K ?10

TestData

AssignDefaultedto“No”

HomeOwner

Startfromtherootofthetree

Page 163: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Decisiontreeclassificationtask

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

DecisionTree

Page 164: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Buildingthedecisiontree—Treeinduction

• Let𝐷" bethesetoftrainingrecordsthatreachanode𝑡.

• Generalprocedure:– If𝐷" containsrecordsthatbelong

thesameclass𝑦",then𝑡isaleafnodelabeledas𝑦".

– If𝐷" containsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.• Recursivelyapplytheproceduretoeachsubset.

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

𝐷"

?

Page 165: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

(d)

Yes No

MaritalStatus

Single,Divorced Married

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Buildingthedecisiontree:Example

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 166: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

(d)

Yes No

MaritalStatus

Single,Divorced Married

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Buildingthedecisiontree:Example

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 167: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

(d)

Yes No

MaritalStatus

Single,Divorced Married

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

(d)

Yes No

MaritalStatus

Single,Divorced Married

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Buildingthedecisiontree:Example

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 168: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Hunt’s'algorithm

(a) (b)

(c)

Defaulted = No

HomeOwner

Yes No

Defaulted = No Defaulted = No

Yes No

MaritalStatus

Single,Divorced Married

(d)

Yes No

MaritalStatus

Single,Divorced Married

AnnualIncome

<480K >=480K

HomeOwner

Defaulted = No

Defaulted = NoDefaulted = Yes

HomeOwner

Defaulted = No

Defaulted = No

Defaulted = No

Defaulted = Yes

(3,0) (4,3)

(3,0)

(1,3) (3,0)

(3,0)

(1,0) (0,3)

(3,0)

(7,3)

Buildingthedecisiontree:Example

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Page 169: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Designissuesofdecisiontreeinduction

• Howshouldthetrainingrecordsbesplit?– Methodforspecifyingtestcondition.

• Thisdependsontheattributetypes.– Methodforselectingwhichattributeandsplitconditiontochoose.• Needameasureforevaluatingthegoodnessofatestcondition.

• Whenshouldthesplittingprocedurestop?– Stopsplittingifalltherecordsbelongtothesameclassorhaveidenticalattributevalues.

– Earlytermination.

Page 170: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Methodsforexpressingtestconditions

• Dependsonattributetypes:– Binary– Nominal– Ordinal– Continuous

• Dependsonnumberofwaystosplit:– 2-waysplit– Multi-waysplit

Page 171: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Multi-waysplit:Useasmanypartitionsasdistinctvalues:

Binarysplit:Dividevaluesintotwosubsets:

Testconditionfornominalattributes

MaritalStatus

Single Divorced Married

Single Married,Divorced

MaritalStatus

Married Single,Divorced

MaritalStatus

OR OR

Single,Married

MaritalStatus

Divorced

Page 172: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Testconditionforordinalattributes

Large

ShirtSize

Medium Extra LargeSmall

• Multi-waysplit:– Useasmanypartitionsasdistinctvalues.

• Binarysplit:– Dividesvaluesintotwosubsets.

– Preserveorderpropertyamongattributevalues.

Medium, Large,Extra Large

ShirtSize

SmallLarge,Extra Large

ShirtSize

Small,Medium

Medium,Extra Large

ShirtSize

Small,Large

Thisgroupingviolatesorderproperty.

Page 173: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Testconditionforcontinuousattributes

AnnualIncome> 80K?

Yes No

AnnualIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

Page 174: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Howtodeterminethebestsplit?

Gender

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

CustomerID

...

Yes No Family

Sports

Luxury c1 c10

c20

C0: 0C1: 1

...

c11

Beforesplitting:10recordsofclass0,and10recordsofclass1.

Whichtestconditionisthebest?

Page 175: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Howtodeterminethebestsplit?

• Greedyapproach:– Nodeswithpurer classdistributionarepreferred.

• Needameasureofnodepurity/impurity:

C0: 5C1: 5

C0: 9C1: 1

Highdegreeofimpurity Lowdegreeofimpurity

Page 176: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Measuresofnodeimpurity

• GiniIndex

• Entropy

• Misclassificationerror

å-=j

tjptGINI 2)]|([1)(

å-=j

tjptjptEntropy )|(log)|()(

)|(max1)( tiPtErrori

-=

Two-classproblem

Page 177: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Findingthebestsplit

1. Computeimpuritymeasure(P)beforesplitting.2. Computeimpuritymeasure(M)aftersplitting.

• Computeimpuritymeasureofeachchildnode.• Misthesize-weightedimpurityofthechildren.

3. Choosetheattributetestconditionthatproducesthehighestgain:

orequivalently,lowestimpuritymeasureaftersplitting(M).

Gain = P – M,

Page 178: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Decisiontreebasedclassification

• Advantages:– Inexpensivetoconstruct.– Extremelyfastatclassifyingunknownrecords.– Easytointerpretforsmall-sizedtrees.– Robusttonoise(especiallywhenmethodstoavoidoverfittingareemployed).

– Caneasilyhandleredundantorirrelevantattributes(unlesstheattributesareinteracting).

• Disadvantages:– Spaceofpossibledecisiontreesisexponentiallylarge.Greedyapproachesareoftenunabletofindthebesttree.

– Doesnottakeintoaccountinteractionsbetweenattributes.– Eachdecisionboundaryinvolvesonlyasingleattribute.

Page 179: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

OVERFITTING

Page 180: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Classificationerrors

• Trainingerrors(apparenterrors):– Errorscommittedonthetrainingset.

• Testerrors:– Errorscommittedonthetestset.

• Generalizationerrors:– Expectederrorofamodelinarandomlyselectedsubsetofrecordsfromthesamedistribution.

Page 181: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Exampledataset

Twoclassproblem:

+:5400instances

• 5000instancesgeneratedfromaGaussiancenteredat(10,10)

• 400noisyinstancesadded

o:5400instances• Generatedfromauniformdistribution

10%ofthedatausedfortrainingand90%ofthedatausedfortesting

Page 182: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Increasingnumberofnodesinthedecisiontree

Page 183: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Decisiontreewith4nodes

Decisiontree

Decisionboundariesontrainingdata

Page 184: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Decision Tree

Decision boundaries on training data

Decisiontreewith50nodes

Page 185: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Increasingnumberofnodesindecisiontrees

Decision Tree with 4 nodes

Decision Tree with 50 nodes

Which tree is better ?

Page 186: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Modeloverfitting

Underfitting:whenmodelistoosimple,bothtrainingandtesterrorsarelarge.

Overfitting:whenmodelistoocomplex,trainingerrorissmallbuttesterrorislarge.

Page 187: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Model overfitting

Usingtwicethenumberofdatainstances

• Iftrainingdataisunder-representative,testingerrorsincreaseandtrainingerrorsdecreaseonincreasingnumberofnodes.

• Increasingthesizeoftrainingdatareducesthedifferencebetweentrainingandtestingerrorsatagivennumberofnodes.

Page 188: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Reasonsformodeloverfitting

• Presenceofnoise.

• Lackofrepresentativesamples.

• Multiplecomparisonprocedure.

Page 189: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Effectofmultiplecomparisonprocedure

• Considerthetaskofpredictingwhetherstockmarketwillrise/fallinthenext10tradingdays.

• Randomguessing:P(correct)=0.5

• Make10randomguessesinarow:

0547.02

1010

910

810

)8(# 10 =÷÷ø

öççè

æ+÷÷ø

öççè

æ+÷÷ø

öççè

æ

=³correctP

Day1 UpDay2 DownDay3 DownDay4 UpDay5 DownDay6 DownDay7 UpDay8 UpDay9 UpDay10 Down

Page 190: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Effectofmultiplecomparisonprocedure

• Approach:– Get50analysts.– Eachanalystmakes10randomguesses.– Choosetheanalystthatmakesthemostnumberofcorrectpredictions.

• Probabilitythatatleastoneanalystmakesatleast8correctpredictions:

9399.0)0547.01(1)8(# 50 =--=³correctP

Page 191: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Effectofmultiplecomparisonprocedure

• Manyalgorithmsemploythefollowinggreedystrategy:– Initialmodel:𝑀.– Alternativemodel:𝑀' = 𝑀 ∪ 𝛾,where𝛾isacomponenttobeaddedtothemodel(e.g.,atestconditionofadecisiontree).

– Keep𝑀' ifimprovement,Δ 𝑀,𝑀' > 𝛼.

• Oftentimes,𝛾ischosenfromasetofalternativecomponents,Γ=best(𝛾1, 𝛾2, … , 𝛾4).

• Ifmanyalternativesareavailable,onemayinadvertentlyaddirrelevantcomponentstothemodel,resultinginmodeloverfitting.

Page 192: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Effectofmultiplecomparison:Example

Useadditional100noisyvariablesgeneratedfromauniformdistributionalongwith𝑋and𝑌asattributes.

Use30%ofthedatafortrainingand70%ofthedatafortesting.

Usingonly𝑋and𝑌asattributes

Page 193: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Notesonoverfitting

• Overfittingresultsindecisiontreesthataremorecomplex thannecessary.

• Trainingerrordoesnotprovideagoodestimateofhowwellthetreewillperformonpreviouslyunseenrecords.

• Needwaysforestimatinggeneralizationerrors.

Page 194: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Handlingoverfittingindecisiontrees

Pre-Pruning(earlystoppingrule):– Stopthealgorithmbeforeitbecomesafully-growntree.– Typicalstoppingconditionsforanode:

• Stopifallinstancesbelongtothesameclass.• Stopifalltheattributevaluesarethesame.

– Morerestrictiveconditions:• Stopifnumberofinstancesislessthansomeuser-specifiedthreshold.• Stopifclassdistributionofinstancesareindependentoftheavailablefeatures(e.g.,using𝜒2 test).

• Stopifexpandingthecurrentnodedoesnotimproveimpurity measures(e.g.,Giniorinformationgain).

• Stopifestimatedgeneralizationerrorfallsbelowcertainthreshold.

Page 195: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Handling overfitting in decision trees

Post-pruning:– Growdecisiontreetoitsentirety.– Subtree replacement:

• Trimthenodesofthedecisiontreeinabottom-upfashion.• Ifgeneralizationerrorimprovesaftertrimming,replacesub-treebyaleafnode.

• Classlabelofleafnodeisdeterminedfrommajorityclassofinstancesinthesub-tree.

– Subtree raising:• Replacesubtree withmostfrequentlyusedbranch.

Page 196: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Examplesofpost-pruning

Simplified Decision Tree:

depth = 1 :| ImagePages <= 0.1333 : class 1| ImagePages > 0.1333 :| | breadth <= 6 : class 0| | breadth > 6 : class 1depth > 1 :| MultiAgent = 0: class 0| MultiAgent = 1:| | totalPages <= 81 : class 0| | totalPages > 81 : class 1

Decision Tree:depth = 1 :| breadth > 7 : class 1| breadth <= 7 :| | breadth <= 3 :| | | ImagePages > 0.375 : class 0| | | ImagePages <= 0.375 :| | | | totalPages <= 6 : class 1| | | | totalPages > 6 :| | | | | breadth <= 1 : class 1| | | | | breadth > 1 : class 0| | width > 3 :| | | MultiIP = 0:| | | | ImagePages <= 0.1333 : class 1| | | | ImagePages > 0.1333 :| | | | | breadth <= 6 : class 0| | | | | breadth > 6 : class 1| | | MultiIP = 1:| | | | TotalTime <= 361 : class 0| | | | TotalTime > 361 : class 1depth > 1 :| MultiAgent = 0:| | depth > 2 : class 0| | depth <= 2 :| | | MultiIP = 1: class 0| | | MultiIP = 0:| | | | breadth <= 6 : class 0| | | | breadth > 6 :| | | | | RepeatedAccess <= 0.0322 : class 0| | | | | RepeatedAccess > 0.0322 : class 1| MultiAgent = 1:| | totalPages <= 81 : class 0| | totalPages > 81 : class 1

SubtreeRaising

SubtreeReplacement

Page 197: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ENSEMBLEMETHODS

Page 198: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Ensemblemethods

• Constructasetofclassifiersfromthetrainingdata.

• Predictclasslabeloftestrecordsbycombiningthepredictionsmadebymultipleclassifiers.

Page 199: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Whyensemblemethodswork?

Supposethereare25baseclassifiers:– Eachclassifierhaserrorrate,e =0.35.– Assumeerrorsmadebyclassifiersare

uncorrelated.– Probabilitythattheensemble

classifiermakesawrongprediction:

P(X ≥13) = 25i

⎝⎜

⎠⎟ε i (1−ε)25−i = 0.06

i=13

25

Page 200: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Generalapproach

OriginalTraining data

....D1 D2 Dt-1 Dt

D

Step 1:Create Multiple

Data Sets

C1 C2 Ct -1 Ct

Step 2:Build Multiple

Classifiers

C*Step 3:

CombineClassifiers

Page 201: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Typesofensemblemethods

• Manipulatedatadistribution.– Resamplingmethod.

• Baggingandboosting.

• Manipulateinputfeatures.– Featuresubsetselection.

• Randomforest:Randomlyselectfeaturesubsetsandbuiltdecisiontrees.

• Manipulateclasslabels.– Randomlypartitiontheclassesintotwosubsets,treatthemas+veand–ve,andlearnabinaryclassifier.Dothatmanytimes.Atclassification,useallbinaryclassifiersandgivecreditstotheconstituentclasses.

• Byusingdifferentmodels.– E.g.,DifferentANNtopologies.

Page 202: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Bagging

• Samplingwithreplacement.

• Buildaclassifieroneachbootstrapsample.• Useamajorityvotingpredictionapproach:

– Predictanunlabeledinstanceusingallclassifiersandreturnthemostfrequentlypredictedclassastheprediction.

Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7

Page 203: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Boosting

• Aniterativeproceduretoadaptivelychangethedistributionoftrainingdatabyfocusingmoreonpreviouslymisclassifiedrecords.– Initially,all𝑁recordsareassignedequalweights.– Unlikebagging,weightsmaychangeattheendofeachboostinground.

• Theweightscanbeusedtocreateaweighted-lossfunctionorbiastheselectionofthesample.

Page 204: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Boosting

• Recordsthatarewronglyclassifiedwillhavetheirweightsincreased.

• Recordsthatareclassifiedcorrectlywillhavetheirweightsdecreased.

Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4

Example4ishardtoclassify.

Itsweightisincreased,thereforeitismorelikelytobechosenagaininsubsequentrounds.

Page 205: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ARTIFICIALNEURALNETWORKS

Page 206: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Considerthefollowing

X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.

Page 207: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Considerthefollowing

X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1

S

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)

where sign(x) = +1 if x ≥ 0−1 if x < 0

⎧⎨⎩

Page 208: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Perceptron

• Modelisanassemblyofinter-connectednodesandweightedlinks.

• Outputnodesumsupeachofitsinputvalueaccordingtotheweightsofitslinks.

• Compareoutputnodeagainstsomethresholdt.

S

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

Perceptron Model

Y = sign( wiXii=1

d

∑ − t)

= sign( wiXi )i=0

d

Page 209: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Perceptron

• Singlelayernetwork– Containsonlyinputandoutputnodes.

• Activationfunction:

• Applyingmodelisstraightforward:

– X1 =1,X2 =0,X3 =1=>y=sign(0.2)=1

f (w, x) = sign( x,w )

Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)

where sign(x) = +1 if x ≥ 0−1 if x < 0

⎧⎨⎩

Page 210: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Perceptronlearningrule

• Initializetheweights(w0,w1,…,wd)• Repeat

– Foreachtrainingexample(xi,yi)• Computef(w,xi)• Updatetheweights:

• Untilstoppingconditionismet.• Theaboveisanexampleofastochasticgradientdescentoptimizationmethod.

w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi

Page 211: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Perceptronlearningrule

• Weightupdateformula:

• Intuition:– Updateweightbasedonerror:– Ify=f(x,w),e=0:noupdateneeded.– Ify>f(x,w),e=2:weightmustbeincreasedsothatf(x,w)willincrease.

– Ify<f(x,w),e=-2:weightmustbedecreasedsothatf(x,w)willdecrease.

w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi ; λ: learning rate

e = yi − f (w(k ), xi )⎡⎣ ⎤⎦

Page 212: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Perceptronlearningrule

• Sincef(w,x)isalinearcombinationofinputvariables,decisionboundaryislinear.

• Fornonlinearlyseparableproblems,perceptronlearningalgorithmwillfailbecausenolinearhyperplane canseparatethedataperfectly.

Page 213: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Nonlinearlyseparabledata

x1 x2 y0 0 -11 0 10 1 11 1 -1

21 xxy Å=XOR Data

Page 214: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Multilayerartificialneuralnetworks(ANN)

Activationfunction

g(Si )Si Oi

I1

I2

I3

wi1

wi2

wi3

Oi

Neuron iInput Output

threshold, t

InputLayer

HiddenLayer

OutputLayer

x1 x2 x3 x4 x5

y

Training ANN means learning the weights of the neurons

Page 215: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Artificialneuralnetworks

• Varioustypesofneuralnetworktopologies:– Single-layerednetwork(perceptron)versusmulti-layerednetwork.

– Feed-forwardversusrecurrentnetwork.

• Varioustypesofactivationfunctions(f):

Y = f ( wiXii∑ )

Page 216: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Artificialneuralnetworks

Multi-layerneuralnetworkcansolveanytypeofclassificationtaskinvolvingnonlineardecisionsurfaces.

n1

n2

n3

n4

n5

x1

x2

InputLayer

HiddenLayer

OutputLayer

y

w31

w32

w41

w42

w53

w54

XOR Data

Page 217: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

DesignissuesofANN

• Numberofnodesininputlayer:– Oneinputnodeperbinary/continuousattribute.– 𝑘orlog2 𝑘 nodesforeachcategoricalattributewith𝑘values.

• Numberofnodesinoutputlayer:– Oneoutputforbinaryclassproblem.– 𝑘or log2(𝑘) nodesfork-classproblem.

• Numberofnodesinhiddenlayer.• Initialweightsandbiases.

Page 218: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

CharacteristicsofANN

• MultilayerANNareuniversalfunctionapproximatorsbutcouldsufferfromoverfittingifthenetworkistoolarge.

• Gradientdescentmayconvergetolocalminimum.• Modelbuildingcanbeverytimeconsuming,butapplyingthemodelcanbeveryfast.

• Canhandleredundantattributesbecauseweightsareautomaticallylearnt.

• Sensitivetonoiseintrainingdata.• Difficulttohandlemissingattributes.

Page 219: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

RecentnoteworthydevelopmentsinANN

• Useindeeplearningandunsupervisedfeaturelearning.– Seektoautomaticallylearnagoodrepresentationoftheinputfromunlabeleddata.

• GoogleBrainproject:– Learnedtheconceptofa‘cat’bylookingatunlabeledpicturesfromYouTube.

– Onebillionconnectionnetwork.

Page 220: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Purpose-builtneuralnetworks

• Convolutionneuralnetworks– Deepnetworksthataredesigned

toextractsuccessivelymorecomplicatedfeaturesfrom1D,2D,and3Dsignals(i.e.,audio,images,video).

Page 221: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Purpose-builtneuralnetworks

• Networksthatarespecificallydesignedtomodelarbitrarylengthsequencesandnon-localdependencies:– Recurrentneuralnetworks– Bi-directionalrecurrentneuralnetworks– Longshort-termmemory

• Goodforlanguagemodelingandvariousbiologicalapplications.

Page 222: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

SUPPORTVECTORMACHINES

Page 223: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Separating hyperplanes

Findalinearhyperplane (decisionboundary)thatseparatesthedata.

Page 224: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Separating hyperplanes

One possible solution.

B1

Page 225: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Separating hyperplanes

Another possible solution.

B2

Page 226: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Separating hyperplanes

Other possible solutions.

B2

Page 227: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Separating hyperplanes

• Which one is better? B1 or B2?• How do we define better?

B1

B2

Page 228: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Support Vector Machines (SVM)

Find the hyperplane that maximizes the margin: 𝐵1is better than 𝐵2.

B1

B2

b11

b12

b21b22

margin

Page 229: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Support vector machinesB1

b11

b12

wxT + b = 0

w

Vector w is normal to the separat-ing hyperplane. Let x and y be twopoints on the hyperplane. Then,

wxT + b = 0 & wyT + b = 0,

andw(x y)T = 0,

which indicates that w is orthogonalto the vector x y, which lies on thehyperplane.Classifica on is performed as follows:

f(x) =

+1 if wxT + b 01 if wxT + b < 0

Page 230: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Modelestimation

• The goal is to find the parameters w and b (i.e., the model’s parameters) such that it separates theclasses and maximizes the margin.

• We know how to measure classifica on accuracy, but how do we measure the margin?

• Let (w, b) be the parameters of a hyperplane that is in the “middle” between the two classes. Wecan scale (w, b) in order to obtain (w, b) such that

f(x) =

+1 if wxT + b +11 if wxT + b 1

• Let x and y be two points such that

wxT + b = +1 & wyT + b = 1,

that is, these points are the closest to the hyperplane posi ve and nega ve instances, respec vely.Then,

w(x y)T = 2||w|| ||x y|| cos(w, x y) = 2

||w|| (margin) = 2margin = 2/||w||

which indicates that w is orthogonal to the vector x y, which lies on the hyperplane.

Page 231: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Support Vector MachinesB1

b11

b12wxT + b = 0

wxT + b = +1

w

Page 232: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Modelestimation• The op miza on problem is formulated as follows:

maximizew,b

2||w||

subject to wxTi + b +1 if xi is +ve

wxTi + b 1 if xi is -ve

• If yi be +1 or 1 if xi is +ve or -ve, respec vely, then the above can be conciselywri en in a standard minimiza on form:

minimizew,b

||w||2

subject to yi(wxTi + b) 1 xi

• This is a constrained quadra c op miza on problem, which is convex and can besolved efficiently using Lagrange mul plies by minimizing the following func on:

Lp = ||w||2

i

i(yi(wxTi + b) 1),

where the i 0’s are what they are called Lagrange mul plies.

Page 233: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Modelestimation

• The dual Lagrangian is used for solving this problem, which can be shown to be:

LD =

i

i

i,j

ijyiyjxixTj .

Since this is the dual of the primal op miza on problem, the problem is now becomesa maximiza on problem.

• At the op mal solu on of the primal/dual problem we have that:

w =

i

iyixi.

• Most of the i’s are 0, and the non-zero i’s are those that define the w vector. Theycorrespond to the training examples for which the model predicts either +1 or 1.These training examples are called the support vectors.

• A test instance z is classified as +ve or -ve based on

f(z) = sign(wzT + b) = sign

i

iyixizT + b

.

Page 234: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

ExampleofLinearSVM

x1 x2 y l0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0

Supportvectors

Page 235: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Support vector machines

Whatiftheproblemisnotlinearlyseparable?

Page 236: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Non-separablecase

• Non-linearly separable cases are handled by introducing for each training instance aslack variable i and solving the following op miza on problem:

minimizew,b,i

||w||2 + c (

i i)

subject to wxTi + b +1 i if xi is +ve

wxTi + b 1 + i if xi is -ve

i 0

• ... Or by using a non-linear hyperplane.

• ... Or by doing both.

Page 237: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Nonlinearsupportvectormachines

Whatifdecisionboundaryisnotlinear?

Page 238: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Nonlinearsupportvectormachines

Transformdataintohigherdimensionalspace.

Decisionboundary:

Φ(x)wT + b = 0

Page 239: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

NonlinearSVMs

Mappingfromtheoriginalspacetoadifferentspacecanmakethingsseparable.

Page 240: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Learningnon-linearSVMs

• The dual LagrangianLD =

i

i

i,j

ijyiyjxixTj ,

now becomes:LD =

i

i

i,j

ijyiyj(xi)(xj)T ,

• A test instance z is classified as +ve or -ve based on

f(z) = sign

i

iyi(xi)(z)T + b

.

• The matrix K such that K(xi, xj) = (xi)(xj)T is called the kernel matrix.

• Non-linear SVMs require to have such a kernel matrix. I can derive interes ng kernelmatrices involving extremely high-dimensional func ons by opera ng on the originalspace. This is called the kernel trick.

Page 241: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Kerneltrick

Examples:

Thisisaninfinitedimensionpolynomial.

Page 242: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Example of nonlinear SVM

SVM with polynomial degree 2 kernel

Page 243: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

LearningnonlinearSVM

• Advantagesofusingkernels:– DonothavetoknowthemappingfunctionF.– ComputingdotproductF(xi)• F(xj)intheoriginalspaceavoidscurseof

dimensionality.

• Thekernelfunctioncanbeconsideredasameasureofsimilaritybetweenobjectsandusedtoencodekeyinformationabouttheclassificationproblem.

• Notallfunctionscanbekernels:– MustmakesurethereisacorrespondingF insomehigh-dimensional

space.– Mercer’stheorem.

Page 244: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

CharacteristicsofSVM

• Thelearningproblemisformulatedasaconvexoptimizationproblem,efficientalgorithmsareavailabletofindtheglobalminimaoftheobjectivefunction.

• Over-fittingisaddressedbymaximizingthemarginofthedecisionboundary,buttheuserstillneedstoprovidethetypeofkernelfunctionandcostfunction.

• Difficulttohandlemissingvalues.• Robusttonoise.• Highcomputationalcomplexityforbuildingthemodel.

Page 245: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

RIDGEREGRESSION&COORDINATEDESCENT

Page 246: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Linearregressiontask

• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherex isasetofnumericalattributesandy isavalue.

• Goal:– Wewanttolearnavectorw suchthat<x, w>approximatesy inaleastsquaressense.

Page 247: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Linearregressionandnormalequations

>Ğƚ X ďĞ ĂŶ n m ŵĂƚƌŝdž ǁŚŽƐĞ ƌŽǁƐ ĐŽƌƌĞƐƉŽŶĚ ƚŽ ƚŚĞ ƌĞĐŽƌĚƐĂŶĚ ƚŚĞ ĐŽůƵŵŶƐ ĐŽƌƌĞƐƉŽŶĚ ƚŽ ƚŚĞ ĂƩƌŝďƵƚĞƐ >Ğƚ y ďĞ ĂŶ n 1ǀĞĐƚŽƌ ŽĨ ƚŚĞ ŬŶŽǁŶ ƚĂƌŐĞƚ ǀĂůƵĞƐ ŽĨ ƚŚĞ ƌĞĐŽƌĚƐ ŝŶ X dŚĞ ƐŽůƵƟŽŶƚŽ ƚŚĞ ůŝŶĞĂƌ ƌĞŐƌĞƐƐŝŽŶ ƉƌŽďůĞŵ ŝƐ ƚŚĞ ǀĞĐƚŽƌ w ƐƵĐŚ ƚŚĂƚ

minimizew

||Xw y||2.

dŚĞ ƐŽůƵƟŽŶ ƚŽ ƚŚĞ ĂďŽǀĞ ƉƌŽďůĞŵ ŝƐ ŐŝǀĞŶ ďLJ

w = (XT X)1XT y.

,ŽǁĞǀĞƌ ƚŚŝƐ ŝƐ ŶŽƚ ŚŽǁ ǁĞ ƵƐƵĂůůLJ ƐŽůǀĞ ŝƚ

Page 248: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Ridgeregression

/Ŷ ŽƌĚĞƌ ƚŽ ƉƌĞǀĞŶƚ ŽǀĞƌĮƫŶŐ ǁĞ ĂĚĚ Ă ƌĞŐƵůĂƌŝnjĂƟŽŶ ƉĞŶĂůƚLJ ĂŶĚĞƐƟŵĂƚĞ w ĂƐ ĨŽůůŽǁƐ

minimizew

||Xw y||2 + ||w||2

,

ǁŚĞƌĞ ŝƐ Ă ƵƐĞƌͲƐƵƉƉůŝĞĚ ƉĂƌĂŵĞƚĞƌ ƚŚĂƚ ĐŽŶƚƌŽůƐ ŽǀĞƌĮƫŶŐdŚŝƐ ƚLJƉĞ ŽĨ ƌĞŐƌĞƐƐŝŽŶ ŝƐ ĐĂůůĞĚ ƌŝĚŐĞ ƌĞŐƌĞƐƐŝŽŶ

Page 249: Introduction to Data Mining - University of Minnesota to... · Introduction to Data Mining 1. Large-scale data is everywhere! • There has been enormous data ... Data mining tasks

Estimatingw

ͻ dŚĞƌĞ ĂƌĞ ŵĂŶLJ ǁĂLJƐ ƚŽ ƐŽůǀĞ ƚŚĞ ŽƉƟŵŝnjĂƟŽŶ ƉƌŽďůĞŵ ĨŽƌ ĞƐƟŵĂƟŶŐw ŽŽƌĚŝŶĂƚĞ ĚĞƐĐĞŶƚ ŝƐ ƉƌŽďĂďůLJ ƚŚĞ ƐŝŵƉůĞƐƚ ŵĞƚŚŽĚ

ͻ /ƚ ĐŽŶƐŝƐƚƐ ŽĨ Ă ƐĞƚ ŽĨ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶƐ /Ŷ ĞĂĐŚ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶ ŝƚ ƉĞƌͲĨŽƌŵƐ m ƐƚĞƉƐ ;ŽŶĞ ĨŽƌ ĞĂĐŚ ŽĨ ƚŚĞ ĚŝŵĞŶƐŝŽŶƐ ŝŶ wͿ ƵƌŝŶŐ ƚŚĞ iƚŚƐƚĞƉ ŝƚ ŽƉƟŵŝnjĞƐ ƚŚĞ ǀĂůƵĞ ŽĨ ƚŚĞ ŽďũĞĐƟǀĞ ĨƵŶĐƟŽŶ ďLJ ĮdžŝŶŐ Ăůů ďƵƚƚŚĞ wi ǀĂƌŝĂďůĞ dŚŝƐ ŽƉƟŵŝnjĂƟŽŶ ŝƐ ƉĞƌĨŽƌŵĞĚ ďLJ ƚĂŬŝŶŐ ƚŚĞ ƉĂƌƟĂůĚĞƌŝǀĂƟǀĞ ŽĨ ƚŚĞ ŽďũĞĐƟǀĞ ĨƵŶĐƟŽŶ ǁŝƚŚ ƌĞƐƉĞĐƚ ƚŽ wi ƐĞƫŶŐ ŝƚ ƚŽ ϬĂŶĚ ƐŽůǀŝŶŐ ĨŽƌ wi dŚĂƚ ǀĂůƵĞ ŽĨ wi ŝƐ ƚŚĞ ŶĞǁ ǀĂůƵĞ ĨŽƌ ƚŚĂƚ ǀĂƌŝĂďůĞdŚĞ ĞŶƟƌĞ ƉƌŽĐĞƐƐ ĐŽŶǀĞƌŐĞƐ ǁŚĞŶ ƚŚĞ ĞƌƌŽƌ ĚŽĞƐ ŶŽƚ ĚĞĐƌĞĂƐĞ ƐƵďͲƐƚĂŶƟĂůůLJ ďĞƚǁĞĞŶ ƐƵĐĐĞƐƐŝǀĞ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶƐ

ͻ EŽŶͲŶĞŐĂƟǀŝƚLJ ŝŶ ƚŚĞ ŵŽĚĞů ĐĂŶ ďĞ ĞŶĨŽƌĐĞĚ ďLJ ƐĞƫŶŐ ĂŶLJ ŶĞŐĂƟǀĞwi ǀĂůƵĞƐ ƚŽ Ϭ ĚƵƌŝŶŐ ƚŚĞ ŝŶŶĞƌ ŝƚĞƌĂƟŽŶƐ


Recommended