IntroductiontoDataMining
1
Large-scaledataiseverywhere!
• Therehasbeenenormousdatagrowthinbothcommercialandscientificdatabasesduetoadvancesindatagenerationandcollectiontechnologies.
• Newmantra:– Gatherwhateverdatayoucanwheneverand
whereverpossible.
• Expectations:– Gathereddatawillhavevalueeitherforthe
purposecollectedorforapurposenotenvisioned.
Computational Simulations
Business Data
Sensor Networks
Geo-spatial data
Homeland Security
2
Whydatamining?
Commercialviewpoint:– Lotsofdataisbeingcollectedandwarehoused.• Webdata:– Yahoohaspetabytesofwebdata.– Facebookhas~2Bactiveusers.
• Purchasesatdepartment/grocerystores,e-commerce:– Amazonrecords1.1Bordersayear.– Bank/CreditCardtransactions.
– Computershavebecomecheaperandmorepowerful.
– Competitivepressureisstrong.• Providebetter,customizedservicesforanedge(e.g.inCustomerRelationshipManagement).
3
Whydatamining?
Scientificviewpoint:– Datacollectedandstoredat
enormousspeeds.
• Remotesensorsonasatellite.– NASAEOSDISarchivesover 1-petabytesof
earthsciencedata/year.
• Telescopesscanningtheskies.– Skysurveydata.
• High-throughputbiologicaldata.
• Scientificsimulations.– Terabytesofdatageneratedinafewhours.
– Datamininghelpsscientists.• Inautomatedanalysisofmassivedatasets.
• Inhypothesisformation.
4
Whatisdatamining?
Manydefinitions:– Non-trivialextractionofimplicit,previouslyunknownandpotentiallyusefulinformationfromdata.
– Exploration&analysis,byautomaticorsemi-automaticmeans,oflargequantitiesofdatainordertodiscovermeaningfulpatterns.
5
Originsofdatamining
• Drawsideasfrommachinelearning/AI,patternrecognition,statistics,anddatabasesystems.
• Traditionaltechniquesmaybeunsuitableduetodatathatis:– Large-scale– Highdimensional– Heterogeneous– Complex– Distributed
KeyDistinction:Datadrivenvs.Hypothesisdriven
6
Dataminingtasks
• Predictiontask:– Usesomevariablestopredictunknownorfuturevaluesofothervariables.
• Descriptiontask:– Findhuman-interpretablepatternsthatdescribethedata.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
7
Tid Refund Marital Status
Taxable Income Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
11 No Married 60K No
12 Yes Divorced 220K No
13 No Single 85K Yes
14 No Married 75K No
15 No Single 90K Yes 10
Milk
Data
Dataminingmethods
8
Predictivemodeling:Classification
Findamodelforclassattributeasafunctionofthevaluesofotherattributes.
9
Tid Employed Level of Education
# years at present address
Credit Worthy
1 Yes Graduate 5 Yes 2 Yes High School 2 No 3 No Undergrad 1 No 4 Yes High School 10 Yes … … … … …
10
Model for predicting credit worthiness
ClassEmployed
No Education
Number ofyears
No Yes
Graduate High school, Undergrad
Yes No
> 7 yrs < 7 yrs
Yes
Number ofyears
No
> 3 yr < 3 yr
Examplesofclassification
10
• Predictingtumorcellsasbenignormalignant.
• Classifyingcreditcardtransactionsaslegitimateorfraudulent.
• Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil.
• Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc.
• Identifyingintrudersinthecyberspace.
ClusteringFindinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups.
11
Inter-clusterdistancesaremaximized
Intra-clusterdistancesareminimized
Applicationsofclustering
• Understanding– Customprofilingfortargetedmarketing.– Grouprelateddocumentsforbrowsing.– Groupgenesandproteinsthathavesimilar
functionality.– Groupstockswithsimilarpricefluctuations.
• Summarization– Reducethesizeoflargedatasets.
12
Clusters for Raw SST and Raw NPP
longitude
latitu
de
-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180
90
60
30
0
-30
-60
-90
Cluster
Sea Cluster 1
Sea Cluster 2
Ice or No NPP
Land Cluster 1
Land Cluster 2
UseofK-meanstopartitionSeaSurfaceTemperature(SST)andNetPrimaryProduction(NPP)intoclustersthatreflecttheNorthernandSouthernHemispheres.
Courtesy: Michael Eisen
Associationrulediscovery
Givenasetofrecordseachofwhichcontainsomenumberofitemsfromagivencollection.– Producedependencyruleswhichwillpredictoccurrenceofanitembasedonoccurrencesofotheritems.
TID Items
1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk
Rules Discovered:Milk --> CokeDiaper, Milk --> Beer
13
Associationanalysis:Applications
• Market-basketanalysis.– Rulesareusedforsalespromotion,shelfmanagement,andinventory
management.
• Telecommunicationalarmdiagnosis.– Rulesareusedtofindcombinationofalarmsthatoccurtogether
frequentlyinthesametimeperiod.
• MedicalInformatics.– Rulesareusedtofindcombinationofpatientsymptomsandtest
resultsassociatedwithcertaindiseases.
14
Motivatingchallenges
• Scalability.
• Highdimensionality.
• Heterogeneousandcomplexdata.
• Dataownershipanddistribution.
• Non-traditionalanalysis.
15
The4V’sof“BigData”
16
PatternMining
ASSOCIATIONRULES
AssociationRuleMiningGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransaction
Market-basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Example of association rules
Diaper ® Beer,Milk, Bread ® Eggs,Coke,Beer, Bread ® Milk,
Implication means co-occurrence, not causality!
Definition:FrequentItemsetItemset– Acollectionofoneormoreitems
• Example:Milk,Bread,Diaper– k-itemset
• Anitemset thatcontainskitemsSupportcount(𝜎)– Frequencyofoccurrenceofanitemset– E.g.𝜎(Milk,Bread,Diaper)=2
Support(𝑠)– Fractionoftransactionsthatcontainanitemset– E.g.𝑠(Milk,Bread,Diaper)=2/5
FrequentItemset– Anitemset whosesupportisgreaterthanor
equaltoaminsup threshold
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definition:AssociationRule
Example:BeerDiaper,Milk Þ
4.052
|T|)BeerDiaper,,Milk(
===ss
67.032
)Diaper,Milk()BeerDiaper,Milk,(
===s
sc
Association Rule– An implication expression of the form
𝑋 → 𝑌, where 𝑋and 𝑌are itemsets.– Example:
Milk, Diaper →Beer
Rule Evaluation Metrics– Support (𝑠)
• Fraction of transactions that contain both 𝑋and 𝑌.
– Confidence (𝑐)• Measures how often items in 𝑌
appear in transactions thatcontain 𝑋.
• It is nothing more than 𝑃(𝑌|𝑋).
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
AssociationRuleMiningTask
Givenasetoftransactions𝑇,thegoalofassociationruleminingistofindallruleshaving:
1) support≥minsup threshold,and2) confidence≥minconf threshold.
Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Anapproach….
1. Listallpossibleassociationrules.2. Computethesupportandconfidenceforeachrule.3. Prunerulesthatfailtheminsup andminconf
thresholds.
ComputationalComplexityGivend uniqueitems:
– Totalnumberofitemsets =2𝑑– Totalnumberofpossibleassociationrules:
123 1
1
1 1
+-=
úû
ùêë
é÷ø
öçè
æ -´÷ø
öçè
æ=
+
-
=
-
=å å
dd
d
k
kd
j jkd
kd
R
If d=6, R = 602 rules
MiningAssociationRules
Example of Rules:Milk,Diaper ® Beer (s=0.4, c=0.67)Milk,Beer ® Diaper (s=0.4, c=1.0)Diaper,Beer ® Milk (s=0.4, c=0.67)Beer ® Milk,Diaper (s=0.4, c=0.67) Diaper ® Milk,Beer (s=0.4, c=0.5) Milk ® Diaper,Beer (s=0.4, c=0.5)
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Observations:• All the above rules are binary partitions of the same itemset:
Milk, Diaper, Beer
• Rules originating from the same itemset have identical support butcan have different confidence.
• Thus, we may decouple the support and confidence requirements.
Miningassociationrules
Two-stepapproach:1. FrequentItemset Generation
– Generateallitemsets whosesupport³minsup.
2. RuleGeneration– Generatehighconfidencerulesfromeachfrequentitemset,
whereeachruleisabinarypartitioningofafrequentitemset.
Frequentitemset generationisstillexpensive.
Frequentitemset generationstrategies
• Reducethenumberofcandidates (𝑀)– Completesearch:𝑀 = 21.– Usepruningtechniquestoreduce𝑀.
• Reducethenumberoftransactions(𝑁)– ReducesizeofNasthesizeofitemset increases.– UsedbyDHPandvertical-basedminingalgorithms.
• Reducethenumberofcomparisons (𝑁𝑀)– Useefficientdatastructurestostorethecandidatesortransactions.
– Noneedtomatcheverycandidateagainsteverytransaction.
PatternLattice
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Given 𝑑items, there are 2𝑑 possible candidate itemsets.
Reducingthenumberofcandidates
• Observation:– Ifanitemset isfrequent,thenallofitssubsetsmustalsobefrequent.
• Thisholdsduetothefollowingpropertyofthesupportmeasure:
– Supportofanitemset neverexceedsthesupportofitssubsets.
– Thisisknownastheanti-monotonepropertyofsupport.
)()()(:, YsXsYXYX ³ÞÍ"
Found to be Infrequent
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Illustratingsupport’santi-monotonicity
Illustratingsupport’santi-monotonicity
Minimum Support = 3
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Illustratingsupport’santi-monotonicity
Minimum Support = 3
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Items (1-itemsets)
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Illustratingsupport’santi-monotonicity
Minimum Support = 3
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Items (1-itemsets)
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Illustratingsupport’santi-monotonicity
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Itemset Bread,Milk Bread, Beer Bread,Diaper Beer, Milk Diaper, Milk Beer,Diaper
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Minimum Support = 3
Illustratingsupport’santi-monotonicity
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Itemset Count Bread,Milk 3 Beer, Bread 2 Bread,Diaper 3 Beer,Milk 2 Diaper,Milk 3 Beer,Diaper 3
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Minimum Support = 3
Illustratingsupport’santi-monotonicity
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3
Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
Illustratingsupport’santi-monotonicity
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3
Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk
2 2 2 1
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41
With support-based pruning,6 + 6 + 4 = 16
Illustratingsupport’santi-monotonicity
Item Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Itemset Count Bread,Milk 3 Bread,Beer 2 Bread,Diaper 3 Milk,Beer 2 Milk,Diaper 3 Beer,Diaper 3
Itemset Count Beer, Diaper, Milk Beer,Bread, Diaper Bread, Diaper, Milk Beer, Bread, Milk
2 2 2 1
Items (1-itemsets)
Pairs (2-itemsets)
(No need to generatecandidates involving Cokeor Eggs)
Triplets (3-itemsets)Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C36 + 15 + 20 = 41
With support-based pruning,6 + 6 + 4 = 166 + 6 + 1 = 13
APRIORI
Apriori algorithm– ℱ4:frequent𝑘-itemsets– ℒ4:candidate𝑘-itemsets
Algorithm– Let𝑘 = 1– Generateℱ8 =frequent1-itemsets– Repeatuntilℱ4 isempty
1. CandidateGeneration:Generateℒ498 fromℱ4.2. CandidatePruning:Prunecandidateitemsets inℒ498 containing
subsetsoflength𝑘thatareinfrequent.3. SupportCounting:Countthesupportofeachcandidateinℒ498 by
scanningtheDB.4. CandidateElimination:Eliminatecandidatesinℒ498 thatare
infrequent,leavingonlythosethatarefrequent,leadingtoℱ498.
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Level-by-leveltraversalofthelattice.
CandidateGeneration:theℱ4:8×ℱ4:8method
• Mergetwofrequent(𝑘 − 1)-itemsets iftheirfirst(𝑘 − 2)itemsareidentical
• 𝐹> =ABC,ABD,ABE,ACD,BCD,BDE,CDE– Merge(ABC,ABD)=ABCD– Merge(ABC,ABE)=ABCE– Merge(ABD,ABE)=ABDE
– Donotmerge(ABD,ACD)becausetheyshareonlyprefixoflength1insteadoflength2.
Candidatepruning• Letℱ> =ABC,ABD,ABE,ACD,BCD,BDE,CDEbethesetoffrequent3-itemsets.
• ℒ? =ABCD,ABCE,ABDEisthesetofcandidate4-itemsetsgenerated(frompreviousslide).
• Candidatepruning:– PruneABCEbecauseACEandBCEareinfrequent.– PruneABDEbecauseADEisinfrequent.
• Aftercandidatepruning:ℒ? =ABCD.
Supportcountingofcandidateitemsets
Scanthedatabaseoftransactionstodeterminethesupportofeachcandidateitemset.
– Mustmatcheverycandidateitemset againsteverytransaction,whichisanexpensiveoperation.
TID Items
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Bread, Coke, Diaper, Milk
Itemset Beer, Diaper, Milk Beer,Bread,Diaper Bread, Diaper, Milk Beer, Bread, Milk
Q:Howshouldweperformthisoperation?
Toreducenumberofcomparisons,storethecandidateitemsets inahashstructure.
– Insteadofmatchingeachtransactionagainsteverycandidate,matchitagainstcandidatescontainedinthehashedbuckets.
TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke
N
Transactions Hash Structure
k
Buckets
Supportcountingofcandidateitemsets
Supportcounting:AnexampleSuppose you have 15 candidate itemsets of length 3:
1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
How many of these itemsets are supported by transaction (1,2,3,5,6)?
1 2 3 5 6
Transaction, t
2 3 5 61 3 5 62
5 61 33 5 61 2 61 5 5 62 3 62 5
5 63
1 2 31 2 51 2 6
1 3 51 3 6 1 5 6 2 3 5
2 3 6 2 5 6 3 5 6
Subsets of 3 items
Level 1
Level 2
Level 3
63 5
This is a “full” n-arytree where n is the number of items.
Q: Can we reduce storage requirements?
Supportcountingusingahashtree
2 3 45 6 7
1 4 5 1 3 6
1 2 44 5 7 1 2 5
4 5 81 5 9
3 4 5 3 5 63 5 76 8 9
3 6 73 6 8
1,4,72,5,8
3,6,9Hash function
Suppose you have 15 candidate itemsets of length 3:
1 4 5, 1 2 4, 4 5 7, 1 2 5, 4 5 8, 1 5 9, 1 3 6, 2 3 4, 5 6 7, 3 4 5, 3 5 6, 3 5 7, 6 8 9, 3 6 7, 3 6 8
You need:
• Hash function.
• Max leaf size: max number of itemsets stored in a leaf node (if number of candidate itemsets exceeds max leaf size, split the node).
FactorsaffectingthecomplexityofApriori
MAXIMAL&CLOSEDITEMSETS
Maximalfrequentitemset
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
BorderInfrequent Itemsets
Maximal Itemsets
An itemset is maximal frequent if it is frequent and none of its immediate supersets are frequent.
Closeditemsets• Anitemset 𝑋isclosed ifallofitsimmediatesupersetshavealowersupportthan𝑋.
• Itemset 𝑋isnotclosedifatleastoneofitsimmediatesupersetshasthatsamesupportas𝑋.
TID Items1 A,B2 B,C,D3 A,B,C,D4 A,B,D5 A,B,C,D
Itemset SupportA 4B 5C 3D 4A,B 4A,C 2A,D 3B,C 3B,D 4C,D 3
Itemset SupportA,B,C 2A,B,D 3A,C,D 2B,C,D 2A,B,C,D 2
Maximalvs closedfrequentitemsets
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
TID Items
1 ABC
2 ABCD
3 BCE
4 ACDE
5 DE
Frequent,maximal,andcloseditemsets
FrequentItemsets
ClosedFrequentItemsets
MaximalFrequentItemsets
Frequent,maximal,andcloseditemsets
FrequentItemsets
ClosedFrequentItemsets
MaximalFrequentItemsets
Q1:Whatifinsteadoffindingthefrequentitemsets,wefindthemaximalfrequentitemsetsortheclosedfrequentitemsets?
Q2:Doestheknowledgeofjustthemaximalfrequentitemsetswillallowmetogenerateallrequiredassociationrules?
Q3:Doestheknowledgeofjusttheclosedfrequentitemsetswillallowmetogenerateallrequiredassociationrules?
BEYONDLEVEL-BY-LEVELEXPLORATION
null
AB AC AD BC BD CD
A B C D
ABC ABD ACD BCD
ABCD
null
AB AC ADBC BD CD
A B C D
ABC ABD ACD BCD
ABCD
(a) Prefix tree (b) Suffix tree
Traversingthepatternlattice
PatternsstartingwithA.(patternsthatcontainAandanyotheritem)
PatternsstartingwithB.(patternsthatcontainBandanyotheritemexceptA)
PatternsendingwithD.(patternsthatcontainDandanyotheritem)
PatternsendingwithC.(patternsthatcontainCandanyotheritemexceptD)
Breadth-firstvs Depth-first
a" b" c" d"
ab" ac" ad" bc" bd" cd"
abcd"
""
abc" abd" bcd"acd"
a" b" c" d"
ab" ac" ad" bc" bd" cd"
abcd"
""
abc" abd" bcd"acd"
a" b" c" d"
ab" ac" ad" bc" bd" cd"
abcd"
""
abc" abd" bcd"acd"
a" b" c" d"
ab" ac" ad" bc" bd" cd"
abcd"
""
abc" abd" bcd"acd"
Plussesandminuses?
PROJECTIONMETHODS
Projection-basedmethods
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Projection-basedmethodsnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
TID Items1 A,B2 B,C,D3 A,C,D,E4 A,D,E5 A,B,C6 A,B,C,D7 B,C8 A,B,C9 A,B,D10 B,C,E
TID Items1 B2 3 C,D,E4 D,E5 B,C6 B,C,D7 8 B,C9 B,D10
Initialdatabase
DatabaseassociatedwithnodeA
“Projected
DB”
TID Items1 2 D3 D,E4 5 6 D7 8 9 10 E
DatabaseassociatedwithnodeC
AprojectedDBonprefixpattern𝑋isobtainedasfollows:• Eliminateanytransactionsthatdonotcontain𝑋.• Fromthetransactionsthatareleft,retainonlytheitemsthatarelexicographicallygreaterthantheitemsin𝑋.
Projection-basedmethod• Itemsarelistedinlexicographicorder.• Let𝑃and𝐷𝐵(𝑃) beanode’spatternanditsassociatedprojecteddatabase.
• Miningisperformedbyrecursivelycallingthisfunction:– 𝑇𝑃(𝑃, 𝐷𝐵(𝑃))
1. Determinethefrequentitemsin𝐷𝐵(𝑃),anddenotethemby𝐸(𝑃).2. Eliminatefrom𝐷𝐵(𝑃) anyitemsnotin𝐸(𝑃).3. Foreachitem𝑥in𝐸(𝑃),call𝑇𝑃(𝑃𝑥, 𝐷𝐵(𝑃𝑥)).
BEYONDTRANSACTIONS
Beyondtransactiondatasets
• Theconceptoffrequentpatternsandassociationruleshasbeengeneralizedtodifferenttypesofdatasets:– Sequentialdatasets:
• Sequenceofpurchasingtransactions,web-pagesvisited,articlesread,biologicalsequences,eventlogs,etc.
– Relational/Graphdatasets:• Socialnetworks,chemicalcompounds,web-graphs,informationnetworks,etc.
• Thereisanextensivesetofapproachesandalgorithmsforthem,manyofwhichfollowsimilarideastothosedevelopedfortransactiondatasets.
Clustering(Unsupervisedlearning)
Whatisclusteranalysis?
Findinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
Notionofaclustercanbeambiguous
How many clusters?
Notionofaclustercanbeambiguous
How many clusters?
Four ClustersTwo Clusters
Six Clusters
Clusteringformulations
Anumberofclusteringformulationshavebeendeveloped:
1. Weneedtofindafixednumberofclusters.– Well-suitedforcompression-likeapplications.
2. Weneedtofindclustersoffixedsize.– Well-suitedforneighborhood-discovery(recommendationengine).
3. Weneedtofindthesmallestnumberofclustersthatsatisfycertainqualitycriteria.– Well-suitedforapplicationsinwhichclusterqualityisimportant.
4. Weneedtofindthenatural numberofclusters.– Thisisclustering'sholly-grail!
• Extremelyhard,problemdependent,and“quitesupervised”.
Typesofclusterings
• Aclustering isasetofclusters.
• Importantdistinctionbetweenhierarchical andpartitional setsofclusters.
• Partitional clustering– Adivisionofdataobjectsintonon-overlappingsubsets(clusters)such
thateachdataobjectisinexactlyonesubset.
• Hierarchicalclustering– Asetofnestedclustersorganizedasahierarchicaltree.
Partitional clustering
Original Points A Partitional Clustering
Hierarchicalclustering
p4 p1
p3
p2 p4p1 p2 p3
Hierarchical clustering Dendrogram
Otherdistinctionsbetweensetsofclusters
• Exclusiveversusnon-exclusive– Innon-exclusiveclusterings,pointsmaybelongtomultipleclusters.– Canrepresentmultipleclassesor“border”points.
• Fuzzyversusnon-fuzzy– Infuzzyclustering,apointbelongstoeveryclusterwithsomeweight
between0and1.– Weightsmustsumto1.– Probabilisticclusteringhassimilarcharacteristics.
• Partialversuscomplete– Insomecases,weonlywanttoclustersomeofthedata.
• Heterogeneousversushomogeneous– Clustersofwidelydifferentsizes,shapes,anddensities.
Typesofclusters
• Well-separatedclusters• Center-basedclusters• Contiguousclusters• Density-basedclusters• Propertyorconceptual• Describedbyanobjectivefunction
Typesofclusters:Well-separated
Well-separatedclusters:– Aclusterisasetofpointssuchthatanypointinaclusteriscloser(ormore
similar)toeveryotherpointintheclusterthantoanypointnotinthecluster.
Three well-separated clusters
Typesofclusters:Center-based
Center-based– Aclusterisasetofobjectssuchthatanobjectinaclusteriscloser(more
similar)tothe“center”ofacluster,thantothecenterofanyothercluster.– Thecenterofaclusterisoftenacentroid,theaverageofallthepointsin
thecluster,oramedoid,themost“representative”pointofacluster.
Four center-based clusters
Typesofclusters:Contiguity-based
Contiguouscluster(nearestneighborortransitive)– Aclusterisasetofpointssuchthatapointinaclusteriscloser(ormore
similar)tooneormoreotherpointsintheclusterthantoanypointnotinthecluster.
Eight contiguous clusters
Typesofclusters:Density-based
Density-based– Aclusterisadenseregionofpoints,whichisseparatedbylow-density
regions,fromotherregionsofhighdensity.– Usedwhentheclustersareirregularorintertwined,andwhennoiseand
outliersarepresent.
Six density-based clusters
Typesofclusters:Conceptualclusters
SharedPropertyorConceptualClusters– Findsclustersthatsharesomecommonpropertyorrepresentaparticular
concept.
Two overlapping circles
Typesofclusters:Objectivefunction
Clustersdefinedbyanobjectivefunction– Findsclustersthatminimizeormaximizeanobjectivefunction.– Enumerateallpossiblewaysofdividingthepointsintoclustersand
evaluatethe“goodness”ofeachpotentialsetofclustersbyusingthegivenobjectivefunction.(NPHard)
– Canhaveglobalorlocalobjectives.• Hierarchicalclusteringalgorithmstypicallyhavelocalobjectives.• Partitional algorithmstypicallyhaveglobalobjectives.
– Avariationoftheglobalobjectivefunctionapproachistofitthedatatoaparameterizedmodel.• Parametersforthemodelaredeterminedfromthedata.
• Mixturemodelsassumethatthedataisa‘mixture'ofanumberofstatisticaldistributions.
Clusteringrequirements
Thefundamentalrequirementforclusteringistheavailabilityofafunctiontodeterminethesimilarity ordistance betweenobjectsinthedatabase.
Theusermustbeabletoanswersomeofthefollowingquestions:
1. Whenshouldtwoobjectsbelongtothesamecluster?2. Howshouldtheclusterslooklike(i.e.,whattypeofobjects
shouldthecontain)?3. Whataretheobject-relatedcharacteristicsofgoodclusters?
Datacharacteristics&clustering
• Typeofproximityordensitymeasure– Centraltoclustering.– Dependsondataandapplication.
• Datacharacteristicsthataffectproximityand/ordensityare– Dimensionality
• Sparseness
– Attributetype– Specialrelationshipsinthedata
• Forexample,autocorrelation
– Distributionofthedata
• Noiseandoutliers– Ofteninterferewiththeoperationoftheclusteringalgorithm
BASICCLUSTERINGALGORITHMS
1. K-means2. Hierarchicalclustering3. Density-basedclustering
K-meansclustering
• Partitional clusteringapproach.• Numberofclusters,K,mustbespecified.• Eachclusterisassociatedwithacentroid(centerpoint/object).• Eachpointisassignedtotheclusterwiththeclosestcentroid.• Thebasicalgorithmisverysimple.
ExampleofK-meansclustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
K-meansclustering– Details
• Initialcentroidsareoftenchosenrandomly.– Clustersproducedvaryfromoneruntoanother.
• Thecentroidis(typically)themeanofthepointsinthecluster.• “Closeness”ismeasuredbyEuclideandistance,cosinesimilarity,correlation,etc.
• K-meanswillconvergeforcommonsimilaritymeasuresmentionedabove.
• Mostoftheconvergencehappensinthefirstfewiterations.– Oftenthestoppingconditionischangedto“Untilrelativelyfewpointschange
clusters”.
• ComplexityisO(n*K*I*d)– n=numberofpoints,K=numberofclusters,
I=numberofiterations,d=numberofattributes.
K-meansclustering– Objective
Let o1, . . . , on be the set of objects to be clustered, k be the number of desired
clusters, p be the clustering indicator vector such that pi is the cluster number
that the ith object belongs to, and ci be the centroid of the ith cluster.
In the case of Euclidean distance, the K-means clustering algorithm solves the
following optimization problem:
minimize
pf(p) =
nX
i=1
||oi cpi ||22.
Function f() is the objective or clustering criterion function of K-means.
K-meansclustering– Objective
Let o1, . . . , on be the set of objects to be clustered, k be the number of desired
clusters, p be the clustering indicator vector such that pi is the cluster number
that the ith object belongs to, and ri be a vector associated with the ith cluster.
In the case of Euclidean distance, the K-means clustering algorithm solves the
following optimization problem:
minimize
p,r1,...,rkg(p, r1, . . . , rk) =
nX
i=1
||oi rpi ||22.
Note that p and r1, . . . , rk are the variables of the optimization problem that
need to be estimated such that the value of g() is minimized.
K-meansclustering– Objective
The solution to
minimize
pf(p) =
nX
i=1
||oi cpi ||22
is the same as the solution to
minimize
p,r1,...,rkg(p, r1, . . . , rk) =
nX
i=1
||oi rpi ||22.
and 8i, ri = ci.
K-meansclustering– Objective
The solution to
minimize
pf(p) =
nX
i=1
||oi cpi ||22
is the same as the solution to
minimize
p,r1,...,rkg(p, r1, . . . , rk) =
nX
i=1
||oi rpi ||22.
and 8i, ri = ci.
Ther_i vectorscanbethoughtasbeingrepresentativesoftheobjectsthatareassignedtotheith cluster.Ther_i vectorsrepresentacompressedviewofthedata.
K-meansclustering– Objectiveminimize
p
Pni=1 ||oi cpi ||22
minimizep,r1,...,rk
Pni=1 ||oi rpi ||22
Thesearenon-convexoptimizationproblems.
• The𝐾-meansclusteringalgorithmisawayofsolvingtheoptimizationproblem.• Itusesaniterativealternateleastsquaresoptimizationstrategy.
a. Optimizeclusterassignments𝑝,given𝑟$ for𝑖 = 1,… , 𝑘.b. Optimize𝑟$ for𝑖 = 1,… , 𝑘,givenclusterassignments𝑝.
• Itguaranteesconvergencetoalocalminimasolution.However,duetothenon-convexityoftheproblem,itmaynotbetheglobalminimum.
• Run𝐾-meansmultipletimeswithdifferentinitialcentroidsandreturnthesolutionthathasthebestvalue.
TwodifferentK-meansclusterings
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
LimitationsofK-means
• Def. problem:whentheclusteringsolutionthatyougetisnotthebest,natural,insightful,etc.
• K-meanshasproblems whenclustersareofdiffering– Sizes– Densities– Non-globularshapes
• K-meanshasproblems whenthedatacontainsoutliers.
LimitationsofK-means:Differingsizes
Original Points K-means (3 Clusters)
LimitationsofK-means:Differingdensity
Original Points K-means (3 Clusters)
LimitationsofK-means:Non-globularshapes
Original Points K-means (2 Clusters)
OvercomingK-meansLimitations
Original Points K-means Clusters
One solution is to use many clusters.Finds parts of clusters, and we may need to put them back together.
OvercomingK-meanslimitations
Original Points K-means Clusters
Importanceofchoosinginitialcentroids
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Importanceofchoosinginitialcentroids…
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
yIteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Solutionstoinitialcentroidsproblem• Multipleruns
– Helps,butprobabilityisnotonyourside.
• Sampleandusehierarchicalclusteringtodetermineinitialcentroids.
• Selectmorethan𝑘initialcentroidsandthenselectamongtheseinitialcentroids.– Selectmostwidelyseparated.
• Generatealargernumberofclustersandthenperformahierarchicalclustering.
• Bisecting𝐾-means– Notassusceptibletoinitializationissues.
Outliers
• A principled way of dealing with outliers is to do so directly during the
optimization process.
• Robust k-means algorithms as part of the optimization process in addition
to determining the clustering solution they also identify a set of outlier
objects that are not clustered by the algorithm.
• The non-clustered objects are treated as a penalty component of the objec-
tive function (in supervised learning, these penalty components are often
called regularizers) like
minimize
p
X
i : pi 6=1
||oi cpi ||22 +
X
i : pi=1
q(i),
where is a user-specified parameter that controls the penalty associated
with not clustering an object, and q(i) is a cost function associated with
the ith object. A simple q() = 1 is such a cost function.
K-Meansandthe“Curseofdimensionality”
• Whendimensionalityincreases,databecomesincreasinglysparseinthespacethatitoccupies.
• Definitionsofdensityanddistancebetweenpoints,whichiscriticalforclusteringandoutlierdetection,becomelessmeaningful. • Randomly generate 500 points.
• Compute difference between max and min distance between any pair of points.
Asymmetricattributes
Ifwemetafriendinthegrocerystorewouldweeversaythefollowing?
“I see our purchases are very similar since we didn’t buy most of the same things.”
40
SphericalK-meansclustering
Let d1, . . . , dn be the unit length vectors of the set of objects to be clustered, kbe the number of desired clusters, p be the clustering indicator vector such that
pi is the cluster number that the ith object belongs to, and ci be the centroid
of the ith cluster.
The spherical K-means clustering algorithm solves the following optimization
problem:
maximize
p
nX
i=1
cos(di, cpi).
SphericalK-means&Text
Inhigh-dimensionaldata,clustersexistinlower-dimensionalsub-spaces.
HIERARCHICALCLUSTERING
Hierarchicalclustering
• Producesasetofnestedclustersorganizedasahierarchicaltree.
• Canbevisualizedasadendrogram.– Atreelikediagramthatrecordsthesequencesofmergesorsplits.
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
Advantagesofhierarchicalclustering• Donothavetoassumeanyparticularnumberofclusters.– Anydesirednumberofclusterscanbe
obtainedby“cutting”thedendrogram attheproperlevel.
• Theymaycorrespondtomeaningfultaxonomies.– Exampleinbiologicalsciences(e.g.,animal
kingdom,phylogenyreconstruction,…).
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1 3 2 5 4 60
0.05
0.1
0.15
0.2
Hierarchicalclustering• Twomainwaysofobtaininghierarchicalclusterings:
– Agglomerative:• Startwiththepointsasindividualclusters.• Ateachstep,mergetheclosestpairofclustersuntilonlyonecluster(orkclusters)left.
– Divisive:• Startwithone,all-inclusivecluster.• Ateachstep,splitaclusteruntileachclustercontainsapoint(ortherearekclusters).
• Traditionalhierarchicalalgorithmsuseasimilarityordistancematrix.– Mergeorsplitoneclusteratatime.
Agglomerativeclusteringalgorithm• Morepopularhierarchicalclusteringtechnique
• Basicalgorithmisstraightforward1. Computetheproximitymatrix.2. Leteachdatapointbeacluster.3. Repeat:4. Mergethetwoclosestclusters.5. Updatetheproximitymatrix.6. Until onlyasingleclusterremains(orkclustersremain).
• Keyoperationisthecomputationoftheproximityoftwoclusters.
– Differentapproachestodefiningthedistancebetweenclustersdistinguishthedifferentalgorithms
StartingsituationStartwithclustersofindividualpointsandaproximitymatrix
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
IntermediatesituationAftersomemergingsteps,wehavesomeclusters
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
IntermediatesituationWewanttomergethetwoclosestclusters(C2andC5)andupdatetheproximitymatrix.
C1
C4
C2 C5
C3
C2C1
C1
C3
C5
C4
C2
C3 C4 C5
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
AftermergingHowdoweupdatetheproximitymatrix?
C1
C4
C2 U C5
C3 ? ? ? ?
?
?
?
C2 U C5C1
C1
C3
C4
C2 U C5
C3 C4
Proximity Matrix
...p1 p2 p3 p4 p9 p10 p11 p12
Defininginter-clusterproximity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Proximity?
Minimum distance, maximum distance, average distance, distance between centroids, objective-driven selection, etc.
Proximity Matrix
Defininginter-clusterproximity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
Usingminimumdistance.
Defininginter-clusterproximity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
Usingmaximumdistance.
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
Defininginter-clusterproximity
Usingaveragedistance.
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
´ ´
Defininginter-clusterproximity
Usingdistancebetweencentroids.
Strengthofminimumdistance
Can handle non-elliptical shapes.
Original Points Six Clusters
Limitationsofminimumdistance
Original Points
Two Clusters
Sensitive to noise and outliers.Three Clusters
Strengthofmaximumdistance
Less susceptible to noise and outliers.
Original Points Two Clusters
Limitationsofmaximumdistance
Tends to break large clusters.
Biased towards globular clusters.
Two ClustersOriginal Points
Groupaverage
• Compromisebetweensingleandcompletelink.
• Strengths:– Lesssusceptibletonoiseandoutliers.
• Limitations:– Biasedtowardsglobularclusters.
Hierarchicalclustering:Timeandspacerequirements• O(𝑁2) spacesinceitusestheproximitymatrix.
– 𝑁isthenumberofpoints.
• O(𝑁3) timeinmanycases– Thereare𝑁stepsandateachsteptheproximitymatrixmustbeupdatedandsearched(ontheaveragethereare𝑁2 onthatmatrix).
– ComplexitycanbereducedtoΟ(𝑁2log(𝑁))timewithsomecleverness.
Hierarchicalclustering:Problemsandlimitations• Onceadecisionismadetocombinetwoclusters,itcannotbeundone.
• Objectivefunctionisoptimizedonlylocally.
• Differentschemeshaveproblemswithoneormoreofthefollowing:– Sensitivitytonoiseandoutliers.– Difficultyhandlingdifferentsizedclustersandconvexshapes.
– Breakinglargeclusters.
DENSITY-BASEDCLUSTERING
DBSCAN
• DBSCANisadensity-basedalgorithm.– Thedensity isthenumberofpointswithinaspecifiedradius(Eps)
– Apointisacorepoint ifithasmorethanaspecifiednumberofpoints(MinPts)withinEps.
• Thesearepointsthatareattheinteriorofacluster.
– Aborderpoint hasfewerthanMinPts withinEps,butisintheneighborhoodofacorepoint.
– Anoisepoint isanypointthatisnotacorepointoraborderpoint.
DBSCAN:core,border,andnoisepoints
DBSCANalgorithm182 CHAPTER 6. CLUSTER ANALYSIS
Algorithm DBSCAN(Data: D, Radius: Eps, Density: τ )begin
Determine core, border and noise points of D at level (Eps, τ);Create graph in which core points are connected
if they are within Eps of one another;Determine connected components in graph;Assign each border point to connected component
with which it is best connected;return points in each connected component as a cluster;
end
Figure 6.15: Basic DBSCAN algorithm
3. Noise point: A data point that is neither a core point nor a border point is defined asa noise point.
Examples of core points, border points, and noise points are illustrated in Fig. 6.16 forτ = 10. The data point A is a core point because it contains 10 data points within theillustrated radius Eps. On the other hand, data point B contains only 6 points within aradius of Eps, but it contains the core point A. Therefore, it is a border point. The datapoint C is a noise point because it contains only 4 points within a radius of Eps, and itdoes not contain any core point.
After the core, border, and noise points have been determined, the DBSCAN clusteringalgorithm proceeds as follows. First, a connectivity graph is constructed with respect to thecore points, in which each node corresponds to a core point, and an edge is added betweena pair of core points, if and only if they are within a distance of Eps from one another. Notethat the graph is constructed on the data points rather than on partitioned regions, as ingrid-based algorithms. All connected components of this graph are identified. These corre-spond to the clusters constructed on the core points. The border points are then assigned tothe cluster with which they have the highest level of connectivity. The resulting groups arereported as clusters and noise points are reported as outliers. The basic DBSCAN algorithmis illustrated in Fig. 6.15. It is noteworthy that the first step of graph-based clustering isidentical to a single-linkage agglomerative clustering algorithm with termination-criterionof Eps-distance, which is applied only to the core points. Therefore, the DBSCAN algorithmmay be viewed as an enhancement of single-linkage agglomerative clustering algorithms bytreating marginal (border) and noisy points specially. This special treatment can reduce theoutlier-sensitive chaining characteristics of single-linkage algorithms without losing the abil-ity to create clusters of arbitrary shape. For example, in the pathological case of Fig. 6.9(b),the bridge of noisy data points will not be used in the agglomerative process if Eps and τare selected appropriately. In such cases, DBSCAN will discover the correct clusters in spiteof the noise in the data.
Practical Issues
The DBSCAN approach is very similar to grid-based methods, except that it uses circularregions as building blocks. The use of circular regions generally provides a smoother contourto the discovered clusters. Nevertheless, at more detailed levels of granularity, the twomethods will tend to become similar. The strengths and weaknesses of DBSCAN are also
DBSCAN:core,borderandnoisepoints
Original Points Point types: core, border and noise
Eps = 10, MinPts = 4
DBSCANclustering
Clusters
DBSCANclustering
Clusters
Thesearealsoclusters.Theyareusuallyeliminatedbyputtingaminimumclustersizethreshold.
DBSCANclustering
Original Points Clusters
• Resistant to (some) noise.
• Can handle clusters of different shapes and sizes.
DBSCAN:Howmuchnoise?
WhenDBSCANdoesnotworkwell
Original Points(MinPts=4, Eps=9.75).
(MinPts=4, Eps=9.92)
• Varying densities
• High-dimensional data
DBSCAN:DeterminingEPSandMinPts
• Ideaisthatforpointsinacluster,theirkth nearestneighborsareroughlyatthesamedistance.
• Noisepointshavethekth nearestneighboratfartherdistance.• So,plotsorteddistanceofeverypointtoitskth nearestneighbor.
CLUSTERVALIDITY
Differentaspectsofclustervalidation
• Determiningthe clusteringtendency ofasetofdata:– Isthereanon-randomstructureinthedata?
• Comparingtheresultsofaclusteranalysistoexternallyknownresults.– Dotheclusterscontainobjectsofmostlyasingleclasslabel?
• Evaluatinghowwelltheresultsofaclusteranalysisfitthedatawithout referencetoexternalinformation.– Lookatvariousintra- andinter-clusterdata-derivedproperties.
• Comparingtheresultsoftwodifferentsetsofclusteranalysestodeterminewhichisbetter.
• Theevaluationcanbedonefortheentireclusteringsolutionorjustforselectedclusters.
Measuresofclustervalidity• Numericalmeasuresthatareappliedtojudgevariousaspectsofclustervalidity,areclassifiedintothefollowingthreetypes.– InternalIndex(II): Usedtomeasurethegoodnessofaclusteringstructurewithout respecttoexternalinformation.• SumofSquaredError(SSE)(oranyotheroftheobjectivefunctionsthatwediscussed).
– ExternalIndex(EI): Usedtomeasuretheextenttowhichclusterlabelsmatchexternallysuppliedclasslabels.• Entropy,purity,f-score,etc.
– RelativeIndex(RI): Usedtocomparetwodifferentclusteringsorclusters.• Oftenanexternalorinternalindexisusedforthisfunction,e.g.,SSEorentropy.
II:Measuringclustervalidityviacorrelation
• Twomatrices:– Proximity(distance)matrixofthedata(e.g.,pair-wisecosinesimilarity(Euclidean
distance)).– Idealproximitymatrixthatisimpliedbytheclusteringsolution.
• Onerowandonecolumnforeachdatapoint.• Anentryis1iftheassociatedpairofpointsbelongtothesamecluster.• Anentryis0iftheassociatedpairofpointsbelongstodifferentclusters.
• Computethecorrelationbetweenthetwomatrices.– i.e.,thecorrelationbetweenthevectorized matrices.– (makesurethattheorderingofthedatapointsisthesameinbothmatrices)
• High(low)correlationindicatesthatpointsthatbelongtothesameclusterareclosetoeachother.
• Notagoodmeasureforsomedensityorcontiguitybasedclusters.
II:Measuringclustervalidityviacorrelation
CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810
II:Usingsimilaritymatrixforclustervalidation
Orderthesimilaritymatrixwithrespecttoclusterlabelsandinspectvisually.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Clustersfoundinrandomdata
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Random Points
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
yComplete Link
II:Usingsimilaritymatrixforclustervalidation
Clustersinrandomdataarenotsocrisp.
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
DBSCAN
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
xy
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
II:Usingsimilaritymatrixforclustervalidation
Clustersinrandomdataarenotsocrisp.
K-means
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
II:Usingsimilaritymatrixforclustervalidation
Clustersinrandomdataarenotsocrisp.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Points
Points
20 40 60 80 100
10
20
30
40
50
60
70
80
90
100Similarity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Complete Link
II:Usingsimilaritymatrixforclustervalidation
1 2
3
5
6
4
7
DBSCAN
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
500 1000 1500 2000 2500 3000
500
1000
1500
2000
2500
3000
II:Frameworkforclustervalidity
• Needaframeworktointerpretanymeasure.– Forexample,ifourmeasureofevaluationhasa valueof10,isthatgood,fair,or
poor?
• Statisticsprovideaframeworkforclustervalidity.– Themore“atypical”aclusteringresultis,themorelikelyitrepresentsvalid
structureinthedata.– Cancomparethevaluesofanindexthatresultfromrandomdataorclusterings to
thoseofaclusteringresult.• Ifthevalueoftheindexisunlikely,thentheclusterresultsarevalid.
– Theseapproachesaremorecomplicatedandhardertounderstand.
• Forcomparingtheresultsoftwodifferentsetsofclusteranalyses,aframeworkislessnecessary.– However,thereisthequestionofwhetherthedifferencebetweentwoindex
valuesissignificant.
II:StatisticalframeworkforSSE
Example– CompareSSEof0.005againstthreeclustersinrandomdata.– HistogramshowsSSEofthreeclustersin500setsofrandomdatapointsofsize100
distributedovertherange0.2– 0.8forxandyvalues.
0.016 0.018 0.02 0.022 0.024 0.026 0.028 0.03 0.032 0.0340
5
10
15
20
25
30
35
40
45
50
SSE
Count
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
II:Statisticalframeworkforcorrelation
CorrelationofidealsimilarityandproximitymatricesfortheK-meansclusterings ofthefollowingtwodatasets.
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x
y
Corr = -0.9235 Corr = -0.5810
“Thevalidationofclusteringstructuresisthemostdifficultandfrustratingpartofclusteranalysis.Withoutastrongeffortinthisdirection,clusteranalysiswillremainablackartaccessibleonlytothosetruebelieverswhohaveexperienceandgreatcourage.”
AlgorithmsforClusteringData,JainandDubes
Finalcommentonclustervalidity
Classification(Supervisedlearning)
BASICCONCEPTS
Classification:Definition
• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherexisasetofattributesandyistheclasslabel• x:setofattributes,predictors,independentvariables,inputs.• y:class,response,dependentvariable,oroutput.
• Task:– Learnamodelthatmapseachsetofattributesxintooneofthepredefinedclasslabelsy.
Examplesofclassificationtasks
Task Attributeset,x Classlabel,y
Categorizingemailmessages
Featuresextractedfromemailmessageheaderandcontent
spamornon-spam
Identifyingtumorcells
FeaturesextractedfromMRIscans
malignantorbenigncells
Cataloginggalaxies
Featuresextractedfromtelescopeimages
Elliptical,spiral,orirregular-shapedgalaxies
Buildingandusingaclassificationmodel
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
Classificationtechniques
• Baseclassifiers– Decisiontree-basedmethods.– Rule-basedmethods.– Nearest-neighbor.– Neuralnetworks.– NaïveBayesandBayesianbeliefnetworks.– Supportvectormachines.– …andothers.
• Ensembleclassifiers– Boosting,bagging,randomforests,etc.
DECISIONTREES
Wewillusethismethodtoillustratevariousconceptsandissuesassociatedwiththeclassificationtask.
Exampleofadecisiontree
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
HomeOwner
MarSt
Income
YESNO
NO
NO
Yes No
MarriedSingle,Divorced
<80K >80K
SplittingAttributes
TrainingData Model:Decisiontree
Exampleofdecisiontree
MarSt
HomeOwner
Income
YESNO
NO
NO
Yes No
MarriedSingle,
Divorced
<80K >80K
Therecouldbemorethanonetreethatfitsthesamedata!
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Decisiontreeclassificationtask
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training SetDecision Tree
Applymodeltotestdata
MarSt
Income
YESNO
NO
NO
Yes No
MarriedSingle,Divorced
<80K >80K
HomeOwner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ?10
TestData
AssignDefaultedto“No”
HomeOwner
Startfromtherootofthetree
Decisiontreeclassificationtask
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
TreeInductionalgorithm
Training Set
DecisionTree
Buildingthedecisiontree—Treeinduction
• Let𝐷" bethesetoftrainingrecordsthatreachanode𝑡.
• Generalprocedure:– If𝐷" containsrecordsthatbelong
thesameclass𝑦",then𝑡isaleafnodelabeledas𝑦".
– If𝐷" containsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.• Recursivelyapplytheproceduretoeachsubset.
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
𝐷"
?
Hunt’s'algorithm
(a) (b)
(c)
Defaulted = No
HomeOwner
Yes No
Defaulted = No Defaulted = No
Yes No
MaritalStatus
Single,Divorced Married
(d)
Yes No
MaritalStatus
Single,Divorced Married
AnnualIncome
<480K >=480K
HomeOwner
Defaulted = No
Defaulted = NoDefaulted = Yes
HomeOwner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Buildingthedecisiontree:Example
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Hunt’s'algorithm
(a) (b)
(c)
Defaulted = No
HomeOwner
Yes No
Defaulted = No Defaulted = No
Yes No
MaritalStatus
Single,Divorced Married
(d)
Yes No
MaritalStatus
Single,Divorced Married
AnnualIncome
<480K >=480K
HomeOwner
Defaulted = No
Defaulted = NoDefaulted = Yes
HomeOwner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Buildingthedecisiontree:Example
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Hunt’s'algorithm
(a) (b)
(c)
Defaulted = No
HomeOwner
Yes No
Defaulted = No Defaulted = No
Yes No
MaritalStatus
Single,Divorced Married
(d)
Yes No
MaritalStatus
Single,Divorced Married
AnnualIncome
<480K >=480K
HomeOwner
Defaulted = No
Defaulted = NoDefaulted = Yes
HomeOwner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Hunt’s'algorithm
(a) (b)
(c)
Defaulted = No
HomeOwner
Yes No
Defaulted = No Defaulted = No
Yes No
MaritalStatus
Single,Divorced Married
(d)
Yes No
MaritalStatus
Single,Divorced Married
AnnualIncome
<480K >=480K
HomeOwner
Defaulted = No
Defaulted = NoDefaulted = Yes
HomeOwner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Buildingthedecisiontree:Example
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Hunt’s'algorithm
(a) (b)
(c)
Defaulted = No
HomeOwner
Yes No
Defaulted = No Defaulted = No
Yes No
MaritalStatus
Single,Divorced Married
(d)
Yes No
MaritalStatus
Single,Divorced Married
AnnualIncome
<480K >=480K
HomeOwner
Defaulted = No
Defaulted = NoDefaulted = Yes
HomeOwner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
Buildingthedecisiontree:Example
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Designissuesofdecisiontreeinduction
• Howshouldthetrainingrecordsbesplit?– Methodforspecifyingtestcondition.
• Thisdependsontheattributetypes.– Methodforselectingwhichattributeandsplitconditiontochoose.• Needameasureforevaluatingthegoodnessofatestcondition.
• Whenshouldthesplittingprocedurestop?– Stopsplittingifalltherecordsbelongtothesameclassorhaveidenticalattributevalues.
– Earlytermination.
Methodsforexpressingtestconditions
• Dependsonattributetypes:– Binary– Nominal– Ordinal– Continuous
• Dependsonnumberofwaystosplit:– 2-waysplit– Multi-waysplit
Multi-waysplit:Useasmanypartitionsasdistinctvalues:
Binarysplit:Dividevaluesintotwosubsets:
Testconditionfornominalattributes
MaritalStatus
Single Divorced Married
Single Married,Divorced
MaritalStatus
Married Single,Divorced
MaritalStatus
OR OR
Single,Married
MaritalStatus
Divorced
Testconditionforordinalattributes
Large
ShirtSize
Medium Extra LargeSmall
• Multi-waysplit:– Useasmanypartitionsasdistinctvalues.
• Binarysplit:– Dividesvaluesintotwosubsets.
– Preserveorderpropertyamongattributevalues.
Medium, Large,Extra Large
ShirtSize
SmallLarge,Extra Large
ShirtSize
Small,Medium
Medium,Extra Large
ShirtSize
Small,Large
Thisgroupingviolatesorderproperty.
Testconditionforcontinuousattributes
AnnualIncome> 80K?
Yes No
AnnualIncome?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
Howtodeterminethebestsplit?
Gender
C0: 6C1: 4
C0: 4C1: 6
C0: 1C1: 3
C0: 8C1: 0
C0: 1C1: 7
CarType
C0: 1C1: 0
C0: 1C1: 0
C0: 0C1: 1
CustomerID
...
Yes No Family
Sports
Luxury c1 c10
c20
C0: 0C1: 1
...
c11
Beforesplitting:10recordsofclass0,and10recordsofclass1.
Whichtestconditionisthebest?
Howtodeterminethebestsplit?
• Greedyapproach:– Nodeswithpurer classdistributionarepreferred.
• Needameasureofnodepurity/impurity:
C0: 5C1: 5
C0: 9C1: 1
Highdegreeofimpurity Lowdegreeofimpurity
Measuresofnodeimpurity
• GiniIndex
• Entropy
• Misclassificationerror
å-=j
tjptGINI 2)]|([1)(
å-=j
tjptjptEntropy )|(log)|()(
)|(max1)( tiPtErrori
-=
Two-classproblem
Findingthebestsplit
1. Computeimpuritymeasure(P)beforesplitting.2. Computeimpuritymeasure(M)aftersplitting.
• Computeimpuritymeasureofeachchildnode.• Misthesize-weightedimpurityofthechildren.
3. Choosetheattributetestconditionthatproducesthehighestgain:
orequivalently,lowestimpuritymeasureaftersplitting(M).
Gain = P – M,
Decisiontreebasedclassification
• Advantages:– Inexpensivetoconstruct.– Extremelyfastatclassifyingunknownrecords.– Easytointerpretforsmall-sizedtrees.– Robusttonoise(especiallywhenmethodstoavoidoverfittingareemployed).
– Caneasilyhandleredundantorirrelevantattributes(unlesstheattributesareinteracting).
• Disadvantages:– Spaceofpossibledecisiontreesisexponentiallylarge.Greedyapproachesareoftenunabletofindthebesttree.
– Doesnottakeintoaccountinteractionsbetweenattributes.– Eachdecisionboundaryinvolvesonlyasingleattribute.
OVERFITTING
Classificationerrors
• Trainingerrors(apparenterrors):– Errorscommittedonthetrainingset.
• Testerrors:– Errorscommittedonthetestset.
• Generalizationerrors:– Expectederrorofamodelinarandomlyselectedsubsetofrecordsfromthesamedistribution.
Exampledataset
Twoclassproblem:
+:5400instances
• 5000instancesgeneratedfromaGaussiancenteredat(10,10)
• 400noisyinstancesadded
o:5400instances• Generatedfromauniformdistribution
10%ofthedatausedfortrainingand90%ofthedatausedfortesting
Increasingnumberofnodesinthedecisiontree
Decisiontreewith4nodes
Decisiontree
Decisionboundariesontrainingdata
Decision Tree
Decision boundaries on training data
Decisiontreewith50nodes
Increasingnumberofnodesindecisiontrees
Decision Tree with 4 nodes
Decision Tree with 50 nodes
Which tree is better ?
Modeloverfitting
Underfitting:whenmodelistoosimple,bothtrainingandtesterrorsarelarge.
Overfitting:whenmodelistoocomplex,trainingerrorissmallbuttesterrorislarge.
Model overfitting
Usingtwicethenumberofdatainstances
• Iftrainingdataisunder-representative,testingerrorsincreaseandtrainingerrorsdecreaseonincreasingnumberofnodes.
• Increasingthesizeoftrainingdatareducesthedifferencebetweentrainingandtestingerrorsatagivennumberofnodes.
Reasonsformodeloverfitting
• Presenceofnoise.
• Lackofrepresentativesamples.
• Multiplecomparisonprocedure.
Effectofmultiplecomparisonprocedure
• Considerthetaskofpredictingwhetherstockmarketwillrise/fallinthenext10tradingdays.
• Randomguessing:P(correct)=0.5
• Make10randomguessesinarow:
0547.02
1010
910
810
)8(# 10 =÷÷ø
öççè
æ+÷÷ø
öççè
æ+÷÷ø
öççè
æ
=³correctP
Day1 UpDay2 DownDay3 DownDay4 UpDay5 DownDay6 DownDay7 UpDay8 UpDay9 UpDay10 Down
Effectofmultiplecomparisonprocedure
• Approach:– Get50analysts.– Eachanalystmakes10randomguesses.– Choosetheanalystthatmakesthemostnumberofcorrectpredictions.
• Probabilitythatatleastoneanalystmakesatleast8correctpredictions:
9399.0)0547.01(1)8(# 50 =--=³correctP
Effectofmultiplecomparisonprocedure
• Manyalgorithmsemploythefollowinggreedystrategy:– Initialmodel:𝑀.– Alternativemodel:𝑀' = 𝑀 ∪ 𝛾,where𝛾isacomponenttobeaddedtothemodel(e.g.,atestconditionofadecisiontree).
– Keep𝑀' ifimprovement,Δ 𝑀,𝑀' > 𝛼.
• Oftentimes,𝛾ischosenfromasetofalternativecomponents,Γ=best(𝛾1, 𝛾2, … , 𝛾4).
• Ifmanyalternativesareavailable,onemayinadvertentlyaddirrelevantcomponentstothemodel,resultinginmodeloverfitting.
Effectofmultiplecomparison:Example
Useadditional100noisyvariablesgeneratedfromauniformdistributionalongwith𝑋and𝑌asattributes.
Use30%ofthedatafortrainingand70%ofthedatafortesting.
Usingonly𝑋and𝑌asattributes
Notesonoverfitting
• Overfittingresultsindecisiontreesthataremorecomplex thannecessary.
• Trainingerrordoesnotprovideagoodestimateofhowwellthetreewillperformonpreviouslyunseenrecords.
• Needwaysforestimatinggeneralizationerrors.
Handlingoverfittingindecisiontrees
Pre-Pruning(earlystoppingrule):– Stopthealgorithmbeforeitbecomesafully-growntree.– Typicalstoppingconditionsforanode:
• Stopifallinstancesbelongtothesameclass.• Stopifalltheattributevaluesarethesame.
– Morerestrictiveconditions:• Stopifnumberofinstancesislessthansomeuser-specifiedthreshold.• Stopifclassdistributionofinstancesareindependentoftheavailablefeatures(e.g.,using𝜒2 test).
• Stopifexpandingthecurrentnodedoesnotimproveimpurity measures(e.g.,Giniorinformationgain).
• Stopifestimatedgeneralizationerrorfallsbelowcertainthreshold.
Handling overfitting in decision trees
Post-pruning:– Growdecisiontreetoitsentirety.– Subtree replacement:
• Trimthenodesofthedecisiontreeinabottom-upfashion.• Ifgeneralizationerrorimprovesaftertrimming,replacesub-treebyaleafnode.
• Classlabelofleafnodeisdeterminedfrommajorityclassofinstancesinthesub-tree.
– Subtree raising:• Replacesubtree withmostfrequentlyusedbranch.
Examplesofpost-pruning
Simplified Decision Tree:
depth = 1 :| ImagePages <= 0.1333 : class 1| ImagePages > 0.1333 :| | breadth <= 6 : class 0| | breadth > 6 : class 1depth > 1 :| MultiAgent = 0: class 0| MultiAgent = 1:| | totalPages <= 81 : class 0| | totalPages > 81 : class 1
Decision Tree:depth = 1 :| breadth > 7 : class 1| breadth <= 7 :| | breadth <= 3 :| | | ImagePages > 0.375 : class 0| | | ImagePages <= 0.375 :| | | | totalPages <= 6 : class 1| | | | totalPages > 6 :| | | | | breadth <= 1 : class 1| | | | | breadth > 1 : class 0| | width > 3 :| | | MultiIP = 0:| | | | ImagePages <= 0.1333 : class 1| | | | ImagePages > 0.1333 :| | | | | breadth <= 6 : class 0| | | | | breadth > 6 : class 1| | | MultiIP = 1:| | | | TotalTime <= 361 : class 0| | | | TotalTime > 361 : class 1depth > 1 :| MultiAgent = 0:| | depth > 2 : class 0| | depth <= 2 :| | | MultiIP = 1: class 0| | | MultiIP = 0:| | | | breadth <= 6 : class 0| | | | breadth > 6 :| | | | | RepeatedAccess <= 0.0322 : class 0| | | | | RepeatedAccess > 0.0322 : class 1| MultiAgent = 1:| | totalPages <= 81 : class 0| | totalPages > 81 : class 1
SubtreeRaising
SubtreeReplacement
ENSEMBLEMETHODS
Ensemblemethods
• Constructasetofclassifiersfromthetrainingdata.
• Predictclasslabeloftestrecordsbycombiningthepredictionsmadebymultipleclassifiers.
Whyensemblemethodswork?
Supposethereare25baseclassifiers:– Eachclassifierhaserrorrate,e =0.35.– Assumeerrorsmadebyclassifiersare
uncorrelated.– Probabilitythattheensemble
classifiermakesawrongprediction:
P(X ≥13) = 25i
⎛
⎝⎜
⎞
⎠⎟ε i (1−ε)25−i = 0.06
i=13
25
∑
Generalapproach
OriginalTraining data
....D1 D2 Dt-1 Dt
D
Step 1:Create Multiple
Data Sets
C1 C2 Ct -1 Ct
Step 2:Build Multiple
Classifiers
C*Step 3:
CombineClassifiers
Typesofensemblemethods
• Manipulatedatadistribution.– Resamplingmethod.
• Baggingandboosting.
• Manipulateinputfeatures.– Featuresubsetselection.
• Randomforest:Randomlyselectfeaturesubsetsandbuiltdecisiontrees.
• Manipulateclasslabels.– Randomlypartitiontheclassesintotwosubsets,treatthemas+veand–ve,andlearnabinaryclassifier.Dothatmanytimes.Atclassification,useallbinaryclassifiersandgivecreditstotheconstituentclasses.
• Byusingdifferentmodels.– E.g.,DifferentANNtopologies.
Bagging
• Samplingwithreplacement.
• Buildaclassifieroneachbootstrapsample.• Useamajorityvotingpredictionapproach:
– Predictanunlabeledinstanceusingallclassifiersandreturnthemostfrequentlypredictedclassastheprediction.
Original Data 1 2 3 4 5 6 7 8 9 10Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
Boosting
• Aniterativeproceduretoadaptivelychangethedistributionoftrainingdatabyfocusingmoreonpreviouslymisclassifiedrecords.– Initially,all𝑁recordsareassignedequalweights.– Unlikebagging,weightsmaychangeattheendofeachboostinground.
• Theweightscanbeusedtocreateaweighted-lossfunctionorbiastheselectionofthesample.
Boosting
• Recordsthatarewronglyclassifiedwillhavetheirweightsincreased.
• Recordsthatareclassifiedcorrectlywillhavetheirweightsdecreased.
Original Data 1 2 3 4 5 6 7 8 9 10Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
Example4ishardtoclassify.
Itsweightisincreased,thereforeitismorelikelytobechosenagaininsubsequentrounds.
ARTIFICIALNEURALNETWORKS
Considerthefollowing
X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1
X1
X2
X3
Y
Black box
Output
Input
Output Y is 1 if at least two of the three inputs are equal to 1.
Considerthefollowing
X1 X2 X3 Y1 0 0 -11 0 1 11 1 0 11 1 1 10 0 1 -10 1 0 -10 1 1 10 0 0 -1
S
X1
X2
X3
Y
Black box
0.3
0.3
0.3 t=0.4
Outputnode
Inputnodes
Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)
where sign(x) = +1 if x ≥ 0−1 if x < 0
⎧⎨⎩
Perceptron
• Modelisanassemblyofinter-connectednodesandweightedlinks.
• Outputnodesumsupeachofitsinputvalueaccordingtotheweightsofitslinks.
• Compareoutputnodeagainstsomethresholdt.
S
X1
X2
X3
Y
Black box
w1
t
Outputnode
Inputnodes
w2
w3
Perceptron Model
Y = sign( wiXii=1
d
∑ − t)
= sign( wiXi )i=0
d
∑
Perceptron
• Singlelayernetwork– Containsonlyinputandoutputnodes.
• Activationfunction:
• Applyingmodelisstraightforward:
– X1 =1,X2 =0,X3 =1=>y=sign(0.2)=1
f (w, x) = sign( x,w )
Y = sign(0.3X1 + 0.3X2 + 0.3X3 − 0.4)
where sign(x) = +1 if x ≥ 0−1 if x < 0
⎧⎨⎩
Perceptronlearningrule
• Initializetheweights(w0,w1,…,wd)• Repeat
– Foreachtrainingexample(xi,yi)• Computef(w,xi)• Updatetheweights:
• Untilstoppingconditionismet.• Theaboveisanexampleofastochasticgradientdescentoptimizationmethod.
w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi
Perceptronlearningrule
• Weightupdateformula:
• Intuition:– Updateweightbasedonerror:– Ify=f(x,w),e=0:noupdateneeded.– Ify>f(x,w),e=2:weightmustbeincreasedsothatf(x,w)willincrease.
– Ify<f(x,w),e=-2:weightmustbedecreasedsothatf(x,w)willdecrease.
w(k+1) = w(k ) +λ yi − f (w(k ), xi )⎡⎣ ⎤⎦xi ; λ: learning rate
e = yi − f (w(k ), xi )⎡⎣ ⎤⎦
Perceptronlearningrule
• Sincef(w,x)isalinearcombinationofinputvariables,decisionboundaryislinear.
• Fornonlinearlyseparableproblems,perceptronlearningalgorithmwillfailbecausenolinearhyperplane canseparatethedataperfectly.
Nonlinearlyseparabledata
x1 x2 y0 0 -11 0 10 1 11 1 -1
21 xxy Å=XOR Data
Multilayerartificialneuralnetworks(ANN)
Activationfunction
g(Si )Si Oi
I1
I2
I3
wi1
wi2
wi3
Oi
Neuron iInput Output
threshold, t
InputLayer
HiddenLayer
OutputLayer
x1 x2 x3 x4 x5
y
Training ANN means learning the weights of the neurons
Artificialneuralnetworks
• Varioustypesofneuralnetworktopologies:– Single-layerednetwork(perceptron)versusmulti-layerednetwork.
– Feed-forwardversusrecurrentnetwork.
• Varioustypesofactivationfunctions(f):
Y = f ( wiXii∑ )
Artificialneuralnetworks
Multi-layerneuralnetworkcansolveanytypeofclassificationtaskinvolvingnonlineardecisionsurfaces.
n1
n2
n3
n4
n5
x1
x2
InputLayer
HiddenLayer
OutputLayer
y
w31
w32
w41
w42
w53
w54
XOR Data
DesignissuesofANN
• Numberofnodesininputlayer:– Oneinputnodeperbinary/continuousattribute.– 𝑘orlog2 𝑘 nodesforeachcategoricalattributewith𝑘values.
• Numberofnodesinoutputlayer:– Oneoutputforbinaryclassproblem.– 𝑘or log2(𝑘) nodesfork-classproblem.
• Numberofnodesinhiddenlayer.• Initialweightsandbiases.
CharacteristicsofANN
• MultilayerANNareuniversalfunctionapproximatorsbutcouldsufferfromoverfittingifthenetworkistoolarge.
• Gradientdescentmayconvergetolocalminimum.• Modelbuildingcanbeverytimeconsuming,butapplyingthemodelcanbeveryfast.
• Canhandleredundantattributesbecauseweightsareautomaticallylearnt.
• Sensitivetonoiseintrainingdata.• Difficulttohandlemissingattributes.
RecentnoteworthydevelopmentsinANN
• Useindeeplearningandunsupervisedfeaturelearning.– Seektoautomaticallylearnagoodrepresentationoftheinputfromunlabeleddata.
• GoogleBrainproject:– Learnedtheconceptofa‘cat’bylookingatunlabeledpicturesfromYouTube.
– Onebillionconnectionnetwork.
Purpose-builtneuralnetworks
• Convolutionneuralnetworks– Deepnetworksthataredesigned
toextractsuccessivelymorecomplicatedfeaturesfrom1D,2D,and3Dsignals(i.e.,audio,images,video).
Purpose-builtneuralnetworks
• Networksthatarespecificallydesignedtomodelarbitrarylengthsequencesandnon-localdependencies:– Recurrentneuralnetworks– Bi-directionalrecurrentneuralnetworks– Longshort-termmemory
• Goodforlanguagemodelingandvariousbiologicalapplications.
SUPPORTVECTORMACHINES
Separating hyperplanes
Findalinearhyperplane (decisionboundary)thatseparatesthedata.
Separating hyperplanes
One possible solution.
B1
Separating hyperplanes
Another possible solution.
B2
Separating hyperplanes
Other possible solutions.
B2
Separating hyperplanes
• Which one is better? B1 or B2?• How do we define better?
B1
B2
Support Vector Machines (SVM)
Find the hyperplane that maximizes the margin: 𝐵1is better than 𝐵2.
B1
B2
b11
b12
b21b22
margin
Support vector machinesB1
b11
b12
wxT + b = 0
w
Vector w is normal to the separat-ing hyperplane. Let x and y be twopoints on the hyperplane. Then,
wxT + b = 0 & wyT + b = 0,
andw(x y)T = 0,
which indicates that w is orthogonalto the vector x y, which lies on thehyperplane.Classifica on is performed as follows:
f(x) =
+1 if wxT + b 01 if wxT + b < 0
Modelestimation
• The goal is to find the parameters w and b (i.e., the model’s parameters) such that it separates theclasses and maximizes the margin.
• We know how to measure classifica on accuracy, but how do we measure the margin?
• Let (w, b) be the parameters of a hyperplane that is in the “middle” between the two classes. Wecan scale (w, b) in order to obtain (w, b) such that
f(x) =
+1 if wxT + b +11 if wxT + b 1
• Let x and y be two points such that
wxT + b = +1 & wyT + b = 1,
that is, these points are the closest to the hyperplane posi ve and nega ve instances, respec vely.Then,
w(x y)T = 2||w|| ||x y|| cos(w, x y) = 2
||w|| (margin) = 2margin = 2/||w||
which indicates that w is orthogonal to the vector x y, which lies on the hyperplane.
Support Vector MachinesB1
b11
b12wxT + b = 0
wxT + b = +1
w
Modelestimation• The op miza on problem is formulated as follows:
maximizew,b
2||w||
subject to wxTi + b +1 if xi is +ve
wxTi + b 1 if xi is -ve
• If yi be +1 or 1 if xi is +ve or -ve, respec vely, then the above can be conciselywri en in a standard minimiza on form:
minimizew,b
||w||2
subject to yi(wxTi + b) 1 xi
• This is a constrained quadra c op miza on problem, which is convex and can besolved efficiently using Lagrange mul plies by minimizing the following func on:
Lp = ||w||2
i
i(yi(wxTi + b) 1),
where the i 0’s are what they are called Lagrange mul plies.
Modelestimation
• The dual Lagrangian is used for solving this problem, which can be shown to be:
LD =
i
i
i,j
ijyiyjxixTj .
Since this is the dual of the primal op miza on problem, the problem is now becomesa maximiza on problem.
• At the op mal solu on of the primal/dual problem we have that:
w =
i
iyixi.
• Most of the i’s are 0, and the non-zero i’s are those that define the w vector. Theycorrespond to the training examples for which the model predicts either +1 or 1.These training examples are called the support vectors.
• A test instance z is classified as +ve or -ve based on
f(z) = sign(wzT + b) = sign
i
iyixizT + b
.
ExampleofLinearSVM
x1 x2 y l0.3858 0.4687 1 65.52610.4871 0.611 -1 65.52610.9218 0.4103 -1 00.7382 0.8936 -1 00.1763 0.0579 1 00.4057 0.3529 1 00.9355 0.8132 -1 00.2146 0.0099 1 0
Supportvectors
Support vector machines
Whatiftheproblemisnotlinearlyseparable?
Non-separablecase
• Non-linearly separable cases are handled by introducing for each training instance aslack variable i and solving the following op miza on problem:
minimizew,b,i
||w||2 + c (
i i)
subject to wxTi + b +1 i if xi is +ve
wxTi + b 1 + i if xi is -ve
i 0
• ... Or by using a non-linear hyperplane.
• ... Or by doing both.
Nonlinearsupportvectormachines
Whatifdecisionboundaryisnotlinear?
Nonlinearsupportvectormachines
Transformdataintohigherdimensionalspace.
Decisionboundary:
Φ(x)wT + b = 0
NonlinearSVMs
Mappingfromtheoriginalspacetoadifferentspacecanmakethingsseparable.
Learningnon-linearSVMs
• The dual LagrangianLD =
i
i
i,j
ijyiyjxixTj ,
now becomes:LD =
i
i
i,j
ijyiyj(xi)(xj)T ,
• A test instance z is classified as +ve or -ve based on
f(z) = sign
i
iyi(xi)(z)T + b
.
• The matrix K such that K(xi, xj) = (xi)(xj)T is called the kernel matrix.
• Non-linear SVMs require to have such a kernel matrix. I can derive interes ng kernelmatrices involving extremely high-dimensional func ons by opera ng on the originalspace. This is called the kernel trick.
Kerneltrick
Examples:
Thisisaninfinitedimensionpolynomial.
Example of nonlinear SVM
SVM with polynomial degree 2 kernel
LearningnonlinearSVM
• Advantagesofusingkernels:– DonothavetoknowthemappingfunctionF.– ComputingdotproductF(xi)• F(xj)intheoriginalspaceavoidscurseof
dimensionality.
• Thekernelfunctioncanbeconsideredasameasureofsimilaritybetweenobjectsandusedtoencodekeyinformationabouttheclassificationproblem.
• Notallfunctionscanbekernels:– MustmakesurethereisacorrespondingF insomehigh-dimensional
space.– Mercer’stheorem.
CharacteristicsofSVM
• Thelearningproblemisformulatedasaconvexoptimizationproblem,efficientalgorithmsareavailabletofindtheglobalminimaoftheobjectivefunction.
• Over-fittingisaddressedbymaximizingthemarginofthedecisionboundary,buttheuserstillneedstoprovidethetypeofkernelfunctionandcostfunction.
• Difficulttohandlemissingvalues.• Robusttonoise.• Highcomputationalcomplexityforbuildingthemodel.
RIDGEREGRESSION&COORDINATEDESCENT
Linearregressiontask
• Wearegivenacollectionofrecords(trainingset)– Eachrecordischaracterizedbyatuple(x,y),wherex isasetofnumericalattributesandy isavalue.
• Goal:– Wewanttolearnavectorw suchthat<x, w>approximatesy inaleastsquaressense.
Linearregressionandnormalequations
>Ğƚ X ďĞ ĂŶ n m ŵĂƚƌŝdž ǁŚŽƐĞ ƌŽǁƐ ĐŽƌƌĞƐƉŽŶĚ ƚŽ ƚŚĞ ƌĞĐŽƌĚƐĂŶĚ ƚŚĞ ĐŽůƵŵŶƐ ĐŽƌƌĞƐƉŽŶĚ ƚŽ ƚŚĞ ĂƩƌŝďƵƚĞƐ >Ğƚ y ďĞ ĂŶ n 1ǀĞĐƚŽƌ ŽĨ ƚŚĞ ŬŶŽǁŶ ƚĂƌŐĞƚ ǀĂůƵĞƐ ŽĨ ƚŚĞ ƌĞĐŽƌĚƐ ŝŶ X dŚĞ ƐŽůƵƟŽŶƚŽ ƚŚĞ ůŝŶĞĂƌ ƌĞŐƌĞƐƐŝŽŶ ƉƌŽďůĞŵ ŝƐ ƚŚĞ ǀĞĐƚŽƌ w ƐƵĐŚ ƚŚĂƚ
minimizew
||Xw y||2.
dŚĞ ƐŽůƵƟŽŶ ƚŽ ƚŚĞ ĂďŽǀĞ ƉƌŽďůĞŵ ŝƐ ŐŝǀĞŶ ďLJ
w = (XT X)1XT y.
,ŽǁĞǀĞƌ ƚŚŝƐ ŝƐ ŶŽƚ ŚŽǁ ǁĞ ƵƐƵĂůůLJ ƐŽůǀĞ ŝƚ
Ridgeregression
/Ŷ ŽƌĚĞƌ ƚŽ ƉƌĞǀĞŶƚ ŽǀĞƌĮƫŶŐ ǁĞ ĂĚĚ Ă ƌĞŐƵůĂƌŝnjĂƟŽŶ ƉĞŶĂůƚLJ ĂŶĚĞƐƟŵĂƚĞ w ĂƐ ĨŽůůŽǁƐ
minimizew
||Xw y||2 + ||w||2
,
ǁŚĞƌĞ ŝƐ Ă ƵƐĞƌͲƐƵƉƉůŝĞĚ ƉĂƌĂŵĞƚĞƌ ƚŚĂƚ ĐŽŶƚƌŽůƐ ŽǀĞƌĮƫŶŐdŚŝƐ ƚLJƉĞ ŽĨ ƌĞŐƌĞƐƐŝŽŶ ŝƐ ĐĂůůĞĚ ƌŝĚŐĞ ƌĞŐƌĞƐƐŝŽŶ
Estimatingw
ͻ dŚĞƌĞ ĂƌĞ ŵĂŶLJ ǁĂLJƐ ƚŽ ƐŽůǀĞ ƚŚĞ ŽƉƟŵŝnjĂƟŽŶ ƉƌŽďůĞŵ ĨŽƌ ĞƐƟŵĂƟŶŐw ŽŽƌĚŝŶĂƚĞ ĚĞƐĐĞŶƚ ŝƐ ƉƌŽďĂďůLJ ƚŚĞ ƐŝŵƉůĞƐƚ ŵĞƚŚŽĚ
ͻ /ƚ ĐŽŶƐŝƐƚƐ ŽĨ Ă ƐĞƚ ŽĨ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶƐ /Ŷ ĞĂĐŚ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶ ŝƚ ƉĞƌͲĨŽƌŵƐ m ƐƚĞƉƐ ;ŽŶĞ ĨŽƌ ĞĂĐŚ ŽĨ ƚŚĞ ĚŝŵĞŶƐŝŽŶƐ ŝŶ wͿ ƵƌŝŶŐ ƚŚĞ iƚŚƐƚĞƉ ŝƚ ŽƉƟŵŝnjĞƐ ƚŚĞ ǀĂůƵĞ ŽĨ ƚŚĞ ŽďũĞĐƟǀĞ ĨƵŶĐƟŽŶ ďLJ ĮdžŝŶŐ Ăůů ďƵƚƚŚĞ wi ǀĂƌŝĂďůĞ dŚŝƐ ŽƉƟŵŝnjĂƟŽŶ ŝƐ ƉĞƌĨŽƌŵĞĚ ďLJ ƚĂŬŝŶŐ ƚŚĞ ƉĂƌƟĂůĚĞƌŝǀĂƟǀĞ ŽĨ ƚŚĞ ŽďũĞĐƟǀĞ ĨƵŶĐƟŽŶ ǁŝƚŚ ƌĞƐƉĞĐƚ ƚŽ wi ƐĞƫŶŐ ŝƚ ƚŽ ϬĂŶĚ ƐŽůǀŝŶŐ ĨŽƌ wi dŚĂƚ ǀĂůƵĞ ŽĨ wi ŝƐ ƚŚĞ ŶĞǁ ǀĂůƵĞ ĨŽƌ ƚŚĂƚ ǀĂƌŝĂďůĞdŚĞ ĞŶƟƌĞ ƉƌŽĐĞƐƐ ĐŽŶǀĞƌŐĞƐ ǁŚĞŶ ƚŚĞ ĞƌƌŽƌ ĚŽĞƐ ŶŽƚ ĚĞĐƌĞĂƐĞ ƐƵďͲƐƚĂŶƟĂůůLJ ďĞƚǁĞĞŶ ƐƵĐĐĞƐƐŝǀĞ ŽƵƚĞƌ ŝƚĞƌĂƟŽŶƐ
ͻ EŽŶͲŶĞŐĂƟǀŝƚLJ ŝŶ ƚŚĞ ŵŽĚĞů ĐĂŶ ďĞ ĞŶĨŽƌĐĞĚ ďLJ ƐĞƫŶŐ ĂŶLJ ŶĞŐĂƟǀĞwi ǀĂůƵĞƐ ƚŽ Ϭ ĚƵƌŝŶŐ ƚŚĞ ŝŶŶĞƌ ŝƚĞƌĂƟŽŶƐ