3 Classification: Basic Concepts and Techniquesmoodle.nwssu.edu.ph/pluginfile.php/80810/mod... ·...

3Classification:BasicConceptsandTechniques

Humanshaveaninnateabilitytoclassifythingsintocategories,e.g.,mundanetaskssuchasfilteringspamemailmessagesormorespecializedtaskssuchasrecognizingcelestialobjectsintelescopeimages(seeFigure3.1 ).Whilemanualclassificationoftensufficesforsmallandsimpledatasetswithonlyafewattributes,largerandmorecomplexdatasetsrequireanautomatedsolution.

Figure3.1.ClassificationofgalaxiesfromtelescopeimagestakenfromtheNASAwebsite.

Thischapterintroducesthebasicconceptsofclassificationanddescribessomeofitskeyissuessuchasmodeloverfitting,modelselection,andmodelevaluation.Whilethesetopicsareillustratedusingaclassificationtechniqueknownasdecisiontreeinduction,mostofthediscussioninthischapterisalsoapplicabletootherclassificationtechniques,manyofwhicharecoveredinChapter4 .

3.1BasicConceptsFigure3.2 illustratesthegeneralideabehindclassification.Thedataforaclassificationtaskconsistsofacollectionofinstances(records).Eachsuchinstanceischaracterizedbythetuple( ,y),where isthesetofattributevaluesthatdescribetheinstanceandyistheclasslabeloftheinstance.Theattributeset cancontainattributesofanytype,whiletheclasslabelymustbecategorical.

Figure3.2.Aschematicillustrationofaclassificationtask.

Aclassificationmodelisanabstractrepresentationoftherelationshipbetweentheattributesetandtheclasslabel.Aswillbeseeninthenexttwochapters,themodelcanberepresentedinmanyways,e.g.,asatree,aprobabilitytable,orsimply,avectorofreal-valuedparameters.Moreformally,wecanexpressitmathematicallyasatargetfunctionfthattakesasinputtheattributeset andproducesanoutputcorrespondingtothepredictedclasslabel.Themodelissaidtoclassifyaninstance( ,y)correctlyif .

Table3.1 showsexamplesofattributesetsandclasslabelsforvariousclassificationtasks.Spamfilteringandtumoridentificationareexamplesofbinaryclassificationproblems,inwhicheachdatainstancecanbecategorizedintooneoftwoclasses.Ifthenumberofclassesislargerthan2,asinthe

f(x)=y

galaxyclassificationexample,thenitiscalledamulticlassclassificationproblem.

Table3.1.Examplesofclassificationtasks.

Task Attributeset Classlabel

Spamfiltering Featuresextractedfromemailmessageheaderandcontent

spamornon-spam

Tumoridentification

Featuresextractedfrommagneticresonanceimaging(MRI)scans

malignantorbenign

Galaxyclassification

Featuresextractedfromtelescopeimages elliptical,spiral,orirregular-shaped

Weillustratethebasicconceptsofclassificationinthischapterwiththefollowingtwoexamples.

3.1.ExampleVertebrateClassificationTable3.2 showsasampledatasetforclassifyingvertebratesintomammals,reptiles,birds,fishes,andamphibians.Theattributesetincludescharacteristicsofthevertebratesuchasitsbodytemperature,skincover,andabilitytofly.Thedatasetcanalsobeusedforabinaryclassificationtasksuchasmammalclassification,bygroupingthereptiles,birds,fishes,andamphibiansintoasinglecategorycallednon-mammals.

Table3.2.Asampledataforthevertebrateclassificationproblem.VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

human warm-

blooded

hair yes no no yes no mammal

3.2.ExampleLoanBorrowerClassificationConsidertheproblemofpredictingwhetheraloanborrowerwillrepaytheloanordefaultontheloanpayments.Thedatasetusedtobuildthe

blooded

python cold-blooded scales no no no no yes reptile

salmon cold-blooded scales no yes no no no fish

whale warm-blooded

hair yes yes no no no mammal

frog cold-blooded none no semi no yes yes amphibian

komodo cold-blooded scales no no no yes no reptile

dragon

bat warm-blooded

hair yes no yes yes yes mammal

pigeon warm-blooded

feathers no no yes yes no bird

cat warm-blooded

fur yes no no yes no mammal

leopard cold-blooded scales yes yes no no no fish

shark

turtle cold-blooded scales no semi no yes no reptile

penguin warm-blooded

feathers no semi no yes no bird

porcupine warm-blooded

quills yes no no yes yes mammal

eel cold-blooded scales no yes no no no fish

salamander cold-blooded none no semi no yes yes amphibian

classificationmodelisshowninTable3.3 .Theattributesetincludespersonalinformationoftheborrowersuchasmaritalstatusandannualincome,whiletheclasslabelindicateswhethertheborrowerhaddefaultedontheloanpayments.

Table3.3.Asampledatafortheloanborrowerclassificationproblem.

ID HomeOwner MaritalStatus AnnualIncome Defaulted?

1 Yes Single 125000 No

2 No Married 100000 No

3 No Single 70000 No

4 Yes Married 120000 No

5 No Divorced 95000 Yes

6 No Single 60000 No

7 Yes Divorced 220000 No

8 No Single 85000 Yes

9 No Married 75000 No

10 No Single 90000 Yes

Aclassificationmodelservestwoimportantrolesindatamining.First,itisusedasapredictivemodeltoclassifypreviouslyunlabeledinstances.Agoodclassificationmodelmustprovideaccuratepredictionswithafastresponsetime.Second,itservesasadescriptivemodeltoidentifythecharacteristicsthatdistinguishinstancesfromdifferentclasses.Thisisparticularlyusefulforcriticalapplications,suchasmedicaldiagnosis,whereit

isinsufficienttohaveamodelthatmakesapredictionwithoutjustifyinghowitreachessuchadecision.

Forexample,aclassificationmodelinducedfromthevertebratedatasetshowninTable3.2 canbeusedtopredicttheclasslabelofthefollowingvertebrate:

Inaddition,itcanbeusedasadescriptivemodeltohelpdeterminecharacteristicsthatdefineavertebrateasamammal,areptile,abird,afish,oranamphibian.Forexample,themodelmayidentifymammalsaswarm-bloodedvertebratesthatgivebirthtotheiryoung.

Thereareseveralpointsworthnotingregardingthepreviousexample.First,althoughalltheattributesshowninTable3.2 arequalitative,therearenorestrictionsonthetypeofattributesthatcanbeusedaspredictorvariables.Theclasslabel,ontheotherhand,mustbeofnominaltype.Thisdistinguishesclassificationfromotherpredictivemodelingtaskssuchasregression,wherethepredictedvalueisoftenquantitative.MoreinformationaboutregressioncanbefoundinAppendixD.

Anotherpointworthnotingisthatnotallattributesmayberelevanttotheclassificationtask.Forexample,theaveragelengthorweightofavertebratemaynotbeusefulforclassifyingmammals,astheseattributescanshowsamevalueforbothmammalsandnon-mammals.Suchanattributeistypicallydiscardedduringpreprocessing.Theremainingattributesmightnotbeabletodistinguishtheclassesbythemselves,andthus,mustbeusedin

VertebrateName

BodyTemperature

SkinCover

GivesBirth

AquaticCreature

AerialCreature

HasLegs

Hibernates ClassLabel

gilamonster

cold-blooded scales no no no yes yes ?

concertwithotherattributes.Forinstance,theBodyTemperatureattributeisinsufficienttodistinguishmammalsfromothervertebrates.WhenitisusedtogetherwithGivesBirth,theclassificationofmammalsimprovessignificantly.However,whenadditionalattributes,suchasSkinCoverareincluded,themodelbecomesoverlyspecificandnolongercoversallmammals.Findingtheoptimalcombinationofattributesthatbestdiscriminatesinstancesfromdifferentclassesisthekeychallengeinbuildingclassificationmodels.

3.2GeneralFrameworkforClassificationClassificationisthetaskofassigninglabelstounlabeleddatainstancesandaclassifierisusedtoperformsuchatask.Aclassifieristypicallydescribedintermsofamodelasillustratedintheprevioussection.Themodeliscreatedusingagivenasetofinstances,knownasthetrainingset,whichcontainsattributevaluesaswellasclasslabelsforeachinstance.Thesystematicapproachforlearningaclassificationmodelgivenatrainingsetisknownasalearningalgorithm.Theprocessofusingalearningalgorithmtobuildaclassificationmodelfromthetrainingdataisknownasinduction.Thisprocessisalsooftendescribedas“learningamodel”or“buildingamodel.”Thisprocessofapplyingaclassificationmodelonunseentestinstancestopredicttheirclasslabelsisknownasdeduction.Thus,theprocessofclassificationinvolvestwosteps:applyingalearningalgorithmtotrainingdatatolearnamodel,andthenapplyingthemodeltoassignlabelstounlabeledinstances.Figure3.3 illustratesthegeneralframeworkforclassification.

Figure3.3.Generalframeworkforbuildingaclassificationmodel.

Aclassificationtechniquereferstoageneralapproachtoclassification,e.g.,thedecisiontreetechniquethatwewillstudyinthischapter.Thisclassificationtechniquelikemostothers,consistsofafamilyofrelatedmodelsandanumberofalgorithmsforlearningthesemodels.InChapter4 ,wewillstudyadditionalclassificationtechniques,includingneuralnetworksandsupportvectormachines.

Acouplenotesonterminology.First,theterms“classifier”and“model”areoftentakentobesynonymous.Ifaclassificationtechniquebuildsasingle,

globalmodel,thenthisisfine.However,whileeverymodeldefinesaclassifier,noteveryclassifierisdefinedbyasinglemodel.Someclassifiers,suchask-nearestneighborclassifiers,donotbuildanexplicitmodel(Section4.3 ),whileotherclassifiers,suchasensembleclassifiers,combinetheoutputofacollectionofmodels(Section4.10 ).Second,theterm“classifier”isoftenusedinamoregeneralsensetorefertoaclassificationtechnique.Thus,forexample,“decisiontreeclassifier”canrefertothedecisiontreeclassificationtechniqueoraspecificclassifierbuiltusingthattechnique.Fortunately,themeaningof“classifier”isusuallyclearfromthecontext.

InthegeneralframeworkshowninFigure3.3 ,theinductionanddeductionstepsshouldbeperformedseparately.Infact,aswillbediscussedlaterinSection3.6 ,thetrainingandtestsetsshouldbeindependentofeachothertoensurethattheinducedmodelcanaccuratelypredicttheclasslabelsofinstancesithasneverencounteredbefore.Modelsthatdeliversuchpredictiveinsightsaresaidtohavegoodgeneralizationperformance.Theperformanceofamodel(classifier)canbeevaluatedbycomparingthepredictedlabelsagainstthetruelabelsofinstances.Thisinformationcanbesummarizedinatablecalledaconfusionmatrix.Table3.4 depictstheconfusionmatrixforabinaryclassificationproblem.Eachentry denotesthenumberofinstancesfromclassipredictedtobeofclassj.Forexample, isthenumberofinstancesfromclass0incorrectlypredictedasclass1.Thenumberofcorrectpredictionsmadebythemodelis andthenumberofincorrectpredictionsis .

Table3.4.Confusionmatrixforabinaryclassificationproblem.

PredictedClass

ActualClass

fijf01

(f11+f00)(f10+f01)

Class=1 Class=0

Class=1 f11 f10

Althoughaconfusionmatrixprovidestheinformationneededtodeterminehowwellaclassificationmodelperforms,summarizingthisinformationintoasinglenumbermakesitmoreconvenienttocomparetherelativeperformanceofdifferentmodels.Thiscanbedoneusinganevaluationmetricsuchasaccuracy,whichiscomputedinthefollowingway:

Accuracy=

Forbinaryclassificationproblems,theaccuracyofamodelisgivenby

Errorrateisanotherrelatedmetric,whichisdefinedasfollowsforbinaryclassificationproblems:

Thelearningalgorithmsofmostclassificationtechniquesaredesignedtolearnmodelsthatattainthehighestaccuracy,orequivalently,thelowesterrorratewhenappliedtothetestset.WewillrevisitthetopicofmodelevaluationinSection3.6 .

Class=0 f01 f00

Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions. (3.1)

Accuracy=f11+f00f11+f10+f01+f00. (3.2)

Errorrate=NumberofwrongpredictionsTotalnumberofpredictions=f10+f01f11(3.3)

3.3DecisionTreeClassifierThissectionintroducesasimpleclassificationtechniqueknownasthedecisiontreeclassifier.Toillustratehowadecisiontreeworks,considertheclassificationproblemofdistinguishingmammalsfromnon-mammalsusingthevertebratedatasetshowninTable3.2 .Supposeanewspeciesisdiscoveredbyscientists.Howcanwetellwhetheritisamammaloranon-mammal?Oneapproachistoposeaseriesofquestionsaboutthecharacteristicsofthespecies.Thefirstquestionwemayaskiswhetherthespeciesiscold-orwarm-blooded.Ifitiscold-blooded,thenitisdefinitelynotamammal.Otherwise,itiseitherabirdoramammal.Inthelattercase,weneedtoaskafollow-upquestion:Dothefemalesofthespeciesgivebirthtotheiryoung?Thosethatdogivebirtharedefinitelymammals,whilethosethatdonotarelikelytobenon-mammals(withtheexceptionofegg-layingmammalssuchastheplatypusandspinyanteater).

Thepreviousexampleillustrateshowwecansolveaclassificationproblembyaskingaseriesofcarefullycraftedquestionsabouttheattributesofthetestinstance.Eachtimewereceiveananswer,wecouldaskafollow-upquestionuntilwecanconclusivelydecideonitsclasslabel.Theseriesofquestionsandtheirpossibleanswerscanbeorganizedintoahierarchicalstructurecalledadecisiontree.Figure3.4 showsanexampleofthedecisiontreeforthemammalclassificationproblem.Thetreehasthreetypesofnodes:

Arootnode,withnoincominglinksandzeroormoreoutgoinglinks.Internalnodes,eachofwhichhasexactlyoneincominglinkandtwoormoreoutgoinglinks.Leaforterminalnodes,eachofwhichhasexactlyoneincominglinkandnooutgoinglinks.

Everyleafnodeinthedecisiontreeisassociatedwithaclasslabel.Thenon-terminalnodes,whichincludetherootandinternalnodes,containattributetestconditionsthataretypicallydefinedusingasingleattribute.Eachpossibleoutcomeoftheattributetestconditionisassociatedwithexactlyonechildofthisnode.Forexample,therootnodeofthetreeshowninFigure3.4 usestheattribute todefineanattributetestconditionthathastwooutcomes,warmandcold,resultingintwochildnodes.

Figure3.4.Adecisiontreeforthemammalclassificationproblem.

Givenadecisiontree,classifyingatestinstanceisstraightforward.Startingfromtherootnode,weapplyitsattributetestconditionandfollowtheappropriatebranchbasedontheoutcomeofthetest.Thiswillleaduseithertoanotherinternalnode,forwhichanewattributetestconditionisapplied,ortoaleafnode.Oncealeafnodeisreached,weassigntheclasslabelassociatedwiththenodetothetestinstance.Asanillustration,Figure3.5

tracesthepathusedtopredicttheclasslabelofaflamingo.Thepathterminatesataleafnodelabeledas .

Figure3.5.Classifyinganunlabeledvertebrate.Thedashedlinesrepresenttheoutcomesofapplyingvariousattributetestconditionsontheunlabeledvertebrate.Thevertebrateiseventuallyassignedtothe class.

3.3.1ABasicAlgorithmtoBuildaDecisionTree

Manypossibledecisiontreesthatcanbeconstructedfromaparticulardataset.Whilesometreesarebetterthanothers,findinganoptimaloneiscomputationallyexpensiveduetotheexponentialsizeofthesearchspace.Efficientalgorithmshavebeendevelopedtoinduceareasonablyaccurate,

albeitsuboptimal,decisiontreeinareasonableamountoftime.Thesealgorithmsusuallyemployagreedystrategytogrowthedecisiontreeinatop-downfashionbymakingaseriesoflocallyoptimaldecisionsaboutwhichattributetousewhenpartitioningthetrainingdata.OneoftheearliestmethodisHunt'salgorithm,whichisthebasisformanycurrentimplementationsofdecisiontreeclassifiers,includingID3,C4.5,andCART.ThissubsectionpresentsHunt'salgorithmanddescribessomeofthedesignissuesthatmustbeconsideredwhenbuildingadecisiontree.

Hunt'sAlgorithmInHunt'salgorithm,adecisiontreeisgrowninarecursivefashion.Thetreeinitiallycontainsasinglerootnodethatisassociatedwithallthetraininginstances.Ifanodeisassociatedwithinstancesfrommorethanoneclass,itisexpandedusinganattributetestconditionthatisdeterminedusingasplittingcriterion.Achildleafnodeiscreatedforeachoutcomeoftheattributetestconditionandtheinstancesassociatedwiththeparentnodearedistributedtothechildrenbasedonthetestoutcomes.Thisnodeexpansionstepcanthenberecursivelyappliedtoeachchildnode,aslongasithaslabelsofmorethanoneclass.Ifalltheinstancesassociatedwithaleafnodehaveidenticalclasslabels,thenthenodeisnotexpandedanyfurther.Eachleafnodeisassignedaclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththenode.

Toillustratehowthealgorithmworks,considerthetrainingsetshowninTable3.3 fortheloanborrowerclassificationproblem.SupposeweapplyHunt'salgorithmtofitthetrainingdata.ThetreeinitiallycontainsonlyasingleleafnodeasshowninFigure3.6(a) .ThisnodeislabeledasDefaulted=No,sincethemajorityoftheborrowersdidnotdefaultontheirloanpayments.Thetrainingerrorofthistreeis30%asthreeoutofthetentraininginstanceshave

theclasslabel .Theleafnodecanthereforebefurtherexpandedbecauseitcontainstraininginstancesfrommorethanoneclass.

Figure3.6.Hunt'salgorithmforbuildingdecisiontrees.

LetHomeOwnerbetheattributechosentosplitthetraininginstances.Thejustificationforchoosingthisattributeastheattributetestconditionwillbediscussedlater.TheresultingbinarysplitontheHomeOwnerattributeisshowninFigure3.6(b) .AllthetraininginstancesforwhichHomeOwner=Yesarepropagatedtotheleftchildoftherootnodeandtherestarepropagatedtotherightchild.Hunt'salgorithmisthenrecursivelyappliedtoeachchild.Theleftchildbecomesaleafnodelabeled ,since

Defaulted=Yes

Defaulted=No

allinstancesassociatedwiththisnodehaveidenticalclasslabel.Therightchildhasinstancesfromeachclasslabel.Hence,

wesplititfurther.TheresultingsubtreesafterrecursivelyexpandingtherightchildareshowninFigures3.6(c) and(d) .

Hunt'salgorithm,asdescribedabove,makessomesimplifyingassumptionsthatareoftennottrueinpractice.Inthefollowing,wedescribetheseassumptionsandbrieflydiscusssomeofthepossiblewaysforhandlingthem.

1. SomeofthechildnodescreatedinHunt'salgorithmcanbeemptyifnoneofthetraininginstanceshavetheparticularattributevalues.Onewaytohandlethisisbydeclaringeachofthemasaleafnodewithaclasslabelthatoccursmostfrequentlyamongthetraininginstancesassociatedwiththeirparentnodes.

2. Ifalltraininginstancesassociatedwithanodehaveidenticalattributevaluesbutdifferentclasslabels,itisnotpossibletoexpandthisnodeanyfurther.Onewaytohandlethiscaseistodeclareitaleafnodeandassignittheclasslabelthatoccursmostfrequentlyinthetraininginstancesassociatedwiththisnode.

DesignIssuesofDecisionTreeInductionHunt'salgorithmisagenericprocedureforgrowingdecisiontreesinagreedyfashion.Toimplementthealgorithm,therearetwokeydesignissuesthatmustbeaddressed.

1. Whatisthesplittingcriterion?Ateachrecursivestep,anattributemustbeselectedtopartitionthetraininginstancesassociatedwithanodeintosmallersubsetsassociatedwithitschildnodes.Thesplittingcriteriondetermineswhichattributeischosenasthetestconditionand

Defaulted=No

howthetraininginstancesshouldbedistributedtothechildnodes.ThiswillbediscussedinSections3.3.2 and3.3.3 .

2. Whatisthestoppingcriterion?Thebasicalgorithmstopsexpandinganodeonlywhenallthetraininginstancesassociatedwiththenodehavethesameclasslabelsorhaveidenticalattributevalues.Althoughtheseconditionsaresufficient,therearereasonstostopexpandinganodemuchearliereveniftheleafnodecontainstraininginstancesfrommorethanoneclass.Thisprocessiscalledearlyterminationandtheconditionusedtodeterminewhenanodeshouldbestoppedfromexpandingiscalledastoppingcriterion.TheadvantagesofearlyterminationarediscussedinSection3.4 .

3.3.2MethodsforExpressingAttributeTestConditions

Decisiontreeinductionalgorithmsmustprovideamethodforexpressinganattributetestconditionanditscorrespondingoutcomesfordifferentattributetypes.

BinaryAttributes

Thetestconditionforabinaryattributegeneratestwopotentialoutcomes,asshowninFigure3.7 .

Figure3.7.Attributetestconditionforabinaryattribute.

NominalAttributes

Sinceanominalattributecanhavemanyvalues,itsattributetestconditioncanbeexpressedintwoways,asamultiwaysplitorabinarysplitasshowninFigure3.8 .Foramultiwaysplit(Figure3.8(a) ),thenumberofoutcomesdependsonthenumberofdistinctvaluesforthecorrespondingattribute.Forexample,ifanattributesuchasmaritalstatushasthreedistinctvalues—single,married,ordivorced—itstestconditionwillproduceathree-waysplit.Itisalsopossibletocreateabinarysplitbypartitioningallvaluestakenbythenominalattributeintotwogroups.Forexample,somedecisiontreealgorithms,suchasCART,produceonlybinarysplitsbyconsideringall

waysofcreatingabinarypartitionofkattributevalues.Figure3.8(b)illustratesthreedifferentwaysofgroupingtheattributevaluesformaritalstatusintotwosubsets.

2k−1−1

Figure3.8.Attributetestconditionsfornominalattributes.

OrdinalAttributes

Ordinalattributescanalsoproducebinaryormulti-waysplits.Ordinalattributevaluescanbegroupedaslongasthegroupingdoesnotviolatetheorderpropertyoftheattributevalues.Figure3.9 illustratesvariouswaysofsplittingtrainingrecordsbasedontheShirtSizeattribute.ThegroupingsshowninFigures3.9(a) and(b) preservetheorderamongtheattributevalues,whereasthegroupingshowninFigure3.9(c) violatesthispropertybecauseitcombinestheattributevaluesSmallandLargeintothesamepartitionwhileMediumandExtraLargearecombinedintoanotherpartition.

Figure3.9.Differentwaysofgroupingordinalattributevalues.

ContinuousAttributes

Forcontinuousattributes,theattributetestconditioncanbeexpressedasacomparisontest(e.g., )producingabinarysplit,orasarangequeryoftheform ,for producingamultiwaysplit.ThedifferencebetweentheseapproachesisshowninFigure3.10 .Forthebinarysplit,anypossiblevaluevbetweentheminimumandmaximumattributevaluesinthetrainingdatacanbeusedforconstructingthecomparisontest .However,itissufficienttoonlyconsiderdistinctattributevaluesinthetrainingsetascandidatesplitpositions.Forthemultiwaysplit,anypossiblecollectionofattributevaluerangescanbeused,aslongastheyaremutuallyexclusiveandcovertheentirerangeofattributevaluesbetweentheminimumandmaximumvaluesobservedinthetrainingset.OneapproachforconstructingmultiwaysplitsistoapplythediscretizationstrategiesdescribedinSection2.3.6 onpage63.Afterdiscretization,anewordinalvalueisassignedtoeachdiscretizedinterval,andtheattributetestconditionisthendefinedusingthisnewlyconstructedordinalattribute.

A<vvi≤A<vi+1 i=1,…,k,

A<v

Figure3.10.Testconditionforcontinuousattributes.

3.3.3MeasuresforSelectinganAttributeTestCondition

Therearemanymeasuresthatcanbeusedtodeterminethegoodnessofanattributetestcondition.Thesemeasurestrytogivepreferencetoattributetestconditionsthatpartitionthetraininginstancesintopurersubsetsinthechildnodes,whichmostlyhavethesameclasslabels.Havingpurernodesisusefulsinceanodethathasallofitstraininginstancesfromthesameclassdoesnotneedtobeexpandedfurther.Incontrast,animpurenodecontainingtraininginstancesfrommultipleclassesislikelytorequireseverallevelsofnodeexpansions,therebyincreasingthedepthofthetreeconsiderably.Largertreesarelessdesirableastheyaremoresusceptibletomodeloverfitting,aconditionthatmaydegradetheclassificationperformanceonunseeninstances,aswillbediscussedinSection3.4 .Theyarealsodifficulttointerpretandincurmoretrainingandtesttimeascomparedtosmallertrees.

Inthefollowing,wepresentdifferentwaysofmeasuringtheimpurityofanodeandthecollectiveimpurityofitschildnodes,bothofwhichwillbeusedtoidentifythebestattributetestconditionforanode.

ImpurityMeasureforaSingleNodeTheimpurityofanodemeasureshowdissimilartheclasslabelsareforthedatainstancesbelongingtoacommonnode.Followingareexamplesofmeasuresthatcanbeusedtoevaluatetheimpurityofanodet:

wherepi(t)istherelativefrequencyoftraininginstancesthatbelongtoclassiatnodet,cisthetotalnumberofclasses,and inentropycalculations.Allthreemeasuresgiveazeroimpurityvalueifanodecontainsinstancesfromasingleclassandmaximumimpurityifthenodehasequalproportionofinstancesfrommultipleclasses.

Figure3.11 comparestherelativemagnitudeoftheimpuritymeasureswhenappliedtobinaryclassificationproblems.Sincethereareonlytwoclasses, .Thehorizontalaxispreferstothefractionofinstancesthatbelongtooneofthetwoclasses.Observethatallthreemeasuresattaintheirmaximumvaluewhentheclassdistributionisuniform(i.e.,

)andminimumvaluewhenalltheinstancesbelongtoasingleclass(i.e.,either or equalsto1).Thefollowingexamplesillustratehowthevaluesoftheimpuritymeasuresvaryaswealtertheclassdistribution.

Entropy=−∑i=0c−1pi(t)log2pi(t), (3.4)

Giniindex=1−∑i=0c−1pi(t)2, (3.5)

Classificationerror=1−maxi[pi(t)], (3.6)

0log20=0

p0(t)+p1(t)=1

p0(t)+p1(t)=0.5p0(t) p1(t)

Figure3.11.Comparisonamongtheimpuritymeasuresforbinaryclassificationproblems.

Node Count

0

6

Node Count

1

5

Node Count

3

N1 Gini=1−(0/6)2−(6/6)2=0

Class=0 Entropy=−(0/6)log2(0/6)−(6/6)log2(6/6)=0

Class=1 Error=1−max[0/6,6/6]=0

N2 Gini=1−(1/6)2−(5/6)2=0.278

Class=0 Entropy=−(1/6)log2(1/6)−(5/6)log2(5/6)=0.650

Class=1 Error=1−max[1/6,5/6]=0.167

N3 Gini=1−(3/6)2−(3/6)2=0.5

Class=0 Entropy=−(3/6)log2(3/6)−(3/6)log2(3/6)=1

3

Basedonthesecalculations,node hasthelowestimpurityvalue,followedby and .Thisexample,alongwithFigure3.11 ,showstheconsistencyamongtheimpuritymeasures,i.e.,ifanode haslowerentropythannode ,thentheGiniindexanderrorrateof willalsobelowerthanthatof .Despitetheiragreement,theattributechosenassplittingcriterionbytheimpuritymeasurescanstillbedifferent(seeExercise6onpage187).

CollectiveImpurityofChildNodesConsideranattributetestconditionthatsplitsanodecontainingNtraininginstancesintokchildren, ,whereeverychildnoderepresentsapartitionofthedataresultingfromoneofthekoutcomesoftheattributetestcondition.Let bethenumberoftraininginstancesassociatedwithachildnode ,whoseimpurityvalueis .Sinceatraininginstanceintheparentnodereachesnode forafractionof times,thecollectiveimpurityofthechildnodescanbecomputedbytakingaweightedsumoftheimpuritiesofthechildnodes,asfollows:

3.3.ExampleWeightedEntropyConsiderthecandidateattributetestconditionshowninFigures3.12(a)and(b) fortheloanborrowerclassificationproblem.SplittingontheHomeOwnerattributewillgeneratetwochildnodes

Class=1 Error=1−max[6/6,3/6]=0.5

N1N2 N3

N1N2 N1N2

{v1,v2,⋯,vk}

N(vj)vj I(vj)

vj N(vj)/N

I(children)=∑j=1kN(vj)NI(vj), (3.7)

Figure3.12.Examplesofcandidateattributetestconditions.

whoseweightedentropycanbecalculatedasfollows:

SplittingonMaritalStatus,ontheotherhand,leadstothreechildnodeswithaweightedentropygivenby

Thus,MaritalStatushasalowerweightedentropythanHomeOwner.

IdentifyingthebestattributetestconditionTodeterminethegoodnessofanattributetestcondition,weneedtocomparethedegreeofimpurityoftheparentnode(beforesplitting)withtheweighteddegreeofimpurityofthechildnodes(aftersplitting).Thelargertheir

I(HomeOwner=yes)=03log203−33log233=0I(HomeOwner=no)=−37log237−47log247=0.985I(HomeOwner=310×0+710×0.985=0.690

I(MaritalStatus=Single)=−25log225−35log235=0.971I(MaritalStatus=Married)=−03log203−33log233=0I(MaritalStatus=Divorced)=−12log212−12log212=1.000I(MaritalStatus)=510×0.971+310×0+210×1=0.686

difference,thebetterthetestcondition.Thisdifference, ,alsotermedasthegaininpurityofanattributetestcondition,canbedefinedasfollows:

Figure3.13.SplittingcriteriafortheloanborrowerclassificationproblemusingGiniindex.

whereI(parent)istheimpurityofanodebeforesplittingandI(children)istheweightedimpuritymeasureaftersplitting.Itcanbeshownthatthegainisnon-negativesince foranyreasonablemeasuresuchasthosepresentedabove.Thehigherthegain,thepureraretheclassesinthechildnodesrelativetotheparentnode.Thesplittingcriterioninthedecisiontreelearningalgorithmselectstheattributetestconditionthatshowsthemaximumgain.NotethatmaximizingthegainatagivennodeisequivalenttominimizingtheweightedimpuritymeasureofitschildrensinceI(parent)isthesameforallcandidateattributetestconditions.Finally,whenentropyisused

Δ

Δ=I(parent)−I(children), (3.8)

I(parent)≥I(children)

astheimpuritymeasure,thedifferenceinentropyiscommonlyknownasinformationgain, .

Inthefollowing,wepresentillustrativeapproachesforidentifyingthebestattributetestconditiongivenqualitativeorquantitativeattributes.

SplittingofQualitativeAttributesConsiderthefirsttwocandidatesplitsshowninFigure3.12 involvingqualitativeattributes and .Theinitialclassdistributionattheparentnodeis(0.3,0.7),sincethereare3instancesofclass and7instancesofclass inthetrainingdata.Thus,

TheinformationgainsforHomeOwnerandMaritalStatusareeachgivenby

TheinformationgainforMaritalStatusisthushigherduetoitslowerweightedentropy,whichwillthusbeconsideredforsplitting.

BinarySplittingofQualitativeAttributesConsiderbuildingadecisiontreeusingonlybinarysplitsandtheGiniindexastheimpuritymeasure.Figure3.13 showsexamplesoffourcandidatesplittingcriteriaforthe and attributes.Sincethereare3borrowersinthetrainingsetwhodefaultedand7otherswhorepaidtheirloan(seeTableinFigure3.13 ),theGiniindexoftheparentnodebeforesplittingis

Δinfo

I(parent)=−310log2310−710log2710=0.881

Δinfo(HomeOwner)=0.881−0.690=0.191Δinfo(MaritalStatus)=0.881−0.686=0.195

If ischosenasthesplittingattribute,theGiniindexforthechildnodes and are0and0.490,respectively.TheweightedaverageGiniindexforthechildrenis

wheretheweightsrepresenttheproportionoftraininginstancesassignedtoeachchild.Thegainusing assplittingattributeis

.Similarly,wecanapplyabinarysplitontheattribute.However,since isanominalattributewith

threeoutcomes,therearethreepossiblewaystogrouptheattributevaluesintoabinarysplit.TheweightedaverageGiniindexofthechildrenforeachcandidatebinarysplitisshowninFigure3.13 .Basedontheseresults,

andthelastbinarysplitusing areclearlythebestcandidates,sincetheybothproducethelowestweightedaverageGiniindex.Binarysplitscanalsobeusedforordinalattributes,ifthebinarypartitioningoftheattributevaluesdoesnotviolatetheorderingpropertyofthevalues.

BinarySplittingofQuantitativeAttributesConsidertheproblemofidentifyingthebestbinarysplit fortheprecedingloanapprovalclassificationproblem.Asdiscussedpreviously,eventhough cantakeanyvaluebetweentheminimumandmaximumvaluesofannualincomeinthetrainingset,itissufficienttoonlyconsidertheannualincomevaluesobservedinthetrainingsetascandidatesplitpositions.Foreachcandidate ,thetrainingsetisscannedoncetocountthenumberofborrowerswithannualincomelessthanorgreaterthan alongwiththeirclassproportions.WecanthencomputetheGiniindexateachcandidatesplit

1−(310)2−(710)2=0.420.

N1 N2

(3/10)×0+(7/10)×0.490=0.343,

0.420−0.343=0.077

AnnualIncome≤τ

τ

ττ

positionandchoosethe thatproducesthelowestvalue.ComputingtheGiniindexateachcandidatesplitpositionrequiresO(N)operations,whereNisthenumberoftraininginstances.SincethereareatmostNpossiblecandidates,theoverallcomplexityofthisbrute-forcemethodis .ItispossibletoreducethecomplexityofthisproblemtoO(NlogN)byusingamethoddescribedasfollows(seeillustrationinFigure3.14 ).Inthismethod,wefirstsortthetraininginstancesbasedontheirannualincome,aone-timecostthatrequiresO(NlogN)operations.Thecandidatesplitpositionsaregivenbythemidpointsbetweeneverytwoadjacentsortedvalues:$55,000,$65,000,$72,500,andsoon.Forthefirstcandidate,sincenoneoftheinstanceshasanannualincomelessthanorequalto$55,000,theGiniindexforthechildnodewith isequaltozero.Incontrast,thereare3traininginstancesofclass and instancesofclassNowithannualincomegreaterthan$55,000.TheGiniindexforthisnodeis0.420.TheweightedaverageGiniindexforthefirstcandidatesplitposition, ,isequalto .

Figure3.14.Splittingcontinuousattributes.

Forthenextcandidate, ,theclassdistributionofitschildnodescanbeobtainedwithasimpleupdateofthedistributionforthepreviouscandidate.Thisisbecause,as increasesfrom$55,000to$65,000,thereisonlyone

τ

O(N2)

AnnualIncome<$55,000

τ=$55,0000×0+1×0.420=0.420

τ=$65,000

τ

traininginstanceaffectedbythechange.Byexaminingtheclasslabeloftheaffectedtraininginstance,thenewclassdistributionisobtained.Forexample,as increasesto$65,000,thereisonlyoneborrowerinthetrainingset,withanannualincomeof$60,000,affectedbythischange.Sincetheclasslabelfortheborroweris ,thecountforclass increasesfrom0to1(for

)anddecreasesfrom7to6(for),asshowninFigure3.14 .Thedistributionforthe

classremainsunaffected.TheupdatedGiniindexforthiscandidatesplitpositionis0.400.

ThisprocedureisrepeateduntiltheGiniindexforallcandidatesarefound.ThebestsplitpositioncorrespondstotheonethatproducesthelowestGiniindex,whichoccursat .SincetheGiniindexateachcandidatesplitpositioncanbecomputedinO(1)time,thecomplexityoffindingthebestsplitpositionisO(N)onceallthevaluesarekeptsorted,aone-timeoperationthattakesO(NlogN)time.TheoverallcomplexityofthismethodisthusO(NlogN),whichismuchsmallerthanthe timetakenbythebrute-forcemethod.Theamountofcomputationcanbefurtherreducedbyconsideringonlycandidatesplitpositionslocatedbetweentwoadjacentsortedinstanceswithdifferentclasslabels.Forexample,wedonotneedtoconsidercandidatesplitpositionslocatedbetween$60,000and$75,000becauseallthreeinstanceswithannualincomeinthisrange($60,000,$70,000,and$75,000)havethesameclasslabels.Choosingasplitpositionwithinthisrangeonlyincreasesthedegreeofimpurity,comparedtoasplitpositionlocatedoutsidethisrange.Therefore,thecandidatesplitpositionsat and

canbeignored.Similarly,wedonotneedtoconsiderthecandidatesplitpositionsat$87,500,$92,500,$110,000,$122,500,and$172,500becausetheyarelocatedbetweentwoadjacentinstanceswiththesamelabels.Thisstrategyreducesthenumberofcandidatesplitpositionstoconsiderfrom9to2(excludingthetwoboundarycases and

).

τ

AnnualIncome≤$65,000AnnualIncome>$65,000

τ=$97,500

O(N2)

τ=$65,000τ=$72,500

τ=$55,000τ=$230,000

GainRatioOnepotentiallimitationofimpuritymeasuressuchasentropyandGiniindexisthattheytendtofavorqualitativeattributeswithlargenumberofdistinctvalues.Figure3.12 showsthreecandidateattributesforpartitioningthedatasetgiveninTable3.3 .Aspreviouslymentioned,theattribute

isabetterchoicethantheattribute ,becauseitprovidesalargerinformationgain.However,ifwecomparethemagainst ,thelatterproducesthepurestpartitionswiththemaximuminformationgain,sincetheweightedentropyandGiniindexisequaltozeroforitschildren.Yet,

isnotagoodattributeforsplittingbecauseithasauniquevalueforeachinstance.Eventhoughatestconditioninvolving willaccuratelyclassifyeveryinstanceinthetrainingdata,wecannotusesuchatestconditiononnewtestinstanceswith valuesthathaven'tbeenseenbeforeduringtraining.Thisexamplesuggestshavingalowimpurityvaluealoneisinsufficienttofindagoodattributetestconditionforanode.AswewillseelaterinSection3.4 ,havingmorenumberofchildnodescanmakeadecisiontreemorecomplexandconsequentlymoresusceptibletooverfitting.Hence,thenumberofchildrenproducedbythesplittingattributeshouldalsobetakenintoconsiderationwhiledecidingthebestattributetestcondition.

Therearetwowaystoovercomethisproblem.Onewayistogenerateonlybinarydecisiontrees,thusavoidingthedifficultyofhandlingattributeswithvaryingnumberofpartitions.ThisstrategyisemployedbydecisiontreeclassifierssuchasCART.Anotherwayistomodifythesplittingcriteriontotakeintoaccountthenumberofpartitionsproducedbytheattribute.Forexample,intheC4.5decisiontreealgorithm,ameasureknownasgainratioisusedtocompensateforattributesthatproducealargenumberofchildnodes.Thismeasureiscomputedasfollows:

where isthenumberofinstancesassignedtonode andkisthetotalnumberofsplits.Thesplitinformationmeasurestheentropyofsplittinganodeintoitschildnodesandevaluatesifthesplitresultsinalargernumberofequally-sizedchildnodesornot.Forexample,ifeverypartitionhasthesamenumberofinstances,then andthesplitinformationwouldbeequaltolog k.Thus,ifanattributeproducesalargenumberofsplits,itssplitinformationisalsolarge,whichinturn,reducesthegainratio.

3.4.ExampleGainRatioConsiderthedatasetgiveninExercise2onpage185.Wewanttoselectthebestattributetestconditionamongthefollowingthreeattributes:

, ,and .Theentropybeforesplittingis

If isusedasattributetestcondition:

If isusedasattributetestcondition:

Finally,if isusedasattributetestcondition:

Gainratio=ΔinfoSplitInfo=Entropy(Parent)−∑i=1kN(vi)NEntropy(vi)−∑i=1kN(vi)Nlog2N(vi)N

(3.9)

N(vi) vi

∀i:N(vi)/N=1/k2

Entropy(parent)=−1020log21020−1020log21020=1.

Entropy(children)=1020[−610log2610−410log2410]×2=0.971GainRatio=1−0.971−1020log21020−1020log21020=0.0291=0.029

Entropy(children)=420[−14log214−34log234]+820×0+820[−18log218−78log278]=0.380GainRatio=1−0.380−420log2420−820log2820−820log2820=0.6201.52

Thus,eventhough hasthehighestinformationgain,itsgainratioislowerthan sinceitproducesalargernumberofsplits.

3.3.4AlgorithmforDecisionTreeInduction

Algorithm3.1 presentsapseudocodefordecisiontreeinductionalgorithm.TheinputtothisalgorithmisasetoftraininginstancesEalongwiththeattributesetF.Thealgorithmworksbyrecursivelyselectingthebestattributetosplitthedata(Step7)andexpandingthenodesofthetree(Steps11and12)untilthestoppingcriterionismet(Step1).Thedetailsofthisalgorithmareexplainedbelow.

1. The functionextendsthedecisiontreebycreatinganewnode.Anodeinthedecisiontreeeitherhasatestcondition,denotedasnode.testcond,oraclasslabel,denotedasnode.label.

2. The functiondeterminestheattributetestconditionforpartitioningthetraininginstancesassociatedwithanode.Thesplittingattributechosendependsontheimpuritymeasureused.ThepopularmeasuresincludeentropyandtheGiniindex.

3. The functiondeterminestheclasslabeltobeassignedtoaleafnode.Foreachleafnodet,let denotethefractionoftraininginstancesfromclassiassociatedwiththenodet.Thelabelassignedto

Entropy(children)=120[−11log211−01log201]×20=0GainRatio=1−0−120log2120×20=14.32=0.23

p(i|t)

theleafnodeistypicallytheonethatoccursmostfrequentlyinthetraininginstancesthatareassociatedwiththisnode.

Algorithm3.1Askeletondecisiontreeinductionalgorithm.

∈

∈

wheretheargmaxoperatorreturnstheclassithatmaximizes .Besidesprovidingtheinformationneededtodeterminetheclasslabel

leaf.label=argmaxip(i|t), (3.10)

p(i|t)

ofaleafnode, canalsobeusedasaroughestimateoftheprobabilitythataninstanceassignedtotheleafnodetbelongstoclassi.Sections4.11.2 and4.11.4 inthenextchapterdescribehowsuchprobabilityestimatescanbeusedtodeterminetheperformanceofadecisiontreeunderdifferentcostfunctions.

4. The functionisusedtoterminatethetree-growingprocessbycheckingwhetheralltheinstanceshaveidenticalclasslabelorattributevalues.Sincedecisiontreeclassifiersemployatop-down,recursivepartitioningapproachforbuildingamodel,thenumberoftraininginstancesassociatedwithanodedecreasesasthedepthofthetreeincreases.Asaresult,aleafnodemaycontaintoofewtraininginstancestomakeastatisticallysignificantdecisionaboutitsclasslabel.Thisisknownasthedatafragmentationproblem.Onewaytoavoidthisproblemistodisallowsplittingofanodewhenthenumberofinstancesassociatedwiththenodefallbelowacertainthreshold.Amoresystematicwaytocontrolthesizeofadecisiontree(numberofleafnodes)willbediscussedinSection3.5.4 .

3.3.5ExampleApplication:WebRobotDetection

Considerthetaskofdistinguishingtheaccesspatternsofwebrobotsfromthosegeneratedbyhumanusers.Awebrobot(alsoknownasawebcrawler)isasoftwareprogramthatautomaticallyretrievesfilesfromoneormorewebsitesbyfollowingthehyperlinksextractedfromaninitialsetofseedURLs.Theseprogramshavebeendeployedforvariouspurposes,fromgatheringwebpagesonbehalfofsearchenginestomoremaliciousactivitiessuchasspammingandcommittingclickfraudsinonlineadvertisements.

p(i|t)

Figure3.15.Inputdataforwebrobotdetection.

Thewebrobotdetectionproblemcanbecastasabinaryclassificationtask.Theinputdatafortheclassificationtaskisawebserverlog,asampleofwhichisshowninFigure3.15(a) .Eachlineinthelogfilecorrespondstoarequestmadebyaclient(i.e.,ahumanuserorawebrobot)tothewebserver.Thefieldsrecordedintheweblogincludetheclient'sIPaddress,timestampoftherequest,URLoftherequestedfile,sizeofthefile,anduseragent,whichisafieldthatcontainsidentifyinginformationabouttheclient.

Forhumanusers,theuseragentfieldspecifiesthetypeofwebbrowserormobiledeviceusedtofetchthefiles,whereasforwebrobots,itshouldtechnicallycontainthenameofthecrawlerprogram.However,webrobotsmayconcealtheirtrueidentitiesbydeclaringtheiruseragentfieldstobeidenticaltoknownbrowsers.Therefore,useragentisnotareliablefieldtodetectwebrobots.

Thefirststeptowardbuildingaclassificationmodelistopreciselydefineadatainstanceandassociatedattributes.Asimpleapproachistoconsidereachlogentryasadatainstanceandusetheappropriatefieldsinthelogfileasitsattributeset.Thisapproach,however,isinadequateforseveralreasons.First,manyoftheattributesarenominal-valuedandhaveawiderangeofdomainvalues.Forexample,thenumberofuniqueclientIPaddresses,URLs,andreferrersinalogfilecanbeverylarge.Theseattributesareundesirableforbuildingadecisiontreebecausetheirsplitinformationisextremelyhigh(seeEquation(3.9) ).Inaddition,itmightnotbepossibletoclassifytestinstancescontainingIPaddresses,URLs,orreferrersthatarenotpresentinthetrainingdata.Finally,byconsideringeachlogentryasaseparatedatainstance,wedisregardthesequenceofwebpagesretrievedbytheclient—acriticalpieceofinformationthatcanhelpdistinguishwebrobotaccessesfromthoseofahumanuser.

Abetteralternativeistoconsidereachwebsessionasadatainstance.Awebsessionisasequenceofrequestsmadebyaclientduringagivenvisittothewebsite.Eachwebsessioncanbemodeledasadirectedgraph,inwhichthenodescorrespondtowebpagesandtheedgescorrespondtohyperlinksconnectingonewebpagetoanother.Figure3.15(b) showsagraphicalrepresentationofthefirstwebsessiongiveninthelogfile.Everywebsessioncanbecharacterizedusingsomemeaningfulattributesaboutthegraphthatcontaindiscriminatoryinformation.Figure3.15(c) showssomeoftheattributesextractedfromthegraph,includingthedepthandbreadthofits

correspondingtreerootedattheentrypointtothewebsite.Forexample,thedepthandbreadthofthetreeshowninFigure3.15(b) arebothequaltotwo.

ThederivedattributesshowninFigure3.15(c) aremoreinformativethantheoriginalattributesgiveninthelogfilebecausetheycharacterizethebehavioroftheclientatthewebsite.Usingthisapproach,adatasetcontaining2916instanceswascreated,withequalnumbersofsessionsduetowebrobots(class1)andhumanusers(class0).10%ofthedatawerereservedfortrainingwhiletheremaining90%wereusedfortesting.TheinduceddecisiontreeisshowninFigure3.16 ,whichhasanerrorrateequalto3.8%onthetrainingsetand5.3%onthetestset.Inadditiontoitslowerrorrate,thetreealsorevealssomeinterestingpropertiesthatcanhelpdiscriminatewebrobotsfromhumanusers:

1. Accessesbywebrobotstendtobebroadbutshallow,whereasaccessesbyhumanuserstendtobemorefocused(narrowbutdeep).

2. Webrobotsseldomretrievetheimagepagesassociatedwithawebpage.

3. Sessionsduetowebrobotstendtobelongandcontainalargenumberofrequestedpages.

4. Webrobotsaremorelikelytomakerepeatedrequestsforthesamewebpagethanhumanuserssincethewebpagesretrievedbyhumanusersareoftencachedbythebrowser.

3.3.6CharacteristicsofDecisionTreeClassifiers

Thefollowingisasummaryoftheimportantcharacteristicsofdecisiontreeinductionalgorithms.

1. Applicability:Decisiontreesareanonparametricapproachforbuildingclassificationmodels.Thisapproachdoesnotrequireanypriorassumptionabouttheprobabilitydistributiongoverningtheclassandattributesofthedata,andthus,isapplicabletoawidevarietyofdatasets.Itisalsoapplicabletobothcategoricalandcontinuousdatawithoutrequiringtheattributestobetransformedintoacommonrepresentationviabinarization,normalization,orstandardization.UnlikesomebinaryclassifiersdescribedinChapter4 ,itcanalsodealwithmulticlassproblemswithouttheneedtodecomposethemintomultiplebinaryclassificationtasks.Anotherappealingfeatureofdecisiontreeclassifiersisthattheinducedtrees,especiallytheshorterones,arerelativelyeasytointerpret.Theaccuraciesofthetreesarealsoquitecomparabletootherclassificationtechniquesformanysimpledatasets.

2. Expressiveness:Adecisiontreeprovidesauniversalrepresentationfordiscrete-valuedfunctions.Inotherwords,itcanencodeanyfunctionofdiscrete-valuedattributes.Thisisbecauseeverydiscrete-valuedfunctioncanberepresentedasanassignmenttable,whereeveryuniquecombinationofdiscreteattributesisassignedaclasslabel.Sinceeverycombinationofattributescanberepresentedasaleafinthedecisiontree,wecanalwaysfindadecisiontreewhoselabelassignmentsattheleafnodesmatcheswiththeassignmenttableoftheoriginalfunction.Decisiontreescanalsohelpinprovidingcompactrepresentationsoffunctionswhensomeoftheuniquecombinationsofattributescanberepresentedbythesameleafnode.Forexample,Figure3.17 showstheassignmenttableoftheBooleanfunction

involvingfourbinaryattributes,resultinginatotalofpossibleassignments.ThetreeshowninFigure3.17 shows

(A∧B)∨(C∧D)24=16

acompressedencodingofthisassignmenttable.Insteadofrequiringafully-growntreewith16leafnodes,itispossibletoencodethefunctionusingasimplertreewithonly7leafnodes.Nevertheless,notalldecisiontreesfordiscrete-valuedattributescanbesimplified.Onenotableexampleistheparityfunction,whosevalueis1whenthereisanevennumberoftruevaluesamongitsBooleanattributes,and0otherwise.Accuratemodelingofsuchafunctionrequiresafulldecisiontreewith nodes,wheredisthenumberofBooleanattributes(seeExercise1onpage185).

2d

Figure3.16.Decisiontreemodelforwebrobotdetection.

Figure3.17.DecisiontreefortheBooleanfunction .

3. ComputationalEfficiency:Sincethenumberofpossibledecisiontreescanbeverylarge,manydecisiontreealgorithmsemployaheuristic-basedapproachtoguidetheirsearchinthevasthypothesisspace.Forexample,thealgorithmpresentedinSection3.3.4 usesagreedy,top-down,recursivepartitioningstrategyforgrowingadecisiontree.Formanydatasets,suchtechniquesquicklyconstructareasonablygooddecisiontreeevenwhenthetrainingsetsizeisverylarge.Furthermore,onceadecisiontreehasbeenbuilt,classifyingatestrecordisextremelyfast,withaworst-casecomplexityofO(w),wherewisthemaximumdepthofthetree.

4. HandlingMissingValues:Adecisiontreeclassifiercanhandlemissingattributevaluesinanumberofways,bothinthetrainingandthetestsets.Whentherearemissingvaluesinthetestset,theclassifiermustdecidewhichbranchtofollowifthevalueofasplitting

(A∧B)∨(C∧D)

nodeattributeismissingforagiventestinstance.Oneapproach,knownastheprobabilisticsplitmethod,whichisemployedbytheC4.5decisiontreeclassifier,distributesthedatainstancetoeverychildofthesplittingnodeaccordingtotheprobabilitythatthemissingattributehasaparticularvalue.Incontrast,theCARTalgorithmusesthesurrogatesplitmethod,wheretheinstancewhosesplittingattributevalueismissingisassignedtooneofthechildnodesbasedonthevalueofanothernon-missingsurrogateattributewhosesplitsmostresemblethepartitionsmadebythemissingattribute.Anotherapproach,knownastheseparateclassmethodisusedbytheCHAIDalgorithm,wherethemissingvalueistreatedasaseparatecategoricalvaluedistinctfromothervaluesofthesplittingattribute.Figure3.18showsanexampleofthethreedifferentwaysforhandlingmissingvaluesinadecisiontreeclassifier.Otherstrategiesfordealingwithmissingvaluesarebasedondatapreprocessing,wheretheinstancewithmissingvalueiseitherimputedwiththemode(forcategoricalattribute)ormean(forcontinuousattribute)valueordiscardedbeforetheclassifieristrained.

Figure3.18.Methodsforhandlingmissingattributevaluesindecisiontreeclassifier.

Duringtraining,ifanattributevhasmissingvaluesinsomeofthetraininginstancesassociatedwithanode,weneedawaytomeasurethegaininpurityifvisusedforsplitting.Onesimplewayistoexcludeinstanceswithmissingvaluesofvinthecountingofinstancesassociatedwitheverychildnode,generatedforeverypossibleoutcomeofv.Further,ifvischosenastheattributetestconditionatanode,traininginstanceswithmissingvaluesofvcanbepropagatedtothechildnodesusinganyofthemethodsdescribedaboveforhandlingmissingvaluesintestinstances.

5. HandlingInteractionsamongAttributes:Attributesareconsideredinteractingiftheyareabletodistinguishbetweenclasseswhenusedtogether,butindividuallytheyprovidelittleornoinformation.Duetothegreedynatureofthesplittingcriteriaindecisiontrees,suchattributescouldbepassedoverinfavorofotherattributesthatarenotasuseful.Thiscouldresultinmorecomplexdecisiontreesthannecessary.Hence,decisiontreescanperformpoorlywhenthereareinteractionsamongattributes.Toillustratethispoint,considerthethree-dimensionaldatashowninFigure3.19(a) ,whichcontains2000datapointsfromoneoftwoclasses,denotedas and inthediagram.Figure3.19(b) showsthedistributionofthetwoclassesinthetwo-dimensionalspaceinvolvingattributesXandY,whichisanoisyversionoftheXORBooleanfunction.Wecanseethateventhoughthetwoclassesarewell-separatedinthistwo-dimensionalspace,neitherofthetwoattributescontainsufficientinformationtodistinguishbetweenthetwoclasseswhenusedalone.Forexample,theentropiesofthefollowingattributetestconditions: and ,arecloseto1,indicatingthatneitherXnorYprovideanyreductionintheimpuritymeasurewhenusedindividually.XandYthusrepresentacaseofinteractionamongattributes.Thedatasetalsocontainsathirdattribute,Z,inwhichbothclassesaredistributeduniformly,asshowninFigures3.19(c) and

+ ∘

X≤10 Y≤10

3.19(d) ,andhence,theentropyofanysplitinvolvingZiscloseto1.Asaresult,Zisaslikelytobechosenforsplittingastheinteractingbutusefulattributes,XandY.Forfurtherillustrationofthisissue,readersarereferredtoExample3.7 inSection3.4.1 andExercise7attheendofthischapter.

Figure3.19.ExampleofaXORdatainvolvingXandY,alongwithanirrelevantattributeZ.

6. HandlingIrrelevantAttributes:Anattributeisirrelevantifitisnotusefulfortheclassificationtask.Sinceirrelevantattributesarepoorlyassociatedwiththetargetclasslabels,theywillprovidelittleornogaininpurityandthuswillbepassedoverbyothermorerelevantfeatures.Hence,thepresenceofasmallnumberofirrelevantattributeswillnotimpactthedecisiontreeconstructionprocess.However,notallattributesthatprovidelittletonogainareirrelevant(seeFigure3.19 ).Hence,iftheclassificationproblemiscomplex(e.g.,involvinginteractionsamongattributes)andtherearealargenumberofirrelevantattributes,thensomeoftheseattributesmaybeaccidentallychosenduringthetree-growingprocess,sincetheymayprovideabettergainthanarelevantattributejustbyrandomchance.Featureselectiontechniquescanhelptoimprovetheaccuracyofdecisiontreesbyeliminatingtheirrelevantattributesduringpreprocessing.WewillinvestigatetheissueoftoomanyirrelevantattributesinSection3.4.1 .

7. HandlingRedundantAttributes:Anattributeisredundantifitisstronglycorrelatedwithanotherattributeinthedata.Sinceredundantattributesshowsimilargainsinpurityiftheyareselectedforsplitting,onlyoneofthemwillbeselectedasanattributetestconditioninthedecisiontreealgorithm.Decisiontreescanthushandlethepresenceofredundantattributes.

8. UsingRectilinearSplits:Thetestconditionsdescribedsofarinthischapterinvolveusingonlyasingleattributeatatime.Asaconsequence,thetree-growingprocedurecanbeviewedastheprocessofpartitioningtheattributespaceintodisjointregionsuntileachregioncontainsrecordsofthesameclass.Theborderbetweentwoneighboringregionsofdifferentclassesisknownasadecisionboundary.Figure3.20 showsthedecisiontreeaswellasthedecisionboundaryforabinaryclassificationproblem.Sincethetestconditioninvolvesonlyasingleattribute,thedecisionboundariesare

rectilinear;i.e.,paralleltothecoordinateaxes.Thislimitstheexpressivenessofdecisiontreesinrepresentingdecisionboundariesofdatasetswithcontinuousattributes.Figure3.21 showsatwo-dimensionaldatasetinvolvingbinaryclassesthatcannotbeperfectlyclassifiedbyadecisiontreewhoseattributetestconditionsaredefinedbasedonsingleattributes.ThebinaryclassesinthedatasetaregeneratedfromtwoskewedGaussiandistributions,centeredat(8,8)and(12,12),respectively.Thetruedecisionboundaryisrepresentedbythediagonaldashedline,whereastherectilineardecisionboundaryproducedbythedecisiontreeclassifierisshownbythethicksolidline.Incontrast,anobliquedecisiontreemayovercomethislimitationbyallowingthetestconditiontobespecifiedusingmorethanoneattribute.Forexample,thebinaryclassificationdatashowninFigure3.21 canbeeasilyrepresentedbyanobliquedecisiontreewithasinglerootnodewithtestcondition

Figure3.20.

x+y<20.

Exampleofadecisiontreeanditsdecisionboundariesforatwo-dimensionaldataset.

Figure3.21.Exampleofdatasetthatcannotbepartitionedoptimallyusingadecisiontreewithsingleattributetestconditions.Thetruedecisionboundaryisshownbythedashedline.

Althoughanobliquedecisiontreeismoreexpressiveandcanproducemorecompacttrees,findingtheoptimaltestconditioniscomputationallymoreexpensive.

9. ChoiceofImpurityMeasure:Itshouldbenotedthatthechoiceofimpuritymeasureoftenhaslittleeffectontheperformanceofdecisiontreeclassifierssincemanyoftheimpuritymeasuresarequiteconsistentwitheachother,asshowninFigure3.11 onpage129.Instead,thestrategyusedtoprunethetreehasagreaterimpactonthefinaltreethanthechoiceofimpuritymeasure.

3.4ModelOverfittingMethodspresentedsofartrytolearnclassificationmodelsthatshowthelowesterroronthetrainingset.However,aswewillshowinthefollowingexample,evenifamodelfitswelloverthetrainingdata,itcanstillshowpoorgeneralizationperformance,aphenomenonknownasmodeloverfitting.

Figure3.22.Examplesoftrainingandtestsetsofatwo-dimensionalclassificationproblem.

Figure3.23.Effectofvaryingtreesize(numberofleafnodes)ontrainingandtesterrors.

3.5.ExampleOverfittingandUnderfittingofDecisionTreesConsiderthetwo-dimensionaldatasetshowninFigure3.22(a) .Thedatasetcontainsinstancesthatbelongtotwoseparateclasses,representedas and ,respectively,whereeachclasshas5400instances.Allinstancesbelongingtothe classweregeneratedfromauniformdistribution.Forthe class,5000instancesweregeneratedfromaGaussiandistributioncenteredat(10,10)withunitvariance,whiletheremaining400instancesweresampledfromthesameuniformdistributionasthe class.WecanseefromFigure3.22(a) thatthe classcanbelargelydistinguishedfromthe classbydrawingacircleofappropriatesizecenteredat(10,10).Tolearnaclassifierusingthistwo-dimensionaldataset,werandomlysampled10%ofthedatafortrainingandusedtheremaining90%fortesting.Thetrainingset,showninFigure3.22(b) ,looksquiterepresentativeoftheoveralldata.WeusedGiniindexasthe

+ ∘∘

+

∘ +∘

impuritymeasuretoconstructdecisiontreesofincreasingsizes(numberofleafnodes),byrecursivelyexpandinganodeintochildnodestilleveryleafnodewaspure,asdescribedinSection3.3.4 .

Figure3.23(a) showschangesinthetrainingandtesterrorratesasthesizeofthetreevariesfrom1to8.Botherrorratesareinitiallylargewhenthetreehasonlyoneortwoleafnodes.Thissituationisknownasmodelunderfitting.Underfittingoccurswhenthelearneddecisiontreeistoosimplistic,andthus,incapableoffullyrepresentingthetruerelationshipbetweentheattributesandtheclasslabels.Asweincreasethetreesizefrom1to8,wecanobservetwoeffects.First,boththeerrorratesdecreasesincelargertreesareabletorepresentmorecomplexdecisionboundaries.Second,thetrainingandtesterrorratesarequiteclosetoeachother,whichindicatesthattheperformanceonthetrainingsetisfairlyrepresentativeofthegeneralizationperformance.Aswefurtherincreasethesizeofthetreefrom8to150,thetrainingerrorcontinuestosteadilydecreasetilliteventuallyreacheszero,asshowninFigure3.23(b) .However,inastrikingcontrast,thetesterrorrateceasestodecreaseanyfurtherbeyondacertaintreesize,andthenitbeginstoincrease.Thetrainingerrorratethusgrosslyunder-estimatesthetesterrorrateoncethetreebecomestoolarge.Further,thegapbetweenthetrainingandtesterrorrateskeepsonwideningasweincreasethetreesize.Thisbehavior,whichmayseemcounter-intuitiveatfirst,canbeattributedtothephenomenaofmodeloverfitting.

3.4.1ReasonsforModelOverfitting

Modeloverfittingisthephenomenawhere,inthepursuitofminimizingthetrainingerrorrate,anoverlycomplexmodelisselectedthatcapturesspecific

patternsinthetrainingdatabutfailstolearnthetruenatureofrelationshipsbetweenattributesandclasslabelsintheoveralldata.Toillustratethis,Figure3.24 showsdecisiontreesandtheircorrespondingdecisionboundaries(shadedrectanglesrepresentregionsassignedtothe class)fortwotreesofsizes5and50.Wecanseethatthedecisiontreeofsize5appearsquitesimpleanditsdecisionboundariesprovideareasonableapproximationtotheidealdecisionboundary,whichinthiscasecorrespondstoacirclecenteredaroundtheGaussiandistributionat(10,10).Althoughitstrainingandtesterrorratesarenon-zero,theyareveryclosetoeachother,whichindicatesthatthepatternslearnedinthetrainingsetshouldgeneralizewelloverthetestset.Ontheotherhand,thedecisiontreeofsize50appearsmuchmorecomplexthanthetreeofsize5,withcomplicateddecisionboundaries.Forexample,someofitsshadedrectangles(assignedtheclass)attempttocovernarrowregionsintheinputspacethatcontainonlyoneortwo traininginstances.Notethattheprevalenceof instancesinsuchregionsishighlyspecifictothetrainingset,astheseregionsaremostlydominatedby-instancesintheoveralldata.Hence,inanattempttoperfectlyfitthetrainingdata,thedecisiontreeofsize50startsfinetuningitselftospecificpatternsinthetrainingdata,leadingtopoorperformanceonanindependentlychosentestset.

+

+

+ +

Figure3.24.Decisiontreeswithdifferentmodelcomplexities.

Figure3.25.Performanceofdecisiontreesusing20%datafortraining(twicetheoriginaltrainingsize).

Thereareanumberoffactorsthatinfluencemodeloverfitting.Inthefollowing,weprovidebriefdescriptionsoftwoofthemajorfactors:limitedtrainingsizeandhighmodelcomplexity.Thoughtheyarenotexhaustive,theinterplaybetweenthemcanhelpexplainmostofthecommonmodeloverfittingphenomenainreal-worldapplications.

LimitedTrainingSizeNotethatatrainingsetconsistingofafinitenumberofinstancescanonlyprovidealimitedrepresentationoftheoveralldata.Hence,itispossiblethatthepatternslearnedfromatrainingsetdonotfullyrepresentthetruepatternsintheoveralldata,leadingtomodeloverfitting.Ingeneral,asweincreasethesizeofatrainingset(numberoftraininginstances),thepatternslearnedfromthetrainingsetstartresemblingthetruepatternsintheoveralldata.Hence,

theeffectofoverfittingcanbereducedbyincreasingthetrainingsize,asillustratedinthefollowingexample.

3.6ExampleEffectofTrainingSizeSupposethatweusetwicethenumberoftraininginstancesthanwhatwehadusedintheexperimentsconductedinExample3.5 .Specifically,weuse20%datafortrainingandusetheremainderfortesting.Figure3.25(b) showsthetrainingandtesterrorratesasthesizeofthetreeisvariedfrom1to150.TherearetwomajordifferencesinthetrendsshowninthisfigureandthoseshowninFigure3.23(b) (usingonly10%ofthedatafortraining).First,eventhoughthetrainingerrorratedecreaseswithincreasingtreesizeinbothfigures,itsrateofdecreaseismuchsmallerwhenweusetwicethetrainingsize.Second,foragiventreesize,thegapbetweenthetrainingandtesterrorratesismuchsmallerwhenweusetwicethetrainingsize.Thesedifferencessuggestthatthepatternslearnedusing20%ofdatafortrainingaremoregeneralizablethanthoselearnedusing10%ofdatafortraining.

Figure3.25(a) showsthedecisionboundariesforthetreeofsize50,learnedusing20%ofdatafortraining.Incontrasttothetreeofthesamesizelearnedusing10%datafortraining(seeFigure3.24(d) ),wecanseethatthedecisiontreeisnotcapturingspecificpatternsofnoisyinstancesinthetrainingset.Instead,thehighmodelcomplexityof50leafnodesisbeingeffectivelyusedtolearntheboundariesofthe instancescenteredat(10,10).

HighModelComplexityGenerally,amorecomplexmodelhasabetterabilitytorepresentcomplexpatternsinthedata.Forexample,decisiontreeswithlargernumberofleaf

+

+

nodescanrepresentmorecomplexdecisionboundariesthandecisiontreeswithfewerleafnodes.However,anoverlycomplexmodelalsohasatendencytolearnspecificpatternsinthetrainingsetthatdonotgeneralizewelloverunseeninstances.Modelswithhighcomplexityshouldthusbejudiciouslyusedtoavoidoverfitting.

Onemeasureofmodelcomplexityisthenumberof“parameters”thatneedtobeinferredfromthetrainingset.Forexample,inthecaseofdecisiontreeinduction,theattributetestconditionsatinternalnodescorrespondtotheparametersofthemodelthatneedtobeinferredfromthetrainingset.Adecisiontreewithlargernumberofattributetestconditions(andconsequentlymoreleafnodes)thusinvolvesmore“parameters”andhenceismorecomplex.

Givenaclassofmodelswithacertainnumberofparameters,alearningalgorithmattemptstoselectthebestcombinationofparametervaluesthatmaximizesanevaluationmetric(e.g.,accuracy)overthetrainingset.Ifthenumberofparametervaluecombinations(andhencethecomplexity)islarge,thelearningalgorithmhastoselectthebestcombinationfromalargenumberofpossibilities,usingalimitedtrainingset.Insuchcases,thereisahighchanceforthelearningalgorithmtopickaspuriouscombinationofparametersthatmaximizestheevaluationmetricjustbyrandomchance.Thisissimilartothemultiplecomparisonsproblem(alsoreferredasmultipletestingproblem)instatistics.

Asanillustrationofthemultiplecomparisonsproblem,considerthetaskofpredictingwhetherthestockmarketwillriseorfallinthenexttentradingdays.Ifastockanalystsimplymakesrandomguesses,theprobabilitythatherpredictioniscorrectonanytradingdayis0.5.However,theprobabilitythatshewillpredictcorrectlyatleastnineoutoftentimesis

whichisextremelylow.

Supposeweareinterestedinchoosinganinvestmentadvisorfromapoolof200stockanalysts.Ourstrategyistoselecttheanalystwhomakesthemostnumberofcorrectpredictionsinthenexttentradingdays.Theflawinthisstrategyisthatevenifalltheanalystsmaketheirpredictionsinarandomfashion,theprobabilitythatatleastoneofthemmakesatleastninecorrectpredictionsis

whichisveryhigh.Althougheachanalysthasalowprobabilityofpredictingatleastninetimescorrectly,consideredtogether,wehaveahighprobabilityoffindingatleastoneanalystwhocandoso.However,thereisnoguaranteeinthefuturethatsuchananalystwillcontinuetomakeaccuratepredictionsbyrandomguessing.

Howdoesthemultiplecomparisonsproblemrelatetomodeloverfitting?Inthecontextoflearningaclassificationmodel,eachcombinationofparametervaluescorrespondstoananalyst,whilethenumberoftraininginstancescorrespondstothenumberofdays.Analogoustothetaskofselectingthebestanalystwhomakesthemostaccuratepredictionsonconsecutivedays,thetaskofalearningalgorithmistoselectthebestcombinationofparametersthatresultsinthehighestaccuracyonthetrainingset.Ifthenumberofparametercombinationsislargebutthetrainingsizeissmall,itishighlylikelyforthelearningalgorithmtochooseaspuriousparametercombinationthatprovideshightrainingaccuracyjustbyrandomchance.Inthefollowingexample,weillustratethephenomenaofoverfittingduetomultiplecomparisonsinthecontextofdecisiontreeinduction.

(109)+(1010)210=0.0107,

1−(1−0.0107)200=0.8847,

Figure3.26.Exampleofatwo-dimensional(X-Y)dataset.

Figure3.27.

Trainingandtesterrorratesillustratingtheeffectofmultiplecomparisonsproblemonmodeloverfitting.

3.7.ExampleMultipleComparisonsandOverfittingConsiderthetwo-dimensionaldatasetshowninFigure3.26 containing500 and500 instances,whichissimilartothedatashowninFigure3.19 .Inthisdataset,thedistributionsofbothclassesarewell-separatedinthetwo-dimensional(XY)attributespace,butnoneofthetwoattributes(XorY)aresufficientlyinformativetobeusedaloneforseparatingthetwoclasses.Hence,splittingthedatasetbasedonanyvalueofanXorYattributewillprovideclosetozeroreductioninanimpuritymeasure.However,ifXandYattributesareusedtogetherinthesplittingcriterion(e.g.,splittingXat10andYat10),thetwoclassescanbeeffectivelyseparated.

+ ∘

Figure3.28.Decisiontreewith6leafnodesusingXandYasattributes.Splitshavebeennumberedfrom1to5inorderofotheroccurrenceinthetree.

Figure3.27(a) showsthetrainingandtesterrorratesforlearningdecisiontreesofvaryingsizes,when30%ofthedataisusedfortrainingandtheremainderofthedatafortesting.Wecanseethatthetwoclassescanbeseparatedusingasmallnumberofleafnodes.Figure3.28showsthedecisionboundariesforthetreewithsixleafnodes,wherethesplitshavebeennumberedaccordingtotheirorderofappearanceinthetree.Notethattheeventhoughsplits1and3providetrivialgains,theirconsequentsplits(2,4,and5)providelargegains,resultingineffectivediscriminationofthetwoclasses.

Assumeweadd100irrelevantattributestothetwo-dimensionalX-Ydata.Learningadecisiontreefromthisresultantdatawillbechallengingbecausethenumberofcandidateattributestochooseforsplittingateveryinternalnodewillincreasefromtwoto102.Withsuchalargenumberofcandidateattributetestconditionstochoosefrom,itisquitelikelythatspuriousattributetestconditionswillbeselectedatinternalnodesbecauseofthemultiplecomparisonsproblem.Figure3.27(b) showsthetrainingandtesterrorratesafteradding100irrelevantattributestothetrainingset.Wecanseethatthetesterrorrateremainscloseto0.5evenafterusing50leafnodes,whilethetrainingerrorratekeepsondecliningandeventuallybecomes0.

3.5ModelSelectionTherearemanypossibleclassificationmodelswithvaryinglevelsofmodelcomplexitythatcanbeusedtocapturepatternsinthetrainingdata.Amongthesepossibilities,wewanttoselectthemodelthatshowslowestgeneralizationerrorrate.Theprocessofselectingamodelwiththerightlevelofcomplexity,whichisexpectedtogeneralizewelloverunseentestinstances,isknownasmodelselection.Asdescribedintheprevioussection,thetrainingerrorratecannotbereliablyusedasthesolecriterionformodelselection.Inthefollowing,wepresentthreegenericapproachestoestimatethegeneralizationperformanceofamodelthatcanbeusedformodelselection.Weconcludethissectionbypresentingspecificstrategiesforusingtheseapproachesinthecontextofdecisiontreeinduction.

3.5.1UsingaValidationSet

Notethatwecanalwaysestimatethegeneralizationerrorrateofamodelbyusing“out-of-sample”estimates,i.e.byevaluatingthemodelonaseparatevalidationsetthatisnotusedfortrainingthemodel.Theerrorrateonthevalidationset,termedasthevalidationerrorrate,isabetterindicatorofgeneralizationperformancethanthetrainingerrorrate,sincethevalidationsethasnotbeenusedfortrainingthemodel.Thevalidationerrorratecanbeusedformodelselectionasfollows.

GivenatrainingsetD.train,wecanpartitionD.trainintotwosmallersubsets,D.trandD.val,suchthatD.trisusedfortrainingwhileD.valisusedasthevalidationset.Forexample,two-thirdsofD.traincanbereservedasD.trfor

training,whiletheremainingone-thirdisusedasD.valforcomputingvalidationerrorrate.ForanychoiceofclassificationmodelmthatistrainedonD.tr,wecanestimateitsvalidationerrorrateonD.val, .Themodelthatshowsthelowestvalueof canthenbeselectedasthepreferredchoiceofmodel.

Theuseofvalidationsetprovidesagenericapproachformodelselection.However,onelimitationofthisapproachisthatitissensitivetothesizesofD.trandD.val,obtainedbypartitioningD.train.IfthesizeofD.tristoosmall,itmayresultinthelearningofapoorclassificationmodelwithsub-standardperformance,sinceasmallertrainingsetwillbelessrepresentativeoftheoveralldata.Ontheotherhand,ifthesizeofD.valistoosmall,thevalidationerrorratemightnotbereliableforselectingmodels,asitwouldbecomputedoverasmallnumberofinstances.

Figure3.29.

errval(m)errval(m)

ClassdistributionofvalidationdataforthetwodecisiontreesshowninFigure3.30 .

3.8.ExampleValidationErrorInthefollowingexample,weillustrateonepossibleapproachforusingavalidationsetindecisiontreeinduction.Figure3.29 showsthepredictedlabelsattheleafnodesofthedecisiontreesgeneratedinFigure3.30 .Thecountsgivenbeneaththeleafnodesrepresenttheproportionofdatainstancesinthevalidationsetthatreacheachofthenodes.Basedonthepredictedlabelsofthenodes,thevalidationerrorrateforthelefttreeis ,whilethevalidationerrorratefortherighttreeis .Basedontheirvalidationerrorrates,therighttreeispreferredovertheleftone.

3.5.2IncorporatingModelComplexity

Sincethechanceformodeloverfittingincreasesasthemodelbecomesmorecomplex,amodelselectionapproachshouldnotonlyconsiderthetrainingerrorratebutalsothemodelcomplexity.Thisstrategyisinspiredbyawell-knownprincipleknownasOccam'srazorortheprincipleofparsimony,whichsuggeststhatgiventwomodelswiththesameerrors,thesimplermodelispreferredoverthemorecomplexmodel.Agenericapproachtoaccountformodelcomplexitywhileestimatinggeneralizationperformanceisformallydescribedasfollows.

GivenatrainingsetD.train,letusconsiderlearningaclassificationmodelmthatbelongstoacertainclassofmodels, .Forexample,if representsthesetofallpossibledecisiontrees,thenmcancorrespondtoaspecificdecision

errval(TL)=6/16=0.375errval(TR)=4/16=0.25

M M

treelearnedfromthetrainingset.Weareinterestedinestimatingthegeneralizationerrorrateofm,gen.error(m).Asdiscussedpreviously,thetrainingerrorrateofm,train.error(m,D.train),canunder-estimategen.error(m)whenthemodelcomplexityishigh.Hence,werepresentgen.error(m)asafunctionofnotjustthetrainingerrorratebutalsothemodelcomplexityof asfollows:

where isahyper-parameterthatstrikesabalancebetweenminimizingtrainingerrorandreducingmodelcomplexity.Ahighervalueof givesmoreemphasistothemodelcomplexityintheestimationofgeneralizationperformance.Tochoosetherightvalueof ,wecanmakeuseofthevalidationsetinasimilarwayasdescribedin3.5.1 .Forexample,wecaniteratethrougharangeofvaluesof andforeverypossiblevalue,wecanlearnamodelonasubsetofthetrainingset,D.tr,andcomputeitsvalidationerrorrateonaseparatesubset,D.val.Wecanthenselectthevalueof thatprovidesthelowestvalidationerrorrate.

Equation3.11 providesonepossibleapproachforincorporatingmodelcomplexityintotheestimateofgeneralizationperformance.Thisapproachisattheheartofanumberoftechniquesforestimatinggeneralizationperformance,suchasthestructuralriskminimizationprinciple,theAkaike'sInformationCriterion(AIC),andtheBayesianInformationCriterion(BIC).Thestructuralriskminimizationprincipleservesasthebuildingblockforlearningsupportvectormachines,whichwillbediscussedlaterinChapter4 .FormoredetailsonAICandBIC,seetheBibliographicNotes.

Inthefollowing,wepresenttwodifferentapproachesforestimatingthecomplexityofamodel, .Whiletheformerisspecifictodecisiontrees,thelatterismoregenericandcanbeusedwithanyclassofmodels.

M,complexity(M),

gen.error(m)=train.error(m,D.train)+α×complexity(M), (3.11)

αα

α

α

α

complexity(M)

EstimatingtheComplexityofDecisionTreesInthecontextofdecisiontrees,thecomplexityofadecisiontreecanbeestimatedastheratioofthenumberofleafnodestothenumberoftraininginstances.Letkbethenumberofleafnodesand bethenumberoftraininginstances.Thecomplexityofadecisiontreecanthenbedescribedas

.Thisreflectstheintuitionthatforalargertrainingsize,wecanlearnadecisiontreewithlargernumberofleafnodeswithoutitbecomingoverlycomplex.ThegeneralizationerrorrateofadecisiontreeTcanthenbecomputedusingEquation3.11 asfollows:

whereerr(T)isthetrainingerrorofthedecisiontreeand isahyper-parameterthatmakesatrade-offbetweenreducingtrainingerrorandminimizingmodelcomplexity,similartotheuseof inEquation3.11 .canbeviewedastherelativecostofaddingaleafnoderelativetoincurringatrainingerror.Intheliteratureondecisiontreeinduction,theaboveapproachforestimatinggeneralizationerrorrateisalsotermedasthepessimisticerrorestimate.Itiscalledpessimisticasitassumesthegeneralizationerrorratetobeworsethanthetrainingerrorrate(byaddingapenaltytermformodelcomplexity).Ontheotherhand,simplyusingthetrainingerrorrateasanestimateofthegeneralizationerrorrateiscalledtheoptimisticerrorestimateortheresubstitutionestimate.

3.9.ExampleGeneralizationErrorEstimatesConsiderthetwobinarydecisiontrees, and ,showninFigure3.30 .Bothtreesaregeneratedfromthesametrainingdataand isgeneratedbyexpandingthreeleafnodesof .Thecountsshownintheleafnodesofthetreesrepresenttheclassdistributionofthetraining

Ntrain

k/Ntrain

errgen(T)=err(T)+Ω×kNtrain,

Ω

α Ω

TL TRTL

TR

instances.Ifeachleafnodeislabeledaccordingtothemajorityclassoftraininginstancesthatreachthenode,thetrainingerrorrateforthelefttreewillbe ,whilethetrainingerrorratefortherighttreewillbe .Basedontheirtrainingerrorratesalone,wouldpreferredover ,eventhough ismorecomplex(contains

largernumberofleafnodes)than .

Figure3.30.Exampleoftwodecisiontreesgeneratedfromthesametrainingdata.

Now,assumethatthecostassociatedwitheachleafnodeis .Then,thegeneralizationerrorestimatefor willbe

andthegeneralizationerrorestimatefor willbe

err(TL)=4/24=0.167err(TR)=6/24=0.25

TL TR TLTR

Ω=0.5TL

errgen(TL)=424+0.5×724=7.524=0.3125

TR

errgen(TR)=624+0.5×424=824=0.3333.

Since hasalowergeneralizationerrorrate,itwillstillbepreferredover.Notethat impliesthatanodeshouldalwaysbeexpandedinto

itstwochildnodesifitimprovesthepredictionofatleastonetraininginstance,sinceexpandinganodeislesscostlythanmisclassifyingatraininginstance.Ontheotherhand,if ,thenthegeneralizationerrorratefor is andfor is

.Inthiscase, willbepreferredoverbecauseithasalowergeneralizationerrorrate.Thisexampleillustratesthatdifferentchoicesof canchangeourpreferenceofdecisiontreesbasedontheirgeneralizationerrorestimates.However,foragivenchoiceof ,thepessimisticerrorestimateprovidesanapproachformodelingthegeneralizationperformanceonunseentestinstances.Thevalueof canbeselectedwiththehelpofavalidationset.

MinimumDescriptionLengthPrincipleAnotherwaytoincorporatemodelcomplexityisbasedonaninformation-theoreticapproachknownastheminimumdescriptionlengthorMDLprinciple.Toillustratethisapproach,considertheexampleshowninFigure3.31 .Inthisexample,bothperson andperson aregivenasetofinstanceswithknownattributevalues .AssumepersonAknowstheclasslabelyforeveryinstance,whileperson hasnosuchinformation. wouldliketosharetheclassinformationwith bysendingamessagecontainingthelabels.Themessagewouldcontain bitsofinformation,whereNisthenumberofinstances.

TLTR Ω=0.5

Ω=1TL errgen(TL)=11/24=0.458 TR

errgen(TR)=10/24=0.417 TR TL

Ω

ΩΩ

Θ(N)

Figure3.31.Anillustrationoftheminimumdescriptionlengthprinciple.

Alternatively,insteadofsendingtheclasslabelsexplicitly, canbuildaclassificationmodelfromtheinstancesandtransmititto . canthenapplythemodeltodeterminetheclasslabelsoftheinstances.Ifthemodelis100%accurate,thenthecostfortransmissionisequaltothenumberofbitsrequiredtoencodethemodel.Otherwise, mustalsotransmitinformationaboutwhichinstancesaremisclassifiedbythemodelsothat canreproducethesameclasslabels.Thus,theoveralltransmissioncost,whichisequaltothetotaldescriptionlengthofthemessage,is

wherethefirsttermontheright-handsideisthenumberofbitsneededtoencodethemisclassifiedinstances,whilethesecondtermisthenumberofbitsrequiredtoencodethemodel.Thereisalsoahyper-parameter thattrades-offtherelativecostsofthemisclassifiedinstancesandthemodel.

Cost(model,data)=Cost(data|model)+α×Cost(model), (3.12)

α

NoticethefamiliaritybetweenthisequationandthegenericequationforgeneralizationerrorratepresentedinEquation3.11 .Agoodmodelmusthaveatotaldescriptionlengthlessthanthenumberofbitsrequiredtoencodetheentiresequenceofclasslabels.Furthermore,giventwocompetingmodels,themodelwithlowertotaldescriptionlengthispreferred.AnexampleshowinghowtocomputethetotaldescriptionlengthofadecisiontreeisgiveninExercise10onpage189.

3.5.3EstimatingStatisticalBounds

InsteadofusingEquation3.11 toestimatethegeneralizationerrorrateofamodel,analternativewayistoapplyastatisticalcorrectiontothetrainingerrorrateofthemodelthatisindicativeofitsmodelcomplexity.Thiscanbedoneiftheprobabilitydistributionoftrainingerrorisavailableorcanbeassumed.Forexample,thenumberoferrorscommittedbyaleafnodeinadecisiontreecanbeassumedtofollowabinomialdistribution.Wecanthuscomputeanupperboundlimittotheobservedtrainingerrorratethatcanbeusedformodelselection,asillustratedinthefollowingexample.

3.10.ExampleStatisticalBoundsonTrainingErrorConsidertheleft-mostbranchofthebinarydecisiontreesshowninFigure3.30 .Observethattheleft-mostleafnodeof hasbeenexpandedintotwochildnodesin .Beforesplitting,thetrainingerrorrateofthenodeis .Byapproximatingabinomialdistributionwithanormaldistribution,thefollowingupperboundofthetrainingerrorrateecanbederived:

TRTL

2/7=0.286

where istheconfidencelevel, isthestandardizedvaluefromastandardnormaldistribution,andNisthetotalnumberoftraininginstancesusedtocomputee.Byreplacing and ,theupperboundfortheerrorrateis ,whichcorrespondsto errors.Ifweexpandthenodeintoitschildnodesasshownin ,thetrainingerrorratesforthechildnodesare

and ,respectively.UsingEquation(3.13) ,theupperboundsoftheseerrorratesare and

,respectively.Theoveralltrainingerrorofthechildnodesis ,whichislargerthantheestimatederrorforthecorrespondingnodein ,suggestingthatitshouldnotbesplit.

3.5.4ModelSelectionforDecisionTrees

Buildingonthegenericapproachespresentedabove,wepresenttwocommonlyusedmodelselectionstrategiesfordecisiontreeinduction.

Prepruning(EarlyStoppingRule)

Inthisapproach,thetree-growingalgorithmishaltedbeforegeneratingafullygrowntreethatperfectlyfitstheentiretrainingdata.Todothis,amorerestrictivestoppingconditionmustbeused;e.g.,stopexpandingaleafnodewhentheobservedgaininthegeneralizationerrorestimatefallsbelowacertainthreshold.Thisestimateofthegeneralizationerrorratecanbe

eupper(N,e,α)=e+zα/222N+zα/2e(1−e)N+zα/224N21+zα/22N, (3.13)

α zα/2

α=25%,N=7, e=2/7eupper(7,2/7,0.25)=0.503

7×0.503=3.521TL

1/4=0.250 1/3=0.333eupper(4,1/4,0.25)=0.537

eupper(3,1/3,0.25)=0.6504×0.537+3×0.650=4.098

TR

computedusinganyoftheapproachespresentedintheprecedingthreesubsections,e.g.,byusingpessimisticerrorestimates,byusingvalidationerrorestimates,orbyusingstatisticalbounds.Theadvantageofprepruningisthatitavoidsthecomputationsassociatedwithgeneratingoverlycomplexsubtreesthatoverfitthetrainingdata.However,onemajordrawbackofthismethodisthat,evenifnosignificantgainisobtainedusingoneoftheexistingsplittingcriterion,subsequentsplittingmayresultinbettersubtrees.Suchsubtreeswouldnotbereachedifprepruningisusedbecauseofthegreedynatureofdecisiontreeinduction.

Post-pruning

Inthisapproach,thedecisiontreeisinitiallygrowntoitsmaximumsize.Thisisfollowedbyatree-pruningstep,whichproceedstotrimthefullygrowntreeinabottom-upfashion.Trimmingcanbedonebyreplacingasubtreewith(1)anewleafnodewhoseclasslabelisdeterminedfromthemajorityclassofinstancesaffiliatedwiththesubtree(approachknownassubtreereplacement),or(2)themostfrequentlyusedbranchofthesubtree(approachknownassubtreeraising).Thetree-pruningstepterminateswhennofurtherimprovementinthegeneralizationerrorestimateisobservedbeyondacertainthreshold.Again,theestimatesofgeneralizationerrorratecanbecomputedusinganyoftheapproachespresentedinthepreviousthreesubsections.Post-pruningtendstogivebetterresultsthanprepruningbecauseitmakespruningdecisionsbasedonafullygrowntree,unlikeprepruning,whichcansufferfromprematureterminationofthetree-growingprocess.However,forpost-pruning,theadditionalcomputationsneededtogrowthefulltreemaybewastedwhenthesubtreeispruned.

Figure3.32 illustratesthesimplifieddecisiontreemodelforthewebrobotdetectionexamplegiveninSection3.3.5 .Noticethatthesubtreerootedat

hasbeenreplacedbyoneofitsbranchescorrespondingtodepth=1

,and ,usingsubtreeraising.Ontheotherhand,thesubtreecorrespondingto and hasbeenreplacedbyaleafnodeassignedtoclass0,usingsubtreereplacement.Thesubtreefor

and remainsintact.

Figure3.32.Post-pruningofthedecisiontreeforwebrobotdetection.

breadth<=7,width>3 MultiP=1depth>1 MultiAgent=0

depth>1 MultiAgent=1

3.6ModelEvaluationTheprevioussectiondiscussedseveralapproachesformodelselectionthatcanbeusedtolearnaclassificationmodelfromatrainingsetD.train.Herewediscussmethodsforestimatingitsgeneralizationperformance,i.e.itsperformanceonunseeninstancesoutsideofD.train.Thisprocessisknownasmodelevaluation.

NotethatmodelselectionapproachesdiscussedinSection3.5 alsocomputeanestimateofthegeneralizationperformanceusingthetrainingsetD.train.However,theseestimatesarebiasedindicatorsoftheperformanceonunseeninstances,sincetheywereusedtoguidetheselectionofclassificationmodel.Forexample,ifweusethevalidationerrorrateformodelselection(asdescribedinSection3.5.1 ),theresultingmodelwouldbedeliberatelychosentominimizetheerrorsonthevalidationset.Thevalidationerrorratemaythusunder-estimatethetruegeneralizationerrorrate,andhencecannotbereliablyusedformodelevaluation.

Acorrectapproachformodelevaluationwouldbetoassesstheperformanceofalearnedmodelonalabeledtestsethasnotbeenusedatanystageofmodelselection.ThiscanbeachievedbypartitioningtheentiresetoflabeledinstancesD,intotwodisjointsubsets,D.train,whichisusedformodelselectionandD.test,whichisusedforcomputingthetesterrorrate, .Inthefollowing,wepresenttwodifferentapproachesforpartitioningDintoD.trainandD.test,andcomputingthetesterrorrate, .

3.6.1HoldoutMethod

errtest

errtest

Themostbasictechniqueforpartitioningalabeleddatasetistheholdoutmethod,wherethelabeledsetDisrandomlypartitionedintotwodisjointsets,calledthetrainingsetD.trainandthetestsetD.test.AclassificationmodelistheninducedfromD.trainusingthemodelselectionapproachespresentedinSection3.5 ,anditserrorrateonD.test, ,isusedasanestimateofthegeneralizationerrorrate.Theproportionofdatareservedfortrainingandfortestingistypicallyatthediscretionoftheanalysts,e.g.,two-thirdsfortrainingandone-thirdfortesting.

Similartothetrade-offfacedwhilepartitioningD.trainintoD.trandD.valinSection3.5.1 ,choosingtherightfractionoflabeleddatatobeusedfortrainingandtestingisnon-trivial.IfthesizeofD.trainissmall,thelearnedclassificationmodelmaybeimproperlylearnedusinganinsufficientnumberoftraininginstances,resultinginabiasedestimationofgeneralizationperformance.Ontheotherhand,ifthesizeofD.testissmall, maybelessreliableasitwouldbecomputedoverasmallnumberoftestinstances.Moreover, canhaveahighvarianceaswechangetherandompartitioningofDintoD.trainandD.test.

Theholdoutmethodcanberepeatedseveraltimestoobtainadistributionofthetesterrorrates,anapproachknownasrandomsubsamplingorrepeatedholdoutmethod.Thismethodproducesadistributionoftheerrorratesthatcanbeusedtounderstandthevarianceof .

3.6.2Cross-Validation

Cross-validationisawidely-usedmodelevaluationmethodthataimstomakeeffectiveuseofalllabeledinstancesinDforbothtrainingandtesting.Toillustratethismethod,supposethatwearegivenalabeledsetthatwehave

errtest

errtest

errtest

errtest

randomlypartitionedintothreeequal-sizedsubsets, ,and ,asshowninFigure3.33 .Forthefirstrun,wetrainamodelusingsubsetsandS (shownasemptyblocks)andtestthemodelonsubset .Thetesterrorrateon ,denotedas ,isthuscomputedinthefirstrun.Similarly,forthesecondrun,weuse and asthetrainingsetand asthetestset,tocomputethetesterrorrate, ,on .Finally,weuseand fortraininginthethirdrun,while isusedfortesting,thusresultinginthetesterrorrate for .Theoveralltesterrorrateisobtainedbysummingupthenumberoferrorscommittedineachtestsubsetacrossallrunsanddividingitbythetotalnumberofinstances.Thisapproachiscalledthree-foldcross-validation.

Figure3.33.Exampledemonstratingthetechniqueof3-foldcross-validation.

Thek-foldcross-validationmethodgeneralizesthisapproachbysegmentingthelabeleddataD(ofsizeN)intokequal-sizedpartitions(orfolds).Duringthei run,oneofthepartitionsofDischosenasD.test(i)fortesting,whiletherestofthepartitionsareusedasD.train(i)fortraining.Amodelm(i)islearnedusingD.train(i)andappliedonD.test(i)toobtainthesumoftesterrors,

S1,S2 S3S2

3 S1S1 err(S1)

S1 S3 S2err(S2) S2 S1

S3 S3err(S3) S3

th

.Thisprocedureisrepeatedktimes.Thetotaltesterrorrate, ,isthencomputedas

Everyinstanceinthedataisthususedfortestingexactlyonceandfortrainingexactly times.Also,everyrunuses fractionofthedatafortrainingand1/kfractionfortesting.

Therightchoiceofkink-foldcross-validationdependsonanumberofcharacteristicsoftheproblem.Asmallvalueofkwillresultinasmallertrainingsetateveryrun,whichwillresultinalargerestimateofgeneralizationerrorratethanwhatisexpectedofamodeltrainedovertheentirelabeledset.Ontheotherhand,ahighvalueofkresultsinalargertrainingsetateveryrun,whichreducesthebiasintheestimateofgeneralizationerrorrate.Intheextremecase,when ,everyrunusesexactlyonedatainstancefortestingandtheremainderofthedatafortesting.Thisspecialcaseofk-foldcross-validationiscalledtheleave-one-outapproach.Thisapproachhastheadvantageofutilizingasmuchdataaspossiblefortraining.However,leave-one-outcanproducequitemisleadingresultsinsomespecialscenarios,asillustratedinExercise11.Furthermore,leave-one-outcanbecomputationallyexpensiveforlargedatasetsasthecross-validationprocedureneedstoberepeatedNtimes.Formostpracticalapplications,thechoiceofkbetween5and10providesareasonableapproachforestimatingthegeneralizationerrorrate,becauseeachfoldisabletomakeuseof80%to90%ofthelabeleddatafortraining.

Thek-foldcross-validationmethod,asdescribedabove,producesasingleestimateofthegeneralizationerrorrate,withoutprovidinganyinformationaboutthevarianceoftheestimate.Toobtainthisinformation,wecanrunk-foldcross-validationforeverypossiblepartitioningofthedataintokpartitions,

errsum(i) errtest

errtest=∑i=1kerrsum(i)N. (3.14)

(k−1) (k−1)/k

k=N

andobtainadistributionoftesterrorratescomputedforeverysuchpartitioning.Theaveragetesterrorrateacrossallpossiblepartitioningsservesasamorerobustestimateofgeneralizationerrorrate.Thisapproachofestimatingthegeneralizationerrorrateanditsvarianceisknownasthecompletecross-validationapproach.Eventhoughsuchanestimateisquiterobust,itisusuallytooexpensivetoconsiderallpossiblepartitioningsofalargedatasetintokpartitions.Amorepracticalsolutionistorepeatthecross-validationapproachmultipletimes,usingadifferentrandompartitioningofthedataintokpartitionsateverytime,andusetheaveragetesterrorrateastheestimateofgeneralizationerrorrate.Notethatsincethereisonlyonepossiblepartitioningfortheleave-one-outapproach,itisnotpossibletoestimatethevarianceofgeneralizationerrorrate,whichisanotherlimitationofthismethod.

Thek-foldcross-validationdoesnotguaranteethatthefractionofpositiveandnegativeinstancesineverypartitionofthedataisequaltothefractionobservedintheoveralldata.Asimplesolutiontothisproblemistoperformastratifiedsamplingofthepositiveandnegativeinstancesintokpartitions,anapproachcalledstratifiedcross-validation.

Ink-foldcross-validation,adifferentmodelislearnedateveryrunandtheperformanceofeveryoneofthekmodelsontheirrespectivetestfoldsisthenaggregatedtocomputetheoveralltesterrorrate, .Hence, doesnotreflectthegeneralizationerrorrateofanyofthekmodels.Instead,itreflectstheexpectedgeneralizationerrorrateofthemodelselectionapproach,whenappliedonatrainingsetofthesamesizeasoneofthetrainingfolds .Thisisdifferentthanthe computedintheholdoutmethod,whichexactlycorrespondstothespecificmodellearnedoverD.train.Hence,althougheffectivelyutilizingeverydatainstanceinDfortrainingandtesting,the computedinthecross-validationmethoddoesnotrepresenttheperformanceofasinglemodellearnedoveraspecificD.train.

errtest errtest

(N(k−1)/k) errtest

errtest

Nonetheless,inpractice, istypicallyusedasanestimateofthegeneralizationerrorofamodelbuiltonD.Onemotivationforthisisthatwhenthesizeofthetrainingfoldsisclosertothesizeoftheoveralldata(whenkislarge),then resemblestheexpectedperformanceofamodellearnedoveradatasetofthesamesizeasD.Forexample,whenkis10,everytrainingfoldis90%oftheoveralldata.The thenshouldapproachtheexpectedperformanceofamodellearnedover90%oftheoveralldata,whichwillbeclosetotheexpectedperformanceofamodellearnedoverD.

errtest

errtest

errtest

3.7PresenceofHyper-parametersHyper-parametersareparametersoflearningalgorithmsthatneedtobedeterminedbeforelearningtheclassificationmodel.Forinstance,considerthehyper-parameter thatappearedinEquation3.11 ,whichisrepeatedhereforconvenience.Thisequationwasusedforestimatingthegeneralizationerrorforamodelselectionapproachthatusedanexplicitrepresentationsofmodelcomplexity.(SeeSection3.5.2 .)

Forotherexamplesofhyper-parameters,seeChapter4 .

Unlikeregularmodelparameters,suchasthetestconditionsintheinternalnodesofadecisiontree,hyper-parameterssuchas donotappearinthefinalclassificationmodelthatisusedtoclassifyunlabeledinstances.However,thevaluesofhyper-parametersneedtobedeterminedduringmodelselection—aprocessknownashyper-parameterselection—andmustbetakenintoaccountduringmodelevaluation.Fortunately,bothtaskscanbeeffectivelyaccomplishedviaslightmodificationsofthecross-validationapproachdescribedintheprevioussection.

3.7.1Hyper-parameterSelection

InSection3.5.2 ,avalidationsetwasusedtoselect andthisapproachisgenerallyapplicableforhyper-parametersection.Letpbethehyper-parameterthatneedstobeselectedfromafiniterangeofvalues,

α

gen.error(m)=train.error(m,D.train)+α×complexity(M)

α

α

P=

.PartitionD.trainintoD.trandD.val.Foreverychoiceofhyper-parametervalue ,wecanlearnamodel onD.tr,andapplythismodelonD.valtoobtainthevalidationerrorrate .Let bethehyper-parametervaluethatprovidesthelowestvalidationerrorrate.Wecanthenusethemodel correspondingto asthefinalchoiceofclassificationmodel.

Theaboveapproach,althoughuseful,usesonlyasubsetofthedata,D.train,fortrainingandasubset,D.val,forvalidation.Theframeworkofcross-validation,presentedinSection3.6.2 ,addressesbothofthoseissues,albeitinthecontextofmodelevaluation.Hereweindicatehowtouseacross-validationapproachforhyper-parameterselection.Toillustratethisapproach,letuspartitionD.trainintothreefoldsasshowninFigure3.34 .Ateveryrun,oneofthefoldsisusedasD.valforvalidation,andtheremainingtwofoldsareusedasD.trforlearningamodel,foreverychoiceofhyper-parametervalue .Theoverallvalidationerrorratecorrespondingtoeachiscomputedbysummingtheerrorsacrossallthethreefolds.Wethenselectthehyper-parametervalue thatprovidesthelowestvalidationerrorrate,anduseittolearnamodel ontheentiretrainingsetD.train.

Figure3.34.Exampledemonstratingthe3-foldcross-validationframeworkforhyper-parameterselectionusingD.train.

{p1,p2,…pn}pi mi

errval(pi) p*

m* p*

pi pi

p*m*

Algorithm3.2 generalizestheaboveapproachusingak-foldcross-validationframeworkforhyper-parameterselection.Atthei runofcross-validation,thedatainthei foldisusedasD.val(i)forvalidation(Step4),whiletheremainderofthedatainD.trainisusedasD.tr(i)fortraining(Step5).Thenforeverychoiceofhyper-parametervalue ,amodelislearnedonD.tr(i)(Step7),whichisappliedonD.val(i)tocomputeitsvalidationerror(Step8).Thisisusedtocomputethevalidationerrorratecorrespondingtomodelslearningusing overallthefolds(Step11).Thehyper-parametervalue thatprovidesthelowestvalidationerrorrate(Step12)isnowusedtolearnthefinalmodel ontheentiretrainingsetD.train(Step13).Hence,attheendofthisalgorithm,weobtainthebestchoiceofthehyper-parametervalueaswellasthefinalclassificationmodel(Step14),bothofwhichareobtainedbymakinganeffectiveuseofeverydatainstanceinD.train.

Algorithm3.2Proceduremodel-select(k, ,D.train)

∈

th

th

pi

pip*

m*

P

∑

3.7.2NestedCross-Validation

TheapproachoftheprevioussectionprovidesawaytoeffectivelyusealltheinstancesinD.traintolearnaclassificationmodelwhenhyper-parameterselectionisrequired.ThisapproachcanbeappliedovertheentiredatasetDtolearnthefinalclassificationmodel.However,applyingAlgorithm3.2 onDwouldonlyreturnthefinalclassificationmodel butnotanestimateofitsgeneralizationperformance, .RecallthatthevalidationerrorratesusedinAlgorithm3.2 cannotbeusedasestimatesofgeneralizationperformance,sincetheyareusedtoguidetheselectionofthefinalmodel .However,tocompute ,wecanagainuseacross-validationframeworkforevaluatingtheperformanceontheentiredatasetD,asdescribedoriginallyinSection3.6.2 .Inthisapproach,DispartitionedintoD.train(fortraining)andD.test(fortesting)ateveryrunofcross-validation.Whenhyper-parametersareinvolved,wecanuseAlgorithm3.2 totrainamodelusingD.trainateveryrun,thus“internally”usingcross-validationformodelselection.Thisapproachiscallednestedcross-validationordoublecross-validation.Algorithm3.3describesthecompleteapproachforestimating

usingnestedcross-validationinthepresenceofhyper-parameters.

Asanillustrationofthisapproach,seeFigure3.35 wherethelabeledsetDispartitionedintoD.trainandD.test,usinga3-foldcross-validationmethod.

m*errtest

m*errtest

errtest

Figure3.35.Exampledemonstrating3-foldnestedcross-validationforcomputing .

Atthei runofthismethod,oneofthefoldsisusedasthetestset,D.test(i),whiletheremainingtwofoldsareusedasthetrainingset,D.train(i).ThisisrepresentedinFigure3.35 asthei “outer”run.InordertoselectamodelusingD.train(i),weagainusean“inner”3-foldcross-validationframeworkthatpartitionsD.train(i)intoD.trandD.valateveryoneofthethreeinnerruns(iterations).AsdescribedinSection3.7 ,wecanusetheinnercross-validationframeworktoselectthebesthyper-parametervalue aswellasitsresultingclassificationmodel learnedoverD.train(i).Wecanthenapply onD.test(i)toobtainthetesterroratthei outerrun.Byrepeatingthisprocessforeveryouterrun,wecancomputetheaveragetesterrorrate,

,overtheentirelabeledsetD.Notethatintheaboveapproach,theinnercross-validationframeworkisbeingusedformodelselectionwhiletheoutercross-validationframeworkisbeingusedformodelevaluation.

Algorithm3.3Thenestedcross-validationapproachforcomputing .

errtest

th

th

p*(i)m*(i)

m*(i) th

errtest

errtest

∑

3.8PitfallsofModelSelectionandEvaluationModelselectionandevaluation,whenusedeffectively,serveasexcellenttoolsforlearningclassificationmodelsandassessingtheirgeneralizationperformance.However,whenusingthemeffectivelyinpracticalsettings,thereareseveralpitfallsthatcanresultinimproperandoftenmisleadingconclusions.Someofthesepitfallsaresimpletounderstandandeasytoavoid,whileothersarequitesubtleinnatureanddifficulttocatch.Inthefollowing,wepresenttwoofthesepitfallsanddiscussbestpracticestoavoidthem.

3.8.1OverlapbetweenTrainingandTestSets

Oneofthebasicrequirementsofacleanmodelselectionandevaluationsetupisthatthedatausedformodelselection(D.train)mustbekeptseparatefromthedatausedformodelevaluation(D.test).Ifthereisanyoverlapbetweenthetwo,thetesterrorrate computedoverD.testcannotbeconsideredrepresentativeoftheperformanceonunseeninstances.Comparingtheeffectivenessofclassificationmodelsusing canthenbequitemisleading,asanoverlycomplexmodelcanshowaninaccuratelylowvalueof duetomodeloverfitting(seeExercise12attheendofthischapter).

errtest

errtest

errtest

ToillustratetheimportanceofensuringnooverlapbetweenD.trainandD.test,consideralabeleddatasetwherealltheattributesareirrelevant,i.e.theyhavenorelationshipwiththeclasslabels.Usingsuchattributes,weshouldexpectnoclassificationmodeltoperformbetterthanrandomguessing.However,ifthetestsetinvolvesevenasmallnumberofdatainstancesthatwereusedfortraining,thereisapossibilityforanoverlycomplexmodeltoshowbetterperformancethanrandom,eventhoughtheattributesarecompletelyirrelevant.AswewillseelaterinChapter10 ,thisscenariocanactuallybeusedasacriteriontodetectoverfittingduetoimpropersetupofexperiment.Ifamodelshowsbetterperformancethanarandomclassifierevenwhentheattributesareirrelevant,itisanindicationofapotentialfeedbackbetweenthetrainingandtestsets.

3.8.2UseofValidationErrorasGeneralizationError

Thevalidationerrorrate servesanimportantroleduringmodelselection,asitprovides“out-of-sample”errorestimatesofmodelsonD.val,whichisnotusedfortrainingthemodels.Hence, servesasabettermetricthanthetrainingerrorrateforselectingmodelsandhyper-parametervalues,asdescribedinSections3.5.1 and3.7 ,respectively.However,oncethevalidationsethasbeenusedforselectingaclassificationmodel

nolongerreflectstheperformanceof onunseeninstances.

Torealizethepitfallinusingvalidationerrorrateasanestimateofgeneralizationperformance,considertheproblemofselectingahyper-parametervaluepfromarangeofvalues usingavalidationsetD.val.Ifthenumberofpossiblevaluesin isquitelargeandthesizeofD.valissmall,itis

errval

errval

m*,errval m*

P,P

possibletoselectahyper-parametervalue thatshowsfavorableperformanceonD.valjustbyrandomchance.NoticethesimilarityofthisproblemwiththemultiplecomparisonsproblemdiscussedinSection3.4.1 .Eventhoughtheclassificationmodel learnedusing wouldshowalowvalidationerrorrate,itwouldlackgeneralizabilityonunseentestinstances.

ThecorrectapproachforestimatingthegeneralizationerrorrateofamodelistouseanindependentlychosentestsetD.testthathasn'tbeenusedin

anywaytoinfluencetheselectionof .Asaruleofthumb,thetestsetshouldneverbeexaminedduringmodelselection,toensuretheabsenceofanyformofoverfitting.Iftheinsightsgainedfromanyportionofalabeleddatasethelpinimprovingtheclassificationmodeleveninanindirectway,thenthatportionofdatamustbediscardedduringtesting.

p*

m* p*

m*m*

3.9ModelComparisonOnedifficultywhencomparingtheperformanceofdifferentclassificationmodelsiswhethertheobserveddifferenceintheirperformanceisstatisticallysignificant.Forexample,considerapairofclassificationmodels, and .Suppose achieves85%accuracywhenevaluatedonatestsetcontaining30instances,while achieves75%accuracyonadifferenttestsetcontaining5000instances.Basedonthisinformation,is abettermodelthan ?Thisexampleraisestwokeyquestionsregardingthestatisticalsignificanceofaperformancemetric:

1. Although hasahigheraccuracythan ,itwastestedonasmallertestset.Howmuchconfidencedowehavethattheaccuracyfor isactually85%?

2. Isitpossibletoexplainthedifferenceinaccuraciesbetween andasaresultofvariationsinthecompositionoftheirtestsets?

Thefirstquestionrelatestotheissueofestimatingtheconfidenceintervalofmodelaccuracy.Thesecondquestionrelatestotheissueoftestingthestatisticalsignificanceoftheobserveddeviation.Theseissuesareinvestigatedintheremainderofthissection.

3.9.1EstimatingtheConfidenceIntervalforAccuracy

*

MA MBMA

MBMA

MB

MA MBMA

MAMB

Todetermineitsconfidenceinterval,weneedtoestablishtheprobabilitydistributionforsampleaccuracy.Thissectiondescribesanapproachforderivingtheconfidenceintervalbymodelingtheclassificationtaskasabinomialrandomexperiment.Thefollowingdescribescharacteristicsofsuchanexperiment:

1. TherandomexperimentconsistsofNindependenttrials,whereeachtrialhastwopossibleoutcomes:successorfailure.

2. Theprobabilityofsuccess,p,ineachtrialisconstant.

AnexampleofabinomialexperimentiscountingthenumberofheadsthatturnupwhenacoinisflippedNtimes.IfXisthenumberofsuccessesobservedinNtrials,thentheprobabilitythatXtakesaparticularvalueisgivenbyabinomialdistributionwithmean andvariance :

Forexample,ifthecoinisfair andisflippedfiftytimes,thentheprobabilitythattheheadshowsup20timesis

Iftheexperimentisrepeatedmanytimes,thentheaveragenumberofheadsexpectedtoshowupis whileitsvarianceis

Thetaskofpredictingtheclasslabelsoftestinstancescanalsobeconsideredasabinomialexperiment.GivenatestsetthatcontainsNinstances,letXbethenumberofinstancescorrectlypredictedbyamodelandpbethetrueaccuracyofthemodel.Ifthepredictiontaskismodeledasabinomialexperiment,thenXhasabinomialdistributionwithmean andvariance Itcanbeshownthattheempiricalaccuracy, also

Np Np(1−p)

P(X=υ)=(Nυ)pυ(1−p)N−υ.

(p=0.5)

P(X=20)=(5020)0.520(1−0.5)30=0.0419.

50×0.5=25, 50×0.5×0.5=12.5.

NpNp(1−p). acc=X/N,

hasabinomialdistributionwithmeanpandvariance (seeExercise14).ThebinomialdistributioncanbeapproximatedbyanormaldistributionwhenNissufficientlylarge.Basedonthenormaldistribution,theconfidenceintervalforacccanbederivedasfollows:

where and aretheupperandlowerboundsobtainedfromastandardnormaldistributionatconfidencelevel Sinceastandardnormaldistributionissymmetricaround itfollowsthatRearrangingthisinequalityleadstothefollowingconfidenceintervalforp:

Thefollowingtableshowsthevaluesof atdifferentconfidencelevels:

0.99 0.98 0.95 0.9 0.8 0.7 0.5

2.58 2.33 1.96 1.65 1.28 1.04 0.67

3.11.ExampleConfidenceIntervalforAccuracyConsideramodelthathasanaccuracyof80%whenevaluatedon100testinstances.Whatistheconfidenceintervalforitstrueaccuracyata95%confidencelevel?Theconfidencelevelof95%correspondsto

accordingtothetablegivenabove.InsertingthistermintoEquation3.16 yieldsaconfidenceintervalbetween71.1%and86.7%.Thefollowingtableshowstheconfidenceintervalwhenthenumberofinstances,N,increases:

N 20 50 100 500 1000 5000

p(1−p)/N

P(−Zα/2≤acc−pp(1−p)/N≤Z1−α/2)=1−α, (3.15)

Zα/2 Z1−α/2(1−α).

Z=0, Zα/2=Z1−α/2.

2×N×acc×Zα/22±Zα/2Zα/22+4Nacc−4Nacc22(N+Zα/22). (3.16)

Zα/2

1−α

Zα/2

Za/2=1.96

Confidence 0.584 0.670 0.711 0.763 0.774 0.789

Interval

NotethattheconfidenceintervalbecomestighterwhenNincreases.

3.9.2ComparingthePerformanceofTwoModels

Considerapairofmodels, and whichareevaluatedontwoindependenttestsets, and Let denotethenumberofinstancesin

and denotethenumberofinstancesin Inaddition,supposetheerrorratefor on is andtheerrorratefor on is Ourgoalistotestwhethertheobserveddifferencebetween and isstatisticallysignificant.

Assumingthat and aresufficientlylarge,theerrorrates and canbeapproximatedusingnormaldistributions.Iftheobserveddifferenceintheerrorrateisdenotedas thendisalsonormallydistributedwithmean ,itstruedifference,andvariance, Thevarianceofdcanbecomputedasfollows:

where and arethevariancesoftheerrorrates.Finally,atthe confidencelevel,itcanbeshownthattheconfidenceintervalforthetruedifferencedtisgivenbythefollowingequation:

−0.919 −0.888 −0.867 −0.833 −0.824 −0.811

M1 M2,D1 D2. n1

D1 n2 D2.M1 D1 e1 M2 D2 e2.

e1 e2

n1 n2 e1 e2

d=e1−e2,dt σd2.

σd2≃σ^d2=e1(1−e1)n1+e2(1−e2)n2, (3.17)

e1(1−e1)/n1 e2(1−e1)/n2(1−α)%

3.12.ExampleSignificanceTestingConsidertheproblemdescribedatthebeginningofthissection.Modelhasanerrorrateof whenappliedto testinstances,whilemodel hasanerrorrateof whenappliedto testinstances.Theobserveddifferenceintheirerrorratesis

.Inthisexample,weareperformingatwo-sidedtesttocheckwhether or .Theestimatedvarianceoftheobserveddifferenceinerrorratescanbecomputedasfollows:

or .InsertingthisvalueintoEquation3.18 ,weobtainthefollowingconfidenceintervalfor at95%confidencelevel:

Astheintervalspansthevaluezero,wecanconcludethattheobserveddifferenceisnotstatisticallysignificantata95%confidencelevel.

Atwhatconfidencelevelcanwerejectthehypothesisthat ?Todothis,weneedtodeterminethevalueof suchthattheconfidenceintervalfordoesnotspanthevaluezero.Wecanreversetheprecedingcomputationandlookforthevalue suchthat .Replacingthevaluesofdand

gives .Thisvaluefirstoccurswhen (foratwo-sidedtest).Theresultsuggeststhatthenullhypothesiscanberejectedatconfidencelevelof93.6%orlower.

dt=d±zα/2σ^d. (3.18)

MAe1=0.15 N1=30

MB e2=0.25 N2=5000

d=|0.15−0.25|=0.1dt=0 dt≠0

σ^d2=0.15(1−0.15)30+0.25(1−0.25)5000=0.0043

σ^d=0.0655dt

dt=0.1±1.96×0.0655=0.1±0.128.

dt=0Zα/2 dt

Zα/2 d>Zσ/2σ^dσ^d Zσ/2<1.527 (1−α)<~0.936

Date post:	25-Aug-2021
Category:	Documents
Upload:	others
View:	5 times
Download:	0 times

3 Classification: Basic Concepts and Techniquesmoodle.nwssu.edu.ph/pluginfile.php/80810/mod... ·...

Documents