CPSC340:MachineLearningandDataMining
FundamentalsofLearningFall2019
Admin• Assignment1 isdueWednesday:youshouldbealmostdone.• Waitinglistpeople:everyoneshouldbeinsoon?• Coursewebpage:– https://www.cs.ubc.ca/~fwood/CS340/
• Auditors:– BringyourformsattheendofclassFriday,assumingweclearwaitlist.
• Exchangestudents:– Ifyouarestillhavingtroubleregistering,bringyourformsFriday.– ContactusonPiazzaaboutgettingregisteredforGradescope.
• Midtermconfirmed(14/Feb;6pm->8pm,Wesbrook 100).
LastTime:SupervisedLearningNotation
• Featurematrix‘X’ hasrowsasexamples,columnsasfeatures.– xij isfeature‘j’forexample‘i’(quantityoffood‘j’onday‘i’).– xi isthelistofallfeaturesforexample‘i’(allthequantitiesonday‘i’).– xj iscolumn‘j’ofthematrix (thevalueoffeature‘j’acrossallexamples).
• Labelvector‘y’ containsthelabelsoftheexamples.– yi isthelabelofexample‘i’ (1for“sick”,0for“notsick”).
Egg Milk Fish Wheat Shellfish Peanuts
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
SupervisedLearningApplication• Wemotivatedsupervisedlearningbythe“foodallergy”example.
• Butwecanusesupervisedlearningforanyinput:output mapping.– E-mailspamfiltering.– Opticalcharacterrecognitiononscanners.– Recognizingfacesinpictures.– Recognizingtumoursinmedicalimages.– Speechrecognitiononphones.– Yourprobleminindustry/research?
Motivation:DetermineHomeCity• Wearegivendatafrom248homes.• Foreachhome/example,wehavethesefeatures:– Elevation.– Year.– Bathrooms– Bedrooms.– Price.– Squarefeet.
• GoalistobuildaprogramthatpredictsSF orNY.
Thisexampleandimagesofitcomefrom:http://www.r2d3.us/visual-intro-to-machine-learning-part-1
PlottingElevation
SimpleDecisionStump
ScatterplotArray
ScatterplotArray
PlottingElevationandPrice/SqFt
SimpleDecisionTreeClassification
SimpleDecisionTreeClassification
Howdoesthedepthaffectaccuracy?
Thisisagoodstart(>75%accuracy).
Howdoesthedepthaffectaccuracy?
Startsplittingthedatarecursively…
Howdoesthedepthaffectaccuracy?
Accuracykeepsincreasingasweadddepth.
Howdoesthedepthaffectaccuracy?
Eventually,wecanperfectlyclassifyallofourdata.
Trainingvs.TestingError• Withthisdecisiontree,‘trainingaccuracy’is1.
– It perfectlylabelsthedataweusedtomakethetree.• Wearenowgivenfeaturesfor217newhomes.• Whatisthe‘testingaccuracy’onthenewdata?
– Howdoesitdoondatanotused tomakethetree?
• Overfitting:loweraccuracyonnewdata.– Ourrulesgottoospecifictoourexacttrainingdataset.– Someofthe“deep”splitsonlyuseafewexamples(bad“couponcollecting”).
SupervisedLearningNotation• Wearegiventrainingdata whereweknowlabels:
• Butthereisalsotestingdata wewanttolabel:
Egg Milk Fish Wheat Shellfish Peanuts …
0 0.7 0 0.3 0 0
0.3 0.7 0 0.6 0 0.01
0 0 0 0.8 0 0
0.3 0.7 1.2 0 0.10 0.01
0.3 0 1.2 0.3 0.10 0.01
Sick?
1
1
0
1
1
X= y=
Egg Milk Fish Wheat Shellfish Peanuts …
0.5 0 1 0.6 2 1
0 0.7 0 1 0 0
3 1 0 0.5 0 0
Sick?
?
?
?
𝑋"= 𝑦$=
SupervisedLearningNotation• Typicalsupervisedlearningsteps:
1. BuildmodelbasedontrainingdataXandy(trainingphase).2. Modelmakespredictions𝑦% ontestdata𝑋" (testingphase).
• Insteadoftrainingerror,considertesterror:– Arepredictions𝑦%similartotrueunseenlabels𝑦$?
GoalofMachineLearning• Inmachinelearning:– Whatwecareaboutisthetesterror!
• Midtermanalogy:– Thetrainingerroristhepracticemidterm.– Thetesterroristheactualmidterm.– Goal:dowellonactualmidterm,notthepracticeone.
• Memorizationvslearning:– Candowellontrainingdatabymemorizingit.– You’veonlylearnedifyoucandowellinnewsituations.
GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.
• We’remeasuringtesterrortoseehowwellwedoonnewdata:– Ifusedduringtraining,doesn’tmeasurethis.– Youcanstarttooverfit ifyouuseitduringtraining.– Midtermanalogy:youarecheatingonthetest.
GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.
http://www.technologyreview.com/view/538111/why-and-how-baidu-cheated-an-artificial-intelligence-test/
GoldenRuleofMachineLearning• Eventhoughwhatwecareaboutistesterror:– THETESTDATACANNOTINFLUENCETHETRAININGPHASEINANYWAY.
• Youalsoshouldn’tchangethetestsettogettheresultyouwant.
– http://blogs.sciencemag.org/pipeline/archives/2015/01/14/the_dukepotti_scandal_from_the_inside
https://www.cbsnews.com/news/deception-at-duke-fraud-in-cancer-care/
Digression:GoldenRuleandHypothesisTesting• Notethegoldenruleappliestohypothesistestinginscientificstudies.
– Datathatyoucollectcan’tinfluencethehypothesesthatyoutest.
• EXTREMELYCOMMON andaMAJORPROBLEM,cominginmanyforms:– Collectmoredatauntilyoucoincidentallygetsignificancelevelyouwant.– Trydifferentwaystomeasureperformance,choosetheonethatlooksbest.– Chooseadifferenttypeofmodel/hypothesisafterlookingatthetestdata.
• Ifyouwanttomodifyyourhypotheses,youneedtotestonnewdata.– Oratleastbeawareandhonestaboutthisissuewhenreportingresults.
Digression:GoldenRuleandHypothesisTesting• Notethegoldenruleappliestohypothesistestinginscientificstudies.
– Datathatyoucollectcan’tinfluencethehypothesesthatyoutest.
• EXTREMELYCOMMON andaMAJORPROBLEM,cominginmanyforms:– “ReplicationcrisisinScience”.– “WhyMostPublishedResearchFindingsareFalse”.– “False-PositivePsychology:UndisclosedFlexibilityinDataCollectionandAnalysisAllowsPresentingAnythingasSignificant”.
– “HARKing:HypothesizingAftertheResultsareKnown”.– “HackYourWayToScientificGlory”.– “Psychology’sReplicationCrisisHasMadeTheFieldBetter”(somesolutions)
IsLearningPossible?• Doestrainingerrorsayanythingabouttesterror?– Ingeneral,NO:Testdatamighthavenothingtodowithtrainingdata.– E.g.,“adversary”takestrainingdataandflipsalllabels.
• Inordertolearn,weneedassumptions:– Thetrainingandtestdataneedtoberelatedinsomeway.– Mostcommonassumption:independentandidenticallydistributed(IID).
Egg Milk Fish
0 0.7 0
0.3 0.7 1
0.3 0 0
Sick?
1
1
0
X= y=
Egg Milk Fish
0 0.7 0
0.3 0.7 1
0.3 0 0
Sick?
0
0
1
𝑋" = 𝑦$ =
IIDAssumption• Training/testdataisindependentandidenticallydistributed(IID)if:
– Allexamplescomefromthesamedistribution(identicallydistributed).– Theexamplearesampledindependently(orderdoesn’tmatter).
• Examplesintermsofcards:– Pickacard,putitbackinthedeck,re-shuffle,repeat.– Pickacard,putitbackinthedeck,repeat.– Pickacard,don’tputitback,re-shuffle,repeat.
Age Job? City Rating Income
23 Yes Van A 22,000.0023 Yes Bur BBB 21,000.0022 No Van CC 0.0025 Yes Sur AAA 57,000.00
IIDAssumptionandFoodAllergyExample• IsthefoodallergydataIID?– Doalltheexamplescomefromthesamedistribution?– Doestheorderoftheexamplesmatter?
• No!– Beingsickmightdependonwhatyouateyesterday(notindependent).– Youreatinghabitsmightchangedovertime(notidenticallydistributed).
• Whatcanwedoaboutthis?– Justignorethatdataisn’tIIDandhopeforthebest?– Foreachday,maybeaddthefeaturesfromthepreviousday?– Maybeaddtimeasanextrafeature?
LearningTheory• WhydoestheIIDassumptionmakelearningpossible?
– Patternsintrainingexamplesarelikelytobethesameintestexamples.• TheIIDassumptionisrarelytrue:
– Butitisoftenagoodapproximation.– Thereareotherpossibleassumptions.
• Also,we’reassumingIIDacrossexamplesbutnotacrossfeatures.
• Learningtheoryexploreshowtrainingerrorisrelatedtotesterror.• We’lllookatasimpleexample,usingthisnotation:
– Etrain istheerrorontrainingdata.– Etest istheerrorontestingdata.
FundamentalTrade-Off• StartwithEtest =Etest,thenaddandsubtractEtrain ontheright:
• Howdoesthishelp?– IfEapprox issmall,thenEtrain isagoodapproximationtoEtest.
• WhatdoesEapprox (“amountofoverfitting”)dependon?– Ittendstogetsmalleras‘n’getslarger.– Ittendstogrowasmodelgetmore“complicated”.
FundamentalTrade-Off• Thisleadstoafundamentaltrade-off:
1. Etrain:howsmallyoucanmakethetrainingerror.vs.
2. Eapprox:howwelltrainingerrorapproximatesthetesterror.
• Simplemodels (likedecisionstumps):– Eapprox islow(notverysensitivetotrainingset).– ButEtrain mightbehigh.
• Complexmodels(likedeepdecisiontrees):– Etrain canbelow.– ButEapprox mightbehigh(verysensitivetotrainingset).
FundamentalTrade-Off• Trainingerrorvs.testerrorforchoosingdepth:– Trainingerrorishighforlowdepth(underfitting)– Trainingerrorgetsbetterwithdepth.– Testerrorinitiallygoesdown,buteventuallyincreases(overfitting).
ValidationError• Howdowedecidedecisiontreedepth?• Wecareabouttesterror.• Butwecan’tlookattestdata.• Sowhatdowedo?????
• Oneanswer:Usepartofthetrainingdatatoapproximatetesterror.
• Splittrainingexamplesintotraining setandvalidation set:– Trainmodelbasedonthetrainingdata.– Testmodelbasedonthevalidationdata.
ValidationError
ValidationError• IIDdata:validationerrorisunbiasedapproximationoftesterror.
• Midtermanalogy:– Youhave2practicemidterms.– Youhideonemidterm,andspendalotoftimeworkingthroughtheother.– Youthendotheotherpracticeterm,toseehowwellyou’lldoonthetest.
• Wetypicallyusevalidationerrortochoose“hyper-parameters”…
Notation:ParametersandHyper-Parameters• Thedecisiontreerule valuesarecalled“parameters”.– Parameterscontrolhowwellwefitadataset.– We“train”amodelbytryingtofindthebestparametersontrainingdata.
• Thedecisiontreedepth isacalleda“hyper-parameter”.– Hyper-parameterscontrolhowcomplexourmodelis.– Wecan’t“train”ahyper-parameter.
• Youcanalwaysfittrainingdatabetterbymakingthemodelmorecomplicated.
– We“validate”ahyper-parameterusingavalidationscore.
• (“Hyper-parameter”issometimesusedforparameters“notfitwithdata”.)
ChoosingHyper-ParameterswithValidationSet• Sotochooseagoodvalueofdepth(“hyper-parameter”),wecould:– Tryadepth-1decisiontree,computevalidationerror.– Tryadepth-2decisiontree,computevalidationerror.– Tryadepth-3decisiontree,computevalidationerror.– …– Tryadepth-20decisiontree,computevalidationerror.– Returnthedepthwiththelowestvalidationerror.
• Afteryouchoosethehyper-parameter,weusuallyre-trainonthefulltrainingsetwiththechosenhyper-parameter.
Digression:OptimizationBias• Anothernameforoverfittingis“optimizationbias”:– Howbiasedisan“error”thatweoptimizedovermanypossibilities?
• Optimizationbiasofparameterlearning:– Duringlearning,wecouldsearchovertonsofdifferentdecisiontrees.– Sowecanget“lucky”andfindonewithlowtrainingerrorbychance.
• “Overfittingofthetrainingerror”.
• Optimizationbiasofhyper-parametertuning:– Here,wemightoptimizethevalidationerrorover20valuesof“depth”.– Oneofthe20treesmighthavelowvalidationerrorbychance.
• “Overfittingofthevalidationerror”.
Digression:ExampleofOptimizationBias• Consideramultiple-choice(a,b,c,d)“test”with10questions:– Ifyouchooseanswersrandomly,expectedgradeis25%(nobias).– Ifyoufillouttwotestsrandomlyandpickthebest,expectedgradeis33%.
• Optimizationbiasof~8%.
– Ifyoutakethebestamong10 randomtests,expectedgradeis~47%.– Ifyoutakethebestamong100,expectedgradeis~62%.– Ifyoutakethebestamong1000,expectedgradeis~73%.– Ifyoutakethebestamong10000,expectedgradeis~82%.
• Youhavesomany“chances”thatyouexpecttodowell.
• Butonnewquestionsthe“randomchoice”accuracyisstill25%.
FactorsAffectingOptimizationBias• Ifweinsteaduseda100-questiontestthen:
– Expectedgradefrombestover1randomly-filledtestis25%.– Expectedgradefrombestover2randomly-filledtestis~27%.– Expectedgradefrombestover10randomly-filledtestis~32%.– Expectedgradefrombestover100randomly-filledtestis~36%.– Expectedgradefrombestover1000randomly-filledtestis~40%.– Expectedgradefrombestover10000randomly-filledtestis~47%.
• Theoptimizationbiasgrowswiththenumberofthingswetry.– “Complexity”ofthesetofmodelswesearchover.
• But,optimizationbiasshrinksquicklywiththenumberofexamples.– Butit’sstillnon-zeroandgrowingifyouover-useyourvalidationset!
Summary• Trainingerrorvs.testingerror:
– Whatwecareaboutinmachinelearningisthetestingerror.• Goldenruleofmachinelearning:
– Thetestdatacannotinfluencetrainingthemodelinanyway.• Independentandidenticallydistributed(IID):
– Oneassumptionthatmakeslearningpossible.• Fundamentaltrade-off:
– Trade-offbetweengettinglowtrainingerrorandhavingtrainingerrorapproximatetesterror.• Validationset:
– Wecansavepartofourtrainingdatatoapproximatetesterror.• Hyper-parameters:
– Parametersthatcontrolmodelcomplexity,typicallysetwithavalidationset.
• Nexttime:– Wediscussthe“best”machinelearningmethod.
“testerror”vs.“testseterror”vs.“validationerror”
“testerror”vs.“testseterror”vs.“validationerror”
ApproximationErrorforSelectingHyper-Parameters
• Fromthe2019EasyMarkit AIHackathon:– “Weendedupselectingthehyperparametersthatgaveusthelowestapproximationerror(gapbetweentrainandvalidation)asopposedtothelowestvalidationerror.Thiswasquiteadifficultdecisionforourteamsincewewereonlyallowedonesubmission.However,themodelwiththelowestvalidationerrorhadaveryhighapproximationerror,whichfelttoorisky,sowewentwithamodelwithaslightlyhighervalidationerrorandmuchlowerapproximationerror.Whentheresultswereannounced,thereportedtestaccuracywaswithin0.1%ofwhatourmodelpredictedwiththevalidationset.”
• Thisisthetypeofreasoningyouwanttodo.– Ahighapproximationerrorcouldindicatelowvalidationerrorbychance.
“AvisualIntroductiontomachinelearning”• The“housingprices”exampleistakenfromthiswebsite:– http://www.r2d3.us/visual-intro-to-machine-learning-part-1
• Theyalsohavea“Part2”here:– http://www.r2d3.us/visual-intro-to-machine-learning-part-2
• Part2coverssimilartopicstowhatwecoveredinthislecture.
BoundingEapprox• Let’sassumewehaveafixedmodel‘h’(likeadecisiontree),andthenwecollectatrainingsetof‘n’examples.
• Whatistheprobabilitythattheerroronthistrainingset(Etrain),iswithinsomesmallnumberε ofthetesterror(Etest)?
• From“Hoeffding’s inequality”wehave:
• Thisisgreat!Inthissettingtheprobabilitythatourtrainingerrorisfarfromourtesterrorgoesdownexponentiallyintermsofthenumberofsamples‘n’.
BoundingEapprox• Unfortunately,thelastslidegetsitbackwards:
– Weusuallydon’tpickamodelandthencollectadataset.– Weusuallycollectadatasetandthenpickthemodel‘w’ basedonthedata.
• Wenowpickedthemodelthatdidbestonthedata,andHoeffding’sinequalitydoesn’taccountfortheoptimizationbiasofthisprocedure.
• Onewaytogetaroundthisistobound(Etest – Etrain)forallmodelsinthespaceofmodelsweareoptimizingover.– Ifwebounditforallmodels,thenwebounditforthebestmodel.– Thisgiveslooserbutcorrectbounds.
BoundingEapprox• Ifweonlyoptimizeoverafinitenumberofevents‘k’,wecanusethe“unionbound”thatforevents{A1,A2,…,Ak}wehave:
• CombiningHoeffding’s inequalityandtheunionboundgives:
BoundingEapprox• So,withtheoptimizationbiasofsetting“h*”tothebest‘h’among‘k’models,probabilitythat(Etest – Etrain)isbiggerthanε satisfies:
• Sooptimizingoverafewmodelsisokifwehavelotsofexamples.• Ifwetrylotsofmodelsthen(Etest – Etrain)couldbeverylarge.• Laterinthecoursewe’llbesearchingovercontinuousmodelswherek=infinity,sothisboundisuseless.
• Tohandlecontinuousmodels,onewayisviatheVC-dimension.– SimplermodelswillhavelowerVC-dimension.
RefinedFundamentalTrade-Off• LetEbest betheirreducibleerror(lowestpossibleerrorforanymodel).
• Forexample,irreducibleerrorforpredictingcoinflipsis0.5.
• SomelearningtheoryresultsuseEbest tofuther decomposeEtest:
• Thisissimilartothebias-variancedecomposition:– Term1:measureofvariance(howsensitivewearetotrainingdata).– Term2:measureofbias (howlowcanwemakethetrainingerror).– Term3:measureofnoise (howlowcananymodelmaketesterror).
RefinedFundamentalTrade-Off• Decisiontreewithhighdepth:– Verylikelytofitdatawell,sobiasislow.– Butmodelchangesalotifyouchangethedata,sovarianceishigh.
• Decisiontreewithlowdepth:– Lesslikelytofitdatawell,sobiasishigh.– Butmodeldoesn’tchangemuchyouchangedata,sovarianceislow.
• Anddegreedoesnotaffectirreducibleerror.– Irreducibleerrorcomesfromthebestpossiblemodel.
Bias-VarianceDecomposition• Youmayhaveseen“bias-variancedecomposition”inotherclasses:
– Assumes𝑦$ i =𝑦(i +ε,whereε hasmean0andvarianceσ2.– Assumeswehavea“learner”thatcantake‘n’trainingexamplesandusethesetomakepredictions𝑦%i.
• Expectedsquaredtesterrorinthissettingis
– Whereexpectationsaretakenoverpossibletrainingsets of‘n’examples.– Bias isexpectederrorduetohavingwrongmodel.– Variance isexpectederrorduetosensitivitytothetrainingset.– Noise (irreducibleerror)isthebestcanhopeforgiventhenoise(Ebest).
Bias-Variancevs.FundamentalTrade-Off• Bothdecompositionsservethesamepurpose:– Tryingtoevaluatehowdifferentfactorsaffecttesterror.
• Theybothleadtothesame3conclusions:1. SimplemodelscanhavehighEtrain/bias,lowEapprox/variance.2. ComplexmodelscanhavelowEtrain/bias,highEapprox/variance.3. Asyouincrease‘n’,Eapprox/variancegoesdown(forfixedcomplexity).
Bias-Variancevs.FundamentalTrade-Off• Sowhyfocusonfundamentaltrade-offandnotbias-variance?– Simplest viewpointthatgivesthese3conclusions.– Noassumptionslikebeingrestrictedtosquarederror.
– YoucanmeasureEtrain butnotEapprox (1knownand1unknown).• IfEtrain islowandyouexpectEapprox tobelow,thenyouarehappy.
– E.g.,youfitaverysimplemodeloryouusedahugeindependentvalidationset.
– Youcan’tmeasurebias,variance,ornoise(3unknowns).• IfEtrain islow,bias-variancedecompositiondoesn’tsayanythingabouttesterror.
– Youonlyhaveyourtrainingset,notdistributionoverpossibledatasets.– Doesn’tsayifhighEtest isduetobiasorvarianceornoise.
LearningTheory• Bias-variancedecompositionisabitweirdcomparedtoourpreviousdecompositionsofEtest:– Bias-variancedecompositionconsidersexpectationoverpossibletrainingsets.– Butdoesn’tsayanythingabouttesterrorwithyour trainingset.
• Somekeywordsifyouwanttolearnaboutlearningtheory:– Bias-variancedecomposition,samplecomplexity,probablyapproximatelycorrect(PAC)learning,Vapnik-Chernovenkis (VC)dimension,Rademacher complexity.
• Agentleplacetostartisthe“LearningfromData”book:– https://work.caltech.edu/telecourse.html
ATheoreticalAnswerto“HowMuchData?”• AssumewehaveasourceofIIDexamplesandafixedclassofparametricmodels.
• Like“alldepth-5decisiontrees”.
• Undersomenastyassumptions,with‘n’trainingexamplesitholdsthat:E[testerrorofbestmodelontrainingset]– (besttesterrorinclass)=O(1/n).
• Yourarelyknowtheconstantfactor,butthisgivessomeguidelines:– Addingmoredatahelpsmoreonsmalldatasetsthanonlargedatasets.
• Goingfrom10trainingexamplesto20,differencewithbestpossibleerrorgetscutinhalf.– Ifthebestpossibleerroris15%youmightgofrom20%to17.5%(thisdoesnotmean20%to10%).
• Goingfrom110trainingexamplesto120,erroronlygoesdownby~10%.• Goingfrom1Mtrainingexamplesto1M+10,youwon’tnoticeachange.
– Doublingthedatasizecutstheerrorinhalf:• Goingfrom1Mtrainingto2Mtrainingexamples,errorgetscutinhalf.• Ifyoudoublethedatasizeandyourtesterrordoesn’timprove,moredatamightnothelp.
CanyoutesttheIIDassumption?• Ingeneral,testingtheIIDassumptionisnoteasy.– Usually,youneedbackgroundknowledgetodecideifit’sreasonable.
• Sometestsdoexist,likeshufflingtheorderofdataandthenmeasuringifsomebasicstatisticsagree.– It’sreasonabletocheckifsummarystatisticsoftrainandtestdataagree.
• Ifnot,yourtrainedmodelmaynotbesouseful.
• Somediscussionhere:– https://stats.stackexchange.com/questions/28715/test-for-iid-sampling