CPSC340:MachineLearningandDataMining
LeastSquaresFall2020
Admin• Assignment3isup:– Startearly,thisisusuallythelongestassignment.
• We’regoingtostartusingcalculus andlinearalgebraalot.– YoushouldstartreviewingtheseASAPifyouarerusty.– Areviewofrelevantcalculusconceptsishere.– Areviewofrelevantlinearalgebraconceptsishere.
SupervisedLearningRound2:Regression• We’regoingtorevisitsupervisedlearning:
• Previously,weconsideredclassification:– Weassumedyi wasdiscrete:yi =‘spam’oryi =‘notspam’.
• Nowwe’regoingtoconsiderregression:– Weallowyi tobenumerical:yi =10.34cm.
Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Doesnumberoflungcancerdeathschangewithnumberofcigarettes?– Doesnumberofskincancerdeathschangewithlatitude?
http://www.cvgs.k12.va.us:81/digstats/main/inferant/d_regrs.htmlhttps://onlinecourses.science.psu.edu/stat501/node/11
Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Dopeopleinbigcitieswalkfaster?– Istheuniverseexpandingorshrinkingorstayingthesamesize?
http://hosting.astro.cornell.edu/academics/courses/astro201/hubbles_law.htmhttps://www.nature.com/articles/259557a0.pdf
Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:– Doesnumberofgundeathschangewithgunownership?– Doesnumberviolentcrimeschangewithviolentvideogames?
http://www.vox.com/2015/10/3/9444417/gun-violence-united-states-americahttps://www.soundandvision.com/content/violence-and-video-games
Example:Dependentvs.ExplanatoryVariables• Wewanttodiscoverrelationshipbetweennumericalvariables:
– DoeshighergenderequalityindexleadtomorewomenSTEMgrads?
• Notthatwe’redoingsupervisedlearning:– Tryingtopredictvalueof1variable(the‘yi’values).(insteadofmeasuringcorrelationbetween2).
• Supervisedlearningdoesnotgivecausality:– OK:“Higherindexiscorrelatedwithlowergrad%”.– OK:“Higherindexhelpspredictlowergrad%”.– BAD:“Higherindexleadstolowergrads%”.
• People/mediagettheseconfusedallthetime,becareful!• Therearelotsofpotentialreasonsforthiscorrelation.
https://www.weforum.org/agenda/2018/02/does-gender-equality-result-in-fewer-female-stem-grads/
HandlingNumericalLabels• Onewaytohandlenumericalyi:discretize.– E.g.,for‘age’couldweuse{‘age≤20’,‘20<age≤30’,‘age>30’}.– Nowwecanapplymethodsforclassificationtodoregression.– Butcoarsediscretizationlosesresolution.– Andfinediscretizationrequireslotsofdata.
• Thereexistregressionversionsofclassificationmethods:– Regressiontrees,probabilisticmodels,non-parametricmodels.
• Today:oneofoldest,butstillmostpopular/importantmethods:– Linearregressionbasedonsquarederror.– Interpretableandthebuildingblockformore-complexmethods.
LinearRegressionin1Dimension• Assumeweonlyhave1feature(d=1):– E.g.,xi isnumberofcigarettesandyi isnumberoflungcancerdeaths.
• Linearregressionmakespredictions𝑦"i usingalinearfunctionofxi:
• Theparameter‘w’istheweight orregressioncoefficient ofxi.– We’retemporarilyignoringthey-intercept.
• Asxi changes,slope‘w’affectstheratethat𝑦"i increases/decreases:– Positive‘w’:𝑦"i increaseasxi increases.– Negative‘w’:𝑦"i decreasesasxi increases.
LinearRegressionin1Dimension
Aside:terminologywoes• Differentfieldsusedifferentterminologyandsymbols.– Datapoints=objects=examples =rows=observations.– Inputs =predictors=features =explanatoryvariables=regressors =independentvariables=covariates=columns.
– Outputs =outcomes=targets=responsevariables=dependentvariables(alsocalleda“label”ifit’scategorical).
– Regressioncoefficients=weights=parameters=betas.
• Withlinearregression,thesymbolsareinconsistenttoo:– InML,thedataisXandy,andtheweightsarew.– Instatistics,thedataisXandy,andtheweightsareβ.– Inoptimization,thedataisAandb,andtheweightsarex.
LeastSquaresObjective• Ourlinearmodelisgivenby:
• Sowemakepredictions foranewexamplebyusing:
• Butwecan’tusethesameerrorasbefore:– Itisunlikelytofindalinewhere𝑦"𝑖 = 𝑦𝑖 exactly formanypoints.
• Duetonoise,relationshipnotbeingquitelinearorjustfloating-pointissues.– “Best”modelmayhave|𝑦"𝑖 − 𝑦𝑖| issmall butnotexactly0.
LeastSquaresObjective• Insteadof“exactyi”,weevaluate“size”oftheerror inprediction.• Classicwayissettingslope‘w’tominimizesumof squarederrors:
• Therearesomejustificationsforthischoice.– Aprobabilisticinterpretationiscominglaterinthecourse.
• Butusually,itisdonebecauseitiseasytominimize.
LeastSquaresObjective• Classicwaytosetslope‘w’isminimizingsumof squarederrors:
LeastSquaresObjective• Classicwaytosetslope‘w’isminimizingsumof squarederrors:
MinimizingaDifferentialFunction• Math101approachtominimizingadifferentiablefunction‘f’:
1. Takethederivativeof‘f’.2. Findpoints‘w’wherethederivativef’(w)isequalto0.3. Choosethesmallestone(andcheckthatf’’(w)ispositive).
Digression:MultiplyingbyaPositiveConstant• Notethatthisproblem:
• Hasthesamesetofminimizers asthisproblem:
• Andthesealsohavethesameminimizers:
• Icanmultiply‘f’byanypositiveconstantandnotchangesolution.– Derivativewillstillbezeroatthesamelocations.– We’llusethistrickalot!
(Quoratrollingonethicsofthis)
FindingLeastSquaresSolution• Finding‘w’thatminimizessumof squarederrors:
FindingLeastSquaresSolution• Finding‘w’thatminimizessumof squarederrors:
• Let’scheckthatthisisaminimizer bycheckingsecondderivative:
– Since(anything)2 isnon-negativeand(anythingnon-zero)2 >0,ifwehaveonenon-zerofeaturethenf’’(w)>0andthisisaminimizer.
LeastSquaresObjective/Solution(AnotherView)
• Leastsquaresminimizesaquadraticthatisasumofquadratics:
(pause)
Motivation:CombiningExplanatoryVariables• Smokingisnottheonlycontributortolungcancer.– Forexample,thereenvironmentalfactorslikeexposuretoasbestos.
• Howcanwemodelthecombined effect ofsmokingandasbestos?• Asimplewayiswitha2-dimensionallinearfunction:
• Wehaveaweightw1 forfeature‘1’andw2 forfeature‘2’:
LeastSquaresin2-Dimensions• Linearmodel:
• Thisdefinesatwo-dimensionalplane.
LeastSquaresin2-Dimensions• Linearmodel:
• Thisdefinesatwo-dimensionalplane.
• Notjustaline!
DifferentNotationsforLeastSquares• Ifwehave‘d’features,thed-dimensionallinearmodel is:
– Inwords,ourmodelisthattheoutputisaweightedsumoftheinputs.
• Wecanre-writethisinsummationnotation:
• Wecanalsore-writethisinvectornotation:
NotationAlert(again)• Inthiscourse,allvectorsareassumedtobecolumn-vectors:
• SowTxi isascalar:
• Sorowsof‘X’areactuallytransposeofcolumn-vectorxi:
LeastSquaresind-Dimensions• Thelinearleastsquaresmodelind-dimensionsminimizes:
• Datesbackto1801:GaussusedittopredictlocationofCeres.• Howdowefindthe bestvector‘w’ in‘d’dimensions?– Canwesetthe“partialderivative”ofeachvariableto0?
PartialDerivatives
http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml
PartialDerivatives
http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml
LeastSquaresPartialDerivatives(1Example)• Thelinearleastsquaresmodelind-dimensionsfor1example:
• Computingthepartialderivative forvariable‘1’:
LeastSquaresPartialDerivatives(‘n’Examples)• Linearleastsquarespartialderivativeforvariable1onexample‘i’:
• Foragenericvariable‘j’wewouldhave:
• Andif‘f’issummedoverall‘n’exampleswewouldhave:
• Unfortunately,thepartialderivativeforwj dependsonall{w1,w2,…,wd}– Ican’tjust“setequalto0andsolveforwj”.
GradientandCriticalPointsind-Dimensions• Generalizing“setthederivativeto0andsolve”ind-dimensions:– Find‘w’wherethegradientvector equalsthezerovector.
• Gradient isvectorwithpartialderivative‘j’inposition‘j’:
http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml
GradientandCriticalPointsind-Dimensions• Generalizing“setthederivativeto0andsolve”ind-dimensions:– Find‘w’wherethegradientvector equalsthezerovector.
• Gradient isvectorwithpartialderivative‘j’inposition‘j’:
http://msemac.redwoods.edu/~darnold/math50c/matlab/pderiv/index.xhtml
Summary• Regression considersthecaseofanumericalyi.• Leastsquaresisaclassicmethodforfittinglinearmodels.– With1feature,ithasasimpleclosed-formsolution.– Canbegeneralizedto‘d’features.
• Gradient isvectorcontainingpartialderivativesofallvariables.• Nexttime: