Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

transcript

MachineLearning

SupportVectorMachines:Trainingwith

StochasticGradientDescent

Supportvectormachines

• Trainingbymaximizingmargin

• TheSVMobjective

• SolvingtheSVMoptimizationproblem

• Supportvectors,dualsandkernels

SVMobjectivefunction

Regularizationterm:• Maximizethemargin• Imposesapreferenceoverthe

hypothesisspaceandpushesforbettergeneralization

• Canbereplacedwithotherregularizationtermswhichimposeotherpreferences

EmpiricalLoss:• Hingeloss• Penalizesweightvectorsthatmake

mistakes

• Canbereplacedwithotherlossfunctionswhichimposeotherpreferences

Ahyper-parameterthatcontrolsthetradeoffbetweenalargemarginandasmallhinge-loss

Outline:TrainingSVMbyoptimization

1. Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

1. Reviewofconvexfunctionsandgradientdescent

SolvingtheSVMoptimizationproblem

Thisfunctionisconvex inw

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Convexfunctions

Linearfunctions maxisconvex

Somewaystoshowthatafunctionisconvex:

1. Usingthedefinitionofconvexity

2. Showingthatthesecondderivativeispositive(foronedimensionalfunctions)

3. Showingthatthesecondderivativeispositivesemi-definite(forvectorfunctions)

Notallfunctionsareconvex

Theseareconcave

Theseareneither

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Convexfunctionsareconvenient

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈[0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Ingeneral:Necessaryconditionforx tobeaminimumforthefunctionfisr f(x)=0

Forconvexfunctions,thisisbothnecessaryand sufficient

SolvingtheSVMoptimizationproblem

Thisfunctionisconvexinw

• Thisisaquadraticoptimizationproblembecausetheobjectiveisquadratic

• Oldermethods:UsedtechniquesfromQuadraticProgramming– Veryslow

• Noconstraints,canusegradientdescent– Stillveryslow!

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Gradientdescent

gradientofJatwt

Gradientdescent

gradientofJatwt

ww0w1w2

Gradientdescent

gradientofJatwt

ww0w1w2w3

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

r:Calledthe learningrate.

ü Reviewofconvexfunctionsandgradientdescent

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

r:Calledthelearningrate

GradientoftheSVMobjectiverequiressummingovertheentiretrainingset

Slow,doesnotreallyscale

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt Ã wt-1 – °trJt (wt-1)

3. Returnfinalw

Whatisthegradientofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

ThisalgorithmisguaranteedtoconvergetotheminimumofJif°t issmallenough.Why?TheobjectiveJ(w)isaconvexfunction

ü Stochasticgradientdescent

GradientDescentvsSGD

Gradientdescent

StochasticGradientdescent

Manymoreupdatesthangradientdescent,buteachindividualupdateislesscomputationallyexpensive

ü Gradientdescentvsstochasticgradientdescent

3. Returnfinalw

Whatisthederivativeofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

Hingelossisnot differentiable!

Whatisthederivativeofthehingelosswithrespecttow?

Detour:Sub-gradients

Generalizationofgradientstonon-differentiablefunctions(Rememberthateverytangentisahyperplanethatliesbelowthefunctionforconvexfunctions)

Informally,asub-tangentatapointisanyhyperplanethatliesbelowthefunctionatthepoint.Asub-gradientistheslopeofthatline

Sub-gradients

50[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Sub-gradients

51[ExamplefromBoyd]

g1isagradientatx1

Sub-gradients

52[ExamplefromBoyd]

g1isagradientatx1

Sub-gradientoftheSVMobjective

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

Sub-gradientoftheSVMobjective

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

ü Sub-derivativesofthehingeloss

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wÃ (1- °t) w + °t Cyi xi

elsewÃ (1- °t) w

3. Returnw

elsewÃ (1- °t) w

3. Returnw

elsewÃ (1- °t) w

3. Returnw

Update w Ã w – °trJt

elsewÃ (1- °t) w

3. Returnw

elsewÃ (1- °t) w

3. Returnw

°t:learningrate,manytweakspossible

Importanttoshuffleexamplesatthestartofeachepoch

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

elsewÃ (1- °t) w

3. Returnw

°t:learningrate,manytweakspossible

Convergenceandlearningrates

Withenoughiterations,itwillconvergeinexpectation

Providedthestepsizesare“squaresummable,butnotsummable”

• Stepsizes°t arepositive• Sumofsquaresofstepsizesovert=1to1 isnotinfinite• Sumofstepsizesovert=1to1 isinfinity

• Someexamples:𝛾2 = 56

7896:;or𝛾2 =

• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy,targetingmultilayerneuralnetworks– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

Ifyi wTxi· 1,wÃ (1-°t) w + °t Cyi xi

elsewÃ (1-°t) w

3. Returnw

ü Sub-derivativesofthehingeloss

ü Stochasticsub-gradientdescentforSVM

Ifyi wTxi· 1,wÃ (1-°t) w + °t Cyi xi

elsewÃ (1-°t) w

3. Returnw

ComparewiththePerceptronupdate:IfywTx· 0,updatewÃ w +ryx

Perceptronvs.SVM

• Perceptron:Stochasticsub-gradientdescentforadifferentloss– Noregularizationthough

• SVMoptimizesthehingeloss– Withregularization

SVMsummaryfromoptimizationperspective

• Minimizeregularizedhingeloss

• Solveusingstochasticgradientdescent– Veryfast,runtimedoesnotdependonnumberofexamples

– ComparewithPerceptronalgorithm:Perceptrondoesnotmaximizemarginwidth• Perceptronvariantscanforceamargin

– Convergencecriterionisanissue;canbetooaggressiveinthebeginningandgettoareasonablygoodsolutionfast;butconvergenceisslowforveryaccurateweightvector

• Othersuccessfuloptimizationalgorithmsexist– Eg:Dualcoordinatedescent,implementedinliblinear

69Questions?

Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

Documents