Support Vector Machines: Training with Stochastic Gradient ... · Machine Learning Support Vector...

Post on 14-Jun-2020

4 views 0 download

transcript

MachineLearning

SupportVectorMachines:Trainingwith

StochasticGradientDescent

1

Supportvectormachines

• Trainingbymaximizingmargin

• TheSVMobjective

• SolvingtheSVMoptimizationproblem

• Supportvectors,dualsandkernels

2

SVMobjectivefunction

3

Regularizationterm:• Maximizethemargin• Imposesapreferenceoverthe

hypothesisspaceandpushesforbettergeneralization

• Canbereplacedwithotherregularizationtermswhichimposeotherpreferences

EmpiricalLoss:• Hingeloss• Penalizesweightvectorsthatmake

mistakes

• Canbereplacedwithotherlossfunctionswhichimposeotherpreferences

Ahyper-parameterthatcontrolsthetradeoffbetweenalargemarginandasmallhinge-loss

Outline:TrainingSVMbyoptimization

1. Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

4

Outline:TrainingSVMbyoptimization

1. Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

5

SolvingtheSVMoptimizationproblem

Thisfunctionisconvex inw

6

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

7

u v

f(v)

f(u)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈ [0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

8

u v

f(v)

f(u)

Recall:Convexfunctions

Fromgeometricperspective

Everytangentplaneliesbelowthefunction

Convexfunctions

9

Linearfunctions maxisconvex

Somewaystoshowthatafunctionisconvex:

1. Usingthedefinitionofconvexity

2. Showingthatthesecondderivativeispositive(foronedimensionalfunctions)

3. Showingthatthesecondderivativeispositivesemi-definite(forvectorfunctions)

Notallfunctionsareconvex

10

Theseareconcave

Theseareneither

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≥ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Convexfunctionsareconvenient

Afunction𝑓 isconvexifforevery𝒖, 𝒗 inthedomain,andforevery𝜆 ∈[0,1] wehave

𝑓 𝜆𝒖 + 1 − 𝜆 𝒗 ≤ 𝜆𝑓 𝒖 + 1 − 𝜆 𝑓(𝒗)

Ingeneral:Necessaryconditionforx tobeaminimumforthefunctionfisr f(x)=0

Forconvexfunctions,thisisbothnecessaryand sufficient

11

u v

f(v)

f(u)

SolvingtheSVMoptimizationproblem

Thisfunctionisconvexinw

• Thisisaquadraticoptimizationproblembecausetheobjectiveisquadratic

• Oldermethods:UsedtechniquesfromQuadraticProgramming– Veryslow

• Noconstraints,canusegradientdescent– Stillveryslow!

12

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

13

J(w)

ww0

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

14

J(w)

ww0w1

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

15

J(w)

ww0w1w2

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

Gradientdescent

GeneralstrategyforminimizingafunctionJ(w)

• Startwithaninitialguessforw,say w0

• Iteratetillconvergence:– Computethegradientofthe

gradientofJatwt

– Updatewt togetwt+1 bytakingastepintheoppositedirectionofthegradient

16

J(w)

ww0w1w2w3

Intuition:Thegradientisthedirectionofsteepestincreaseinthefunction.Togettotheminimum,gointheoppositedirection

Wearetryingtominimize

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

17

r:Calledthe learningrate.

Wearetryingtominimize

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

2. Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

18

GradientdescentforSVM

1. Initializew0

2. Fort=0,1,2,….1. ComputegradientofJ(w)atwt.Callitr J(wt)

2. Updatewasfollows:

19

r:Calledthelearningrate

GradientoftheSVMobjectiverequiressummingovertheentiretrainingset

Slow,doesnotreallyscale

Wearetryingtominimize

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

20

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

21

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

22

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

23

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

24

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

Whatisthegradientofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

25

ThisalgorithmisguaranteedtoconvergetotheminimumofJif°t issmallenough.Why?TheobjectiveJ(w)isaconvexfunction

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

3. Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

26

GradientDescentvsSGD

27

Gradientdescent

GradientDescentvsSGD

28

StochasticGradientdescent

GradientDescentvsSGD

29

StochasticGradientdescent

GradientDescentvsSGD

30

StochasticGradientdescent

GradientDescentvsSGD

31

StochasticGradientdescent

GradientDescentvsSGD

32

StochasticGradientdescent

GradientDescentvsSGD

33

StochasticGradientdescent

GradientDescentvsSGD

34

StochasticGradientdescent

GradientDescentvsSGD

35

StochasticGradientdescent

GradientDescentvsSGD

36

StochasticGradientdescent

GradientDescentvsSGD

37

StochasticGradientdescent

GradientDescentvsSGD

38

StochasticGradientdescent

GradientDescentvsSGD

39

StochasticGradientdescent

GradientDescentvsSGD

40

StochasticGradientdescent

GradientDescentvsSGD

41

StochasticGradientdescent

GradientDescentvsSGD

42

StochasticGradientdescent

GradientDescentvsSGD

43

StochasticGradientdescent

GradientDescentvsSGD

44

StochasticGradientdescent

GradientDescentvsSGD

45

StochasticGradientdescent

Manymoreupdatesthangradientdescent,buteachindividualupdateislesscomputationallyexpensive

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

ü Gradientdescentvsstochasticgradientdescent

4. Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

46

Stochastic gradientdescentforSVMGivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew0 =02 <n

2. Forepoch=1…T:1. Pickarandomexample(xi,yi)fromthetrainingsetS

2. Treat(xi,yi)asafulldatasetandtakethederivativeoftheSVMobjectiveatthecurrentwt-1 toberJt(wt-1)

3. Update:wt à wt-1 – °trJt (wt-1)

3. Returnfinalw

Whatisthederivativeofthehingelosswithrespecttow?(Thehingelossisnotadifferentiablefunction!)

47

Hingelossisnot differentiable!

Whatisthederivativeofthehingelosswithrespecttow?

48

Detour:Sub-gradients

Generalizationofgradientstonon-differentiablefunctions(Rememberthateverytangentisahyperplanethatliesbelowthefunctionforconvexfunctions)

Informally,asub-tangentatapointisanyhyperplanethatliesbelowthefunctionatthepoint.Asub-gradientistheslopeofthatline

49

Sub-gradients

50[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Sub-gradients

51[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Sub-gradients

52[ExamplefromBoyd]

g1isagradientatx1

g2 and g3isarebothsubgradients atx2

fisdifferentiableatx1Tangentatthispoint

Formally,avectorgisasubgradient tofatpointxif

Sub-gradientoftheSVMobjective

53

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

Sub-gradientoftheSVMobjective

54

Generalstrategy:Firstsolvethemaxandcomputethegradientforeachcase

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

ü Gradientdescentvsstochasticgradientdescent

ü Sub-derivativesofthehingeloss

5. Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

55

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

56

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

57

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

58

Update w à w – °trJt

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

59

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

60

°t:learningrate,manytweakspossible

Importanttoshuffleexamplesatthestartofeachepoch

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1- °t) w + °t Cyi xi

elsewà (1- °t) w

3. Returnw

61

°t:learningrate,manytweakspossible

Convergenceandlearningrates

Withenoughiterations,itwillconvergeinexpectation

Providedthestepsizesare“squaresummable,butnotsummable”

• Stepsizes°t arepositive• Sumofsquaresofstepsizesovert=1to1 isnotinfinite• Sumofstepsizesovert=1to1 isinfinity

• Someexamples:𝛾2 = 56

7896:;or𝛾2 =

56782

62

Convergenceandlearningrates

• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

63

Convergenceandlearningrates

• Numberofiterationstogettoaccuracywithin²

• Forstronglyconvexfunctions,Nexamples,ddimensional:– Gradientdescent:O(Nd ln(1/²))– Stochasticgradientdescent:O(d/²)

• Moresubtletiesinvolved,butSGDisgenerallypreferablewhenthedatasizeishuge

• Recently,manyvariantsthatarebasedonthisgeneralstrategy,targetingmultilayerneuralnetworks– Examples:Adagrad,momentum,Nesterov’s acceleratedgradient,

Adam,RMSProp,etc…

64

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1-°t) w + °t Cyi xi

elsewà (1-°t) w

3. Returnw

65

Outline:TrainingSVMbyoptimization

ü Reviewofconvexfunctionsandgradientdescent

ü Stochasticgradientdescent

ü Gradientdescentvsstochasticgradientdescent

ü Sub-derivativesofthehingeloss

ü Stochasticsub-gradientdescentforSVM

6. Comparisontoperceptron

66

Stochasticsub-gradientdescentforSVM

GivenatrainingsetS={(xi,yi)},x 2 <n,y 2 {-1,1}1. Initializew =02 <n

2. Forepoch=1…T:1. Shufflethetrainingset2. Foreachtrainingexample(xi,yi)2 S:

Ifyi wTxi· 1,wà (1-°t) w + °t Cyi xi

elsewà (1-°t) w

3. Returnw

67

ComparewiththePerceptronupdate:IfywTx· 0,updatewà w +ryx

Perceptronvs.SVM

• Perceptron:Stochasticsub-gradientdescentforadifferentloss– Noregularizationthough

• SVMoptimizesthehingeloss– Withregularization

68

SVMsummaryfromoptimizationperspective

• Minimizeregularizedhingeloss

• Solveusingstochasticgradientdescent– Veryfast,runtimedoesnotdependonnumberofexamples

– ComparewithPerceptronalgorithm:Perceptrondoesnotmaximizemarginwidth• Perceptronvariantscanforceamargin

– Convergencecriterionisanissue;canbetooaggressiveinthebeginningandgettoareasonablygoodsolutionfast;butconvergenceisslowforveryaccurateweightvector

• Othersuccessfuloptimizationalgorithmsexist– Eg:Dualcoordinatedescent,implementedinliblinear

69Questions?