+ All Categories
Home > Documents > Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf ·...

Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf ·...

Date post: 14-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
68
1 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar
Transcript
Page 1: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

1

Support Vector Machines: Training with Stochastic Gradient Descent

MachineLearningFall2017

SupervisedLearning:TheSetup

1

Machine LearningSpring 2018

The slides are mainly from Vivek Srikumar

Page 2: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Support vector machines

• Training by maximizing margin

• The SVM objective

• Solving the SVM optimization problem

• Support vectors, duals and kernels

2

Page 3: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

SVM objective function

3

Regularization term:

• Maximize the margin

• Imposes a preference over the

hypothesis space and pushes for

better generalization

• Can be replaced with other

regularization terms which impose

other preferences

Empirical Loss:

• Hinge loss

• Penalizes weight vectors that make

mistakes

• Can be replaced with other loss

functions which impose other

preferences

A hyper-parameter that

controls the tradeoff

between a large margin and

a small hinge-loss

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 4: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron

4

Page 5: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

1. Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron

5

Page 6: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Solving the SVM optimization problem

This function is convex in w, bFor convenience, use simplified notation:

w0 ← ww ← [w0,b]xi ← [xi,1]

6

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 7: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have

! %" + 1 − % $ ≤ %! " + 1 − % !($)

7

u v

f(v)

f(u)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

Page 8: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

A function ! is convex if for every ", $ in the domain, and for every % ∈ [0,1] we have

! %" + 1 − % $ ≤ %! " + 1 − % !($)

8

u v

f(v)

f(u)

Recall: Convex functions

From geometric perspective

Every tangent plane lies below the function

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 9: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Convex functions

9

Linear functions max is convex

Some ways to show that a function is convex:

1. Using the definition of convexity

2. Showing that the second derivative is nonnegative (for one dimensional functions)

3. Showing that the second derivative is positive semi-definite (for vector functions)

Page 10: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Not all functions are convex

10

These are concaveThese are neither

! "# + 1 − " ' ≥ "! # + 1 − " !(')

Page 11: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Convex functions are convenient

A function ! is convex if for every ", $ in the domain, and for every % ∈[0,1] we have

! %" + 1 − % $ ≤ %! " + 1 − % !($)

In general: Necessary condition for x to be a minimum for the function f is ∇f (x)= 0

For convex functions, this is both necessary and sufficient

11

u v

f(v)

f(u)

Page 12: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

This function is convex in w

• This is a quadratic optimization problem because the objective is quadratic

• Older methods: Used techniques from Quadratic Programming– Very slow

• No constraints, can use gradient descent– Still very slow!

Solving the SVM optimization problem

12

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

maxw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

maxw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

maxw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 13: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient descent

General strategy for minimizing

a function J(w)

• Start with an initial guess for

w, say w0

• Iterate till convergence:

– Compute the gradient of J at wt

– Update wt to get wt+1 by taking

a step in the opposite direction

of the gradient

13

J(w)

w

w0

Intuition: The gradient is the direction

of steepest increase in the function. To

get to the minimum, go in the opposite

direction

We are trying to minimize

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 14: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient descent

General strategy for minimizing

a function J(w)

• Start with an initial guess for

w, say w0

• Iterate till convergence:

– Compute the gradient of J at wt

– Update wt to get wt+1 by taking

a step in the opposite direction

of the gradient

14

J(w)

w

w0w1

Intuition: The gradient is the direction

of steepest increase in the function. To

get to the minimum, go in the opposite

direction

We are trying to minimize

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 15: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient descent

General strategy for minimizing a function J(w)

• Start with an initial guess for w, say w0

• Iterate till convergence: – Compute the gradient of J at wt

– Update wt to get wt+1 by taking a step in the opposite direction of the gradient

15

J(w)

ww0w1w2

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 16: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient descent

General strategy for minimizing a function J(w)

• Start with an initial guess for w, say w0

• Iterate till convergence: – Compute the gradient of J at wt

– Update wt to get wt+1 by taking a step in the opposite direction of the gradient

16

J(w)

ww0w1w2w3

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 17: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient descent for SVM

1. Initialize w0

2. For t = 0, 1, 2, ….1. Compute gradient of J(w) at wt. Call it ∇J(wt)

2. Update w as follows:

17

r: Called the learning rate .

We are trying to minimize

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 18: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

ü Review of convex functions and gradient descent2. Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron

18

Page 19: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient descent for SVM

1. Initialize w0

2. For t = 0, 1, 2, ….1. Compute gradient of J(w) at wt. Call it ∇J(wt)

2. Update w as follows:

19

r: Called the learning rate

Gradient of the SVM objective requires summing over the entire training set

Slow, does not really scale

We are trying to minimize

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 20: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be rJt(wt-1)

3. Update: wt à wt-1 – °t rJt (wt-1)

3. Return final w

20

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 21: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be rJt(wt-1)

3. Update: wt à wt-1 – °t rJt (wt-1)

3. Return final w

21

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 22: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)

3. Update: wt à wt-1 – °t rJt (wt-1)

3. Return final w

22

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 23: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)

3. Update: wt ←wt-1 – %t ∇Jt (wt-1)

3. Return final w

23

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 24: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)

3. Update: wt à wt-1 – °t rJt (wt-1)

3. Return final w

24

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Number of training examples

Page 25: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)

3. Update: wt ←wt-1 – %t ∇Jt (wt-1)

3. Return final w

25

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 26: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Repeat (xi, yi) to make a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)

3. Update: wt ←wt-1 – %t ∇Jt (wt-1)

3. Return final w

What is the gradient of the hinge loss with respect to w?(The hinge loss is not a differentiable function!)

26

This algorithm is guaranteed to converge to the minimum of J if %t is small enough.

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 27: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

ü Review of convex functions and gradient descentü Stochastic gradient descent3. Gradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron

27

Page 28: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

28

Gradient descent

Page 29: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

29

Stochastic Gradient descent

Page 30: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

30

Stochastic Gradient descent

Page 31: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

31

Stochastic Gradient descent

Page 32: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

32

Stochastic Gradient descent

Page 33: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

33

Stochastic Gradient descent

Page 34: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

34

Stochastic Gradient descent

Page 35: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

35

Stochastic Gradient descent

Page 36: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

36

Stochastic Gradient descent

Page 37: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

37

Stochastic Gradient descent

Page 38: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

38

Stochastic Gradient descent

Page 39: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

39

Stochastic Gradient descent

Page 40: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

40

Stochastic Gradient descent

Page 41: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

41

Stochastic Gradient descent

Page 42: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

42

Stochastic Gradient descent

Page 43: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

43

Stochastic Gradient descent

Page 44: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

44

Stochastic Gradient descent

Page 45: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

45

Stochastic Gradient descent

Page 46: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Gradient Descent vs SGD

46

Stochastic Gradient descent

Many more updates than gradient descent, but each individual update is less computationally expensive

Page 47: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descent4. Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron

47

Page 48: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic gradient descent for SVMGiven a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Pick a random example (xi, yi) from the training set S

2. Treat (xi, yi) as a full dataset and take the derivative of the SVM objective at the current wt-1 to be ∇Jt(wt-1)

3. Update: wt ←wt-1 – %t ∇Jt (wt-1)

3. Return final w

What is the derivative of the hinge loss with respect to w?(The hinge loss is not a differentiable function!)

48

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J(w) =1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

5

Page 49: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Hinge loss is not differentiable!

What is the derivative of the hinge loss with respect to w?

49

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 50: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Detour: Sub-gradients

Generalization of gradients to non-differentiable functions(Remember that every tangent lies below the function for convex functions)

Informally, a sub-tangent at a point is any line lies below the function at the point.A sub-gradient is the slope of that line

50

Page 51: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Sub-gradients

51[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, g is a subgradient to f at x if

Page 52: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Sub-gradients

52[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, g is a subgradient to f at x if

Page 53: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Sub-gradients

53[Example from Boyd]

g1 is a gradient at x1

g2 and g3 is are both subgradients at x2

f is differentiable at x1Tangent at this point

Formally, g is a subgradient to f at x if

Page 54: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Sub-gradient of the SVM objective

54

General strategy: First solve the max and compute the gradient for each case

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 55: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Sub-gradient of the SVM objective

55

General strategy: First solve the max and compute the gradient for each case

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� C ·Nyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� Cyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 56: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descentü Sub-derivatives of the hinge loss5. Stochastic sub-gradient descent for SVM6. Comparison to perceptron

56

Page 57: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. For each training example (xi, yi)2 S:

If yi wTxi · 1, w à (1- °t) w + °t C yi xi

else w à (1- °t) w

3. Return w

57

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� C ·Nyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 58: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. For each training example (xi, yi)2 S:

If yi wTxi · 1, w à (1- °t) w + °t C yi xi

else w à (1- °t) w

3. Return w

58

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� C ·Nyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 59: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:

If yi wTxi · 1, w à (1- °t) w + °t C yi xi

else w à (1- °t) w

3. Return w

59

Update w ← w – $t ∇Jt

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� C ·Nyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 60: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:

If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi

else w0 ← (1- %t) w0

3. Return w

60

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

minw

1

2w>w

s.t.8i, yi(w>xi + b) � 1

minw

1

2w>w + C

X

i

⇠i

s.t.8i, yi(w>xi + b) � 1� ⇠i

⇠i � 0

minw,b

1

2w>w + C

X

i

max�0, 1� yi(w

>xi + b)�

LHinge(y,x,w, b) = max�0, 1� y(w>x+ b)

minw

1

2w>

0 w0 + CX

i

max(0, 1� yiw>xi)

J t(w) =

1

2w>

0 w0 + C ·N max(0, 1� yiw>xi)

rJ t=

⇢[w0; 0] if max(0, 1� yiwx

i) = 0

[w0; 0]� C ·Nyixi otherwise

f(x) � f(u) +rf(u)>(x� u)

5

Page 61: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. For each training example (xi, yi) ∈ S:

If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi

else w0 ← (1- %t) w0

3. Return w

61

%t: learning rate, many tweaks possible

Important to shuffle examples at the start of each epoch

Page 62: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Shuffle the training set2. For each training example (xi, yi) ∈ S:

If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi

else w0 ← (1- %t) w0

3. Return w

62

%t: learning rate, many tweaks possible

Page 63: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Convergence and learning rates

With enough iterations, it will converge in expectation

Provided the step sizes are “square summable, but not summable”

• Step sizes !t are positive• Sum of squares of step sizes over t = 1 to ∞ is not infinite• Sum of step sizes over t = 1 to ∞ is infinity

• Some examples: !# = %&'()&*+

or !# = %&'(#

63

Page 64: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Convergence and learning rates

• Number of iterations to get to accuracy within !

• For strongly convex functions, N examples, d dimensional:– Gradient descent: O(Nd ln(1/!))– Stochastic gradient descent: O(d/!)

• More subtleties involved, but SGD is generally preferable when the data size is huge

64

Page 65: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Outline: Training SVM by optimization

ü Review of convex functions and gradient descentü Stochastic gradient descentüGradient descent vs stochastic gradient descentü Sub-derivatives of the hinge lossü Stochastic sub-gradient descent for SVM6. Comparison to perceptron

65

Page 66: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Stochastic sub-gradient descent for SVM

Given a training set S = {(xi, yi)}, x ∈ ℜn, y ∈ {-1,1}1. Initialize w0 = 0 ∈ ℜn

2. For epoch = 1 … T:1. Shuffle the training set2. For each training example (xi, yi) ∈ S:

If yi wTxi ≤ 1, w ← (1- %t) [w0; 0] + %t C N yi xi

else w0 ← (1- %t) w0

3. Return w

66

Compare with the Perceptron update:If yiwTxi ≤ 0, update w ← w + r yi xi

Page 67: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

Perceptron vs. SVM

• Perceptron: Stochastic sub-gradient descent for a different loss– No regularization though

• SVM optimizes the hinge loss– With regularization

67

Page 68: Supervised Learning: The Setup Support Vector Machines: …zhe/pdf/lec-19-2-svm-sgd-upload.pdf · 2018-04-11 · 1.Review of convex functions and gradient descent 2.Stochastic gradient

SVM summary from optimization perspective

• Minimize regularized hinge loss

• Solve using stochastic gradient descent– Very fast, run time does not depend on number of examples

– Compare with Perceptron algorithm: similar framework withdifferent objectives!

– Compare with Perceptron algorithm: Perceptron does not maximize margin width• Perceptron variants can force a margin

• Other successful optimization algorithms exist– Eg: Dual coordinate descent, implemented in liblinear

68Questions?


Recommended