+ All Categories
Home > Documents > Vishy Lec2

Vishy Lec2

Date post: 16-Jan-2016
Category:
Upload: jeromeku
View: 236 times
Download: 0 times
Share this document with a friend
Description:
a
Popular Tags:
69
Optimization for Machine Learning Lecture 2: Support Vector Machine Training S.V . N. (vishy) Vishwanathan Purdue University [email protected] July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 41
Transcript
Page 1: Vishy Lec2

Optimization for Machine LearningLecture 2: Support Vector Machine Training

S.V.N. (vishy) Vishwanathan

Purdue [email protected]

July 11, 2012

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 41

Page 2: Vishy Lec2

Linear Support Vector Machines

Outline

1 Linear Support Vector Machines

2 Stochastic Optimization

3 Implicit Updates

4 Dual Problem

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 41

Page 3: Vishy Lec2

Linear Support Vector Machines

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

Page 4: Vishy Lec2

Linear Support Vector Machines

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

Page 5: Vishy Lec2

Linear Support Vector Machines

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

Page 6: Vishy Lec2

Linear Support Vector Machines

Binary Classification

yi = −1

yi = +1

x | 〈w , x〉+ b = 0x | 〈w , x〉+ b = −1

x | 〈w , x〉+ b = 1

x2x1

〈w , x1〉+ b = +1〈w , x2〉+ b = −1〈w , x1 − x2〉 = 2⟨w‖w‖ , x1 − x2

⟩= 2‖w‖

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 41

Page 7: Vishy Lec2

Linear Support Vector Machines

Linear Support Vector Machines

Optimization Problem

minw ,b,ξ

λ

2‖w‖2 +

1

m

m∑i=1

ξi

s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41

Page 8: Vishy Lec2

Linear Support Vector Machines

Linear Support Vector Machines

Optimization Problem

minw ,b

λ

2‖w‖2 +

1

m

m∑i=1

max(0, 1− yi (〈w , xi 〉+ b))

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41

Page 9: Vishy Lec2

Linear Support Vector Machines

Linear Support Vector Machines

Optimization Problem

minw ,b

λ

2‖w‖2︸ ︷︷ ︸λΩ(w)

+1

m

m∑i=1

max(0, 1− yi (〈w , xi 〉+ b))︸ ︷︷ ︸Remp(w)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 41

Page 10: Vishy Lec2

Stochastic Optimization

Outline

1 Linear Support Vector Machines

2 Stochastic Optimization

3 Implicit Updates

4 Dual Problem

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 41

Page 11: Vishy Lec2

Stochastic Optimization

Stochastic Optimization Algorithms

Optimization Problem (with no bias)

minw

λ

2‖w‖2︸ ︷︷ ︸Ω(w)

+1

m

m∑i=1

max(0, 1− yi 〈w , xi 〉)︸ ︷︷ ︸Remp(w)

Unconstrained

Nonsmooth

Convex

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 41

Page 12: Vishy Lec2

Stochastic Optimization

Pegasos: Stochastic Gradient Descent

Require: T1: w0 ← 02: for t = 1, . . . ,T do3: ηt ← 1

λt4: if yt 〈wt , xt〉 < 1 then5: w ′t ← (1− ηtλ)wt + ηtytxt6: else7: w ′t ← (1− ηtλ)wt

8: end if9: end for

10: wt+1 ← min

1, 1/√λ

‖w ′t‖

w ′t

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 41

Page 13: Vishy Lec2

Stochastic Optimization

Understanding Pegasos

Objective Function Revisited

J(w) =λ

2‖w‖2 +

1

m

m∑i=1

max(0, 1− yi 〈w , xi 〉)

Subgradient

If yt 〈w , xt〉 < 1 then

∂wJt(w) = λw − ytxt

else

∂wJt(w) = λw

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41

Page 14: Vishy Lec2

Stochastic Optimization

Understanding Pegasos

Objective Function Revisited

J(w) ≈ Jt(w) =λ

2‖w‖2 + max(0, 1− yt 〈w , xt〉)

Subgradient

If yt 〈w , xt〉 < 1 then

∂wJt(w) = λw − ytxt

else

∂wJt(w) = λw

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41

Page 15: Vishy Lec2

Stochastic Optimization

Understanding Pegasos

Objective Function Revisited

J(w) ≈ Jt(w) =λ

2‖w‖2 + max(0, 1− yt 〈w , xt〉)

Subgradient

If yt 〈w , xt〉 < 1 then

∂wJt(w) = λw − ytxt

else

∂wJt(w) = λw

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 41

Page 16: Vishy Lec2

Stochastic Optimization

Understanding Pegasos

Explicit Update

If yt 〈w , xt〉 < 1 then

w ′t = wt − ηt∂wJt(wt) = (1− ληt)wt + ytxt

else

w ′t = wt − ηt∂wJt(wt) = (1− ληt)wt

Projection

Project w ′t onto the set

B =

w s.t. ‖w‖ ≤ 1/√λ

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 41

Page 17: Vishy Lec2

Stochastic Optimization

Motivating Stochastic Gradient Descent

How are the Updates Derived?

Minimize the following objective function

wt+1 = argminw

1

2‖w − wt‖2 + ηt Jt(w)

This gives us

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

Page 18: Vishy Lec2

Stochastic Optimization

Motivating Stochastic Gradient Descent

How are the Updates Derived?

Minimize the following objective function

wt+1 = argminw

1

2‖w − wt‖2 + ηt Jt(w)

This gives us

wt+1 = wt − ηt∂wJt(wt+1)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

Page 19: Vishy Lec2

Stochastic Optimization

Motivating Stochastic Gradient Descent

How are the Updates Derived?

Minimize the following objective function

wt+1 = argminw

1

2‖w − wt‖2 + ηt Jt(w)

This gives us

wt+1 = wt − ηt ∂wJt(wt+1)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

Page 20: Vishy Lec2

Stochastic Optimization

Motivating Stochastic Gradient Descent

How are the Updates Derived?

Minimize the following objective function

wt+1 = argminw

1

2‖w − wt‖2 + ηt Jt(w)

This gives us

wt+1 ≈ wt − ηt ∂wJt(wt)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 41

Page 21: Vishy Lec2

Implicit Updates

Outline

1 Linear Support Vector Machines

2 Stochastic Optimization

3 Implicit Updates

4 Dual Problem

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 41

Page 22: Vishy Lec2

Implicit Updates

Implicit Updates

What if we did not approximate ∂wJt(wt+1)?

wt+1 = wt − ηt∂wJt(wt+1)

Subgradient

∂wJt(w) = λw − γytxt

If yt 〈w , xt〉 < 1 then γ = 1

If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]

If yt 〈w , xt〉 > 1 then γ = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

Page 23: Vishy Lec2

Implicit Updates

Implicit Updates

What if we did not approximate ∂wJt(wt+1)?

wt+1 = wt − ηt∂wJt(wt+1)

Subgradient

∂wJt(w) = λw − γytxt

If yt 〈w , xt〉 < 1 then γ = 1

If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]

If yt 〈w , xt〉 > 1 then γ = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

Page 24: Vishy Lec2

Implicit Updates

Implicit Updates

What if we did not approximate ∂wJt(wt+1)?

wt+1 = wt − ηtλwt+1 + γηtytxt

Subgradient

∂wJt(w) = λw − γytxt

If yt 〈w , xt〉 < 1 then γ = 1

If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]

If yt 〈w , xt〉 > 1 then γ = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

Page 25: Vishy Lec2

Implicit Updates

Implicit Updates

What if we did not approximate ∂wJt(wt+1)?

(1 + ηtλ)wt+1 = wt + γηtytxt

Subgradient

∂wJt(w) = λw − γytxt

If yt 〈w , xt〉 < 1 then γ = 1

If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]

If yt 〈w , xt〉 > 1 then γ = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

Page 26: Vishy Lec2

Implicit Updates

Implicit Updates

What if we did not approximate ∂wJt(wt+1)?

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

Subgradient

∂wJt(w) = λw − γytxt

If yt 〈w , xt〉 < 1 then γ = 1

If yt 〈w , xt〉 = 1 then γ ∈ [0, 1]

If yt 〈w , xt〉 > 1 then γ = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 41

Page 27: Vishy Lec2

Implicit Updates

Implicit Updates: Case 1

The Implicit Update Condition

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

Case 1

Suppose 1 + ηtλ < yt 〈wt , xt〉. Set

wt+1 =1

1 + ηtλwt

Verify yt 〈wt+1, xt〉 > 1 which implies that γ = 0 and the implicitupdate condition is satisfied

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 41

Page 28: Vishy Lec2

Implicit Updates

Implicit Updates: Case 2

The Implicit Update Condition

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

Case 2

Suppose yt 〈wt , xt〉 < 1 + ηtλ− ηt 〈xt , xt〉. Set

wt+1 =1

1 + ηtλ[wt + ηtytxt ]

Verify yt 〈wt+1, xt〉 < 1 which implies that γ = 1 and the implicitupdate condition is satisfied

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 41

Page 29: Vishy Lec2

Implicit Updates

Implicit Updates: Case 3

The Implicit Update Condition

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

Case 3

Suppose 1 + ηtλ− ηt 〈xt , xt〉 ≤ yt 〈wt , xt〉 ≤ 1 + ηtλ. Set

γ =1 + ηtλ− yt 〈wt , xt〉

ηt 〈xt , xt〉

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

Verify γ ∈ [0, 1] and yt 〈wt+1, xt〉 = 1

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 41

Page 30: Vishy Lec2

Implicit Updates

Implicit Updates: Summary

Summary

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

If 1 + ηtλ < yt 〈wt , xt〉 then γ = 0

If 1 + ηtλ− ηt 〈xt , xt〉 ≤ yt 〈wt , xt〉 ≤ 1 + ηtλ then

γ =1 + ηtλ− yt 〈wt , xt〉

ηt 〈xt , xt〉

If yt 〈wt , xt〉 < 1 + ηtλ− ηt 〈xt , xt〉 then γ = 1

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 41

Page 31: Vishy Lec2

Implicit Updates

Implicit Updates: Summary

Summary

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

γ = min

(1,max

(0,

1 + ηtλ− yt 〈wt , xt〉ηt 〈xt , xt〉

))

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 41

Page 32: Vishy Lec2

Dual Problem

Outline

1 Linear Support Vector Machines

2 Stochastic Optimization

3 Implicit Updates

4 Dual Problem

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 41

Page 33: Vishy Lec2

Dual Problem

Deriving the Dual

Lagrangian

Recall the primal problem without bias

minw ,ξ

λ

2‖w‖2 +

1

m

m∑i=1

ξi

s.t. yi 〈w , xi 〉 ≥ 1− ξi for all i

ξi ≥ 0

Introduce non-negative dual variables α and β

L(w , ξ, α, β) =λ

2‖w‖2 +

1

m

m∑i=1

ξi −∑i

αi (yi 〈w , xi 〉 − 1 + ξi )

−∑i

βiξi

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 41

Page 34: Vishy Lec2

Dual Problem

Deriving the Dual

Lagrangian

Recall the primal problem without bias

minw ,ξ

λ

2‖w‖2 +

1

m

m∑i=1

ξi

s.t. yi 〈w , xi 〉 ≥ 1− ξi for all i

ξi ≥ 0

Introduce non-negative dual variables α and β

L(w , ξ, α, β) =λ

2‖w‖2 +

1

m

m∑i=1

ξi −∑i

αi (yi 〈w , xi 〉 − 1 + ξi )

−∑i

βiξi

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 19 / 41

Page 35: Vishy Lec2

Dual Problem

Deriving the Dual

Take Gradients and Set to Zero

Write the gradients

∇wL(w , ξ, α, β) = λw −∑i

αiyixi = 0

∇ξi L(w , ξ, α, β) =1

m− βi − αi = 0

Conclude that

w =1

λ

∑i

αiyixi

0 ≤ αi ≤1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 20 / 41

Page 36: Vishy Lec2

Dual Problem

Deriving the Dual

Plug back into Lagrangian

Plug w = 1λ

∑i αiyixi and βi + αi = 1

m into the Lagrangian

maxα−D(α) := − 1

∑i ,j

αiαjyiyj 〈xi , xj〉+∑i

αi

s.t. 0 ≤ αi ≤1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41

Page 37: Vishy Lec2

Dual Problem

Deriving the Dual

Plug back into Lagrangian

Plug w = 1λ

∑i αiyixi and βi + αi = 1

m into the Lagrangian

minα

D(α) :=1

∑i ,j

αiαjyiyj 〈xi , xj〉 −∑i

αi

s.t. 0 ≤ αi ≤1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41

Page 38: Vishy Lec2

Dual Problem

Deriving the Dual

Plug back into Lagrangian

Plug w = 1λ

∑i αiyixi and βi + αi = 1

m into the Lagrangian

minα

D(α) :=1

∑i ,j

αiαjyiyj 〈xi , xj〉 −∑i

αi

s.t. 0 ≤ αi ≤1

m

Quadratic objective

Linear (box) constraints

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 21 / 41

Page 39: Vishy Lec2

Dual Problem

Coordinate Descent in the Dual

One dimensional function

D(αt) =α2t

2λ〈xt , xt〉+

1

λ

∑i

αtαiyiyt 〈xi , xt〉 − αt + const.

s.t. 0 ≤ αt ≤1

m

Take Gradients and set to Zero

∇D(αt) =αt

λ〈xt , xt〉+

1

λ

∑i

αiyiyt 〈xi , xt〉 − 1 = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41

Page 40: Vishy Lec2

Dual Problem

Coordinate Descent in the Dual

One dimensional function

D(αt) =α2t

2λ〈xt , xt〉+

1

λ

∑i

αtαiyiyt 〈xi , xt〉 − αt + const.

s.t. 0 ≤ αt ≤1

m

Take Gradients and set to Zero

∇D(αt) =αt

λ〈xt , xt〉+

1

λyt

⟨wt︸︷︷︸

:=∑

i yiαixi

, xt

⟩− 1 = 0

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41

Page 41: Vishy Lec2

Dual Problem

Coordinate Descent in the Dual

One dimensional function

D(αt) =α2t

2λ〈xt , xt〉+

1

λ

∑i

αtαiyiyt 〈xi , xt〉 − αt + const.

s.t. 0 ≤ αt ≤1

m

Take Gradients and set to Zero

αt = min

(max

(0,λ− yt 〈wt , xt〉〈xt , xt〉

),

1

m

)

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 22 / 41

Page 42: Vishy Lec2

Dual Problem

Contrast with Implicit Updates

Coordinate Descent in the Dual

wt =1

λ

∑i

αiyixi

αt = min

(1

m,max

(0,λ− yt 〈wt , xt〉〈xt , xt〉

))

Implicit Updates

wt+1 =1

1 + ηtλ[wt + γηtytxt ]

γ = min

(1,max

(0,

1 + ηtλ− yt 〈wt , xt〉ηt 〈xt , xt〉

))

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 23 / 41

Page 43: Vishy Lec2

Scaling Things Up

Outline

1 Linear Support Vector Machines

2 Stochastic Optimization

3 Implicit Updates

4 Dual Problem

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 24 / 41

Page 44: Vishy Lec2

Scaling Things Up

What if Data Does not Fit in Memory?

Idea 1: Block Minimization [Yu et al., KDD 2010]

Split data into blocks B1, B2 . . . such that Bj fits in memory

Compress and store each block separately

Load one block of data at a time and optimize only those αi ’s

Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011]

Split data into blocks B1, B2 . . . such that Bj fits in memory

Compress and store each block separately

Load one block of data at a time and optimize only those αi ’s

Retain informative samples from each block in main memory

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 25 / 41

Page 45: Vishy Lec2

Scaling Things Up

What are Informative Samples?

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 26 / 41

Page 46: Vishy Lec2

Scaling Things Up

Some Observations

SBM and BM are wasteful

Both split data into blocks and compress the blocks

This requires reading the entire data at least once (expensive)

Both pause optimization while a block is loaded into memory

Hardware 101

Disk I/O is slower than CPU (sometimes by a factor of 100)

Random access on HDD is terrible

sequential access is reasonably fast (factor of 10)

Multi-core processors are becoming commonplace

How can we exploit this?

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 27 / 41

Page 47: Vishy Lec2

Scaling Things Up

Dual Cached Loops [Matsushima, Vishwanathan, Smola]

Reader Trainer

RAM RAMHDD

Data Working Set Weight Vec

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 28 / 41

Page 48: Vishy Lec2

Scaling Things Up

Underlying Philosophy

Iterate over the data in main memory while streaming data fromdisk. Evict primarily examples from main memory that are“uninformative”.

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 29 / 41

Page 49: Vishy Lec2

Scaling Things Up

Reader

for k = 1, . . . ,max iter dofor i = 1, . . . , n do

if |A| = Ω thenrandomly select i ′ ∈ AA = A \ i ′delete yi ,Qii , xi from RAM

end ifread yi , xi from Disk

calculate Qii = 〈xi , xi 〉store yi ,Qii , xi in RAM

A = A ∪ iend forif stopping criterion is met then

exitend if

end forS.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 30 / 41

Page 50: Vishy Lec2

Scaling Things Up

Trainer

α1 = 0,w 1 = 0, ε = 9, εnew = 0, β = 0.9while stopping criterion is not met do

for t = 1, . . . , n doIf |A| > 0.9× Ω then ε = βεrandomly select i ∈ A and read yi ,Qii , xi from RAM

compute ∇iD := yi 〈w t , xi 〉 − 1if (αt

i = 0 and ∇iD > ε) or (αti = C and ∇iD < −ε) then

A = A \ i and delete yi ,Qii , xi from RAM

continueend ifαt+1i = median

(0,C , αt

i −∇iDQii

), w t+1 = w t + (αt+1

i − αti )yixi

εnew = max(εnew, |∇iD|)end forUpdate stopping criterionε = εnew

end whileS.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 31 / 41

Page 51: Vishy Lec2

Scaling Things Up

Experiments

dataset n d s(%) n+ :n− Datasize

ocr 3.5 M 1156 100 0.96 45.28 GB

dna 50 M 800 25 3e−3 63.04 GB

webspam-t 0.35 M 16.61 M 0.022 1.54 20.03 GB

kddb 20.01 M 29.89 M 1e-4 6.18 4.75 GB

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 32 / 41

Page 52: Vishy Lec2

Scaling Things Up

Does Active Eviction Work?

0 0.5 1 1.5 2 2.5

·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eF

un

ctio

nV

alu

eD

iffer

ence

Linear SVM

webspam-t C = 1.0RandomActive

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 33 / 41

Page 53: Vishy Lec2

Scaling Things Up

Comparison with Block Minimization

0 1 2 3 4

·104

10−9

10−7

10−5

10−3

10−1

Wall Clock Time (sec)

Rel

ativ

eF

un

ctio

nV

alu

eD

iffer

ence ocr C = 1.0

StreamSVMSBMBM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

Page 54: Vishy Lec2

Scaling Things Up

Comparison with Block Minimization

0 0.5 1 1.5 2

·104

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eF

un

ctio

nV

alu

eD

iffer

ence webspam-t C = 1.0

StreamSVMSBMBM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

Page 55: Vishy Lec2

Scaling Things Up

Comparison with Block Minimization

0 1 2 3 4

·104

10−6

10−5

10−4

10−3

10−2

10−1

100

Wall Clock Time (sec)

Rel

ativ

eF

un

ctio

nV

alu

eD

iffer

ence kddb C = 1.0

StreamSVMSBMBM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

Page 56: Vishy Lec2

Scaling Things Up

Comparison with Block Minimization

0 1 2 3 4

·104

10−11

10−9

10−7

10−5

10−3

10−1

Wall Clock Time (sec)

Rel

ativ

eO

bje

ctiv

eF

un

ctio

nV

alu

e

dna C = 1.0StreamSVM

SBMBM

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 34 / 41

Page 57: Vishy Lec2

Scaling Things Up

Expanding Features

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

·105

10−10

10−8

10−6

10−4

10−2

100

Wall Clock Time (sec)

Rel

ativ

eF

un

ctio

nV

alu

eD

iffer

ence dna expanded C = 1.0

16 GB32GB

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 35 / 41

Page 58: Vishy Lec2

Bringing in the Bias

Outline

1 Linear Support Vector Machines

2 Stochastic Optimization

3 Implicit Updates

4 Dual Problem

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 36 / 41

Page 59: Vishy Lec2

Bringing in the Bias

Let us Bring Back the Bias

Lagrangian

Recall the primal problem

minw ,b,ξ

λ

2‖w‖2 +

1

m

m∑i=1

ξi

s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

Introduce non-negative dual variables α and β

L(w , b, ξ, α, β) =λ

2‖w‖2 −

∑i

βiξi

+1

m

m∑i=1

ξi −∑i

αi (yi (〈w , xi 〉+ b)− 1 + ξi )

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 41

Page 60: Vishy Lec2

Bringing in the Bias

Let us Bring Back the Bias

Lagrangian

Recall the primal problem

minw ,b,ξ

λ

2‖w‖2 +

1

m

m∑i=1

ξi

s.t. yi (〈w , xi 〉+ b) ≥ 1− ξi for all i

ξi ≥ 0

Introduce non-negative dual variables α and β

L(w , b, ξ, α, β) =λ

2‖w‖2 −

∑i

βiξi

+1

m

m∑i=1

ξi −∑i

αi (yi (〈w , xi 〉+ b)− 1 + ξi )

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 37 / 41

Page 61: Vishy Lec2

Bringing in the Bias

Let us Bring Back the Bias

Take Gradients and Set to Zero

Write the gradients

∇wL(w , b, ξ, α, β) = λw −∑i

αiyixi = 0

∇bL(w , b, ξ, α, β) =∑i

αiyi = 0

∇ξi L(w , b, ξ, α) =1

m− βi − αi = 0

Conclude that

w =1

λ

∑i

αiyixi

∑i

αiyi = 0 and 0 ≤ αi ≤1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 38 / 41

Page 62: Vishy Lec2

Bringing in the Bias

Let us Bring Back the Bias

Plug back into Lagrangian

Plug w = 1λ

∑i αiyixi and βi + αi = 1

m into the Lagrangian

maxα−D(α) := − 1

∑i ,j

αiαjyiyj 〈xi , xj〉+∑i

αi

s.t.∑i

αiyi = 0

0 ≤ αi ≤1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 41

Page 63: Vishy Lec2

Bringing in the Bias

Let us Bring Back the Bias

Plug back into Lagrangian

Plug w = 1λ

∑i αiyixi and βi + αi = 1

m into the Lagrangian

minα

D(α) :=1

∑i ,j

αiαjyiyj 〈xi , xj〉 −∑i

αi

s.t.∑i

αiyi = 0

0 ≤ αi ≤1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 39 / 41

Page 64: Vishy Lec2

Bringing in the Bias

Coordinate Descent in the Dual

One Dimensional Function

Cannot pick one coordinate so pick two!

Call the two coordinates t1 and t2

D(ηt1 , ηt2) =η2t1

2λ〈xt1 , xt1〉+

η2t2

2λ〈xt2 , xt2〉

+ηt1

λ

∑i

αi 〈xi , xt1〉+ηt2

λ

∑i

αi 〈xi , xt2〉

+ηt1ηt2

λ〈xt1 , xt2〉 − ηt1 − ηt2 + const.

s.t. yt1ηt1 + yt2ηt2 = 0

0 ≤ αt1 + ηt1 ≤1

m

0 ≤ αt2 + ηt2 ≤1

mS.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41

Page 65: Vishy Lec2

Bringing in the Bias

Coordinate Descent in the Dual

One Dimensional Function

Cannot pick one coordinate so pick two!

Call the two coordinates t1 and t2

ηt1 = −yt2

yt1

ηt2 = η

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41

Page 66: Vishy Lec2

Bringing in the Bias

Coordinate Descent in the Dual

One Dimensional Function

Cannot pick one coordinate so pick two!

Call the two coordinates t1 and t2

D(ηt1 , ηt2) =η2

2λ〈xt1 , xt1〉+

η2

2λ〈xt2 , xt2〉

λ

∑i

αi 〈xi , xt1〉 −ηyt1

λyt2

∑i

αi 〈xi , xt2〉

− η2yt1

λyt2

〈xt1 , xt2〉 − η +yt1

yt2

η + const.

s.t. 0 ≤ αt1 + η ≤ 1

m

0 ≤ αt2 −yt1

yt2

η ≤ 1

m

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 40 / 41

Page 67: Vishy Lec2

Bringing in the Bias

Software

LibSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/

LibLinear: http://www.csie.ntu.edu.tw/~cjlin/liblinear/

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 41 / 41

Page 68: Vishy Lec2

Bringing in the Bias

References (Incomplete)

Implicit UpdatesKivinen and Warmuth, Exponentiated Gradient Versus GradientDescent for Linear Predictors, Information and Computation, 1997.Kivinen, Warmuth, and Hassibi, The p-norm generaliziation of theLMS algorithm for adaptive filtering, IEEE Transactions on SignalProcessing, 2006.Cheng, Vishwanathan, Schuurmans, Wang, and Caelli, Implicit OnlineLearning With Kernels, NIPS 2006.Hsieh, Chang, Lin, Keerthi, and Sundararajan, A Dual CoordinateDescent Method for Large-scale Linear SVM, ICML 2008.

SMOPlatt, Fast Training of Support Vector Machines using SequentialMinimal Optimization, Advances in Kernel Methods — Support VectorLearnin, 1999.

Dual Cached LoopsMatsushima, Vishwanathan, and Smola, Linear Support VectorMachines via Dual Cached Loops, KDD 2012.

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 42 / 41

Page 69: Vishy Lec2

Bringing in the Bias

References (Incomplete)

Slides are loosely based on lecture notes from

http://learning.stat.purdue.edu/wiki/courses/sp2011/

598a/lectures

http://www.ee.ucla.edu/~vandenbe/shortcourses.html

S.V. N. Vishwanathan (Purdue University) Optimization for Machine Learning 43 / 41


Recommended