Stochastic Optimization for Big Data Analytics
Tianbao Yang‡, Rong Jin†, Shenghuo Zhu‡
Tutorial@SDM 2014Philadelphia, Pennsylvania
‡NEC Laboratories America, †Michigan State University
April 26, 2014
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 1 / 99
The updates are available here
http://www.cse.msu.edu/˜yangtia1/sdm14-tutorial.pdf
Thanks.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 2 / 99
Some Claims
NoThis tutorial is not an exhaustive literature surveyThe algorithms are not necessary the best for small dataThe theories may not carry over to non-convex optimization
Yesstart-of-the-art Stochastic Optimization for SVM, Logistic Regression,Least Square Regression, LASSOA Generic Distributed Library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 3 / 99
Outline
1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up
2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms
3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies
4 Implementations and A Distributed Library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 4 / 99
Machine Learning and STochastic OPtimization (STOP)
Outline
1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up
2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms
3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies
4 Implementations and A Distributed Library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 5 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Introduction
Machine Learning problems and Stochastic OptimizationClassification and Regression in different forms
Motivation to employ STochastic OPtimization (STOP)
Basic Convex Optimization Knowledge
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 6 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Three Steps for Machine Learning and Pattern Recognition
Model Optimization
20 40 60 80 1000
0.05
0.1
0.15
0.2
0.25
0.3
iterations
dist
ance
to o
ptim
al o
bjec
tive
0.5T
1/T2
1/T
Data
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 7 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Least Square Regression Problem:
minw∈Rd
1n
n∑i=1
(yi −w>xi )2 +
λ
2 ‖w‖22
xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 8 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Least Square Regression Problem:
minw∈Rd
1n
n∑i=1
(yi −w>xi )2
︸ ︷︷ ︸Empirical Loss
+λ
2 ‖w‖22
xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 9 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Least Square Regression Problem:
minw∈Rd
1n
n∑i=1
(yi −w>xi )2 +
λ
2 ‖w‖22︸ ︷︷ ︸
Regularization
xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 10 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Classification Problems:
minw∈Rd
1n
n∑i=1
`(yiw>xi ) +λ
2 ‖w‖22
yi ∈ {+1,−1}: labelLoss function `(z): z = yw>x
1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2
2. Logistic Regression: `(z) = log(1 + exp(−z))
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 11 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Feature Selection:
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + λ‖w‖1
`1 regularization ‖w‖1 =∑d
i=1 |wi |λ controls sparsity level
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 12 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Feature Selection using Elastic Net:
minw∈Rd
1n
n∑i=1
`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2
2
)
Elastic net regularizer, more robust than `1 regularizer
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 13 / 99
Machine Learning and STochastic OPtimization (STOP) Introduction
Learning as Optimization
Multi-class/Multi-task Learning:
minW
1n
n∑i=1
`(Wxi , yi ) + λR(W)
W ∈ RK×d
R(W) = ‖W‖2F =
∑Kk=1
∑dj=1 W 2
kj : Frobenius NormR(W) = ‖W‖∗ =
∑i σi : Nuclear Norm (sum of singular values)
R(W) = ‖W‖1,∞ =∑d
j=1 ‖W:j‖∞: `1,∞mixed normExtensions to Matrix Cases are possible
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 14 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Big Data Challenge
Huge amount of data generated every dayFacebook users upload 3 million photosGoolge receives 3 billion queriesYoutube users upload over 1,700 hours videoGlobal internet population is 2.1 billion people247 billion emails sent
Data Analyticshttp://www.visualnews.com/2012/06/19/how-much-data-created-every-minute/
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 15 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Why Learning from Big Data is Hard?
minw∈Rd
1n
n∑i=1
`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer
Too many data pointsIssue: can’t afford go through data set many timesSolution: Stochastic Optimization
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 16 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Why Learning from Big Data is Hard?
minw∈Rd
1n
n∑i=1
`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer
High dimensional dataIssue: can’t afford second order optimization (Newton’s method)Solution: first order method (i.e, gradient based method)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 17 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Why Learning from Big Data is Hard?
minw∈Rd
1n
n∑i=1
`(w>xi , yi) + λR(w)︸ ︷︷ ︸empirical loss + regularizer
Data are distributed over many machinesIssue: expensive (if not impossible) to move dataSolution: Distributed Optimization
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 18 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Stochastic Optimization
Stochastic Optimization:
minw∈W
F (w) = Eξ[f (w; ξ)]
f (w; ξ) is convex, F (w) is convexξ random variable
Methods:1 Sample Average Approximation, ξ1, . . . , ξn
minw∈W
1n
n∑i=1
f (w; ξi )
2 Stochastic Approximation: ∇f (w; ξ) (stochastic gradient)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 19 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Stochastic Optimization
Stochastic Optimization:
minw∈W
F (w) = Eξ[f (w; ξ)]
f (w; ξ) is convex, F (w) is convexξ random variable
Methods:1 Sample Average Approximation, ξ1, . . . , ξn
minw∈W
1n
n∑i=1
f (w; ξi )
2 Stochastic Approximation: ∇f (w; ξ) (stochastic gradient)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 19 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Machine Learning is Stochastic Optimization
Goal:minw∈W
Eξ=(x,y)[Loss(w>x, y)]
Empirical Regularized Loss Minimization
minw∈W
1n
n∑i=1︸ ︷︷ ︸
Eξ=i
[Loss(w>xi , yi ) + λR(w)
]
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Machine Learning is Stochastic Optimization
Goal:minw∈W
Eξ=(x,y)[Loss(w>x, y)]
Empirical Regularized Loss Minimization
minw∈W
1n
n∑i=1︸ ︷︷ ︸
Eξ=i
[Loss(w>xi , yi ) + λR(w)
]
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Machine Learning is Stochastic Optimization
Goal:minw∈W
Eξ=(x,y)[Loss(w>x, y)]
Empirical Regularized Loss Minimization
minw∈W
1n
n∑i=1︸ ︷︷ ︸
Eξ=i
[Loss(w>xi , yi ) + λR(w)
]
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 20 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
The Simplest Method for Stochastic OptimizationStochastic Optimization:
minw∈Rd
F (w) = Eξ[f (w; ξ)]
Stochastic Gradient Descent (Nemirovski & Yudin, 1978)
wt = wt−1 − γt∇f (wt−1; ξt)
step size
Stochastic Gradient
Eξt [∇f (w; ξt)] = ∇F (w)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
The Simplest Method for Stochastic OptimizationStochastic Optimization:
minw∈Rd
F (w) = Eξ[f (w; ξ)]
Stochastic Gradient Descent (Nemirovski & Yudin, 1978)
wt = wt−1 − γt∇f (wt−1; ξt)
step size
Stochastic Gradient
Eξt [∇f (w; ξt)] = ∇F (w)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
The Simplest Method for Stochastic OptimizationStochastic Optimization:
minw∈Rd
F (w) = Eξ[f (w; ξ)]
Stochastic Gradient Descent (Nemirovski & Yudin, 1978)
wt = wt−1 − γt∇f (wt−1; ξt)
step size
Stochastic Gradient
Eξt [∇f (w; ξt)] = ∇F (w)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 21 / 99
Machine Learning and STochastic OPtimization (STOP) Motivation
Stochastic Gradient in Machine Learning
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
let it ∈ {1, . . . , n} uniformly randomly sampled
key equation: Eit [∇`(w>xit , yit ) + λw] = ∇F (w)
computation is very cheap O(d) compared with full gradient O(nd)
wt = (1− γtλ)wt−1 − γt∇`(w>t−1xit , yit )
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 22 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Vector, Norm, Inner product, Dual Norm
bold letters x ∈ Rd (data vector), w ∈ Rd (model parameter) :d-dimensional vectors, yi denotes response variable of ith datax , y ∈ X finite dimensional variable, X a normed space
norm ‖w‖: Rd → R+. e.g.1 `2 norm ‖w‖2 =
√∑di=1 w2
i2 `1 norm ‖w‖1 =
∑di=1 |wi |
3 `∞ norm ‖w‖∞ = maxi |wi |
inner product 〈x,w〉 = x>w =∑d
i=1 xiwi
dual norm ‖w‖∗ = max‖x‖≤1 x>w.1 ‖x‖2 ⇐⇒ ‖w‖22 ‖x‖1 ⇐⇒ ‖w‖∞
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 23 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Convex Optimization
minx∈X f (x)
X is a convex domain
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
x2
gradient
smooth
f (x) is a convex function
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 24 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Convex Function
Characterization of Convex Function
x y
f(x)
f(y)
↵x + (1 � ↵)y
f(↵x + (1 � ↵)y)
↵f(x) + (1 � ↵)f(y) f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),
∀x , y ∈ X , α ∈ [0, 1]
y
f(x)
f(y) + rf(y)>(y � x)f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 25 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Convergence Measure
Most optimization algorithms are iterative
xt+1 = xt + ∆xt
Iteration Complexity: the number ofiterations T (ε) needed to have
f (xT )− minx∈X
f (x) ≤ ε (ε� 1)
Convergence Rate: after T iterations, howgood is the solution
f (xT )−minx∈X
f (x) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration Complexity
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Convergence Measure
Most optimization algorithms are iterative
xt+1 = xt + ∆xt
Iteration Complexity: the number ofiterations T (ε) needed to have
f (xT )− minx∈X
f (x) ≤ ε (ε� 1)
Convergence Rate: after T iterations, howgood is the solution
f (xT )−minx∈X
f (x) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration Complexity
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Convergence Measure
Most optimization algorithms are iterative
xt+1 = xt + ∆xt
Iteration Complexity: the number ofiterations T (ε) needed to have
f (xT )− minx∈X
f (x) ≤ ε (ε� 1)
Convergence Rate: after T iterations, howgood is the solution
f (xT )−minx∈X
f (x) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration Complexity
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Convergence Measure
Most optimization algorithms are iterative
xt+1 = xt + ∆xt
Iteration Complexity: the number ofiterations T (ε) needed to have
f (xT )− minx∈X
f (x) ≤ ε (ε� 1)
Convergence Rate: after T iterations, howgood is the solution
f (xT )−minx∈X
f (x) ≤ ε(T )
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations
obje
ctive
T
ε
Total Runtime = Per-iteration Cost×Iteration ComplexityYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 26 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
More on Convergence Measure
Big O(·) notation: explicit dependence on T or ε
Convergence Rate Iteration Complexity
linear O(µT)
(µ < 1) O(
log(1ε
))sub-linear O
(1
Tα
)α > 0 O
( 1ε1/α
)Why are we interested in Bounds?
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 27 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
More on Convergence Measure
Big O(·) notation: explicit dependence on T or ε
Convergence Rate Iteration Complexity
linear O(µT)
(µ < 1) O(
log(1ε
))sub-linear O
(1
Tα
)α > 0 O
( 1ε1/α
)Why are we interested in Bounds?
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 27 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
seconds
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 28 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
1/T
secondsminutes
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 29 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
1/T
1/T0.5
secondsminutes
hours
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 30 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
More on Convergence Measure
Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O
(log( 1
ε ))
sub-linear O( 1Tα ) α > 0 O
(1
ε1/α
)
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
iterations (T)
dis
tan
ce
to
op
tim
um
0.5T
1/T
1/T0.5
secondsminutes
hours
Theoretically, we consider
O(µT ) ≺ O( 1
T 2
)≺ O
( 1T
)≺ O
( 1√T
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 31 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Factors that affect Iteration Complexity
Property of function: e.g., smoothness of function
Domain X : size and geometry
Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Factors that affect Iteration Complexity
Property of function: e.g., smoothness of function
Domain X : size and geometry
Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Factors that affect Iteration Complexity
Property of function: e.g., smoothness of function
Domain X : size and geometry
Size of problem: dimension and number of data pointsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 32 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Non-smooth function
Lipschitz continuous: e.g. absolute loss f (x) = |x |
|f (x)− f (y)| ≤ G‖x − y‖2
Lipschitzconstant
Subgradient: f (x) ≥ f (y)+∂f (y)>(x−y)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
|x|
non−smooth
sub−gradient
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 33 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Non-smooth function
Lipschitz continuous: e.g. absolute loss f (x) = |x |
|f (x)− f (y)| ≤ G‖x − y‖2
Lipschitzconstant
Subgradient: f (x) ≥ f (y)+∂f (y)>(x−y)
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
|x|
non−smooth
sub−gradient
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 33 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Smooth Convex function
smooth: e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2
where β > 0
smoothnessconstant
Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ β
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 34 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Smooth Convex function
smooth: e.g. logistic loss f (x) = log(1 + exp(−x))
‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2
where β > 0
smoothnessconstant
Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ β
−5 −4 −3 −2 −1 0 1 2 3 4 5−1
0
1
2
3
4
5
6
log(1+exp(−x))
f(y)+f’(y)(x−y)
y
f(x)
Quadratic Function
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 34 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Strongly Convex function
strongly convex: e.g. Euclidean norm f (x) = 12‖x‖
22
‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2
where λ > 0
strong convexityconstant
Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
x2
gradient
smooth
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 35 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Strongly Convex function
strongly convex: e.g. Euclidean norm f (x) = 12‖x‖
22
‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2
where λ > 0
strong convexityconstant
Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ
−1 −0.5 0 0.5 1−0.2
0
0.2
0.4
0.6
0.8
x2
gradient
smooth
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 35 / 99
Machine Learning and STochastic OPtimization (STOP) Warm-up
Warm-up: Smooth and Strongly Convex function
smooth and strongly convex: e.g. quadratic function:f (z) = 1
2 (z − 1)2
λ‖x − y‖2 ≤ ‖∇f (x)−∇f (y)‖2 ≤ β‖x − y‖2, β ≥ λ > 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 36 / 99
STOP Algorithms for Big Data Classification and Regression
Outline
1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up
2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms
3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies
4 Implementations and A Distributed Library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 37 / 99
STOP Algorithms for Big Data Classification and Regression
STOP Algorithms for Big Data Classification andRegression
Stochastic Gradient Descent (Pegasos) for SVM
Stochastic Average Gradient (SAG) for Logistic Regression andRegression
Stochastic Dual Coordinate Ascent (SDCA)
Stochastic Optimization for Lasso
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 38 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Classification and Regression
Classification and Regression
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
Classification (red indicates smooth function)1 SVM (hinge loss): `(w>x, y) = max(0, 1− yw>x)2 Smooth SVM (squared hinge loss): `(w>x, y) = max(0, 1− yw>x)2
3 Equivalent to C-SVM formulations C = nλ4 Logistic Regression (logistic loss): `(w>x, y) = log(1 + exp(−yw>x))
Regression1 Least Square Regression (square loss): `(w>x, y) = (w>x− y)2
2 Least Absolute Deviation (absolute loss): `(w>x, y) = |w>x− y |
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 39 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Timeline of Stochastic Optimization in Machine Learning
minw∈Rd F (w) = 1n∑n
i=1 `(w>xi , yi ) + λ2‖w‖
22
• [Zinkevich03].• [Kivinen04].
General.Convex.
• [Hazan07].• [Shalev<Shwartz07].
Strongly.Convex.
• [Roux12].• [Shalev<Shwartz13].
• [Zhang13].
Smooth.&.
Strongly.
O
✓1pT
◆
2007.
O
✓1
T
◆O�µT
�
2012.90’s. 2003.
Basic.SGD. Pegasos. SAG,......SDCA.
wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit )
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 40 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Basic SGD
Leveraging only convexity
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =
c√t
output solution: wT =1T
T∑t=1
wt ⇒ O( 1√
T
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 41 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Basic SGD
Leveraging only convexity
minw∈Rd
F (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22
update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =
c√t
output solution: wT =1T
T∑t=1
wt ⇒ O( 1√
T
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 41 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Pegasos (Shalev-Shwartz et al. (2007))
Leveraging strongly convex regularizer
minw∈Rd
F (w) =1n
n∑i=1
max(0, 1− yiw>xi ) +λ
2 ‖w‖22︸ ︷︷ ︸
strongly convex
update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =
1λt
output solution: wT =1T
T∑t=1
wt ⇒ O( 1λT
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 42 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Pegasos (Shalev-Shwartz et al. (2007))
Leveraging strongly convex regularizer
minw∈Rd
F (w) =1n
n∑i=1
max(0, 1− yiw>xi ) +λ
2 ‖w‖22︸ ︷︷ ︸
strongly convex
update: wt = (1− γtλ)wt−1 − γt∇`(wTt−1xit , yit ), γt =
1λt
output solution: wT =1T
T∑t=1
wt ⇒ O( 1λT
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 42 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Pegasos (Shalev-Shwartz et al. (2007))
e.g. hinge loss (SVM), absolute loss (Least Absolute Deviation)
stochastic gradient: ∂`(w>xit , yit ) =
−yit xit , 1− yit w>xit > 0
0, otherwise
computation cost per-iteration: O(d)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 43 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SAG (Roux et al. (2012))
Leveraging smoothness of loss
minw∈Rd
f (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22︸ ︷︷ ︸
smooth and strongly convexEstimated Average Gradient
Gt =1n
n∑i=1
g ti , g t
i =
{∂`(w>t xit , yit ), if it is selectedg t−1
i , otherwise
update: wt = (1− γtλ)wt−1 − γtGt , γt =cβ
output solution: wT ⇒ O(µT)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 44 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SAG (Roux et al. (2012))
Leveraging smoothness of loss
minw∈Rd
f (w) =1n
n∑i=1
`(w>xi , yi ) +λ
2 ‖w‖22︸ ︷︷ ︸
smooth and strongly convexEstimated Average Gradient
Gt =1n
n∑i=1
g ti , g t
i =
{∂`(w>t xit , yit ), if it is selectedg t−1
i , otherwise
update: wt = (1− γtλ)wt−1 − γtGt , γt =cβ
output solution: wT ⇒ O(µT)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 44 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SAG: efficient update of averaged gradient
logistic regression, least square regression, smooth SVMindividual gradient
gi = ∂`(w>xi , yi ) = αixi
update of averaged gradient
Gt =1n
n∑i=1
g ti =
1n
n∑i=1
αti xi = Gt−1 +
1n (αt
i − αt−1i )xit
computation cost per-iteration: O(d)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 45 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SDCA (Shalev-Shwartz & Zhang (2013))
Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))
Dual Problem:
maxα∈Q
D(α) =1n
n∑i=1−φ∗(−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
primal solution: wt =1λn
n∑i=1
αti xi
Dual Coordinate Updates
∆αi = maxαt
i +∆αi∈Q−φ∗(−αt
i −∆αi )−λn2
∥∥∥∥wt +1λn∆αixi
∥∥∥∥2
2
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SDCA (Shalev-Shwartz & Zhang (2013))
Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))
Dual Problem:
maxα∈Q
D(α) =1n
n∑i=1−φ∗(−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
primal solution: wt =1λn
n∑i=1
αti xi
Dual Coordinate Updates
∆αi = maxαt
i +∆αi∈Q−φ∗(−αt
i −∆αi )−λn2
∥∥∥∥wt +1λn∆αixi
∥∥∥∥2
2
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SDCA (Shalev-Shwartz & Zhang (2013))
Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))non-smooth loss O(1/ε) and smooth loss O(log(1/ε))
Dual Problem:
maxα∈Q
D(α) =1n
n∑i=1−φ∗(−αi )−
λ
2
∥∥∥∥∥ 1λn
n∑i=1
αixi
∥∥∥∥∥2
2
primal solution: wt =1λn
n∑i=1
αti xi
Dual Coordinate Updates
∆αi = maxαt
i +∆αi∈Q−φ∗(−αt
i −∆αi )−λn2
∥∥∥∥wt +1λn∆αixi
∥∥∥∥2
2
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 46 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
SDCA updates
close form solution: hinge loss, squared hinge loss, absolute loss andsquare loss (Shalev-Shwartz & Zhang (2013))e.g. square loss
∆αti =
yi −w>t xi − αt−1i
1 + ‖xi‖22/(λn)
computation cost per-iteration: O(d)
approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013))
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 47 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex
yes yes yes
smooth
no yes yes/no
loss
hinge, abs. logistic, square, sqh all left
memory cost
O(d) O(d + n) O(d + n)
computation cost∗
O(d) O(d) O(d)
Iteration
O(
1λε
)O(
log(
1ε
))O(
1λε
)
Complexity
O(
log(
1ε
))
paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth
no yes yes/no
loss
hinge, abs. logistic, square, sqh all left
memory cost
O(d) O(d + n) O(d + n)
computation cost∗
O(d) O(d) O(d)
Iteration
O(
1λε
)O(
log(
1ε
))O(
1λε
)
Complexity
O(
log(
1ε
))
paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss
hinge, abs. logistic, square, sqh all left
memory cost
O(d) O(d + n) O(d + n)
computation cost∗
O(d) O(d) O(d)
Iteration
O(
1λε
)O(
log(
1ε
))O(
1λε
)
Complexity
O(
log(
1ε
))
paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost
O(d) O(d + n) O(d + n)
computation cost∗
O(d) O(d) O(d)
Iteration
O(
1λε
)O(
log(
1ε
))O(
1λε
)
Complexity
O(
log(
1ε
))
paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)
computation cost∗
O(d) O(d) O(d)
Iteration
O(
1λε
)O(
log(
1ε
))O(
1λε
)
Complexity
O(
log(
1ε
))
paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)
computation cost∗ O(d) O(d) O(d)
Iteration
O(
1λε
)O(
log(
1ε
))O(
1λε
)
Complexity
O(
log(
1ε
))
paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)
computation cost∗ O(d) O(d) O(d)
Iteration O(
1λε
)O(
log(
1ε
))O(
1λε
)Complexity O
(log(
1ε
))paramter
no step size no
averaging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)
computation cost∗ O(d) O(d) O(d)
Iteration O(
1λε
)O(
log(
1ε
))O(
1λε
)Complexity O
(log(
1ε
))paramter no step size noaveraging
yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Summary
alg. Pegasos SAG SDCAstrongly convex yes yes yessmooth no yes yes/noloss hinge, abs. logistic, square, sqh all leftmemory cost O(d) O(d + n) O(d + n)
computation cost∗ O(d) O(d) O(d)
Iteration O(
1λε
)O(
log(
1ε
))O(
1λε
)Complexity O
(log(
1ε
))paramter no step size noaveraging yes no need no need
Table : sqh: squared hinge loss; abs.: absolute loss; ∗ per-iteration cost
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 48 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
What about `1 regularization?
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + σK∑
g=1‖wg‖1︸ ︷︷ ︸
Lasso or Group Lasso
Issue: Regularizer is Not Strongly Convex
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 49 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Adding `2 regularization (Shalev-Shwartz & Zhang, 2012)
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + σK∑
g=1‖wg‖1︸ ︷︷ ︸
Lasso or Group Lasso
Issue: Not Strongly Convex Solution: Add `22 regularization
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + σK∑
g=1‖wg‖1 +
λ
2 ‖w‖22
setting λ = Θ(1/ε), SDCA (non-smooth or smooth)
O( 1ε2
)for general convex loss, O
(1ε
)for smooth loss
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 50 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Adding `2 regularization (Shalev-Shwartz & Zhang, 2012)
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + σK∑
g=1‖wg‖1︸ ︷︷ ︸
Lasso or Group Lasso
Issue: Not Strongly Convex Solution: Add `22 regularization
minw∈Rd
1n
n∑i=1
`(w>xi , yi ) + σK∑
g=1‖wg‖1 +
λ
2 ‖w‖22
setting λ = Θ(1/ε), SDCA (non-smooth or smooth)
O( 1ε2
)for general convex loss, O
(1ε
)for smooth loss
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 50 / 99
STOP Algorithms for Big Data Classification and Regression Algorithms
Other algorithms for `1 regularization
(Stochastic) Proximal Gradient DescentProximal Stochastic Gradient Descent (Langford et al., 2009;Shalev-Shwartz & Tewari, 2009; Duchi & Singer, 2009)sparsity can be achieved at each iterationO(1/ε2) iteration complexity
Stochastic Coordinate Descent (Shalev-Shwartz & Tewari, 2009;Bradley et al., 2011; Richtarik & Takac, 2013)
need to compute full gradientO(n/ε) iteration complexity for smooth loss
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 51 / 99
General Strategies for Stochastic Optimization
Outline
1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up
2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms
3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies
4 Implementations and A Distributed Library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 52 / 99
General Strategies for Stochastic Optimization
Be Back in 5 minutes
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 53 / 99
General Strategies for Stochastic Optimization
General Strategies for Stochastic Optimization
General strategies for STOPSGD and its variants for different objectives
Parallel and Distributed Optimization
Other Effective Strategies
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 54 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Stochastic Gradient Descent
minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable
basic SGD updates:
xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]
ΠX [x ] = minx∈X ‖x − x‖22
Issue: How to determine learning rate (step size) γt?
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Stochastic Gradient Descent
minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable
basic SGD updates:
xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]
ΠX [x ] = minx∈X ‖x − x‖22
Issue: How to determine learning rate (step size) γt?
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Stochastic Gradient Descent
minx∈X f (x)stochastic gradient ∇f (x ; ξ): ξ is a random variable
basic SGD updates:
xt ← ΠX [xt−1 − γt∇f (xt−1; ξt)]
ΠX [x ] = minx∈X ‖x − x‖22
Issue: How to determine learning rate (step size) γt?
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 55 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Convergence of final solution
Iterative updates
xt = ΠX [xt−1 − γt∆t ]
to have convergence, intuitively γt∆t → 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 56 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Convergence of (S)GD
iterative updates
xt = ΠX [xt−1 − γt∆t ]
GD: xt = xt−1 − γt∇f (xt−1)
∇f (xt−1)→ 0, xt → x∗
SGD: xt = xt−1 − γt∇f (xt−1; ξt)
γt∇f (xt−1; ξt)→ 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 57 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Three Schemes of Step size
General Convex Optimization γt ∝ 1/√
t → 0
Strongly Convex Optimization γt ∝ 1/t → 0
Smooth Optimization γt = c,∆t → 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 58 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for General Convex Function
Step size: γt = c√t , c usually needed to be tuned
Convergence rate of final solution xT (Shamir & Zhang, 2013):
E[f (xT )− f (x∗)] ≤ O(DG log T√
T
)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O
(DG√
T
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for General Convex Function
Step size: γt = c√t , c usually needed to be tuned
Convergence rate of final solution xT (Shamir & Zhang, 2013):
E[f (xT )− f (x∗)] ≤ O(DG log T√
T
)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O
(DG√
T
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for General Convex Function
Step size: γt = c√t , c usually needed to be tuned
Convergence rate of final solution xT (Shamir & Zhang, 2013):
E[f (xT )− f (x∗)] ≤ O(DG log T√
T
)‖x − y‖2 ≤ D and ‖∂f (x ; ξ)‖2 ≤ G , ∀x , y ∈ X .Close to Optimal : O
(DG√
T
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 59 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for Strongly Convex Function
f (x) is λ-strongly convex
Step size: γt =1λt
Convergence Rate of xT (Shamir & Zhang, 2013):
E[f (xT )− f (x∗)] ≤ O(
G2 log TλT
)
Close to Optimal : O(
G2
λT
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 60 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for Smooth Convex Function
A sub-class of general convex function
SGD with γt ∝ 1/√
t has O( log T√
T
)Gradient Descent with γt = c has O
( 1T
)(Nesterov, 2004)
Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.
f (x) =1n
n∑i=1
fi (x), ∇f (xt) =1n
n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)
Constant step size of GD is due to ∇f (x∗) = 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for Smooth Convex Function
A sub-class of general convex function
SGD with γt ∝ 1/√
t has O( log T√
T
)Gradient Descent with γt = c has O
( 1T
)(Nesterov, 2004)
Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.
f (x) =1n
n∑i=1
fi (x), ∇f (xt) =1n
n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)
Constant step size of GD is due to ∇f (x∗) = 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for Smooth Convex Function
A sub-class of general convex function
SGD with γt ∝ 1/√
t has O( log T√
T
)Gradient Descent with γt = c has O
( 1T
)(Nesterov, 2004)
Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.
f (x) =1n
n∑i=1
fi (x), ∇f (xt) =1n
n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)
Constant step size of GD is due to ∇f (x∗) = 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
SGD for Smooth Convex Function
A sub-class of general convex function
SGD with γt ∝ 1/√
t has O( log T√
T
)Gradient Descent with γt = c has O
( 1T
)(Nesterov, 2004)
Generally SGD can’t bridge the gap (Lan, 2012)Special case, e.g.
f (x) =1n
n∑i=1
fi (x), ∇f (xt) =1n
n∑i=1∇fi (xt), ∇f (xt ; ξt) = ∇fit (xt)
Constant step size of GD is due to ∇f (x∗) = 0
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 61 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Accelerated SGD for smooth function (Johnson & Zhang, 2013;
Mahdavi et al., 2013)
Iterate s = 1, . . . ,Iterate t = 1, . . . ,m
x st = x s
t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad
update x s
x s = x sm or x s =
∑mt=1 x s
t /m, m = O(n)
constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s
t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O
(log(
1ε
))
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Accelerated SGD for smooth function (Johnson & Zhang, 2013;
Mahdavi et al., 2013)
Iterate s = 1, . . . ,Iterate t = 1, . . . ,m
x st = x s
t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad
update x s
x s = x sm or x s =
∑mt=1 x s
t /m, m = O(n)
constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s
t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O
(log(
1ε
))
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Accelerated SGD for smooth function (Johnson & Zhang, 2013;
Mahdavi et al., 2013)
Iterate s = 1, . . . ,Iterate t = 1, . . . ,m
x st = x s
t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad
update x s
x s = x sm or x s =
∑mt=1 x s
t /m, m = O(n)
constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s
t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O
(log(
1ε
))
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Accelerated SGD for smooth function (Johnson & Zhang, 2013;
Mahdavi et al., 2013)
Iterate s = 1, . . . ,Iterate t = 1, . . . ,m
x st = x s
t−1 − γ (∇fit (x st−1)−∇fit (x s−1) +∇f (x s−1))︸ ︷︷ ︸∆t =StoGrad−StoGrad+Grad
update x s
x s = x sm or x s =
∑mt=1 x s
t /m, m = O(n)
constant step size, ∆t → 0if x s−1 → x∗, ∇f (x s−1)→ 0, ∇fit (x s
t−1)−∇fit (x s−1)→ 0Smooth function: O(1/ε) for smoothSmooth & strongly convex function: O
(log(
1ε
))
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 62 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Averaged Stochastic Gradient Descent
Averaging usually speed-up convergence:
xt =
(1− 1 + η
t + η
)xt−1 +
1 + η
t + ηxt , η ≥ 0
η = 0 simple averaging xT = (x1 + . . .+ xT )/T
General Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O(
1√T
)vs O
(log T√
T
)Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O
(1λT
)vs O
(log TλT
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Averaged Stochastic Gradient Descent
Averaging usually speed-up convergence:
xt =
(1− 1 + η
t + η
)xt−1 +
1 + η
t + ηxt , η ≥ 0
η = 0 simple averaging xT = (x1 + . . .+ xT )/TGeneral Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O
(1√T
)vs O
(log T√
T
)
Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O
(1λT
)vs O
(log TλT
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99
General Strategies for Stochastic Optimization Stochastic Gradient Descent and Accelerated variants
Averaged Stochastic Gradient Descent
Averaging usually speed-up convergence:
xt =
(1− 1 + η
t + η
)xt−1 +
1 + η
t + ηxt , η ≥ 0
η = 0 simple averaging xT = (x1 + . . .+ xT )/T
General Convex Optimization (Nemirovski et al., 2009): η = 0⇒ O(
1√T
)vs O
(log T√
T
)
Strongly Convex Optimization (Shamir & Zhang, 2013; Zhu, 2013):η > 0⇒ O
(1λT
)vs O
(log TλT
)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 63 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
General Strategies for Stochastic Optimization
General strategies for STOPSGD and its variants for different objectives
Parallel and Distributed Optimization
Other Effective Strategies
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 64 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
Parallel and Distributed Optimization
Parallel (shared memory)
Distributed (not shared)
To speed-up convergence
data distributed over multiple machines
moving to single machine sufferslow network bandwidthlimited disk or memory
benefits fromcluster of machinesmulti-core machine, GPU
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 65 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
A simple solution: Average Runs
multi-core machinecluster of machines
Data
w1 w2 w3 w4 w5 w6
w =1k
k∑i=1
wi , Issue: Not the Optimal
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 66 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
Parallel SGD: Average Gradients
Mini-batch
synchronization
Mini-batch SGD
multi-core or clusterGood: reduced variance, faster conv.Bad: synchronization is expensiveSolutions:
asynchronized update: HogWild!
lesser synchronizations: DisDCA
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 67 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
Lock-free Parallel SGD: HOGWILD! (Niu et al., 2011)
minx
∑e∈E
fe(xe)
multi-core with shared-memory accesseach e is a small subset of [d ]
sparse SVM, matrix completion, graph-cutrobust 1/T convergence rate for strongly convex objective
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 68 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
Distributed SDCA (Yang, 2013)
∆αi = arg max−φ∗i (−αti −∆αi )−
λn2
∥∥∥∥wt +1λn∆αixi
∥∥∥∥2
2
convergence is not guaranteed: data are correlated
∆αi = arg max−φ∗i (−αti −∆αi )−
λn2K
∥∥∥∥wt +Kλn∆αixi
∥∥∥∥2
2
guaranteed convergence; limited speed-up
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 69 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
Distributed SDCA (Yang, 2013)
∆αi = arg max−φ∗i (−αti −∆αi )−
λn2
∥∥∥∥wt +1λn∆αixi
∥∥∥∥2
2
convergence is not guaranteed: data are correlated
∆αi = arg max−φ∗i (−αti −∆αi )−
λn2K
∥∥∥∥wt +Kλn∆αixi
∥∥∥∥2
2
guaranteed convergence; limited speed-up
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 69 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
DisDCA: Trading Computation for Communication
∆αij = arg max−φ∗ij (−αtij −∆αij )−
λn2K
∥∥∥∥utj +
Kλn∆αij xij
∥∥∥∥2
2
utj+1 = ut
j +Kλn∆αij xij
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 70 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
DisDCA: Trading Computation for Communication
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 71 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
DisDCA: Trading Computation for Communication
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 72 / 99
General Strategies for Stochastic Optimization Parallel and Distributed Optimization
DisDCA
increasing m could lead to nearly linear speed-upincreasing K leads to parallel speed-up
0 2 4 6 8 10x 105
−10
−8
−6
−4
−2
0
2¡(t,m) vs t
t
log(¡(t,
m) )
m=10m=100m=1000
104 106105
(a) 1 million syn. data for regression
0" 5" 10" 15"
Liblinear"
DisDCA"0.1"
0.01"
0.001"
0.0001"
n = 109
n = 107
1"minute"3"minutes"
7"minutes"12"minutes"
7"minutes"
✏(T )
(b) 1 billion syn. data for classification; 400 GB,50*2 processors
The Distributed Library: BirdsYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 73 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
General Strategies for Stochastic Optimization
General strategies for STOPSGD and its variants for different objectives
Parallel and Distributed Optimization
Other Effective Strategies
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 74 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Factors that affect Iteration Complexity
Property of function: smoothness of function
Size of problem: dimension and number of data points
Domain X : size and geometry
Screening for Lasso and Support Vector Machine
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 75 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Factors that affect Iteration Complexity
Property of function: smoothness of function
Size of problem: dimension and number of data points
Domain X : size and geometry
Screening for Lasso and Support Vector Machine
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 75 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Screening for Lasso
Lasso: minw∈Rd
12‖y− Xw‖2
2 + λ‖w‖1
y = (y1, . . . , yn)> ∈ Rn
X = (x1, · · · , xd ) ∈ Rn×d
I0 = {i : w∗i = 0}, I = [d ]\I0
Lasso: minwI
12‖y− XIwI‖2
2 + λ‖wI‖1
SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Screening for Lasso
Lasso: minw∈Rd
12‖y− Xw‖2
2 + λ‖w‖1
y = (y1, . . . , yn)> ∈ Rn
X = (x1, · · · , xd ) ∈ Rn×d
I0 = {i : w∗i = 0}, I = [d ]\I0
Lasso: minwI
12‖y− XIwI‖2
2 + λ‖wI‖1
SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Screening for Lasso
Lasso: minw∈Rd
12‖y− Xw‖2
2 + λ‖w‖1
y = (y1, . . . , yn)> ∈ Rn
X = (x1, · · · , xd ) ∈ Rn×d
I0 = {i : w∗i = 0}, I = [d ]\I0
Lasso: minwI
12‖y− XIwI‖2
2 + λ‖wI‖1
SAFE rule (Ghaoui et al., 2010), DPP (Wang et al., 2012).
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 76 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Screening for Support Vector Machine
Dual SVM: maxα∈[0,1]n
1n
n∑i=1
αi −λ
2
∥∥∥∥∥ 1λn
n∑i=1
αiyixi
∥∥∥∥∥2
2
yiw∗xi < 1⇒ α∗i = 1, yiw∗xi > 1⇒ α∗i = 0
Ball Test (Ogawa et al., 2014; Wang et al.,2013)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 77 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Factors that affect Iteration Complexity
Property of function: smoothness of function
Size of problem: dimension and number of data points
Domain X : size and geometry
Stochastic Mirror Descent
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 78 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Reducing G ,D
Iteration Complexity of SGD depends on:‖x− x∗‖2 ≤ D, ‖∇f (x; ξ)‖2 ≤ G : positive correlation
Interpretation of Gradient Descent
xt =∏X
[xt−1 − γt∇f (xt−1; ξt)]
= minx∈X
f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation
+1
2γt‖x − xt−1‖2
2︸ ︷︷ ︸distance to last solution
(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖2‖∇f (xt−1; ξt)‖2 ≤ GD
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 79 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Stochastic Mirror Descent (Nemirovski et al., 2009)
xt = minx∈X
f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation
+1γt
B(x , xt−1)︸ ︷︷ ︸Bregman Divergence
B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)
B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm
(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗
E[f (xT )]− f (x∗) ≤ O(DG√
T
)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Stochastic Mirror Descent (Nemirovski et al., 2009)
xt = minx∈X
f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation
+1γt
B(x , xt−1)︸ ︷︷ ︸Bregman Divergence
B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)
B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm
(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗
E[f (xT )]− f (x∗) ≤ O(DG√
T
)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Stochastic Mirror Descent (Nemirovski et al., 2009)
xt = minx∈X
f (xt−1) + (x − xt−1)>∇f (xt−1; ξt)︸ ︷︷ ︸linear approximation
+1γt
B(x , xt−1)︸ ︷︷ ︸Bregman Divergence
B(x , xt) = ω(x)− ω(xt)−∇ω(xt)>(x − xt)
B(x , xt) ≥ α2 ‖x − xt‖2: strongly convex w.r.t general norm
(x − xt−1)>∇f (xt−1; ξt) ≤ ‖x − xt−1‖‖∇f (xt−1; ξt)‖∗
E[f (xT )]− f (x∗) ≤ O(DG√
T
)‖B(x , x∗)‖ ≤ D‖∇f (x ; ξ)‖∗ ≤ G
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 80 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Reducing Projections
Factors that affect Iteration ComplexityProperty of function: smoothness of function
Size of problem: dimension and number of data points
Domain X : size and geometry
xt =∏X
[xt−1 −∇f (xt−1; ξt)]
Complex Domain X leads to Expensive computationse.g.,PSD cone
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 81 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Reducing Projections
Linear Optimization over the Domain: Frank-WolfeAlgorithm (Jaggi, 2013; Lacoste-Julien et al., 2013; Hazan, 2008)
st = arg maxs∈X〈s,∇f (xt−1)〉 : linear optimization
xt = (1− ηt)xt−1 + ηtst : xt ∈ X
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 82 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
Reducing Projections
Few Projections: SGD with only-one or log T projection (Mahdaviet al., 2012; Yang & Zhang, 2013)
x ∈ X ⇐⇒ g(x) ≤ 0
SGD for min-max
minx maxλ≥0 f (x) + λg(x)
objectiveviolation ofconstraints
Final projection
xT =∏X[ 1
T∑T
t=1 xt]
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 83 / 99
General Strategies for Stochastic Optimization Other Effective Strategies
How about kernel methods?
Linearization + STOP for linear methodsthe Nystrom method (Drineas & Mahoney, 2005)
Random Fourier Features (Rahimi & Recht, 2007)
Comparison of two (Yang et al., 2012)the Nystrom method: data dependent sampling, better approximationerror under large eigen-gap and power law eigen-distribution
Random Fourier Features: data independent sampling
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 84 / 99
Implementations and A Distributed Library
Outline
1 Machine Learning and STochastic OPtimization (STOP)IntroductionMotivationWarm-up
2 STOP Algorithms for Big Data Classification and RegressionClassification and RegressionAlgorithms
3 General Strategies for Stochastic OptimizationStochastic Gradient Descent and Accelerated variantsParallel and Distributed OptimizationOther Effective Strategies
4 Implementations and A Distributed Library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 85 / 99
Implementations and A Distributed Library
Implementations and A Distributed Library
Efficient implementations and a practical libraryEfficient averaging
Gradient sparsification
Distributed (parallel) optimization library
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 86 / 99
Implementations and A Distributed Library
Efficient Averaging
Update rule:
xt = (1− γtλ)xt−1 + γtgt
xt = (1− αt)xt−1 + αtxt
Efficient update when gt has many 0, or gt is sparse,
St =
(1− λγt 0
αt(1− λγt) 1− αt
)St−1, S1 = I
yt = yt−1 − [S−1t ]11γtgt
yt = yt−1 − ([S−1t ]21 + [S−1
t ]22αt)γtgt
xT = [ST ]11yT
xT = [ST ]21yT + [ST ]22yT
When Gradient is SparseYang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 87 / 99
Implementations and A Distributed Library
Gradient sparsification
Sparsification by importance sampling
Rti = unif(0, 1)
gti = gti [|gti | ≥ gi ] + gsign(gti )[giRti ≤ |gti | < gi ]
Unbiased sample: Egt = gt .Tradeoff variance increase for the efficient computation.
Especially useful for Logistic Regression
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 88 / 99
Implementations and A Distributed Library
Distributed Optimization Library: Birds
The birds library implements distributed stochastic dual coordinateascent (DisDCA) for classification and regression with a broadsupport.For technical details see:
”Trading Computation for Communication: Distributed StochasticDual Coordinate Ascent.” Tianbao Yang. NIPS 2013.”Analysis of Distributed Stochastic Dual Coordinate Ascent” TianbaoYang, etc. Tech Report 2013, arxiv.
The code is distributed under GNU General Publich License (seelicense.txt for details).
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 89 / 99
Implementations and A Distributed Library
Distributed Optimization Library: Birds
What problems does it solve?Classification and RegressionLoss
1 Hinge loss and squared hinge loss (SVM)2 Logistic loss (Logistic Regression)3 Least Square Regression (Ridge Regression)
Regularizer1 `2 norm: SVM, Logistic Regression, Ridge Regression2 `1 norm: Lasso, SVM, LR with `1 norm
Multi-class : one-vs-all
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 90 / 99
Implementations and A Distributed Library
Distributed Optimization Library: Birds
What data does it support?dense, sparsetxt, binary
What environment does it support?Prerequisites: Boost.MPI and Boost.Serialization LibraryTested on A cluster of Linux machines (up to hundreds of processors)
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 91 / 99
Implementations and A Distributed Library
Thank You!
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 92 / 99
Implementations and A Distributed Library
References I
Bradley, Joseph K., Kyrola, Aapo, Bickson, Danny, and Guestrin, Carlos.Parallel coordinate descent for l1-regularized loss minimization. CoRR,2011.
Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning. J.Mach. Learn. Res., 6:2153–2175, 2005.
Duchi, John and Singer, Yoram. Efficient online and batch learning usingforward backward splitting. J. Mach. Learn. Res., 10:2899–2934, 2009.
Ghaoui, Laurent El, Viallon, Vivian, and Rabbani, Tarek. Safe featureelimination in sparse supervised learning. CoRR, abs/1009.3515, 2010.
Hazan, Elad. Sparse approximate solutions to semidefinite programs. InLATIN, pp. 306–316, 2008.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 93 / 99
Implementations and A Distributed Library
References II
Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, andSundararajan, S. A dual coordinate descent method for large-scale linearsvm. In Proceedings of the 25th International Conference on MachineLearning, ICML ’08, pp. 408–415, 2008.
Jaggi, Martin. Revisiting frank-wolfe: Projection-free sparse convexoptimization. In ICML 2013 - Proceedings of the 30th InternationalConference on Machine Learning, 2013.
Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descentusing predictive variance reduction. In NIPS, pp. 315–323, 2013.
Lacoste-Julien, Simon, Jaggi, Martin, Schmidt, Mark W., and Pletscher,Patrick. Block-coordinate frank-wolfe optimization for structural svms.In ICML (1), volume 28, pp. 53–61, 2013.
Lan, Guanghui. An optimal method for stochastic composite optimization.Math. Program., 133(1-2):365–397, 2012.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 94 / 99
Implementations and A Distributed Library
References III
Langford, John, Li, Lihong, and Zhang, Tong. Sparse online learning viatruncated gradient. J. Mach. Learn. Res., 10:777–801, June 2009.
Mahdavi, Mehrdad, Yang, Tianbao, Jin, Rong, Zhu, Shenghuo, and Yi,Jinfeng. Stochastic gradient descent with only one projection. In NIPS,pp. 503–511, 2012.
Mahdavi, Mehrdad, Zhang, Lijun, and Jin, Rong. Mixed optimization forsmooth functions. In NIPS, pp. 674–682, 2013.
Nemirovski, A. and Yudin, D. On cezari?s convergence of the steepestdescent method for approximating saddle point of convex-concavefunctons. Soviet Math Dkl., 19:341–362, 1978.
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochasticapproximation approach to stochastic programming. SIAM J. onOptimization, pp. 1574–1609, 2009.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 95 / 99
Implementations and A Distributed Library
References IV
Nesterov, Yurii. Introductory Lectures on Convex Optimization: A BasicCourse (Applied Optimization). Springer Netherlands, 2004.
Niu, Feng, Recht, Benjamin, Re, Christopher, and Wright, Stephen J.Hogwild!: A lock-free approach to parallelizing stochastic gradientdescent. CoRR, 2011.
Ogawa, Kohei, Suzuki, Yoshiki, Suzumura, Shinya, and Takeuchi, Ichiro.Safe sample screening for support vector machines. CoRR, 2014.
Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In NIPS, 2007.
Richtarik, Peter and Takac, Martin. Distributed coordinate descentmethod for learning with big data. CoRR, abs/1310.2059, 2013.
Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. A stochasticgradient method with an exponential convergence rate forstrongly-convex optimization with finite training sets. CoRR, 2012.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 96 / 99
Implementations and A Distributed Library
References V
Shalev-Shwartz, Shai and Tewari, Ambuj. Stochastic methods for l1regularized loss minimization. In Proceedings of the 26th AnnualInternational Conference on Machine Learning, ICML ’09, pp. 929–936,2009.
Shalev-Shwartz, Shai and Zhang, Tong. Proximal stochastic dualcoordinate ascent. CoRR, abs/1211.2717, 2012.
Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascentmethods for regularized loss. Journal of Machine Learning Research, 14:567–599, 2013.
Shalev-Shwartz, Shai, Singer, Yoram, and Srebro, Nathan. Pegasos:Primal estimated sub-gradient solver for svm. In Proceedings of the24th International Conference on Machine Learning, pp. 807–814, 2007.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 97 / 99
Implementations and A Distributed Library
References VI
Shalev-Shwartz, Shai, Singer, Yoram, Srebro, Nathan, and Cotter, Andrew.Pegasos: primal estimated sub-gradient solver for svm. Math. Program.,127(1):3–30, 2011.
Shamir, Ohad and Zhang, Tong. Stochastic gradient descent fornon-smooth optimization: Convergence results and optimal averagingschemes. In ICML (1), pp. 71–79, 2013.
Wang, Jie, Lin, Binbin, Gong, Pinghua, Wonka, Peter, and Ye, Jieping.Lasso screening rules via dual polytope projection. CoRR,abs/1211.3966, 2012.
Wang, Jie, Wonka, Peter, and Ye, Jieping. Scaling svm and least absolutedeviations via exact data reduction. CoRR, abs/1310.7048, 2013. URLhttp://dblp.uni-trier.de/db/journals/corr/corr1310.html#WangWY13.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 98 / 99
Implementations and A Distributed Library
References VII
Yang, Tianbao. Trading computation for communication: Distributedstochastic dual coordinate ascent. NIPS’13, pp. –, 2013.
Yang, Tianbao and Zhang, Lijun. Efficient stochastic gradient descent forstrongly convex optimization. CoRR, abs/1304.5504, 2013.
Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. ”nystrom method vs random fourier features: A theoreticaland empirical comparison”. In NIPS, pp. 485–493, 2012.
Zhu, Shenghuo. Stochastic gradient descent algorithms for strongly convexfunctions at o(1/t) convergence rates. CoRR, abs/1305.2218, 2013.
Yang, Jin, Zhu (NEC Labs America, MSU) Tutorial for SDM’14 April 26, 2014 99 / 99