+ All Categories
Home > Documents > Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization

Date post: 14-Feb-2017
Category:
Upload: trinhtu
View: 223 times
Download: 0 times
Share this document with a friend
315
Big Data Analytics: Optimization and Randomization Tianbao Yang , Qihang Lin , Rong Jin * Tutorial@SIGKDD 2015 Sydney, Australia Department of Computer Science, The University of Iowa, IA, USA Department of Management Sciences, The University of Iowa, IA, USA * Department of Computer Science and Engineering, Michigan State University, MI, USA Institute of Data Science and Technologies at Alibaba Group, Seattle, USA August 10, 2015 Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 1 / 234
Transcript
Page 1: Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimizationand Randomization

Tianbao Yang†, Qihang Lin\, Rong Jin∗‡

Tutorial@SIGKDD 2015Sydney, Australia

†Department of Computer Science, The University of Iowa, IA, USA\Department of Management Sciences, The University of Iowa, IA, USA

∗Department of Computer Science and Engineering, Michigan State University, MI, USA‡Institute of Data Science and Technologies at Alibaba Group, Seattle, USA

August 10, 2015

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 1 / 234

Page 2: Big Data Analytics: Optimization and Randomization

URL

http://www.cs.uiowa.edu/˜tyng/kdd15-tutorial.pdf

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 2 / 234

Page 3: Big Data Analytics: Optimization and Randomization

Some Claims

NoThis tutorial is not an exhaustive literature surveyIt is not a survey on different machine learning/data miningalgorithms

YesIt is about how to efficiently solve machine learning/data mining(formulated as optimization) problems for big data

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 3 / 234

Page 4: Big Data Analytics: Optimization and Randomization

Outline

Part I: BasicsPart II: OptimizationPart III: Randomization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 4 / 234

Page 5: Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization

Part I: Basics

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 5 / 234

Page 6: Big Data Analytics: Optimization and Randomization

Basics Introduction

Outline

1 BasicsIntroductionNotations and Definitions

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 6 / 234

Page 7: Big Data Analytics: Optimization and Randomization

Basics Introduction

Three Steps for Machine Learning

Model Optimization

20 40 60 80 1000

0.05

0.1

0.15

0.2

0.25

0.3

iterations

dist

ance

to o

ptim

al o

bjec

tive

0.5T

1/T2

1/T

Data

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 7 / 234

Page 8: Big Data Analytics: Optimization and Randomization

Basics Introduction

Big Data Challenge

Big Data

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 8 / 234

Page 9: Big Data Analytics: Optimization and Randomization

Basics Introduction

Big Data Challenge

Big Model

60 million parameters

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 9 / 234

Page 10: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Ridge Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2 +

λ

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 10 / 234

Page 11: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Ridge Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2

︸ ︷︷ ︸Empirical Loss

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 11 / 234

Page 12: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Ridge Regression Problem:

minw∈Rd

1n

n∑i=1

(yi −w>xi )2 +

λ

2 ‖w‖22︸ ︷︷ ︸

Regularization

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 12 / 234

Page 13: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Classification Problems:

minw∈Rd

1n

n∑i=1

`(yiw>xi ) +λ

2 ‖w‖22

yi ∈ +1,−1: labelLoss function `(z): z = yw>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 13 / 234

Page 14: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Feature Selection:

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + λ‖w‖1

`1 regularization ‖w‖1 =∑d

i=1 |wi |λ controls sparsity level

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 14 / 234

Page 15: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Feature Selection using Elastic Net:

minw∈Rd

1n

n∑i=1

`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2

2

)

Elastic net regularizer, more robust than `1 regularizer

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 15 / 234

Page 16: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Multi-class/Multi-task Learning:

minW

1n

n∑i=1

`(Wxi , yi ) + λr(W)

W ∈ RK×d

r(W) = ‖W‖2F =

∑Kk=1

∑dj=1 W 2

kj : Frobenius Normr(W) = ‖W‖∗ =

∑i σi : Nuclear Norm (sum of singular values)

r(W) = ‖W‖1,∞ =∑d

j=1 ‖W:j‖∞: `1,∞mixed norm

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 16 / 234

Page 17: Big Data Analytics: Optimization and Randomization

Basics Introduction

Learning as Optimization

Regularized Empirical Loss Minimization

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + R(w)

Both ` and R are convex functionsExtensions to Matrix Cases are possible (sometimes straightforward)Extensions to Kernel methods can be combined with randomizedapproachesExtensions to Non-convex (e.g., deep learning) are in progress

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 17 / 234

Page 18: Big Data Analytics: Optimization and Randomization

Basics Introduction

Data Matrices and Machine Learning

The Instance-feature Matrix: X ∈ Rn×d

X =

x>1x>2···

x>n

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 18 / 234

Page 19: Big Data Analytics: Optimization and Randomization

Basics Introduction

Data Matrices and Machine Learning

The output vector: y =

y1y2···

yn

∈ Rn×1

continuous yi ∈ R: regression (e.g., house price)discrete, e.g., yi ∈ 1, 2, 3: classification (e.g., species of iris)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 19 / 234

Page 20: Big Data Analytics: Optimization and Randomization

Basics Introduction

Data Matrices and Machine LearningThe Instance-Instance Matrix: K ∈ Rn×n

Similarity MatrixKernel Matrix

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 20 / 234

Page 21: Big Data Analytics: Optimization and Randomization

Basics Introduction

Data Matrices and Machine LearningSome machine learning tasks are formulated on the kernel matrix

ClusteringKernel Methods

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 21 / 234

Page 22: Big Data Analytics: Optimization and Randomization

Basics Introduction

Data Matrices and Machine Learning

The Feature-Feature Matrix: C ∈ Rd×d

Covariance MatrixDistance Metric Matrix

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 22 / 234

Page 23: Big Data Analytics: Optimization and Randomization

Basics Introduction

Data Matrices and Machine Learning

Some machine learning tasks requires the covariance matrixPrincipal Component AnalysisTop-k Singular Value (Eigen-Value) Decomposition of the CovarianceMatrix

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 23 / 234

Page 24: Big Data Analytics: Optimization and Randomization

Basics Introduction

Why Learning from Big Data is Challenging?

High per-iteration cost

High memory cost

High communication cost

Large iteration complexity

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 24 / 234

Page 25: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Outline

1 BasicsIntroductionNotations and Definitions

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 25 / 234

Page 26: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Norms

Vector x ∈ Rd

Euclidean vector norm: ‖x‖2 =√

x>x =√∑d

i=1 x2i

`p-norm of a vector: ‖x‖p =(∑d

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 26 / 234

Page 27: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Norms

Vector x ∈ Rd

Euclidean vector norm: ‖x‖2 =√

x>x =√∑d

i=1 x2i

`p-norm of a vector: ‖x‖p =(∑d

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 26 / 234

Page 28: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Norms

Vector x ∈ Rd

Euclidean vector norm: ‖x‖2 =√

x>x =√∑d

i=1 x2i

`p-norm of a vector: ‖x‖p =(∑d

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 26 / 234

Page 29: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Matrix Factorization

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation

Pseudo inverse: X † = V Σ−1U>

QR factorization: X = QR (n ≥ d)

Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 27 / 234

Page 30: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Matrix Factorization

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation

Pseudo inverse: X † = V Σ−1U>

QR factorization: X = QR (n ≥ d)

Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 27 / 234

Page 31: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Matrix Factorization

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation

Pseudo inverse: X † = V Σ−1U>

QR factorization: X = QR (n ≥ d)

Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 27 / 234

Page 32: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Norms

Matrix X ∈ Rn×d

Frobenius norm: ‖X‖F =√

tr(X>X ) =√∑n

i=1∑d

j=1 X 2ij

Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖A‖2 = σ1 (maximum singular value)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 28 / 234

Page 33: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Norms

Matrix X ∈ Rn×d

Frobenius norm: ‖X‖F =√

tr(X>X ) =√∑n

i=1∑d

j=1 X 2ij

Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖A‖2 = σ1 (maximum singular value)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 28 / 234

Page 34: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Convex Optimization

minx∈X f (x)

X is a convex domainfor any x , y ∈ X , their convex combinationαx + (1− α)y ∈ X

f (x) is a convex function

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 29 / 234

Page 35: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Convex Function

Characterization of Convex Function

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

local optimum is global optimum

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 30 / 234

Page 36: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Convex Function

Characterization of Convex Function

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

local optimum is global optimum

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 30 / 234

Page 37: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Convex vs Strongly Convex

Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Strongly Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) +λ

2 ‖x − y‖22 ∀x , y ∈ X

Global optimum is unique

strong convexityconstant

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 31 / 234

Page 38: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Convex vs Strongly Convex

Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Strongly Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) +λ

2 ‖x − y‖22 ∀x , y ∈ X

Global optimum is unique

strong convexityconstant

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 31 / 234

Page 39: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Non-smooth function vs Smooth functionNon-smooth function

Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 32 / 234

Page 40: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Non-smooth function vs Smooth functionNon-smooth function

Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 32 / 234

Page 41: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Non-smooth function vs Smooth functionNon-smooth function

Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

|x|

non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 32 / 234

Page 42: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Next ...

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + R(w)

Part II: Optimizationstochastic optimizationdistributed optimization

Reduce Iteration Complexity: utilizing properties of functions

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 33 / 234

Page 43: Big Data Analytics: Optimization and Randomization

Basics Notations and Definitions

Next ...

Part III: RandomizationClassification, RegressionSVD, K-means, Kernel methods

Reduce Data Size: utilizing properties of data

Please stay tuned!

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 34 / 234

Page 44: Big Data Analytics: Optimization and Randomization

Optimization

Big Data Analytics: Optimization and Randomization

Part II: Optimization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 35 / 234

Page 45: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Outline

2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data

Stochastic OptimizationDistributed Optimization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 36 / 234

Page 46: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Learning as Optimization

Regularized Empirical Loss Minimization

minw∈Rd

1n

n∑i=1

`(w>xi , yi) + R(w)︸ ︷︷ ︸F (w)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 37 / 234

Page 47: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative

wt+1 = wt + ∆wt

Iteration Complexity: the number ofiterations T (ε) needed to have

F (wT )−minw

F (w) ≤ ε (ε 1)

Convergence Rate: after T iterations, howgood is the solution

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 38 / 234

Page 48: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative

wt+1 = wt + ∆wt

Iteration Complexity: the number ofiterations T (ε) needed to have

F (wT )−minw

F (w) ≤ ε (ε 1)

Convergence Rate: after T iterations, howgood is the solution

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 38 / 234

Page 49: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative

wt+1 = wt + ∆wt

Iteration Complexity: the number ofiterations T (ε) needed to have

F (wT )−minw

F (w) ≤ ε (ε 1)

Convergence Rate: after T iterations, howgood is the solution

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration Complexity

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 38 / 234

Page 50: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Convergence Measure

Most optimization algorithms are iterative

wt+1 = wt + ∆wt

Iteration Complexity: the number ofiterations T (ε) needed to have

F (wT )−minw

F (w) ≤ ε (ε 1)

Convergence Rate: after T iterations, howgood is the solution

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations

obje

ctive

T

ε

Total Runtime = Per-iteration Cost×Iteration ComplexityYang, Lin, Jin Tutorial for KDD’15 August 10, 2015 38 / 234

Page 51: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

More on Convergence Measure

Big O(·) notation: explicit dependence on T or ε

Convergence Rate Iteration Complexity

linear O(µT)

(µ < 1) O(

log(1ε

))sub-linear O

(1

)α > 0 O

( 1ε1/α

)Why are we interested in Bounds?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 39 / 234

Page 52: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

More on Convergence Measure

Big O(·) notation: explicit dependence on T or ε

Convergence Rate Iteration Complexity

linear O(µT)

(µ < 1) O(

log(1ε

))sub-linear O

(1

)α > 0 O

( 1ε1/α

)Why are we interested in Bounds?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 39 / 234

Page 53: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

seconds

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 40 / 234

Page 54: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

secondsminutes

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 41 / 234

Page 55: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

1/T0.5

secondsminutes

hours

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 42 / 234

Page 56: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

More on Convergence Measure

Convergence Rate Iteration Complexitylinear O(µT ) (µ < 1) O

(log( 1

ε ))

sub-linear O( 1Tα ) α > 0 O

(1

ε1/α

)

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

iterations (T)

dis

tan

ce

to

op

tim

um

0.5T

1/T

1/T0.5

secondsminutes

hours

Theoretically, we consider

O(µT ) ≺ O( 1

T 2

)≺ O

( 1T

)≺ O

( 1√T

)log(1ε

)≺ 1√

ε≺ 1ε≺ 1ε2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 43 / 234

Page 57: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Non-smooth V.S. Smooth

Non-smooth `(z)

hinge loss: `(w>x, y) = max(0, 1− yw>x)absolute loss: `(w>x, y) = |w>x− y |

Smooth `(z)

squared hinge loss: `(w>x, y) = max(0, 1− yw>x)2

logistic loss: `(w>x, y) = log(1 + exp(−yw>x))square loss: `(w>x, y) = (w>x− y)2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 44 / 234

Page 58: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Strong convex V.S. Non-strongly convex

λ-strongly convex R(w)

`2 regularizer: λ2 ‖w‖

22

Elastic net regularizer: τ‖w‖1 + λ2 ‖w‖

22

Non-strongly convex R(w)

unregularized problem: R(w) ≡ 0`1 regularizer: τ‖w‖1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 45 / 234

Page 59: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Gradient Method in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Suppose `(z) is smoothFull gradient: ∇F (w) = 1

n∑n

i=1∇`(w>xi , yi ) + λwPer-iteration cost: O(nd)

Gradient Descent

wt = wt−1 − γt∇F (wt−1)

step size

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 46 / 234

Page 60: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Gradient Method in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Suppose `(z) is smoothFull gradient: ∇F (w) = 1

n∑n

i=1∇`(w>xi , yi ) + λwPer-iteration cost: O(nd)

Gradient Descent

wt = wt−1 − γt∇F (wt−1)

step size

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 46 / 234

Page 61: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Gradient Method in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

If λ = 0: R(w) is non-strongly convexIteration complexity O( 1

ε )

If λ > 0: R(w) is λ-strongly convexIteration complexity O( 1

λ log( 1ε ))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 47 / 234

Page 62: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Accelerated Gradient Method

Accelerated Gradient Descent

wt = vt−1 − γt∇F (vt−1)

vt = wt + ηt(wt −wt−1)

MomentumStep

wt is the output and vt is an auxiliary sequence.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 48 / 234

Page 63: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Accelerated Gradient Method

Accelerated Gradient Descent

wt = vt−1 − γt∇F (vt−1)

vt = wt + ηt(wt −wt−1)

MomentumStep

wt is the output and vt is an auxiliary sequence.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 48 / 234

Page 64: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Accelerated Gradient Method

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

If λ = 0: R(w) is non-strongly convexIteration complexity O( 1√

ε), better than O( 1

ε )

If λ > 0: R(w) is λ-strongly convexIteration complexity O( 1√

λlog( 1

ε )), better than O( 1λ log( 1

ε )) for smallλ

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 49 / 234

Page 65: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Deal with `1 regularizer

Consider a more general case

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) + R ′(w) + τ‖w‖1︸ ︷︷ ︸R(w)

R(w) = R ′(w) + τ‖w‖1

R ′(w): λ-strongly convex and smooth

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 50 / 234

Page 66: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Deal with `1 regularizer

Consider a more general case

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) + R ′(w)︸ ︷︷ ︸F ′(w)

+τ‖w‖1

R(w) = R ′(w) + τ‖w‖1

R ′(w): λ-strongly convex and smooth

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 51 / 234

Page 67: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Deal with `1 regularizer

Accelerated Gradient Descent

wt = arg minw∈Rd

∇F ′(vt−1)>w +

12γt‖w− vt−1‖2

2 + τ‖w‖1

vt = wt + ηt(wt −wt−1)

Proximalmapping

Proximal mapping has close-form solution: Soft-thresholdingIteration complexity and runtime remain unchanged.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 52 / 234

Page 68: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Deal with `1 regularizer

Accelerated Gradient Descent

wt = arg minw∈Rd

∇F ′(vt−1)>w +

12γt‖w− vt−1‖2

2 + τ‖w‖1

vt = wt + ηt(wt −wt−1)

Proximalmapping

Proximal mapping has close-form solution: Soft-thresholdingIteration complexity and runtime remain unchanged.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 52 / 234

Page 69: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Sub-Gradient Method in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Suppose `(z) is non-smoothFull sub-gradient: ∂F (w) = 1

n∑n

i=1 ∂`(w>xi , yi ) + λw

Sub-Gradient Descent

wt = wt−1 − γt∂F (wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 53 / 234

Page 70: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Sub-Gradient Method in Machine Learning

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Suppose `(z) is non-smoothFull sub-gradient: ∂F (w) = 1

n∑n

i=1 ∂`(w>xi , yi ) + λw

Sub-Gradient Descent

wt = wt−1 − γt∂F (wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 53 / 234

Page 71: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Sub-Gradient Method

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

If λ = 0: R(w) is non-strongly convexIteration complexity O( 1

ε2 )

If λ > 0: R(w) is λ-strongly convexIteration complexity O( 1

λε)

No efficient acceleration scheme in general

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 54 / 234

Page 72: Big Data Analytics: Optimization and Randomization

Optimization (Sub)Gradient Methods

Problem Classes and Iteration Complexity

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + R(w)

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex O

(1ε2

)O(

1√ε

)λ-strongly convex O

(1λε

)O(

1√λ

log(

))Per-iteration cost: O(nd), too high if n or d are large.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 55 / 234

Page 73: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Outline

2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data

Stochastic OptimizationDistributed Optimization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 56 / 234

Page 74: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Stochastic First-Order Method by Data Sampling

Stochastic Gradient Descent (SGD)

Stochastic Variance Reduced Gradient (SVRG)

Stochastic Average Gradient Algorithm (SAGA)

Stochastic Dual Coordinate Ascent (SDCA)

Accelerated Proximal Coordinate Gradient (APCG)

Assumption: ‖xi‖ ≤ 1 for any i

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 57 / 234

Page 75: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Basic SGD (Nemirovski & Yudin (1978))

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Full sub-gradient: ∂F (w) = 1n∑n

i=1 ∂`(w>xi , yi ) + λw

Randomly sample i ∈ 1, . . . , nStochastic sub-gradient: ∂`(wT xi , yi ) + λw

Ei [∂`(wT xi , yi ) + λw] = ∂F (w)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 58 / 234

Page 76: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Basic SGD (Nemirovski & Yudin (1978))

Applicable in all settings!

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

sample: it ∈ 1, . . . , n

update: wt = wt−1 − γt(∂`(wT

t−1xit , yit ) + λwt−1)

output: wT =1T

T∑t=1

wt

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 59 / 234

Page 77: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Basic SGD (Nemirovski & Yudin (1978))

Applicable in all settings!

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

sample: it ∈ 1, . . . , n

update: wt = wt−1 − γt(∂`(wT

t−1xit , yit ) + λwt−1)

output: wT =1T

T∑t=1

wt

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 59 / 234

Page 78: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Basic SGD (Nemirovski & Yudin (1978))

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

If λ = 0: R(w) is non-strongly convexIteration complexity O( 1

ε2 )

If λ > 0: R(w) is λ-strongly convexIteration complexity O( 1

λε)

Exactly the same as sub-gradient descent!

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 60 / 234

Page 79: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Total Runtime

Per-iteration cost: O(d)

Much lower than full gradient methode.g. hinge loss (SVM)

stochastic gradient: ∂`(w>xit , yit ) =

−yit xit , 1− yit w>xit > 0

0, otherwise

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 61 / 234

Page 80: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Total Runtime

minw∈Rd

1n

n∑i=1

`(w>xi , yi ) + R(w)

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex O

(1ε2

)O(

1ε2

)λ-strongly convex O

(1λε

)O(

1λε

)For SGD, only strongly convexity helps but the smoothness does notmake any difference!

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 62 / 234

Page 81: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Full Gradient V.S. Stochastic Gradient

Full gradient method needs fewer iterationsStochastic gradient method has lower cost per iteration

For small ε, use full gradientSatisfied with large ε, use stochastic gradient

Full gradient can be acceleratedStochastic gradient cannot

Full gradient’s iterations complexity depends on smoothness andstrong convexityStochastic gradient’s iteration complexity only depends on strongconvexity

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 63 / 234

Page 82: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Applicable when `(z) is smooth and R(w) is λ-strongly convex

Stochastic gradient:

git (w) = ∇`(wT xit , yit ) + λw

Eit [git (w)] = ∇F (w) but...Var [git (w)] 6= 0 even if w = w?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 64 / 234

Page 83: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Applicable when `(z) is smooth and R(w) is λ-strongly convex

Stochastic gradient:

git (w) = ∇`(wT xit , yit ) + λw

Eit [git (w)] = ∇F (w) but...Var [git (w)] 6= 0 even if w = w?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 64 / 234

Page 84: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

Compute the full gradient at a reference point w

∇F (w) =1n

n∑i=1∇`(wT xi , yi ) + λw

Stochastic variance reduced gradient:

git (w) = ∇F (w)− git (w) + git (w)

Eit [git (w)] = ∇F (w)

Var [git (w)] −→ 0 as w,w→ w?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 65 / 234

Page 85: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

At optimal solution w?: ∇F (w?) = 0It does not mean

git (w) −→ 0

as w→ w?

However, we have

git (w) = ∇F (w)− git (w) + git (w) −→ 0

as w,w→ w?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 66 / 234

Page 86: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

Iterate s = 1, . . . ,T − 1Let w0 = ws and compute ∇F (ws)

Iterate t = 1, . . . ,Kgit (wt−1) = ∇F (ws)− git (ws) + git (wt−1)wt = wt−1 − γt git (wt−1)

ws+1 = 1K∑K

t=1 wtoutput: wT

K = O(

)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 67 / 234

Page 87: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

Per-iteration cost: O(

d(

n + 1λ

))Iteration complexity

`(z) ≡ `(z , y)Non-smooth Smooth

R(w)Non-strongly convex N.A. N.A.λ-strongly convex N.A. O

(log(

))Total Runtime: O

(d(

n + 1λ

)log(

))Use proximal mapping for `1 regularizer

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 68 / 234

Page 88: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG (Johnson & Zhang, 2013; Zhang et al., 2013; Xiao & Zhang, 2014)

Per-iteration cost: O(

d(

n + 1λ

))Iteration complexity

`(z) ≡ `(z , y)Non-smooth Smooth

R(w)Non-strongly convex N.A. N.A.λ-strongly convex N.A. O

(log(

))Total Runtime: O

(d(

n + 1λ

)log(

))Use proximal mapping for `1 regularizer

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 68 / 234

Page 89: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

A new version of SAG (Roux et al. (2012))Applicable when `(z) is smoothStrong convexity is not required.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 69 / 234

Page 90: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

SAGA also reduces the variance of stochastic gradient but with adifferent techniqueSVRG uses gradients at the same point w

git (w) = ∇F (w)− git (w) + git (w)

∇F (w) =1n

n∑i=1∇`(wT xi , yi ) + λw

SAGA uses gradients at different point w1, w2, · · · , wn

git (w) = G − git (wit ) + git (w)

G

Average Gradient

=1n

n∑i=1∇`(wi

T xi , yi ) + λwit

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 70 / 234

Page 91: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

SAGA also reduces the variance of stochastic gradient but with adifferent techniqueSVRG uses gradients at the same point w

git (w) = ∇F (w)− git (w) + git (w)

∇F (w) =1n

n∑i=1∇`(wT xi , yi ) + λw

SAGA uses gradients at different point w1, w2, · · · , wn

git (w) = G − git (wit ) + git (w)

G

Average Gradient

=1n

n∑i=1∇`(wi

T xi , yi ) + λwit

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 70 / 234

Page 92: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

Initialize average gradient G0:

G0 =1n

n∑i=1

gi , gi = ∇`(w>0 xi , yi ) + λw0

stochastic variance reduced gradient:

git (wt−1) = Gt−1 − git +(∇`(w>t−1xit , yit ) + λwt−1

)wt = wt−1 − γt git (wt−1)

Update average gradient

Gt =1n

n∑i=1

gi , git = ∇`(w>t−1xit , yit ) + λwt−1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 71 / 234

Page 93: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

Initialize average gradient G0:

G0 =1n

n∑i=1

gi , gi = ∇`(w>0 xi , yi ) + λw0

stochastic variance reduced gradient:

git (wt−1) = Gt−1 − git +(∇`(w>t−1xit , yit ) + λwt−1

)wt = wt−1 − γt git (wt−1)

Update average gradient

Gt =1n

n∑i=1

gi , git = ∇`(w>t−1xit , yit ) + λwt−1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 71 / 234

Page 94: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

Initialize average gradient G0:

G0 =1n

n∑i=1

gi , gi = ∇`(w>0 xi , yi ) + λw0

stochastic variance reduced gradient:

git (wt−1) = Gt−1 − git +(∇`(w>t−1xit , yit ) + λwt−1

)wt = wt−1 − γt git (wt−1)

Update average gradient

Gt =1n

n∑i=1

gi , git = ∇`(w>t−1xit , yit ) + λwt−1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 71 / 234

Page 95: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA: efficient update of averaged gradient

Gt and Gt−1 only differs in gi for i = itBefore we update gi , we update

Gt =1n

n∑i=1

gi = Gt−1 −1n git +

1n(∇`(w>t−1xit , yit ) + λwt−1

)computation cost: O(d)

To implemente SAGA, we have to store and update all: g1, g2, . . . , gn

Require extra memory space O(nd)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 72 / 234

Page 96: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA: efficient update of averaged gradient

Gt and Gt−1 only differs in gi for i = itBefore we update gi , we update

Gt =1n

n∑i=1

gi = Gt−1 −1n git +

1n(∇`(w>t−1xit , yit ) + λwt−1

)computation cost: O(d)

To implemente SAGA, we have to store and update all: g1, g2, . . . , gn

Require extra memory space O(nd)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 72 / 234

Page 97: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

Per-iteration cost: O(d)

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex N.A. O

(nε

)λ-strongly convex N.A. O

((n + 1

λ

)log(

))Total Runtime (strongly convex): O

(d(

n + 1λ

)log(

)). Same as

SVRG!Use proximal mapping for `1 regularizer

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 73 / 234

Page 98: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SAGA (Defazio et al. (2014))

Per-iteration cost: O(d)

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex N.A. O

(nε

)λ-strongly convex N.A. O

((n + 1

λ

)log(

))Total Runtime (strongly convex): O

(d(

n + 1λ

)log(

)). Same as

SVRG!Use proximal mapping for `1 regularizer

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 73 / 234

Page 99: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Compare the Runtime of SGD and SVRG/SAGA

Smooth but non-strongly convex:SGD: O

( dε2

)SAGA: O

( dnε

)Smooth and strongly convex:

SGD: O( dλε

)SVRG/SAGA: O

(d(n + 1

λ

)log( 1ε

))For small ε, use SVRG/SAGASatisfied with large ε, use SGD

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 74 / 234

Page 100: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Conjugate Duality

Define `i (z) ≡ `(z , yi )

Conjugate function: `∗i (α)⇐⇒ `i (z)

`i (z) = maxα∈R

[αz − `∗(α)] , `∗i (α) = maxz∈R

[αz − `(z)]

E.g. hinge loss: `i (z) = max(0, 1− yi z)

`∗i (α) =

αyi if − 1 ≤ αyi ≤ 0+∞ otherwise

E.g. square hinge loss: `i (z) = max(0, 1− yi z)2

`∗i (α) =

α2

4 + αyi if αyi ≤ 0+∞ otherwise

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 75 / 234

Page 101: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Conjugate Duality

Define `i (z) ≡ `(z , yi )

Conjugate function: `∗i (α)⇐⇒ `i (z)

`i (z) = maxα∈R

[αz − `∗(α)] , `∗i (α) = maxz∈R

[αz − `(z)]

E.g. hinge loss: `i (z) = max(0, 1− yi z)

`∗i (α) =

αyi if − 1 ≤ αyi ≤ 0+∞ otherwise

E.g. square hinge loss: `i (z) = max(0, 1− yi z)2

`∗i (α) =

α2

4 + αyi if αyi ≤ 0+∞ otherwise

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 75 / 234

Page 102: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))Applicable when R(w) is λ-strongly convexSmoothness is not requiredFrom Primal problem to Dual problem:

minw

1n

n∑i=1

`(w>xi︸ ︷︷ ︸z

, yi ) +λ

2 ‖w‖22

= minw

1n

n∑i=1

maxαi∈R

[αi (w>xi )− `∗i (αi )

]+λ

2 ‖w‖22

= maxα∈Rn

1n

n∑i=1−`∗i (αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 76 / 234

Page 103: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA (Shalev-Shwartz & Zhang (2013))

Stochastic Dual Coordinate Ascent (liblinear (Hsieh et al., 2008))Applicable when R(w) is λ-strongly convexSmoothness is not requiredFrom Primal problem to Dual problem:

minw

1n

n∑i=1

`(w>xi︸ ︷︷ ︸z

, yi ) +λ

2 ‖w‖22

= minw

1n

n∑i=1

maxαi∈R

[αi (w>xi )− `∗i (αi )

]+λ

2 ‖w‖22

= maxα∈Rn

1n

n∑i=1−`∗i (αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 76 / 234

Page 104: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA (Shalev-Shwartz & Zhang (2013))

Solve Dual Problem:

maxα∈Rn

1n

n∑i=1−`∗i (αi )−

λ

2

∥∥∥∥∥ 1λn

n∑i=1

αixi

∥∥∥∥∥2

2

Sample it ∈ 1, . . . , n. Optimize αit while fixing others

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 77 / 234

Page 105: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA (Shalev-Shwartz & Zhang (2013))

Maintain a primal solution: wt = 1λn∑n

i=1 αti xi

Change variable αi −→ ∆αi

max∆α∈Rn

1n

n∑i=1−`∗i (αt

i + ∆αi )−λ

2

∥∥∥∥∥ 1λn

( n∑i=1

αti xi +

n∑i=1

∆αixi

)∥∥∥∥∥2

2

⇐⇒ max∆α∈Rn

1n

n∑i=1−`∗i (αt

i + ∆αi )−λ

2

∥∥∥∥∥wt +1λn

n∑i=1

∆αixi

∥∥∥∥∥2

2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 78 / 234

Page 106: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA (Shalev-Shwartz & Zhang (2013))

Dual Coordinate Updates

∆αit = max∆αit∈Rn

−1n `∗it (−αt

it −∆αit )− λ

2

∥∥∥∥wt +1λn ∆αit xit

∥∥∥∥2

2

αt+1it = αt

it + ∆αit

wt+1 = wt +1λn ∆αit xi

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 79 / 234

Page 107: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA updates

Close-form solution for ∆αi : hinge loss, squared hinge loss, absoluteloss and square loss (Shalev-Shwartz & Zhang (2013))e.g. square loss

∆αi =yi −w>t xi − αt

i1 + ‖xi‖2

2/(λn)

Per-iteration cost: O(d)

Approximate solution: logistic loss (Shalev-Shwartz & Zhang (2013))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 80 / 234

Page 108: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex N.A. N.A.λ-strongly convex O

(n + 1

λε

)O((

n + 1λ

)log(

))Total Runtime (smooth loss): O

(d(

n + 1λ

)log(

)). The same as

SVRG and SAGA!

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 81 / 234

Page 109: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SDCA

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex N.A. N.A.λ-strongly convex O

(n + 1

λε

)O((

n + 1λ

)log(

))Total Runtime (smooth loss): O

(d(

n + 1λ

)log(

)). The same as

SVRG and SAGA!

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 81 / 234

Page 110: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

SVRG V.S. SDCA V.S. SGD

`2-regularized logistic regression with λ = 10−4

MNIST dataJohnson & Zhang (2013)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 82 / 234

Page 111: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG (Lin et al. (2014))

Recall the acceleration scheme for full gradient methodAuxiliary sequence (βt)Momentum step

Maintain a primal solution: wt = 1λn∑n

i=1 βti xi

Dual Coordinate Updates

∆βit = max∆βit∈Rn

−1n `∗it (−βt

it −∆βit )− λ

2

∥∥∥∥wt +1λn ∆βit xit

∥∥∥∥2

2

αt+1it = βt

it + ∆βit

βt+1 = αt+1 + ηt(αt+1 − αt)

Momentum Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 83 / 234

Page 112: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG (Lin et al. (2014))

Recall the acceleration scheme for full gradient methodAuxiliary sequence (βt)Momentum step

Maintain a primal solution: wt = 1λn∑n

i=1 βti xi

Dual Coordinate Updates

∆βit = max∆βit∈Rn

−1n `∗it (−βt

it −∆βit )− λ

2

∥∥∥∥wt +1λn ∆βit xit

∥∥∥∥2

2

αt+1it = βt

it + ∆βit

βt+1 = αt+1 + ηt(αt+1 − αt)

Momentum Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 83 / 234

Page 113: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG (Lin et al. (2014))

Per-iteration cost: O(d)

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex N.A. N.A.λ-strongly convex O

(n +

√nλε

)O((

n +√

)log(

))Compared to SDCA, APCG has shorter runtime when λ is very small.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 84 / 234

Page 114: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG V.S. SDCA

squared hinge loss SVMreal data

datasets number of samples n number of features d sparsityrcv1 20,242 47,236 0.16%

covtype 581,012 54 22%news20 19,996 1,355,191 0.04%

F (wt)− F (w?) V.S. the number of passes of data

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 85 / 234

Page 115: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG V.S. SDCALin et al. (2014)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 86 / 234

Page 116: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

For general R(w)

Dual Problem:

maxα∈Rn

1n

n∑i=1−`∗i (αi )− R∗

(1λn

n∑i=1

αixi

)

R∗ is the conjugate of RSample it ∈ 1, . . . , n. Optimize αit while fixing othersCan be still updated in O(d) in many cases (Shalev-Shwartz & Zhang(2013))Iteration complexity and runtime of SDCA and APCG remainunchanged.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 87 / 234

Page 117: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG for primal problem

minw∈Rd

F (w) =1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22 + τ‖w‖1

Suppose d >> n. Per-iteration cost O(d) is too highApply APCG to the primal instead of dual problemSample over features instead of dataPer-iteration cost becomes O(n)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 88 / 234

Page 118: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG for primal problem

minw∈Rd

F (w) =12‖Xw− y‖2

2 +λ

2 ‖w‖22 + τ‖w‖1

X = [x1, x2, · · · , xn]

Full gradient: ∇F (w) = X T (Xw− y) + λwPartial gradient: ∇i F (w) = xT

i (Xw− y) + λwi

Proximal Coordinate Gradient (PCG) (Nesterov (2012))

wti =

arg minw∈R∇i F (wt−1)wi + 1

2γt(wi −wt−1

i )2 + τ |wi |

Proximalmapping

if i = tiwt−1

i otherwise

∇i F (wt) can be updated in O(n)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 89 / 234

Page 119: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG for primal problem

minw∈Rd

F (w) =12‖Xw− y‖2

2 +λ

2 ‖w‖22 + τ‖w‖1

X = [x1, x2, · · · , xn]

Full gradient: ∇F (w) = X T (Xw− y) + λwPartial gradient: ∇i F (w) = xT

i (Xw− y) + λwi

Proximal Coordinate Gradient (PCG) (Nesterov (2012))

wti =

arg minw∈R∇i F (wt−1)wi + 1

2γt(wi −wt−1

i )2 + τ |wi |

Proximalmapping

if i = tiwt−1

i otherwise

∇i F (wt) can be updated in O(n)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 89 / 234

Page 120: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG for primal problem

minw∈Rd

F (w) =12‖Xw− y‖2

2 +λ

2 ‖w‖22 + τ‖w‖1

X = [x1, x2, · · · , xn]

Full gradient: ∇F (w) = X T (Xw− y) + λwPartial gradient: ∇i F (w) = xT

i (Xw− y) + λwi

Proximal Coordinate Gradient (PCG) (Nesterov (2012))

wti =

arg minw∈R∇i F (wt−1)wi + 1

2γt(wi −wt−1

i )2 + τ |wi |

Proximalmapping

if i = tiwt−1

i otherwise

∇i F (wt) can be updated in O(n)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 89 / 234

Page 121: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG for primal problem

APCG accelerates PCG usingAuxiliary sequence (vt)Momentum step

APCG

wti =

arg minwi∈R∇i F (vt−1)wi + 1

2γt(wi − vt−1

i )2 + τ |wi | if i = tiwt−1

i otherwisevt = wt + ηt(wt −wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 90 / 234

Page 122: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

APCG for primal problem

Per-iteration cost: O(n)

Iteration complexity`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex N.A. O

(d√ε

)λ-strongly convex N.A. O

((d√λ

)log(

))n >> d : Apply APCG to the dual problem.d >> n: Apply APCG to the primal problem.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 91 / 234

Page 123: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Which Algorithm to Use

Satisfied with large ε: SGDFor small ε:

`(z) ≡ `(z , y)Non-smooth Smooth

R(w)Non-strongly convex SGD SAGAλ-strongly convex APCG APCG

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 92 / 234

Page 124: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

smooth str-cvx SGD SAGA SDCA APCGNo No 1

ε2 N.A. N.A. N.A.Yes No 1

ε2nε N.A. N.A.

No Yes 1λε N.A. n + 1

ε n +√

nλε

Yes Yes 1λε (n + 1

λ) log( 1ε ) (n + 1

λ) log( 1ε ) (n + 1√

λ) log( 1

ε )

Table : Per-iteration cost: O(d)

smooth str-cvx SVRGNo No N.A.Yes No N.A.No Yes N.A.Yes Yes log( 1

ε )

Table : Per-iteration cost:O(d(n + 1λ ))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 93 / 234

Page 125: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

smooth str-cvx SGD SAGA SDCA APCGNo No 1

ε2 N.A. N.A. N.A.Yes No 1

ε2nε N.A. N.A.

No Yes 1λε N.A. n + 1

ε n +√

nλε

Yes Yes 1λε (n + 1

λ) log( 1ε ) (n + 1

λ) log( 1ε ) (n + 1√

λ) log( 1

ε )

Table : Per-iteration cost: O(d)

smooth str-cvx SVRGNo No N.A.Yes No N.A.No Yes N.A.Yes Yes log( 1

ε )

Table : Per-iteration cost:O(d(n + 1λ ))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 93 / 234

Page 126: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

SGD SVRG SAGA SDCA APCGMemory O(d) O(d) O(dn) O(d) O(d)

Parameters γt γt , K γt None ηtabsolute

3 7 7 3 3

hinge

3 7 7 3 3

square

3 3 3 3 3

squared hinge

3 3 3 3 3

logistic

3 3 3 3 3

λ > 0

3 3 3 3 3

λ = 0

3 3 3 7 7

Primal

3 3 3 7 3

Dual

7 7 7 3 3

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 94 / 234

Page 127: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

SGD SVRG SAGA SDCA APCGMemory O(d) O(d) O(dn) O(d) O(d)

Parameters γt γt , K γt None ηtabsolute 3 7 7 3 3

hinge 3 7 7 3 3

square 3 3 3 3 3

squared hinge 3 3 3 3 3

logistic 3 3 3 3 3

λ > 0

3 3 3 3 3

λ = 0

3 3 3 7 7

Primal

3 3 3 7 3

Dual

7 7 7 3 3

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 94 / 234

Page 128: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

SGD SVRG SAGA SDCA APCGMemory O(d) O(d) O(dn) O(d) O(d)

Parameters γt γt , K γt None ηtabsolute 3 7 7 3 3

hinge 3 7 7 3 3

square 3 3 3 3 3

squared hinge 3 3 3 3 3

logistic 3 3 3 3 3

λ > 0 3 3 3 3 3

λ = 0 3 3 3 7 7

Primal

3 3 3 7 3

Dual

7 7 7 3 3

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 94 / 234

Page 129: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

SGD SVRG SAGA SDCA APCGMemory O(d) O(d) O(dn) O(d) O(d)

Parameters γt γt , K γt None ηtabsolute 3 7 7 3 3

hinge 3 7 7 3 3

square 3 3 3 3 3

squared hinge 3 3 3 3 3

logistic 3 3 3 3 3

λ > 0 3 3 3 3 3

λ = 0 3 3 3 7 7

Primal 3 3 3 7 3

Dual 7 7 7 3 3

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 94 / 234

Page 130: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Outline

2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data

Stochastic OptimizationDistributed Optimization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 95 / 234

Page 131: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Big Data and Distributed Optimization

Distributed Optimizationdata distributed over a cluster ofmultiple machines

moving to single machine sufferslow network bandwidthlimited disk or memory

communication V.S. computationRAM 100 nanosecondsstandard network connection 250, 000nanoseconds

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 96 / 234

Page 132: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Distributed Data

N data points are partitioned and distributed to m machines[x1, x2, . . . , xN ] = S1 ∪ S2 ∪ · · · ∪ Sm

Machine j only has access to Sj .W.L.O.G: |Sj | = n = N

m

S1 S2 S3 S4 S5 S6

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 97 / 234

Page 133: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

A simple solution: Average Solution

Global problem

w? = arg minw∈Rd

F (w) =

1N

N∑i=1

`(w>xi , yi ) + R(w)

Machine j solves a local problem

wj = arg minw∈Rd

fj(w) =1n∑i∈Sj

`(w>xi , yi ) + R(w)

S1 S2 S3 S4 S5 S6

w1 w2 w3 w4 w5 w6

Center computes: w =1m

m∑j=1

wj , Issue: Will not converge to w?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 98 / 234

Page 134: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

A simple solution: Average Solution

Global problem

w? = arg minw∈Rd

F (w) =

1N

N∑i=1

`(w>xi , yi ) + R(w)

Machine j solves a local problem

wj = arg minw∈Rd

fj(w) =1n∑i∈Sj

`(w>xi , yi ) + R(w)

S1 S2 S3 S4 S5 S6

w1 w2 w3 w4 w5 w6

Center computes: w =1m

m∑j=1

wj , Issue: Will not converge to w?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 98 / 234

Page 135: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Mini-Batch SGD: Average Stochastic Gradient

Machine j sample it ∈ Sj and construct a stochastic gradient

gj(wt−1) = ∂`(w>t−1xit , yit ) + ∂R(wt−1)

S1 S2 S3 S4 S5 S6

g1(wt−1) g2(wt−1) g3(wt−1) g4(wt−1) g5(wt−1) g6(wt−1)

Center computes: wt = wt−1 − γt1m

m∑j=1

gj(wt−1)

︸ ︷︷ ︸Mini-batch SG

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 99 / 234

Page 136: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Total Runtime

Single machineTotal Runtime= Per-iteration Cost×Iteration Complexity

Distributed optimizationTotal Runtime= (Communication Time Per-round+Local Runtime Per-round)×Rounds of Communication

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 100 / 234

Page 137: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Mini-Batch SGD

Applicable in all settings!Communication Time Per-round: increase in m in a complicated way.Local Runtime Per-round: O(1)

Suppose R(w) is λ-strongly convexRounds of Communication: O( 1

mλε)

Suppose R(w) is non-strongly convexRounds of Communication: O( 1

mε2 )

More machines reduce the rounds of communication but increasecommunication time per-round.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 101 / 234

Page 138: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Mini-Batch SGD

Applicable in all settings!Communication Time Per-round: increase in m in a complicated way.Local Runtime Per-round: O(1)

Suppose R(w) is λ-strongly convexRounds of Communication: O( 1

mλε)

Suppose R(w) is non-strongly convexRounds of Communication: O( 1

mε2 )

More machines reduce the rounds of communication but increasecommunication time per-round.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 101 / 234

Page 139: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Distributed SDCA (Yang, 2013; Ma et al., 2015)

Only works when R(w) = λ2‖w‖

22 with λ > 0. (No `1)

Global dual problem

maxα∈RN

1N

N∑i=1−`∗i (αi )−

λ

2

∥∥∥∥∥ 1λN

N∑i=1

αixi

∥∥∥∥∥2

2

α = [αS1 , αS2 , · · · , αSm ]

Machine j solves a local dual problem only over αSj

maxαSj∈R

n

1N∑i∈Sj

−`∗i (αi )−λ

2

∥∥∥∥∥ 1λN

N∑i=1

αixi

∥∥∥∥∥2

2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 102 / 234

Page 140: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Distributed SDCA (Yang, 2013; Ma et al., 2015)

Only works when R(w) = λ2‖w‖

22 with λ > 0. (No `1)

Global dual problem

maxα∈RN

1N

N∑i=1−`∗i (αi )−

λ

2

∥∥∥∥∥ 1λN

N∑i=1

αixi

∥∥∥∥∥2

2

α = [αS1 , αS2 , · · · , αSm ]

Machine j solves a local dual problem only over αSj

maxαSj∈R

n

1N∑i∈Sj

−`∗i (αi )−λ

2

∥∥∥∥∥ 1λN

N∑i=1

αixi

∥∥∥∥∥2

2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 102 / 234

Page 141: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015)

Center maintains a primal solution: wt =1λN

N∑i=1

αti xi

Change variable αi −→ ∆αi

max∆αSj∈Rn

1N∑i∈Sj

−`∗i (αti + ∆αi )−

λ

2

∥∥∥∥∥∥ 1λN

N∑i=1

αti xi +

∑i∈Sj

∆αixi

∥∥∥∥∥∥2

2

⇐⇒ max∆αSj∈Rn

1N∑i∈Sj

−`∗i (αti + ∆αi )−

λ

2

∥∥∥∥∥∥wt +1λN

∑i∈Sj

∆αixi

∥∥∥∥∥∥2

2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 103 / 234

Page 142: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015)

Machine j approximately solves

∆αtSj≈ arg max

∆αSj∈Rn

1N∑i∈Sj

−`∗i (αti + ∆αi )−

λ

2

∥∥∥∥∥∥wt +1λN

∑i∈Sj

∆αixi

∥∥∥∥∥∥2

2

αt+1Sj

= αtSj

+ ∆αtSj, ∆wt

j =1λN

∑i∈Sj

∆αtSj

xi

S1 S2 S3 S4 S5 S6

∆wt1 ∆wt

2 ∆wt3 ∆wt

4 ∆wt5 ∆wt

6

Center computes: wt+1 = wt +m∑

j=1∆wt

j

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 104 / 234

Page 143: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DisDCA (Yang, 2013),CoCoA+ (Ma et al., 2015)

Machine j approximately solves

∆αtSj≈ arg max

∆αSj∈Rn

1N∑i∈Sj

−`∗i (αti + ∆αi )−

λ

2

∥∥∥∥∥∥wt +1λN

∑i∈Sj

∆αixi

∥∥∥∥∥∥2

2

αt+1Sj

= αtSj

+ ∆αtSj, ∆wt

j =1λN

∑i∈Sj

∆αtSj

xi

S1 S2 S3 S4 S5 S6

∆wt1 ∆wt

2 ∆wt3 ∆wt

4 ∆wt5 ∆wt

6

Center computes: wt+1 = wt +m∑

j=1∆wt

j

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 104 / 234

Page 144: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

CoCoA+ (Ma et al., 2015)

Local objective value

Gj(∆αSj ,wt) =1N∑i∈Sj

−`∗i (αti + ∆αi )−

λ

2

∥∥∥∥∥∥wt +1λN

∑i∈Sj

∆αixi

∥∥∥∥∥∥2

2

Solve ∆αtSj

by any local solver as long as(max∆αSj

Gj(∆αSj ,wt)− Gj(∆αtSj,wt)

)≤ Θ

(max∆αSj

Gj(∆αSj ,wt)− Gj(0,wt)

)

0 < Θ < 1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 105 / 234

Page 145: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

CoCoA+ (Ma et al., 2015)

Suppose `(z) is smooth, R(w) is λ-strongly convex and SDCA is the localsolver

Local Runtime Per-round: O(( 1λ + N

m ) log(

))

Rounds of Communication: O( 11−Θ

1λ log

(1ε

))

Suppose `(z) is non-smooth, R(w) is λ-strongly convex and SDCA is thelocal solver

Local Runtime Per-round: O(( 1λ + N

m ) 1Θ )

Rounds of Communication: O( 11−Θ

1λε)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 106 / 234

Page 146: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

CoCoA+ (Ma et al., 2015)

Suppose `(z) is smooth, R(w) is λ-strongly convex and SDCA is the localsolver

Local Runtime Per-round: O(( 1λ + N

m ) log(

))

Rounds of Communication: O( 11−Θ

1λ log

(1ε

))

Suppose `(z) is non-smooth, R(w) is λ-strongly convex and SDCA is thelocal solver

Local Runtime Per-round: O(( 1λ + N

m ) 1Θ )

Rounds of Communication: O( 11−Θ

1λε)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 106 / 234

Page 147: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Distributed SDCA in Practice

Choice of Θ (how long we run the local solver?)Choice of m (how many machines to use?)

Fast machines but slow network: Use small Θ and small mFast network but slow machines: Use large Θ and large m

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 107 / 234

Page 148: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Distributed SDCA in Practice

Choice of Θ (how long we run the local solver?)Choice of m (how many machines to use?)

Fast machines but slow network: Use small Θ and small mFast network but slow machines: Use large Θ and large m

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 107 / 234

Page 149: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

The rounds of communication of Dual SDCA does not depends on m.

DiSCO: Distributed Second-Order method (Zhang & Xiao, 2015)The rounds of communication of DiSCO depends on m.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 108 / 234

Page 150: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Global problem

minw∈Rd

F (w) =

1N

N∑i=1

`(w>xi , yi ) + R(w)

Local problem

minw∈Rd

fj(w) =1n∑i∈Sj

`(w>xi , yi ) + R(w)

Global problem can be written as

minw∈Rd

F (w) =1m

m∑j=1

fj(w)

Applicable when fj(w) is smooth, λ-strongly convex andself-concordant

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 109 / 234

Page 151: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Newton Direction

At optimal solution w?:∇F (w?) = 0

We hope moving wt along −vt leads to ∇F (wt − vt) = 0Taylor expansion:

F (wt − vt) ≈ ∇F (wt)−∇2F (wt)vt = 0

Such vt is called a Newton’s direction

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 110 / 234

Page 152: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Newton Method

Newton MethodFind a Netwon direction vt by solving

∇2F (wt)vt = ∇F (wt)

Then updatewt+1 = wt − γtvt

Require solving a linear d × d equation system. Costly when d > 1000.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 111 / 234

Page 153: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Inexact Newton MethodFind an inexact Netwon direction vt using Preconditioned ConjugateGradient (PCG) (Golub & Ye, 1997)

‖∇2F (wt)vt −∇F (wt)‖2 ≤ εt

Then updatewt+1 = wt − γtvt

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 112 / 234

Page 154: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

PCGKeep computing

v←− P−1 ×∇2F (wt)× v

iteratively until v becomes an inexact Newton direction.

Preconditioner: P = ∇2f1(wt) + µIµ: a tuning parameter such that

‖∇2f1(wt)−∇2F (wt)‖2 ≤ µ

f1 is “similar” to F so that P is a good local preconditioner.µ = O( 1√

n ) = O(√

mN )

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 113 / 234

Page 155: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

PCGKeep computing

v←− P−1 ×∇2F (wt)× v

iteratively until v becomes an inexact Newton direction.

Preconditioner: P = ∇2f1(wt) + µIµ: a tuning parameter such that

‖∇2f1(wt)−∇2F (wt)‖2 ≤ µ

f1 is “similar” to F so that P is a good local preconditioner.µ = O( 1√

n ) = O(√

mN )

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 113 / 234

Page 156: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

DiSCO compute ∇2F (wt)× v distributedly:

∇2F (wt)× v= ∇2f1(wt)× v︸ ︷︷ ︸machine 1 +∇2f2(wt)× v︸ ︷︷ ︸machine 2 + · · ·+∇2fm(wt)× v︸ ︷︷ ︸machine m

Then, compute P−1 ×∇2F (wt)× v only in machine 1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 114 / 234

Page 157: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 115 / 234

Page 158: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 116 / 234

Page 159: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 117 / 234

Page 160: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 118 / 234

Page 161: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

For high-dimensional data (e.g. d ≥ 1000), computingP−1 ×∇2F (wt)× v is time costly.Instead, use SDCA in machine 1 to solve

P−1 ×∇2F (wt)× v ≈ arg minu∈Rd

12u>Pu− u>F (wt)v

Local runtime: O( Nm + 1+µ

λ+µ)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 119 / 234

Page 162: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Suppose SDCA is the local solverLocal Runtime Per-round: O( N

m + 1+µλ+µ)

Rounds of Communication: O(√

µλ log

(1ε

))

Choice of m (how many machines to use?). (µ = O(√

mN ))

Fast machines but slow network: Use small mFast network but slow machine: Use large m

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 120 / 234

Page 163: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DiSCO (Zhang & Xiao, 2015)

Suppose SDCA is the local solverLocal Runtime Per-round: O( N

m + 1+µλ+µ)

Rounds of Communication: O(√

µλ log

(1ε

))

Choice of m (how many machines to use?). (µ = O(√

mN ))

Fast machines but slow network: Use small mFast network but slow machine: Use large m

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 120 / 234

Page 164: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

A distributed version of SVRG using a “round-robin” schemeAssume the user can control the distribution of data before algorithm.Applicable when fj(w) is smooth and λ-strongly convex

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 121 / 234

Page 165: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Iterate s = 1, . . . ,T − 1Let w0 = ws and compute ∇F (ws)

Easy todistribute

Iterate t = 1, . . . ,Kgit (wt−1) = ∇F (ws)− git (ws) + git (wt−1)

Hard todistribute

wt = wt−1 − γt git (wt−1)

ws+1 = 1K∑K

t=1 wtoutput: wT

Each machine can only sample from its own data. However,Eit∈Sj [git (wt−1)] 6= ∇F (wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 122 / 234

Page 166: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Iterate s = 1, . . . ,T − 1Let w0 = ws and compute ∇F (ws)

Easy todistribute

Iterate t = 1, . . . ,Kgit (wt−1) = ∇F (ws)− git (ws) + git (wt−1)

Hard todistribute

wt = wt−1 − γt git (wt−1)

ws+1 = 1K∑K

t=1 wtoutput: wT

Each machine can only sample from its own data. However,Eit∈Sj [git (wt−1)] 6= ∇F (wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 122 / 234

Page 167: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Iterate s = 1, . . . ,T − 1Let w0 = ws and compute ∇F (ws)

Easy todistribute

Iterate t = 1, . . . ,Kgit (wt−1) = ∇F (ws)− git (ws) + git (wt−1)

Hard todistribute

wt = wt−1 − γt git (wt−1)

ws+1 = 1K∑K

t=1 wtoutput: wT

Each machine can only sample from its own data. However,Eit∈Sj [git (wt−1)] 6= ∇F (wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 122 / 234

Page 168: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Iterate s = 1, . . . ,T − 1Let w0 = ws and compute ∇F (ws)

Easy todistribute

Iterate t = 1, . . . ,Kgit (wt−1) = ∇F (ws)− git (ws) + git (wt−1)

Hard todistribute

wt = wt−1 − γt git (wt−1)

ws+1 = 1K∑K

t=1 wtoutput: wT

Each machine can only sample from its own data. However,Eit∈Sj [git (wt−1)] 6= ∇F (wt−1)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 122 / 234

Page 169: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Solution:Store a second set of data Rj in machine j , which are sampled withreplacement from x1, x2, . . . , xn before the algorithm starts.Construct the stochastic gradient git (wt−1) by sampling it ∈ Rj andremoving it from Rj after.

Eit∈Rj [git (wt−1)] = ∇F (wt−1)

When Rj = ∅, pass wt to next machine.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 123 / 234

Page 170: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Full Gradient Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 124 / 234

Page 171: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Full Gradient Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 125 / 234

Page 172: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Stochastic Gradient Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 126 / 234

Page 173: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Stochastic Gradient Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 127 / 234

Page 174: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Stochastic Gradient Step

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 128 / 234

Page 175: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Suppose |Rj | = r for all j .Local Runtime Per-round: O( N

m + r)

Rounds of Communication: O( 1rλ log

(1ε

))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 129 / 234

Page 176: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

DSVRG (Lee et al., 2015)

Choice of m (how many machines to use?).Fast machines but slow network: Use small mFast network but slow machine: Use large m

Choice of r (how many data points to pre-sample in Rj?).The larger, the betterRequired machine memory space: |Sj |+ |Rj | = N

m + r

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 130 / 234

Page 177: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Other Distributed Optimization Methods

ADMM (Boyd et al., 2011; Ozdaglar, 2015)Rounds of Communication:O(Network Graph Dependency Term × 1√

λlog( 1

ε ))

DANE (Shamir et al., 2014)Approximate Newton direction with a difference approach from DISCO

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 131 / 234

Page 178: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

`(z) is smooth and R(w) is λ-strongly convex.

alg. Mini-SGD Dist-SDCA DiSCORuntime Per-round O(1) O( 1

λ + Nm ) O( N

m )

Round of Comm. O( 1λmε) O( 1

λ log( 1ε )) O( m1/4

√λN1/4 log( 1

ε ))

Table : Assume µ = O(√m

N )

alg. DSVRGFull Grad. Stoch. Grad.

Runtime Per-round O( Nm ) O(r)

Round of Comm. O(log( 1ε )) O( 1

rλ log( 1ε ))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 132 / 234

Page 179: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

`(z) is non-smooth and g(w) is λ-strongly convex.

alg. Mini-SGD Dist-SDCA DiSCO DSVRGRuntime Per-round O(1) O( 1

λ + Nm ) N.A. N.A.

Round of Comm. O( 1λmε) O( 1

λε) N.A. N.A.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 133 / 234

Page 180: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Summary

g(w) is non-strongly convex.

alg. Mini-SGD Dist-SDCA DiSCO DSVRGRuntime Per-round O(1) N.A. N.A. N.A.Round of Comm. O( 1

mε2 ) N.A. N.A. N.A.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 134 / 234

Page 181: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Which Algorithm to Use

Algorithm to use`(z) ≡ `(z , y)

Non-smooth Smooth

R(w)Non-strongly convex Mini-SGD DSVRGλ-strongly convex Dist-SDCA DiSCO/DSVRG

Between DiSCO/DSVRGUse DiSCO for small λ, e.g., λ < 10−5

Use DSVRG for large λ, e.g., λ > 10−5

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 135 / 234

Page 182: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Distributed Machine Learning Systems and Library

Petuum: http://petuum.github.io

Apache Spark: http://spark.apache.org/

Parameter Server: http://parameterserver.org/

Birds: http://cs.uiowa.edu/˜tyng/software.html

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 136 / 234

Page 183: Big Data Analytics: Optimization and Randomization

Optimization Stochastic Optimization Algorithms for Big Data

Thank You! Questions?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 137 / 234

Page 184: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Big Data Analytics: Optimization and Randomization

Part III: Randomization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 138 / 234

Page 185: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Outline

1 Basics

2 Optimization

3 Randomized Dimension Reduction

4 Randomized Algorithms

5 Concluding Remarks

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 139 / 234

Page 186: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Random Sketch

Approximate a large data matrix

by a much smaller sketch

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 140 / 234

Page 187: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

The Framework of Randomized Algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 141 / 234

Page 188: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

The Framework of Randomized Algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 142 / 234

Page 189: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

The Framework of Randomized Algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 143 / 234

Page 190: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

The Framework of Randomized Algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 144 / 234

Page 191: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Why randomized dimension reduction?

Efficient

Robust (e.g., dropout)

Formal Guarantees

Can explore parallel algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 145 / 234

Page 192: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Randomized Dimension Reduction

Johnson-Lindenstauss (JL) transforms

Subspace embeddings

Column sampling

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 146 / 234

Page 193: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

JL Lemma

JL Lemma (Johnson & Lindenstrauss, 1984)For any 0 < ε, δ < 1/2, there exists a probability distribution on m × dreal matrices A such that there exists a small universal constant c > 0 andfor any fixed x ∈ Rd with a probability at least 1− δ, we have∣∣∣‖Ax‖2

2 − ‖x‖22

∣∣∣ ≤ c

√log(1/δ)

m ‖x‖22

or for m = Θ(ε−2 log(1/δ)), then with a probability at least 1− δ∣∣∣‖Ax‖22 − ‖x‖2

2

∣∣∣ ≤ ε‖x‖22

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 147 / 234

Page 194: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Embedding a set of points into low dimensional space

Given a set of points x1, . . . , xn ∈ Rd , we can embed them into a lowdimensional space Ax1, . . . ,Axn ∈ Rm such thatthe pairwise distance between any two points are well preserved in the lowdimensional space

‖Axi − Axj‖22 = ‖A(xi − xj)‖2

2 ≤ (1 + ε) ‖xi − xj‖22

‖Axi − Axj‖22 = ‖A(xi − xj)‖2

2 ≥ (1− ε) ‖xi − xj‖22

In other words, in order to have all pairwise Euclidean distances preservedup to 1± ε, only m = Θ(ε−2 log(n2/δ)) dimensions are necessary

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 148 / 234

Page 195: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

JL transforms: Gaussian Random Projection

Gaussian Random Projection (Dasgupta & Gupta, 2003): A ∈ Rm×d

Aij ∼ N (0, 1/m)

m = Θ(ε−2 log(1/δ))

Computational cost of AX : where X ∈ Rd×n

mnd for dense matricesnnz(X )m for sparse matrices

Computational Cost is very High (could be as high as solving manyproblems)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 149 / 234

Page 196: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using discrete distributions

Using Discrete Distributions (Achlioptas, 2003):Pr(Aij = ± 1√

m ) = 0.5

Pr(Aij = ±√

3m ) = 1

6 , Pr(Aij = 0) = 23

Database friendlyReplace multiplications by additions and subtractions

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 150 / 234

Page 197: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using Hadmard transform (I)

Fast JL transform based on randomized Hadmard transform:

Motivation: Can we simply use random sampling matrix P ∈ Rm×d thatrandomly selects m coordinates out of d coordinates (scaled by

√d/m)?

Unfortunately: by Chernoff bound

|‖Px‖22 − ‖x‖2

2| ≤√

d‖x‖∞‖x‖2

√3 log(2/δ)

m ‖x‖22

Unless√

d‖x‖∞‖x‖2

≤ c, the random sampling doest not work

Remedy is given by randomized Hadmard transform

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 151 / 234

Page 198: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using Hadmard transform (I)

Fast JL transform based on randomized Hadmard transform:

Motivation: Can we simply use random sampling matrix P ∈ Rm×d thatrandomly selects m coordinates out of d coordinates (scaled by

√d/m)?

Unfortunately: by Chernoff bound

|‖Px‖22 − ‖x‖2

2| ≤√

d‖x‖∞‖x‖2

√3 log(2/δ)

m ‖x‖22

Unless√

d‖x‖∞‖x‖2

≤ c, the random sampling doest not work

Remedy is given by randomized Hadmard transform

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 151 / 234

Page 199: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using Hadmard transform (I)

Fast JL transform based on randomized Hadmard transform:

Motivation: Can we simply use random sampling matrix P ∈ Rm×d thatrandomly selects m coordinates out of d coordinates (scaled by

√d/m)?

Unfortunately: by Chernoff bound

|‖Px‖22 − ‖x‖2

2| ≤√

d‖x‖∞‖x‖2

√3 log(2/δ)

m ‖x‖22

Unless√

d‖x‖∞‖x‖2

≤ c, the random sampling doest not work

Remedy is given by randomized Hadmard transform

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 151 / 234

Page 200: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Randomized Hadmard transform

Hadmard transform:H ∈ Rd×d : H =

√1d H2k

H1 = [1] , H2 =

[1 11 −1

], H2k =

[H2k−1 H2k−1

H2k−1 −H2k−1

]

‖Hx‖2 = ‖x‖2 and H is orthogonalComputational costs of Hx : d log(d)

randomized Hadmard transform: HDD ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5HD is orthogonal and ‖HDx‖2 = ‖x‖2

Key property:√

d‖HDx‖∞‖HDx‖2

≤√

log(d/δ) w.h.p 1− δ

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 152 / 234

Page 201: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Randomized Hadmard transform

Hadmard transform:H ∈ Rd×d : H =

√1d H2k

H1 = [1] , H2 =

[1 11 −1

], H2k =

[H2k−1 H2k−1

H2k−1 −H2k−1

]

‖Hx‖2 = ‖x‖2 and H is orthogonalComputational costs of Hx : d log(d)

randomized Hadmard transform: HDD ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5HD is orthogonal and ‖HDx‖2 = ‖x‖2

Key property:√

d‖HDx‖∞‖HDx‖2

≤√

log(d/δ) w.h.p 1− δ

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 152 / 234

Page 202: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Randomized Hadmard transform

Hadmard transform:H ∈ Rd×d : H =

√1d H2k

H1 = [1] , H2 =

[1 11 −1

], H2k =

[H2k−1 H2k−1

H2k−1 −H2k−1

]

‖Hx‖2 = ‖x‖2 and H is orthogonalComputational costs of Hx : d log(d)

randomized Hadmard transform: HDD ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5HD is orthogonal and ‖HDx‖2 = ‖x‖2

Key property:√

d‖HDx‖∞‖HDx‖2

≤√

log(d/δ) w.h.p 1− δ

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 152 / 234

Page 203: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using Hadmard transform (I)

Fast JL transform based on randomized Hadmard transform (Tropp, 2011):

A =

√dm PHD

yields

|‖Ax‖22 − ‖x‖2

2| ≤

√3 log(2/δ) log(d/δ)

m ‖x‖22

m = Θ(ε−2 log(1/δ) log(d/δ)) suffice for 1± εadditional factor log(d/δ) can be removedComputational cost of AX : O(nd log(m))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 153 / 234

Page 204: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using a sparse matrix (I)

Random hashing (Dasgupta et al., 2010)

A = HD

where D ∈ Rd×d and H ∈ Rm×d

random hashing: h(j) : 1, . . . , d → 1, . . . ,mHij = 1 if h(j) = i : sparse matrix (each column has only one non-zeroentry)D ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5[Ax]j =

∑i :h(i)=j xi Dii

Technically speaking, random hashing does not satisfy JL lemma

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 154 / 234

Page 205: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using a sparse matrix (I)

Random hashing (Dasgupta et al., 2010)

A = HD

where D ∈ Rd×d and H ∈ Rm×d

random hashing: h(j) : 1, . . . , d → 1, . . . ,mHij = 1 if h(j) = i : sparse matrix (each column has only one non-zeroentry)D ∈ Rd×d : a diagonal matrix Pr(Dii = ±1) = 0.5[Ax]j =

∑i :h(i)=j xi Dii

Technically speaking, random hashing does not satisfy JL lemma

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 154 / 234

Page 206: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using a sparse matrix (I)

key properties:E[〈HDx1,HDx2〉] = 〈x1, x2〉and norm perserving |‖HDx‖2

2 − ‖x‖22| ≤ ε‖x‖2

2, only when

‖x‖∞‖x‖2

≤ 1√c

Apply randomized Hadmard transform P first: Θ(c log(c/δ)) blocks ofrandomized Hadmard transform

‖Px‖∞‖Px‖2

≤ 1√c

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 155 / 234

Page 207: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Accelerate JL transforms: using a sparse matrix (II)

Sparse JL transform based on block random hashing (Kane & Nelson,2014)

A =

1√s Q1

. . .1√s Qs

Each Qs ∈ Rv×d is an independent random hashing (HD) matrixSet v = Θ(ε−1) and s = Θ(ε−1 log(1/δ))

Computational Cost of AX : O(nnz(X )

εlog[1δ

])

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 156 / 234

Page 208: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Randomized Dimension Reduction

Johnson-Lindenstauss (JL) transforms

Subspace embeddings

Column sampling

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 157 / 234

Page 209: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Subspace Embeddings

Definition: a subspace embedding given some parameters0 < ε, δ < 1, k ≤ d is a distribution D over matrices A ∈ Rm×d such thatfor any fixed linear subspace W ∈ Rd with dim(W ) = k it holds that

PrA∼D

(∀x ∈W , ‖Ax‖2 ∈ (1± ε)‖x‖2) ≥ 1− δ

It impliesIf U ∈ Rd×k is orthogonal matrix (contains the orthonormal bases)

AU ∈ Rm×k is of full column rank‖AU‖2 ∈ (1± ε)(1− ε)2 ≤ ‖U>A>AU‖2 ≤ (1 + ε)2

These are key properties in the theoretical analysis of manyalgorithms (e.g., low-rank matrix approximation, randomizedleast-squares regression, randomized classification)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 158 / 234

Page 210: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Subspace Embeddings

Definition: a subspace embedding given some parameters0 < ε, δ < 1, k ≤ d is a distribution D over matrices A ∈ Rm×d such thatfor any fixed linear subspace W ∈ Rd with dim(W ) = k it holds that

PrA∼D

(∀x ∈W , ‖Ax‖2 ∈ (1± ε)‖x‖2) ≥ 1− δ

It impliesIf U ∈ Rd×k is orthogonal matrix (contains the orthonormal bases)

AU ∈ Rm×k is of full column rank‖AU‖2 ∈ (1± ε)(1− ε)2 ≤ ‖U>A>AU‖2 ≤ (1 + ε)2

These are key properties in the theoretical analysis of manyalgorithms (e.g., low-rank matrix approximation, randomizedleast-squares regression, randomized classification)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 158 / 234

Page 211: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Subspace Embeddings

Definition: a subspace embedding given some parameters0 < ε, δ < 1, k ≤ d is a distribution D over matrices A ∈ Rm×d such thatfor any fixed linear subspace W ∈ Rd with dim(W ) = k it holds that

PrA∼D

(∀x ∈W , ‖Ax‖2 ∈ (1± ε)‖x‖2) ≥ 1− δ

It impliesIf U ∈ Rd×k is orthogonal matrix (contains the orthonormal bases)

AU ∈ Rm×k is of full column rank‖AU‖2 ∈ (1± ε)(1− ε)2 ≤ ‖U>A>AU‖2 ≤ (1 + ε)2

These are key properties in the theoretical analysis of manyalgorithms (e.g., low-rank matrix approximation, randomizedleast-squares regression, randomized classification)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 158 / 234

Page 212: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Subspace Embeddings

From a JL transform to a Subspace Embedding (Sarlos, 2006).Let A ∈ Rm×d be a JL transform. If

m = O

k log[

kδε

]ε2

Then w.h.p 1− δk , A ∈ Rm×d is a subspace embedding w.r.t ak-dimensional space in Rd

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 159 / 234

Page 213: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Subspace Embeddings

Making block random hashing a Subspace Embedding (Nelson & Nguyen,2013).

A =

1√s Q1

. . .1√s Qs

Each Qs ∈ Rv×d is an independent random hashing (HD) matrixSet v = Θ(kε−1 log5(k/δ)) and s = Θ(ε−1 log3(k/δ))

w.h.p 1− δ, A ∈ Rm×d with m = Θ(

k log8(k/δ)ε2

)is a subspace

embedding w.r.t a k-dimensional space in Rd

Computational Cost of AX : O(nnz(X )

εlog3

[kδ

])

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 160 / 234

Page 214: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Sparse Subspace Embedding (SSE)

Random hashing is SSE with a Constant Probability (Nelson & Nguyen,2013)

A = HD

where D ∈ Rd×d and H ∈ Rm×d

m = Ω(k2/ε2) suffice for a subspace embedding with a probability 2/3Computational Cost AX : O(nnz(X ))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 161 / 234

Page 215: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Randomized Dimensionality Reduction

Johnson-Lindenstauss (JL) transforms

Subspace embeddings

Column (Row) sampling

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 162 / 234

Page 216: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Column sampling

Column subset selection (feature selection)More interpretableUniform sampling usually does not work (not a JL transform)Non-oblivious sampling (data-dependent sampling)

leverage-score sampling

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 163 / 234

Page 217: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Column sampling

Column subset selection (feature selection)More interpretableUniform sampling usually does not work (not a JL transform)Non-oblivious sampling (data-dependent sampling)

leverage-score sampling

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 163 / 234

Page 218: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Column sampling

Column subset selection (feature selection)More interpretableUniform sampling usually does not work (not a JL transform)Non-oblivious sampling (data-dependent sampling)

leverage-score sampling

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 163 / 234

Page 219: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Leverage-score sampling (Drineas et al., 2006)

Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k

Leverage scores ‖Ui∗‖22, i = 1, . . . , d

Let pi =‖Ui∗‖2

2∑di=1 ‖Ui∗‖2

2, i = 1, . . . , d

Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi

Let A ∈ Rm×d be sampling-and-rescaling matrix:

Aij =

1√mpj

if j = ij

0 otherwise

AX ∈ Rm×n is a small sketch of X

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 164 / 234

Page 220: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Leverage-score sampling (Drineas et al., 2006)

Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k

Leverage scores ‖Ui∗‖22, i = 1, . . . , d

Let pi =‖Ui∗‖2

2∑di=1 ‖Ui∗‖2

2, i = 1, . . . , d

Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi

Let A ∈ Rm×d be sampling-and-rescaling matrix:

Aij =

1√mpj

if j = ij

0 otherwise

AX ∈ Rm×n is a small sketch of X

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 164 / 234

Page 221: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Leverage-score sampling (Drineas et al., 2006)

Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k

Leverage scores ‖Ui∗‖22, i = 1, . . . , d

Let pi =‖Ui∗‖2

2∑di=1 ‖Ui∗‖2

2, i = 1, . . . , d

Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi

Let A ∈ Rm×d be sampling-and-rescaling matrix:

Aij =

1√mpj

if j = ij

0 otherwise

AX ∈ Rm×n is a small sketch of X

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 164 / 234

Page 222: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Leverage-score sampling (Drineas et al., 2006)

Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k

Leverage scores ‖Ui∗‖22, i = 1, . . . , d

Let pi =‖Ui∗‖2

2∑di=1 ‖Ui∗‖2

2, i = 1, . . . , d

Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi

Let A ∈ Rm×d be sampling-and-rescaling matrix:

Aij =

1√mpj

if j = ij

0 otherwise

AX ∈ Rm×n is a small sketch of X

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 164 / 234

Page 223: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Leverage-score sampling (Drineas et al., 2006)

Let X ∈ Rd×n be a rank-k matrixX = UΣV>: U ∈ Rd×k , Σ ∈ Rk×k

Leverage scores ‖Ui∗‖22, i = 1, . . . , d

Let pi =‖Ui∗‖2

2∑di=1 ‖Ui∗‖2

2, i = 1, . . . , d

Let i1, . . . , im ∈ 1, . . . , d denote m indices selected by following pi

Let A ∈ Rm×d be sampling-and-rescaling matrix:

Aij =

1√mpj

if j = ij

0 otherwise

AX ∈ Rm×n is a small sketch of X

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 164 / 234

Page 224: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Properties of Leverage-score sampling

When m = Θ(

kε2 log

[2kδ

]), w.h.p 1− δ,

AU ∈ Rm×k is full column rankσ2

i (AU) ≥ (1− ε) ≥ (1− ε)2

σ2i (AU) ≤ 1 + ε ≤ (1 + ε)2

Leverage-score sampling performs like a subspace embedding (only forU, the top singular vector matrix of X )Computational cost: compute top-k SVD of X , expensiveRandomized algoritms to compute approximate leverage scores

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 165 / 234

Page 225: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

Properties of Leverage-score sampling

When m = Θ(

kε2 log

[2kδ

]), w.h.p 1− δ,

AU ∈ Rm×k is full column rankσ2

i (AU) ≥ (1− ε) ≥ (1− ε)2

σ2i (AU) ≤ 1 + ε ≤ (1 + ε)2

Leverage-score sampling performs like a subspace embedding (only forU, the top singular vector matrix of X )Computational cost: compute top-k SVD of X , expensiveRandomized algoritms to compute approximate leverage scores

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 165 / 234

Page 226: Big Data Analytics: Optimization and Randomization

Randomized Dimension Reduction

When uniform sampling makes sense?

Coherence measureµk =

dk max

1≤i≤d‖Ui∗‖2

2

Valid when the coherence measure is small (some real data miningdatasets have small coherence measures)The Nystrom method usually uses uniform sampling (Gittens, 2011)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 166 / 234

Page 227: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Outline

4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 167 / 234

Page 228: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Classification

Classification problems:

minw∈Rd

1n

n∑i=1

`(yiw>xi ) +λ

2 ‖w‖22

yi ∈ +1,−1: labelLoss function `(z): z = yw>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 168 / 234

Page 229: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Randomized Classification

For large-scale high-dimensional problems, the computational cost ofoptimization is O((nd + dκ) log(1/ε)).

Use random reduction A ∈ Rd×m (m d), we reduce X ∈ Rn×d toX = XA ∈ Rn×m. Then solve

minu∈Rm

1n

n∑i=1

`(yiu>xi ) +λ

2 ‖u‖22

JL transformsSparse subspace embeddings

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 169 / 234

Page 230: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Randomized Classification

Two questions:Is there any performance guarantee?

margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12

ε2 log( 6mδ )

generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))

ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 170 / 234

Page 231: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Randomized Classification

Two questions:Is there any performance guarantee?

margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12

ε2 log( 6mδ )

generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))

ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 170 / 234

Page 232: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Randomized Classification

Two questions:Is there any performance guarantee?

margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12

ε2 log( 6mδ )

generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))

ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 170 / 234

Page 233: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Randomized Classification

Two questions:Is there any performance guarantee?

margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12

ε2 log( 6mδ )

generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))

ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 170 / 234

Page 234: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Randomized Classification

Two questions:Is there any performance guarantee?

margin is preserved: if data is linearly separable (Balcan et al., 2006) aslong as m ≥ 12

ε2 log( 6mδ )

generalization performance is preserved: if the data matrix if of lowrank and m = Ω( kploy(log(k/δε))

ε2 ) (Paul et al., 2013)How to recover an accurate model in the original high-dimensionalspace?Dual Recovery (Zhang et al., 2014) and Dual Sparse Recovery (Yanget al., 2015)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 170 / 234

Page 235: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

The Dual probelm

Using Fenchel conjugate

`∗i (αi ) = maxαi

αi z − `(z , yi )

Primal:w∗ = arg min

w∈Rd

1n

n∑i=1

`(w>xi , yi ) +λ

2 ‖w‖22

Dual:α∗ = arg max

α∈Rn−1

n

n∑i=1

`∗i (αi )−1

2λn2α>XX>α

From dual to primal:w∗ = − 1

λn X>α∗

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 171 / 234

Page 236: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Recovery for Randomized Reduction

From dual formulation: w∗ lies in the row space of the data matrixX ∈ Rn×d

Dual Recovery: w∗ = − 1λn X>α∗, where

α∗ = arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α

and X = XA ∈ Rn×m

Subspace Embedding A with m = Θ(r log(r/δ)ε−2)

Guarantee: under low-rank assumption of the data matrix X (e.g.,rank(X ) = r), with a high probability 1− δ,

‖w∗ −w∗‖2 ≤ε

1− ε‖w∗‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 172 / 234

Page 237: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Recovery for Randomized Reduction

From dual formulation: w∗ lies in the row space of the data matrixX ∈ Rn×d

Dual Recovery: w∗ = − 1λn X>α∗, where

α∗ = arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α

and X = XA ∈ Rn×m

Subspace Embedding A with m = Θ(r log(r/δ)ε−2)

Guarantee: under low-rank assumption of the data matrix X (e.g.,rank(X ) = r), with a high probability 1− δ,

‖w∗ −w∗‖2 ≤ε

1− ε‖w∗‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 172 / 234

Page 238: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Recovery for Randomized Reduction

From dual formulation: w∗ lies in the row space of the data matrixX ∈ Rn×d

Dual Recovery: w∗ = − 1λn X>α∗, where

α∗ = arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α

and X = XA ∈ Rn×m

Subspace Embedding A with m = Θ(r log(r/δ)ε−2)

Guarantee: under low-rank assumption of the data matrix X (e.g.,rank(X ) = r), with a high probability 1− δ,

‖w∗ −w∗‖2 ≤ε

1− ε‖w∗‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 172 / 234

Page 239: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Sparse Recovery for Randomized Reduction

Assume the optimal dual solution α∗ is sparse (i.e., the number of supportvectors is small)

Dual Sparse Recovery: w∗ = − 1λn X>α∗, where

α∗ = arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α− τ

n‖α‖1

where X = XA ∈ Rn×m

JL transform A with m = Θ(s log(n/δ)ε−2)

Guarantee: if α∗ is s-sparse, with a high probability 1− δ,

‖w∗ −w∗‖2 ≤ ε‖w∗‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 173 / 234

Page 240: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Sparse Recovery for Randomized Reduction

Assume the optimal dual solution α∗ is sparse (i.e., the number of supportvectors is small)

Dual Sparse Recovery: w∗ = − 1λn X>α∗, where

α∗ = arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α− τ

n‖α‖1

where X = XA ∈ Rn×m

JL transform A with m = Θ(s log(n/δ)ε−2)

Guarantee: if α∗ is s-sparse, with a high probability 1− δ,

‖w∗ −w∗‖2 ≤ ε‖w∗‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 173 / 234

Page 241: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Sparse Recovery for Randomized Reduction

Assume the optimal dual solution α∗ is sparse (i.e., the number of supportvectors is small)

Dual Sparse Recovery: w∗ = − 1λn X>α∗, where

α∗ = arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α− τ

n‖α‖1

where X = XA ∈ Rn×m

JL transform A with m = Θ(s log(n/δ)ε−2)

Guarantee: if α∗ is s-sparse, with a high probability 1− δ,

‖w∗ −w∗‖2 ≤ ε‖w∗‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 173 / 234

Page 242: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Classification (Regression)

Dual Sparse Recovery

RCV1 text data, n = 677, 399, and d = 47, 236

Dual Error Primal Error

0 0.1 0.3 0.5 0.7 0.90.2

0.4

0.6

0.8

1

τ

rela

tive−

dual−

err

or−

L2−

norm

λ=0.001

m=1024

m=2048

m=4096

m=8192

0 0.1 0.3 0.5 0.7 0.9

0.2

0.4

0.6

0.8

1

τ

rela

tive−

prim

al−

err

or−

L2−

norm

λ=0.001

m=1024

m=2048

m=4096

m=8192

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 174 / 234

Page 243: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Least-Squares Regression

Outline

4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 175 / 234

Page 244: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Least-Squares Regression

Least-squares regression

Let X ∈ Rn×d with d n and b ∈ Rn. The least-squares regressionproblem is to find w∗ such that

w∗ = arg minw∈Rd

‖Xw− b‖2

Computational Cost: O(nd2)

Goal of RA: o(nd2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 176 / 234

Page 245: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Least-Squares Regression

Randomized Least-squares regression

Let A ∈ Rm×n be a random reduction matrix. Solve

w∗ = arg minw∈Rd

‖A(Xw− b)‖2 = ‖AXw− Ab‖2

Computational Cost: O(md2) + reduction time

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 177 / 234

Page 246: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Least-Squares Regression

Randomized Least-squares regression

Theoretical Guarantees (Sarlos, 2006; Drineas et al., 2011; Nelson &Nguyen, 2012):

‖X w∗ − b‖2 ≤ (1 + ε)‖Xw∗ − b‖2

Total Time O(nnz(X ) + d3 log(d/ε)ε−2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 178 / 234

Page 247: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized K-means Clustering

Outline

4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 179 / 234

Page 248: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized K-means Clustering

K-means Clustering

Let x1, . . . , xn ∈ Rd be a set of data points.

K-means clustering aims to solve

minC1,...,Ck

k∑j=1

∑xi∈Cj

‖xi − µj‖22

Computational Cost: O(ndkt), where t is number of iterations.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 180 / 234

Page 249: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized K-means Clustering

Randomized Algorithms for K-means Clustering

Let X = (x1, . . . , xn)> ∈ Rn×d be the data matrix.High-dimensional data: Random Sketch: X = XA ∈ Rn×m, ` d

Approximate K-means:

minC1,...,Ck

k∑j=1

∑xi∈Cj

‖xi − µj‖22

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 181 / 234

Page 250: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized K-means Clustering

Randomized Algorithms for K-means Clustering

Let X = (x1, . . . , xn)> ∈ Rn×d be the data matrix.High-dimensional data: Random Sketch: X = XA ∈ Rn×m, ` d

Approximate K-means:

minC1,...,Ck

k∑j=1

∑xi∈Cj

‖xi − µj‖22

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 181 / 234

Page 251: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized K-means Clustering

Randomized Algorithms for K-means Clustering

For random sketch: JL transforms, sparse subspace embedding all workJL transform: m = O(k log(k/(εδ))

ε2 )

Sparse subspace embedding: m = O( k2

ε2δ )

ε relates to the approximation accuracyAnalysis of approximation error for K-means can be formulates asConstrained Low-rank Approximation (Cohen et al., 2015)

minQ>Q=I

‖X − QQ>X‖2F

where Q is orthonormal.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 182 / 234

Page 252: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

Outline

4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 183 / 234

Page 253: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

Kernel methods

Kernel function: κ(·, ·)a set of examples x1, . . . , xn

Kernel matrix: K ∈ Rn×n with Kij = κ(xi , xj)

K is a PSD matrixComputational and memory costs: Ω(n2)

Approximation methosThe Nystrom methodRandom Fourier features

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 184 / 234

Page 254: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

Kernel methods

Kernel function: κ(·, ·)a set of examples x1, . . . , xn

Kernel matrix: K ∈ Rn×n with Kij = κ(xi , xj)

K is a PSD matrixComputational and memory costs: Ω(n2)

Approximation methosThe Nystrom methodRandom Fourier features

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 184 / 234

Page 255: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

The Nystrom method

Let A ∈ Rn×` be uniform sampling matrix.B = KA ∈ Rn×`

C = A>B = A>KAThe Nystrom approximation (Drineas & Mahoney, 2005)

K = BC †B>

Computational Cost: O(`3 + n`2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 185 / 234

Page 256: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

The Nystrom method

Let A ∈ Rn×` be uniform sampling matrix.B = KA ∈ Rn×`

C = A>B = A>KAThe Nystrom approximation (Drineas & Mahoney, 2005)

K = BC †B>

Computational Cost: O(`3 + n`2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 185 / 234

Page 257: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

The Nystrom based kernel machine

The dual problem:

arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>BC †B>α

Solve it like solving a linear method: X = BC−1/2 ∈ Rn×`

arg maxα∈Rn

−1n

n∑i=1

`∗i (αi )−1

2λn2α>X X>α

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 186 / 234

Page 258: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

The Nystrom based kernel machine

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 187 / 234

Page 259: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

Random Fourier Features (RFF)

Bochner’s theoremA shift-invariant kernel κ(x, y) = κ(x− y) is a valid kernel if only if κ(δ) isthe Fourier transform of a non-negative measure, i.e.,

κ(x− y) =

∫p(ω)e−jω>(x−y)dω

RFF (Rahimi & Recht, 2008): generate a set of ω1, . . . , ωm ∈ Rd followingp(ω). For an example x ∈ Rd , construct

x = (cos(ω>1 x), sin(ω>1 x), . . . , cos(ω>mx), sin(ω>mx))> ∈ R2m

RBF kernel exp(−‖x−y‖22

2γ2 ): p(ω) = N (0, γ2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 188 / 234

Page 260: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

Random Fourier Features (RFF)

Bochner’s theoremA shift-invariant kernel κ(x, y) = κ(x− y) is a valid kernel if only if κ(δ) isthe Fourier transform of a non-negative measure, i.e.,

κ(x− y) =

∫p(ω)e−jω>(x−y)dω

RFF (Rahimi & Recht, 2008): generate a set of ω1, . . . , ωm ∈ Rd followingp(ω). For an example x ∈ Rd , construct

x = (cos(ω>1 x), sin(ω>1 x), . . . , cos(ω>mx), sin(ω>mx))> ∈ R2m

RBF kernel exp(−‖x−y‖22

2γ2 ): p(ω) = N (0, γ2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 188 / 234

Page 261: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

Random Fourier Features (RFF)

Bochner’s theoremA shift-invariant kernel κ(x, y) = κ(x− y) is a valid kernel if only if κ(δ) isthe Fourier transform of a non-negative measure, i.e.,

κ(x− y) =

∫p(ω)e−jω>(x−y)dω

RFF (Rahimi & Recht, 2008): generate a set of ω1, . . . , ωm ∈ Rd followingp(ω). For an example x ∈ Rd , construct

x = (cos(ω>1 x), sin(ω>1 x), . . . , cos(ω>mx), sin(ω>mx))> ∈ R2m

RBF kernel exp(−‖x−y‖22

2γ2 ): p(ω) = N (0, γ2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 188 / 234

Page 262: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

The Nystrom method vs RFF (Yang et al., 2012)

functional approximation frameworkThe Nystrom method: data-dependent basesRFF: data independent basesIn certain cases (e.g., large eigen-gap, skewed eigen-valuedistribution): the generalization performance of the Nystrom methodis better than RFF

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 189 / 234

Page 263: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Kernel methods

The Nystrom method vs RFF

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 190 / 234

Page 264: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Outline

4 Randomized AlgorithmsRandomized Classification (Regression)Randomized Least-Squares RegressionRandomized K-means ClusteringRandomized Kernel methodsRandomized Low-rank Matrix Approximation

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 191 / 234

Page 265: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized low-rank matrix approximation

Let X ∈ Rn×d . The goal is to obtain

UΣV> ≈ X

where U ∈ Rn×k , V ∈ Rd×k have orthonormal columns, Σ ∈ Rk×k is adiagonal matrix with nonegative entries

k is target rankThe best rank-k approximation Xk = UkΣkV>kApproximation error

‖UΣV> − X‖ξ ≤ (1 + ε)‖UkΣkV>k − X‖ξ

where ξ = F or ξ = 2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 192 / 234

Page 266: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Why low-rank approximation?

Applications in Data mining and Machine learningPCASpectral clustering· · ·

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 193 / 234

Page 267: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Why randomized algorithms?

Deterministic AlgorithmsTruncated SVD O(nd min(n, d))

Rank-Revealing QR factorization O(ndk)

Krylov subspace method (e.g. Lanczos algorithm):O(kTmult + (n + d)k2), where Tmult denotes the cost of matrix-vectorproduct.

Randomized AlgorithmsSpeed can be faster (e.g., O(nd log(k)))Output more robust (e.g. Lanczos requires sophisticatedmodifications)Can be pass efficientCan exploit parallel algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 194 / 234

Page 268: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Why randomized algorithms?

Deterministic AlgorithmsTruncated SVD O(nd min(n, d))

Rank-Revealing QR factorization O(ndk)

Krylov subspace method (e.g. Lanczos algorithm):O(kTmult + (n + d)k2), where Tmult denotes the cost of matrix-vectorproduct.

Randomized AlgorithmsSpeed can be faster (e.g., O(nd log(k)))Output more robust (e.g. Lanczos requires sophisticatedmodifications)Can be pass efficientCan exploit parallel algorithms

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 194 / 234

Page 269: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

The Basic Randomized Algorithms for Approximating X ∈ Rn×d (Halkoet al., 2011)

1 Obtain a small sketch by Y = XA ∈ Rn×m

2 Compute Q ∈ Rn×m that contains the orthonormal basis of col(Y )

3 Compute SVD of QT X = UΣV>4 Approximation X ≈ UΣV>, where U = QU

Explanation: If col(XA) captures the top-k column space of X well,i.e.,

‖X − QQ>X‖ ≤ ε

then‖X − UΣV>‖ ≤ ε

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 195 / 234

Page 270: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

Three questions:1 What is the value of m?

m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results

2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))

3 What is the quality?Theoretical Guarantee:Practically, very accurate

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 196 / 234

Page 271: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

Three questions:1 What is the value of m?

m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results

2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))

3 What is the quality?Theoretical Guarantee:Practically, very accurate

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 196 / 234

Page 272: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

Three questions:1 What is the value of m?

m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results

2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))

3 What is the quality?Theoretical Guarantee:Practically, very accurate

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 196 / 234

Page 273: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

Three questions:1 What is the value of m?

m = k + p, p is the oversampling parameter. In practice p = 5 or 10gives superb results

2 What is the computational cost?Subsampled Randomized Hadmard Transform: can be as fast asO(nd log(k) + k2(n + d))

3 What is the quality?Theoretical Guarantee:Practically, very accurate

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 196 / 234

Page 274: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 197 / 234

Page 275: Big Data Analytics: Optimization and Randomization

Randomized Algorithms Randomized Low-rank Matrix Approximation

Randomized algorithms for low-rank matrix approximation

Other thingsUse power iteration to reduce the error: use (XX>)qX

Can use sparse JL transform/subspace embedding matrices(Frobenius norm guarantee only)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 198 / 234

Page 276: Big Data Analytics: Optimization and Randomization

Concluding Remarks

Outline

1 Basics

2 Optimization

3 Randomized Dimension Reduction

4 Randomized Algorithms

5 Concluding Remarks

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 199 / 234

Page 277: Big Data Analytics: Optimization and Randomization

Concluding Remarks

How to address big data challenge?

Optimization perspective: improve convergence rates, exploringproperties of functions

stochastic optimization (e.g., SDCA, SVRG, SAGA)distributed optimization (e.g., DisDCA)

Randomization perspective: reduce data size, exploring properties ofdata

randomized feature reduction (e.g., reduce the number of features)randomized instance reduction (e.g., reduce the number of instances)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 200 / 234

Page 278: Big Data Analytics: Optimization and Randomization

Concluding Remarks

How can we address big data challenge?

Optimization perspective: improve convergence rates, exploringproperties of functions

Pro: can obtain the optimal solutionCon: high computational/communication costs

Randomization perspective: reduce data size, exploring properties ofdata

Pro: fastCon: still exists recovery error

Can we combine the benefits of two techniques?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 201 / 234

Page 279: Big Data Analytics: Optimization and Randomization

Concluding Remarks

Combine Randomization and Optimization (Yang et al.,2015)

Use randomization (Dual Spare Recovery) to obtain a good initialsolution

Initialize distributed optimization (DisDCA) to reduce cost ofcomputation/communication

Observe 1 or 2 epochs of computations (1 or 2 communications)suffice to obtain the same performance of pure optimization

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 202 / 234

Page 280: Big Data Analytics: Optimization and Randomization

Concluding Remarks

Big Data Experiments

KDDcup Data: n = 8, 407, 752, d = 29, 890, 095, 10 machines, m = 1024

0

5

10

15

Testing E

rro

r (%

)

kdd

DSRRDSRR−Rec

DSRR−DisDCA−1

DSRR−DisDCA−2DisDCA

0

50

100

150

200

250

tim

e (

s)

kdd

DSRR

DSRR−Rec

DSRR−DisDCA−1

DSRR−DisDCA−2

DisDCA

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 203 / 234

Page 281: Big Data Analytics: Optimization and Randomization

Concluding Remarks

Thank You! Questions?

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 204 / 234

Page 282: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References I

Achlioptas, Dimitris. Database-friendly random projections:Johnson-Lindenstrauss with binary coins. Journal of Computer andSystem Sciences, 66(4):671 – 687, 2003.

Balcan, Maria-Florina, Blum, Avrim, and Vempala, Santosh. Kernels asfeatures: on kernels, margins, and low-dimensional mappings. MachineLearning, 65(1):79–94, 2006.

Boyd, S., Parikh, N., Chu, E., Peleato, B., and Eckstein, J. Distributedoptimization and statistical learning via the alternating directionmethods of multiplies. Foundations and Trends in Machine Learning,3(1):1–122, 2011.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 205 / 234

Page 283: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References II

Cohen, Michael B., Elder, Sam, Musco, Cameron, Musco, Christopher,and Persu, Madalina. Dimensionality reduction for k-means clusteringand low rank approximation. In Proceedings of the Forty-SeventhAnnual ACM on Symposium on Theory of Computing (STOC), pp.163–172, 2015.

Dasgupta, Anirban, Kumar, Ravi, and Sarlos, Tamas. A sparse johnson:Lindenstrauss transform. In Proceedings of the 42nd ACM symposiumon Theory of computing, STOC ’10, pp. 341–350, 2010.

Dasgupta, Sanjoy and Gupta, Anupam. An elementary proof of a theoremof Johnson and Lindenstrauss. Random Structures & Algorithms, 22(1):60–65, 2003.

Defazio, Aaron, Bach, Francis, and Lacoste-Julien, Simon. Saga: A fastincremental gradient method with support for non-strongly convexcomposite objectives. In NIPS, 2014.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 206 / 234

Page 284: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References III

Drineas, Petros and Mahoney, Michael W. On the nystrom method forapproximating a gram matrix for improved kernel-based learning.Journal of Machine Learning Research, 6:2005, 2005.

Drineas, Petros, Mahoney, Michael W., and Muthukrishnan, S. Samplingalgorithms for l2 regression and applications. In ACM-SIAM Symposiumon Discrete Algorithms (SODA), pp. 1127–1136, 2006.

Drineas, Petros, Mahoney, Michael W., Muthukrishnan, S., and Sarlos,Tamas. Faster least squares approximation. Numerische Mathematik,117(2):219–249, February 2011.

Gittens, Alex. The spectral norm error of the naive nystrom extension.CoRR, 2011.

Golub, Gene H. and Ye, Qiang. Inexact preconditioned conjugate gradientmethod with inner-outer iteration. SIAM J. Sci. Comput, 21:1305–1320,1997.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 207 / 234

Page 285: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References IV

Halko, Nathan, Martinsson, Per Gunnar., and Tropp, Joel A. Findingstructure with randomness: Probabilistic algorithms for constructingapproximate matrix decompositions. SIAM Review, 53(2):217–288, May2011.

Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S. Sathiya, andSundararajan, S. A dual coordinate descent method for large-scale linearsvm. In ICML, pp. 408–415, 2008.

Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descentusing predictive variance reduction. In NIPS, pp. 315–323, 2013.

Johnson, William and Lindenstrauss, Joram. Extensions of Lipschitzmappings into a Hilbert space. In Conference in modern analysis andprobability (New Haven, Conn., 1982), volume 26, pp. 189–206. 1984.

Kane, Daniel M. and Nelson, Jelani. Sparser johnson-lindenstrausstransforms. Journal of the ACM, 61:4:1–4:23, 2014.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 208 / 234

Page 286: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References V

Lee, Jason, Ma, Tengyu, and Lin, Qihang. Distributed stochastic variancereduced gradient methods. Technical report, UC Berkeley, 2015.

Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An accelerated proximalcoordinate gradient method and its application to regularized empiricalrisk minimization. In NIPS, 2014.

Ma, Chenxin, Smith, Virginia, Jaggi, Martin, Jordan, Michael I., Richtarik,Peter, and Takac, Martin. Adding vs. averaging in distributedprimal-dual optimization. In ICML, 2015.

Nelson, Jelani and Nguyen, Huy L. OSNAP: faster numerical linear algebraalgorithms via sparser subspace embeddings. CoRR, abs/1211.1002,2012.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 209 / 234

Page 287: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References VI

Nelson, Jelani and Nguyen, Huy L. OSNAP: faster numerical linear algebraalgorithms via sparser subspace embeddings. In 54th Annual IEEESymposium on Foundations of Computer Science (FOCS), pp. 117–126,2013.

Nemirovski, A. and Yudin, D. On cezari’s convergence of the steepestdescent method for approximating saddle point of convex-concavefunctons. Soviet Math Dkl., 19:341–362, 1978.

Nesterov, Yurii. Efficiency of coordinate descent methods on huge-scaleoptimization problems. SIAM Journal on Optimization, 22:341–362,2012.

Ozdaglar, Asu. Distributed multiagent optimization linear convergencerate of admm. Technical report, MIT, 2015.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 210 / 234

Page 288: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References VII

Paul, Saurabh, Boutsidis, Christos, Magdon-Ismail, Malik, and Drineas,Petros. Random projections for support vector machines. In Proceedingsof the International Conference on Artificial Intelligence and Statistics(AISTATS), pp. 498–506, 2013.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernelmachines. In Advances in Neural Information Processing Systems 20,pp. 1177–1184, 2008.

Recht, Benjamin. A simpler approach to matrix completion. JournalMachine Learning Research (JMLR), pp. 3413–3430, 2011.

Roux, Nicolas Le, Schmidt, Mark, and Bach, Francis. A stochasticgradient method with an exponential convergence rate forstrongly-convex optimization with finite training sets. CoRR, 2012.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 211 / 234

Page 289: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References VIII

Sarlos, Tamas. Improved approximation algorithms for large matrices viarandom projections. In 47th Annual IEEE Symposium on Foundations ofComputer Science (FOCS), pp. 143–152, 2006.

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic dual coordinate ascentmethods for regularized loss. Journal of Machine Learning Research, 14:567–599, 2013.

Shamir, Ohad, Srebro, Nathan, and Zhang, Tong.Communication-efficient distributed optimiztion using an approximatenewton-type method. In ICML, 2014.

Tropp, Joel A. Improved analysis of the subsampled randomized hadamardtransform. Advances in Adaptive Data Analysis, 3(1-2):115–126, 2011.

Tropp, Joel A. User-friendly tail bounds for sums of random matrices.Found. Comput. Math., 12(4):389–434, August 2012. ISSN 1615-3375.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 212 / 234

Page 290: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References IX

Xiao, L. and Zhang, T. A proximal stochastic gradient method withprogressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

Yang, Tianbao. Trading computation for communication: Distributedstochastic dual coordinate ascent. NIPS’13, pp. –, 2013.

Yang, Tianbao, Li, Yu-Feng, Mahdavi, Mehrdad, Jin, Rong, and Zhou,Zhi-Hua. Nystrom method vs random fourier features: A theoretical andempirical comparison”. In Advances in Neural Information ProcessingSystems (NIPS), pp. 485–493, 2012.

Yang, Tianbao, Zhang, Lijun, Jin, Rong, and Zhu, Shenghuo. Theory ofdual-sparse regularized randomized reduction. In Proceedings of the32nd International Conference on Machine Learning, (ICML), pp.305–314, 2015.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 213 / 234

Page 291: Big Data Analytics: Optimization and Randomization

Concluding Remarks

References X

Zhang, Lijun, Mahdavi, Mehrdad, and Jin, Rong. Linear convergence withcondition number independent access of full gradients. In NIPS, pp.980–988. 2013.

Zhang, Lijun, Mahdavi, Mehrdad, Jin, Rong, Yang, Tianbao, and Zhu,Shenghuo. Random projections for classification: A recovery approach.IEEE Transactions on Information Theory (IEEE TIT), 60(11):7300–7316, 2014.

Zhang, Yuchen and Xiao, Lin. Communication-efficient distributedoptimization of self-concordant empirical loss. In ICML, 2015.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 214 / 234

Page 292: Big Data Analytics: Optimization and Randomization

Appendix

Examples of Convex functions

ax + b, Ax + bx2, ‖x‖2

2exp(ax), exp(w>x)

log(1 + exp(ax)), log(1 + exp(w>x))

x log(x),∑

i xi log(xi )

‖x‖p, p ≥ 1, ‖x‖2p

maxi (xi )

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 215 / 234

Page 293: Big Data Analytics: Optimization and Randomization

Appendix

Operations that preserve convexity

Nonnegative scale: a · f (x) where a ≥ 0Sum: f (x) + g(x)

Composition with affine function f (Ax + b)

Point-wise maximum: maxi fi (x)

Examples:Least-squares regression: ‖Ax− b‖2

SVM: 1n∑n

i=1 max(0, 1− yiw>xi ) + λ2‖w‖

22

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 216 / 234

Page 294: Big Data Analytics: Optimization and Randomization

Appendix

Smooth Convex function

smooth: e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

where L > 0

smoothnessconstant

Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ L

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 217 / 234

Page 295: Big Data Analytics: Optimization and Randomization

Appendix

Smooth Convex function

smooth: e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

where L > 0

smoothnessconstant

Second Order Derivative is upperbounded ‖∇2f (x)‖2 ≤ L

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

0

1

2

3

4

5

6

log(1+exp(−x))

f(y)+f’(y)(x−y)

y

f(x)

Quadratic Function

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 217 / 234

Page 296: Big Data Analytics: Optimization and Randomization

Appendix

Strongly Convex function

strongly convex: e.g. Euclidean norm f (x) = 12‖x‖

22

‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2

where λ > 0

strong convexityconstant

Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 218 / 234

Page 297: Big Data Analytics: Optimization and Randomization

Appendix

Strongly Convex function

strongly convex: e.g. Euclidean norm f (x) = 12‖x‖

22

‖∇f (x)−∇f (y)‖2 ≥ λ‖x − y‖2

where λ > 0

strong convexityconstant

Second Order Derivative is lowerbounded ‖∇2f (x)‖2 ≥ λ

−1 −0.5 0 0.5 1−0.2

0

0.2

0.4

0.6

0.8

x2

gradient

smooth

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 218 / 234

Page 298: Big Data Analytics: Optimization and Randomization

Appendix

Smooth and Strongly Convex function

smooth and strongly convex: e.g. quadratic function:f (z) = 1

2 (z − 1)2

λ‖x − y‖2 ≤ ‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2, L ≥ λ > 0

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 219 / 234

Page 299: Big Data Analytics: Optimization and Randomization

Appendix

Chernoff bound

Let X1, . . . ,Xn be independent random variables. Assume 0 ≤ Xi ≤ 1.Let X = X1 + . . .+ Xn. µ = E[X ]. Then

Pr(X ≥ (1 + ε)µ) ≤ exp(− ε2

2 + εµ

)

Pr(X ≤ (1− ε)µ) ≤ exp(−ε

2

2 µ)

or

Pr(|X − µ| ≥ εµ) ≤ 2 exp(− ε2

2 + εµ

)≤ 2 exp

(−ε

2

3 µ)

the last inequality holds when 0 < ε ≤ 1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 220 / 234

Page 300: Big Data Analytics: Optimization and Randomization

Appendix

Theoretical Guarantee of RA for low-rank approximation

X = U[

Σ1Σ2

] [V>1V>2

]

X ∈ Rm×n: the target matrixΣ1 ∈ Rk×k , V1 ∈ Rn×k

A ∈ Rn×`: random reduction matrixY = XA ∈ Rm×`: the small sketch

Key inequality:

‖(I − PY )X‖2 ≤ ‖Σ2‖2 + ‖Σ2Ω2Ω†1‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 221 / 234

Page 301: Big Data Analytics: Optimization and Randomization

Appendix

Gaussian Matrices

G is a standard Gaussian matrixU and V are orthonormal matricesUT GV follows the standard Gaussian distributionE[‖SGT‖2

F ] = ‖S‖2F‖T‖2

FE[‖SGT‖] ≤ ‖S‖‖T‖F + ‖S‖F‖T‖Concentration for function of a Gaussian matrix. Suppose h is aLipschitz function on matrices

h(X )− h(Y ) ≤ L‖X − Y ‖F

ThenPr(h(G) ≥ E[h(G)] + Lt) ≤ e−t2/2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 222 / 234

Page 302: Big Data Analytics: Optimization and Randomization

Appendix

Analysis for Randomized Least-square regression

Let X = UΣV>w∗ = arg min

w∈Rd‖Xw− b‖2

Let Z = ‖Xw∗ − b‖2, ω = b − Xw∗, and Xw∗ = Uα

w∗ = arg minw∈Rd

‖A(Xw− b)‖2

Since b − Xw∗ = b − X (X>X )†X>b = (I − UU>)b, X w∗ − Xw∗ = Uβ.Then

‖X w∗ − b‖2 = ‖Xw∗ − b‖2 + ‖X w∗ − Xw‖2 = Z + ‖β‖2

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 223 / 234

Page 303: Big Data Analytics: Optimization and Randomization

Appendix

Analysis for Randomized Least-square regression

AU(α + β) = AX w∗ = AX (AX )†Ab = PAX (Ab) = PAU(Ab)

PAU(Ab) = PAU(A(ω + Uα)) = AUα + PAU(Aω)

Hence

U>A>AUβ = (AU)>(AU)(AU)†Aω = (AU)>(AU)((AU)>AU)−1(AU)>Aω

where we use AU is full column matrix. Then

U>A>AUβ = U>A>Aω

‖β‖22/2 ≤ ‖U>A>AUβ‖2

2 = ‖U>A>Aω‖22 ≤ ε′2‖U‖2

F‖ω‖22

where the last inequality uses the matrix products approximation shown innext slide. Since ‖U‖2

F ≤ d , setting ε′ =√

εd suffices.

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 224 / 234

Page 304: Big Data Analytics: Optimization and Randomization

Appendix

Approximate Matrix Products

Given X ∈ Rn×d and Y ∈ Rd×p, let A ∈ Rm×d one of the followingmatrices

a JL transform matrix with m = Θ(ε−2 log((n + p)/δ))

the sparse subspace embedding with m = Θ(ε−2)

leverage-score sampling matrix based on pi ≥‖Xi∗‖2

22‖X‖2

Fand m = Θ(ε−2)

Then w.h.p 1− δ

‖XA>AY − XY ‖F ≤ ε‖X‖F‖Y ‖F

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 225 / 234

Page 305: Big Data Analytics: Optimization and Randomization

Appendix

Analysis for Randomized Least-square regression

A ∈ Rm×n

1. Subspace embedding: AU full column rank2. Matrix product approximation:

√ε/d

Order of mJL transforms: 1. O(d log(d)), 2. O(d log(d)ε−1)⇒ O(d log(d)ε−1)

Sparse subspace embedding: 1. O(d2), 2. O(dε−1)⇒ O(d2ε−1)

If we use SSE (A1 ∈ Rm1×n) and JL transform A2 ∈ Rm2×m1

‖A2A1(Xw2∗ − b)‖2 ≤ (1 + ε)‖A1(Xw1

∗ − b)‖2

≤ (1 + ε)‖A1(Xw∗ − b)‖2 ≤ (1 + ε)2‖Xw∗ − b‖

with m1 = O(d2ε−2) and m2 = d log(d)ε−1, w2∗ is the optimal solution

using A2A1 and w1∗ is the optimal using A1 and w∗ is the original optimal

solution.Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 226 / 234

Page 306: Big Data Analytics: Optimization and Randomization

Appendix

Randomized Least-squares regression

Theoretical Guarantees (Sarlos, 2006; Drineas et al., 2011; Nelson &Nguyen, 2012):

‖X w∗ − b‖2 ≤ (1 + ε)‖Xw∗ − b‖2

If A is a fast JL transform with m = Θ(ε−1d log(d)): Total TimeO(nd log(m) + d3 log(d)ε−1)

If A is a Sparse Subspace Embedding with m = Θ(d2ε−1): TotalTime O(nnz(X ) + d4ε−1)

If A = A1A2 combine fast JL (m1 = Θ(ε−1d log(d))) and SSE(m2 = Θ(d2ε−2)): Total Time O(nnz(X ) + d3 log(d/ε)ε−2)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 227 / 234

Page 307: Big Data Analytics: Optimization and Randomization

Appendix

Matrix Chernoff bound

Lemma (Matrix Chernoff (Tropp, 2012))

Let X be a finite set of PSD matrices with dimension k, and suppose thatmaxX∈X λmax(X ) ≤ B. Sample X1, . . . ,X` independently from X .Compute

µmax = `λmax(E[X1]), µmin = `λmin(E[X1])

Then

Prλmax

(∑i=1

Xi

)≥ (1 + δ)µmax

≤k

[eδ

(1 + δ)1+δ

]µmaxB

Prλmin

(∑i=1

Xi

)≤ (1− δ)µmin

≤k

[e−δ

(1− δ)1−δ

]µminB

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 228 / 234

Page 308: Big Data Analytics: Optimization and Randomization

Appendix

To simplify the usage of Matrix Chernoff bound, we note that[e−δ

[1− δ]1−δ

]µ≤ exp

(−δ

2

2

)[

eδ(1 + δ)1+δ

]µ≤ exp

(−µδ2/3

), δ ≤ 1[

eδ(1 + δ)1+δ

]µ≤ exp (−µδ log(δ)/2) , δ > 1

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 229 / 234

Page 309: Big Data Analytics: Optimization and Randomization

Appendix

Noncommutative Bernstein Inequality

Lemma (Noncommutative Bernstein Inequality (Recht, 2011))

Let Z1, . . . ,ZL be independent zero-mean random matrices of dimensiond1 × d2. Suppose τ2

j = max‖E[ZjZ>j ]‖2, ‖E[Z>j Zj‖2

and ‖Zj‖2 ≤ M

almost surely for all k. Then, for any ε > 0,

Pr

∥∥∥∥∥∥L∑

j=1Zj

∥∥∥∥∥∥2

> ε

≤ (d1 + d2) exp[

−ε2/2∑Lj=1 τ

2j + Mε/3

]

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 230 / 234

Page 310: Big Data Analytics: Optimization and Randomization

Appendix

Randomized Algorithms for K-means Clustering

K-means:k∑

j=1

∑xi∈Cj

‖xi − µj‖22 = ‖X − CC>X‖2

F

where C ∈ Rn×k is the scaled cluster indicator matrix such that C>C = I.

Constrained Low-rank Approximation (Cohen et al., 2015)

minP∈S‖X − PX‖2

F

where S = QQ> is any set of rank k orthogonal projection matrix withorthonormal Q ∈ Rn×k

Low-rank Approximation: S is the set of all rank k orthogonal projectionmatrix. P∗ = UkU>k

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 231 / 234

Page 311: Big Data Analytics: Optimization and Randomization

Appendix

Randomized Algorithms for K-means Clustering

K-means:k∑

j=1

∑xi∈Cj

‖xi − µj‖22 = ‖X − CC>X‖2

F

where C ∈ Rn×k is the scaled cluster indicator matrix such that C>C = I.

Constrained Low-rank Approximation (Cohen et al., 2015)

minP∈S‖X − PX‖2

F

where S = QQ> is any set of rank k orthogonal projection matrix withorthonormal Q ∈ Rn×k

Low-rank Approximation: S is the set of all rank k orthogonal projectionmatrix. P∗ = UkU>k

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 231 / 234

Page 312: Big Data Analytics: Optimization and Randomization

Appendix

Randomized Algorithms for K-means Clustering

K-means:k∑

j=1

∑xi∈Cj

‖xi − µj‖22 = ‖X − CC>X‖2

F

where C ∈ Rn×k is the scaled cluster indicator matrix such that C>C = I.

Constrained Low-rank Approximation (Cohen et al., 2015)

minP∈S‖X − PX‖2

F

where S = QQ> is any set of rank k orthogonal projection matrix withorthonormal Q ∈ Rn×k

Low-rank Approximation: S is the set of all rank k orthogonal projectionmatrix. P∗ = UkU>k

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 231 / 234

Page 313: Big Data Analytics: Optimization and Randomization

Appendix

Randomized Algorithms for K-means Clustering

DefineP∗ = min

P∈S‖X − PX‖2

F

P∗ = minP∈S‖X − PX‖2

F

Guarantees on Approximation

‖X − P∗X‖2F ≤

1 + ε

1− ε‖X − P∗X‖2F

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 232 / 234

Page 314: Big Data Analytics: Optimization and Randomization

Appendix

Properties of Leverage-score sampling

We prove the properties using Matrix Chernoff bound. Let Ω = AU.

Ω>Ω = (AU)>(AU) =m∑

j=1

1mpij

uij u>ij

Let Xi = 1mpi

uiu>i . E[Xi ] = 1m Ik . Therefore λmax(Xi ) = λmin(Xi ) = 1

m .

And λmax(Xi ) ≤ maxi‖ui‖2

2mpi

= km . Applying the Matrix Chernoff bound for

the minimum and maximum eigen-value, we have

Pr(λmin(Ω>Ω) ≤ (1− ε)) ≤ k exp(−mε2

2k

)≤ k exp

(−mε2

3k

)

Pr(λmax(Ω>Ω) ≥ (1 + ε)) ≤ k exp(−mε2

3k

)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 233 / 234

Page 315: Big Data Analytics: Optimization and Randomization

Appendix

When uniform sampling makes sense?

Coherence measureµk =

dk max

1≤i≤d‖Ui∗‖2

2

When µk ≤ τ and m = Θ(

kτε2 log

[2kδ

])w.h.p 1− δ,

A formed by uniform sampling (and scaling)AU ∈ Rm×k is full column rankσ2

i (AU) ≥ (1− ε) ≥ (1− ε)2

σ2i (AU) ≤ (1 + ε) ≤ (1 + ε)2

Valid when the coherence measure is small (some real data miningdatasets have small coherence measures)The Nystrom method usually uses uniform sampling (Gittens, 2011)

Yang, Lin, Jin Tutorial for KDD’15 August 10, 2015 234 / 234


Recommended