Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng ›...

transcript

Big Data Analytics: Optimizationand Randomization

Tianbao Yang

Tutorial@ACML 2015Hong Kong

†Department of Computer Science, The University of Iowa, IA, USA

Nov. 20, 2015

Yang Tutorial for ACML’15 Nov. 20, 2015 1 / 210

http://www.cs.uiowa.edu/˜tyng/acml15-tutorial.pdf

Some Claims

NoThis tutorial is not an exhaustive literature surveyIt is not a survey on different machine learning algorithms

YesIt is about how to efficiently solve machine learning (formulated asoptimization) problems for big data

Outline

Part I: BasicsPart II: OptimizationPart III: Randomization

Big Data Analytics: Optimization and Randomization

Part I: Basics

Basics Introduction

Outline

1 BasicsIntroductionNotations and Definitions

Basics Introduction

Three Steps for Machine Learning

Model Optimization

20 40 60 80 1000

iterations

Basics Introduction

Big Data Challenge

Big Data

Basics Introduction

Big Data Challenge

Big Model

60 million parameters

Basics Introduction

Learning as Optimization

Ridge Regression Problem:

minw∈Rd

n∑i=1

(yi −w>xi )2 +

2 ‖w‖22

xi ∈ Rd : d-dimensional feature vectoryi ∈ R: target variablew ∈ Rd : model parametersn: number of data points

Basics Introduction

minw∈Rd

n∑i=1

(yi −w>xi )2

︸︷︷︸Empirical Loss

2 ‖w‖22

Basics Introduction

minw∈Rd

n∑i=1

(yi −w>xi )2 +

2 ‖w‖22︸︷︷︸

Regularization

Basics Introduction

Classification Problems:

minw∈Rd

n∑i=1

`(yiw>xi ) +λ

2 ‖w‖22

yi ∈ +1,−1: labelLoss function `(z): z = yw>x

1. SVMs: (squared) hinge loss `(z) = max(0, 1− z)p, where p = 1, 2

2. Logistic Regression: `(z) = log(1 + exp(−z))

Basics Introduction

Feature Selection:

minw∈Rd

n∑i=1

`(w>xi , yi ) + λ‖w‖1

`1 regularization ‖w‖1 =∑d

i=1 |wi |λ controls sparsity level

Basics Introduction

Feature Selection using Elastic Net:

minw∈Rd

n∑i=1

`(w>xi , yi )+λ(‖w‖1 + γ‖w‖2

Elastic net regularizer, more robust than `1 regularizer

Basics Introduction

Multi-class/Multi-task Learning:

n∑i=1

`(Wxi , yi ) + λr(W)

W ∈ RK×d

r(W) = ‖W‖2F =

∑Kk=1

∑dj=1 W 2

kj : Frobenius Normr(W) = ‖W‖∗ =

∑i σi : Nuclear Norm (sum of singular values)

r(W) = ‖W‖1,∞ =∑d

j=1 ‖W:j‖∞: `1,∞mixed norm

Basics Introduction

Regularized Empirical Loss Minimization

minw∈Rd

n∑i=1

`(w>xi , yi ) + R(w)

Both ` and R are convex functionsExtensions to Matrix Cases are possible (sometimes straightforward)Extensions to Kernel methods can be combined with randomizedapproachesExtensions to Non-convex (e.g., deep learning) are in progress

Basics Introduction

Data Matrices and Machine Learning

The Instance-feature Matrix: X ∈ Rn×d

x>1x>2···

Basics Introduction

The output vector: y =

y1y2···

∈ Rn×1

continuous yi ∈ R: regression (e.g., house price)discrete, e.g., yi ∈ 1, 2, 3: classification (e.g., species of iris)

Basics Introduction

Data Matrices and Machine LearningThe Instance-Instance Matrix: K ∈ Rn×n

Similarity MatrixKernel Matrix

Basics Introduction

Data Matrices and Machine LearningSome machine learning tasks are formulated on the kernel matrix

ClusteringKernel Methods

Basics Introduction

The Feature-Feature Matrix: C ∈ Rd×d

Covariance MatrixDistance Metric Matrix

Basics Introduction

Some machine learning tasks requires the covariance matrixPrincipal Component AnalysisTop-k Singular Value (Eigen-Value) Decomposition of the CovarianceMatrix

Basics Introduction

Why Learning from Big Data is Challenging?

High per-iteration cost

High memory cost

High communication cost

Large iteration complexity

Basics Notations and Definitions

Outline

1 BasicsIntroductionNotations and Definitions

Vector x ∈ Rd

Euclidean vector norm: ‖x‖2 =√

x>x =√∑d

i=1 x2i

`p-norm of a vector: ‖x‖p =(∑d

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Vector x ∈ Rd

x>x =√∑d

i=1 x2i

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Vector x ∈ Rd

x>x =√∑d

i=1 x2i

i=1 |xi |p)1/p

where p ≥ 1

1 `2 norm ‖x‖2 =√∑d

i=1 x2i

2 `1 norm ‖x‖1 =∑d

i=1 |xi |3 `∞ norm ‖x‖∞ = maxi |xi |

Matrix Factorization

Matrix X ∈ Rn×d

Singular Value Decomposition X = UΣV>1 U ∈ Rn×r : orthonormal columns (U>U = I): span column space2 Σ ∈ Rr×r : diagonal matrix Σii = σi > 0, σ1 ≥ σ2 . . . ≥ σr3 V ∈ Rd×r : orthonormal columns (V>V = I): span row space4 r ≤ min(n, d): max value such that σr > 0: rank of X5 UkΣkV>k : top-k approximation

Pseudo inverse: X † = V Σ−1U>

QR factorization: X = QR (n ≥ d)

Q ∈ Rn×d : orthonormal columnsR ∈ Rd×d : upper triangular matrix

Matrix X ∈ Rn×d

Frobenius norm: ‖X‖F =√

tr(X>X ) =√∑n

i=1∑d

j=1 X 2ij

Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖X‖2 = σ1 (maximum singular value)

Matrix X ∈ Rn×d

Frobenius norm: ‖X‖F =√

tr(X>X ) =√∑n

i=1∑d

j=1 X 2ij

Spectral (induced norm) of a matrix: ‖X‖2 = max‖u‖2=1 ‖Xu‖2‖X‖2 = σ1 (maximum singular value)

Convex Optimization

minx∈X f (x)

X is a convex domainfor any x , y ∈ X , their convex combinationαx + (1− α)y ∈ X

f (x) is a convex function

Convex Function

Characterization of Convex Function

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

local optimum is global optimum

Convex Function

Characterization of Convex Function

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y),

∀x , y ∈ X , α ∈ [0, 1]

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

local optimum is global optimum

Convex vs Strongly Convex

Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Strongly Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) +λ

2 ‖x − y‖22 ∀x , y ∈ X

Global optimum is unique

strong convexityconstant

e.g., λ2‖w‖

22 is λ-strongly convex

Convex vs Strongly Convex

Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) ∀x , y ∈ X

Strongly Convex function:

f (x) ≥ f (y) +∇f (y)>(x − y) +λ

2 ‖x − y‖22 ∀x , y ∈ X

Global optimum is unique

e.g., λ2‖w‖

22 is λ-strongly convex

Non-smooth function vs Smooth functionNon-smooth function

Lipschitz continuous: e.g. absolute lossf (x) = |x ||f (x)− f (y)| ≤ G‖x − y‖2

Lipschitzconstant

Subgradient: f (x) ≥ f (y) + ∂f (y)>(x − y)

−1 −0.5 0 0.5 1−0.2

non−smooth

sub−gradient

Smooth function

e.g. logistic loss f (x) = log(1 + exp(−x))

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

log(1+exp(−x))

f(y)+f’(y)(x−y)

Quadratic Function

Lipschitzconstant

−1 −0.5 0 0.5 1−0.2

non−smooth

sub−gradient

Smooth function

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

log(1+exp(−x))

f(y)+f’(y)(x−y)

Quadratic Function

Lipschitzconstant

−1 −0.5 0 0.5 1−0.2

non−smooth

sub−gradient

Smooth function

‖∇f (x)−∇f (y)‖2 ≤ L‖x − y‖2

smoothnessconstant

−5 −4 −3 −2 −1 0 1 2 3 4 5−1

log(1+exp(−x))

f(y)+f’(y)(x−y)

Quadratic Function

Next ...

minw∈Rd

n∑i=1

`(w>xi , yi ) + R(w)

Part II: Optimizationstochastic optimizationdistributed optimization

Reduce Iteration Complexity: utilizing properties of functions and thestructure of the problem

Next ...

Part III: RandomizationClassification, RegressionSVD, K-means, Kernel methods

Reduce Data Size: utilizing properties of data

Please stay tuned!

Optimization

Part II: Optimization

Optimization (Sub)Gradient Methods

Outline

2 Optimization(Sub)Gradient MethodsStochastic Optimization Algorithms for Big Data

Stochastic OptimizationDistributed Optimization

Regularized Empirical Loss Minimization

minw∈Rd

n∑i=1

`(w>xi , yi) + R(w)︸︷︷︸F (w)

Convergence Measure

Most optimization algorithms are iterative

wt+1 = wt + ∆wt

Iteration Complexity: the number ofiterations T (ε) needed to have

F (wT )−minw

F (w) ≤ ε (ε 1)

Convergence Rate: after T iterations, howgood is the solution

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Total Runtime = Per-iteration Cost×Iteration Complexity

Convergence Measure

wt+1 = wt + ∆wt

F (wT )−minw

F (w) ≤ ε (ε 1)

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Convergence Measure

wt+1 = wt + ∆wt

F (wT )−minw

F (w) ≤ ε (ε 1)

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Convergence Measure

wt+1 = wt + ∆wt

F (wT )−minw

F (w) ≤ ε (ε 1)

F (wT )−minw

F (w) ≤ ε(T )

0 20 40 60 80 100

iterations

Total Runtime = Per-iteration Cost×Iteration ComplexityYang Tutorial for ACML’15 Nov. 20, 2015 38 / 210

Big Data Analytics: Optimization and Randomizationhomepage.divms.uiowa.edu › ~tyng ›...

Documents