Download - Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

LargeScaleOptimizationforMachineLearning

Meisam RazaviyaynLecture 16

[email protected]

Recap: Batch Versus Online

• Incrementalgradient,Stochasticapproximation,Stochasticgradientdescent• Fastapproximatesolution• Slowexactsolution

GradientDescent(Batch) StochasticGradientDescent(Online)Linearrateofconvergence;But expensiveiterations

Sub-linearrateofconvergence;But cheapiterations

Forstronglyconvexobjectives:

Canwecombinethetwomethods?SDCA,SAGA,SVRG,…1

Recap: Stochastic Variance Reduced Gradient

[Johnson-Zhang2013]and[Reddi etal.2016]

“Asymptoticallynoiselessgradients”

OR

2

Recap: SVRG Convergence RateAssumeLipschitzgradientandstrongconvexity:

Theorem:

IndicativeCase:

SVRG:

GradientDescent4

Parallel Stochastic Gradient Descent

Assumesharedmemoryamongthreadsandfinitenumberofdatapoints

Synchronous• Lockingprocessors/threads

Asynchronous(Lock-free)• Needtobecareful,maydiverge!

5

Asynchronous Stochastic Gradient Descent

Hypergraphrepresentation

Example1:Regressionwithsparsefeaturevectors

Sparsefeaturevectors

Example2:SparseSVM

6

Example of Asynchronous SGD

Hogwild!Algorithm:(updateforindividualprocessors)

w onthesharedmemory

Thread1 Thread2 Threadp

Time

Thread1reads

Thread2reads

Thread1writes

Thread2writes

Delay

[Niu-Recht-Ré-Wright2011][DeSa- Zhang-Olukotun-Ré 2016]

7

Convergence of Hogwild!

Convergenceisguaranteedforsmallenoughstep-size

• Sparsergraphallowslargerchoiceofstep-size

• Largerdelayrequiressmallerstep-size

DifferentVersions:• Dogwild (fordeeplearning)

• Frogwild (forpagerank)[Niu-Recht-Ré-Wright2011]

[DeSa- Zhang-Olukotun-Ré 2016]

8

Recap

SAA

• SA/SGD/IG,cyclic/Randomized,robustversion

• Acceleration:SVRG,SDCA,…

• Asynchronous:Hogwilde,Dogwild,...

• Effectofregularizer• Utilizingpriorknowledge• Simplehypothesisclass• Interpretablepredictor

• Whatregularizer tochoose?

• Whatstructuretoimpose? Wewillseesomeexamplestoday9

Imposing Sparse Structure: Interpretable Regression

Non-convexandNP-hard Convexproblem

Undersomesuitableconditions,thesolutionofthetwoproblemsare“close”!

10

Imposing Sparse Structure: Prior KnowledgeAgenda Sparse Recovery Sparse DL NP-hardness Algorithms Simulations

Image Compression

Figure: Compressive sensing and structured random matrices [HolgerRauhut 2009]

5 / 24

Agenda Sparse Recovery Sparse DL NP-hardness Algorithms Simulations

Image Compression

Figure: Compressive sensing and structured random matrices [HolgerRauhut 2009]

! 98% of wavelet coefficients are set to zero; only largestcoefficients are retained

6 / 24

[Rauhut 2009]:Originalpicture [Rauhut 2009]:Setting98%ofwaveletcoefficientstozero

11

Other Structures

• GroupSparsity

• Example:Multi-taskLearning

• TV-Regularizer• Example:Imagedenoising

• FusedLasso• Example:Micro-arrayanalysis/orderedfeatures

12

Low Rank Structure

• Example: Finding a hidden partition in a graph

13

Low Rank Structure


13

Low Rank Structure


13

Example: Hidden Partition Problem

Adjacency Matrix: Permuted Adjacency Matrix:

14

Stochastic Block Model

15

AdjacencymatrixA

Maximum Likelihood Estimation in Stochastic Block Model

16

Low Rank Structure in Hidden Partition Problem

Allowingalittlebitofimbalancedpartitions

Changeofvariables

17