+ All Categories
Home > Documents > Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared...

Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared...

Date post: 19-Jul-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
19
Large Scale Optimization for Machine Learning Meisam Razaviyayn Lecture 16 [email protected]
Transcript
Page 1: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

LargeScaleOptimizationforMachineLearning

Meisam RazaviyaynLecture 16

[email protected]

Page 2: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Recap: Batch Versus Online

• Incrementalgradient,Stochasticapproximation,Stochasticgradientdescent• Fastapproximatesolution• Slowexactsolution

GradientDescent(Batch) StochasticGradientDescent(Online)Linearrateofconvergence;But expensiveiterations

Sub-linearrateofconvergence;But cheapiterations

Forstronglyconvexobjectives:

Canwecombinethetwomethods?SDCA,SAGA,SVRG,…1

Page 3: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Recap: Stochastic Variance Reduced Gradient

[Johnson-Zhang2013]and[Reddi etal.2016]

“Asymptoticallynoiselessgradients”

OR

2

Page 4: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Recap: SVRG Convergence RateAssumeLipschitzgradientandstrongconvexity:

Theorem:

IndicativeCase:

SVRG:

GradientDescent4

Page 5: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Parallel Stochastic Gradient Descent

Assumesharedmemoryamongthreadsandfinitenumberofdatapoints

Synchronous• Lockingprocessors/threads

Asynchronous(Lock-free)• Needtobecareful,maydiverge!

5

Page 6: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Asynchronous Stochastic Gradient Descent

Hypergraphrepresentation

Example1:Regressionwithsparsefeaturevectors

Sparsefeaturevectors

Example2:SparseSVM

6

Page 7: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Example of Asynchronous SGD

Hogwild!Algorithm:(updateforindividualprocessors)

w onthesharedmemory

Thread1 Thread2 Threadp

Time

Thread1reads

Thread2reads

Thread1writes

Thread2writes

Delay

[Niu-Recht-Ré-Wright2011][DeSa- Zhang-Olukotun-Ré 2016]

7

Page 8: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Convergence of Hogwild!

Convergenceisguaranteedforsmallenoughstep-size

• Sparsergraphallowslargerchoiceofstep-size

• Largerdelayrequiressmallerstep-size

DifferentVersions:• Dogwild (fordeeplearning)

• Frogwild (forpagerank)[Niu-Recht-Ré-Wright2011]

[DeSa- Zhang-Olukotun-Ré 2016]

8

Page 9: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Recap

SAA

• SA/SGD/IG,cyclic/Randomized,robustversion

• Acceleration:SVRG,SDCA,…

• Asynchronous:Hogwilde,Dogwild,...

• Effectofregularizer• Utilizingpriorknowledge• Simplehypothesisclass• Interpretablepredictor

• Whatregularizer tochoose?

• Whatstructuretoimpose? Wewillseesomeexamplestoday9

Page 10: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Imposing Sparse Structure: Interpretable Regression

Non-convexandNP-hard Convexproblem

Undersomesuitableconditions,thesolutionofthetwoproblemsare“close”!

10

Page 11: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Imposing Sparse Structure: Prior KnowledgeAgenda Sparse Recovery Sparse DL NP-hardness Algorithms Simulations

Image Compression

Figure: Compressive sensing and structured random matrices [HolgerRauhut 2009]

5 / 24

Agenda Sparse Recovery Sparse DL NP-hardness Algorithms Simulations

Image Compression

Figure: Compressive sensing and structured random matrices [HolgerRauhut 2009]

! 98% of wavelet coefficients are set to zero; only largestcoefficients are retained

6 / 24

[Rauhut 2009]:Originalpicture [Rauhut 2009]:Setting98%ofwaveletcoefficientstozero

11

Page 12: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Other Structures

• GroupSparsity

• Example:Multi-taskLearning

• TV-Regularizer• Example:Imagedenoising

• FusedLasso• Example:Micro-arrayanalysis/orderedfeatures

12

Page 13: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Low Rank Structure

• Example: Finding a hidden partition in a graph

13

Page 14: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Low Rank Structure

• Example: Finding a hidden partition in a graph

13

Page 15: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Low Rank Structure

• Example: Finding a hidden partition in a graph

13

Page 16: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Example: Hidden Partition Problem

Adjacency Matrix: Permuted Adjacency Matrix:

14

Page 17: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Stochastic Block Model

15

AdjacencymatrixA

Page 18: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Maximum Likelihood Estimation in Stochastic Block Model

16

Page 19: Large Scale Optimization for Machine Learning · Parallel Stochastic Gradient Descent Assume shared memory among threads and finite number of data points Synchronous • Locking processors/threads

Low Rank Structure in Hidden Partition Problem

Allowingalittlebitofimbalancedpartitions

Changeofvariables

17


Recommended