Recap: Batch Versus Online
• Incrementalgradient,Stochasticapproximation,Stochasticgradientdescent• Fastapproximatesolution• Slowexactsolution
GradientDescent(Batch) StochasticGradientDescent(Online)Linearrateofconvergence;But expensiveiterations
Sub-linearrateofconvergence;But cheapiterations
Forstronglyconvexobjectives:
Canwecombinethetwomethods?SDCA,SAGA,SVRG,…1
Recap: Stochastic Variance Reduced Gradient
[Johnson-Zhang2013]and[Reddi etal.2016]
“Asymptoticallynoiselessgradients”
OR
2
Recap: SVRG Convergence RateAssumeLipschitzgradientandstrongconvexity:
Theorem:
IndicativeCase:
SVRG:
GradientDescent4
Parallel Stochastic Gradient Descent
Assumesharedmemoryamongthreadsandfinitenumberofdatapoints
Synchronous• Lockingprocessors/threads
Asynchronous(Lock-free)• Needtobecareful,maydiverge!
5
Asynchronous Stochastic Gradient Descent
Hypergraphrepresentation
Example1:Regressionwithsparsefeaturevectors
Sparsefeaturevectors
Example2:SparseSVM
6
Example of Asynchronous SGD
Hogwild!Algorithm:(updateforindividualprocessors)
w onthesharedmemory
Thread1 Thread2 Threadp
Time
Thread1reads
Thread2reads
Thread1writes
Thread2writes
Delay
[Niu-Recht-Ré-Wright2011][DeSa- Zhang-Olukotun-Ré 2016]
7
Convergence of Hogwild!
Convergenceisguaranteedforsmallenoughstep-size
• Sparsergraphallowslargerchoiceofstep-size
• Largerdelayrequiressmallerstep-size
DifferentVersions:• Dogwild (fordeeplearning)
• Frogwild (forpagerank)[Niu-Recht-Ré-Wright2011]
[DeSa- Zhang-Olukotun-Ré 2016]
8
Recap
SAA
• SA/SGD/IG,cyclic/Randomized,robustversion
• Acceleration:SVRG,SDCA,…
• Asynchronous:Hogwilde,Dogwild,...
• Effectofregularizer• Utilizingpriorknowledge• Simplehypothesisclass• Interpretablepredictor
• Whatregularizer tochoose?
• Whatstructuretoimpose? Wewillseesomeexamplestoday9
Imposing Sparse Structure: Interpretable Regression
Non-convexandNP-hard Convexproblem
Undersomesuitableconditions,thesolutionofthetwoproblemsare“close”!
10
Imposing Sparse Structure: Prior KnowledgeAgenda Sparse Recovery Sparse DL NP-hardness Algorithms Simulations
Image Compression
Figure: Compressive sensing and structured random matrices [HolgerRauhut 2009]
5 / 24
Agenda Sparse Recovery Sparse DL NP-hardness Algorithms Simulations
Image Compression
Figure: Compressive sensing and structured random matrices [HolgerRauhut 2009]
! 98% of wavelet coefficients are set to zero; only largestcoefficients are retained
6 / 24
[Rauhut 2009]:Originalpicture [Rauhut 2009]:Setting98%ofwaveletcoefficientstozero
11
Other Structures
• GroupSparsity
• Example:Multi-taskLearning
• TV-Regularizer• Example:Imagedenoising
• FusedLasso• Example:Micro-arrayanalysis/orderedfeatures
12
Low Rank Structure
• Example: Finding a hidden partition in a graph
13
Low Rank Structure
• Example: Finding a hidden partition in a graph
13
Low Rank Structure
• Example: Finding a hidden partition in a graph
13
Example: Hidden Partition Problem
Adjacency Matrix: Permuted Adjacency Matrix:
14
Stochastic Block Model
15
AdjacencymatrixA
Maximum Likelihood Estimation in Stochastic Block Model
16
Low Rank Structure in Hidden Partition Problem
Allowingalittlebitofimbalancedpartitions
Changeofvariables
17