+ All Categories
Home > Documents > Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  ·...

Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  ·...

Date post: 13-Jul-2020
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
24
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24
Transcript
Page 1: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Stochastic Composition OptimizationAlgorithms and Sample Complexities

Mengdi WangJoint works with Ethan X. Fang, Han Liu, and Ji Liu

ORFE@Princeton

ICCOPT, Tokyo, August 8-11, 2016

1 / 24

Page 2: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Collaborators

• M. Wang, X. Fang, and H. Liu. Stochastic Compositional Gradient Descent: Algorithms forMinimizing Compositions of Expected-Value Functions. Mathematical Programming, Submitted in2014, to appear in 2016.

• M. Wang and J. Liu. Accelerating Stochastic Composition Optimization. 2016.

• M. Wang and J. Liu. A Stochastic Compositional Subgradient Method Using Markov Samples. 2016.

2 / 24

Page 3: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Outline

1 Background: Why is SGD a good method?

2 A New Problem: Stochastic Composition Optimization

3 Stochastic Composition Algorithms: Convergence and Sample Complexity

4 Acceleration via Smoothing-Extrapolation

3 / 24

Page 4: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Background: Why is SGD a good method?

Outline

1 Background: Why is SGD a good method?

2 A New Problem: Stochastic Composition Optimization

3 Stochastic Composition Algorithms: Convergence and Sample Complexity

4 Acceleration via Smoothing-Extrapolation

4 / 24

Page 5: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Background: Why is SGD a good method?

Background

• Machine learning is optimization

Learning from batch data Learning from online data

minx∈<d1n

∑ni=1 `(x ;Ai , bi ) + ρ(x) minx∈<d EA,b [`(x ;A, b)] + ρ(x)

• Both problems can be formulated as Stochastic Convex Optimization

minx

E [f (x , ξ)]︸ ︷︷ ︸expectation over batch data set or unknown distribution

A general framework encompasses likelihood estimation, online learning, empirical riskminimization, multi-arm bandit, online MDP

• Stochastic gradient descent (SGD) updates by taking sample gradients:

xk+1 = xk − α∇f (xk , ξk )

A special case of stochastic approximation with a long history (Robbins and Monro,Kushner and Yin, Polyak and Juditsky, Benveniste et al., Ruszcynski, Borkar, Bertsekasand Tsitsiklis, and many)

5 / 24

Page 6: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Background: Why is SGD a good method?

Background: Stochastic first-order methods

• Stochastic gradient descent (SGD) updates by taking sample gradients:

xk+1 = xk − α∇f (xk , ξk )

(1,410,000 results on Google Scholar Search and 24,400 since 2016!)

Why is SGD a good method in practice?

• When processing either batch or online data, a scalable algorithm needs to update usingpartial information (a small subset of all data)

• Answer: We have no other choice

Why is SGD a good method beyond practical reasons?

• SGD achieves optimal convergence after processing k samples:

• E [F (xk )− F∗] = O(1/√k) for convex minimization

• E [F (xk )− F∗] = O(1/k) for strongly convex minimization

(Nemirovski and Yudin 1983, Agarwal et al. 2012, Rakhlin et al. 2012, Ghadimi and Lan2012,2013, Shamir and Zhang 2013 and many more)

• Beyond convexity: nearly optimal online PCA (Li, Wang, Liu, Zhang 2015)

• Answer: Strong theoretical guarantees for data-driven problems

6 / 24

Page 7: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

A New Problem: Stochastic Composition Optimization

Outline

1 Background: Why is SGD a good method?

2 A New Problem: Stochastic Composition Optimization

3 Stochastic Composition Algorithms: Convergence and Sample Complexity

4 Acceleration via Smoothing-Extrapolation

7 / 24

Page 8: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

A New Problem: Stochastic Composition Optimization

Stochastic Composition OptimizationConsider the problem

minx∈X

{F (x) := (f ◦ g)(x) = f (g(x))

},

where the outer and inner functions are f : <m → <, g : <n → <m

f (y) = E [fv (y)] , g(x) = E [gw (x)] ,

and X is a closed and convex set in <n.

• We focus on the case where the overall problem is convex (for now)

• No structural assumptions on f , g (nonconvex/nonmonotone/nondifferentiable)

• We may not know the distribution of v ,w .

8 / 24

Page 9: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

A New Problem: Stochastic Composition Optimization

Expectation Minimization vs. Stochastic Composition Optimization

Recall the classical problem:

minx∈X

E [f (x , ξ)]︸ ︷︷ ︸linear w.r.t the distribution of ξ

In stochastic composition optimization, the objective is no longer a linear functional of the(v ,w) distribution:

minx∈X

E[fv (E [gw (x)])]︸ ︷︷ ︸nonlinear w.r.t the distribution of (w,v)

• In the classical problem, nice properties come from linearity w.r.t. data distribution

• In stochastic composition optimization, they are all lost

A little nonlinearity takes a long way to go.

9 / 24

Page 10: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

A New Problem: Stochastic Composition Optimization

Motivating Example: High-Dimensional Nonparametric Estimation

• Sparse Additive Model (SpAM):

yi =d∑

j=1

hj (xij ) + εi .

• High-dimensional feature space with relatively few data samples:

Sample size n

Features!dFeatures d

Sample size!n

n � dn � d

• Optimization model for SpAM1:

minx

E[fv (E [gw (x)])] ↔ minhj∈Hj

E[Y −

d∑j=1

hj (Xj )]2

+ λd∑

j=1

√E[h2j (Xj )

]• The term λ

∑dj=1

√E[h2j (Xj )

]induces sparsity in the feature space.

1P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the RoyalStatistical Society: Series B, 71(5):1009-1030, 2009.

10 / 24

Page 11: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

A New Problem: Stochastic Composition Optimization

Motivating Example: Risk-Averse LearningConsider the mean-variance minimization problem

minx

Ea,b [`(x ; a, b)] + λVara,b[`(x ; a, b)],

Its batch version is

minx

1

N

N∑i=1

`(x ; ai , bi ) +λ

N

N∑i=1

(`(x ; ai , bi )−

1

N

N∑i=1

`(x ; ai , bi )

)2

.

• The variance Var[Z ] = E[(Z − E[Z ])2

]is a composition between two functions

• Many other risk functions are equivalent to compositions of multiple expected-valuefunctions (Shapiro, Dentcheva, Ruszcynski 2014)

• A central limit theorem for composite of multiple smooth functions has been establishedfor risk metrics (Dentcheva, Penev, Ruszcynski 2016)

• No good way to optimize a risk-averse objective while learning from online data

11 / 24

Page 12: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

A New Problem: Stochastic Composition Optimization

Motivating Example: Reinforcement LearningOn-policy reinforcement learning is to learn the value-per-state of a stochastic system.

• We want to solve a (huge) Bellman equations

γPπVπ + rπ = Vπ ,

where Pπ is transition prob. matrix and rπ are rewards, both unknown.

• On-policy learning aims to solve Bellman equation via blackbox simulation. It becomes aspecial stochastic composition optimization problem:

minx

E[fv (E [gw (x)])] ↔ minx∈<S

‖E[A]x − E[b]‖2,

where E[A] = I − γPπ and E[b] = rπ .

12 / 24

Page 13: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Stochastic Composition Algorithms: Convergence and Sample Complexity

Outline

1 Background: Why is SGD a good method?

2 A New Problem: Stochastic Composition Optimization

3 Stochastic Composition Algorithms: Convergence and Sample Complexity

4 Acceleration via Smoothing-Extrapolation

13 / 24

Page 14: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Stochastic Composition Algorithms: Convergence and Sample Complexity

Problem Formulation

minx∈X

{F (x) := E[fv (E [gw (x)])]︸ ︷︷ ︸

nonlinear w.r.t the distribution of (w,v)

},

Sampling Oracle (SO)Upon query (x , y), the oracle returns:

• Noisy inner sample gw (x) and its noisy subgradient Ogw (x);

• Noisy outer gradient Ofv (y)

Challenges

• Stochastic gradient descent (SGD) method does not work since an “unbiased” sample ofthe gradient

Og(xk )Of (g(xk ))

is not available.

• Fenchel dual does not work except for rare conditions

• Sample average approximation (SAA) subject to curse of dimensionality.

• Sample complexity unclear

14 / 24

Page 15: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Stochastic Composition Algorithms: Convergence and Sample Complexity

Basic IdeaTo approximate

xk+1 = ΠX{xk − αk Og(xk )Of (g(xk ))

},

by a quasi-gradient iteration using estimates of g(xk )

Algorithm 1: Stochastic Compositional Gradient Descent (SCGD)

Require: x0, z0 ∈ <n, y0 ∈ <m, SO, K , stepsizes {αk}Kk=1, and {βk}Kk=1.

Ensure: {xk}Kk=1for k = 1, · · · ,K do

Query the SO and obtain ∇gwk (xk ), gwk (xk ), fvk (yk+1)Update by

yk+1= (1− βk )yk + βkgwk (xk ),

xk+1 = ΠX{xk − αk Ogwk (xk )Ofvk (yk+1)

},

end for

Remarks

• Each iteration makes simple updates by interacting with SO• Scalable with large-scale batch data and can process streaming data points online

• Considered for the first time by (Ermoliev 1976) as a stochastic approximation methodwithout rate analysis

15 / 24

Page 16: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Stochastic Composition Algorithms: Convergence and Sample Complexity

Sample Complexity (Wang et al., 2016)Under suitable conditions (inner function nonsmooth, outer function smooth), and X isbounded, let the stepsizes be

αk = k−3/4, βk = k−1/2,

we have that, if k is large enough,

E

F 2

k

k∑t=k/2+1

xt

− F∗

= O( 1

k1/4

).

(Optimal rate which matches the lowerbound for stochastic programming)

Sample Convexity in Strongly Convex Case (Wang et al., 2016)Under suitable conditions (inner function nonsmooth, outer function smooth), suppose thatthe compositional function F (·) is strongly convex, let the stepsizes be

αk =1

k, and βk =

1

k2/3,

we have, if k is sufficiently large,

E[‖xk − x∗‖2

]= O

( 1

k2/3

).

16 / 24

Page 17: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Stochastic Composition Algorithms: Convergence and Sample Complexity

Outline of Analysis

• The auxilary variable yk is taking a running estimate of g(xk ) at biased query pointsx0, . . . , xk :

yk+1 =k∑

t=0

(Πkt′=tβt′ )gwt (xt).

• Two entangled stochastic sequences

{εk = ‖yk+1 − g(xk )‖2}, {ξk = ‖xk − x∗‖2}.

• Coupled supermatingale analysis

E [εk+1 | Fk ] ≤ (1− βk )εk +O(β2k +

1

βk‖xk+1 − xk‖2)

E [ξk+1 | Fk ] ≤ (1 + α2k )ξk − αk (F (xk )− F∗) +O(βk )εk

• Almost sure convergence by using a Coupled Supermartingale Convergence Theorem(Wang and Bertsekas, 2013)

• Convergence rate analysis via optimizing over stepsizes and balancing noise-bias tradeoff

17 / 24

Page 18: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

Outline

1 Background: Why is SGD a good method?

2 A New Problem: Stochastic Composition Optimization

3 Stochastic Composition Algorithms: Convergence and Sample Complexity

4 Acceleration via Smoothing-Extrapolation

18 / 24

Page 19: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

AccelerationWhen the function g(·) is smooth, can the algorithms be accelerated? Yes!

Algorithm 2: Accelerated SCGD

Require: x0, z0 ∈ <n, y0 ∈ <m, SO, K , stepsizes {αk}Kk=1, and {βk}Kk=1.

Ensure: {xk}Kk=1for k = 1, · · · ,K do

Query the SO and obtain ∇fvk (yk ), ∇gwk (zk ).Update by

xk+1 = xk − αk∇g>wk(xk )∇fvk (yk ).

Update auxiliary variables via extrapolation-smoothing:

zk+1 =

(1−

1

βk

)xk +

1

βk

xk+1,

yk+1 = (1− βk )yk + βkgwk+1(zk+1),

where the sample gwk+1(zk+1) is obtained via querying the SO.

end for

Key to the AccelerationBias reduction by averaging over extrapolated points (extrapolation-smoothing)

yk =k∑

t=0

(Πkt′=tβt′ )gwt (zt) ≈ g(xk ) = g

( k∑t=0

(Πkt′=tβt′ )zt

).

19 / 24

Page 20: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

Accelerated Sample Complexity (Wang et al. 2016)Under suitable conditions (inner function smooth, outer function smooth), if the stepsizes arechosen as

αk = k−5/7, βk = k−4/7,

we have,

E[F (xk )− F∗

]= O

( 1

k2/7

),

where xk = 2k

∑kt=k/2+1 xt .

Strongly Convex Case (Wang et al. 2016)Under suitable conditions (inner function smooth, outer function smooth), , and we assumethat F is strongly-convex. If the stepsizes are chosen as

αk =1

k, and βk =

1

k4/5,

we have,

E[‖xk − x∗‖2] = O( 1

k4/5

).

20 / 24

Page 21: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

Regularized Stochastic Composition Optimization(Wang and Liu 2016)

minx∈<n

(Ev fv ◦ Ewgw )(x) + R(x)

The penalty term R(x) is convex and nonsmooth.

Algorithm 3: Accelerated Stochastic Compositional Proximal Gradient (ASC-PG)

Require: x0, z0 ∈ <n, y0 ∈ <m, SO, K , stepsizes {αk}Kk=1, and {βk}Kk=1.

Ensure: {xk}Kk=1for k = 1, · · · ,K do

Query the SO and obtain ∇fvk (yk ), ∇gwk (zk ).Update the main iterate by

xk+1 = proxαkR(·)

(xk − αk∇g>wk

(xk )∇fvk (yk )).

Update auxillary iterates by an extrapolation-smoothing scheme:

zk+1 =

(1−

1

βk

)xk +

1

βkxk+1,

yk+1 = (1− βk )yk + βkgwk+1 (zk+1),

where the sample gwk+1 (zk+1) is obtained via querying the SO.end for

21 / 24

Page 22: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

Sample Complexity for Smooth Optimization (Wang and Liu 2016)Under suitable conditions (inner function nonsmooth, outer function smooth), and X isbounded, let the stepsizes be chosen properly, we have that, if k is large enough,

E

F 2

k

k∑t=k/2+1

xt

− F∗

= O( 1

k4/9

),

If either the outer or the inner function is linear, the rate improves to be optimal:

E

F 2

k

k∑t=k/2+1

xt

− F∗

= O( 1

k1/2

),

Sample Convexity in Strongly Convex Case (Wang and Liu 2016)Suppose that the compositional function F (·) is strongly convex, let the stepsizes be chosenproperly, we have, if k is sufficiently large,

E[‖xk − x∗‖2

]= O

( 1

k4/5

).

If either the outer or the inner function is linear, the rate improves to be optimal:

E[‖xk − x∗‖2

]= O

( 1

k1

).

22 / 24

Page 23: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

When there are two nested uncertainties:

• A class of two-timescale algorithms that update using first-order samples

• Analyzing convergence becomes harder: two coupled stochastic processes andsmoothness-noise interplay

• Convergence rates of stochastic algorithms establish sample complexity upper bounds forthe new problem class

General Convex Strongly Convex

Outer Nonsmooth, Inner Smooth O(k−1/4

)O(k−2/3

)?

Outer and Inner Smooth O(k−4/9

)? O(k−4/5

)?

Special Case: minx E [f (x ; ξ)] O(k−1/2

)O(k−1

)Special Case: minx E [f (E[A]x − E[b]; ξ)] O

(k−1/2

)O(k−1

)Table : Summary of best known sample complexities

Applications and computations

• First scalable algorithm for sparse nonparametric estimation

• Optimal algorithm for on-policy reinforcement learning

23 / 24

Page 24: Stochastic Composition Optimization - Princeton …mengdiw/pdfs/slides-stochastic...2016/08/08  · Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang

Acceleration via Smoothing-Extrapolation

Summary

• Stochastic Composition Optimization - a new and rich problem class

minx

E[fv (E [gw (x)])]︸ ︷︷ ︸nonlinear w.r.t the distribution of (w,v)

• Applications in risk, data analysis, machine learning, real-time intelligent systems

• A class of stochastic compositional gradient methods with convergence guarantees. Basicsample complexity being developed:

General/Convex Strongly Convex

Outer Nonsmooth, Inner Smooth O(k−1/4

)O

(k−2/3

)?

Outer and Inner Smooth O(k−4/9

)? O(k−4/5

)?

Special Case: minx E [f (x ; ξ)] O(k−1/2

)O

(k−1

)Special Case: minx E [f (E[A]x − E[b]; ξ)] O

(k−1/2

)O

(k−1

)• Many open questions remaining and more works needed!

Thank you very much!

24 / 24


Recommended