A Two-Timescale Stochastic Algorithm Framework for Bilevel ...

A Two-Timescale Stochastic Algorithm Framework for BilevelOptimization: Complexity Analysis and Application to

Actor-Critic

Mingyi Hong∗† Hoi-To Wai ‡ Zhaoran Wang§ Zhuoran Yang¶

January 3, 2022

Abstract

This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization.Bilevel optimization is a class of problems which exhibits a two-level structure, and its goal isto minimize an outer objective function with variables which are constrained to be the optimalsolution to an (inner) optimization problem. We consider the case when the inner problem isunconstrained and strongly convex, while the outer problem is constrained and has a smoothobjective function. We propose a two-timescale stochastic approximation (TTSA) algorithm fortackling such a bilevel problem. In the algorithm, a stochastic gradient update with a largerstep size is used for the inner problem, while a projected stochastic gradient update with asmaller step size is used for the outer problem. We analyze the convergence rates for the TTSAalgorithm under various settings: when the outer problem is strongly convex (resp. weaklyconvex), the TTSA algorithm finds an O(K

−2/3max )-optimal (resp. O(K

−2/5max )-stationary) solution,

where Kmax is the total iteration number. As an application, we show that a two-timescalenatural actor-critic proximal policy optimization algorithm can be viewed as a special case ofour TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge ata rate of O(K

−1/4max ) in terms of the gap in expected discounted reward compared to a global

optimal policy.

1 Introduction

Consider bilevel optimization problems of the form:

minx∈X⊆Rd1

`(x) := f(x, y?(x)) subject to y?(x) ∈ arg miny∈Rd2

g(x, y), (1)

where d1, d2 ≥ 1 are integers; X is a closed and convex subset of Rd1 , f : X × Rd2 → R andg : X ×Rd2 → R are continuously differentiable functions with respect to (w.r.t.) x, y. Problem (1)

∗Authors listed in alphabetical order.†University of Minnesota, email:[email protected]‡The Chinese University of Hong Kong, email:[email protected]§Northwestern University, email:[email protected]¶Princeton University, email:[email protected]

1

arX

iv:2

007.

0517

0v3

[m

ath.

OC

] 3

0 D

ec 2

021

Table 1: Summary of the main results. SC stands for strongly convex, WC for weakly convex, Cfor convex; k is the iteration counter, Kmax is the total number of iterations.

`(x) Constraint Step Size (αk, βk) Rate (outer) Rate (Inner)

SC X ⊆ Rd1 O(k−1), O(k−2/3) O(K−2/3max )† O(K

−2/3max )?

C X ⊆ Rd1 O(K−3/4max ), O(K

−1/2max ) O(K

−1/4max )¶ O(K

−1/2max )?

WC X ⊆ Rd1 O(K−3/5max ), O(K

−2/5max ) O(K

−2/5max )# O(K

−2/5max )?

†in terms of ‖xKmax−x?‖2, where x? is the optimal solution; ?in terms of ‖yKmax−y?(xKmax−1)‖2, where y?(xKmax−1)is the optimal inner solution for fixed xKmax−1; ¶measured using `(x) − `(x?); #measured using distance to a fixedpoint with the Moreau proximal map x(·); see (18).

involves two optimization problems following a two-level structure. We refer to miny∈Rd2 g(x, y) asthe inner problem whose solution depends on x, and g(x, y) is called the inner objective function;minx∈X `(x) is referred as the outer problem, which represents the outer objective function that wewish to minimize and `(x) ≡ f(x, y?(x)) is called the outer objective function. Moreover, both f, gcan be stochastic functions whose gradient may be difficult to compute. Despite being a non-convexstochastic problem in general, (1) has a wide range of applications, e.g., reinforcement learning [33],hyperparameter optimization [22], game theory [51], etc..

Tackling (1) is challenging as it involves solving the inner and outer optimization problemssimultaneously. Even in the simplest case when `(x) and g(x, y) are strongly convex in x, y,respectively, solving (1) is difficult. For instance, if we aim to minimize `(x) via a gradient method,at any iterate xcur ∈ Rd1 – applying the gradient method for (1) involves a double-loop algorithmthat (a) solves the inner optimization problem y?(xcur) = arg miny∈Rd2 g(xcur, y) and then (b)evaluates the gradient as ∇`(xcur) based on the solution y?(xcur). Depending on the application,step (a) is usually accomplished by applying yet another gradient method for solving the innerproblem (unless a closed form solution for y?(xcur) exists). In this way, the resulting algorithmnecessitates a double-loop structure.

To this end, [23] and the references therein proposed a stochastic algorithm for (1) involving adouble-loop update. During the iterations, the inner problem miny∈Rd2 g(xcur, y) is solved using astochastic gradient (SGD) method, with the solution denoted by y?(xcur). Then, the outer prob-lem is optimized with an SGD update using estimates of ∇f(xcur, y?(xcur)). Such a double-loopalgorithm is proven to converge to a stationary solution, yet a practical issues lingers: What if the(stochastic) gradients of the inner and outer problems are only revealed sequentially? For example,when these problems are required to be updated at the same time such as in a sequential game.

To address the above issues, this paper investigates a single-loop stochastic algorithm for (1).Focusing on a class of the bilevel optimization problem (1) where the inner problem is unconstrainedand strongly convex, and the outer objective function is smooth, our contributions are three-fold:

• We study a two-timescale stochastic approximation (TTSA) algorithm [7] for the concerned classof bilevel optimization. The TTSA algorithm updates both outer and inner solutions simultane-ously, by using some cheap estimates of stochastic gradients of both inner and outer objectives.The algorithm guarantees convergence by improving the inner (resp., outer) solution with a larger(resp., smaller) step size, also known as using a faster (resp., slower) timescale.

2

• We analyze the expected convergence rates of the TTSA algorithm. Our results are summarizedin Table 1. Our analysis is accomplished by building a set of coupled inequalities for the one-stepupdate in TTSA. For strongly convex outer function, we show inequalities that couple betweenthe outer and inner optimality gaps. For convex or weakly convex outer functions, we establishinequalities coupling between the difference of outer iterates, the optimality of function values,and the inner optimality gap. We also provide new and generic results for solving coupledinequalities. The distinction of timescales between step sizes of the inner and outer updatesplays a crucial role in our convergence analysis.

• Finally, we illustrate the application of our analysis results on a two-timescale natural actor-critic policy optimization algorithm with linear function approximation [30, 45]. The naturalactor-critic algorithm converges at the rate O(K−1/4) to an optimal policy, which is comparableto the state-of-the-art results.

The rest of this paper is organized as follows. §2 formally describes the problem setting of bileveloptimization and specify the problem class of interest. In addition, the TTSA algorithm is introducedand some application examples are discussed. §3 presents the main convergence results for thegeneric bilevel optimization problem (1). The convergence analysis is also presented where wehighlight the main proof techniques used. Lastly, §4 discusses the application to reinforcementlearning. Notice that some technical details of the proof have been relegated to the online appendix[24].

1.1 Related Works

The study of bilevel optimization problem (1) can be traced to that of Stackelberg games [51], wherethe outer (resp. inner) problem optimizes the action taken by a leader (resp. the follower). In theoptimization literature, bilevel optimization was introduced in [10] for resource allocation problems,and later studied in [9]. Furthermore, bilevel optimization is a special case of the broader classproblem of Mathematical Programming with Equilibrium Constraints [39].

Many related algorithms have been proposed for bilevel optimization. This includes approximatedescent methods [19,56], and penalty-based method [26,58]. The approximate descent methods dealwith a subclass of problem where the outer problem possesses certain (local) differentiability prop-erty, while the penalty-based methods approximate the inner problems and/or the outer problemswith an appropriate penalty functions. It is noted in [12] that descent based methods have rel-atively strong assumptions about the inner problem (such as non-degeneracy), while the penaltybased methods are typically slow. Moreover, these works typically focus on asymptotic convergenceanalysis, without characterizing the convergence rates; see [12] for a comprehensive survey.

In [13,23,27], the authors considered bilevel problems in the (stochastic) unconstrained setting,when the outer problem is non-convex and the inner problem is strongly (or strictly) convex. Theseworks are more related to the algorithms and results to be developed in the current paper. Inthis case, the (stochastic) gradient of the outer problem may be computed using the chain rule.However, to obtain an accurate estimate, one has to either use double loop structure where theinner loop solves the inner sub-problem to a high accuracy [13, 23], or use a large batch-size (e.g.,O(1/ε)) [27]. Both of these methods could be difficult to implement in practice as the batch-sizeselection or number of required inner loop iterations are difficult to adjust. In reference [48], the

3

authors analyzed a special bilevel problem where there is a single optimization variable in both outerand inner levels. The authors proposed a Sequential Averaging Method (SAM) algorithm which canprovably solve a problem with strongly convex outer problem, and convex inner problems. Buildingupon the SAM, [35,38] developed first-order algorithms for bilevel problem, without requiring thatfor each fixed outer-level variable, the inner-level solution must be a singleton.

In a different line of recent works, references [36, 50] proposed and analyzed different versionsof the so-called truncated back-propagation approach for approximating the (stochastic) gradientof the outer-problem, and established convergence for the respective algorithms. The idea is to usea dynamical system to model an optimization algorithm that solves the inner problem, and thenreplace the optimal solution of the inner problem by unrolling a few iterations of the updates. How-ever, computing the (hyper-)gradient of the objective function `(x) requires using back-propagationthrough the optimization algorithm, and can be computationally very expensive. It is important tonote that none of the methods discussed above have considered single-loop stochastic algorithms,in which a small batch of samples are used to approximate the inner and outer gradients at eachiteration. Later we will see that the ability of being able to update using a small number of samplesfor both outer and inner problems is critical in a number of applications, and it is also beneficialnumerically.

In contrast to the above mentioned works, this paper considers a TTSA algorithm for stochasticbilevel optimization, which is a single-loop algorithm employing cheap stochastic estimates of thegradient. Notice that TTSA [7] is a class of algorithms designed to solve coupled system of (nonlin-ear) equations. While its asymptotic convergence property has been well understood, e.g., [7,8,32],the convergence rate analysis have been focused on linear cases, e.g., [14, 31, 34]. In general, thebilevel optimization problem (1) requires a nonlinear TTSA algorithm. For this case, an asymptoticconvergence rate is analyzed in [42] under a restricted form of nonlinearity. For convergence rateanalysis, [48] considered a single-loop algorithm for deterministic bilevel optimization with only onevariable, and [17] studied the convergence rate when the expected updates are strongly monotone.

Finally, it is worthwhile mentioning that various forms of TTSA have been applied to tacklecompositional stochastic optimization [57], policy evaluation methods [6,54], and actor-critic meth-ods [5,33,40]. Notice that some of these optimization problems can be cast as a bilevel optimization,as we will demonstrate next.

Notations Unless otherwise noted, ‖ · ‖ is the Euclidean norm on finite dimensional Euclideanspace. For a twice differentiable function f : X × Y → R, ∇xf(x, y) (resp. ∇yf(x, y)) denotes itspartial gradient taken w.r.t. x (resp. y), and ∇2

yxf(x, y) (resp. ∇2xyf(x, y)) denotes the Jacobian

of ∇yf(x, y) at y (resp. ∇xf(x) at x). A function `(·) is said to be weakly convex with modulusµ` ∈ R if

`(w) ≥ `(v) + 〈∇`(v), w − v〉+ µ`‖w − v‖2, ∀ w, v ∈ X. (2)

Notice that if µ` ≥ 0 (resp. µ` > 0), then `(·) is convex (resp. strongly convex).

2 Two-Timescale Stochastic Approximation Algorithm for (1)

To formally define the problem class of interest, we state the following conditions on the bileveloptimization problem (1).

4

Assumption 1. The outer functions f(x, y) and `(x) := f(x, y?(x)) satisfy:

1. For any x ∈ Rd1, ∇xf(x, ·) and ∇yf(x, ·) are Lipschitz continuous with respect to (w.r.t.)y ∈ Rd2, and with constants Lfx and Lfy, respectively.

2. For any y ∈ Rd2, ∇yf(·, y) is Lipschitz continuous w.r.t. x ∈ X, and with constant Lfy.

3. For any x ∈ X, y ∈ Rd2, we have ‖∇yf(x, y)‖ ≤ Cfy , for some Cfy > 0.

Assumption 2. The inner function g(x, y) satisfies:

1. For any x ∈ X and y ∈ Rd2, g(x, y) is twice continuously differentiable in (x, y);

2. For any x ∈ X, ∇yg(x, ·) is Lipschitz continuous w.r.t. y ∈ Rd2, and with constant Lg.

3. For any x ∈ X, g(x, ·) is strongly convex in y, and with modulus µg > 0.

4. For any x ∈ X, ∇2xyg(x, ·) and ∇2

yyg(x, ·) are Lipschitz continuous w.r.t. y ∈ Rd2, and withconstants Lgxy > 0 and Lgyy > 0, respectively.

5. For any y ∈ Rm, ∇2xyg(·, y) and ∇2

yyg(·, y) are Lipschitz continuous w.r.t. x ∈ X, and withconstants Lgxy > 0 and Lgyy > 0, respectively.

6. For any x ∈ X and y ∈ Rd2, we have ‖∇2xyg(x, y)‖ ≤ Cgxy for some Cgxy > 0.

Basically, Assumption 1, 2 require that the inner and outer functions f, g are well-behaved.In particular, ∇xf , ∇yf , ∇2

xyg, and ∇2yyg are Lipschitz continuous w.r.t. x when y is fixed, and

Lipschitz continuous w.r.t. y when x is fixed. These assumptions are satisfied by common problemsin machine learning and optimization, e.g., the application examples discussed in Sec. 2.1.

Our first endeavor is to develop a single-loop stochastic algorithm for tackling (1). Focusingon solutions which satisfy the first-order stationary condition of (1), we aim at finding a pair ofsolution (x?, y?) such that

∇yg(x?, y?) = 0, 〈∇`(x?), x− x?〉 ≥ 0, ∀ x ∈ X. (3)

Given x?, a solution y? satisfying the first condition in (3) may be found by a cheap stochasticgradient recursion such as y ← y − βhg with E[hg] = ∇yg(x?, y). On the other hand, given y?(x)and suppose that we can obtain a cheap stochastic gradient estimate hf with E[hf ] = ∇`(x) =∇xf(x, y?(x)), where ∇xf(x, y) is a surrogate for ∇`(x) (to be described later), then the secondcondition can be satisfied by a simple projected stochastic gradient recursion as x← PX(x− αhf ),where PX(·) denotes the Euclidean projection onto X.

A challenge in designing a single-loop algorithm for satisfying (3) is to ensure that the outerfunction’s gradient ∇xf(x, y) is evaluated at an inner solution y that is close to y?(x). This ledus to develop a two-timescale stochastic approximation (TTSA) [7] framework, as summarized inAlgorithm 1. An important feature is that the algorithm utilizes two step sizes αk, βk for the outer(xk), inner (yk) solution, respectively, designed with different timescales as αk/βk → 0. As a largerstep size is taken to optimize yk, the latter shall stay close to y?(xk). Using this strategy, it isexpected that yk will converge to y?(xk) asymptotically.

5

Algorithm 1. Two-Timescale Stochastic Approximation (TTSA)

S0) Initialize the variable (x0, y0) ∈ X × Rd2 and the step size sequence αk, βkk≥0;S1) For iteration k = 0, ...,K,

yk+1 = yk − βk · hkg , (4a)

xk+1 = PX(xk − αk · hkf ), (4b)

where hkg , hkf are stochastic estimates of ∇yg(xk, yk), ∇xf(xk, yk+1) [cf. (6)], respectively, satis-fying Assumption 3 given below. Moreover, PX(·) is the Euclidean projection operator onto theconvex set X.

Inspired by [23], we provide a method for computing a surrogate of ∇`(x) given y with generalobjective functions satisfying Assumption 1, 2. Given y?(x), we observe that using chain rule, thegradient of `(x) can be derived as

∇`(x) = ∇xf(x, y?(x)

)−∇2

xyg(x, y?(x)

)[∇2yyg(x, y?(x)

)]−1∇yf(x, y?(x)

). (5)

We note that the computation of the above gradient critically depends on the fact that the innerproblem is strongly convex and unconstrained, so that the inverse function theorem can be appliedwhen computing ∇y∗(x).

We may now define ∇xf(x, y) as a surrogate of ∇`(x) by replacing y?(x) with y ∈ Rd2 :

∇xf(x, y) := ∇xf(x, y)−∇2xyg(x, y)[∇2

yyg(x, y)]−1∇yf(x, y). (6)

Notice that ∇`(x) = ∇xf(x, y?(x)). Eq. (6) is a surrogate for ∇`(x) that may be used in TTSA.We emphasize that (6) is not the only construction and the TTSA can accommodate other forms ofgradient surrogate. For example, see (41) in the application of our results to actor-critic.

Let Fk := σy0, x0, ..., yk, xk, F ′k := σy0, x0..., yk, xk, yk+1 be the filtration of the randomvariables up to iteration k, where σ· denotes the σ-algebra generated by the random variables.We consider the following assumption regarding hkf , h

kg :

Assumption 3. For any k ≥ 0, there exist constants σg, σf , and a nonincreasing sequence bkk≥0

such that:

E[hkg |Fk] = ∇yg(xk, yk), E[hkf |F ′k] = ∇xf(xk, yk+1) +Bk, ‖Bk‖ ≤ bk, (7a)

E[‖hkg −∇yg(xk, yk)‖2|Fk] ≤ σ2g · 1 + ‖∇yg(xk, yk)‖2, (7b)

E[‖hkf −Bk −∇xf(xk, yk+1)‖2|F ′k] ≤ σ2f . (7c)

Notice that the conditions on hkg are standard when the latter is taken as a stochastic gradientof g(xk, yk), while hkf is a potentially biased estimate of ∇xf(xk, yk+1). As we will see in ourconvergence analysis, the bias shall decay polynomially to zero.

In light of (6) and as inspired by [23], we suggest to construct a stochastic estimate of∇xf(xk, yk+1)as follows. Let tmax(k) ≥ 1 be an integer, ch ∈ (0, 1] be a scalar parameter, and denote x ≡ xk, y ≡yk+1 for brevity. Consider:

1. Select p ∈ 0, . . . , tmax(k) − 1 uniformly at random and draw 2 + p independent samples as

6

ξ(1) ∼ µ(1), ξ(2)0 , . . . , ξ

(2)p ∼ µ(2).

2. Construct the gradient estimator hkf as

hkf = ∇xf(x, y; ξ(1))−

∇2xyg(x, y; ξ

(2)0 )

[tmax(k) ch

Lg

p∏i=1

(I − ch

Lg∇2yyg(x, y; ξ

(2)i ))]∇yf(x, y; ξ(1)),

where as a convention, we set∏0i=1

(I − ch

Lg∇2yyg(x, y; ξ

(2)i

)= I.

In the above, the distributions µ(1), µ(2) are defined such that they yield unbiased estimate of thegradients/Jacobians/Hessians as:

∇xf(x, y) = Eµ(1) [∇xf(x, y; ξ(1))], ∇yf(x, y) = Eµ(1) [∇yf(x, y; ξ(1))], (8)

∇2xyg(x, y) = Eµ(2) [∇

2xyg(x, y; ξ(2))], ∇2

yyg(x, y) = Eµ(2) [∇2yyg(x, y; ξ(2))],

and satisfying E[‖∇yf(x, y; ξ(1))‖2] ≤ Cy, E[‖∇2xyg(x, y; ξ(2))‖2] ≤ Cg,

E[‖∇xf(x, y)−∇xf(x, y; ξ(1))‖2] ≤ σ2fx, E[‖∇yf(x, y)−∇yf(x, y; ξ(1))‖2] ≤ σ2

fy,

E[‖∇2xyg(x, y)−∇2

xyg(x, y; ξ(2))‖22] ≤ σ2gxy,

(9)

note that ‖ ·‖2 is the Schatten-2 norm. For convenience of analysis, we assume µgµ2g+σ2

gxy≤ 1, Lg ≥ 1.

The next lemma shows that hkf satisfies Assumption 3.

Lemma 1. Under Assumption 1, 2, (8), (9), and ch = µg/(µ2g + σ2

gxy), then for any x ∈ X, y ∈Rd2 , tmax(k) ≥ 1, it holds that∥∥∥∇xf(xk, yk+1)− E[hkf ]

∥∥∥ ≤ CgxyCfy · 1

µg·(

1−µ2g

Lg(µ2g + σ2

gxy)

)tmax(k). (10)

Furthermore, the variance is bounded as

E[‖hkf − E[hkf ]‖2] ≤ σ2fx +

[(σ2fy + C2

y )σ2gxy + 2C2

gxy+ σ2fyC

2gxy

]max

3

µ2g

,3d1/Lgµ2g + σ2

gxy

. (11)

The proof of the above lemma is relegated to our online appendix, see §E in [24]. Note that thevariance bound (11) relies on analyzing the expected norm of product of random matrices using thetechniques inspired by [18, 25]. Finally, observe that the upper bounds in (10), (11) correspond tobk, σ2

f in Assumption 3, respectively, and the requirements on hkf are satisfied.

To further understand the property of the TTSA algorithm with (6), we borrow the followingresults from [23] on the Lipschitz continuity of the maps ∇`(x), y?(x):

Lemma 2. [23, Lemma 2.2] Under Assumption 1, 2, it holds

‖∇xf(x, y)−∇`(x)‖ ≤ L‖y?(x)− y‖, ‖y?(x1)− y?(x2)‖ ≤ Ly‖x1 − x2‖, (12a)‖∇`(x1)−∇`(x2)‖ = ‖∇f(x1, y

?(x1))−∇f(x2, y?(x2))‖ ≤ Lf‖x1 − x2‖. (12b)

7

for any x, x1, x2 ∈ X and y ∈ Rd2 , where we have defined

L := Lfx +LfyCgxyµg

+ Cfy

(Lgxyµg

+LgyyCgxy

µ2g

),

Lf := Lfx +(Lfy + L)Cgxy

µg+ Cfy

(Lgxyµg

+LgyyCgxy

µ2g

), Ly =

Cgxyµg

.

(13)

The above properties will be pivotal in establishing the convergence of TTSA. First, we note that(12b) implies that the composite function `(x) is weakly convex with a modulus that is at least(−Lf ). Furthermore, (7c) in Assumption 3 combined with Lemma 2 leads to the following estimate:

E[‖hkf‖2|F ′k] ≤ σ2f + 3b2k + 3L2‖yk+1 − y?(xk)‖2, σ2

f := σ2f + 3 sup

x∈X‖∇`(x)‖2. (14)

Throughout, we assume σ2f is bounded, e.g., it can be satisfied if X is bounded, or if `(x) has

bounded gradient.

2.1 Applications

Practical problems such as hyperparameter optimization [22, 41, 50], Stackelberg games [51] canbe cast into special cases of bilevel optimization problem (1). To be specific, we discuss threeapplications of the bilevel optimization problem (1) below.

Model-Agnostic Meta-Learning An important paradigm of machine learning is to find modelthat adapts to multiple training sets in order to achieve the best performance for individual tasks.Among others, a popular formulation is model-agnostic meta learning (MAML) [20] which minimizesan outer objective of empirical risk on all training sets, and the inner objective is the one-stepprojected gradient. Let D(j) = z(j)

i ni=1 be the j-th (j ∈ [J ]) training set with sample size n,MAML can be formulated as a bilevel optimization problem [47]:

minθ∈Θ

J∑j=1

n∑i=1

¯(θ?(j)(θ), z

(j)i

)subject to θ?(j)(θ) ∈ arg min

θ(j)

n∑i=1

〈θ(j),∇θ ¯(θ, z(j)i )〉+

λ

2‖θ(j) − θ‖2

.

(15)

Here θ is the shared model parameter, θ(j) is the adaptation of θ to the jth training set, and ¯ is theloss function. It can be checked that the inner problem is strongly convex. We have Assumption 1,2, 3 for stochastic gradient updates, assuming ¯ is sufficiently regular, and the losses are the logisticloss. Moreover, [21] proved that, assuming λ is sufficiently large and ¯ is strongly convex, the outerproblem is also strongly convex. In fact, [46] demonstrated that an algorithm with no inner loopachieves a comparable performance to [21].

Policy Optimization Another application of the bilevel optimization problem is the policy op-timization problem, particularly when combined with an actor-critic scheme. The optimizationinvolved is to find an optimal policy to maximize the expected (discounted) reward. Here, the‘actor’ serves as the outer problem and the ‘critic’ serves as the inner problem which evaluates the

8

performance of the ‘actor’ (current policy). To avoid redundancy, we refer our readers to §4 wherewe present a detailed case study. The latter will also shed lights on the generality of our prooftechniques for TTSA algorithms.

Data hyper-cleaning The data hyper-cleaning problem trains a classifier with a dataset of ran-domly corrupted labels [50]. The problem formulation is given below:

minx∈Rd1 `(x) :=∑

i∈DvalL(a>i y

?(x), bi) (16)

s.t. y?(x) = arg miny∈Rd2λ‖y‖2 +

∑i∈Dtr

σ(xi)L(a>i y, bi).

In this problem, we have d1 = |Dtr|, and d2 is the dimension of the classifier y; (ai, bi) is the ithdata point; L(·) is the loss function; xi is the parameter that determines the weight for the ith datasample; σ : R → R+ is the weight function; λ > 0 is a regularization parameter; Dval and Dtr arevalidation and training sets, respectively. Here, the inner problem finds the classifier y∗(x) with thetraining set Dtr, while the outer problem finds the best weights x with respect to the validation setDval.

Before ending this subsection, let us mention that, we do not aware any general sufficient condi-tions that can be used to verify whether the outer function `(x) is (strongly) convex or not. To ourknowledge, the convexity of `(x) has to be verified in a case-by-case manner; please see [21, AppendixB3] for how this can be done.

3 Main Results

This section presents the convergence results for TTSA algorithm for (1). We first summarize a listof important constants in Table 2 to be used in the forthcoming analysis. Next, we discuss a fewconcepts pivotal to our analysis.

Tracking Error TTSA tackles the inner and outer problems simultaneously using single-loopupdates. Due to the coupled nature of the inner and outer problems, in order to obtain an upperbound on the optimality gap ∆k

x := E[‖xk − x?‖2], where x? is an optimal solution to (1), we needto estimate the tracking error defined as

∆ky := E[‖yk − y?(xk−1)‖2] where y?(x) = arg min

y∈Rd2g(x, y). (17)

For any x ∈ X, y?(x) is well defined since the inner problem is strongly convex due to Assumption 2.By definition, ∆k

y quantifies how close yk is from the optimal solution to inner problem given xk−1.

Moreau Envelop Fix ρ > 0, define the Moreau envelop and proximal map as

Φ1/ρ(z) := minx∈X

`(x) + (ρ/2)‖x− z‖2

, x(z) := arg min

x∈X

`(x) + (ρ/2)‖x− z‖2

. (18)

For any ε > 0, xk ∈ X is said to be an ε-nearly stationary solution [16] if xk is an approximate fixedpoint of x− I(·), where

∆kx := E[‖x(xk)− xk‖2] ≤ ρ−2 · ε. (19)

9

Table 2: Summary of the Constants for Section 3Constant Description Reference

Lfx, Lfy Lipschitz constants for ∇xf(x, ·), ∇yf(x, ·) w.r.t. y, resp. Assumption 1Lfy Lipschitz constants for ∇yf(·, y) w.r.t. x Assumption 1Cfy Upper bound on ‖∇yf(x, y)‖ Assumption 1Lg Lipschitz constant of ∇yg(x, ·) Assumption 2µg Strong convexity modulus g(x, ·) w.r.t. y Assumption 2Lgxy, Lgyy Lipschitz constants of ∇2

xyg(x, ·),∇2yyg(x, ·) w.r.t. y, resp. Assumption 2

Lgxy, Lgyy Lipschitz constants of ∇2xyg(·, y),∇2

yyg(·, y) w.r.t. x, resp. Assumption 2Cgxy Upper bound on ‖∇2

xyg(x, y)‖ Assumption 2bk Bound on the bias of hkf at iteration k Assumption 3σ2g , σ

2f Variance of stochastic estimates hkg , hkf , resp. Assumption 3

σ2f Constant term on the bound for E[‖hkf‖2] (14)L Difference between ∇xf(x, y), ∇`(x) w.r.t. ‖y?(x)− x‖ Lemma 2Ly Lipschitz constant of y?(x) Lemma 2Lf Lipschitz constant of ∇`(x) Lemma 2

We observe that if ε = 0, then xk ∈ X is a stationary solution to (1) satisfying the second conditionin (3). As we will demonstrate next, the near-stationarity condition (19) provides an apparatus toquantify the finite-time convergence of TTSA in the case when `(x) is non-convex.

3.1 Strongly Convex Outer Objective Function

Our first result considers the instance of (1) where `(x) is strongly convex. We obtain:

Theorem 1. Under Assumption 1, 2, 3. Assume that `(x) is weakly convex with a modulus µ` > 0(i.e., it is strongly convex), and the step sizes satisfy

αk ≤ c0 β3/2k , βk ≤ c1α

2/3k ,

βk−1

βk≤ 1 + βkµg/8,

αk−1

αk≤ 1 + 3αkµ`/4, (20a)

αk ≤1

µ`, βk ≤ min

1

µg,

µgL2g(1 + σ2

g),

µ2g

48c20L

2L2y

, 8µ`αk ≤ µgβk, ∀ k ≥ 0, (20b)

where the constants L,Ly were defined in Lemma 2 and c0, c1 > 0 are free parameters. If the biasis bounded as b2k ≤ cbαk+1, then for any k ≥ 1, the TTSA iterates satisfy

∆kx .

k−1∏i=0

(1− αiµ`)[∆0x +

L2

µ2`

∆0y

]+

c1L2

µ2`

[σ2g

µg+

c20L

2y

µ2g

σ2f

]α

2/3k−1,

∆ky .

k−1∏i=0

(1− βiµg/4)∆0y +

[σ2g

µg+

c20L

2y

µ2g

σf2]βk−1,

(21)

where the symbol . denotes that the numerical constants are omitted (see Section 3.3).

Notice that the bounds in (21) show that the expected optimality gap and tracking error at the kthiteration shall compose of a transient and fluctuation terms. For instance, for the bound on ∆k

x, thefirst (transient) term decays sub-geometrically as

∏k−1i=0 (1 − αiµ`), while the second (fluctuation)

term decays as α2/3k−1.

10

The conditions in (20) are satisfied by both diminishing and constant step sizes. For example,we define the constants:

kα = max

35(Lgµg

)3(1 + σ2

g)32 ,

(512)32L2L2

y

µ2`

, cα =

8

3µ`, kβ =

1

4kα, cβ =

32

3µg. (22)

Then, for diminishing step sizes, we set αk = cα/(k + kα), βk = cβ/(k + kβ)2/3, and for constant

step sizes, we set αk = cα/kα, βk = cβ/k2/3β . Both pairs of the step sizes satisfy (20) with c0 =

µ3/2g

µ`,

c1 = 10µ2/3`µg

. For diminishing step sizes, Theorem 1 shows the last iterate convergence rate forthe optimality gap and the tracking error to be O(k−2/3). To compute an ε-optimal solution with∆kx ≤ ε, the TTSA algorithm with diminishing step size requires a total of O(log(1/ε)/ε3/2) calls of

stochastic (gradient/Hessian/Jacobian) oracles of both outer (f(·, ·)) and inner (g(·, ·)) functions1.

While this is arguably the easiest case of (1), we notice that the double-loop algorithm in [23]requires O(1/ε), O(1/ε2) stochastic oracles for the outer (i.e., f(·, ·)), inner (i.e., g(·, ·)) functions,respectively. As such, the TTSA algorithm requires less number of stochastic oracles for the innerfunction.

3.2 Smooth (Possibly Non-convex) Outer Objective Function

We focus on the case where `(x) is weakly convex. We obtain

Theorem 2. Under Assumption 1, 2, 3, assume that `(·) is weakly convex with modulus µ` ∈ R.Let Kmax ≥ 1 be the maximum iteration number and set

α = min µ2

g

8LyLL2g(1 + σ2

g),

1

4LyLK−3/5max

, β = min

µgL2g(1 + σ2

g),

2

µgK−2/5max

. (23)

If b2k ≤ α, then for any Kmax ≥ 1, the iterates from the TTSA algorithm satisfy

E[∆Kx ] = O(K

−2/5max ), E[∆K+1

y ] = O(K−2/5max ), (24)

where K is an independent uniformly distributed random variable on 0, ...,Kmax− 1; and we recall∆kx := ‖x(xk)− xk‖2. When Kmax is large and µ` < 0, setting ρ = 2|µ`| yields

E[∆Kx ] .

[L2(

∆0 +σ2g

µ2g

)+ µgσ

2f

]K− 25

max

|µ`|2, E[∆K+1

y ] .[∆0

µg+σ2g

µ2g

+µgσ

2f

L2

]K− 2

5max, (25)

where we defined ∆0 := maxΦ1/ρ(x0),

LyL OPT0,∆0

y, and used the conditions α < 1, µg 1; thesymbol . denotes that the numerical constants are omitted (see Section 3.3).

The above result uses constant step sizes determined by the maximum number of iterations,Kmax. Here we set the step sizes as αk K

−3/5max and βk K

−2/5max . Similar to the previous

case of strongly convex outer objective function, αk/βk converges to zero as Kmax goes to infinity.Nevertheless, Theorem 2 shows that TTSA requires O(ε−5/2 log(1/ε)) calls of stochastic oracle forsampled gradient/Hessian to find an ε-nearly stationary solution. In addition, it is worth noting that

1Notice that as we need bk = O(√αk+1), from Lemma 1, the polynomial bias decay requires using tmax(k) =

O(1 + log(k)) samples per iteration, justifying the log factor in the bound.

11

when X = Rd1 , Theorem 2 implies that TTSA achieves E[‖∇`(xK)‖2] = O(K−2/5max ) [16, Sec. 2.2],

i.e., xK is an O(K−2/5max )-approximate (near) stationary point of `(x) in expectation.

Let us compare our sampling complexity bounds to the double loop algorithm in [23], whichrequires O(ε−3) (resp. O(ε−2)) stochastic oracle calls for the inner problem (resp. outer problem),to reach an ε-stationary solution. The sample complexity of TTSA yields a tradeoff for the innerand outer stochastic oracles. We also observe that a trivial extension to a single-loop algorithmresults in a constant error bound2. Finally, we can extend Theorem 2 to the case where `(·) is aconvex function.

Corollary 1. Under Assumption 1, 2, 3 and assume that `(x) is weakly convex with modulus µ` ≥ 0.Consider (1) with X ⊆ Rd1, Dx = supx,x′∈X ‖x−x′‖ <∞. Let Kmax ≥ 1 be the maximum iterationnumber and set

α = min µ2

g

8LyLL2g(1 + σ2

g),

1

4LyLK−3/4max

, β = min

µgL2g(1 + σ2

g),

2

µgK−1/2max

. (26)

If bk ≤ cbK−1/4max , then for large K, the TTSA algorithm satisfies

E[`(xK)− `(x?)] = O(K−1/4max ), E[∆K+1

y ] = O(K−1/2max ), (27)

where K is an independent uniform random variable on 0, ...,Kmax − 1. By convexity, the aboveimplies E[`( 1

Kmax

∑Kmaxk=1 xk)− `(x?)] = O(K

−1/4max ).

From Corollary 1, the TTSA algorithm requires O(ε−4 log(1/ε)) stochastic oracle calls to find anε-optimal solution (in terms of the optimality gap defined with objective values). This is comparableto the complexity bounds in [23], which requires O(ε−4 log(1/ε)) (resp. O(ε−2)) stochastic oraclecalls for the inner problem (resp. outer problem). Additionally, we mention that the constant Dx,which represents the diameter of the constraint set, appears in the constant of the convergencebounds, therefore it is omitted in the big-O notation in (26). For details, please see the proof inAppendix B.4.

3.3 Convergence Analysis

We now present the proofs for Theorem 1, 2. The proof for Corollary 1 is similar to that ofTheorem 2. Due to the space limitation, we refer the readers to [24]. We highlight that the proofsof both theorems rely on the similar ideas of tackling coupled inequalities.

Proof of Theorem 1 Our proof relies on bounding the optimality gap and tracking error coupledwith each other. First we derive the convergence of the inner problem.

Lemma 3. Under Assumption 1, 2, 3. Suppose that the step size satisfies (20a), (20b). For anyk ≥ 1, it holds that

∆k+1y ≤

k∏`=0

(1− β`µg/2) ∆0y +

8

µg

σ2g +

4c20L

2y

µg

[σ2f + 3b20

]βk. (28)

2To see this, the readers are referred to [23, Theorem 3.1]. If a single inner iteration is performed, tk = 1, soAk ≥ ‖y0 − y?(xk)‖ which is a constant. Then the r.h.s. of (3.70), (3.73), (3.74) in [23, Theorem 3.1] will all have aconstant term.

12

Notice that the bound in (28) relies on the strong convexity of the inner problem and the Lipschitzproperties established in Lemma 2 for y?(x); see §A.1. We emphasize that the step size conditionαk ≤ c0β

3/2k is crucial in establishing the above bound. As the second step, we bound the convergence

of the outer problem.

Lemma 4. Under Assumption 1, 2, 3. Assume that the bias satisfies b2k ≤ cbαk+1. With (20a),(20b), for any k ≥ 1, it holds that

∆k+1x ≤

k∏`=0

(1− α`µ`)∆0x +

[4cbµ2`

+2σ2

f + 6b20

µ`

]αk

+[2L2

µ`+ 3α0L

2] k∑j=0

αj

k∏`=j+1

(1− α`µ`)∆j+1y .

(29)

see §A.2. We observe that (28), (29) lead to a pair of coupled inequalities. To compute the finalbound in the theorem, we substitute (28) into (29). As ∆j+1

y = O(βj) = O(α2/3j ), the dominating

term in (29) can be estimated ask∑j=0

αj

k∏`=j+1

(1− α`µ`)∆j+1y =

k∑j=0

O(α5/3j )

k∏`=j+1

(1− α`µ`) = O(α2/3k ), (30)

yielding the desirable rates in the theorem. See §A.3 for details.

Proof of Theorem 2 Without strong convexity in the outer problem, the analysis becomes morechallenging. To this end, we first develop the following lemma on coupled inequalities with numericalsequences, which will be pivotal to our analysis:

Lemma 5. Let K ≥ 1 be an integer. Consider sequences of non-negative scalars ΩkKk=0, ΥkKk=0,ΘkKk=0. Let c0, c1, c2, d0, d1,d2 be some positive constants. If the recursion holds

Ωk+1 ≤ Ωk − c0Θk+1 + c1Υk+1 + c2, Υk+1 ≤ (1− d0)Υk + d1Θk + d2, (31)

for any k ≥ 0. Then provided that c0 − c1d1(d0)−1 > 0, d0 − d1c1(c0)−1 > 0, it holds

1

K

K∑k=1

Θk ≤Ω0 + c1

d0

(Υ0 + d1Θ0 + d2

)(c0 − c1d1(d0)−1

)K

+c2 + c1d2(d0)−1

c0 − c1d1(d0)−1

1

K

K∑k=1

Υk ≤Υ0 + d1Θ0 + d2 + d1

c0Ω0(

d0 − d1c1(c0)−1)K

+d2 + d1c2(c0)−1

d0 − d1c1(c0)−1.

(32)

The proof of the above lemma is simple and is relegated to §B.1.

We demonstrate that stationarity measures of the TTSA iterates satisfy (31). The conditionsc0 − c1d1(d0)−1 > 0,d0 − d1c1(c0)−1 > 0 impose constraints on the step sizes and (32) leads to afinite-time bound on the convergence of TTSA. To begin our derivation of Theorem 2, we observethe following coupled descent lemma:

Lemma 6. Under Assumption 1, 2, 3. If µgβ/2 < 1, βL2g(1 + σ2

g) ≤ µg, then the following

13

inequalities hold for any k ≥ 0:

OPTk+1 ≤ OPTk −1− αLf

2αE[‖xk+1 − xk‖2] + α

[2L2∆k+1

y + 2b20 + σ2f

](33a)

∆k+1y ≤

(1− µgβ/2

)∆ky +

( 2

µgβ− 1)L2y · E[‖xk − xk−1‖2] + β2σ2

g . (33b)

The proof of (33a) is due to the smoothness of outer function `(·) established in Lemma 2, while(33b) follows from the strong convexity of the inner problem. See the details in §B.2. Note that(33a), (33b) together is a special case of (31) with:

Ωk = OPTk, Θk = E[‖xk − xk−1‖2], c0 =1

2α−Lf2, c1 = 2αL2, c2 = α(2b20 + σ2

f ),

Υk = ∆ky , d0 = µgβ/2, d1 =

( 2

µgβ− 1)L2y, d2 = β2σ2

g .(34)

Notice that Θ0 = 0. Assuming that α ≤ 1/2Lf , we notice the following implications:

α

β≤ µg

8LyL=⇒ c0 − c1

d1

d0≥ 1

8α> 0, d0 − d1

c1

c0≥ µgβ

4> 0, (35)

i.e., if (35) holds, then the conclusion (32) can be applied. It can be shown that the step sizes in(23) satisfy (35). Applying Lemma 5 shows that

1

K

K∑k=1

E[‖xk − xk−1‖2] ≤2OPT0 + L

Ly(∆0

y + 4σ2g/µ

2g)

LyL ·K8/5+

(2b20 + σ2f ) + 8

σ2gL

2

µ2g

2L2yL

2 ·K6/5,

1

K

K∑k=1

∆ky ≤

2∆0y +

8σ2g

µ2g+

4LyµgL

OPT0

K3/5+

8σ2g

µ2g+

µg(2b20+σ2f )

2L2

K2/5.

Again, we emphasize that the two timescales step size design is crucial to establishing the aboveupper bounds. Now, recalling the properties of the Moreau envelop in (18), we obtain the followingdescent estimate:

Lemma 7. Under Assumption 1, 2, 3. Set ρ > −µ`, ρ ≥ 0, then for any k ≥ 0, it holds that

E[Φ1/ρ(xk+1)− Φ1/ρ(x

k)] ≤ 5ρ

2E[‖xk+1 − xk‖2] +

[2αρL2

ρ+ µ`+ 3α2ρL2

]∆k+1y

− (ρ+ µ`)ρα

4E[‖xk − xk‖2] +

[ 2ρ

ρ+ µ`+ ρ(σ2

f + 3α)]α2.

(36)

See details in §B.3. Summing up the inequality (36) from k = 0 to k = Kmax− 1 gives the followingupper bound:

1

Kmax

Kmax−1∑k=0

E[‖x(xk)− xk‖2] ≤ 4

(ρ+ µ`)ρ

[Φ1/ρ(x0)

αKmax+

5ρ

2αKmax

Kmax∑k=1

E[‖xk − xk−1‖2]]

+4

ρ+ µ`

[ L2

ρ+µ`+ 3αL2

Kmax

Kmax∑k=1

∆ky +

[ 2

ρ+ µ`+ σ2

f + 3α]α].

14

Combining the above and α K−3/5max yields the desired 1

Kmax

∑Kmax−1k=0 E[‖x(xk)−xk‖2] = O(K

−2/5max ).

In particular, the asymptotic bound is given by setting ρ = 2|µ`|.

Proof of Corollary 1 We observe that Lemma 6 can be applied directly in this setting sinceconvex functions are also weakly convex. With the step size choice (26), similar conclusions holdas:

1

K

K∑k=1

E[‖xk − xk−1‖2] ≤2OPT0 + L

Ly(∆0

y + 4σ2g/µ

2g)

LyL ·K7/4+

(2b20 + σ2f ) + 8

σ2gL

2

µ2g

2L2yL

2 ·K6/4,

1

K

K∑k=1

∆ky ≤

2∆0y +

8σ2g

µ2g+

4LyµgL

OPT0

K1/2+

8σ2g

µ2g+

µg(2b20+σ2f )

2L2

K1/2.

With the additional property µ` ≥ 0, in §B.4 we further derive an alternative descent estimate to(33a) that leads to the desired bound of K−1

∑Kk=1 OPTk.

4 Application to Reinforcement Learning

Consider a Markov decision process (MDP) (S,A, γ, P, r), where S and A are the state and actionspaces, respectively, γ ∈ [0, 1) is the discount factor, P (s′|s, a) is the transition kernel to thenext state s′ given the current state s and action a, and r(s, a) ∈ [0, 1] is the reward at (s, a).Furthermore, the initial state s0 is drawn from a fixed distribution ρ0. We follow a stationarypolicy π : S × A → R. For any (s, a) ∈ S × A, π(a|s) is the probability of the agent choosingaction a ∈ A at state s ∈ S. Note that a policy π ∈ X induces a Markov chain on S. Denotethe induced Markov transition kernel as P π such that st+1 ∼ P π(·|st). For any s, s′ ∈ S, we haveP π(s′|s) =

∑a∈A π(a|s)P (s′|s, a). For any π ∈ X, P π is assumed to induce a stationary distribution

over S, denoted by µπ. We assume that |A| <∞ while |S| is possibly infinite (but countable). Tosimplify our notations, for any distribution ρ on S, we let 〈·, ·〉ρ be the inner product with respectto ρ, and ‖ · ‖µπ⊗π be the weighted `2-norm with respect to the probability measure µπ ⊗ π overS ×A (where f, g are measurable functions on S ×A)

〈f, g〉ρ =∑s∈S〈f(s, ·), g(s, ·)〉ρ(s), ‖f‖µπ⊗π =

√∑s∈S

∑a∈A

π(a|s) · [f(s, a)]2µπ(s).

In policy optimization, our objective is to maximize the expected total discounted reward re-ceived by the agent with respect to the policy π, i.e.,

maxπ∈X⊆R|S|×|A|

− `(π) = Eπ[∑t≥0

γt · r(st, at) | s0 ∼ ρ0

], (37)

where Eπ is the expectation with the actions taken according to policy π. Here we let ρ0 to denotethe distribution of the initial state. To see that (37) is approximated as a bilevel problem, set P π asthe Markov operator under the policy π. We let Qπ be the unique solution to the following Bellmanequation [53]:

Q(s, a) = r(s, a) + γ(P πQ)(s, a), ∀ s, a ∈ S ×A. (38)

15

Notice that the following holds:

Qπ(s, a) = Eπ[∑t≥0

γtr(st, at)|s0 = s, a0 = a], Ea∼π(·|s)[Q

π(s, a)] = 〈Qπ(s, ·), π(·|s)〉.

Further, we parameterize Q using a linear approximation Q(s, a) ≈ Qθ(s, a) := φ>(s, a)θ, whereφ : S × A → Rd is a known feature mapping and θ ∈ Rd is a finite-dimensional parameter. Usingthe fact that `(π) = −Eπ[Qπ(s, a)], problem (37) can be approximated as a bilevel optimizationproblem such that:

minπ∈X⊆R|S|×|A|

`(π) = −〈Qθ?(π), π〉ρ0subject to θ?(π) ∈ arg min

θ∈Rd12‖Qθ − r − γP

πQθ‖2µπ⊗π.(39)

Solving Policy Optimization Problem We illustrate how to adopt the TTSA algorithm to solve(39). First, the inner problem is the policy evaluation (a.k.a. ‘critic’) which minimizes the meansquared Bellman error (MSBE). A standard approach is TD learning [52]. We draw two consecutivestate-action pairs (s, a, s′, a′) satisfying s ∼ µπ

k , a ∼ πk(·|s), s′ ∼ P (·|s, a), and a′ ∼ πk(·|s′), andupdate the critic via

θk+1 = θk − βhkg with hkg = [φ>(s, a)θk − r(s, a)− γφ>(s′, a′)θk]φ(s, a), (40)

where β is the step size. This step resembles (4a) of TTSA except that the mean field E[hkg |Fk] is asemigradient of the MSBE function.

Secondly, the outer problem searches for the policy (a.k.a. ‘actor’) that maximizes the expecteddiscounted reward. To develop this step, let us define the visitation measure and the Bregmandivergence as:

ρπk(s) := (1− γ)−1

∑t≥0

γtP(st = s), Dψ,ρπk

(π, πk) :=∑s∈S

Dψ

(π(·|s), πk(·|s)

)ρπ

k(s),

such that stt≥0 is a trajectory of states obtained by drawing s0 ∼ ρ0 and following the policy πk,and Dψ is the Kullback-Leibler (KL) divergence between probability distributions over A. We alsodefine the following gradient surrogate:

[∇πf(πk, θk+1)](s, a) = −(1− γ)−1Qθk+1(s, a)ρπk(s), ∀ (s, a). (41)

Similar to (6) and under the additional assumption that the linear approximation is exact, i.e.,Qθ?(πk) = Qπ

k , we can show ∇πf(πk, θ?(πk)) = ∇`(πk) using the policy gradient theorem [53].In a similar vein as (4b) in TTSA, we consider the mirror descent step for improving the policy(cf. proximal policy optimization in [49]):

πk+1 = arg minπ∈X

−(1− γ)−1〈Qθk+1 , π − πk〉ρπk +

1

αDψ,ρπk

(π, πk), (42)

where α is the step size. Note that the above update can be performed as:

πk+1(·|s) ∝ πk(·|s) exp[αk(1− γ)−1Qθk+1(s, ·)

]= π0(·|s) exp

[(1− γ)−1φ(s, ·)>

k∑i=0

αθi+1].

In other words, πk+1 can be represented using the running sum of critic∑k

i=0 αθi+1. This is similar

16

to the natural policy gradient method [30], and the algorithm requires a low memory footprint.Finally, the recursions (40), (42) give the two-timescale natural actor critic (TT-NAC) algorithm.

4.1 Convergence Analysis of TT-NAC

Consider the following assumptions on the MDP model of interest.Assumption 4. The reward function is uniformly bounded by a constant r. That is, |r(s, a)| ≤ rfor all (s, a) ∈ S ×A.

Assumption 5. The feature map φ : S×A→ Rd satisfies ‖φ(s, a)‖2 ≤ 1 for all (s, a) ∈ S×A. Theaction-value function associated with each policy is a linear function of φ. That is, for any policyπ ∈ X, there exists θ?(π) ∈ Rd such that Qπ(·, ·) = φ(·, ·)>θ?(π) = Qθ?(π)(·, ·).

Assumption 6. For each policy π ∈ X, the induced Markov chain P π admits a unique stationarydistribution µπ for all π ∈ X. Let there exists µφ > 0 such that Es∼µπ ,a∼π(·|s)[φ(s, a)φ(s, a)>] µ2

φ·Idfor all π ∈ X.

Assumption 7. For any (s, a) ∈ S ×A and any π ∈ X, let %(s, a, π) be a probability measure overS, defined by

[%(s, a, π)](s′) = (1− γ)−1∑

t≥0 γt · P(st = s′), ∀s′ ∈ S. (43)

That is, %(s, a, π) is the visitation measure induced by the Markov chain starting from (s0, a0) = (s, a)and follows π afterwards. For any π?, there exists Cρ > 0 such that

Es′∼ρ?[∣∣∣∣%(s, a, π)

ρ?(s′)

∣∣∣∣2] ≤ C2ρ , ∀ (s, a) ∈ S ×A, π ∈ X.

Here we let ρ? denote ρπ? to simplify the notation, which is the visitation measure induced by π?

with s0 ∼ ρ0.

We remark that Assumption 4 is standard in the reinforcement learning literature [53, 55]. InAssumption 5, we assume that eachQπ is linear which implies that the linear function approximationis exact. A sufficient condition for Assumption 5 is that the underlying MDP is a linear MDP [28,62],where both the reward function and Markov transition kernel are linear in φ. Linear MDP containsthe tabular MDP as a special case, where the feature mapping φ(s, a) becomes the canonical vectorin RS×A. The assumption that the stationary distribution µπ exists for any policy π is a commonproperty for the MDP analyzed in TD learning, e.g., [4, 15]. Assumption 6 further assumes thatthe smallest eigenvalue of Σπ is bounded uniformly away from zero. Assumption 6 is commonlymade in the literature on policy evaluation with linear function approximation, e.g., [4,37]. Finally,Assumption 7 postulates that ρ? is regular such that that the density ratio between %(s, a, π) andρ? has uniformly bounded second-order moments under ρ?. Such an assumption is closely related tothe concentratability coefficient [1,2,43], which characterizes the distribution shift incurred by policyupdates and is conjectured essential for the sample complexity analysis of reinforcement learningmethods [11]. Assumption 7 is satisfied if the initial distribution ρ0 has lower bounded density overS ×A [1].

To state our main convergence results, let us define the quantities of interest:

∆k+1Q := E[‖θk+1 − θ?(πk)‖22], OPTk := E[`(πk)− `(π?)], (44)

17

where the expectations above are taken with respect to the i.i.d. draws of state-action pairs in(40) for TT-NAC. We remark that ∆k

Q, analogous to ∆ky used in TTSA, is the tracking error that

characterizes the performance of TD learning when the target value function, Qπk , is time-varyingdue to policy updates. We obtain:

Theorem 3. Consider the TT-NAC algorithm (40)-(42) for the policy optimization problem (39).Let Kmax ≥ 322 be the maximum number of iterations. Under Assumption 4 – 7, and we set thestep sizes as

α =(1− γ)3µφ√

r · C2ρ

min(1− γ)2

128µ−2φ

,K−3/4max

, β = min

(1− γ)µ2

φ

8,

16

(1− γ)µ2φ

K−1/2max

. (45)

Then the following holds

E[OPTK] = O(K−1/4max ), E[∆K+1

Q ] = O(K−1/2max ), (46)

where K is an independent random variable uniformly distributed over 0, ...,Kmax − 1.

To shed lights on our analysis, we first observe the following performance difference lemmaproven in [29, Lemma 6.1]:

`(π)− `(π?) = (1− γ)−1〈Qπ, π? − π〉ρπ? , ∀π ∈ X, (47)

where π? is an optimal policy solving (39). The above implies a restricted form of convexity, andour analysis uses the insight that (47) plays a similar role as (2) [with µ` ≥ 0] and characterizes theloss geometry of the outer problem.

Our result shows that the TT-NAC algorithm finds an optimal policy at the rate of O(K−1/4) interms of the objective value. This rate is comparable to another variant of the TT-NAC algorithmin [61], which provided a customized analysis for TT-NAC. In contrast, the analysis for our TT-NAC algorithm is rooted in the general TTSA framework developed in §3.3 for tackling bileveloptimization problems. Notice that analysis for the two-timescale actor-critic algorithm can also befound in [59], which provides an O(K−2/5) convergence rate to a stationary solution.

5 Numerical Experiments

We consider the data hyper-cleaning task (16), and compare TTSA with several algorithms such asthe BSA algorithm [23], the stocBiO [27] for different batch size choices, and the HOAG algorithmin [44]. Note that HOAG is a deterministic algorithm and it requires full gradient computation ateach iteration. In contrast, stocBiO is a stochastic algorithm but it relies on large batch gradientcomputations.

We consider problem (16) with L(·) being the cross-entropy loss (i.e., a data cleaning problemfor logistic regression); σ(x) := 1

1+exp(−x) ; c = 0.001; see [50]. The problem is trained on theFashionMNIST dataset [60] with 50k, 10k, and 10k image samples allocated for training, validationand testing purposes, respectively. We consider the setting where each sample in the training datasetis corrupted with probability 0.4. Note that the outer problem `(x) is non-convex while the lowerlevel problem is strongly-convex. The simulation results are an average of 3 independent runs. Thestep sizes for different algorithms are chosen according to their theoretically suggested values. Let

18

Figure 1: Data hyper-cleaning task on the FashionMNIST dataset. We plot the training loss andtesting accuracy against the number of gradients evaluated with corruption rate p = 0.4.

the outer iteration be indexed by t, for TTSA we choose αt = cα/(1 + t)3/5, βt = cβ/(1 + t)2/5, andtune for cα and cβ in the set 10−3, 10−2, 10−1, 10. For BSA [23], we index the outer iteration by tand the inner iteration by k ∈ 1, . . . , kt. We set kt = d

√t+ 1e as suggested in [23] and choose the

outer and inner step-sizes αt and βk, respectively, as αt = dα/(1 + t)1/2 and βk = dβ/(k + 2). Wetune for dα and dβ in the set 10−3, 10−2, 10−1, 10. Finally, for stocBiO we tune for parameters αtand βt in the range [0, 1].

In Figure 1, we compare the performance of different algorithms against the total number ofouter samples accessed. As observed, TTSA outperforms BSA, stocBiO and HOAG. We remarkthat HOAG is a deterministic algorithm and hence requires full batch gradient computations ateach iteration. Similarly, stocBio relies on large batch gradients which results in relatively slowconvergence.

6 Conclusion

This paper develops efficient two-timescale stochastic approximation algorithms for a class of bi-level optimization problems where the inner problem is unconstrained and strongly convex. Weshow the convergence rates of the proposed TTSA algorithm under the settings where the outerobjective function is either strongly convex, convex, or non-convex. Additionally, we show how ourtheory and analysis can be customized to a two-timescale actor-critic proximal policy optimizationalgorithm in reinforcement learning, and obtain a comparable convergence rate to existing literature.

A Omitted Proofs of Theorem 1

To simplify notations, for any n,m ∈ N, we define the following quantities for brevity of notations.

G(1)m:n =

n∏i=m

(1− βiµg/4), G(2)m:n =

n∏i=m

(1− αiµ`). (48)

19

A.1 Proof of Lemma 3

Following a direct expansion of the updating rule for yk+1 and taking the conditional expectationgiven filtration Fk yield that

E[‖yk+1 − y?(xk)‖2|Fk] ≤ (1− 2βkµg)‖yk − y?(xk)‖2 + β2kE[‖hkg‖2|Fk],

where we used the unbiasedness of hkg [cf. Assumption 3] and the strong convexity of g. By directcomputation and (7b) in Assumption 3, we have

E[‖hkg‖2|Fk] = E[‖hkg −∇g(xk, yk)‖2

∣∣Fk]+ ‖∇g(xk, yk)‖2

≤ σ2g + (1 + σ2

g)‖∇g(xk, yk)‖2 ≤ σ2g + (1 + σ2

g) · L2g‖yk − y?(xk)‖2, (49)

where the last inequality uses Assumption 2 and the optimality of the inner problem∇yg(xk, y?(xk)) =0. As βkL2

g(1 + σ2g) ≤ µg, we have

E[‖yk+1 − y?(xk)‖2|Fk] ≤ (1− βkµg) · ‖yk − y?(xk)‖2 + β2kσ

2g . (50)

Using the basic inequality 2ab ≤ 1/c · a2 + c · b2 for all c ≥ 0 and a, b ∈ R, we have

‖yk − y?(xk)‖2 ≤(1 + 1/c

)· ‖yk − y?(xk−1)‖2 +

(1 + c

)‖y?(xk−1)− y?(xk)‖2. (51)

Note we have taken the convention that x−1 = x0. Furthermore, we observe that

‖y?(xk−1)− y?(xk)‖2 ≤ L2y‖xk − xk−1‖2 ≤ α2

k−1L2y · ‖hk−1

f ‖2, (52)

where the first inequality follows from Lemma 2, and the second inequality follows from the non-expansive property of projection. We have set h−1

f = 0 as convention.

Through setting c =2(1−βkµg)βkµg

, we have(1 + 1/c

)(1− βkµg) = 1− µg

2 βk. Substituting the abovequantity c into (51) and combining with (50) show that

E[‖yk+1 − y?(xk)‖2|Fk]

≤(1− βkµg

2

)· ‖yk − y?(xk−1)‖2 + β2

k · σ2g +

2− µgβkµgβk

· α2k−1L

2y · ‖hk−1

f ‖2. (53)

Taking the total expectation and using (14), we have

∆k+1y ≤

(1− βkµg/2

)·∆k

y + β2kσ

2g +

2− µgβkµgβk

α2k−1L

2y

[σ2f + 3b2k−1 + 3L2∆k

y

],

with the convention α−1 = 0. Using αk−1 ≤ 2αk, αk ≤ c0β3/2k , we have

∆k+1y ≤

[1− βkµg/2 +

12c20L

2yL

2

µgβ2k

]·∆k

y + β2kσ

2g +

4c20L

2y

µgβ2k ·[σ2f + 3b20

]≤[1− βkµg/4

]·∆k

y + β2kσ

2g +

4c20L

2y

µgβ2k ·[σ2f + 3b20

],

where the last inequality is due to (20a). Solving the recursion leads to

∆k+1y ≤ G(1)

0:k ∆0y +

∑kj=0 β

2jG

(1)j+1:k

σ2g +

4c20L2y

µg

[σ2f + 3b20

]. (54)

Since βk−1/βk ≤ 1 + βk · (µg/8), applying Lemma 10 to βkk≥0 with a = µg/4 and q = 2, we have

20

∑kj=0 β

2jG

(1)j+1:k ≤

8βkµg

. Finally, we can simplify (54) as

∆k+1y ≤ G(1)

0:k ∆0y + C(1)

y βk, where C(1)y :=

8

µg

σ2g +

4c20L

2y

µg

[σ2f + 3b20

]. (55)

A.2 Proof of Lemma 4

Due to the projection property, we get

‖xk+1 − x?‖2 ≤ ‖xk − αkhkf − x?‖2 = ‖xk − x?‖2 − 2αk〈hkf , xk − x?〉+ α2k‖hkf‖2.

Taking the conditional expectation given F ′k gives

E[‖xk+1 − x?‖2|F ′k] ≤ ‖xk − x?‖2 − 2αk〈∇`(xk), xk − x?〉+ α2kE[‖hkf‖2|F ′k]

− 2αk〈∇xf(xk, yk+1)−∇`(xk) +Bk, xk − x?〉,

(56)

where the inequality follows from (7a). The strong convexity implies 〈∇`(xk), xk − x?〉 ≥ µ`‖xk −x?‖2, we further bound the r.h.s. of (56) via

E[‖xk+1 − x?‖2|F ′k]

≤(1− 2αkµ`

)‖xk − x?‖2 − 2αk〈∇xf(xk, yk+1)−∇`(xk) +Bk, x

k − x?〉+ α2kE[‖hkf‖2|F ′k]

≤(1− αkµ`

)‖xk − x?‖2 +

αkµ`‖∇xf(xk, yk+1)−∇`(xk) +Bk‖2 + α2

kE[‖hkf‖2|F ′k]

≤(1− αkµ`

)· ‖xk − x?‖2 + (2αk/µ`) ·

L2‖yk+1 − y?(xk)‖2 + b2k

+ α2

k · E[‖hkf‖2|F ′k],

where the last inequality is from Lemma 2. Using (14) and taking total expectation:

∆k+1x ≤

[1− αkµ`

]∆kx +

[2αk/µ`

]L2∆k+1

y + 2αkb2k/µ` + α2

k

[σ2f + 3b20 + 3L2∆k+1

y

]≤[1− αkµ`

]·∆k

x +[2αk/µ` + 3α2

k

]· L2 ·∆k+1

y + α2k

[2cb/µ` + σ2

f + 3b20],

where we have used b2k ≤ cbαk. Solving the recursion above leads to

∆k+1x ≤ G(2)

0:k∆0x +

k∑j=0

[2cbµ`

+ σ2f + 3b20

]α2jG

(2)j+1:k +

[2L2

µ`+ 3α0L

2]αjG

(2)j+1:k∆

j+1y

≤ G(2)0:k∆

0x +

2

µ`

[2cbµ`

+ σ2f + 3b20

]· αk +

(2L2

µ`+ 3α0L

2) k∑j=0

αjG(2)j+1:k∆

j+1y .

The last inequality follows from applying Lemma 10 with q = 2 and a = µ`.

A.3 Bounding ∆kx by coupling with ∆k

y

Using (55), we observe that∑kj=0 αjG

(2)j+1:k∆

j+1y ≤

∑kj=0 αjG

(2)j+1:k

G

(1)0:j∆

0y + C

(1)y βj

. (57)

We bound each term on the right-hand side of (57). For the first term, as µ`αi ≤ µgβi/8 [cf. (20b)],applying Lemma 11 with a = µ`, b = µg/4, γi = αi, ρi = βi gives∑k

j=0 αjG(2)j+1:kG

(1)0:j∆

0y ≤ 1

µ`G

(2)0:k∆

0y. (58)

21

Recall that βj ≤ c1 · α2/3j . Applying Lemma 10 with q = 5/3, a = µ` yields∑k

j=0 αjβjG(2)j+1:k ≤ c1

∑kj=0 α

5/3j G

(2)j+1:k ≤ c1

2µ`α

2/3k , (59)

We obtain a bound on the optimality gap as

∆k+1x ≤ G(2)

0:k

∆0x +

[2L2

µ2`

+3α0L

2

µ`

]∆0y

+

2

µ`

[2cbµ`

+ σ2f + 3b20

]αk

+2c1µ`

[2L2

µ`+ 3α0L

2]C(1)y α

23k .

To simplify the notation, we define the constants

C(0)x = ∆0

x +[2L2

µ2`

+3α0L

2

µ`

]∆0y,C

(1)x =

2

µ`

[2cbµ`

+ σ2f + 3b20

]+

2c1µ`

[2L2

µ`+ 3α0L

2]C(1)y

Then, as long as αk < 1/µ` and we use the step size parameters in (22), we have

∆k+1x ≤ G(2)

0:kC(0)x + C(1)

x α2/3k = O

([ L2

µ2`µ

2g

+L2L2

y

µ4`

] σ2f

k2/3+

L2

µ2`µg

σ2g

k2/3

),

and we recall that σ2f = σ2

f + 3 supx∈X ‖∇`(x)‖2.

B Omitted Proofs of Theorem 2 and Corollary 1

B.1 Proof of Lemma 5

We observe that summing the first and the second inequalities in (31) from k = 0 to k = K − 1gives:

c0∑K

k=1 Θk ≤ Ω0 + c1∑K

k=1 Υk + c2 ·K. (60)

d0∑K

k=1 Υk ≤ Υ1 + d1∑K

k=1 Θk + d2 ·K. (61)

Substituting (60) into (61) gives

d0∑K

k=1 Υk ≤ Υ1 + d2 ·K + d1c0

[Ω0 + c1

∑Kk=1 Υk + c2 ·K

]. (62)

Therefore, if d0 − d1c1c0> 0, a simple computation yields the second inequality in (32). Similarly,

we substitute (61) into (60) to yield

c0∑K

k=1 Θk ≤ Ω0 + c2 ·K + c1d0

[Υ1 + d1

∑Kk=1 Θk + d2 ·K

]. (63)

Under c0 − c1d1d0> 0, simple computation yields the first inequality in (32).


Recall that we defined OPTk := E[`(xk) − `(x?)] for each k ≥ 0. To begin with, we have thefollowing descent estimate

`(xk+1) ≤ `(xk) + 〈∇`(xk), xk+1 − xk〉+ (Lf/2)‖xk+1 − xk‖2. (64)

22

The optimality condition of step (4b) leads to the following bound

〈∇`(xk), xk+1 − xk〉 ≤ 〈∇`(xk)−∇xf(xk, yk+1)−Bk, xk+1 − xk〉

+ 〈Bk +∇xf(xk, yk+1)− hkf , xk+1 − xk〉 − 1

α‖xk+1 − xk‖2,

where we obtained the inequality by adding and subtracting Bk+∇xf(xk, yk+1)−hkf . Then, takingthe conditional expectation on F ′k, for any c, d > 0, we obtain

E[〈∇`(xk), xk+1 − xk〉|F ′k]≤ E

[‖∇`(xk)−∇xf(xk, yk+1)−Bk‖ · ‖xk+1 − xk‖

∣∣F ′k]+ E

[‖Bk +∇xf(xk, yk+1)− hkf‖‖xk+1 − xk‖

∣∣F ′k]− 1

αE[‖xk+1 − xk‖2|F ′k]

≤ 1

2cE[‖∇`(xk)−∇xf(xk, yk+1)−Bk‖2|F ′k] +

c

2E[‖xk+1 − xk‖2|F ′k]

+σ2f

2d+d

2E[‖xk+1 − xk‖2|F ′k]−

1

αE[‖xk+1 − xk‖2|F ′k],

where the second inequality follows from the Young’s inequality and Assumption 3. Simplifying theterms above leads to

E[〈∇`(xk), xk+1 − xk〉|F ′k]

≤ 1

2cE[‖∇`(xk)−∇xf(xk, yk+1)−Bk‖2|F ′k] +

σ2f

2d+(c+ d

2− 1

α

)· E[‖xk+1 − xk‖2|F ′k].

Setting d = c = 12α , plugging the above to (64), and taking the full expectation:

OPTk+1 ≤ OPTk −(

1

2α−Lf2

)· E[‖xk+1 − xk‖2] + α∆k+1 + ασ2

f , (65)

where we have denoted ∆k+1 as follows

∆k+1 := E[‖∇xf(xk; yk+1)−∇`(xk)−Bk‖2](12a)≤ 2L2E[‖yk+1 − y?(xk)‖2] + 2b2k,

where the last inequality follows from Lemma 2 and (7a) in Assumption 3. Next, following fromthe standard SGD analysis [cf. (50)] and using β ≤ µg/(L2

g(1 + σ2g)), we have

E[‖yk+1 − y?(xk)‖2|Fk] ≤ (1− µgβ)E[‖yk − y?(xk)‖2|Fk] + β2σ2g

≤ (1 + c)(1− µgβ)E[‖yk − y?(xk−1)‖2|Fk] + (1 + 1/c)E[‖y?(xk)− y?(xk−1)‖2|Fk] + β2σ2g

≤(1− µgβ/2

)E[‖yk − y?(xk−1)‖2|Fk] +

( 2

µgβ− 1)· E[‖y?(xk)− y?(xk−1)‖2|Fk] + β2σ2

g ,

≤(1− µgβ/2

)E[‖yk − y?(xk−1)‖2|Fk] +

( 2

µgβ− 1)L2y · E[‖xk − xk−1‖2|Fk] + β2σ2

g ,

where the last inequality is due to the Lipschitz continuity property (12a) and µgβ < 1. Furthermore,we have picked c = µgβ · [2(1− µgβ)]−1, so that

(1 + c)(1− µgβ) = 1− µgβ/2, 1/c+ 1 = 2/(µgβ)− 1.

Taking a full expectation on both sides leads to the desired result.

23


For simplicity, we let xk+1 and x denote x(xk+1) and x(x), respectively. For any x ∈ X, lettingx1 = x and x2 = x in (2), we get

`(x) ≥ `(x) + 〈∇`(x), x− x〉+µ`2‖x− x‖2. (66)

Moreover, by the definition of x, for any x ∈ X, we have

`(x) +ρ

2‖x− x‖2 −

[`(x) +

ρ

2‖x− x‖2

]= `(x)−

[`(x) +

ρ

2‖x− x‖2

]≥ 0. (67)

Adding the two inequalities above, we obtain

−µ` + ρ

2· ‖x− x‖2 ≥ 〈∇`(x), x− x〉. (68)

Note that we choose ρ such that ρ + µ` > 0. To proceed, combining the definitions of the Moreauenvelop and x in (18), for xk+1, we have

Φ1/ρ(xk+1)

(18)= `(xk+1) +

ρ

2· ‖xk+1 − xk+1‖2 ≤ `(xk) +

ρ

2· ‖xk+1 − xk‖2

≤ `(xk) +ρ

2· ‖xk − xk‖2 +

ρ

2· ‖xk+1 − xk‖2 + ρα · 〈xk − xk, hkf 〉

+ αρ · 〈hkf , xk − xk+1〉+ ρ‖xk+1 − xk‖2

(18)= Φ1/ρ(x

k) +5ρ

2· ‖xk+1 − xk‖2 + ρα〈xk − xk, hkf 〉+ α2ρ‖hkf‖2, (69)

where the first equality and the first inequality follow from the optimality of xk+1 = x(xk+1), andthe second term is from the optimality condition in (4b). For any x? that is a global optimal solutionfor the original problem minx∈X `(x), we must have

Φ1/ρ(x?) = min

x∈X

`(x) +

ρ

2‖x− x?‖2

= `(x?),

where the last equality holds because

Φ1/ρ(x?) = min

x∈X

`(x) +

ρ

2‖x− x?‖2

≤ `(x?) +

ρ

2‖x? − x?‖2 = `(x?), (70)

Φ1/ρ(z) = minx∈X

`(x) +

ρ

2‖x− z‖2

≥ min

x∈X`(x) = `(x?), ∀ z ∈ X. (71)

Taking expectation of 〈xk − xk, hkf 〉 while conditioning on F ′k, we have:

E[〈xk − xk, hkf 〉|F ′k]= E[〈xk − xk, hkf −∇xf(xk, yk+1) +∇xf(xk, yk+1)−∇`(xk) +∇`(xk)〉|F ′k]= 〈xk − xk, Bk〉+ E[〈xk − xk,∇xf(xk, yk+1)−∇`(xk)〉+ 〈xk − xk,∇`(xk)〉|F ′k], (72)

where the second equality follows from (7a) in Assumption 3. By Young’s inequality, for any c > 0,we have

〈xk − xk, Bk〉 ≤c

4‖xk − xk‖2 +

1

cb2k, (73)

E[〈xk − xk,∇xf(xk, yk+1)−∇`(xk)〉|F ′k] ≤1

c‖∇xf(xk, yk+1)−∇`(xk)‖2 +

c

4‖xk − xk‖2,

24

where we also use (7a) in deriving (73). Combining (68), (72), (73), and setting c = (ρ+ µ`)/2, weobtain that

E[〈xk − xk, hkf 〉|F ′k] (74)

≤ c

2‖xk − xk‖2 +

1

cb2k +

1

cE[‖∇xf(xk, yk+1)−∇`(xk)‖2]− ρ+ µ`

2‖xk − xk‖2

=2

ρ+ µ`· E[‖∇xf(xk, yk+1)−∇`(xk)‖2]− (ρ+ µ`)

4· ‖xk − xk‖2 +

2

ρ+ µ`· b2k

≤ 2L2

ρ+ µ`· E[‖yk+1 − y?(xk)‖2]− (ρ+ µ`)

4· ‖xk − xk‖2 +

2

ρ+ µ`· b2k,

where the last step follows from the first inequality of Lemma 2. Plugging the above into (69), andtaking a full expectation, we obtain

E[Φ1/ρ(xk+1)]− E[Φ1/ρ(x

k)]

≤ 5ρ

2E[‖xk+1 − xk‖2] +

2ραL2

ρ+ µ`∆k+1y − (ρ+ µ`)ρα

4E[‖xk − xk‖2] +

2ραb2kρ+ µ`

+ α2ρE[‖hkf‖2]

≤ 5ρ

2E[‖xk+1 − xk‖2] +

2ραL2

ρ+ µ`∆k+1y − (ρ+ µ`)ρα

4E[‖xk − xk‖2] +

2ραb2kρ+ µ`

+ α2ρ(σ2f + 3b2k + 3L2∆k+1

y )

≤ 5ρ

2E[‖xk+1 − xk‖2] +

[2αρL2

ρ+ µ`+ 3α2ρL2

]∆k+1y − (ρ+ µ`)ρα

4E[‖xk − xk‖2]

+[ 2ρ

ρ+ µ`+ ρ(σ2

f + 3b20)]α2,

where the last inequality is due to the assumption b2k ≤ α.

B.4 Proof of Corollary 1

Our proof departs from that of Theorem 2 through manipulating the descent estimate (64) in analternative way. The key is to observe the following three-point inequality [3]:

〈hkf , xk+1 − x?〉 ≤ 1

2α

‖x? − xk‖2 − ‖x? − xk+1‖2 − ‖xk − xk+1‖2

, (75)

where x? is an optimal solution to (1). Observe that

〈∇`(xk), xk+1 − xk〉 = 〈∇`(xk)− hkf , xk+1 − x?〉+ 〈hkf , xk+1 − x?〉+ 〈∇`(xk), x? − xk〉. (76)

Notice that due to the convexity of `(x), we have 〈∇`(xk), x? − xk〉 ≤ −OPTk. Furthermore,

〈∇`(xk)− hkf , xk+1 − x?〉= 〈∇`(xk)− hkf +Bk +∇xf(xk, yk+1)−Bk −∇xf(xk, yk+1), xk+1 − x?〉≤ Dx

bk + L‖yk+1 − y?(xk)‖

+ 〈Bk +∇xf(xk, yk+1)− hkf , xk+1 − xk + xk − x?〉.

25

We notice that E[〈Bk +∇xf(xk, yk+1) − hkf , xk − x?〉|F ′k] = 0. Thus, taking the total expectationon both sides and applying Young’s inequality on the last inner product lead to

E[〈∇`(xk)− hkf , xk+1 − x?] ≤ Dx

bk + LE[‖yk+1 − y?(xk)‖]

+α

2σ2f +

1

2αE[‖xk+1 − xk‖2]

Substituting the above observations into (64) and using the three-point inequality (75) give

E[`(xk+1)− `(xk)] ≤ Dx

bk + LE[‖yk+1 − y?(xk)‖]

+

1

2α

‖x? − xk‖2 − ‖x? − xk+1‖2

+α

2σ2f −OPTk +

Lf2E[‖xk+1 − xk‖2].

Summing up both sides from k = 0 to k = Kmax − 1 and dividing by Kmax gives

1

Kmax

Kmax∑k=1

OPTk ≤ Dxb0 +ασ2

f

2+‖x? − x0‖2

2αKmax+DxL

K

Kmax∑k=1

E[‖yk − y?(xk−1)‖]

+Lf

2Kmax

Kmax∑k=1

E[‖xk − xk−1‖2].

Applying Cauchy-Schwartz inequality and Lemma 5, 6 with α = O(K−3/4max ), β = O(K

−1/2max ) as in

(26) show that 1Kmax

∑Kmaxk=1 E[‖yk − y?(xk−1)‖] ≤

√1

Kmax

∑Kmaxk=1 ∆k

y = O(K−1/4max ); cf. (34). The proof

is concluded.

C Proof of Theorem 3

Hereafter, we let 〈·, ·〉 and ‖ · ‖ denote the inner product and `1-norm on R|A|, respectively. For anytwo policies π1 and π2, for any s ∈ S, ‖π1(·|s) − π2(·|s)‖1 is the total variation distance betweenπ1(·|s) and π2(·|s). For any f, f ′ : S ×A→ R, define the following norms:

‖f‖ρ,1 =[∑

s∈S ‖f(s, ·)‖21ρ(s)]1/2

, ‖f‖ρ,∞ =[∑

s∈S ‖f(s, ·)‖2∞ρ(s)]1/2

The following result can be derived from the Hölder’s inequality:∣∣〈f, f ′〉ρ∣∣ ≤∑s∈S∣∣〈f(s, ·), f ′(s, ·)〉

∣∣ρ(s) ≤ ‖f‖ρ,1‖f ′‖ρ,∞. (77)

Lastly, it can be shown that ‖π‖ρ,1 = 1, ‖π‖ρ,∞ ≤ 1.

Under Assumption 5, θ?(π) is the solution to the inner problem with Qπ(·, ·) = φ(·, ·)>θ?(π).Below we first show that θ?(π) and Qπ are Lipschitz continuous maps with respect to ‖·‖ρ?,1, whereρ? is the visitation measure of an optimal policy π?.

Lemma 8. Under Assumption 4–7, for any two policies π1, π2 ∈ X,

‖Qπ1 −Qπ2‖ρ?,∞ ≤ (1− γ)−2 · r · Cρ · ‖π1 − π2‖ρ?,1,∥∥θ?(π1)− θ?(π2)∥∥

2≤ (1− γ)−2 · r · Cρ/µφ · ‖π1 − π2‖ρ?,1, (78)

where r is an upper bound on the reward function, µφ is specified in Assumption 6, and Cρ is definedin Assumption 7.

The proof of the above lemma is relegated to §C.1.

26

In the sequel, we first derive coupled inequalities on the non-negative sequences OPTk :=E[`(πk)− `(π?)], ∆k

Q := E[‖θk − θ?(πk−1)‖2], E[‖πk − πk+1‖2ρ?,1], then we apply Lemma 5 to derivethe convergence rates of TT-NAC. Using the performance difference lemma [cf. (47)], we obtain thefollowing

`(πk+1)− `(πk) = −(1− γ)−1〈Qπk+1, πk+1 − π?〉ρ? + (1− γ)−1〈Qπk , πk − π?〉ρ?

= (1− γ)−1〈−Qπk , πk+1 − πk〉ρ? + (1− γ)−1〈Qπk −Qπk+1, πk+1 − π?〉ρ? . (79)

Applying the inequality (77), we further have

〈Qπk −Qπk+1, πk+1 − π?〉ρ? ≤ ‖Qπ

k −Qπk+1‖ρ?,∞‖πk+1 − π?‖ρ?,1 (80)

≤ 2LQ‖πk − πk+1‖ρ?,1 ≤1− γ

4α‖πk+1 − πk‖2ρ?,1 +

4L2Qα

1− γ, (81)

where LQ := (1 − γ)−2r · Cρ. The above inequality follows from ‖πk − πk+1‖ρ?,1 ≤ 2 for anyπ1, π2 ∈ X and applying Lemma 8. Then, combining (79), (81) leads to

`(πk+1)− `(πk) ≤ −1

1− γ〈Qπk , πk+1 − πk〉ρ? +

1

4α‖πk+1 − πk‖2ρ?,1 + 4

αL2Q

(1− γ)2. (82)

Let us bound the first term in the right-hand side of (82). To proceed, note that the policyupdate (42) can be implemented for each state individually as below:

πk+1(·|s) = arg minν:

∑a ν(a)=1,ν(a)≥0

−(1− γ)−1 · 〈Qθk+1(s, ·), ν〉+ 1/αk ·Dψ

(ν, πk(·|s)

), (83)

for all s ∈ S. Observe that we can modify ρπk in (42) to ρ? without changing the optimal solutionfor this subproblem. Specifically, (42) can be written as

πk+1 = arg minπ∈X

−(1− γ)−1〈Qθk+1 , π − πk〉ρ? +

1

αDψ,ρ?(π, π

k). (84)

We have

−(1− γ)−1〈Qπk , πk+1 − πk〉ρ∗ = (1− γ)−1[〈Qθk+1 −Qπ

k, πk+1 − πk〉ρ? − 〈Qθk+1 , πk+1 − πk〉ρ?

].

Furthermore, from (84), we obtain

〈Qθk+1 , πk+1 − πk〉ρ?1− γ

≥ 1

α

∑s∈S〈∇Dψ(πk+1(·|s), πk(·|s)), πk+1(·|s)− πk(·|s)〉ρ?(s), (85)

where the inequality follows from the optimality condition of the mirror descent step. Meanwhile,the 1-strong convexity of Dψ(·, ·) implies that⟨

∇Dψ

(πk+1(·|s), πk(·|s)

), πk+1(·|s)− πk(·|s)

⟩≥ ‖πk+1(·|s)− πk(·|s)‖2. (86)

Thus, combining (85) and (86), and applying Young’s inequality, we further have

− (1− γ)−1〈Qπk , πk+1 − πk〉ρ∗

≤ 1

4α‖πk+1 − πk‖2ρ?,1 + α(1− γ)−2 · ‖Qπk −Qθk+1‖2ρ?,∞ −

1

α‖πk+1 − πk‖2ρ?,1

= α(1− γ)−2 · ‖Qπk −Qθk+1‖2ρ?,∞ −3

4α‖πk+1 − πk‖2ρ?,1. (87)

27

By direct computation and using ‖φ(s, a)‖ ≤ 1 [cf. Assumption 5], we have

‖Qπk −Qθk+1‖2ρ?,∞ =∑s∈S

maxa∈A

∣∣φ(s, a)>[θ?(πk)− θk+1]∣∣2

ρ?(s)

≤∑s∈S

maxa∈A

‖φ(s, a)‖2

‖θ?(πk)− θk+1‖2ρ?(s) ≤ ‖θ?(πk)− θk+1‖2. (88)

Combining (82), (87), and (88), we obtain

`(πk+1)− `(πk) ≤ α(1− γ)−2‖θ?(πk)− θk+1‖2 − 1

2α‖πk+1 − πk‖2ρ?,1 + 4(1− γ)−2L2

Qα.

Taking full expectation leads to

OPTk+1 −OPTk ≤ α(1− γ)−2∆k+1Q − 1

2αE[‖πk+1 − πk‖2ρ∗,1] + 4(1− γ)−2L2

Qα. (89)

Next, we consider the convergence of ∆kQ. Let Fk = σθ0, π0, . . . , θk, πk be the σ-algebra

generated by the first k + 1 actor and critic updates. Under Assumption 5, we can write theconditional expectation of hkg as

E[hkg |Fk] = Eµπk[Qθk(s, a)− r(s, a)− γQθk(s′, a′)φ(s, a)|Fk

]= E

µπk[φ(s, a)φ(s, a)− γφ(s′, a′)>

][θk − θ?(πk)], (90)

where Eµπk

[·] denotes the expectation taken with s ∼ µπk , a ∼ πk(·|s), s′ ∼ P (·|s, a), a′ ∼ πk(·|s′).Under Assumption 5 and 6, Lemma 3 of [4] shows that E[hkg |Fk] is a semigradient of the MSBEfunction ‖Qθk −Qθ?(πk)‖2µπk⊗πk . Particularly, we obtain

E[hkg |Fk]>[θk − θ?(πk)] ≥ (1− γ)‖Qθk −Qθ?(πk)‖2µπk⊗πk ≥ µtd‖θk − θ?(πk)‖22, (91)

where we have let µtd = (1− γ)µ2φ. Moreover, Lemma 5 of [4] demonstrates that the second order

moment E[‖hkg‖22|Fk] is bounded as

E[‖hkg‖22|Fk] ≤ 8‖Qθk −Qθ?(πk)‖2µπk⊗πk + σ2td ≤ 8‖θk − θ?(πk)‖22 + σ2

td, (92)

where σ2td = 4r2(1− γ)−2. Combining (91), (92) and recalling β ≤ µtd/8, it holds

E[‖θk+1 − θ?(πk)‖22|Fk] = ‖θk − θ?(πk)‖22 − 2βE[hkg |Fk]>[θk − θ?(πk)] + β2E[‖hkg‖22|Fk]≤(1− 2µtdβ + 8β2) · ‖θk − θ?(πk)‖22 + β2 · σ2

td

≤ (1− µtdβ) · ‖θk − θ?(πk)‖22 + β2 · σ2td. (93)

By Young’s inequality and Lemma 8, we further have

E[‖θk+1 − θ?(πk)‖22|Fk]≤ (1 + c)(1− µtdβ)‖θk − θ?(πk−1)‖22 + (1 + 1/c)‖θ?(πk)− θ?(πk−1)‖22 + β2σ2

td

≤ (1− µtdβ/2)‖θk − θ?(πk−1)‖22 +( 2

µtdβ− 1)LQ‖πk − πk+1‖2ρ?,1 + β2σ2

td, (94)

where we have chosen c > 0 such that (1 + c)(1 − µtdβ) = 1 − µtdβ/2, which implies that 1/c +1 = 2/(µtdβ) − 1 > 0 [cf. (45)]; The last inequality comes from Lemma 8 with the constantLQ = (1− γ)−4 · r · C2

ρ · µ−2φ .

28

From (89), (94), we identify that condition (31) of Lemma 5 holds with:

Ωk = OPTk, Θk = E[‖πk − πk−1‖2ρ∗,1], c0 =1

2α, c1 =

α

(1− γ)2, c2 =

4L2Qα

(1− γ)2,

Υk = E[‖θk − θ?(πk−1)‖2], d0 = µtdβ/2, d1 =( 2

µtdβ− 1)LQ > 0, d2 = β2 · σ2

td.

Selecting the step sizes as in (45), one can verify that αβ <

µtd(1−γ)

16√LQ

. This ensures

c0 − c1d1(d0)−1 > 1/(4α), d0 − c1d1(c0)−1 > µtdβ/4. (95)

Applying Lemma 5, we obtain for any K ≥ 1 that

1

K

K∑k=1

E[‖πk − πk+1‖2ρ?,1] ≤OPT0 · 4α+ 8α2(1−γ)−2

µtdβ(∆0

Q + β2σ2td)

K+

8α2(4L2Q + βσ2

td/µtd)

(1− γ)2

1

K

K∑k=1

E[∆k+1Q ] ≤

E[∆0Q] + β2σ2

td + 4αµtdβ

LQOPT0

µtdβK/4+β2σ2

td + 16α2

µtdβ(1− γ)−2L2

Qα2

µtdβ/4.

Particularly, plugging in α K−3/4max , β K−1/2

max shows that the convergence rates areK−1max

∑Kmaxk=1 E[‖πk−

πk+1‖2ρ?,1] = O(K−3/2max ), K−1

max

∑Kmaxk=1 E[∆k+1

Q ] = O(K−1/2max ).

Our last step is to analyze the convergence rate of the objective value OPTk. To this end, weobserve the following three-point inequality [3]

−〈Qθk+1 , πk+1 − π?〉ρ?1− γ

≤ 1

α

[Dψ,ρ?(π

?, πk)− Dψ,ρ?(π?, πk+1)− Dψ,ρ?(π

k+1, πk)]. (96)

Meanwhile, by the inequalities (77), (79), (80), we have

`(πk+1)− `(πk) = − 1

1− γ〈Qπ

k

, πk+1 − πk〉ρ? +1

1− γ〈Qπ

k

−Qπk+1

, πk+1 − π?〉ρ?

≤ 1

1− γ[〈−Qπ

k

, πk+1 − π?〉ρ? − 〈Qπk

, π? − πk〉ρ? + ‖πk+1 − π?‖ρ?,1‖Qπk

−Qπk+1

‖ρ?,∞]

≤ 1

1− γ[〈−Qπ

k

, πk+1 − π?〉ρ? − 〈Qπk

, π? − πk〉ρ? + 2LQ‖πk+1 − πk‖ρ?,1],

where the last inequality follows from Lemma 8. Now, with the performance difference lemma`(π∗)− `(πk) = (1− γ)−1〈−Qπk , π? − πk〉ρ? , the above simplifies to

`(πk+1)− `(π?) ≤ (1− γ)−1[− 〈Qπk , πk+1 − π?〉ρ? + 2LQ‖πk+1 − πk‖ρ?,1

]With 〈Qπk , πk+1 − π?〉ρ? = 〈Qπk − Qθk+1 , πk+1 − π?〉ρ? + 〈Qθk+1 , πk+1 − π?〉ρ? and applying thethree-point inequality (96), we have

`(πk+1)− `(π?) ≤ 2

1− γ[LQ‖πk+1 − πk‖ρ?,1 + ‖Qπ

k

−Qθk+1‖ρ?,∞]

+1

α

[Dψ,ρ?(π?, πk)− Dψ,ρ?(π?, πk+1)− Dψ,ρ?(πk+1, πk)

].

≤ 2

1− γ[LQ‖πk+1 − πk‖ρ?,1 + ‖θ?(πk)− θk+1‖2

]+

1

α

[Dψ,ρ?(π?, πk)− Dψ,ρ?(π?, πk+1)

]where the last inequality uses (88) and the fact that Dψ,ρ? is non-negative. Finally, taking the full

29

expectation on both sides of the inequality, we obtain

OPTk+1 ≤ 2(1− γ)−1E[‖θ?(πk)− θk+1‖2] + 2(1− γ)−1LQE[‖πk+1 − πk‖ρ?,1]

+1

αE[Dψ,ρ?(π

?, πk)− Dψ,ρ?(π?, πk+1)

]. (97)

Summing up both sides from k = 0 to k = Kmax − 1 and dividing by Kmax yields

1

Kmax

Kmax∑k=1

OPTk ≤ 1

αKmax

Dψ,ρ?(π

?, π0)− Dψ,ρ?(π?, πKmax)

+

2

(1− γ)

1

Kmax

Kmax∑k=1

E[‖θ?(πk−1)− θk‖2] + LQ · E[‖πk − πk−1‖ρ?,1]

.

(98)

Using Cauchy-Schwarz’s inequality, it can be easily seen that the right-hand side is O(K−1/4max ). This

concludes the proof of the theorem.

C.1 Proof of Lemma 8

We first bound |Qπ1(s, a)−Qπ2(s, a)|. By the Bellman equation (38) and the performance differencelemma (47), we have

Qπ1(s, a)−Qπ2(s, a) =∑s′∈S

P (s′|s, a) ·[V π1(s′)− V π2(s′)

]= (1− γ)−1

∑s′∈S

P (s′|s, a) · Es∼%(s′,π1)

[⟨Qπ2(s, ·), π1(·|s)− π2(·|s)

⟩], (99)

where %(s′, π1) is the visitation measure obtained by the Markov chain induced by π1 with the initialstate fixed to s′. Recall the definition of the visitation measure %(s, a, π) in (43). We rewrite (99)as

Qπ1(s, a)−Qπ2(s, a) = (1− γ)−1 · Es∼%(s,a,π1)

[⟨Qπ2(s, ·), π1(·|s)− π2(·|s)

⟩]. (100)

Moreover, notice that sup(s,a)∈S×A |Qπ(s, a)| ≤ (1 − γ)−1 · r under Assumption 4. Then, applyingHölder’s inequality to (100), we obtain∣∣Qπ1(s, a)−Qπ2(s, a)

∣∣ ≤ (1− γ)−1 · Es∼%(s′,π1)

[‖Qπ2(s, ·)‖∞ · ‖π1(·|s)− π2(·|s)‖1

]≤ (1− γ)−2 · r · Es∼ρ?

[%(s, a, π)

ρ?(s) · ‖π1(·|s)− π2(·|s)‖1

]≤ (1− γ)−2 · r ·

Es∼ρ?

[∣∣%(s, a, π)

ρ?(s)∣∣2]Es∼ρ?[‖π1(·|s)− π2(·|s)‖21

]1/2

≤ (1− γ)−2 · r · Cρ · ‖π1 − π2‖ρ?,1, (101)

where the second inequality is from the boundedness of Qπ, the third one is the Cauchy-Schwarzinequality, and the last one is from Assumption 7. Finally, we have

‖Qπ1 −Qπ2‖ρ?,∞ ≤ (1− γ)−2 · r · Cρ · ‖π1 − π2‖ρ?,1.

30

It remains to bound ‖θ?(π1)− θ?(π2)‖2. Under Assumption 5, we have

‖Qπ1 −Qπ2‖2µπ?⊗π? = Es∼µπ? ,a∼π?(·|s)

[Qπ1(s, a)−Qπ2(s, a)

]2= Es∼µπ? ,a∼π?(·|s)

(φ(s, a)>[θ?(π1)− θ?(π2)]

2)

= [θ?(π1)− θ?(π2)]>Σπ? [θ?(π1)− θ?(π2)]. (102)

Then, combining Assumption 6 and (101), we have

µ2φ‖θ?(π1)− θ?(π2)‖2 ≤ ‖Qπ1 −Qπ2‖2

µπ?⊗π? ≤ (1− γ)−4 · r2 · C2ρ · ‖π1 − π2‖2ρ?,1, (103)

which yields the second inequality in Lemma 8. We conclude the proof.

D Auxiliary Lemmas

The proofs for the lemmas below can be found in the online appendix [24].

Lemma 9. [31, Lemma 12] Let a > 0, γjj≥0 be a non-increasing, non-negative sequence suchthat γ0 < 1/a, it holds for any k ≥ 0 that

k∑j=0

γj

k∏`=j+1

(1− γà) ≤ 1

a. (104)

Lemma 10. Fix a real number 1 < q ≤ 2. Let a > 0, γjj≥0 be a non-increasing, non-negativesequence such that γ0 < 1/(2a). Suppose that γ`−1

γ`≤ 1 + a

2(q−1)γ`. Then, it holds for any k ≥ 0 that

k∑j=0

γqj

k∏`=j+1

(1− γà) ≤ 2

aγq−1k . (105)

Lemma 11. Fix the real numbers a, b > 0. Let γjj≥0, ρjj≥0 be nonincreasing, non-negativesequences such that 2aγj ≤ bρj for all j, and ρ0 < 1/b. Then, it holds that

k∑j=0

γj

k∏`=j+1

(1− γà)

j∏i=0

(1− ρib) ≤1

a

k∏`=0

(1− γà), ∀ k ≥ 0. (106)

E Technical Results Omitted from the Main Paper

E.1 Proof of Lemma 10

To derive this result, we observe thatk∑j=0

γqj

k∏`=j+1

(1− γà) ≤ γq−1k

k∑j=0

γjγq−1j

γq−1k

k∏`=j+1

(1− γà)

= γq−1k

k∑j=0

γj

k∏`=j+1

(γ`−1

γ`

)q−1

(1− γà).

31

Furthermore, from the conditions on γ`,(γ`−1

γ`

)q−1

(1− γà) ≤(1 +

a

2(q − 1)γ`)q−1

(1− γà) ≤ 1− a

2γ`.

Therefore,k∑j=0

γqj

k∏`=j+1

(1− γà) ≤ γq−1k

k∑j=0

γj

k∏`=j+1

(1− a

2γ`) ≤

2

aγq−1k .

This concludes the proof.


First of all, the condition 2aγj ≤ bρj implies

1− ρib1− γia

≤ 1− ρib/2, ∀ i ≥ 0.

As such, we observe∏ji=0

1−ρib1−γia ≤

∏ji=0(1− ρib/2) and subsequently,

k∑j=0

γj

k∏`=j+1

(1− γà)

j∏i=0

(1− ρib) ≤[ k∏`=0

(1− γà)] k∑j=0

γj

j∏i=0

(1− ρib/2).

Furthermore, for any j = 0, ..., k, it holds

ρj

j−1∏i=0

(1− ρib/2) =2

b

[ j−1∏i=0

(1− ρib/2)−j∏i=0

(1− ρib/2)], (107)

where we have taken the convention∏−1i=0(1− ρib/2) = 1. We obtain that

k∑j=0

γj

j∏i=0

(1− ρib/2) ≤ b

2a

k∑j=0

ρj

j−1∏i=0

(1− ρib/2)

=1

a

k∑j=0

[ j−1∏i=0

(1− ρib/2)−j∏i=0

(1− ρib/2)]≤ 1

a,

where the last inequality follows from the bound 1 −∏ki=0(1 − ρiµgb/2) ≤ 1. Combining with the

above inequality yields the desired results.


Proof. Since the samples are drawn independently, the expected value of hkf is

E[hkf ] = ∇xf(x, y)−∇2xyg(x, y)E

[ tmax µgLg(µ2g+σ2

gxy)

∏pi=1

(I − µg

Lg(µ2g+σ2gxy)∇2yyg(x, y; ξ

(2)i ))]∇yf(x, y).

(108)

32

We have‖∇xf(x, y)− E[hkf ]‖

=∥∥∥∇2

xyg(x, y)E[ tmaxµgLg(µ2g+σ2

gxy)

∏pi=1

(I − µg


(2)i ))]−[∇yyg(x, y)

]−1∇yf(x, y)∥∥∥

≤ CgxyCfy∥∥∥E[ tmaxµg

Lg(µ2g+σ2gxy)

∏pi=1

(I − µg


(2)i ))]−[∇yyg(x, y)

]−1∥∥∥ ,

where the last inequality follows from Assumption 1-3 and Assumption 2-5.

Applying [23, Lemma 3.2], the latter norm can be bounded by 1µg

(1 − µ2g

Lg(µ2g+σ2gxy)

)tmax . Thisconcludes the proof for the first part of the lemma.

It remains to bound the variance of hkf . We first let

Hyy =tmax µg

Lg(µ2g + σ2

gxy)

p∏i=1

(I − µg

Lg(µ2g + σ2

gxy)∇2yyg(x, y; ξ

(2)i )).

To estimate the variance of hkf , using (108), we observe that

E[‖hkf − E[hkf ]‖2] = E[‖∇xf(x, y; ξ(1))−∇xf(x, y)‖2]

+ E[∥∥∥∇2

xyg(x, y; ξ(2)0 )Hyy∇yf(x, y; ξ(1))−∇2

xyg(x, y)E[Hyy

]∇yf(x, y)

∥∥∥2].

The first term on the right hand side can be bounded by σ2fx. Furthermore

∇2xyg(x, y; ξ

(2)0 )Hyy∇yf(x, y; ξ(1))−∇2

xyg(x, y)E[Hyy

]∇yf(x, y)

=∇2xyg(x, y; ξ

(2)0 )−∇2

xyg(x, y)Hyy∇yf(x, y; ξ(1))

+∇2xyg(x, y)Hyy − E[Hyy]∇yf(x, y; ξ(1))

+∇2xyg(x, y)E[Hyy]

∇yf(x, y; ξ(1))−∇yf(x, y)

.

We also observe

E[‖∇yf(x, y; ξ(1))‖2] = E[‖∇yf(x, y; ξ(1))−∇yf(x, y)‖2] + E[‖∇yf(x, y)‖2] ≤ σ2fy + C2

y .

Using (a+ b+ c)2 ≤ 3(a2 + b2 + c2) and the Cauchy-Schwarz’s inequality, we have

E[∥∥∥∇2

xyg(x, y; ξ(2)0 )Hyy∇yf(x, y; ξ(1))−∇2

xyg(x, y)E[Hyy

]∇yf(x, y)

∥∥∥2]≤ 3

(σ2fy + C2

y )σ2gxyE[‖Hyy‖2] + C2

gxyE[‖Hyy − E[Hyy]‖2]+ σ2fyC

2gxy · ‖E[Hyy]‖2

.

Next, we observe that

E[‖Hyy‖2] =µ2g

L2g(µ

2g + σ2

gxy)2

tmax−1∑p=0

E

∥∥∥∥∥p∏i=1

(I − µg

Lg(µ2g + σ2


(2)i ))∥∥∥∥∥

2 (109)

Observe that the product of random matrices satisfies the conditions in Lemma 12 with µ ≡µ2g

Lg(µ2g+σ2gxy)

, σ2 ≡ µ2gσ2gxy

L2g(µ2g+σ2

gxy)2. Under the condition Lg ≥ 1, it can be seen that

(1− µ)2 + σ2 ≤ 1− µ2g/(Lg(µ

2g + σ2

gxy)) < 1.

33

Applying Lemma 12 shows that

E

∥∥∥∥∥p∏i=1

(I − µg

Lg(µ2g + σ2


(2)i ))∥∥∥∥∥

2 ≤ d1

(1−

µ2g

Lg(µ2g + σ2

gxy)

)p.

Subsequently,

E[‖Hyy‖2] ≤ d1

Lg(µ2g + σ2

gxy).

3Furthermore, it is easy to derive that ‖E[Hyy]‖ ≤ 1/µg. Together, the above gives the followingestimate on the variance:

E[‖hkf −E[hkf ]‖2] ≤ σ2fx +

(σ2fy +C2

y )σ2gxy + 2C2

gxy+σ2fyC

2gxy

max

3

µ2g

,3d1

Lg(µ2g + σ2

gxy)

. (110)

This concludes the proof for the second part of the lemma.

We observe the following lemma on the product of (possibly non-PSD) matrices, which is inspiredby [18,25]:

Lemma 12. Let Zi, i = 0, 1, ... be a sequence of random matrices defined recursively as Zi = YiZi−1,i ≥ 1, with Z0 = I ∈ Rd×d, and Yi, i = 0, 1, ... are independent, symmetric, random matricessatisfying ‖E[Yi]‖ ≤ 1 − µ and E[‖Yi − E[Yi]‖22] ≤ σ2. If (1 − µ)2 + σ2 < 1, then for any t ≥ 0, itholds

E[‖Zt‖2] ≤ E[‖Zt‖22] ≤ d((1− µ)2 + σ2

)t, (111)

where ‖X‖2 denotes the Schatten-2 norm of the matrix X.

Proof. We note from the norm equivalence between spectral norm and Schatten-2 norm which yields‖Zt‖ ≤ ‖Zt‖2 and thus E[‖Zt‖2] ≤ E[‖Zt‖22]. For any i ≥ 1, we observe that

Zi = (Yi − E[Yi])Zi−1︸︷︷︸=Ai

+E[Yi]Zi−1︸︷︷︸=Bi

.

Notice that as E[Ai|Bi] = 0, applying [25, Proposition 4.3] yields

E[‖Zt‖22] ≤ E[‖At‖22] + E[‖Bt‖22]. (112)

Furthermore, using the fact that Yis are independent random matrices and Hölder’s inequality formatrices, we observe that

E[‖At‖22] ≤ E[‖Yt − E[Yt]‖22‖Zt−1‖22] = E[‖Yt − E[Yt]‖22]E[‖Zt−1‖22] ≤ σ2E[‖Zt−1‖22]

and using [25, (4.1)],

E[‖Bt‖22] ≤ ‖E[Yt]‖2E[‖Zt−1‖22] ≤ (1− µ)2E[‖Zt−1‖22]

Substituting the above into (112) yields E[‖Zt‖22] ≤ ((1− µ)2 + σ2)E[‖Zt−1‖22]. Repeating the samearguments for t times and using ‖I‖22 = d yields the upper bound.

3Notice that a slight modification of the proof of [23, Lemma 3.2] yields (E[‖Hyy‖])2 ≤ µ−2g . However, the latter

lemma requires I − (1/Lg)∇2yyg(x, y, ξ

(2)i ) to be a PSD matrix almost surely, which is not required in our analysis.

34

References

[1] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradientmethods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research,2021.

[2] András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning,71(1):89–129, 2008.

[3] Amir Beck. First-order methods in optimization, volume 25. SIAM, 2017.

[4] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learningwith linear function approximation. In Conference On Learning Theory, pages 1691–1692, 2018.

[5] Shalabh Bhatnagar, Mohammad Ghavamzadeh, Mark Lee, and Richard S Sutton. Incremental naturalactor-critic algorithms. In Advances in Neural Information Processing Systems, pages 105–112, 2008.

[6] Shalabh Bhatnagar, Doina Precup, David Silver, Richard S Sutton, Hamid R Maei, and CsabaSzepesvári. Convergent temporal-difference learning with arbitrary smooth function approximation.In Advances in Neural Information Processing Systems, pages 1204–1212, 2009.

[7] Vivek S Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291–294, 1997.

[8] Vivek S Borkar and Sarath Pattathil. Concentration bounds for two time scale stochastic approximation.In Allerton Conference on Communication, Control, and Computing, pages 504–511, 2018.

[9] Jerome Bracken, James E. Falk, and James T. McGill. Technical note—the equivalence of two mathe-matical programs with optimization problems in the constraints. Operations Research, 22(5):1102–1104,1974.

[10] Jerome Bracken and James T. McGill. Mathematical programs with optimization problems in theconstraints. Operations Research, 21(1):37–44, 1973.

[11] Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. InICML, 2019.

[12] Benoît Colson, Patrice Marcotte, and Gilles Savard. An overview of bilevel optimization. Annals ofOperations Research, 153(1):235–256, 2007.

[13] Nicolas Couellan and Wenjuan Wang. On the convergence of stochastic bi-level gradient methods. 2016.

[14] Gal Dalal, Balazs Szorenyi, and Gugan Thoppe. A tale of two-timescale reinforcement learning withthe tightest finite-time bound. arXiv preprint arXiv:1911.09157, 2019.

[15] Christoph Dann, Gerhard Neumann, Jan Peters, et al. Policy evaluation with temporal differences: Asurvey and comparison. Journal of Machine Learning Research, 15:809–883, 2014.

[16] Damek Davis and Dmitriy Drusvyatskiy. Stochastic subgradient method converges at the rate o(k−1/4)on weakly convex functions. arXiv preprint arXiv:1802.02988, 2018.

[17] Thinh T Doan. Nonlinear two-time-scale stochastic approximation: Convergence and finite-time per-formance. arXiv preprint arXiv:2011.01868, 2020.

[18] Alain Durmus, Eric Moulines, Alexey Naumov, Sergey Samsonov, Kevin Scaman, and Hoi-To Wai.Tight high probability bounds for linear stochastic approximation with fixed stepsize. In NeurIPS,2021.

[19] James E. Falk and Jiming Liu. On bilevel programming, part I: General nonlinear cases. MathematicalProgramming, 70:47–72, 1995.

[20] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation ofdeep networks. In International Conference on Machine Learning, pages 1126–1135, 2017.

35

[21] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In Inter-national Conference on Machine Learning, 2019.

[22] L Franceschi, P Frasconi, S Salzo, R Grazzi, and M Pontil. Bilevel programming for hyperparameteroptimization and meta-learning. In International Conference on Machine Learning, pages 1563–1572,2018.

[23] Saeed Ghadimi and Mengdi Wang. Approximation methods for bilevel programming. arXiv preprintarXiv:1802.02246, 2018.

[24] M. Hong, H. Wai, Z. Wang, and Z. Yang. A two-timescale framework for bilevel optimization: Com-plexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170, 2020.

[25] De Huang, Jonathan Niles-Weed, Joel A Tropp, and Rachel Ward. Matrix concentration for products.Foundations of Computational Mathematics, pages 1–33, 2021.

[26] Y. Ishizuka and E. Aiyoshi. Double penalty method for bilevel optimization problems. Ann Oper Res,34:73–88, 1992.

[27] Kaiyi Ji, Junjie Yang, and Yingbin Liang. Provably faster algorithms for bilevel optimization andapplications to meta-learning. arXiv preprint arXiv:2010.07962, 2020.

[28] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learningwith linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.

[29] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. InInternational Conference on Machine Learning, pages 267–274, 2002.

[30] Sham M Kakade. A natural policy gradient. In Advances in Neural Information Processing Systems,pages 1531–1538, 2002.

[31] Maxim Kaledin, Eric Moulines, Alexey Naumov, Vladislav Tadic, and Hoi-To Wai. Finite time analysisof linear two-timescale stochastic approximation with Markovian noise. In COLT, 2020.

[32] Prasenjit Karmakar and Shalabh Bhatnagar. Two time-scale stochastic approximation with con-trolled Markov noise and off-policy temporal-difference learning. Mathematics of Operations Research,43(1):130–151, 2018.

[33] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural InformationProcessing Systems, pages 1008–1014, 2000.

[34] Vijay R Konda, John N Tsitsiklis, et al. Convergence rate of linear two-time-scale stochastic approxi-mation. Annals of Applied Probability, 14(2):796–819, 2004.

[35] Junyi Li, Bin Gu, and Heng Huang. Improved bilevel model: Fast and optimal algorithm with theoreticalguarantee. arXiv preprint arXiv:2009.00690, 2020.

[36] Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, and Adrian Weller. Ufo-blo:Unbiased first-order bilevel optimization. arXiv preprint arXiv:2006.03631, 2020.

[37] Bo Liu, Ian Gemp, Mohammad Ghavamzadeh, Ji Liu, Sridhar Mahadevan, and Marek Petrik. Proximalgradient temporal difference learning: Stable reinforcement learning with polynomial sample complexity.Journal of Artificial Intelligence Research, 63:461–494, 2018.

[38] Risheng Liu, Pan Mu, Xiaoming Yuan, Shangzhi Zeng, and Jin Zhang. A generic first-order algorithmicframework for bi-level programming beyond lower-level singleton. arXiv preprint arXiv:2006.04045,2020.

[39] Zhi-Quan Luo, Jong-Shi Pang, and Daniel Ralph. Mathematical Programs with Equilibrium Constraints.Cambridge University Press, 1996.

[40] Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S Sutton. Toward off-policylearning control with function approximation. In International Conference on Machine Learning, 2010.

[41] Akshay Mehra and Jihun Hamm. Penalty method for inversion-free deep bilevel optimization. In AsianConference on Machine Learning, 2019.

36

[42] Abdelkader Mokkadem, Mariane Pelletier, et al. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. Annals of Applied Probability, 16(3):1671–1702, 2006.

[43] Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of MachineLearning Research, 9(May):815–857, 2008.

[44] Fabian Pedregosa. Hyperparameter optimization with approximate gradient. In International conferenceon machine learning, pages 737–746. PMLR, 2016.

[45] Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomputing, 71(7-9):1180–1190, 2008.

[46] Aniruddh Raghu, Maithra Raghu, Samy Bengio, and Oriol Vinyals. Rapid learning or feature reuse?towards understanding the effectiveness of MAML. In International Conference on Learning Represen-tations, 2019.

[47] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicitgradients. 2019.

[48] Shoham Sabach and Shimrit Shtern. A first order method for solving convex bilevel optimizationproblems. SIAM Journal on Optimization, 27(2):640–660, 2017.

[49] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[50] Amirreza Shaban, Ching-An Cheng, Nathan Hatch, and Byron Boots. Truncated back-propagation forbilevel optimization. In Artificial Intelligence and Statistics, pages 1723–1732, 2019.

[51] H. Van Stackelberg. The theory of market economy. Oxford University Press, 1952.

[52] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine Learning,3(1):9–44, 1988.

[53] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.

[54] Richard S Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári,and Eric Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear functionapproximation. In International Conference on Machine Learning, pages 993–1000, 2009.

[55] Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligenceand machine learning, 4(1):1–103, 2010.

[56] L. Vicente, G. Savard, and J. Judice. Descent approaches for quadratic bilevel programming. Journalof Optimization Theory and Applications, 81:379–399, 1994.

[57] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms forminimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449,2017.

[58] D. J. White and G. Anandalingam. A penalty function approach for solving bi-level linear programs.Journal of Global Optimization, 3:397–419, 1993.

[59] Yue Wu, Weitong Zhang, Pan Xu, and Quanquan Gu. A finite time analysis of two time-scale actor-criticmethods. In NeurIPS, 2020.

[60] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarkingmachine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

[61] Tengyu Xu, Zhe Wang, and Yingbin Liang. Non-asymptotic convergence analysis of two time-scale(natural) actor-critic algorithms. arXiv preprint arXiv:2005.03557, 2020.

[62] Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. InInternational Conference on Machine Learning, pages 6995–7004. PMLR, 2019.

37

Date post:	01-Feb-2022
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

A Two-Timescale Stochastic Algorithm Framework for Bilevel ...

Documents