+ All Categories
Home > Documents > JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8,...

JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8,...

Date post: 28-Jun-2020
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
16
JOURNAL OF L A T E X CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex Inverse Problems Risheng Liu, Member, IEEE, Shichao Cheng, Yi He, Xin Fan, Senior Member, IEEE, Zhouchen Lin, Fellow, IEEE, and Zhongxuan Luo Abstract—Numerous tasks at the core of statistics, learning and vision areas are specific cases of ill-posed inverse problems. Recently, learning-based (e.g., deep) iterative methods have been empirically shown to be useful for these problems. Nevertheless, integrating learnable structures into iterations is still a laborious process, which can only be guided by intuitions or empirical insights. Moreover, there is a lack of rigorous analysis about the convergence behaviors of these reimplemented iterations, and thus the significance of such methods is a little bit vague. This paper moves beyond these limits and proposes Flexible Iterative Modularization Algorithm (FIMA), a generic and provable paradigm for nonconvex inverse problems. Our theoretical analysis reveals that FIMA allows us to generate globally convergent trajectories for learning-based iterative methods. Meanwhile, the devised scheduling policies on flexible modules should also be beneficial for classical numerical methods in the nonconvex scenario. Extensive experiments on real applications verify the superiority of FIMA. Index Terms—Nonconvex optimization, Learning-based iteration, Global convergence , Computer vision. 1 I NTRODUCTION I N applications throughout statistics, machine learning and computer vision, one is often faced with the challenge of solving ill-posed inverse problems. In general, the basic inverse problem leads to a discrete linear system of the form T (x)= y + n, where x R D is the latent variable to be estimated, T denotes some given linear operations on x, and y, n R D are the observation and an unknown error term, respectively. Typically, these inverse problems can be addressed by solving the composite minimization model: min x Ψ(x) := f (x; T , y)+ g(x), (1) where f is the fidelity that captures the loss of data fitting, and g refers to the prior that promotes desired distribution on the solution. Recent studies illustrate that many prob- lems (e.g., image deconvolution, matrix factorization and dictionary learning) naturally require to be solved in the nonconvex scenario. This trend motivates us to investigate R. Liu, Y. He, and X. Fan are with the DUT-RU International School of Information Science & Engineering, Dalian University of Technol- ogy, and also with the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian 116024, China. E-mail: {rsliu,xin.fan}@dlut.edu.cn, [email protected]. S. Cheng is with the School of Mathematical Sciences, Dalian University of Technology, and also with the Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province, Dalian 116024, China. E-mail: [email protected]. Z. Lin is with the Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China, and also with the Cooperative Medianet Innovation Center, Shanghai Jiao Tong University, Shanghai 200240, China. E-mail: [email protected]. Z. Luo is with the DUT-RU International School of Information Science & Engineering, Dalian University of Technology, the Key Laboratory for Ubiquitous Networ and Service Software of Liaoning Province, and the School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, China, and also with the Institute of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 541004, China. E-mail: [email protected] Manuscript received April 19, 2018; revised August 26, 2015. Nonconvex Inverse Problems (NIPs) in the form of Eq. (1) and with the practical configuration that f is continuously differentiable, g is nonsmooth, and both f and g are possibly nonconvex. Over the past decades, a broad class of first-order meth- ods have been developed to solve special instances of Eq. (1). For example, by integrating Nesterov’s acceleration [1] into the fundamental Proximal Gradient (PG) scheme, Acceler- ated Proximal Gradient (APG, a.k.a. FISTA [2]) method is initially proposed to solve convex models in the form of Eq. (1) for different applications, such as image restoration [2], image deblurring [3], and sparse/low-rank learning [4], etc. While these APGs generate a sequence of objectives that may oscillate [2], [5] developed a variant of APG that guarantees the monotonicity of the sequence. For nonconvex energies in Eq. (1), Li and Lin [6] investigated a monotone APG (mAPG) and proved the convergence under the Kurdyka- Lojasiewicz (KL) constraint [7]. The work in [8] developed another variation of APG (APGnc) for nonconvex problems, but their original analysis only characterized the fixed-point convergence. Recently, Li et al. [9] also proved the subse- quence convergence of APGnc and estimated its convergence rates by further exploiting KL property. Unfortunately, even with some theoretically proved con- vergence properties, these classical numerical solvers may still fail in real-world scenarios. This is mainly because that the abstractly designed and fixed updating schemes do not exploit the particular structure of the problem at hand nor the input data distribution [10]. In recent years, various learning-based strategies [11], [12], [13], [14], [15] have been proposed to address practical inverse problems in the form of Eq. (1). These methods first introduced hyperparameters into the classical numerical solvers and then performed discriminative learning on collected training data to obtain some data-specific (but possibly inconsistent) iteration schemes. Inspired by the
Transcript
Page 1: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

On the Convergence of Learning-based IterativeMethods for Nonconvex Inverse Problems

Risheng Liu, Member, IEEE, Shichao Cheng, Yi He, Xin Fan, Senior Member, IEEE,Zhouchen Lin, Fellow, IEEE, and Zhongxuan Luo

Abstract—Numerous tasks at the core of statistics, learning and vision areas are specific cases of ill-posed inverse problems. Recently,learning-based (e.g., deep) iterative methods have been empirically shown to be useful for these problems. Nevertheless, integratinglearnable structures into iterations is still a laborious process, which can only be guided by intuitions or empirical insights. Moreover, thereis a lack of rigorous analysis about the convergence behaviors of these reimplemented iterations, and thus the significance of suchmethods is a little bit vague. This paper moves beyond these limits and proposes Flexible Iterative Modularization Algorithm (FIMA), ageneric and provable paradigm for nonconvex inverse problems. Our theoretical analysis reveals that FIMA allows us to generate globallyconvergent trajectories for learning-based iterative methods. Meanwhile, the devised scheduling policies on flexible modules should alsobe beneficial for classical numerical methods in the nonconvex scenario. Extensive experiments on real applications verify the superiorityof FIMA.

Index Terms—Nonconvex optimization, Learning-based iteration, Global convergence , Computer vision.

F

1 INTRODUCTION

IN applications throughout statistics, machine learning andcomputer vision, one is often faced with the challenge

of solving ill-posed inverse problems. In general, the basicinverse problem leads to a discrete linear system of the formT (x) = y + n, where x ∈ RD is the latent variable to beestimated, T denotes some given linear operations on x,and y,n ∈ RD are the observation and an unknown errorterm, respectively. Typically, these inverse problems can beaddressed by solving the composite minimization model:

minx

Ψ(x) := f(x; T ,y) + g(x), (1)

where f is the fidelity that captures the loss of data fitting,and g refers to the prior that promotes desired distributionon the solution. Recent studies illustrate that many prob-lems (e.g., image deconvolution, matrix factorization anddictionary learning) naturally require to be solved in thenonconvex scenario. This trend motivates us to investigate

• R. Liu, Y. He, and X. Fan are with the DUT-RU International Schoolof Information Science & Engineering, Dalian University of Technol-ogy, and also with the Key Laboratory for Ubiquitous Network andService Software of Liaoning Province, Dalian 116024, China. E-mail:rsliu,[email protected], [email protected].

• S. Cheng is with the School of Mathematical Sciences, Dalian Universityof Technology, and also with the Key Laboratory for Ubiquitous Networkand Service Software of Liaoning Province, Dalian 116024, China. E-mail:[email protected].

• Z. Lin is with the Key Laboratory of Machine Perception (Ministry ofEducation), School of Electronics Engineering and Computer Science,Peking University, Beijing 100871, China, and also with the CooperativeMedianet Innovation Center, Shanghai Jiao Tong University, Shanghai200240, China. E-mail: [email protected].

• Z. Luo is with the DUT-RU International School of Information Science& Engineering, Dalian University of Technology, the Key Laboratoryfor Ubiquitous Networ and Service Software of Liaoning Province, andthe School of Mathematical Sciences, Dalian University of Technology,Dalian 116024, China, and also with the Institute of Artificial Intelligence,Guilin University of Electronic Technology, Guilin 541004, China. E-mail:[email protected]

Manuscript received April 19, 2018; revised August 26, 2015.

Nonconvex Inverse Problems (NIPs) in the form of Eq. (1)and with the practical configuration that f is continuouslydifferentiable, g is nonsmooth, and both f and g are possiblynonconvex.

Over the past decades, a broad class of first-order meth-ods have been developed to solve special instances of Eq. (1).For example, by integrating Nesterov’s acceleration [1] intothe fundamental Proximal Gradient (PG) scheme, Acceler-ated Proximal Gradient (APG, a.k.a. FISTA [2]) method isinitially proposed to solve convex models in the form ofEq. (1) for different applications, such as image restoration [2],image deblurring [3], and sparse/low-rank learning [4], etc.While these APGs generate a sequence of objectives that mayoscillate [2], [5] developed a variant of APG that guaranteesthe monotonicity of the sequence. For nonconvex energiesin Eq. (1), Li and Lin [6] investigated a monotone APG(mAPG) and proved the convergence under the Kurdyka-Łojasiewicz (KŁ) constraint [7]. The work in [8] developedanother variation of APG (APGnc) for nonconvex problems,but their original analysis only characterized the fixed-pointconvergence. Recently, Li et al. [9] also proved the subse-quence convergence of APGnc and estimated its convergencerates by further exploiting KŁ property.

Unfortunately, even with some theoretically proved con-vergence properties, these classical numerical solvers maystill fail in real-world scenarios. This is mainly because thatthe abstractly designed and fixed updating schemes do notexploit the particular structure of the problem at hand northe input data distribution [10].

In recent years, various learning-based strategies [11],[12], [13], [14], [15] have been proposed to address practicalinverse problems in the form of Eq. (1). These methodsfirst introduced hyperparameters into the classical numericalsolvers and then performed discriminative learning oncollected training data to obtain some data-specific (butpossibly inconsistent) iteration schemes. Inspired by the

Page 2: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

success of deep learning in different application fields, somepreliminary studies considered the handcrafted networkarchitectures as the implicit priors (a.k.a. deep priors) forinverse problems. Following this perspective, various deeppriors are designed and nested into numerical iterations [16],[17], [18]. Alternately, the works in [19] and [20] addressedthe iteration learning issues from the perspectives of deepreinforcement and recurrent learning, respectively.

Nevertheless, existing hyperparameters learning ap-proaches can only build iterations based on the specificenergy forms (e.g., `1-penalty and MRFs), so that they areinapplicable for more generic inverse problems. Meanwhile,due to severe inconstancy of parameters during iterations,rigorous analysis on the resulted trajectories is also missing.Deep iterative methods have been executed in many learningand vision problems in practice. However, due to the com-plex network structure, little or even to no results have beenproposed for the convergence behaviors of these methods. Insummary, the lack of strict theoretical investigations is oneof the most fundamental limits in prevalent learning-basediterative methods, especially in the challenging nonconvexscenario.

To break the limits of prevalent approaches, this paperexplores Flexible Iterative Modularization Algorithm (FIMA),a generic and convergent algorithmic framework that com-bines together the learnable architecture (e.g., mainstreamdeep networks) with principled knowledges (formulatedby mathematical models), to tackle challenging NIPs inEq. (1). Specifically, derived from the fundamental forward-backward updating mechanism, FIMA replaces specificcalculations corresponding to the fidelity and priors in Eq. (1)with two user-specified (learnable) computational modules.A series of theoretical investigations are established for FIMA.For example, we first prove the subsequence convergence ofFIMA with explicit momentum policy (called eFIMA), whichis as good as those mathematically designed nonconvexproximal methods with Nesterov’s acceleration (e.g., variousAPGs in [6], [8], [9]). By introducing a carefully devisederror-control policy (i.e., implicit momentum policy, callediFIMA), we further enhance the results and obtain a globallyconvergent Cauchy sequence for Eq. (1). We prove that thisguarantee can also be preserved for FIMA with multipleblocks of unknown variables (called mFIMA). As a nontrivialbyproduct, we finally show how to specify modules in FIMAfor challenging inverse problems in low-level vision area(e.g., non-blind and blind image deconvolution). Our primarycontributions are summarized as follows:

1) FIMA provides a generic framework that unifiesalmost all existing learning-based iterative methods,as well as a series of scheduling policies that make itpossible to develop theoretically convergent learning-based iterations for challenging nonconvex inverseproblems in the form of Eq. (1).

2) Even with highly flexible (learnable) iterations, theconvergence guarantees obtained by FIMA is still asgood as (eFIMA) or better (iFIMA) than prevalentmathematically designed nonconvex APGs. So it isworth noting that our devised scheduling policiestogether with the flexible algorithmic structuresshould also be beneficial for classical nonconvex

algorithms.3) FIMA also provides us a practical and effective

ensemble of domain knowledge and sophisticatedlearned data distributions for real applications. Thuswe can bring the expressive power of knowledge-based and data-driven methodologies to yield state-of-the-art performance on challenging low-levelvision tasks.

2 RELATED WORK

2.1 Classical First-order Numerical Solvers

We first briefly review a group of classical first-order algo-rithms, which have been widely used to solve inverse prob-lems. The gradient descent (GD) scheme on a differentiablefunction f can be reformulated as minimizing the followingquadratic approximation of f at given point v with step sizeγ > 0, i.e., Qγf (x;v) := f(v)+〈∇f(v),x−v〉+ 1

2γ ‖x−v‖2.

As for the nonsmooth function g, its proximal mapping(PM) with parameter γ > 0 can be defined as proxγg(v) =

arg minxg(x) + 1

2γ ‖x − v‖2. So it is natural to consider PGas cascade of GD (on f ) and PM (on g), or equivalentlyoptimizing the quadratic approximation of Eq. (1), i.e.,xk+1 ∈ arg minx g(x) + Qγkf (x;vk), where vk is somecalculated variable at k-th iteration. Thus most prevalentproximal schemes can be summarized as

vk =

xk, (A-1)xk + βk(xk − xk−1), (A-2)

xk+1 ∈

proxγkg(vk − γk∇f(vk)

), (B-1)

proxεk

γkg

(vk − γk∇f

(vk + ek

)), (B-2)

where εk and ek in (B-2) denote the errors in PM and GDcalculations, respectively [21]. Within this general scheme,we first obtain original PG by setting vk = xk (i.e., (A-1)) andcomputing PM in (B-1) [2]. Using Nesterov’s acceleration [1](i.e., (A-2) with βk > 0), we have the well-known APGmethod [2], [6], [9]. Moreover, by introducing εk and ek

to respectively capture the inexactness of PM and GD (i.e.,(B-2)), we actually consider inexact PG and APG for bothconvex [22] and nonconvex [21] problems. Notice that in thenonconvex scenario, most classical APGs can only guaranteethe subsequence convergence to the critical points [6], [9].

2.2 Learning-based Iterative Methods

In [11], a trained version of FISTA (called LISTA) is in-troduced to approximate the solution of LASSO. [10], [23]extended LISTA for more generic sparse coding tasks andprovided an adaptive acceleration. Unfortunately, LISTA isbuilt on convex `1 regularization, thus may not be applicablefor other complex nonconvex inverse problems (e.g., `0 prior).By introducing hyperparameters in MRF and solving theresulted variational model with different iteration schemes,various learning-based iterative methods are proposed forinverse problems in image domain (e.g., denoising, super-resolution, and MRI imaging). For example, [13], [14], [15],[24], [25] have considered half-quadratic splitting, gradientdescent, Alternating Direction Method of Multiplier (ADMM)and primal-dual method, respectively. But their parameter-izations are completely based on MRF priors. Even worse,

Page 3: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

the original convergence properties are lost in these resultediterations.

To better model complex image degradations, [16], [17],[18] considered Convolutional Neural Networks (CNNs) asimplicit priors for image restoration. Since these methodsdiscard the regularization term in Eq. (1), we may not enforceprincipled constraints on their solutions. It is also unclearwhen and where these iterative trajectories should stop.Another group of very recent works [19], [20] directly for-mulated the descent directions from reinforcement learningperspective or using recurrent networks. However, due tothe high computational budgets, they can only be appliedto relative simple tasks (e.g., linear regression). Besides, dueto the complex topological network structure, it is extremelyhard to provide strict theoretical analysis for these methods.

3 THE PROPOSED ALGORITHMS

This section develops Flexible Iterative Modularization Al-gorithm (FIMA) for nonconvex inverse problems in Eq. (1).The convergence behaviors are also investigated accordingly.Hereafter, some fairly loose assumptions are enforced onEq. (1): f is proper and Lipschitz smooth (with modulus L)on a bounded set, g is proper, lower semi-continuous andproximable1 and Ψ is coercive. Notice that the proofs anddefinitions are deferred until Supplementary Materials.

3.1 Abstract Iterative ModularizationAs summarized in Sec. 2.1, a large amount of first-ordermethods can be summarized as forward-backward-typeiterations. This motivates us to consider the following evenmore abstract updating principle:

xk+1 = Ag Af (xk), (2)

where Af and Ag respectively stand for the user-specifiedmodules for f and g, and denotes operation composition.Building upon this formulation, it is easy to see that design-ing a learning-based iterative method reduces to the problemof iteratively specifying and learning Af and Ag .

It is straightforward that most prevalent approaches [13],[14], [15], [16], [17], [18], [24] naturally fall into this generalformulation. Nevertheless, currently it is still impossible toprovide any strict theoretical results for practical trajectoriesof Eq. (2). This is mainly due to the lack of efficientmechanisms to control the propagations generated by thesehandcrafted operations. Fortunately, in the following, we willintroduce different scheduling policies to automatically guidethe iterations in Eq. (2), resulting in a series of theoreticallyconvergent learning-based iterative methods.

3.2 Explicit Momentum: A Straightforward StrategyThe momentum of objective values is one of the mostimportant properties for numerical iterations. This propertyis also necessary for analyzing the convergence of someclassical algorithms. Inspired by these points, we present anexplicit momentum FIMA (eFIMA) (i.e., Alg. 1), in which weexplicitly compare Ψ(uk) and Ψ(xk) and choose the variable

1. The function g is proximable if minx g(x)+ γ2‖x−y‖2 can be easily

solved by the given y and γ > 0.

with less objective value as our monitor (denoted as vk).Finally, a proximal refinement is performed to adjust thelearning-based updating at each stage.

Algorithm 1 Explicit Momentum FIMA (eFIMA)

Require: x0, A = Ag,Af, and 0 < γk < 1/L.1: while not converged do2: uk = Ag Af (xk).3: if Ψ(uk) ≤ Ψ(xk) then4: vk = uk.5: else6: vk = xk.7: end if8: xk+1 ∈ proxγkg

(vk − γk∇f(vk)

).

9: end while

The following theorem first verifies the sufficient descentof Ψ(xk)k∈N and then proves the subsequence conver-gence of eFIMA. It is nice to observe that these results arenot based on any specific choices of Af and Ag .

Theorem 1. Let xkk∈N be the sequence generated by eFIMA.Then at the k-th iteration, there exists a sequence αk|αk >0k∈N, such that

Ψ(xk+1

)≤ Ψ

(vk)− αk‖xk+1 − vk‖2, (3)

where vk is the monitor in Alg. 1. Furthermore, xkk∈N isbounded and any of its accumulation points are the critical pointsof Ψ(x) in Eq. (1).

Based on Theorem 1 and considering Ψ as a semi-algebraic function2, the convergence rate of eFIMA can bestraightforwardly estimated as follows.

Corollary 1. Let φ(s) = tθ sθ be a desingularizing function with

a constant t > 0 and a parameter θ ∈ (0, 1] [27]. Then xkk∈Ngenerated by eFIMA converges after finite iterations if θ = 1. Thelinear and sub-linear rates can be obtained if choosing θ ∈ [1/2, 1)and θ ∈ (0, 1/2), respectively.

3.3 Implicit Momentum via Error ControlIndeed, even with the explicit momentum schedule, wemay still not obtain a globally convergent iteration. Thisis mainly because that there is no policy to efficiently controlthe inexactness of the user-specified modules (i.e., A). In thissubsection, we show how to address this issue by controllingthe first-order optimality error during iterations.

Specifically, we consider the auxiliary of Ψ at xk (denotedas Ψk) and denote its sub-differential (denoted as dx

Ψk )3 as

Ψk(x) = f(x) + g(x) + µk

2 ‖x− xk‖2,dx

Ψk = dxg +∇f (x) + µk(x− xk) ∈ ∂Ψk(x),

(4)

where µk > 0 is the penalty parameter and dxg ∈ ∂g(x).

As shown in Alg. 2, at stage k, a variable uk is obtainedby proximally minimizing Ψk at uk (i.e., Step 3 of Alg. 2).

2. Indeed, a variety of functions (e.g., the indicator function ofpolyhedral set, `0 and rational `p penalties) satisfy the semi-algebraicproperty [26].

3. Strictly speaking, ∂Ψk(x) is the so-called limiting Frechet sub-differential. We state its formal definition and propose a practicalcomputation scheme for du

Ψkin Appendix.

Page 4: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Algorithm 2 Implicit Momentum FIMA (iFIMA)

Require: x0, A = Ag,Af, 0 < 2Ck < µk < ∞, and0 < γk < 1/L.

1: while not converged do2: uk = Ag Af (xk).3: uk ∈ proxγkg

(uk − γk

(∇f(uk) + µk(uk − xk)

)).

4: if ‖duk

Ψk‖ ≤ Ck‖uk − xk‖ then

5: vk = uk.6: else7: vk = xk.8: end if9: xk+1 ∈ proxγkg

(vk − γk∇f(vk)

).

10: end while

Roughly, this new variable is just an ensemble of the lastupdated xk and the output uk of user-specified A followingthe specific proximal structure in Eq. (1). Then the monitoris obtained by checking the boundedness of du

Ψk . Noticethat the constant Ck actually reveals our tolerance to theinexactness of A at k-th iteration.

Proposition 1. Let xk, uk,vkk∈N be the sequences generatedby Alg. 2. Then there exist two sequences αk|αk > 0k∈N andβk|βk > 0k∈N, such that the inequality (3) in Theorem 1 andΨ(uk) ≤ Ψ(xk)− βk‖uk − xk‖2 are respectively satisfied.

Equipped with Proposition 1, it will be straightforwardto guarantee that the objective values generated by Alg. 2(i.e., Ψ(xk)k∈N) also has sufficient descent. So we call thisversion of FIMA as implicit momentum FIMA (iFIMA). Thenthe global convergence of iFIMA is proved as follows.

Theorem 2. Let xkk∈N be the sequence generated by iFIMA.Then xkk∈N is bounded and any of its accumulation points arethe critical points of Ψ. If Ψ is semi-algebraic, we further havethat xkk∈N is a Cauchy sequence, thus globally converges to acritical point of Ψ(x) in Eq. (1).

Indeed, based on Theorem 2, it is also easy to obtain thesame convergence rate as that in Corollary 1 for iFIMA.

3.3.1 Practical Calculation of duk

Ψk in iFIMA

Here we propose a practical calculation scheme for duk

Ψk ∈∂Ψk(uk) defined in Eq. (4) and used in Alg. 2. In fact, it ischallenging to directly calculate duk

Ψk since the sub-differentialduk

g is often intractable in the non-convex scenario [28].Fortunately, our following analysis provides an efficientpractical calculation scheme for duk

Ψk within FIMA framework.Specifically, by considering uk as the solution for Step 3 ofAlg. 2, we have

uk ∈ proxγkg(uk − γk

(∇f(uk) + µk(uk − xk)

))= arg min

u

12

∥∥u− (uk − γk (∇f(uk) + µk(uk − xk)))∥∥2

+g(u).

By the first-order optimality condition, we further have

0 ∈ ∂g(uk) + 1γk

(uk − (uk − γk(∇f(uk) + µk(uk − xk))))

⇒ −(∇f(uk) + µk(uk − xk)

)+ 1

γk(uk − uk) ∈ ∂g(uk).

Recalling the definition of dxΨk in Eq. (4) and dx

g ∈ ∂g(x),we finally obtain the practical calculation scheme for duk

Ψk :

duk

Ψk = duk

g +∇f(uk) + µk(uk − xk)=(µk − 1/γk

) (uk − uk

)−(∇f

(uk)−∇f

(uk)).

3.4 Multi-block Extension

Algorithm 3 Multi-block FIMA (mFIMA)

Require: X0, A = Ag1 , · · · ,AgN ,Af, 0 < 2Ckn < µkn <∞, and 0 < γkn < 1/Ln.

1: while not converged do2: for n = 1 : N do3: ukn = Agn Af

(Xk+1

[<n],Xk[≥n]

).

4: ukn ∈ proxγkngn(ukn − γkn(∇nf(Xk+1

[<n],ukn,X

k[>n])

+µkn(ukn − xkn))).

5: if ‖duknΨkn‖ ≤ Ckn‖ukn − xkn‖ then

6: vkn = ukn.7: else8: vkn = xkn.9: end if

10: xk+1n ∈ proxγkngn

(vkn − γkn∇nf

(Xk+1

[<n],vkn,X

k[>n]

)).

11: end for12: end while

In order to tackle the inverse problems with blocks ofunknown variables (e.g., blind deconvolution and dictionarylearning), we now discuss how to extend FIMA for multi-block NIPs, which is formulated as T (X) = y + n, whereX = xnNn=1 ∈ RD1×· · ·×RDN is a set of N ≥ 2 unknownvariables to be estimated. Notice that here T should be somegiven linear operations on X. The inference of such problemcan be addressed by solving

minX

Ψ(X) := f(X; T ,y) +N∑n=1

gn(xn), (5)

where f(X) : RD1 × · · · × RDN → (−∞,+∞] is stilldifferentiable and each gn(xn) : RDn → (−∞,+∞] mayalso nonsmooth and possibly nonconvex. Here both f andblock-wise gn (xn) follow the same assumptions as that inEq. (1) and f should also satisfy the generalized Lipschitzsmooth property on bounded subsets of RD1 × · · · × RDN .For ease of presentation, we denote X[<n] = xin−1

i=1 ,X[≤n] = xini=1 and the subscripts [> n] and [≥ n] aredefined in the same manner. Then we summarize the mainiterations of multivariable FIMA (mFIMA) as follows4:

ukn = Agn Af(Xk+1

[<n],Xk[≥n]

),

xk+1n ∈ proxγkgn

(vkn − γk∇nf

(Xk+1

[<n],vkn,X

k[>n]

)).

Here vkn is the monitor of xkn, obtained by the same errorcontrol strategy as that in iFIMA. Then we summarize ourmulti-block FIMA in Alg. 3 and prove the convergence ofmFIMA in Corollary 2.

4. Due to space limit, the details of mFIMA are presented in Supple-mental Material.

Page 5: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Corollary 2. Let Xkk∈N be the sequence generated by mFIMA.Then we have the same convergence properties as that in Theorem 2and Corollary 1 for Xkk∈N.

Then we summarize our multi-block FIMA in Alg. 3.Notice that here we adopt the error-control policy in iFIMAto guide the iterations of mFIMA.

4 DISCUSSIONS

Here we would like to discuss some important aspects ofour FIMA, including the difference with traditional firstorder methods, learning-based iterative methods, and exist-ing meta-learning techniques. We also discuss the relationbetween our eFIMA and iFIMA. For convenience, we specifythe components in FIMA as three categories, including theflexible modules Ag Af , the various criteria (Steps 3-7 inAlg. 1 and Steps 3-8 in Alg. 2) and PG operator.

4.1 FIMA vs. Traditional First Order Methods

We first point out that by specifying the user-specifiedmodulesAf andAg as numerical calculations, our FIMA canreduce to some traditional first-order numerical algorithms(e.g., PG/APG/PALM) when ignoring the criteria and PGoperator. Specifically, by respectively specifying Af and Agas gradient descent and proximal operator, our e/iFIMA willbe the standard PG [29]. Our e/iFIMA also can reduce tothe standard APG [2] when adopting Nesterov’s accelerationand proximal gradient operator as Af and Ag , respectively.Considering the two-block case in mFIMA, our methodbecomes PALM [26] when configuring the same strategywith PG. In these degradations, we can obtain the sameconvergence results with existing PG/APG.

When remaining the criteria and PG operator in FIMA,e/iFIMA will be the inexact APGs by setting Ag Af similarwith PG. For example, eFIMA will be the monotone APG [6]when arranging Af and Ag as Nesterov’s acceleration andproximal gradient operator, respectively. Furthermore, whentransforming these arrangements into our iFIMA framework,it generates a new inexact APG method. However, theconvergence performance based on iFIMA is even better thanthat for the prevalent nonconvex APGs [6], [8]. This actuallysuggests that our devised error-control policy together withthe flexible algorithmic structures should also be beneficialfor inexact nonconvex algorithms.

Moreover, our experimental results have demonstratedthat thanks to the plug-and-play architectures, FIMA canachieve much better numerical performance and final resultsthan these traditional first order numerical methods, espe-cially in real-world applications. Indeed, FIMA providesa generic, flexible and theoretically guaranteed way toextend standard numerical methods using plug-and-playarchitectures.

4.2 FIMA vs. Learning-based Iterative Methods

As discussed above, most existing learning-based iterativemethods only replace their numerical computations bytrained architectures, which directly lead to the missing ofnecessary theoretical guarantees. Fortunately, within FIMAframework, we can prove in Corollary 1 and Theorem 2 that

our newly proposed learning-based scheme does not dependon the particular choices of Af and Ag in general. It actuallyprovides us a unified methodology to analyze and improvethe convergence issues for learning-based methods. WithinFIMA, we can provide an easily-implemented and strictlyconvergent way to extend almost all the learning-basedmethods reviewed in Sec. 2.2. For example, we can designthe gradient descent operator and the encoder architecturein LISTA [11] as Af and Ag , respectively. Then we cascadethe explicit or implicit momentum in our algorithms to buildthe convergence guaranteed iterations. Similarly, we canalso regard the learnable priors in MRF [24] as Ag and therest part as Af to generate our eFIMA or iFIMA. Whendesigning the solution of the subproblem about fidelity asAf and exploring the data distribution by denoise CNN [17]as Ag , we also provide the convergence guarantee forthese plug-and-play learnable iterations. The criterion in ouralgorithms actually provides the guidance to judge whetherthe output of each iteration in learning based methods isreasonable. Thus, almost all the learning-based methods canstrictly converge with tiny assistance under our algorithmframework.

4.3 FIMA vs. Existing Meta-learning Techniques

Meta-learning (a.k.a. “learning to learn”) aims to designmethods that can learn how to learn new tasks by reusingprevious experience, rather than considering each new taskin isolation [30], [31], [32]. It should be noticed that FIMAcan also be categorized as a specific meta-learning techniquefrom the perspective of “learning to optimize”. However,compared with existing approaches [20], [33], which justlearn all the hyperparameters in their optimization processesin heuristic manners and thus miss solid theoretical investi-gations, the main advantages of our FIMA is that we providea theoretically guaranteed framework to learn strict convergenceoptimization schemes for meta-learning. But please noticethat to obtain these convergence results, we have to set somealgorithmic parameters following the theoretical guidance.

4.4 eFIMA vs. iFIMA

We first clarify that the main difference between eFIMA andiFIMA is the error-control condition. It can be seen that theoptimally-based condition in iFIMA is a little bit stricterthan the loss-based eFIMA condition. Thus we can obtainbetter convergence properties but additional computationsare needed at each iterations.

As for the plug-and-play computational modules ine/iFIMA, it will be shown in Sec. 6 that the choices ofAf andAg do affect our speed and accuracy in practice. Fortunately,the proposed scheduling of learnable and numerical modulesare automatically and adaptively adjusted by error-controlconditions of both eFIMA and iFIMA. So in the specific task,FIMA actually can automatically reject improper Af and Agduring iterations.

5 APPLICATIONS

As a nontrivial byproduct, this section illustrates how toapply FIMA to tackle practical inverse problems in low-level

Page 6: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

vision area, such as image deconvolution in the standardnon-blind and even more challenging blind scenarios.

Non-blind Deconvolution (Uni-block) aims to restorethe latent image z from corrupted observation y with knownblur kernel b. In this part, we utilize the well-known sparsecoding formulation [2]: y = Dx + n, where x, D and nare the sparse code, given dictionary and unknown noises,respectively. Indeed, the form of D is given as D = BW>,where B is the matrix form of b, W> denotes the inverseof the wavelet transform W (i.e., x = Wz and z = W>x).So by defining f(x;D,y) = ‖y −Dx‖2 and g(x) = λ‖x‖p(0 ≤ p < 1), we obtain a special case of Eq. (1) as follows

minxf(x;D,y) + g(x). (6)

Now we are ready to design iterative modules (i.e., Afand Ag) to optimize the SC model in Eq. (6). With the well-known imaging formulation y = b ⊗ z + n (⊗ denotesthe convolution operator), we actually update z by solvingAf (zk) := arg minz ‖y−b⊗ z‖2 + τ‖z− zk‖2 to aggregateprinciples of the task and information from last updatedvariable, where zk = W>xk and τ is a positive constant.Then Af on x can be defined as Af (xk) = WAf (zk), i.e.,

Af (xk) = W(BTB + τI)−1(BTy + τW>xk

), (7)

where I is the identity matrix. It is easy to check that Af canbe efficiently calculated by FFT [24]. As for Ag , we considersolving Ag(Af (zk)) := arg minz g(z) + τ‖z−Af (zk)‖2 bya network to describe the distribution of latent image.

Blind Deconvolution (Multi-block) involves the jointestimation of both the latent image z and blur kernel b,given only an observed y. Here we formulate this problemon image gradient domain and solve the following specialcase of Eq. (5) with two unknown variables (x,b)5:

minx,b

f(x,b;∇y) + gx(x) + gb(b), (8)

where f(x,b;∇y) = ‖∇y− b⊗ x‖2 , gx(x) = λx‖x‖0, andgb(b) = χΩb

(b). Here χΩbis the indicator function of the

set Ωb := b ∈ RDb : [b]i ≥ 0,∑Db

i=1[b]i = 1, where [·]idenotes the i-th element. So the proximal updating in mFIMAcorresponding to gx and gb can be respectively calculatedby hard-thresholding [3] and simplex projection [34]. Herewe need to specify three modules (i.e., Af , Agx and Agb ) formiFPG. We first follow similar idea in the non-blind caseto define Af (xk,bk) using the aggregated deconvolutionenergy

Af (xk,bk) := arg minx,b‖∇y − b⊗ x‖2

+τx‖x− xk‖2 + τb‖b− bk‖2,(9)

where τb and τx are positive constants. We then train CNNson image gradient domain and solve minb ‖∇y− b⊗ x‖2 +λb‖b‖2 using conjugate gradient method [35] to formulateAgx and Agb , respectively.

Rain Streaks Removal (Multi-block) is another challeng-ing task which focuses on removing the sporadic rain steakr and restoring rain free background scenes z from severaltypes of visibility distorted observations o. The observation

5. Notice that in this section, x is defined with different meanings, i.e.,image gradient in Eq. (8), while sparse code in Eq. (6).

(a) (b) (c)

Fig. 1. Comparisons of FIMA with different Aτf (τ ∈ [10−4, 101]) andAg ∈ APG

g ,ARFg ,ATV

g ,ACNNg . The bar charts in the rightmost subfigure

compares the overall iteration number and running time (in seconds,“Time(s)” for short).

o can be generated by o = z + r. Considering the differentsparsity of rain steak and background image, we formulatethe sparse coding model as to minimize the following energyfunction:

minx,c

f(x, c;o) + gx(x) + gc(c), (10)

where f(x, c;o) = ‖o−W>x−W>c‖, gx(x) = λx‖x‖0.8and gc(c) = λc‖c‖0. Here, W is the wavelet transform whichhas explained in non-blind deconvolution. x, c are the sparsecodes of z and r, respectively (i.e., z = W>x, r = W>c). Byapplying our multi-block FIMA to solve Eq. (10), we designAf by the similar strategy in Eq. (9). As for Agx and Agc ,we achieve them by same network architecture but differenttraining data.

6 EXPERIMENTAL RESULTS

This section conducts experiments to verify our theoreticalresults and compares the performance of FIMA with otherstate-of-the-art learning-based iterative methods on real-world inverse problems. All experiments are performed ona PC with Intel Core i7 CPU at 3.4 GHz, 32 GB RAM and aNVIDIA GeForce GTX 1050 Ti GPU. More results can also befound in Supplemental Materials.

6.1 Non-blind Image DeconvolutionWe first evaluate FIMA on solving Eq. (6) for image restora-tion. The test images are collected by [24], [36] and differentlevels of Gaussian noise are further added to generate ourcorrupted observations.

Modules Evaluation: Firstly, the influences of differentchoices of A in FIMA is studied. Following Eq. (7), we adoptAτf with varying τ . As for Ag , different choices are alsoconsidered: classical PG (APG

g ), Recursive Filter [37] (ARFg ),

Total Variation [38] (ATVg ) and CNNs (ACNN

g ). For ACNNg , we

introduce a residual structure x = x +R(x) [39] and defineR as a cascade of 7 dilated convolution layers (with filtersize 3× 3). ReLUs are added between each two linear layersand batch normalizations are used for the 2-nd to 6-th linearlayers. We collect 800 images, in which 400 have been used in[24] and the other 400 are randomly sampled from ImageNet[40] as training data. We set the patch size as 160 × 160,and feed 10 patches in each epoch. Finally, Adam is adoptedand executed 60000 epochs for learning rate ranging from1e-4 to 1e-6. Here we just adopt similar strategies in [17] to

Page 7: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

(a) (b)

(c) (d)

Fig. 2. The iteration curves of FIMA with different settings. The first threesubfigures express the function values, constructive errors, and iterationerrors, respectively. Subfigure (d) only plots the first 50 iterations forillustrate the scheduling policies of FIMA.

train ACNNg with different noise levels. Thus, we only training

ACNNg independently in our iterations. As for the algorithmic

parameters, we set γk = 0.1, µk = 0.2, and Ck = 0.09.Fig. 1 analyzes the contributions of Aτf (τ ∈ [10−4, 101]) andAg ∈ APG

g ,ARFg ,ATV

g ,ACNNg . We observe thatATV

g is relativelybetter than APG

g and ARFg , while ACNN

g performs consistentlybetter and faster than other strategies. So hereafter we alwaysutilize ACNN

g in eFIMA and iFIMA. We also observe thateven with different Ag , relatively large τ in Aτf will resultin analogous quantitative results. Thus we experimentallyset τ = 10−3 for Aτf in eFIMA and iFIMA for all theexperiments.

Convergence Behaviors: We then verify the convergenceproperties of FIMA. The convergence behaviors of both eachmodule in our algorithms and other nonconvex APGs areconsidered. To be fair and comprehensive, we adopt specificiteration numbers (K = 80) and iteration errors (‖xk+1 −xk‖/‖xk‖ ≤ 10−4) as stopping criterion in Figs. 2 and 3,respectively.

In Fig. 2(a), (b), and (c), we plot the curvesof objective values (log

(Ψ(xk)

)), reconstruction errors

(log(‖xk+1 − xk‖2/‖xk‖2

)) and iteration errors for FIMA

with different settings. The legends “x”, “u”, and “u-x”respectively denote that at each iteration, we only performclassical PG (i.e., only the last step in Algs. 1 and 2), task-driven modules A (i.e., only perform Eq. (2)), and their naivecombination (without any scheduling policies). It can beseen that the function values and reconstruction errors ofPG decrease slower than our FIMA strategies, while both“u”-curve (i.e., naive Ag Af ) and “u-x”-curve (i.e., A withPG refinement but no “explicit momentum” or “error-control”policy) have oscillations and could not converge after only30 iterations. Moreover, we observe that adding PG to “u”(i.e., “u-x”) make the curve worse rather than correct it to thedescent direction. It illustrates that the pure adding strategies

TABLE 1The number of iterations (including plug-and-play modules) in FIMA. “No.

Iter.” reports the number of whole iterations and “No. A” denotes thetimes the plug-and-play modules Ag Af has been performed by FIMAduring iterations. We also report the number of iterations for standard

PG in the rightmost column.

Image eFIMA iFIMA PGNo. Iter. No. A No. Iter. No. A No. Iter.

Fig. 4 13 11 22 21 542Fig. 5 19 15 26 25 577

indeed break the convergence guarantee. In contrast, sinceof the choice mechanism in our algorithms, both eFIMA andiFIMA can provide a reliable variable (vk) in the currentiteration to satisfy the convergence condition. We furtherexplore the choice mechanism of FIMA in Fig. 2(d). The“circles” in each curve represent the “explicit momentum” or“error-control” policy is satisfied, while the “triangles” denoteonly perform PG in the current stage. It can be seen that theeFIMA strategy is more strict than iFIMA, the judgmentpolicy fails only 20 iterations in eFIMA while remains almost40 iterations in iFIMA. Both eFIMA and iFIMA have betterperformance than other compared schemes, thus verifies theefficiency of our proposed scheduling policies in Sec. 3.

We also compare the iteration behaviors of FIMA toclassical nonconvex APGs, including mAPG [6], APGnc [9])and inexact niAPG [8] on the dataset collected by [24], whichconsists of 68 images corrupted by different blur kernelsof the size ranging from 17×17 to 37×37. We add 1‰and1% Gaussian noise to generate our corrupted observations,respectively. In Fig. 3, the left four subfigures compare curvesof iteration errors and reconstruction errors on an exampleimage and the rightmost one illustrate the averaged iterationnumbers and run time on the whole dataset. It can be seenthat our eFIMA and iFIMA are faster and better than theseabstractly designed classical solvers under the same iterationerror (≤ 1e− 4). Moreover, we observe that the performanceof these nonconvex APGs is not satisfied when the noiselevel is bigger. The reconstruction errors of them (Fig. 3(d))ascend after dozens of steps, while our eFIMA and iFIMAremain lower reconstruction error and fewer iterations. Itillustrates that our strategy is more stable than traditionalnonconvex APGs in image restoration because of the flexiblemodules and effective choice mechanisms.

In Fig. 4, we illustrate the visual results of eFIMA andiFIMA with comparisons to both convex image restorationapproaches, including FISTA [2] (APG) and FTVd [41])(ADMM), and nonconvex mAPG, APGnc, and niAPG onan example image with 1% noise level but large kernel size(i.e, 75×75) [36]. Here FISTA and FTVd solve their originalconvex models, while mAPG, APGnc, and niAPG are basedon the nonconvex model in Eq. (6). We have that APGsoutperformed the original PG. The inexact niAPG is betterthan exact mAPG and APGnc. Since FTVd is specificallydesigned for this task, it is the best among all classicalsolvers, but worse than our FIMA. Overall, iFIMA obtainhigher PSNR than eFIMA since the error-control mechanismactually tend to perform more accurate refinements.

We also analyze the iteration behaviors of FIMA in Tab. 1.We report the number of whole iterations and the times

Page 8: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

(a) σ = 1‰ (b) σ = 1‰ (c) σ = 1% (d) σ = 1% (e)

Fig. 3. Comparing iteration behaviors of FIMA to classical nonconvex APGs, including exact ones (mAPG, and APGnc) and inexact niAPG. Theleft four subfigures compare curves of iteration errors and reconstruction errors with different noise level (1‰ and 1%), respectively. The rightmostsubfigure plot bar charts of the averaged iteration number and “Time(s)” on the dataset [24].

Input PG mAPG APGnc niAPG- (24.97/0.79) (25.67/0.73) (25.68/0.73) (26.17/0.78)

FISTA FTVd eFIMA iFIMA Curves of scores(25.03/0.68) (27.75/0.88) (29.04/0.92) (29.34/0.92)

Fig. 4. The non-blind deconvolution performances (1% noise level) of eFIMA and iFIMA with comparisons to convex optimization based algorithms(i.e., FISTA and FTVd), and non-convex solvers (i.e., APGnc, mAPG, and niAPG). The quantitative scores (PSNR/SSIM) are reported below eachimage. The rightmost subfigure on the bottom row plots the curves of PSNR and SSIM of our methods.

Input PPADMM IRCNN eFIMA iFIMA(17.6 / 0.72) (20.96 / 0.82) (21.18 / 0.83) (21.23 / 0.83)

Fig. 5. The non-blind image deconvolution performance (5% noise level) of FIMA with comparisons to existing plug-and-play type methods (i.e.,PPADMM and IRCNN). The quantitative scores (PSNR/SSIM) are reported below each image.

the plug-and-play modules Ag Af has been performedby FIMA during iterations. It can be seen that Ag Af areperformed in most of the iterations. Moreover, thanks tothese user-specified modules, FIMA only needs a dozen ortwenty iterations to obtain our desired solutions. In contrast,there are more than 500 iterations in standard PG method.But the practical performances of PG are still worse thanour FIMA (see Figs. 4 and 5 for comparisons). These resultsverify the efficiency and effectiveness of the mechanism ofFIMA in real-world applications.

State-of-the-art Comparisons: We compare FIMA withstate-of-the-art image restoration approaches, such as ID-DBM3D [42], EPLL [43], PPADMM [25], RTF [44] and

IRCNN [17]. Fig. 5 first compares our FIMA with twoprevalent learning-based iterative approaches (i.e., PPADMMand IRCNN) on an example image with 5% noise. Tab. 2 thenreports the averaged quantitative results of all the comparedmethods on the image set (collected by [24]) with differentlevels of Gaussian noise (i.e., 1%, 2%, 3% and 4%). Wehave that eFIMA and iFIMA not only outperform classicalnumerical solvers by a large margin in terms of speed andaccuracy, but also achieve better performance than otherstate-of-the-art approaches. Within FIMA, it can be seenthat the speed of eFIMA is faster, while PSNR and SSIM ofiFIMA are relatively higher. This is mainly because the “errorcontrol” strategy tends to perform more refinements than the

Page 9: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

TABLE 2Averaged PSNR, SSIM and Time(s) on the benchmark image set [24]. Here σ denotes the noise levels.

σ Metric State-of-the-art Image Restoration Methods Classical Nonconvex Methods OursIDDBM3D EPLL PPADMM RTF IRCNN PG mAPG APGnc niAPG eFIMA iFIMA

1%PSNR 28.83 28.67 28.01 29.12 29.78 27.32 26.68 26.69 27.24 29.81 29.85SSIM 0.81 0.81 0.78 0.83 0.84 0.71 0.67 0.67 0.73 0.85 0.85

Time(s) 193.13 112.03 293.99 249.83 2.67 20.36 13.02 7.16 5.29 1.89 2.06

2%PSNR 27.60 26.79 26.54 25.58 27.90 25.61 25.20 25.28 25.63 28.02 28.06SSIM 0.76 0.74 0.72 0.66 0.78 0.63 0.60 0.61 0.64 0.79 0.79

Time(s) 198.66 100.52 270.45 254.26 2.68 15.43 7.70 4.66 3.30 1.90 2.07

3%PSNR 26.72 25.68 25.78 21.18 26.81 24.63 24.39 24.48 24.76 27.05 27.07SSIM 0.72 0.69 0.68 0.42 0.73 0.57 0.55 0.56 0.61 0.74 0.75

Time(s) 191.25 96.32 257.94 252.47 2.68 13.89 6.44 5.37 2.63 1.89 2.07

4%PSNR 26.06 24.88 25.27 17.95 26.10 24.05 23.88 23.95 24.14 26.20 26.37SSIM 0.69 0.65 0.66 0.28 0.70 0.54 0.53 0.53 0.59 0.70 0.72

Time(s) 183.44 93.82 258.45 255.84 2.67 11.99 6.01 7.82 2.35 1.89 2.07

TABLE 3Averaged quantitative scores on Levin et al’s benchmark.

Method PSNR SSIM ER KS Time(s)Perrone et al. 29.27 0.88 1.35 0.80 113.70Levin et al. 29.03 0.89 1.40 0.81 41.77Sun et al. 29.71 0.90 1.32 0.82 209.47

Zhang et al. 28.01 0.86 1.25 0.58 37.45Pan et al. 29.78 0.89 1.33 0.80 102.60

Ours 30.37 0.91 1.20 0.83 5.65

“explicit momentum” rule during iterations.

6.2 Blind Image Deconvolution

Blind deconvolution is known as one of the most challeng-ing low-level vision tasks. Here we evaluate mFIAM onsolving Eq. (8) to address this fundamentally ill-posed multi-variables inverse problem. We adopt the same CNN moduleACNNg as that in Sec. 6.1 but train it on image gradient domain

to enhance its ability for sharp edge detection.In Fig. 6, we show the visual performances of mFIMA in

different settings (i.e., with and without A) on an exampleblurry image from [45]. We observe that mFIMA withoutA almost failed on this experiment. This is not surprisingsince [45], [46] have proved that standard optimizationstrategy is likely to lead to degenerate global solutions likethe delta kernel (frequently called the no-blur solution), ormany suboptimal local minima. In contrast, the CNN-basedmodules successful avoid trivial results and significantlyimprove the deconvolution performance. We also plot thecurves of quantitative scores (i.e., PSNR for the latent imageand Kernel Similarity (KS) for the blur kernel) on the bottomrow for these two strategies on the bottom row. As thesescores are stable after 20 iterations, here we only plot curvesof the first 20 iterations. We then compare mFIMA with state-of-the-art deblurring methods6, such as Perrone et al. [47],Levin et al. [45], Sun et al. [46], Zhang et al. [48] and Pan etal. [49] on the most widely-used Levin et al’s benchmark [45],which consists of 32 blurred images generated by 4 clean

6. In this and the following experiments, the widely used multi-scaletechniques are adopted for all the compared methods.

Input mFIMA without A mFIMA with A

Fig. 6. The comparisons of mFIMA with and without the module A. Thetop row compares the visual results of these different strategies. Thebottom row plots the curves of PSNR and KS scores during iterations.

images and 8 blur kernels. Tab. 3 reports the averagedquantitative scores, including PSNR, SSIM, and Error Rate(ER) for the latent image, Kernel Similarity (KS) for the blurkernel and the overall run time. Fig. 7 further compares thevisual performance of mFIMA to Perrone et al., Sun et al. andPan et al. (i.e., top 3 in Tab. 3) on a real-world challengingblurry image collected by [36]. It can be seen that mFIMAconsistently outperforms all the compared methods bothquantitatively and qualitatively, which verifies the efficiencyof our proposed learning-based iteration methodology.

In Figs. 8 and 9, we further compare the blind imagedeconvolution performance of mFIMA with Perrone etal. [47], Sun et al. [46] and Pan et al. [49] (top 3 among all thecompared methods in Tab. 2) on example images corruptedby not only unknown blur kernels, but also different levelsof Gaussian noises (1% and 3% in Figs. 8 and 9, respectively).It can be seen that mFIMA is robust to these corruptionsand outperforms all the compared state-of-the-art deblurringmethods.

Page 10: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

Input Perrone et al. Sun et al. Pan et al. Ours

Fig. 7. Visual comparisons between mFIMA and other competitive methods (top 3 in Tab. 3) on a real blurry image.

Input Perrone et al. Sun et al. Pan et al. mFIMA- (15.96 / 0.49 / 0.80) (17.35 / 0.60 / 0.88) (14.39 / 0.44 / 0.54) (18.11 / 0.58 / 0.95)

Fig. 8. The blind image deconvolution results of mFIMA with comparisons to state-of-the-art approaches on blurry image with 1% Gaussian noise.The quantitative scores (i.e., PSNR / SSIM / KS) are reported below each image.

Input Perrone et al. Sun et al. Pan et al. mFIMA- (24.76 / 0.75 / 0.48) (20.48 / 0.56 / 0.32) (28.05 / 0.83 / 0.40) (31.25 / 0.87 / 0.89)

Fig. 9. The blind image deconvolution results of mFIMA with comparisons to state-of-the-art approaches on blurry facial image with 3% Guassiannoise. The quantitative scores (i.e., PSNR / SSIM / KS) are reported below each image.

6.3 Rain Streaks Removal

To further verify our method can deal with various visiontasks, we provide the performance of our mFIMA on rainstreaks removal. As we claimed in Sec. 5, we adopt thesame CNN architecture to train the learnable Agx and Agc .It should be noticed that we feed rainy observations intonetwork and output the rain steak images to train the Agc ,while adopting the similar strategy with deconvolution totrain Agx .

Firstly, we reported the quantitative scores (PSNR / SSIM)on a widely used synthetic Rain12 dataset [50] and comparedwith state-of-the-art deraining methods including SR [51],DSC [52], LP [50], DerainNet [53], and DetailNet [54], andUGSM [55]. As Tab. 4 shown, our mFIMA is easily superiorto all of the competitive methods. Moreover, we also providethe visual results on the challenging real-world rainy imagefrom [52] in Fig. 10. As can be observed, our proposed

method can remove more rain streaks and preserve the moredetail textures than others.

7 CONCLUSION

This paper provided FIMA, a framework to analyze theconvergence behaviors of learning-based iterative methodsfor nonconvex inverse problems. We proposed two novelmechanisms to adaptively guide the trajectories of learning-based iterations and proved their strict convergence. We alsoshowed how to apply FIMA for real-world applications, suchas non-blind deconvolution, blind image deconvolution, andrain streaks removal.

APPENDIX APROOFS

We first give some preliminaries on variational analysis andnonconvex optimization in Sec. A.1. Secs. A.2-A.4 then prove

Page 11: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE 4Averaged PSNR and SSIM on the benchmark image set [50] for rain streaks removal.

Methods Metric Image Index Avg.1 2 3 4 5 6 7 8 9 10 11 12

SR PSNR 28.17 29.46 25.55 31.40 26.21 29.05 30.66 27.35 28.71 27.87 26.40 27.97 28.23SSIM 0.83 0.88 0.92 0.93 0.87 0.94 0.94 0.94 0.88 0.89 0.88 0.90 0.90

DSC PSNR 28.07 27.81 22.72 32.59 23.26 26.32 29.85 24.95 29.16 26.21 26.94 27.70 27.13SSIM 0.85 0.91 0.84 0.97 0.88 0.96 0.96 0.94 0.90 0.85 0.90 0.87 0.90

LP PSNR 32.01 32.81 28.99 33.31 28.10 30.33 32.32 30.26 31.15 30.58 29.08 31.21 30.85SSIM 0.90 0.94 0.95 0.97 0.93 0.98 0.97 0.97 0.94 0.93 0.93 0.95 0.95

DerainNet PSNR 28.62 29.58 24.86 33.96 27.22 30.67 33.66 26.75 30.05 27.72 26.22 28.00 28.94SSIM 0.90 0.93 0.92 0.98 0.94 0.97 0.98 0.95 0.94 0.91 0.91 0.93 0.94

DetialNet PSNR 32.55 32.97 29.00 36.27 30.03 31.67 36.13 30.82 33.13 31.56 29.20 31.85 32.10SSIM 0.93 0.96 0.95 0.99 0.96 0.98 0.99 0.98 0.97 0.94 0.94 0.96 0.96

UGSM PSNR 32.42 34.03 27.85 38.03 29.40 34.92 37.94 28.40 30.28 30.33 29.31 30.80 31.98SSIM 0.93 0.96 0.94 0.99 0.96 0.99 0.99 0.97 0.92 0.93 0.94 0.94 0.95

mFIMA PSNR 34.10 36.24 29.46 40.76 31.94 35.10 39.51 35.37 36.66 33.02 31.17 33.41 34.73SSIM 0.95 0.98 0.96 0.99 0.97 0.99 0.99 0.99 0.98 0.96 0.95 0.97 0.97

Input DerainNet DetailNet UGSM mFIMA

Fig. 10. Rain streaks removal results of mFIMA with comparisons to the state-of-the-art approaches on real-world rainy images.

the main results in our manuscript.

A.1 PreliminariesDefinition 1. [56] The necessary function properties, includingproper, lower semi-continuous, Lipschitz smooth, and coercive aresummarized as follows. Let f : RD → (−∞,+∞]. Then we have

• Proper and lower semi-continuous: f is proper if domf :=x ∈ RD : f(x) < +∞ is nonempty and f(x) > −∞.f is lower semi-continuous if lim inf

x→yf(x) ≥ f(y) at any

point y ∈ domf .• Coercive: f is said to be coercive, if f is bounded from

below and f → ∞ if ‖x‖ → ∞, where ‖ · ‖ is the `2norm.

• L-Lipschitz smooth (i.e., C1,1L ): f is L-Lipschitz smooth if

f is differentiable and there exists L > 0 such that

‖∇f(x)−∇f(y)‖ ≤ L‖x− y‖, ∀ x,y ∈ RD.

If f is L-Lipschitz smooth, we have the following inequality

f(x) ≤ f(y)+〈∇f(y),y−x〉+L

2‖x−y‖2, ∀x,y ∈ RD.

Definition 2. [7], [56] Let g : RD → (−∞,+∞] be a properand lower semi-continuous function. Then we have

• Sub-differential: The Frecht sub-differential (denoted as ∂g)of g at point x ∈ dom(g) is the set of all vectors z whichsatisfies

lim infy 6=x,y→x

g(y)− g(x)− 〈z,y − x〉‖y − x‖

≥ 0,

where 〈·, ·〉 denotes the inner product. Then the limitingFrecht sub-differential (denoted as ∂g) at x ∈ domg is thefollowing closure of ∂g:

z ∈ Rn : ∃(xk, g(xk))→ (x, g(x)),

where zk ∈ ∂g(xk)→ z when k →∞.• Kurdyka-Łojasiewicz property: g is said to have the

Kurdyka-Łojasiewicz property at x ∈ dom∂g := x ∈RD : ∂g(x) 6= ∅ if there exist η ∈ (0,∞], aneighborhood Ux of x and a desingularizing functionφ : [0, η) → R+ which satisfies (1) φ is continuousat 0 and φ(0) = 0; (2) φ is concave and C1 on (0, η); (3)for all s ∈ (0, η) : φ′(s) > 0, such that for all

x ∈ Ux ∩ [g(x) < g(x) < g(x) + η],

the following inequality holds

φ′(g(x)− g(x))dist(0, ∂g(x)) ≥ 1.

Page 12: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Moreover, if g satisfies the KŁ property at each point ofdom∂g then g is called a KŁ function.

• Semi-algebraic set and function: A subset Ω of RD is areal semi-algebraic set if there exist a finit number of realpolynomial functions rij , hij : RD → R such that

Ω =

p⋃j=1

q⋂i=1

x ∈ RD : rij(x) = 0 and hij(x) < 0

.

(11)g is called semi-algebraic if its graph (x, z) ∈ RD+1 :g(x) = z is a semi-algebraic subset of RD+1. It isverified in [7] that all semi-algebraic functions satisfy theKŁ property.

A.2 Explicit Momentum FIMA (eFIMA)A.2.1 Proof of Theorem 1Proof. We first prove the inequality relationship of Ψ

(xk+1

)and Ψ

(vk). According to the update rule of xk+1 (Step 8 in

Alg. 1): xk+1 ∈ proxγkg(vk − γk∇f(vk)

)), we have

xk+1 ∈ arg minxg (x) + 〈∇f

(vk),x−vk〉+ 1

2γk‖x−vk‖2,

(12)thus

g(xk+1

)+〈∇f

(vk),xk+1−vk〉+ 1

2γk‖xk+1−vk‖2 ≤ g(vk).

(13)Since f is C1,1

L , we have

f(xk+1) ≤ f(vk)+〈∇f(vk),xk+1−vk〉+L

2‖xk+1−vk‖2,

(14)where L is the Lipschitz moduli of ∇f . Combining this withEqs. (13) and (14), we have

Ψ(xk+1

)≤ Ψ

(vk)−(

1

2γk− L

2

)‖xk+1 − vk‖2. (15)

Set γk < 1/L and define αk = 12γk− L

2 , we have αk > 0 andΨ(xk+1

)≤ Ψ

(vk)− αk‖xk+1 − vk‖2.

Then we prove the boundness and convergence ofxkk∈N. Based on the momentum scheduling policy inAlg. 1, we obviously have Ψ

(vk)≤ Ψ

(xk). This together

with the result in Eq. (15) (i.e., Ψ(xk+1

)≤ Ψ

(vk)

withγk < 1/L) concludes that for any k ∈ N+,

Ψ(xk+1

)≤ Ψ

(vk)≤ Ψ

(xk)≤ Ψ

(vk−1

)≤ Ψ

(x0).

(16)Since both f and g are proper, we also have Ψ

(vk)≥

inf Ψ > −∞. Thus both sequences Ψ(xk)k∈N and

Ψ(vk)k∈N are non-increasing and bounded. This together

with the coercive of Ψ concludes that both xkk∈N andvkk∈N are bounded and thus have accumulation points.

Then we prove that all accumulation points are the criticalpoints of Ψ. From Eq. (16), we actually have that the objectivesequences Ψ(xk)k∈N and Ψ(vk)k∈N converge to thesame value Ψ∗, i.e.,

limk→∞

Ψ(xk)

= limk→∞

Ψ(vk)

= Ψ∗. (17)

From Eqs. (15) and (16), we have(1

2γk− L

2

)‖xk+1 − vk‖2

≤ Ψ(vk)−Ψ

(xk+1

)≤ Ψ

(xk)−Ψ

(xk+1

).

(18)

Summing over k, we further have

mink

1

2γk− L

2

∞∑k=0

‖xk+1 − vk‖2 ≤ Ψ(x0)−Ψ∗ <∞.

(19)The above inequality implies that ‖xk+1 − vk‖ → 0 andhence xkk∈N and vkk∈N share the same set of accumu-lation points (denoted as Ω). Consider that x∗ ∈ Ω is anyaccumulation point of xkk∈N, i.e., xkj → x∗ if j → ∞.Then by Eq. (12), we have

g(xk+1

)+ 〈∇f

(vk),xk+1 − vk〉+ 1

2γk‖xk+1 − vk‖2

≤ g (x∗) + 〈∇f(vk),x∗ − vk〉+ 1

2γk‖x∗ − vk‖2.

(20)Let kj = k + 1 in Eq. (20) and j → ∞ , by taking lim supon both sides of Eq. (20), we have lim sup

j→∞g(xkj)≤ g (x∗).

On the other hand, since g is lower semi-continuous andxkj → x∗, it follows that lim inf

j→∞g(xkj)≥ g (x∗). So we

have limj→∞

g(xkj)

= g (x∗). Note that the continuity of f

yields limj→∞

f(xkj)

= f (x∗), so we conclude

limj→∞

Ψ(xkj)

= Ψ (x∗) . (21)

Recall that limk→∞

Ψ(xk+1

)= Ψ∗ in Eq. (17), we have

limj→∞

Ψ(xkj)

= Ψ∗, so

Ψ (x∗) = Ψ∗, ∀ x∗ ∈ Ω. (22)

By first-order optimality condition of Eq. (12) and kj = k+ 1,we have

0 ∈ ∂g(xkj)

+∇f(vk)

+ 1γk

(xkj − vk

). (23)

Thus, we have

∇f(xkj)−∇f

(vk)− 1

γk

(xkj − vk

)∈ ∂Ψ

(xkj)

⇒ ‖∇f(xkj)−∇f

(vk)− 1

γk

(xkj − vk

)‖

≤(L+ 1

γk

)‖xkj − vk‖ → 0, as j →∞.

(24)

Then from the definition of sub-differential and Eqs. (21),(23), and (24), we conclude that

0 ∈ ∂Ψ (x∗) , ∀x∗ ∈ Ω. (25)

Therefore, we have that all accumulation points x∗ are thecritical points of Ψ.

A.2.2 Proof of Corollary 1Proof. Considering the semi-algebraic (thus KŁ) property ofΨ(x) and defining a desingularizing function with the formφ(s) = t

θ sθ, we can prove Corollary 1 by Eqs. (15), (16),

and (23) using similar methodology as that in [7], [27]. Sincethese derivations are quite standard, we omit details of thisproof in our Supplemental Materials.

A.3 Implicit Momentum FIMA (iFIMA)

A.3.1 Proof of Proposition 1Proof. First, by using the same derivations as that in Eq. (15),we can directly obtain the inequality in Theorem 1 for iFIMA.Then we show how to build the relationship between Ψ(uk)

Page 13: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

and Ψ(xk). It is known that uk is actually an inexact minimiz-er of Ψk. But by defining its sub-differential duk

Ψk ∈ ∂Ψk(u)as that in Eq. (4), we can also consider it as the exact solutionto the following problem

uk ∈ arg minx

Ψk (x)− 〈duk

Ψk ,x〉. (26)

Thus, we have

Ψ(uk)

+ µk

2 ‖uk − xk‖2 − 〈duk

Ψk , uk〉

≤ Ψ(xk)− 〈duk

Ψk ,xk〉

⇒ Ψ(uk)≤ Ψ

(xk)− µk

2 ‖uk − xk‖2 + 〈duk

Ψk , uk − xk〉

≤ Ψ(xk)− µk

2 ‖uk − xk‖2 + Ck‖uk − xk‖2

= Ψ(xk)−(µk

2 − Ck)‖uk − xk‖2,

(27)in which the second inequality holds under Cauchy-Schwarzinequality and our error-control-based scheduling policyin Alg. 2. Set Ck < µk

2 and define βk = µk

2 − Ck, wehave βk > 0 and Ψ

(uk)≤ Ψ

(xk)− βk‖uk − xk‖2, which

concludes the proof.

A.3.2 Proof of Theorem 2Proof. We first prove the boundedness of xkk∈N. Accordingto Proposition 1 we have Ψ

(uk)≤ Ψ

(xk)

when µk/2 >Ck. So if the error-control criteria in Alg. 2 is satisfied, wehave vk = uk, Ψ

(vk)

= Ψ(uk)≤ Ψ

(xk), otherwise, we

have vk = xk, Ψ(vk)

= Ψ(xk). This together with the

results in Theorem 1 (i.e., Ψ(xk+1

)≤ Ψ

(vk)

with γk < 1/L)concludes that for any k ∈ N+,

Ψ(xk+1

)≤ Ψ

(vk)≤ Ψ

(xk)≤ Ψ

(vk−1

)≤ Ψ

(x0).

(28)Then by using similar derivations as that in Theorem 1, wehave that all accumulation points x∗ are the critical points ofΨ.

Now we are ready to prove that xkk∈N is a Cauchysequence. Since Ψ is a KŁfunction, we have

ϕ′(Ψ(xk+1)−Ψ(x∗))dist(0, ∂Ψ(xk+1)) ≥ 1.

From Eq. (24) we get that

ϕ′(Ψ(xk+1)−Ψ(x∗)) ≥ 1

L+ 1γk

‖xk+1 − vk‖−1

On the other hand, from the concavity of ϕ and Eqs. (15) and(24) we have that

ϕ(Ψ(xk+1)−Ψ(x∗))− ϕ(Ψ(xk+2)−Ψ(x∗))≥ ϕ′(Ψ(xk+1)−Ψ(x∗))(ϕ(Ψ(xk+1)−Ψ(xk+2)))≥ 1

L+ 1

γk‖xk+1 − vk‖−1( 1

2γk− L

2 )‖xk+2 − vk+1‖2

For convenience, we define for all m,n ∈ N and x∗ thefollowing quantities

∆m,n := ϕ(Ψ(xm)−Ψ(x∗))− ϕ(Ψ(xn)−Ψ(x∗)),

and

B := mink

2Lγk + 2

1− Lγk∈ (0,∞).

These deduce that

∆k+1,k+2 ≥‖xk+2 − vk+1‖2

B‖xk+1 − vk‖.

and hence

‖xk+2 − vk+1‖2 ≤ E∆k+1,k+2‖xk+1 − vk‖⇒ 2‖xk+2 − vk+1‖ ≤ E∆k+1,k+2 + ‖xk+1 − vk‖.

Summing up the above inequality for k = p, . . . , q ∈N and p < q yields

2q∑

k=p‖xk+2 − vk+1‖ ≤

q∑k=p‖xk+1 − vk‖+ E

q∑k=p

∆k+1,k+2

≤q∑

k=p‖xk+2 − vk+1‖+ ‖xp+1 − vp‖+ E∆p+1,q+2.

where the last inequality holds under the fact ∆m,n+∆n,r =∆m,r for all m,n, r ∈ N. Since ϕ > 0, we thus have for anyk > l thatq∑

k=p

‖xk+2−vk+1‖ ≤ ‖xp+1−vp‖+Eϕ(Ψ(xp+1)−Ψ(x∗)).

(29)Moreover, recalled the Eq. (27), we also has

minkµk

2 − Ck

q∑k=p‖vk+1 − xk+1‖

≤q∑

k=pΨ(xk+1)−Ψ(xk+2) = Ψ(xp+1)−Ψ(xq+2)

(30)

Combing with Eqs. (29) and (30), we easily deduce

∞∑k=1‖xk+1 − xk‖

≤∞∑k=1‖xk+1 − vk‖+

∞∑k=1‖vk − xk‖ <∞

(31)

The above inequality implies that xkk∈N is a Cauchysequence. Thus the sequence globally converges to a criticalpoint of Ψ(x) in Eq. (1).

A.4 Multi-block FIMA

A.4.1 Definition Extension

As for the generalized Lipschitz smooth property of f , weactually need f satisfy that

• For each xn with other variables fixed, there exitsLn > 0 such that

‖∇nf(X[<n],xn,X[>n]

)−∇nf

(X[<n],yn,X[>n]

)‖

≤ Ln(X[<n],X[>n]

)‖xn − yn‖, ∀xn,yn ∈ RDn ,

(32)where ∇n denotes the gradient with respect to xn.

• For each bounded subset Ω1 × · · · × ΩN ⊆ RD1 ×· · · × RDN , there exists M > 0 such that

‖ (∇1f(X)−∇1f(Y), . . . ,∇Nf(X)−∇Nf(Y)) ‖≤M‖X−Y‖, ∀ X,Y ∈ Ω1 × · · · × ΩN .

(33)

A.4.2 Proof of Corollary 2

Proof. We first prove the boundedness ofXkk∈N. Using

the inequality in Theorem 1 and Step 10 in Alg. 3, we have

xk+1n ∈ arg minxn gn (xn) + 1

2γkn‖xn − vkn‖2

+⟨∇nf

(Xk+1

[<n],vkn,X

k[>n]

),xn − vkn

⟩.

(34)

Page 14: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

This together with Eq. (32) concludes that

Ψ(Xk+1

[<n],vkn,X

k[>n]

)−Ψ

(Xk+1

[≤n],Xk[>n]

)≥(

12γkn−

Ln(Xk+1

[<n],Xk[>n]

)2

)‖xk+1

n − vkn‖2.(35)

Define the auxiliary function Ψkn(xn) =

f(Xk+1

[<n],xn,Xk[>n]

)+ gn (xn) +

µkn2 ‖xn − xkn‖2. Then

by considering that ukn is an inexact solution of the auxiliaryfunction Ψk

n(xn), and applying Proposition 1, we have

Ψ(Xk+1

[<n],xkn,X

k[>n]

)−Ψ

(Xk+1

[<n], ukn,X

k[>n]

)≥(µkn2 − C

kn

)‖ukn − xkn‖2.

(36)

Let λkn = Ln(Xk+1

[<n],Xk[>n]

), then consider Eq. (35) with

γkn < 1/λkn, Eq. (36) with µkn > 2Ckn, and our error-controlupdating rule, we have

Ψ(Xk+1

[≤n],Xk[>n]

)≤ Ψ

(Xk+1

[<n],vkn,X

k[>n]

)≤ Ψ

(Xk+1

[<n],xkn,X

k[>n]

).

It concludes that for any k ∈ N+ and n ∈ 1, . . . , N,

Ψ(Xk+1

)= Ψ

(Xk+1

[≤N ],Xk[>N ]

)≤ Ψ

(Xk+1

[<n],vkn,X

k[>n]

)≤ Ψ

(Xk+1

[<n],xkn,X

k[>n]

)≤ Ψ

(Xk+1

[<1],xk1 ,X

k[>1]

)= Ψ

(Xk)≤ · · · ≤ Ψ

(X0).

(37)

Since f, gn are proper, we also have −∞ <inf Ψ ≤ Ψ

(Xk+1

). Thus

Ψ(Xk)k∈N and

Ψ(Xk+1

[<n],vkn,X

k[>n]

)k∈N

are all non-increasing andbounded. This together with the coercive of Ψ concludesthat the sequences Xkk∈N and vknk∈N (1 ≤ n ≤ N ) arebounded and thus have accumulation points.

Then we prove that all accumulation points are the criticalpoints of Ψ. From Eq. (37), we have that the function val-ue sequences

Ψ(Xk)k∈N and

Ψ(Xk+1

[<n],vkn,X

k[>n]

)k∈N

converge to the same value Ψ∗, i.e.,

limk→∞

Ψ(Xk)

= limk→∞

Ψ(Xk+1

[<n],vkn,X

k[>n]

)= Ψ∗. (38)

From Eqs. (35) and (37), summing over k and n we have

mink,n

1

2γkn− λkn

2

∞∑k=0

N∑n=1‖xk+1

n − vkn‖2

≤∞∑k=0

N∑n=1

(Ψ(Xk+1

[<n],vkn,X

k[>n]

)−Ψ

(Xk+1

[≤n],Xk[>n]

))≤∞∑k=0

(Ψ(Xk)−Ψ

(Xk+1

))= Ψ

(X0)−Ψ∗ <∞.

(39)The above inequality implies that ‖xk+1

n − vkn‖ → 0, hencexknk∈N and

vknk∈N share the same set of accumulation

points. Consider that X∗ = x∗1, · · · ,x∗N is any accumula-tion point of Xkk∈N , there exists a subsequence Xkjj∈Nsuch that

xkjn → x∗n, as j →∞. (40)

From Eq. (34), we have

gn(xk+1n

)+⟨∇nf

(Xk+1

[<n],vkn,X

k[>n]

),xk+1n − vkn

⟩+ 1

2γkn‖xk+1

n − vkn‖2

≤ gn (x∗n) +⟨∇nf

(Xk+1

[<n],vkn,x

k[>n]

),x∗n − vkn

⟩+ 1

2γkn‖x∗n − vkn‖2.

(41)Let kj = k + 1 in Eq. (41) and j → ∞, by taking lim sup

on both sides, we have lim supj→∞

gn(xkjn

)≤ gn (x∗n). On the

other hand, since gn is lower semi-continuous and xkjn → x∗n,

it follows that lim infj→∞

gn(xkjn

)≥ gn (x∗n). Thus we have

limj→∞

gn(xkjn

)= gn (x∗n). Note that the continuity of f yields

limj→∞

f(Xkj

)= f (X∗). Therefore, we conclude

limj→∞

f(Xkj

)+

N∑n=1

gn(xkjn

)= f (X∗) +

N∑n=1

gn (x∗n)

⇒ limj→∞

Ψ(Xkj

)= Ψ (X∗) .

(42)Considering lim

k→∞Ψ(Xk+1

)= Ψ∗ in Eq. (38), we have

limj→∞

Ψ(Xkj

)= Ψ∗, and thus

Ψ (X∗) = Ψ∗. (43)

Similar with the deduction in Theorem 2, we concludethat ‖Xk+1 − Xk‖ → 0, ‖xk+1

n − vkn‖ → 0 when k →∞.By considering the first-order optimality condition of

Eq. (34) and setting kj = k + 1, we have

0 ∈ ∂ngn(xkjn ) +∇nf(X

kj[<n],v

kn,X

k[>n]) + 1

γkn(xkjn − vkn)

⇔ ∇nf(Xkj

)−∇nf

(Xkj[<n],v

kn,X

k[>n]

)− 1

γkn

(xkjn − vkn

)∈ ∂nΨ

(Xkj

)⇒∥∥∥∇nf (Xkj)−∇nf (Xkj[<n],v

kn,X

k[>n]

)− 1

γkn

(xkjn − vkn

)∥∥∥≤M

∥∥∥Xkj − (Xkj[<n],vkn,X

k[>n]

)∥∥∥+ 1γkn‖xkjn − vkn‖

≤M(N∑

i=n+1‖xkji − xki ‖+ ‖xkjn − vkn‖) + 1

γ− ‖xkjn − vkn‖

≤M‖Xkj − Xk‖+ (M + 1γ− )‖xkjn − vkn‖ → 0

(44)when j → ∞. Here ∂n denotes the partial sub-differentialwith respect to xn, γ− = infγkn : k ∈ N, n = 1, . . . , N andM is defined in Eq. (33). Therefore, combing Eqs. (38), (43)and (44) with the definition of sub-differential, we finallyconcludes

‖∂Ψ(Xkj

)‖ ≤

N∑n=1‖∂nΨ

(Xkj

)‖

≤ NM‖Xkj − Xk‖+ (M + 1γ− )

N∑n=1‖xkjn − vkn‖ → 0

(45)when j →∞. i.e.,

0 ∈ ∂Ψ (X∗) , ∀ X∗ ∈ Ω, (46)

where Ω denotes the set including all accumulation pointsofXkk∈N. Therefore, we have that all accumulation points

X∗ are the critical points of Ψ.

Page 15: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

Finally, based on the KŁ property of Ψ (X) and usingsimilar derivations as that in the proof of Theorem 2, wealso have ∞∑

k=0

‖Xk+1 − Xk‖ <∞. (47)

It is clear that Eq. (47) implies that the sequence Xkk∈N isa Cauchy sequence, thus is globally converged to the criticalpoints of Ψ(X) in Eq. (5).

Considering Ψ(Xk) is semi-algebraic and choosingφ(s) = t

θ sθ as the desingularizing function, it is also easy to

conclude that mFIMA still shares the same convergence ratesstated in Corollary 1.

ACKNOWLEDGMENTS

This work is partially supported by the National NaturalScience Foundation of China (Nos. 61672125, 61733002,61572096, 61432003 and 61632019), and the FundamentalResearch Funds for the Central Universities.

REFERENCES

[1] Y. Nesterov, “A method of solving a convex programming problemwith convergence rate o (1/k2),” in Soviet Mathematics Doklady,vol. 27, no. 2, 1983, pp. 372–376.

[2] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholdingalgorithm for linear inverse problems,” SIAM Journal on ImagingSciences, vol. 2, no. 1, pp. 183–202, 2009.

[3] L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via `0 gradientminimization,” ACM Trans. on Graphics, vol. 30, no. 6, p. 174, 2011.

[4] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principalcomponent analysis,” Journal of the ACM, vol. 58, no. 3, p. 11,2011.

[5] A. Beck and M. Teboulle, “Fast gradient-based algorithms for con-strained total variation image denoising and deblurring problems,”IEEE Trans. Image Process., vol. 18, no. 11, pp. 2419–2434, 2009.

[6] H. Li and Z. Lin, “Accelerated proximal gradient methods fornonconvex programming,” in Proc. Advances in Neural Inf. Process.Systems, 2015, pp. 379–387.

[7] H. Attouch and J. Bolte, “On the convergence of the proximalalgorithm for nonsmooth functions involving analytic features,”Mathematical Programming, vol. 116, no. 1, pp. 5–16, 2009.

[8] Q. Yao, J. T. Kwok, F. Gao, W. Chen, and T.-Y. Liu, “Efficient inexactproximal gradient algorithm for nonconvex problems,” in Proc. Int.Joint Conf. Artif. Intell., 2017, pp. 3308–3314.

[9] Q. Li, Y. Zhou, Y. Liang, and P. K. Varshney, “Convergence analysisof proximal gradient with momentum for nonconvex optimization,”in Proc. Int. Conf. Mach. Learn., 2017, pp. 2111–2119.

[10] T. Moreau and J. Bruna, “Understanding trainable sparse codingvia matrix factorization,” arXiv preprint arXiv:1609.00285, 2016.

[11] K. Gregor and Y. LeCun, “Learning fast approximations of sparsecoding,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 399–406.

[12] R. Liu, G. Zhong, J. Cao, Z. Lin, S. Shan, and Z. Luo, “Learningto diffuse: A new perspective to design pdes for visual analysis,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 12, pp. 2457–2471,2016.

[13] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: Aflexible framework for fast and effective image restoration,” IEEETrans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1256–1272, 2017.

[14] Y. Yang, J. Sun, H. Li, and Z. Xu, “Admm-net: A deep learningapproach for compressive sensing mri,” in Proc. Advances in NeuralInf. Process. Systems, 2016, pp. 10–18.

[15] S. Wang, S. Fidler, and R. Urtasun, “Proximal deep structuredmodels,” in Proc. Advances in Neural Inf. Process. Systems, 2016, pp.865–873.

[16] S. Diamond, V. Sitzmann, F. Heide, and G. Wetzstein, “Unrolledoptimization with deep priors,” arXiv preprint arXiv:1705.08041,2017.

[17] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnndenoiser prior for image restoration,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2017, pp. 3929–3938.

[18] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,”arXiv preprint arXiv:1711.10925, 2017.

[19] K. Li and J. Malik, “Learning to optimize,” arXiv preprint arX-iv:1606.01885, 2016.

[20] M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau,T. Schaul, and N. de Freitas, “Learning to learn by gradient descentby gradient descent,” in Proc. Advances in Neural Inf. Process. Systems,2016, pp. 3981–3989.

[21] B. Gu, Z. Huo, and H. Huang, “Inexact proximal gradient methodsfor non-convex and non-smooth optimization,” arXiv preprintarXiv:1612.06003, 2016.

[22] M. Schmidt, N. L. Roux, and F. R. Bach, “Convergence rates ofinexact proximal-gradient methods for convex optimization,” inProc. Advances in Neural Inf. Process. Systems, 2011, pp. 1458–1466.

[23] A. Bronstein, P. Sprechmann, and G. Sapiro, “Learning efficientstructured sparse models,” arXiv preprint arXiv:1206.4649, 2012.

[24] U. Schmidt and S. Roth, “Shrinkage fields for effective imagerestoration,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014,pp. 2774–2781.

[25] S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play admmfor image restoration: Fixed-point convergence and applications,”IEEE Trans. Comput. Imaging, vol. 3, no. 1, pp. 84–98, 2017.

[26] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating lin-earized minimization for nonconvex and nonsmooth problems,”Mathematical Programming, vol. 146, no. 1-2, pp. 459–494, 2014.

[27] E. Chouzenoux, J.-C. Pesquet, and A. Repetti, “A block coordinatevariable metric forward–backward algorithm,” Journal of GlobalOptimization, vol. 66, no. 3, pp. 457–485, 2016.

[28] H. Attouch, J. Bolte, and B. F. Svaiter, “Convergence of descentmethods for semi-algebraic and tame problems: proximal algo-rithms, forwardbackward splitting, and regularized gaussseidelmethods,” Mathematical Programming, vol. 137, pp. 91–129, 2013.

[29] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholdingalgorithm for linear inverse problems with a sparsity constraint,”Communications on Pure and Applied Mathematics, vol. 57, no. 11, pp.1413–1457, 2004.

[30] L.-Y. Gui, Y.-X. Wang, D. Ramanan, and J. M. Moura, “Few-shothuman motion prediction via meta-learning,” in Proc. Eur. Conf.Comput. Vis., 2018.

[31] Y.-X. Wang and M. Hebert, “Learning from small sample setsby combining unsupervised meta-training with cnns,” in Proc.Advances in Neural Inf. Process. Systems, 2016, pp. 244–252.

[32] C. Zhang, Y. Yu, and Z.-H. Zhou, “Learning environmentalcalibration actions for policy self-evolution.” in Proc. Int. JointConf. Artif. Intell., 2018, pp. 3061–3067.

[33] S. Thrun and L. Pratt, Learning to learn. Springer Science & BusinessMedia, 2012.

[34] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficientprojections onto the `1-ball for learning in high dimensions,” inProc. Int. Conf. Mach. Learn., 2008, pp. 272–279.

[35] S. Cho, J. Wang, and S. Lee, “Handling outliers in non-blind imagedeconvolution,” in Proc. IEEE Conf. Int. Conf. Comput. Vis., 2011, pp.495–502.

[36] W. S. Lai, J. B. Huang, Z. Hu, N. Ahuja, and M. H. Yang, “Acomparative study for single image blind deblurring,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1701–1709.

[37] M. Unser, A. Aldroubi, and M. Eden, “Recursive regularizationfilters: design, properties, and applications,” IEEE Trans. PatternAnal. Mach. Intell., vol. 13, no. 3, pp. 272–277, 1991.

[38] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternatingminimization algorithm for total variation image reconstruction,”SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–272, 2008.

[39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2016, pp. 770–778.

[40] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2009, pp. 248–255.

[41] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternatingminimization algorithm for total variation image reconstruction,”SIAM Journal on Imaging Sciences, vol. 1, no. 3, pp. 248–272, 2008.

[42] A. Danielyan, V. Katkovnik, and K. Egiazarian, “Bm3d frames andvariational image deblurring,” IEEE Trans. Image Process., vol. 21,no. 4, pp. 1715–1728, 2012.

[43] D. Zoran and Y. Weiss, “From learning models of natural imagepatches to whole image restoration,” in Proc. IEEE Conf. Int. Conf.Comput. Vis., 2011, pp. 479–486.

Page 16: JOURNAL OF LA On the Convergence of Learning …...JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 On the Convergence of Learning-based Iterative Methods for Nonconvex

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

[44] U. Schmidt, J. Jancsary, S. Nowozin, S. Roth, and C. Rother,“Cascades of regression tree fields for image restoration,” IEEETrans. Pattern Anal. Mach. Intell., vol. 38, no. 4, pp. 677–689, 2016.

[45] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, “Understandingand evaluating blind deconvolution algorithms,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2009, pp. 1964–1971.

[46] L. Sun, S. Cho, J. Wang, and J. Hays, “Edge-based blur kernelestimation using patch priors,” in Proc. IEEE Int. Conf. Comput.Photography, 2013, pp. 1–8.

[47] D. Perrone and P. Favaro, “Total variation blind deconvolution:The devil is in the details,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2014, pp. 2909–2916.

[48] H. Zhang, D. Wipf, and Y. Zhang, “Multi-image blind deblurringusing a coupled adaptive sparse prior,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., 2013, pp. 1051–1058.

[49] J. Pan, D. Sun, H. Pfister, and M.-H. Yang, “Deblurring images viadark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 1,no. 99, pp. 1–1, 2017.

[50] Y. Li, R. T. Tan, X. Guo, J. Lu, and M. S. Brown, “Rain streakremoval using layer priors,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., 2016, pp. 2736–2744.

[51] L.-W. Kang, C.-W. Lin, and Y.-H. Fu, “Automatic single-image-based rain streaks removal via image decomposition,” IEEE Trans.Image Process., vol. 21, no. 4, pp. 1742–1755, Apr. 2012.

[52] Y. Luo, Y. Xu, and H. Ji, “Removing rain from a single image viadiscriminative sparse coding,” in Proc. IEEE Conf. Int. Conf. Comput.Vis., Dec. 2015, pp. 3397–3405.

[53] X. Fu, J. Huang, X. Ding, Y. Liao, and J. Paisley, “Clearing the skies:A deep network architecture for single-image rain removal,” IEEETrans. Image Process., vol. 26, no. 6, pp. 2944–2956, Apr. 2017.

[54] X. Fu, J. Huang, Y. Huang, Delu Zeng, X. Ding, and J. Paisley,“Removing rain from single images via a deep detail network,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jul. 2017, pp.1715–1723.

[55] L.-J. Deng, T.-Z. Huang, X.-L. Zhao, and T.-X. Jiang, “A directionalglobal sparse model for single image rain removal,” AppliedMathematical Modelling, vol. 59, pp. 662–679, 2018.

[56] R. T. Rockafellar and R. J.-B. Wets, Variational analysis. SpringerScience & Business Media, 2009, vol. 317.

Risheng Liu (M’12-) received the BSc and PhDdegrees both in mathematics from the DalianUniversity of Technology in 2007 and 2012, re-spectively. He was a visiting scholar in the Robot-ic Institute of Carnegie Mellon University from2010 to 2012. He served as Hong Kong ScholarResearch Fellow at the Hong Kong PolytechnicUniversity from 2016 to 2017. He is currentlyan associate professor with the Key Laboratoryfor Ubiquitous Network and Service Software ofLiaoning Province, Internal School of Information

and Software Technology, Dalian University of Technology. His researchinterests include machine learning, optimization, computer vision andmultimedia. He was a co-recipient of the IEEE ICME Best Student PaperAward in both 2014 and 2015. Two papers were also selected as Finalistof the Best Paper Award in ICME 2017. He is a member of the IEEE andACM.

Shichao Cheng received the B.E. degree inMathematics and Applied Mathematics fromHenan Normal University. Xinxiang, China, in2013. She is currently pursuing the PhD degree inComputational Mathematics at Dalian Universityof Technology, Dalian, China. Her research inter-ests include computer vision, machine learningand optimization.

Yi He received the B.E. degree in Computer Sci-ence from Shihezi University, Shihezi, China, in2017. He is currently pursuing the master degreein software engineering at Dalian University ofTechnology, Dalian, China. His research interestsinclude computer vision, deep learning.

Xin Fan was born in 1977. He received the B.E.and Ph.D. degrees in information and communi-cation engineering from Xian Jiaotong University,Xian, China, in 1998 and 2004, respectively. Hewas with Oklahoma State University, Stillwater,from 2006 to 2007, as a post-doctoral researchFellow. He joined the School of Software, DalianUniversity of Technology, Dalian, China, in 2009.His current research interests include computa-tional geometry and machine learning, and theirapplications to lowlevel image processing and

DTI-MR image analysis.

Zhouchen LIN received the Ph.D. degree inapplied mathematics from Peking University in2000. He is currently a Professor with the KeyLaboratory of Machine Perception, School ofElectronics Engineering and Computer Science,Peking University. His research interests includecomputer vision, image processing, machinelearning, pattern recognition, and numerical opti-mization. He is an area chair of ACCV 2009/2018,CVPR 2014/2016, ICCV 2015, NIPS 2015/2018and AAAI 2019, and senior program committee

member of AAAI 2016/2017/2018 and IJCAI 2016/2018. He is anAssociate Editor of the IEEE Transactions on Pattern Analysis andMachine Intelligence and the International Journal of Computer Vision.He is an IAPR/IEEE fellow.

Zhongxuan Luo received the B.S. degree inComputational Mathematics from Jilin University,China, in 1985, the M.S. degree in ComputationalMathematics from Jilin University in 1988, andthe PhD degree in Computational Mathematicsfrom Dalian University of Technology, China, in1991. He has been a full professor of the Schoolof Mathematical Sciences at Dalian University ofTechnology since 1997. His research interestsinclude computational geometry and computervision.


Recommended