+ All Categories
Home > Documents > 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3,...

794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3,...

Date post: 23-Jun-2020
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
23
794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications, and Machine Learning Ying Sun, Prabhu Babu, and Daniel P. Palomar, Fellow, IEEE Overview Article Abstract—This paper gives an overview of the majorization- minimization (MM) algorithmic framework, which can provide guidance in deriving problem-driven algorithms with low compu- tational cost. A general introduction of MM is presented, including a description of the basic principle and its convergence results. The extensions, acceleration schemes, and connection to other algorith- mic frameworks are also covered. To bridge the gap between theory and practice, upperbounds for a large number of basic functions, derived based on the Taylor expansion, convexity, and special in- equalities, are provided as ingredients for constructing surrogate functions. With the pre-requisites established, the way of applying MM to solving specific problems is elaborated by a wide range of applications in signal processing, communications, and machine learning. Index Terms—Majorization-minimization, upperbounds, surrogate function, non-convex optimization. I. INTRODUCTION I N the era of big data, we are witnessing a fast development in data acquisition techniques and a growth of computing power. From an optimization perspective, these can result in large-scale problems due to the tremendous amount of data and variables, which cause challenges to traditional algorithms [1]. For example, apart from trivially parallelizable or convex prob- lems where decomposition techniques can be employed, solving a general problem with no structure to exploit calls for a large amount of computational resources (time and storage). Diffi- culties also arise when data is stored on different computers or is acquired in real-time. In these cases, it can be inefficient or even impossible to first collect the complete data set and then perform centralized optimization. Besides the aforementioned issues caused by the scale, a problem with a complicated form may lead to numerical problems as well. For instance, the second Manuscript received January 11, 2016; revised May 5, 2016 and July 26, 2016; accepted July 29, 2016. Date of publication August 18, 2016; date of current version November 28, 2016. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Sergios Theodoridis. This work was supported by the Hong Kong RGC 16207814 and 16206315 research grants. Y. Sun is with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, and also with the School of Industrial Engineering, Purdue University, West-Lafayette, IN 47907 USA (e-mail: [email protected]). P. Babu is with CARE, IIT Delhi, Delhi 110016, India (e-mail: prabhubabu@ care.iitd.ac.in). D. P. Palomar is with the Department of Electronic and Computer Engineer- ing, Hong Kong University of Science and Technology, Hong Kong (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSP.2016.2601299 Fig. 1. The MM procedure. order derivatives, which are required by Newton-type nonlinear programming algorithms, can be costly to compute under this scenario. Facing these obstacles, devising problem-driven algo- rithms that can take advantage of the problem structure may be a better option than employing a general-purpose solver. This is where MM comes into play. The MM procedure consists of two steps. In the first majoriza- tion step, we find a surrogate function that locally approximates the objective function with their difference minimized at the current point. In other words, the surrogate upperbounds the ob- jective function up to a constant. Then in the minimization step, we minimize the surrogate function. The procedure is shown pictorially in Fig. 1. A parallel argument can be made for max- imization problems by replacing the upperbound minimization step by a lowerbound maximization step, and is referred to as minorization-maximization. MM has a long history that dates back to the 1970s [2], and is closely related to the famous EM algorithm [3] intensively used in computational statistics. As a special case of MM, EM is ap- plied mainly in maximum likelihood (ML) estimation problems with incomplete data, which was systematically introduced in the seminal paper [4] by Dempster, Laird, and Rudin in 1977. MM generalizes EM by replacing the E-step, which calculates the conditional expectation of the log-likelihood of the complete data set, by a minorization step that finds a surrogate function. The surrogate function keeps the key property of the E-step by being a lower bound of the objective function. As a con- sequence, MM shares most of the convergence results of EM. Compared to EM, which relies on a missing data interpretation of the problem, MM is easier to understand and has a wider scope of applications. The idea of MM appears in statistics and image process- ing in early works including [5]–[9], and started taking shape 1053-587X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Transcript
Page 1: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

Majorization-Minimization Algorithms in SignalProcessing, Communications, and Machine Learning

Ying Sun, Prabhu Babu, and Daniel P. Palomar, Fellow, IEEE

Overview Article

Abstract—This paper gives an overview of the majorization-minimization (MM) algorithmic framework, which can provideguidance in deriving problem-driven algorithms with low compu-tational cost. A general introduction of MM is presented, includinga description of the basic principle and its convergence results. Theextensions, acceleration schemes, and connection to other algorith-mic frameworks are also covered. To bridge the gap between theoryand practice, upperbounds for a large number of basic functions,derived based on the Taylor expansion, convexity, and special in-equalities, are provided as ingredients for constructing surrogatefunctions. With the pre-requisites established, the way of applyingMM to solving specific problems is elaborated by a wide range ofapplications in signal processing, communications, and machinelearning.

Index Terms—Majorization-minimization, upperbounds,surrogate function, non-convex optimization.

I. INTRODUCTION

IN the era of big data, we are witnessing a fast developmentin data acquisition techniques and a growth of computing

power. From an optimization perspective, these can result inlarge-scale problems due to the tremendous amount of data andvariables, which cause challenges to traditional algorithms [1].For example, apart from trivially parallelizable or convex prob-lems where decomposition techniques can be employed, solvinga general problem with no structure to exploit calls for a largeamount of computational resources (time and storage). Diffi-culties also arise when data is stored on different computers oris acquired in real-time. In these cases, it can be inefficient oreven impossible to first collect the complete data set and thenperform centralized optimization. Besides the aforementionedissues caused by the scale, a problem with a complicated formmay lead to numerical problems as well. For instance, the second

Manuscript received January 11, 2016; revised May 5, 2016 and July 26,2016; accepted July 29, 2016. Date of publication August 18, 2016; date ofcurrent version November 28, 2016. The associate editor coordinating thereview of this manuscript and approving it for publication was Prof. SergiosTheodoridis. This work was supported by the Hong Kong RGC 16207814 and16206315 research grants.

Y. Sun is with the Department of Electronic and Computer Engineering,Hong Kong University of Science and Technology, Hong Kong, and also withthe School of Industrial Engineering, Purdue University, West-Lafayette, IN47907 USA (e-mail: [email protected]).

P. Babu is with CARE, IIT Delhi, Delhi 110016, India (e-mail: [email protected]).

D. P. Palomar is with the Department of Electronic and Computer Engineer-ing, Hong Kong University of Science and Technology, Hong Kong (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSP.2016.2601299

Fig. 1. The MM procedure.

order derivatives, which are required by Newton-type nonlinearprogramming algorithms, can be costly to compute under thisscenario. Facing these obstacles, devising problem-driven algo-rithms that can take advantage of the problem structure may bea better option than employing a general-purpose solver. This iswhere MM comes into play.

The MM procedure consists of two steps. In the first majoriza-tion step, we find a surrogate function that locally approximatesthe objective function with their difference minimized at thecurrent point. In other words, the surrogate upperbounds the ob-jective function up to a constant. Then in the minimization step,we minimize the surrogate function. The procedure is shownpictorially in Fig. 1. A parallel argument can be made for max-imization problems by replacing the upperbound minimizationstep by a lowerbound maximization step, and is referred to asminorization-maximization.

MM has a long history that dates back to the 1970s [2], and isclosely related to the famous EM algorithm [3] intensively usedin computational statistics. As a special case of MM, EM is ap-plied mainly in maximum likelihood (ML) estimation problemswith incomplete data, which was systematically introduced inthe seminal paper [4] by Dempster, Laird, and Rudin in 1977.MM generalizes EM by replacing the E-step, which calculatesthe conditional expectation of the log-likelihood of the completedata set, by a minorization step that finds a surrogate function.The surrogate function keeps the key property of the E-stepby being a lower bound of the objective function. As a con-sequence, MM shares most of the convergence results of EM.Compared to EM, which relies on a missing data interpretationof the problem, MM is easier to understand and has a widerscope of applications.

The idea of MM appears in statistics and image process-ing in early works including [5]–[9], and started taking shape

1053-587X © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Page 2: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 795

TABLE ISUMMARY OF APPLICATIONS IN SECTION V AND THEIR CORRESPONDING SURROGATE FUNCTION CONSTRUCTION TECHNIQUES

as a general algorithmic framework in [10]–[12]. It has beenapplied to a large number of problems since then [13], includ-ing sparse regression with non-convex or discontinuous objec-tive functions [14]–[18], sparse principal component analysis(PCA) with cardinality constraint [19], canonical componentanalysis (CCA) [20], [21], covariance estimation [22]–[25], andmatrix factorization [26], [27] with non-convex objective func-tions and constraints. It has also been applied to higher levelapplications such as image processing [28], [29], phase retrieval[30], and design [31], [32], just to name a few.

The key to the success of MM lies in constructing a surro-gate function. Generally speaking, surrogate functions with thefollowing features are desired [13]:

� Separability in variables (parallel computing);� Convexity and smoothness;� The existence of a closed-form minimizer.Consequently, minimizing the surrogate function is efficient

and scalable, yielding a neat algorithm that is easy to implement.Nevertheless, finding an appropriate surrogate function that

yields an algorithm with low computational complexity is notan easy task. On one hand, to achieve a fast convergence rate, asurrogate function that tries to follow the shape of the objectivefunction is preferable. On the other hand, it should be simpleto minimize so that the computational cost per iteration is low.Finding the right trade-off between these two opposite goals re-quires skills in applying inequalities to specific problems. As themain purpose of this paper, we are devoted to presenting surro-gate function construction techniques, elaborated by examplesin Section III and applications listed in Table I1.

This paper is organized as follows. Section II serves as an in-troduction to MM, including a description of the framework,

1The convergence column indicates the type of convergence discussed in eachproblem. It should not be interpreted as the whole sequence generated by thealgorithm converges to the corresponding point, which is a strong conclusion.

its convergence results, extensions, as well as accelerators.Section III presents the techniques and examples of constructingsurrogate functions. Section IV connects MM with some otheralgorithmic frameworks. Section V demonstrates the way of ap-plying the inequalities in Section III to devise MM algorithmsfor real-world applications. Section VI concludes the overview.

A. Notation

Italic letters denote scalars, lower case boldface letters denotevectors, and upper case boldface letters denote matrices.

The sequence of nonnegative integers is denoted N := {0,1, . . .}. Real numbers are denoted R, and complex numbers aredenoted C. The Euclidean space of dimension n is denoted Rn .The nonnegative (positive) orthant is denoted Rn

+ (Rn++ ). The

set of symmetric matrices of size n × n is denoted Sn , and thepositive semidefinite (definite) cone is denoted Sn

+ (Sn++ ).

The elements of vectors and matrices are denoted as follows:scalar xi stands for the i-th element of vector x, vector xI standsfor a vector constructed by eliminating all the elements of x butthe xi’s with i ∈ I, vector X:,i stands for the i-th column ofmatrix X, vector Xi,: stands for the i-th row of X, and scalarXij stands for the ij-th entry of X.

Superscripts (·)∗, (·)T , (·)H , (·)−1 , and (·)† denote the com-plex conjugate, transpose, conjugate transpose, inverse, andMoore-Penrose pseudoinverse, respectively. The trace and de-terminant of a matrix X are denoted Tr (X) and det (X), respec-tively. Vector vec (X) is constructed by stacking the columns ofX. The diagonal matrix diag (x) is constructed by setting its i-thdiagonal element to be xi . Notation A � (�)B stands for ma-trix A − B is positive semidefinite (definite). The Hadamardproduct of two vectors x and y is denoted x � y. Wheneverarithmetic operators such as √, /, and −1 are applied to vectorswe mean an element-wise operation.

Page 3: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

796 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

The magnitude of a scalar x is denoted |x|. The �p -norm ofa vector x is denoted ‖x‖p . The nuclear norm and Frobeniusnorm for a matrix X are denoted ‖X‖∗ and ‖X‖F , respectively.

Operator [·]+ : Rn → Rn+ denotes the Euclidean projection of

a vector inRn toRn+ . The gradient of a function f is denoted∇f .

The composition of functions f and g is denoted f ◦ g. The signfunction is denoted sgn. The expected value of a random vectorx is denotedE (x), and its covariance matrix is denoted Cov (x).Unless otherwise specified, subscript (·) t in xt is reserved forthe algorithm iteration that stands for the value of x at the t-thiteration, and xt

i stands for the value of the i-th element of xt ,i.e., (xi)t , for notation simplicity (the same convention appliesto vector xi and matrix Xi).

II. ALGORITHMIC FRAMEWORK

A. The MM Algorithm

Consider the following optimization problem

minimizex

f (x)

subject to x ∈ X ,(1)

where X is a nonempty closed set in Rn and f : X → R is acontinuous function. We assume that f (x) goes to infinity whenx ∈ X and ‖x‖ → +∞.

Initialized as x0 ∈ X , MM generates a sequence of feasiblepoints (xt)t∈N by the following induction. At point xt , in themajorization step we construct a continuous surrogate functiong (·|xt) : X → R satisfying the upperbound property that

g (x|xt) ≥ f (x) + ct , ∀x ∈ X , (2)

where ct = g (xt |xt) − f (xt). That is, the difference of g (·|xt)and f is minimized at xt .

Then in the minimization step, we update x as

xt+1 ∈ arg minx∈X

g (x|xt) . (3)

The sequence (f (xt))t∈N is non-increasing since

f (xt+1) ≤ g (xt+1 |xt) − ct ≤ g (xt |xt) − ct = f (xt) , (4)

where the first inequality follows from (2), and the second in-equality follows from (3). We denote the algorithm mapping de-fined by steps (2) and (3) that sends xt to xt+1 by M : Rn → Rn

in the rest of the paper.

B. Extensions

The MM principle can be combined with other algorithmicframeworks, leading to the following extensions.

Instead of computing a minimizer of g (·|xt), we can find apoint xt+1 that satisfies g (xt+1 |xt) ≤ g (xt |xt) (i.e., just mak-ing an improvement). This leads to the generalized EM (GEM)algorithm [4]. Point xt+1 can be found by taking a gradient,Newton, or quasi-Newton step. GEM is also closely related toMM acceleration schemes [64]–[66].

Combining with the block coordinate descent algorithm, wecan partition the variables into blocks and apply MM to one

block while keeping the value of the other blocks fixed. As a ben-efit, it provides more flexibility in designing surrogate functions.Moreover, in some cases the surrogate function can approximatef better than using a single block, leading to a faster conver-gence rate [67]. A simple update rule is sweeping the blockscyclically. It can be generalized to the “essential cyclic rule”[68], where each block is updated at least once within a finitenumber of iterations [57], [69], [70]. Other sweeping schemesinclude the Gauss-Southwell update rule, maximum improve-ment update rule, as well as the randomized update rule [70].

An incremental MM was proposed in [71] for minimizing anobjective function of the form f (x) = 1

N

∑Ni=1 fi (x), which

is related to stochastic optimization with f being the empiricalaverage. The algorithm assumes only one of the fi’s is observedat each iteration, and the surrogate function is updated based onthe current fi and the algorithm history recursively.

In [69], the global upperbound requirement of the surrogatefunction has been relaxed to just being a local upperbound.

In this paper, we restrict our scope to the standard MM with asingle block of variables2. For a comprehensive analysis of theabove-described extensions, we refer the reader to [69], [70],[72], and [73].

C. Convergence of MM

We assume in preliminary that the MM conditions (2) and (3)hold, andX is convex throughout this subsection. The convexityof X and continuity of f are minimum assumptions for a unifiedstudy of algorithm convergence. In some applications, MM isderived for a problem with a discontinuous objective functionor a non-convex constraint set, see [17], [19] for examples. Theconvergence of these algorithms deserves a case by case study.

In Eq. (4), we have shown that the objective value is non-increasing and converges to a limit f� by the assumption that fis bounded below. The next step is to establish the conditions thatguarantee f� being a stationary value and also the convergenceof the sequence (xt)t∈N .

1) Unconstrained Optimization: We make the following as-sumptions on f and g:

(A1) The sublevel set lev≤f (x0 )f := {x ∈ X |f (x)≤ f (x0)} is compact given that f (x0) < +∞;

(A2.1) f (x) and g (x|xt) are continuously differentiable withrespect to x;

(A3.1) g (x|xt) is continuous in x and xt .For unconstrained problem (1), the set of stationary points of

f is defined as

X � = {x|∇f (x) = 0} . (5)

Under Assumptions (A1), (A2.1), (A3.1), the following state-ments hold [74], [75]:

(C1) Any limit point x∞ of (xt)t∈N is a stationary point of f ;(C2) f (xt) ↓ f� monotonically and f� = f (x�) with x� ∈

X � ;

2There are a few applications in this paper where MM is applied with blockalternation. For presentation clarity we only describe the update of one blockwhile treating the other blocks of variables as fixed parameters.

Page 4: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 797

(C3) If f (M (x)) = f (x), then x ∈ X � and x ∈arg min g (·|x);

(C4) If x is a fixed point of M , then x is a convergent pointof MM and belongs to X � .

To establish to convergence of sequence (xt)t∈N to a station-ary point, we further require one of the following assumptions:

(A4.1) Set X � is a singleton;(A4.2) Set X � is discrete and ‖xt+1 − xt‖ → 0;(A4.3) Set X � is discrete, and g (·|x) has a unique global

minimum for all x ∈ X � .2) Constrained Optimization with Smooth Objective Func-

tion: With X convex and f continuously differentiable, the setof stationary points is defined as

X � ={x|∇f (x)T (y − x) ≥ 0, ∀y ∈ X

}. (6)

Conclusions (C1)–(C4) still hold under Assumptions (A1),(A2.1) and (A3.1) [69]. Moreover, Assumption (A3.1) can bereplaced by (A3.2) stated next.

(A3.2) For all xt generated by the algorithm, there existsγ ≥ 0 such that ∀x ∈ X , we have

(∇g (x|xt) −∇g (xt |xt))T (x − xt) ≤ γ ‖x − xt‖2 .

Assumption (A3.2) is equivalent to stating that g (x|xt) canbe uniformly upperbounded by a quadratic function with theHessian matrix being γI, which is easier to verify than (A3.1)when g (·|xt) has a complicated form3.

Convergence of sequence (xt)t∈N to a stationary point canbe proved by further requiring (A4.1) or (A4.2).

3) Constrained Optimization With Non-Smooth ObjectiveFunction: Finally, we address the case that f and g (·|x) arenonsmooth, but their directional derivatives exist for all feasibledirections [70]. The set of stationary points is defined as

X � = {x|f ′ (x;d) ≥ 0,∀x + d ∈ X} , (7)

where

f ′ (xt ;d) := lim infλ↓0

f (xt + λd) − f (xt)λ

(8)

is the directional derivative of f at xt in direction d. Accord-ingly, the gradient consistency assumption (A2.1) is modifiedas follows:

(A2.2) f ′ (xt ;d) = g′ (xt ;d|xt) , ∀xt + d ∈ X .Under Assumptions (A1), (A2.2), (A3.1), the sequence

(xt)t∈N converges to X � , i.e.,

limt→+∞

infx∈X �

‖xt − x‖2 = 0.

D. Acceleration Schemes

A drawback of MM is that it can suffer from a slow con-vergence speed [3], [13], mainly because of the restrictive up-perbound condition. To alleviate this shortcoming, MM accel-erators are often employed. Various types of accelerators have

3Since the continuously differentiability of f and g (·|xt ) and the upper-bound condition of MM (Eq. (2)) implies the directional derivative of f andg (·|xt ) are equal along all feasible directions (Proposition 1, [70]), the firstorder consistency condition (R2) in [69] holds automatically.

been proposed in the literature, including those derived basedon the multivariate Aitken’s method [76], conjugate gradientacceleration [64], Newton and quasi-Newton type acceleration[66], [77], [78], and over-relaxation [79]–[82], see [83, Chap. 4]for an overview in the context of EM.

We begin with the idea of line search type algorithms. Tominimize a function f , at the current point xt one first deter-mines a descent direction dt , then a step-size αt that decreasesthe objective function. MM can be interpreted in this way byidentifying dt := M (xt) − xt and αt := 1.

The line search type accelerators modify the value of αt toachieve a larger decrement of the objective value. For instance,in [84] αt was determined by the two previous steps basedon Aitken’s method. This method may, however, destroy themonotonicity of the algorithm. A constant step-size αt ≡ α wasadopted in over-relaxation methods [79]–[82], and the optimalα was provided in [82]. Nevertheless, it is also pointed out thatcomputing the optimal α is generally a difficult problem. Toaddress these issues, αt was suggested to be computed using linesearch so that f (xt+1) ≤ f (xt) is guaranteed [38], [82], [85].

Another class of accelerators also modifies the descent direc-tion dt . To ensure the objective value is nonincreasing, xt+1needs not be a global minimizer of g (·|xt). Instead, one cansolve (3) inexactly by taking a Newton step. This leads to the EMgradient algorithm [65]. A quasi-Newton accelerator proposedin [66] improves it by adding an approximate of the Hessian ofH (x|xt) � f (x) − g (x|xt) to ∇g2 (x|xt) in the Newton step(assuming both ∇2H (x|xt) and ∇g2 (x|xt) exist). In [64], thegeneralized gradient algorithm was applied to minimize f bytreating M (xt) − xt as the generalized gradient. See [85] for anoverview and comparison of the above-mentioned accelerators.

Finally, we introduce a class of accelerators based on theidea of finding a fixed point of M , which is a stationary pointof f if Assumptions (A1), (A2), and (A3.1) hold. Assumingthat M is continuously differentiable, it is known that Newton’smethod enjoys a quadratic convergence rate in the vicinity of afixed point. Define F (x) = M (x) − x, a Newton step updateof finding a zero of F is given by4

xt+1 = xt − (∇F (xt))−1 F (xt) ,

where ∇F is the Jacobian of F . While F (xt) can be evalu-ated by the MM step, the Jacobian ∇F (xt) is hard to obtainin general (unless M (x) has an explicit form) and is oftenapproximated based on the previous iterates (xt ′)0≤t ′≤t . TheSTEM accelerator proposed in [86] approximates ∇F (xt) by ascaled identity matrix. The Aitken [76] and SQUAREM acceler-ators [86] approximate ∇F (xt) using the secant method. Morerecently, an accelerator was proposed in [87] that approximates∇F (xt) based on the quasi-Newton method.

We point out that Newton type algorithms converge only inthe vicinity of a stationary point, therefore accelerators basedon Newton’s iteration are often executed after a few MM stepsso that xt falls into the convergence region. It is also worthmentioning that the MM acceleration schemes are developed for

4The sequence (xt )t∈N should be distinguished from the MM sequence(xt )t∈N .

Page 5: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

798 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

unconstrained optimization problems (except the cases wherethe constraint can be eliminated by reparameterization). For aconstrained optimization problem, it is generally not true thatthe point returned by accelerators will be feasible. In this case,heuristic manipulations such as projection to the feasible set canbe employed.

III. SURROGATE FUNCTION CONSTRUCTION

The key step of applying MM is constructing a surrogate func-tion. While there is no concrete step to follow, some commonlyadopted rules that can provide guidance exist. In this section,techniques to find surrogate functions, along with a number of il-lustrating examples, will be presented. The inequalities providedhere will serve as building blocks in finding surrogate functionsfor more sophisticated objective functions in Section V.

A. First Order Taylor Expansion

Suppose f can be decomposed as

f (x) = f0 (x) + fccv (x) , (9)

where fccv is a differentiable concave function.Linearizing fccv at x = xt yields the following inequality:

fccv (x) ≤ fccv (xt) + ∇fccv (xt)T (x − xt) , (10)

thus f can be upperbounded as

f (x) ≤ f0 (x) + ∇fccv (xt)T x + const.

Example 1: Function log (x) can be upperbounded as

log (x) ≤ log (xt) +1xt

(x − xt) (11)

with equality achieved at x = xt .Example 2: Function log det (Σ) can be upperbounded as

log det (Σ) ≤ log det (Σt) + Tr(Σ−1

t (Σ − Σt))

(12)

with equality achieved at Σ = Σt .Example 3: Function Tr

(SX−1

)with both S and X in S++

can be lowerbounded as

Tr(SX−1) ≥ Tr

(SX−1

t

)− Tr

(X−1

t SX−1t (X − Xt)

)(13)

with equality achieved at X = Xt .Example 4 [88]: Function Tr

(XT Y−1X

)with Y ∈ S++

can be lowerbounded as

Tr(XT Y−1X

)

≥ 2Tr(XT

t Y−1t X

)− Tr

(Y−1

t XtXTt Y−1

t Y)

+ const.(14)

with equality achieved at (X,Y) = (Xt ,Yt).Proof: Function Tr

(XT Y−1X

)is jointly convex in X and

Y, therefore lowerbounded by its linear expansion around(Xt ,Yt), which implies (14). �

Remark 5: We emphasize that the upperbounds derivedbased on linearizing a concave function are not necessarily lin-ear in the variables, see Eq. (51) for example.

Fig. 2. Surrogate function construction technique by first order Taylor ex-pansion: a concave function can upperbound a linear function, which can beupperbounded by a convex function.

More generally, given a convex, a linear, and a concave func-tion, fcvx , flin , and fccv , respectively, if their values and gradi-ents are equal at some xt , then, for any x,

fccv (x) ≤ flin (x) ≤ fcvx (x) , (15)

as illustrated in Fig. 2.Example 6: Function |x|p , 0 < p ≤ 1, which is concave on

(−∞, 0] and [0,+∞), can be upperbounded as5

|x|p ≤ p

2|xt |p−2 x2 + const., (16)

providing that xt �= 0.Inequality (16) plays an important role in iteratively

reweighted least squares (IRLS) algorithms, where a quadraticupperbound is preferred to a tighter linear one in the majoriza-tion step, with the benefit that the minimization step admits asolution that is easy to compute.

In the last example, we show that inequality (15) can be usedto construct lowerbounds for maximization problems.

Example 7 [54]: A monomial∏n

i=1 xαii , where xi ≥ 0,

∀i, can be lowerbounded as

n∏

i=1

xαii ≥

n∏

i=1

(xt

i

)αi

(

1 +n∑

i=1

αi log xi −n∑

i=1

αi log xti

)

(17)with equality achieved at xi = xt

i .Proof: Inequality (11) implies that

log

(n∏

i=1

xαii

)

≤ log

(n∏

i=1

(xt

i

)αi

)

+

(n∏

i=1

(xt

i

)αi

)−1( n∏

i=1

xαii −

n∏

i=1

(xt

i

)αi

)

.

Rearranging the terms we have (17). �

5The result also holds for 1 < p ≤ 2 although |x|p is convex.

Page 6: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 799

The surrogate function is separable in the variables, whichcan be optimized in parallel if the constraints are also separable.

B. Convexity Inequality

For a convex function fcvx , we have the following inequality:

fcvx

(n∑

i=1

wixi

)

≤n∑

i=1

wifcvx (xi) , (18)

where∑n

i=1 wi = 1, wi ≥ 0.∀i = 1, . . . , n. Equality canachieved if the xi’s are equal, or for different xi’s if fcvx isnot strictly convex.

Example 8 (Jensen’s Inequality): Let f : X → R be a con-vex function and x be a random variable that take values in X .Assuming that E (x) and E (f (x)) are finite, then

E (f (x)) ≥ f (E (x)) .

With Jensen’s inequality we can show that EM is a special caseof MM (cf. Section IV-A).

Particularizing (18) for the concave function log, we have thefollowing inequality.

Example 9: Function∑n

i=1αi log fi (x) with αi > 0 can

be upperbounded as

n∑

i=1

αi log fi (x) ≤n∑

i=1

αi log fi (xt)

+

(n∑

i=1

αi

)

log

∑ni=1 αi

fi (x)fi (xt )∑n

i=1 αi

⎠ ,

(19)where fi (x) > 0, ∀i. Equality is achieved at x = xt .

Inequality (19) creates a concave upperbound for∑n

i=1αi log fi (x) by merging the summation inside the log

function. Recall that by applying inequality (11) we can obtainan alternative upperbound that is linear in the fi (x)’s as

n∑

i=1

αi log fi (x) ≤n∑

i=1

αi

(

log fi (xt)

+1

fi (xt)

(fi (x) − fi (xt)

))

.

(20)

However, the concave upperbound (19) is tighter, thus is pre-ferred to (20) for a faster convergence rate, see Fig. 3 as anillustration.

Particularizing inequality (18) for 1/x we have the followingbound.

Example 10: The function1

∑ni=1 aixi

with ai > 0 and xi

> 0 can be upperbounded as

1∑n

i=1 aixi≤

∑ni=1 ai (xt

i)2x−1

i

(∑n

i=1 aixti)

2 (21)

with equality achieved at xi = xti , ∀i = 1, . . . , n.

Generalizing (21) to a convex function f yields the followinginequality.

Fig. 3. Objective function: f (x) = 3 log (1 + x) + 5 log (1 + 3x)+ 1.5 log (1 + 6x); log upperbound: upperbound given by (19); linearupperbound: upperbound given by (20).

Example 11 [13]: The convex function f(aT x

)can be up-

perbounded as

f(aT x

)≤

n∑

i=1

αif

(ai

αi

(xi − xt

i

)+ aT xt

)

, (22)

where αi > 0,∑n

i=1 αi = 1. Moreover, if the elements of

a and xt are positive, letting αi = ai xti

aT xtyields a different

upperbound as

f(aT x

)≤

n∑

i=1

aixti

aT xtf

(aT xt

xti

xi

)

. (23)

Inequalities (22) and (23) were proposed and applied in medicalimaging in [6], [9].

C. Construction by Second Order Taylor Expansion

Lemma 12 (Descent Lemma [89]): Let f : Rn → R be acontinuously differentiable function with a Lipschitz contin-uous gradient and Lipschitz constant L (we say that ∇f isL-Lipschitz henceforth). Then, for all x,y ∈ Rn ,

f (x) ≤ f (y) + ∇f (y)T (x − y) +L

2‖x − y‖2 . (24)

More generally, if f has bounded curvature, i.e., there exists amatrixM such thatM � ∇2f (x) , ∀x ∈ X , then the followinginequality implied by Taylor’s theorem [88] holds:

f (x) ≤ f (y) + ∇f (y)T (x − y) +12

(x − y)T M (x − y) .

(25)Particularizing (25) for f (x) = xH Lx gives the followinginequality6.

6Wirtinger calculus is applied for complex-valued matrix differentials [90].

Page 7: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

800 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

Example 13: The quadratic form xH Lx, where L is a Her-mitian matrix, can be upperbounded as

xH Lx ≤ xH Mx + 2Re(xH (L − M)xt

)+ xH

t (M − L)xt ,(26)

where M � L. Equality is achieved at x = xt .Example 13 shows that using (26) we can replace L by M

with some desired structures, such as being a diagonal matrix,so that the surrogate function is separable.

D. Arithmetic-Geometric Mean Inequality

The arithmetic-geometric mean inequality states that [88]

n∏

i=1

zαii ≤

n∑

i=1

αi

‖α‖1z‖α‖1i , (27)

where zi and αi are nonnegative scalars. Equality is achievedwhen the zi’s are equal.

Letting zi = xi/xti for αi > 0 and zi = xt

i/xi for αi < 0 wehave the following inequality.

Example 14 [54]: A monomial∏n

i=1 xαii can be upper-

bounded as

n∏

i=1

xαii ≤

(n∏

i=1

(xt

i

)αi

)n∑

i=1

|αi |‖α‖1

(xi

xti

)‖α‖1 sgn(αi )

. (28)

Equality is achieved at xi = xti , ∀i = 1, . . . , n.

Upperbound (28) and lowerbound (17) serve as the basic in-gredients for deriving MM algorithms for signomial program-ming [54].

Example 15 [53]: A posynomial∑n

i=1 ui (x), where ui (x)is a monomial, can be lower bounded as

n∑

i=1

ui (x) ≥n∏

i=1

(ui (x)

αi

)αi

, (29)

where αi = ui (xt )∏ ni = 1 ui (xt )

. Equality is achieved at x = xt .Inequality (29) can be used in solving complementary geo-

metric programming (GP) with the objective function being theratio of posynomials.

Example 16: The �2-norm ‖x‖2 can be upperbounded as

‖x‖2 ≤ 12

(‖xt‖2 + ‖x‖2

2 / ‖xt‖2

), (30)

given that ‖xt‖2 �= 0. Equality is achieved at x = xt .

E. Cauchy-Schwartz Inequality

Cauchy-Schwartz inequality states that

xT y ≤ ‖x‖2 ‖y‖2 .

Equality is achieved when x and y are collinear.Example 17: Function

∣∣aH x

∣∣ can be lowerbounded as

∣∣aH x

∣∣ ≥ Re

(xH

t aaH x)/∣∣aH xt

∣∣ , (31)

given that∣∣aH xt

∣∣ �= 0. Equality is achieved at x = xt .

Proof: For two complex numbers z1 = u1 + iv1 and z2= u2 + iv2 , we have

Re (z1z∗2) = u1u2 + v1v2

≤√

u21 + v2

1 ·√

u22 + v2

2

by Cauchy-Schwartz inequality. Letting z1 = aH x and z2= aH xt yields the desired inequality. �

Example 18: The �2-norm ‖x‖2 can be lowerbounded as

‖x‖2 ≥ xT xt/ ‖xt‖2 , (32)

given that ‖xt‖2 �= 0. Equality is achieved at x = xt .Together with (32), they provide a quadratic upperbound and

a linear lowerbound for the �2-norm on the whole space exceptthe origin.

F. Schur Complement

The Schur complement condition for C � 0 states that

X =[

A BBT C

]

� 0

if and only if the Schur complement of C is in S+ . That is,

S := A − BC−1BT � 0. (33)

Inequality (33) provides a way to upperbound the inverse ofa matrix.

Example 19 ([25]): Assuming P � 0, the matrix(APAH

)−1can be upperbounded as

R−1t APtP−1PtAH R−1

t �(APAH

)−1, (34)

where Rt = APtAH . Equality is achieved at P = Pt .Inequality (34) can also be derived based on convexity [63].Particularizing (34) for P = diag (p1 , . . . , pn ) and A

=[√

a1 , . . . ,√

an

]gives a different derivation for inequality

(21).

G. Generalization

With the inequalities provided above we can construct sur-rogate functions for more complicated objective functions bymajorizing f more than once. Specifically, one can find a se-quence of functions g(1) (·|xt) , . . . , g(k) (·|xt) satisfying

g(i) (xt |xt) = g(i+1) (xt |xt)

g(i) (x|xt) ≤ g(i+1) (x|xt) , ∀x ∈ X , ∀i = 1, . . . , k − 1.

Function g(i) (·|xt) usually gets a simpler structure graduallyuntil its minimizer is easy to compute, as illustrated by theapplications in Section V.

IV. CONNECTION TO OTHER ALGORITHMIC FRAMEWORKS

A. The EM Algorithm

Introduced in [4], EM is often employed to derive an iterativescheme for ML estimation problems with latent variables. To beprecise, denote the observed variable by x and the latent variable

Page 8: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 801

by z, the maximum likelihood estimator (MLE) of parameter θis defined as the maximizer of the log-likelihood function

L (θ) = log p (x|θ) = logEz|θp (x|z,θ) .

In the E-step of EM, we compute

g (θ|θt) = Ez|x,θtlog p (x, z|θ) ,

where p (z|x,θt) is the posterior distribution of z given thecurrent estimate θt , and g (θ|θt) is the expected log-likelihoodof the complete data set. Then in the M-step, the new estimateθt+1 is defined as

θt+1 ∈ arg maxθ∈Θ

g (θ|θt) .

Applying Jensen’s inequality, we have

L (θ) = logEz|θp (x|z,θ)

= logEz|x,θt

p (x|z,θ) p (z|θ)p (z|x,θt)

≥ Ez|x,θtlog

(p (x|z,θ) p (z|θ)

p (z|x,θt)

)

= g (θ|θt) + const.,

which shows that g (θ|θt) is a lower bound of L (θ). Therefore,EM is a special case of MM [11]. Moreover, theoretical resultsof EM such as convergence analysis and acceleration schemescan be adapted to MM [3], [10].

We mention that EM can also be viewed as a proximal mini-mization algorithm by rewriting g (θ|θt) as

g (θ|θt) = log p (x|θ) − βtI (θt ,θ)

with the proximal term

I (θt ,θ) =∫

logp (z|x,θt)p (z|x,θ)

p (z|x,θt) dz

being the KL-divergence between p (z|x,θt) and p (z|x,θ),and βt = 1 [91]. Ratio p(z|x,θt )

p(z|x,θ) is assumed to exist for all θ andθt . This connection suggests that one could tune the penaltyparameter βt to achieve a faster convergence rate.

In addition, EM belongs to the class of cyclic algorithms aswell [92]. This can be shown by defining function

F (p,θ) = Ep (log p (x, z|θ)) − Ep (log p (z))

and noticing that the E-step gives the optimal p (z) with θ fixedas θt , and the M-step gives the optimal θt+1 with p (z) fixedas p (z|x,θt+1). The equivalence of MM and cyclic algorithmswith finite dimensional variables will be justified in the follow-ing subsection.

B. Cyclic Minimization

If there exists an augmented function g : X × Y → Rsatisfying

f (x) = miny∈Y

g (x,y) ,

then problem

minimizex

f (x)

subject to x ∈ X(35)

can be equivalently reformulated as

minx∈X

miny∈Y

g (x,y) . (36)

The objective function g can be minimized by alternately min-imizing it with respect to x and y. That is, (x,y) is updated as

yt+1 ∈ arg miny∈Y

g (xt ,y)

xt+1 ∈ arg minx∈X

g (x,yt+1) .(37)

This method is referred to as cyclic minimization and appearsin applications such as [37], [62], [93]–[96].

Here we prove that cyclic minimization and MM are equiva-lent. First we show that cyclic minimization belongs to MM.

Define y� (x) ∈ arg miny∈Y g (x,y), then

g (x,y� (x)) = f (x) . (38)

For any given feasible xt ∈ X , we have

g (x,y� (xt)) ≥ g (x,y� (x)) = f (x) . (39)

Eqs. (38) and (39) imply that g (x,y� (xt)) is a surrogatefunction of f (x), and (37) is an MM iteration with xt+1 ∈arg min g (x,y� (xt)).

Conversely, MM can be regarded as cyclic minimization asfollows. The MM conditions

g (x|x) = f (x)

g (x|y) ≥ f (x)

∀x,y ∈ X imply that x ∈ arg miny∈X g (x|y). Therefore theMM iteration can be rewritten as

yt+1 = xt ∈ arg miny∈Y

g (xt |y)

xt+1 ∈ arg minx∈X

g (x|yt+1) ,

which can be interpreted as minimizing g (x|y) with respect tox and y alternately.

C. DC Programming and Concave-Convex Procedure

DC programming problems take the general form

minimizex

f0 (x) − h0 (x)

subject to fi (x) − hi (x) ≤ 0, i = 1, . . . ,m,(40)

where fi (·) and hi (·) for i = 0, . . . ,m are convex functions[97], [98]. We assume that the fi’s and hi’s are differentiableand, without loss of generality, that they are strongly convex.

The concave-convex procedure (CCCP) [99]–[101] devel-oped to reach a local minimum of (40) states that xt can beupdated by solving the following convex subproblem:

minimizex

g0 (x|xt)

subject to gi (x|xt) ≤ 0, ∀i = 1, . . . ,m,

Page 9: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

802 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

where

gi (x|xt) = fi (x) −(hi (xt) + ∇hi (xt)

T (x − xt))

, (41)

for all i = 0, . . . , m.Approximation (41) satisfies the MM principle and is a

tight upperbound of fi with equality attained at x = xt .As a result, CCCP is a special case of MM if hi ≡ 0, ∀i= 1, . . . , m. When there exists some hi �= 0, the constraintset {x|gi (x|xt) ≤ 0, ∀i = 1, . . . ,m} approximates the origi-nal constraint set from inside and is tangent to it at x = xt .

D. Proximal Minimization

The proximal minimization algorithm [102]–[104] has acyclic minimization interpretation, thus also belongs to MM.Specifically, it minimizes f : X → R by introducing an auxil-iary variable y and solving

minimizex∈X ,y∈X

g (x,y) = f (x) +12c

‖y − x‖22 .

The objective function g (x,y) is minimized alternately withrespect to x and y, leading to the iteration:

xt+1 ∈ arg minx∈X

f (x) +12c

‖x − yt‖22 ,

yt+1 = xt+1 .

(42)

Algorithm (42) can be generalized as:

xt+1 = proxA(xt ),f (xt)

:= arg minx∈X

f (x) +12‖x − xt‖2

A(xt ) ,

where A (xt) ∈ Sn++ and ‖x‖2

A(xt ) := xT A (xt)x.

E. Variable Metric Splitting Method for Non-SmoothOptimization

Variable metric forward-backward splitting (VMFB) can bederived based on MM for solving problems of the form

minimizex∈X

f (x) + h (x) ,

where f is a differentiable function and h is a convex non-smooth function [105]. For presentation clarity we introducebelow its simplest version to illustrate the connection.

Let (At)t∈N be a sequence of positive definite matricessatisfying

gf (x|xt) = f (xt) + ∇f (xt)T (x − xt) +

12‖x − xt‖2

A t

≥ f (x) ,(43)

i.e., gf (·|xt) is a quadratic function that majorizes f at x = xt .Then we can upperbound f + h by

g (x|xt)

= f (xt) + ∇f (xt)T (x − xt) +

12γt

‖x − xt‖2A t

+ h (x) ,

(44)

where γt ∈ (0, 1) , ∀t ∈ N. Omitting the constant terms, theupdate xt+1 is given by

xt+1 =proxγ−1t A t ,h

(xt − γtA−1

t ∇f (xt))

:=arg minx∈X

12γt

∥∥x −

(xt − γtA−1

t ∇f (xt))∥∥2

A t+ h (x).

(45)Steps (44) and (45) can be interpreted as MM naturally.

If ∇f is L-Lipschitz continuous, then At can be set asAt = LI and descent lemma (24) implies condition (43) holds.In this case, VMBF reduces to the proximal gradient algorithm(see [61] and [104] for examples).

Similar to MM, VMFB can also be generalized to blockwiseupdate [57], [106].

F. Successive Convex Approximation (SCA) Algorithms

1) Approximating the Objective Function: Consider the fol-lowing problem

minimizex

f (x) + h (x)

subject to x ∈ X ,

where f : X → R is smooth with a Lipschitz continuous gradi-ent, and h : X → R is convex possibly non-differentiable.

To arrive at a stationary point, FLEXA [107]–[109] approxi-mates f by a strongly convex function g (·|xt) satisfying theproperty that ∇g (xt |xt) = ∇f (xt). The subproblem to besolved is

minimizex

g (x|xt) +τ

2(x − xt)

T Q (xt) (x − xt) + h (x)

subject to x ∈ X ,

where Q (xt) ∈ S++ .The main differences between FLEXA and MM are summa-

rized as follows:� Applicable problems: To ensure convergence, both

FLEXA and MM require the objective function to be con-tinuous, and the set X to be convex. MM has been appliedto some applications with a discontinuous objective func-tion and non-convex set to devise an algorithm with a con-vergent objective value. The convergence of the iterates,however, needs to be studied separately.

� Approximating function: MM requires the surrogate func-tion to be a global upperbound, not necessarily convex. Onthe contrary, FLEXA relaxes the upperbound condition,but requires it to be strongly convex.

For the sake of a clearer comparison, the FLEXA algorithmpresented here is a simplified version of that proposed in [107]and [108], where blockwise update and parallel computationare incorporated. Extensions of the algorithm to stochastic op-timization can be found in [110].

2) Approximating Both the Objective Function and Con-straint Set: Consider problem

minimizex

f0 (x)

subject to fi (x) ≤ 0, i = 1, . . . , m.

Page 10: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 803

Apart from f0 , we can also approximate the feasible set{x|fi (x) ≤ 0, i = 1, . . . ,m} at each iteration. As proposed inthe early work [111], assuming that fi is differentiable, we cansolve the following convex subproblem at the t-th iteration:

minimizex

g0 (x|xt)

subject to gi (x|xt) ≤ 0, i = 1, . . . , m,

where gi (·|xt),∀i = 0, . . . ,m, is a convex function that satisfies

gi (xt |xt) = fi (xt)

gi (x|xt) ≥ fi (x)

∇gi (xt |xt) = ∇fi (xt) .

(46)

The limit of any convergent sequence of (xt)t∈N is a KKTpoint. In short, the subproblem is constructed by upperbound-ing the objective function by a convex surrogate function, andapproximating the feasible set from inside by a convex set.

The condition that gi (·|xt) is a global upperbound can berelaxed to just being the first order convex approximation. Inaddition, it can be generalized to blockwise update with theblocks updated either sequentially or in parallel. We refer read-ers to [101] and [112]–[115] for the details and convergenceanalysis.

G. Subspace MM Algorithm

The descent nature of MM indicates that it can be employedfor step-size selection. Recall that line search type nonlinearoptimization algorithms with update xt+1 = xt + αtdt (xt ∈Rn ) can be described as first finding a gradient-related descentdirection dt , and then the step-size as

αt = arg minα≥0

f (xt + αdt) . (47)

The exact line search criterion (47) can be relaxed by only re-quiring that αtdt generates a sufficient decrease of the objectivevalue.

MM subspace optimization generalizes the search space to bethe column space of a matrix Dt =

[d1

t , . . . ,dmt

](Dt is usually

constructed by the gradient directions of the previous xt’s), andthe step-size to be αt ∈ Rm . Given Dt , αt is found by MMthat decreases the objective value.

In the following we assume ∇f is L-Lipschitz. Define ft asft (α) = f (xt + Dtα), then

ft (α)

= f (xt + Dtα)

≤ f(αk

t

)+ ∇f

(αk

t

)T (α − αk

t

)+

L

2

∥∥Dt

(α − αk

t

)∥∥2

2

:= g (α|αt) .

The surrogate function g (α|αt) is quadratic in α, and has aminimizer given by

αk+1t = αk

t −(LDT

t Dt

)† ∇f(αk

t

).

When m = 1, the method reduces to MM line search [116],[117], and when m = n it recovers the ordinary MM. Analysis

of the algorithm convergence and generalizations can be foundin [118]–[122].

V. APPLICATIONS

In this section, we demonstrate applications of MM catego-rized according to the techniques in Section III.

A. First Order Taylor Expansion

A large number of MM algorithms are derived based on lin-earizing the concave components in the objective function, asshown in the following applications.

1) Reweighted �1-norm Minimization: The problem of find-ing a sparse solution of an underdetermined equation systemy = Ax can be formulated as

minimizex

n∑

i=1

log (ε + |xi |)

subject to y = Ax,

(48)

where the objective function is an approximation of the �0-normwith ε > 0 [15].

The reweighted �1-norm minimization algorithm solves prob-lem (48) by solving

minimizex

n∑

i=1

|xi |ε + |xt

i |

subject to y = Ax

(49)

at the t-th iteration, which is an MM step by applying inequality(11) to the objective function.

2) Robust Covariance Estimation: A robust estimator of co-variance matrix R with zero-mean observations {xi}N

i=1 is for-mulated as the minimizer of the following problem [33]:

minimizeR�0

log det (R) +K

N

N∑

i=1

log(xH

i R−1xi

). (50)

By inequality (11) a surrogate function can be found as

g (R|Rt) = log det (R) +K

N

N∑

i=1

xHi R−1xi

xHi R−1

t xi

, (51)

which is not convex in R, but has a closed-form minimizergiven by

Rt+1 =K

N

N∑

i=1

xixHi

xHi R−1

t xi

. (52)

Notice that we can replace the log function in log(xH

i R−1xi

)

by a continuously differentiable concave function ρ and the samederivation applies. This idea has also been used in [22]–[24] forregularized covariance estimation problems.

In contrast to problems (48) and (50), where discoveringa surrogate function is easy, some applications require one toexploit hidden concavity by manipulating the objective function,as illustrated by the following example.

Page 11: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

804 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

3) Variance Component Model: Consider the signal model

xi = Asi + ni ,

where the si’s and ni’s are zero mean i.i.d. signal and noisewith Cov (si) = diag (p1 , . . . , pL ) and Cov (ni) = σ2I, re-spectively.

Denote p = [p1 , . . . , pL ]T and P = diag (p), the covarianceR of xi admits the structure

R = APAH + σ2I.

Assuming Gaussianity of the observations {xi}Ni=1 , a maxi-

mum likelihood type estimator R is defined as the solution ofthe following problem [34], [35]:

minimizeR ,P�0,σ

log det (R) + Tr(SR−1)

subject to R = APAH + σ2I,(53)

where S = 1N

∑Ni=1 xixH

i .We describe the SBL algorithm that solves (53) derived based

on EM [35] using MM. For simplicity, we assume that σ2 isgiven. To find a separable surrogate function, we work with theprecision matrix Γ = P−1 and rewrite the objective function as

L (Γ)

= log det(Σ−1)− log det (Γ)− σ−4Tr

(SAΣAH

)+ const.,

where Σ =(Γ + σ−2AH A

)−1.

Since log det is concave, and the last term of L (Γ) is convexin Σ−1 , we construct surrogate function

g (Γ|Γt)

= Tr (ΣtΓ) − log det (Γ) +σ−4

N

N∑

i=1

xHi AΣtΓΣtAH xi

by inequalities (12) and (13).Define μi = σ−2ΣtAH xi , then

g (Γ|Γt) =L∑

j=1

ΣtjjΓj −

L∑

j=1

log Γj +1N

N∑

i=1

μHi Γμi

=L∑

j=1

(

Σtjj +

N∑

i=1

|μij |2)

Γj −L∑

j=1

log Γj ,

(54)

where μij is the j-th element of μi . The update of Γj (equiva-lently pj ) can be computed in parallel as

(Γt+1

j

)−1 = pt+1j = Σt

jj +N∑

i=1

|μij |2 . (55)

4) Optimization with Projection Forms: Projection matricesappear in optimization problems in structured low-rank approx-imation [123], minimization of MSE criterion [124], etc.

With a slight abuse of terminology, in this subsection, werefer to matrices parameterized as

P (X) = L (X)T Q (X)−1 L (X) (56)

as projection forms, where L (X) is linear in X and Q (X) isquadratic in X. Note that P is a standard projection matrix ifL (X) = X and Q (X) = XXT .

By inequality (14), the trace of P (X) can be lowerboundedas

Tr (P (X)) ≥ 2Tr(L (Xt)

T Q (Xt)−1 L (X)

)

− Tr(Q (Xt)

−1 L (Xt)L (Xt)T Q (Xt)

−1 Q (X))

(57)with equality achieved at X = Xt .

Let us consider the covariance matrix estimation problem in[36] with the following objective function:

L (W) = log det(τWH W + I

)+ zH

(τWWH + I

)−1z.(58)

The first term is upperbounded as

log det(τWH W + I

)

≤ Tr(τ(τWH

t Wt + I)−1

WH W)

+ const.(59)

by inequality (12). As for the second term, we first create aprojection form by the matrix inversion lemma as follows:

zH(τWWH + I

)−1z

= zH z − zH W(τ−1I + WH W

)−1WH z.

︸ ︷︷ ︸projection form

(60)

Letting L (W) = W and Q (W) = τ−1I + WH W, inequal-ity (57) implies that (60) can be upperbounded as

zH(τWWH + I

)−1z ≤ Tr

(WHtWH

)− 2Re

(LtWH

),

(61)where Ht and Lt are coefficients given by

Ht =(τ−1I + WH

t Wt

)−1WH

t zzH Wt

(τ−1I + WH

t Wt

)−1

and

Lt = zzH Wt

(τ−1I + WH

t Wt

)−1.

Combining (59) and (61) we arrive at a surrogate function

g (W|Wt) = Tr(WHWH

)− 2Re

(LtWH

)

with

H =(WH

t Wt + τ−1I)−1

+ Ht ,

which has a closed-form minimizer given by Wt+1 = LtH−1 .5) Maximizing of A Convex Function Over A Compact Set:

Consider problem

maximizex

f (x)

subject to x ∈ K,(62)

where K is a compact set and f : K → R is convex.A gradient method has been proposed and analyzed in [37]

to solve (62), which falls into the category of MM.Since f is convex, it can be minorized as

f (x) ≥ f (xt) + ∇f (xt)T (x − xt) .

Page 12: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 805

The maximization step is then given by

xt+1 ∈ arg maxx∈K

∇f (xt)T x. (63)

Since K is compact, xt+1 is well-defined.For example, if K = {x ∈ Rn | ‖x‖2 = 1}, then xt+1

= ∇f (xt) / ‖∇f (xt)‖2 ; and if K is the Stiefel manifold de-fined as

K ={X ∈ Rm×n |XT X = In

},

where n ≤ m, let the polar decomposition of ∇f (Xt) be∇f (Xt) = UP, then Xt+1 = U.

6) SEVP with �0-norm Constraint: The sparse eigenvectorproblem (SEVP) aims at finding a sparse unit length vector xthat maximizes the quadratic form xT Ax, where A ∈ Sn

+ . Itattracts a lot of attention in applications such as bioinformatics,big data analysis, and machine learning, where a parsimoniousinterpretation of the data set is desired, see references [37],[125]–[127] for examples.

To enforce sparsity, we can include a zero norm constraint onx and formulate the problem as [19]:

maximizex

xT Ax

subject to ‖x‖2 = 1

‖x‖0 ≤ k.

(64)

Since xT Ax is convex in x, it can be minorized at x = xt byg (x|xt) = 2xT

t Ax.Define a = AT xt for notation simplicity. In the maximiza-

tion step we need to solve problem

maximizex

aT x

subject to ‖x‖2 = 1

‖x‖0 ≤ k.

(65)

Define Ik = {i|xi �= 0}, then

aT x = aTIk

xIk≤ ‖aIk

‖2 ‖xIk‖2 = ‖aIk

‖2 ,

where the inequality follows from the Cauchy-Schwarz inequal-ity, and the last equality follows from the constraint ‖x‖2 = 1and the definition ofIk . Observe that ‖aIk

‖2 is maximized whenIk is the set of indices of ai with the k largest absolute value,and aT x is maximized when a and x are collinear. Sort the el-ements of |a| = [|a1 | , . . . , |an |]T in descending order. That is,we find a permutation π : {1, . . . , n} → {1, . . . , n} such that|a|π (1) ≥ · · · ≥ |a|π (n) . The solution x� of the problem (65) isgiven by

x�i =

{ai, ai ≥ |a|π (k)0, ai < |a|π (k)

x� = x�/ ‖x�‖2 .

(66)

The algorithm is named the truncated power method as comput-ing vector a is a power iteration step, and in (66) the smallestn − k elements of |a| are truncated to zero.

7) SEVP with �0-norm Penalty: SEVP can also be formu-lated in penalty form as [21]

maximizex

xT Ax − ρ ‖x‖0

subject to ‖x‖2 = 1,(67)

where ρ ≥ 0 is a parameter that controls the sparsity level. Lin-earizing the quadratic term we have a minorizing function

g (x|xt) = 2xTt Ax − ρ ‖x‖0 . (68)

Denote a = 2AT xt . Suppose ‖x‖0 = k ≤ ‖a‖0 , the mini-mizer x� of g (x|xt) is then given by (66) with g (x� |xt) being

√√√√

k∑

i=1

|a|π (i) − kρ.

Therefore, update xt+1 has cardinality

k� = arg maxk

√√√√

k∑

i=1

|a|π (i) − kρ,

and takes the form (66) with k = k� .We introduce in the end algorithms derived by iteratively

upperbounding f by a quadratic form, which belongs to theiteratively reweighted least squares (IRLS) algorithms. Notethat the upperbounds for a convex objective function f are notconstructed based on first order Taylor expansion. We put themin this subsection for the integrality of the applications.

8) Edge-Preserving Regularization in Image Processing:Many image restoration and reconstruction problems can beformulated as

minimizex

f (x) + Φ (x) ,

where f is a quadratic function of the form

f (x) = xT Qx − 2qT x,

and Φ(x) =∑m

i=1 φ (δi), where δi =(VT x − w

)i, is a regu-

larization term with parameters V ∈ Rn×m and w ∈ Rm [38].Suppose that φ : R → R satisfies regularity conditions: (1) φ

is even; (2) φ is coercive and continuously differentiable; (3)φ(√·

)is concave on R+ ; and (4) 0 < φ′ (t) /t < ∞ (cf. Fig. 4

for examples of φ). Then an upperbound of φ (δ) can be derivedbased on the concavity of φ

(√·)

as follows:

φ (δ) = φ(√

δ2)≤ 1

2φ′ (δt)

δtδ2 + const.

Letting δi =(VT x − w

)i

yields the following quadratic sur-rogate function:

g (x|xt) = xT

(

Q +12VDtVT

)

x − (2q + VDtw)T x,

where Dt is a diagonal matrix with the i-th diagonal elementbeing di = φ′ (δt

i ) /δti . Assuming that 2Q + VDtVT is invert-

ible, x is then updated as

xt+1 =(2Q + VDtVT

)−1(2q + VDtw) . (69)

Page 13: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

806 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

Fig. 4. Examples of edge-preserving functions presented in [128]

(ϕG M (t) = t2

1+ t2 , ϕH L = log(1 + t2 ), ϕH S = 2√

1 + t2 − 2, ϕG R

= 2 log (cosh (t))).

Suppose alternatively that φ satisfies: (1) φ is coercive andcontinuously differentiable; and (2) φ′ is L-Lipschitz. Then fcan be majorized based on descent lemma (24) by surrogatefunction

g (x|xt) = xT

(

Q +1

2LVVT

)

x +(

2q +V (�t + w)

L

)

,

where �it = δt

i − Lφ′ (δti ). Assuming that 2Q + 1

L VVT is in-vertible, x is then updated as

xt+1 =(

2Q +1L

VVT

)−1 (

2q +V (�t + w)

L

)

. (70)

Iteration (69) and (70) correspond to the half-quadratic mini-mization algorithms without over-relaxation as proposed in [7]and [8], respectively. A convergence study of the algorithm withover-relaxation can be found in [38].

9) �p -Norm Minimization: Optimization problems involv-ing �p -norm arise frequently in robust fitting and sparse repre-sentation problems. When 1 ≤ p < 2, |x|p is convex, and when0 < p < 1, the tightest convex upperbound of |x|p is obtainedby linearization. IRLS type algorithms, however, use a quadraticupperbound for |x|p . The idea is to majorize ‖x‖p

p (x ∈ Rn ) by

‖x‖2 at each iteration and solve a weighted �2-norm minimiza-tion problem instead. More precisely, inequality (16) indicatesthat at x = xt , if none of the elements of xt are zero, then ‖x‖p

p

can be majorized as

‖x‖pp ≤ ‖x‖2

W t+ const., (71)

where Wt is a diagonal matrix with the i-th diagonal elementbeing wt

i = p2 |xt

i |p−2 .

Take the following robust regression problem as an example:

minimizex

‖Ax − b‖pp , (72)

where b ∈ Rm . With inequality (16) we can construct aquadratic surrogate function:

g (x|xt) =m∑

i=1

wti (bi − Ai,:x)2 ,

where wti is given by

wti = |bi − Ai,:xt |p−2 .

Function g (x|xt) admits a closed-form minimizer

xt+1 =(AT WtA

)−1AT Wtb.

The loss function |x|p can be generalized to any continuouslydifferentiable concave function for robust fitting [40].

A similar idea has been applied in [41] in solving the sparserepresentation problem

minimizex

‖Ax − b‖22 + λ ‖x‖1 ,

and in [42] in solving the compressed sensing problem

minimizex

‖x‖1

subject to Ax = b,

where ‖x‖1 was upperbounded by a quadratic function usinginequality (16) and a least squares problem was solved periteration.

Note that in [41], convergence was established under the con-dition that none of the xt

i ’s are zero, and in [42] the weightwas adaptively modified so that it never goes to infinity. If anyxt

i becomes zero, the algorithm will be ill-posed since weightmatrix Wt will be undefined in the next iteration. The effect ofthis singularity issue in algorithm convergence has been exten-sively studied in the literature, see [29], [39], [41] and [42] forexamples. A way to circumvent the difficulty is by smoothingthe objective function. For example, ‖x‖p

p was approximated by

hε,p (x) =n∑

i=1

(ε2 + x2

i

)p/2

in [43], where ε is a positive small number. As a result, theweight at each iteration is always well-defined. The smoothingtechnique was also adopted in [21] with a different approxima-tion, to solve a sparse generalized eigenvalue problem (GEVP)formulated as

maximizex

xT Ax −n∑

i=1

|xi |p

subject to xT Bx = 1.

(73)

B. Second Order Taylor Expansion (Hessian Bound)

If the Hessian matrix of the objective function f is uniformlybounded, i.e., M � ∇2f (x) , ∀x ∈ X , then we can find aquadratic surrogate function using inequality (25). As a ben-efit, the update usually admits a closed-form solution.

Page 14: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 807

1) Logistic Regression: In a multi-class classification prob-lem, we are given data pairs (xn , tn )1≤n≤N , where xn ∈ Rm

is a feature vector and tn is a (K + 1)-dimensional encodingvector with (tn )i = 1 if x belongs to the i-th category and(tn )i = 0 otherwise. The task is to train a statistical model thatcan predict t based on x [44], [129]. For notation simplicity weassume there is only one training sample (x, t).

The problem can be formulated as finding a w, definedas w =

[wT

1 , . . . ,wTK

]T, that minimizes the negative log-

likelihood function:

L (w) =K∑

j=1

−tjwTj x + log

⎝1 +K∑

j=1

exp(wT

j x)⎞

⎠ . (74)

It can be proved that the Hessian of L (w) is uniformly upper-bounded by matrix

M =12

(

I − 11T

K + 1

)

⊗(xxT

).

Therefore, inequality (25) implies that L can be upperboundedby

g(w|wt

)=

((t − p

(wt

))⊗ x

)T (w − wt

)

+12(w − wt

)T M(w − wt

),

(75)

where t := [t1 ; . . . ; tK ] and p (w) := [p1 (w) ; . . . ; pK (w)]with

pj (w) =exp

(wT

j x)

1 +∑K

j=1 exp(wT

j x) .

The update of w is then given as

wt+1 = wt − M−1 ((t − p(wt

))⊗ x

).

Compared to the Newton method that requires computing∇2L (w) at each iteration, minimizing g (w|wt) only requirespre-computing M once since it is independent of wt .

Combining with the technique described in Section V-A9,MM can be derived for sparse logistic regression with the �1-norm penalty formulated as

L (w) =−K∑

j=1

tjwTj x + log

⎝1 +K∑

j=1

exp(wT

j x)⎞

⎠+ λ ‖w‖1 ,

(76)where λ ≥ 0 is a regularization parameter [45]. At each iteration,a quadratic upperbound for the �1-norm term was merged with(75), thus still leading to a quadratic surrogate function that hasa closed-form minimizer.

Even if the function to be majorized is already quadratic,inequality (25) is still applied in some applications with an Mthat is easier to deal with (usually being a diagonal or a scaledidentity matrix), as can be seen in Sections V-B2, V-B3, V-B4,and V-B5.

2) Matrix Quadratic Form Minimization with Rank Con-straint: Consider problem

minimizeX

vec (X)T Qvec (X) + vec (L)T vec (X)

subject to rank (X) ≤ r,(77)

where Q is a symmetric square matrix with its maximum eigen-value positive and X,L ∈ Rm×n (m ≥ n).

Observe that if Q is a scaled identity matrix, then problem(77) can be written as

minimizeX

‖X + cL‖2F

subject to rank (X) ≤ r(78)

with c being some positive constant, which has a closed-formminimizer based on the singular value decomposition (SVD)of L.

For this reason, we construct the following surrogate functionby applying inequality (26) to the first term of the objectivefunction:

g (X|Xt) = λ ‖X − Y‖2F + const., (79)

where λ = λmax (Q) and vec (Y) = − (Q/λ − I) vec (Xt)− vec (L) / (2λ).

Let the thin SVD of Y be Y = USVT with S= diag (σ1 , . . . , σn ) and σ1 ≥ · · · ≥ σn , Xt+1 ∈arg minrank(X)≤r g (X|Xt) is then given by

Xt+1 = USrVT ,

where Sr is obtained by thresholding the smallest (n − r) ele-ments of the diagonal of S to zero. We refer to this procedureas singular value hard thresholding.

Many problems can be cast in the form of (77). Two examplesare given as follows.

Problem 20: The weighted low rank approximation problemis formulated as

minimizeX

‖X − R‖2Q

subject to rank (X) ≤ r,(80)

where R ∈ Rn×m is the matrix to be approximated, and‖X‖Q := vec (X)T Qvec (X) with Q ∈ S+ being a weightmatrix. Problem (80) is an instance of problem (77), thus canbe solved accordingly.

Problem 21: The low rank matrix completion problem isformulated as

minimizeX

‖PΩ (X) − PΩ (R)‖2F

subject to rank (X) ≤ r,(81)

where

PΩ (R)ij =

{Rij , (i, j) ∈ Ω

0, otherwise.

By defining Q = diag (q) with qi = 1 if vec (PΩ (R))i �= 0 andqi = 0 otherwise, problem (81) is a special case of (80).

Page 15: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

808 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

Fig. 5. Matrix completion problem. Left: objective value‖PΩ (Xt ) − PΩ (R)‖2

F versus the number of iterations; right: recov-ery error measured by ‖Xt − R‖F versus the number of iterations.

Moreover, since qi is either 0 or 1, we have λmax (Q) = 1.Consequently, vec (Y) = Q (vec (R) − vec (Xt)) + vec (Xt)and the expression of Y can be simplified as

Yij =

{Rij , (i, j) ∈ Ω

Xtij , otherwise

. (82)

We randomly generate a matrix R ∈ R500×600 of rank r = 10,and create PΩ (R) by uniformly randomly deleting 70% of theentries of R. Fig. 5 plots the objective value evolution curveand the recovery error ‖Xt − R‖F versus iterations. It can beseen that in 100 iterations the objective value (approximationerror) decreases to a value below 10−8 and the recovery errordecreases to 10−4 .

Similar to Section V-A7, the penalty form of problem (77),which relaxes the rank constraint to the objective function, canbe handled with minor modifications of the above-describedderivation.

3) Minimization of Quartic Forms: Minimizing a quarticfunction is closely related to minimizing matrix quadratic formsas discussed in Section V-B2. The objective function takes theform

f (x) =N∑

i=1

(xH Aix − yi

)2, (83)

where Ai is Hermitian positive definite.The idea is to reduce the order of f by a change of variables.

To this end, we define the “lifting matrix” X = xxH . Thenf (x) can be written as

f (X) =N∑

i=1

(Tr (AiX) − yi)2

= vec (X)H

(N∑

i=1

vec (Ai) vec (Ai)H

)

vec (X)

− 2N∑

i=1

yiTr (AiX) +N∑

i=1

y2i , (84)

which is quadratic in X.Although the order of the objective function has been reduced

from quartic to quadratic, we have introduced the constraint thatX is rank-one, i.e., X = xxH .

Fig. 6. Linear upperbound for a quadratic function on unit circle (black curve:intercept of the quadratic function on the unit circle; magenta curve: interceptof the linear upperbound on the unit circle; red dot: point at which the value ofthe functions are equal).

Denote A =∑N

i=1 vec (Ai) vec (Ai)H , minimizing f is

then equivalent to solving

minimizeX

vec (X)H Avec (X) − 2N∑

i=1

yiTr (AiX)

subject to rank (X) = 1,

which is a special case of (77) with the identification that Q = Aand L = −2

∑Ni=1 yiAi .

Problem 22: The phase retrieval problem aims at recov-ering signal x from phaseless measurements yi =

∣∣aH

i x∣∣2 , i

= 1, . . . , N . The problem can be formulated as

minimizex

N∑

i=1

(yi −

∣∣aH

i x∣∣2)2

. (85)

Defining matrix Ai = aiaHi , problem (85) is of the form (83)

and MM algorithms can be derived accordingly [47].Problem 23: The sequence design problem considered in

[31] aims at finding a length N complex-valued unimodularsequence (xn )1≤n≤N with low autocorrelation sidelobes. Theassociated optimization problem takes the form

minimizex

2N∑

p=1

(xH apaH

p x)2

subject to |xi | = 1, i = 1, . . . , N.

(86)

The objective function is a special case of that of problem (83)with Ap = apaH

p and yp = 0.To deal with the unit-modulus constraint |xi | = 1, we ob-

serve that Tr(XH X

)and xH x are constants in the set X

= {x| |xi | = 1, i = 1, . . . , N}. Therefore, we can apply in-equality (25) twice with M being a scaled identity matrix, yield-ing a linear surrogate function that has a closed-form minimizerin X [31].

Remark 24: Upperbounding a quadratic function by a lin-ear function is not possible if the constraint set is the entireEuclidean space. However, restricting it to the set X makes itpossible. Fig. 6 visualizes how a linear function upperbounds aconvex quadratic one on the unit circle.

We test the performance of MM in designing a sequence oflength N = 1024. The algorithm is initialized with a Golomb se-quence as a reasonably good starting point, and the SQUAREM

Page 16: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 809

Fig. 7. Sequence design problem. Left: objective value versus iterations; right:correlation level of sequence of length N = 1024.

accelerator is employed to achieve a fast convergence rate. Fig. 7shows the evolution curve of the objective value versus thenumber of iterations and the correlation level of the resultingsequence at convergence.

Extensions of the problem to the design of a sequence min-imizing a weighted integrated sidelobe level criterion and thedesign of a sequence set using MM can be found in [32], [130].

4) Sparse Linear Regression: The sparse linear regressionproblem can be formulated as

minimizex

‖Ax − b‖22 + ρh (x) , (87)

where h is a penalty function that promotes a sparse x, and ρ ≥ 0is the regularization parameter. We assume that h is separableand even, i.e., h (x) =

∑ni=1 hi (|xi |), and hi is concave and

nondecreasing on R+ .The idea is to decouple the objective function, so that opti-

mizing x can be done element-wise [18]. To this end, we resortto inequality (26) and upperbound the first term as

‖Ax − b‖22 ≤ λxT x − 2yT

t x + const.,

where λ = λmax(AT A

), and yt = AT b −

(AT A − λI

)xt .

Then for each xi , the problem boils down to finding aminimizer of

g(1) (xi |xt) = λx2i − 2yixi + ρhi (|xi |) . (88)

For example, when h (x) = ‖x‖1 the update is given by thesoft-thresholding operator as [49]:

xt+1i = Sρ/λ

(yi

λ

)=

⎧⎪⎪⎪⎨

⎪⎪⎪⎩

yi

λ− ρ

2λ,

yi

λ>

ρ

2λyi

λ+

ρ

2λ,

yi

λ<

ρ

0, otherwise.

(89)

When hi is concave, further applying inequality (10) we canupperbound hi (|xi |) as

hi (|xi |) ≤ h′i

(∣∣xt

i

∣∣)|xi | + const.

Together with (88) we arrive at the surrogate function

g(2) (xi |xt) = λx2i − 2yixi + ρh′

i

(∣∣xt

i

∣∣)|xi |

with a minimizer given by

xt+1i = Sρh ′

i (|xti |)/λ

(yi

λ

).

This method has been applied in image restoration in [29] and[48], where the problem is formulated as a high-dimensionalpenalized least square problem.

In the end, we present a special case that h (|xi |) = ‖xi‖0 ,which is discontinuous [16], [17]. The minimizer of g(1) (xi |xt)has a closed-form given by

xt+1i =

{yi/λ, y2

i /λ > ρ

0, otherwise,(90)

which is an iterative hard thresholding algorithm.5) Nonnegative Least Squares: The nonnegative least

squares (NLS) problem is a least squares fitting problemthat requires the regressor to be nonnegative. The problem isstated as

minimizex

‖Ax − b‖22

subject to x ≥ 0.(91)

To obtain a closed-form update of x under the constraint x≥ 0, we construct a separable surrogate function. The simplestway is to apply inequality (26) with M = λI, which gives thefollowing surrogate function:

g (x|xt) = xT x − 2xT

(

xt −1λ

(AT Axt − AT b

))

,

where λ ≥ λmax(AT A

). Consequently, the update of x is

given by

xt+1 =[

xt −1λ

(AT Axt − AT b

)]

+, (92)

which is a gradient projection algorithm.If further assuming that A ∈ Rm×n

++ , b ∈ Rm+ , and b �= 0, it

has been proven in [51] and [131] that

g (x|xt) = xT Mtx + 2xT((

AT A − Mt

)xt − AT b

)

with

Mt = diag

((AT Axt

)1

xt1

, . . . ,

(AT Axt

)n

xtn

)

is a valid surrogate function since Mt � AT A. The update ofx is then given by

xt+1 =(AT b/AT Axt

)� xt . (93)

Initialized at x0 > 0, we can see that (xt)t∈N remains nonneg-ative if the elements of A and b are nonnegative. Although bothderived based on separable quadratic surrogate functions, (92)is an additive update while (93) is multiplicative. Iteration (93)was studied in a more general context, namely as an instance ofmultiplicative iterative algorithms for convex problems, in [50].

We mention that another surrogate function can be con-structed based on (17) and (28), whose derivation is postponedto Section V-D3.

C. Convexity Inequality

In this subsection we show the application of inequality (19)to a robust mean-covariance estimation problem formulated in

Page 17: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

810 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

[24] as

minimizeμ,R�0

K + 1N

N∑

i=1

log(1 + (xi − μ)H R−1 (xi − μ)

)

+ α(log det (R) + K log Tr

(R−1T

))

+ γ log(1+(μ−t)H R−1 (μ−t)

)+ log det (R) .

(94)Similar to (50), Problem (94) can be solved by linearizing the logfunction. Here we provide an alternative solution by exploitingconvexity.

Specifically, applying inequality (19) to the sum of the logterms, the objective function can be majorized by

g (μ,R|μt ,Rt) =

(1 + α) log det (R) + (K + 1 + γ + αK)×

log

(N∑

i=1

K + 1N

wi (μt ,Rt)(1 + (xi − μ)H R−1 (xi − μ)

)

+ γwt (μt ,Rt)(1 + (μ − t)H R−1 (μ − t)

)

+αK

Tr(R−1

t T)Tr

(R−1T

))

.

(95)Proposition 25: The surrogate function g (μ,R|μt ,Rt) has

a closed-form minimizer given by

μt+1 =(K + 1)

∑Ni=1 wi (μt ,Rt)xi + γNwt (μt ,Rt) t

(K + 1)∑N

i=1 wi (μt ,Rt) + γNwt (μt ,Rt)

Rt+1 = βSt ,(96)

where

St =N∑

i=1

K + 1N

wi (μt ,Rt)(xi − μt+1

) (xi − μt+1

)H

+ γwt (μt ,Rt)(μt+1−t

) (μt+1−t

)H +αK

Tr(R−1

t T)T.

(97)and

β =1 + γ

1 + α

(N∑

i=1

K + 1N

wi (μt ,Rt) + γwt (μt ,Rt)

)−1

.

(98)Proof: See Appendix. �Note that the MM update (96) turns out to be the accelerated

MM algorithm (without convergence proof) provided in [24].Fig. 4 and Table II provided in [24] show that the number ofiterations required for algorithm (96) to converge is significantlysmaller than MM derived based on linearization, which can beexplained by the fact that the former algorithm has a tightersurrogate function (see Fig. 3 for example).

D. Geometric and Signomial Programming

An unconstrained standard GP takes the form

minimizex

f (x) :=J∑

j=1

cj

n∏

i=1

xai j

i , (99)

where∀i, j, aij ∈ R, cj > 0, and xi > 0. Function cj

∏ni=1 x

ai j

i

is a monomial, and the sum of monomials is a posynomial. Whensome of the cj ’s are negative, f is a signomial.

1) Signomial Programming: We apply inequality (28) tothe summands of f with positive cj ’s and inequality (17) tothose with negative cj ’s, which leads to the following surrogatefunction [54]:

g (x|xt) =n∑

i=1

gi (xi |xt)

gi (xi |xt) =∑

j :cj >0

cj

(n∏

k=1

(xt

k

)ak j

)|aij |‖aj‖1

(xi

xti

)‖aj ‖1 sgn(ai j )

+∑

j :cj <0

cj

(n∏

k=1

(xt

k

)ak j

)

aij log xi,

where aj = [a1j , a2j , . . . , anj ]T . Surrogate function g (x|xt) is

separable, and minimizing gi (xi |xt) is a uni-variate optimiza-tion problem and can be done in parallel.

Having discussed how to minimize a signomial, we move tothe problem of minimizing the ratio of two posynomials.

2) Complementary GP: Consider the following minimiza-tion problem:

minimizex

f (x)g (x)

, (100)

where f and g are posynomials. The idea is to lowerbound g bya monomial, so that the resulting surrogate function becomes aposynomial.

Write g (x) as g (x) =∑J

j=1 uj (x), where uj is a monomial.Invoking inequality (29), the objective function is majorized by

g (x|xt) = f (x) /

⎝J∏

j=1

(uj (x)

αj

)αj

⎠ ,

where αj = uj (xt )∏ J

j = 1 uj (xt ). As a result, minimizing the surrogate

function becomes a standard GP [53].At this point, we can either solve the GP directly or further

upperbound g (x|xt) by a separable surrogate function using thetechniques in Section V-D1.

3) Nonnegative Least Squares Revisit: An alternative ap-proach to find a separable surrogate function for NLS problem(91) hinges on inequalities (17) and (28) for monomials.

To lighten the notation, denote AT A by Q and −AT b by q.Problem (91) can be written equivalently as

minimizex

12xT Qx + qT x

subject to x ≥ 0.

Page 18: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 811

Fig. 8. Objective value evolution curve of MM algorithms for the NLS prob-lem. Red: algorithm (93), black: algorithm (103), blue: algorithm (92), magenta:accelerated algorithm (92).

To find a separable surrogate function, we need to take care ofthe cross terms Qijxixj .

Notice that |Qij |xixj is a monomial, and we have separableupper and lower bounds for a monomial given by inequalities(28) and (17), respectively. To be precise, for the terms xixj

with Qij > 0, we have

xixj ≤ 12

(xt

j

xti

x2i +

xti

xtj

x2j

)

, (101)

and for the terms xixj with Qij < 0, we have

xixj ≥(xt

ixtj

) (1 + log xi + log xj − log xt

i − log xtj

).

(102)Define Q+ = max [Q,0] and Q− = −min [Q,0], where

the maximum and minimum are taken element-wise.The bounds (102) and (101) lead to the followingsurrogate function:

g (x|xt)

=12

i

(Q+xt)i

xti

x2i −

i

xi

(Q−xt

)ilog xi +

i

qixi.

Setting its gradient to zero gives the multiplicative update [55]

xt+1i = xt

i

(−qi +

√q2i + 4 (Q+xt)i (Q−xt)i

2 (Q+xt)i

)

. (103)

Remark 26: If A ∈ Rm×n++ and b ∈ Rm

+ , b �= 0, we can seethat Q ∈ Rn×n

++ and −q ∈ Rn++ . In this case, iteration (103)

coincides with (93).Fig. 8 shows the performance of MM iterations (92), (93), and

(103). We have also included the accelerated algorithm (92) bymodifying the step-size according to the Armijo line search rule.A and x are generated randomly so that each of the elementsfollows a uniform distribution in [0, 1], and the dimensions are

set to be m = 60 and n = 100, and we assume the noiselesscase, i.e., b = Ax.

Remark 27: Combining with the techniques for sparse linearregression in Section V-B4, an alternating MM algorithm canbe derived for the matrix factorization problem, possibly withthe nonnegativity constraint and sparsity penalty. We refer thereaders to [27] and [132]–[134] for the details.

E. Cauchy-Schwartz Inequality

Cauchy-Schwartz Inequality can be used to lower bound the�2-norm by a linear function, which is applied in the followingapplications.

1) Phase Retrieval Revisit: The phase retrieval problem con-sidered in Section V-B3 can be alternatively formulated by mag-nitude matching as [56], [135]–[137]

minimizex

∥∥√y −

∣∣AH x

∣∣∥∥2

2 , (104)

where√· is applied element-wise.

Expanding the squares and applying inequality (31) to thecross term (assuming

∣∣AH xt

∣∣ �= 0) leads to the following sur-

rogate function:

g(1) (x|xt) =∥∥Ct

√y − AH x

∥∥2

2 ,

where Ct = diag(ej arg(AH xt )

), which has a minimizer

xt+1 =(AAH

)−1ACt

√y. (105)

Algorithm (105) turns out to be the famous Gerchberg-Saxtonalgorithm [56].

Restricting x to be real-valued, we can further majorizeg(1) (·|xt) using inequality (26) [57].

Let us write∥∥AH x

∥∥2

as∥∥AH x

∥∥2 =

∑Ni=1

∣∣aH

i x∣∣2

=∑N

i=1

∣∣∣∑p

j=1 aijxj

∣∣∣2, where aij is the j-th element of aH

i

and p is the dimension of x. For simplicity we assume thataij �= 0. Applying Jensen’s inequality we have

∣∣∣∣∣∣

p∑

j=1

aijxj

∣∣∣∣∣∣

2

=

⎝p∑

j=1

Re (aij ) xj

2

+

⎝p∑

j=1

Im (aij ) xj

2

=

⎝p∑

j=1

V ijR

Re (aij )V ij

R

xj

2

+

⎝p∑

j=1

V ijI

Im (aij )V ij

I

xj

2

≤p∑

j=1

Re (aij )2

V ijR

x2j +

p∑

j=1

Im (aij )2

V ijI

x2j ,

(106)where V ij

R = |Re(ai j )|∑ p

j ′= 1 |Re(ai j ′)| and V ijI = |Im(ai j )|

∑ p

j ′= 1 |Im(ai j ′)| .

Page 19: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

812 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

Summing Eq. (106) over indices i we have∥∥AH x

∥∥2

≤N∑

i=1

⎝p∑

j=1

Re (aij )2

V ijR

x2j +

p∑

j=1

Im (aij )2

V ijI

x2j

=N∑

i=1

⎝p∑

j=1

⎝|Re (aij )|p∑

j ′=1

|Re (aij ′)|

⎠x2j

+p∑

j=1

⎝|Im (aij )|p∑

j ′=1

|Im (aij ′)|

⎠x2j

:= xH Mx,

where M is a diagonal matrix with its j-th diagonal entry being

N∑

i=1

⎝|Re (aij )|p∑

j ′=1

|Re (aij ′)| + |Im (aij )|p∑

j ′=1

|Im (aij ′)|

⎠ .

2) Sensor Network Localization: We introduce the localiza-tion problem described in [138], where a sensor network ismodeled by a graph G

(V,E ∪ E

). The nodes V are partitioned

into a set of m anchor nodes Va = {a1 , . . . ,am} with knownlocation, and the rest Vx = {x1 , . . . ,xn} are n sensors withunknown location. An edge in the set E = {(i, j) |i, j ∈ Vx} isassociated with dij representing the distance between sensorsi and j, and an edge in the set E = {(k, j) |k ∈ Va, j ∈ Vx} isassociated with dkj representing the distance between anchor kand node j.

To estimate the location of all nodes, we formulate the prob-lem as

min{x i }n

i = 1

(i,j )∈E

(‖xi − xj‖2 − dij

)2 +∑

(k,j )∈E

(‖xj − ak‖2 − dkj

)2.

(107)Expanding the squares we can see that the cross terms are con-cave and thus destroy the convexity of the objective function.

Invoking inequality (32), we can get the following quadraticsurrogate function [58]–[61]:

g({xi}n

i=1 |{xt

i

}n

i=1

)

=∑

(i,j )∈E

(

‖xi − xj‖22 − 2dij

(xt

i − xtj

)T (xi − xj )∥∥xt

i − xtj

∥∥

2

)

+∑

(i,j )∈E

(

‖xj − ak‖22 − 2dkj

(xt

j − ak

)T (xj − ak )∥∥xt

j − ak

∥∥

2

)

,

which has a closed-form minimizer.Problem (107) has many variants. For example, to achieve

robustness, the authors of [59] replaced the squared loss functionby the �1-norm loss function, and the authors of [139] employedthe Huber’s loss function. In both cases, inequality (16) has beenapplied together with (32) to arrive at a quadratic surrogatefunction.

F. Schur Complement

We revisit the variance component model problem (53) andshow that an alternative MM algorithm can be derived based oninequality (34).

Define A = [A; IK ] and

P = diag(p1 , . . . , pL , σ2 , . . . , σ2︸ ︷︷ ︸

)

K

,

then R = APAH and the objective function can be rewrittenas

L (P) = log det(APAH

)+ Tr

(

S(APAH

)−1)

.

To find a surrogate function separable in P, we apply inequality(12) to the first term, which leads to the first step majorizationwith surrogate function

g(1) (R|Rt) = Tr(R−1

t R)

+ Tr(SR−1)

� wHt p + Tr

(

S(APAH

)−1)

,

where wtj = aH

j R−1t aj with aj being the j-th column of A.

In the next step, we find a separable upperbound forTr(S(APAH )−1). By inequality (34) we have

g(1) (R|Rt) ≤ g(2) (R|Rt)

= wHt p + Tr

(PtAH R−1

t MtR−1t APtP−1

)

= wHt p +

L∑

j=1

(pt

j

)2 aHj R−1

t MtR−1t aj p

−1j + const.

The update of pj can be obtained in closed-form as

pt+1j =

√√√√aH

j R−1t MtR−1

t aj

aHj R−1

t aj

pt+1j . (108)

Iteration (108) is similar to the LIKES algorithm presented in[93], but executes the outer loop iteration only once. Detailednumerical comparisons between the SBL and LIKES algorithmcan be found in [140].

VI. CONCLUSIONS

In this overview, we have presented the MM principle and itsrecent developments. From a theoretical perspective, we haveintroduced the general algorithmic framework, its convergenceconditions, as well as acceleration schemes. We have also relatedMM to several algorithmic frameworks, namely EM, cyclic min-imization algorithms, CCCP, proximal minimization, VMFB,SCA, and subspace MM. More importantly, a large part of thearticle has been devoted to presenting the techniques of con-structing surrogate functions and applying MM to problems insignal processing, communications, and machine learning. Awide range of applications have been covered in this overviewsuch as sparse regression, matrix completion, phase retrieval,sparse PCA, covariance estimation, sequence design, and sen-sor network localization. In the end, we mention that although

Page 20: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 813

MM has been proven to be an effective tool for many applica-tions, practitioners should also be aware of the following issues.One is that MM algorithms can get stuck at stationary pointsfor nonconvex problems, therefore the performance of the con-vergent point (whether it satisfies application design criterion)should be studied either theoretically or empirically. Anotherproblem is that MM can suffer from a slow convergence rate. Inthis situation, either the surrogate function should be tightened,or an MM accelerator needs to be employed (possibly at the costof losing convergence guarantees).

APPENDIXPROOF OF PROPOSITION 25

To find a minimizer of the surrogate function (95), we firstset the gradient of g (μ,R|μt ,Rt) with respect to μ to zero,which leads to the minimizer

μt+1 =(K + 1)

∑Ni=1 wi (μt ,Rt)xi + γNwt (μt ,Rt) t

(K + 1)∑N

i=1 wi (μt ,Rt) + γNwt (μt ,Rt).

Substituting the optimal μ back into g (μ,R|μt ,Rt) and settingthe gradient of it with respect to R to zero leads to the fixed-pointequation

R=(K + (1 + γ) / (1 + α))St(∑N

i=1K +1

N wi (μt ,Rt)+γwt (μt ,Rt))

+Tr (StR−1),

(109)where St is given by (97). Similar to the proof of Theorem 10in [23], it can be shown by contradiction that if (109) has asolution, it is unique.

Since equation (109) indicates that R should be proportionalto St , we let the solution R be βSt . To get the value of β,we substitute R back into (109), which leads to the followingequation of β:

K + 1 + γ + αK

(1 + α) β

=

(N∑

i=1

K + 1N

wi (μt ,Rt) + γwt (μt ,Rt)

)

+ Kβ−1 .

(110)The solution of (110) is given by (98).

ACKNOWLEDGMENT

The authors would like to thank Prof. Wing-Kin (Ken) Mafor his comments on the connection between the MM algorithmand the cyclic minimization algorithm.

REFERENCES

[1] J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” Nat. Sci.Rev., vol. 1, no. 2, pp. 293–314, 2014.

[2] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of NonlinearEquations in Several Variables. New York, NY, USA: Academic, 1970,vol. 30.

[3] T. T. Wu and K. Lange, “The MM alternative to EM,” Statist. Sci., vol. 25,no. 4, pp. 492–505, 2010.

[4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” J. Roy. Statist. Soc. Ser. B(Methodological), vol. 39, no. 1, pp. 1–38, 1977.

[5] J. De Leeuw, “Convergence of the majorization method for multidimen-sional scaling,” J. Classification, vol. 5, no. 2, pp. 163–180, 1988.

[6] K. Lange and J. A. Fessler, “Globally convergent algorithms formaximum a posteriori transmission tomography,” IEEE Trans. ImageProcess., vol. 4, no. 10, pp. 1430–1438, Oct. 1995.

[7] D. Geman and G. Reynolds, “Constrained restoration and the recov-ery of discontinuities,” IEEE Trans. Pattern Anal. Mach. Intell., no. 3,pp. 367–383, Mar. 1992.

[8] D. Geman and C. Yang, “Nonlinear image recovery with half-quadraticregularization,” IEEE Trans. Image Process., vol. 4, no. 7, pp. 932–946,Jul. 1995.

[9] A. R. De Pierro, “A modified expectation maximization algorithm forpenalized likelihood estimation in emission tomography,” IEEE Trans.Med. Imag., vol. 14, no. 1, pp. 132–137, Apr. 1994.

[10] M. P. Becker, I. Yang, and K. Lange, “EM algorithms without missingdata,” Statist. Methods Med. Res., vol. 6, no. 1, pp. 38–54, 1997.

[11] W. J. Heiser, “Convergent computation by iterative majorization: The-ory and applications in multidimensional data analysis,” in Recent Adv.Descriptive Multivariate Anal., W. J. Krzanowski, Ed. Oxford: OxfordUniv. Press, 1995, pp. 157–189.

[12] K. Lange, D. R. Hunter, and I. Yang, “Optimization transfer using sur-rogate objective functions,” J. Comput. Graph. Statist., vol. 9, no. 1,pp. 1–20, 2000.

[13] D. R. Hunter and K. Lange, “A tutorial on MM algorithms,” Am. Statist.,vol. 58, no. 1, pp. 30–37, 2004.

[14] T. Blumensath, M. Yaghoobi, and M. E. Davies, “Iterative hard thresh-olding and l0 regularisation,” in Proc. 2007 IEEE Int. Conf. Acoust.Speech Signal Process., 2007, vol. 3, pp. III–877.

[15] E. J. Candes, M. B. Wakin, and S. P. Boyd, “Enhancing sparsity byreweighted �1 minimization,” J. Fourier Anal. Appl., vol. 14, nos. 5–6,pp. 877–905, 2008.

[16] T. Blumensath and M. E. Davies, “Iterative thresholding for sparse ap-proximations,” J. Fourier Anal. Appl., vol. 14, nos. 5–6, pp. 629–654,2008.

[17] G. Marjanovic, M. O. Ulfarsson, and A. O. Hero III, “Mist: l0 sparselinear regression with momentum,” arXiv preprint arXiv:1409.7193,2014.

[18] I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding al-gorithm for linear inverse problems with a sparsity constraint,” Commun.Pure Appl. Math., vol. 57, no. 11, pp. 1413–1457, 2004.

[19] X.-T. Yuan and T. Zhang, “Truncated power method for sparse eigen-value problems,” J. Mach. Learning Res., vol. 14, no. 1, pp. 899–925,2013.

[20] B. K. Sriperumbudur, D. A. Torres, and G. R. Lanckriet, “A majorization-minimization approach to the sparse generalized eigenvalue problem,”Mach. Learning, vol. 85, nos. 1/2, pp. 3–39, 2011.

[21] J. Song, P. Babu, and D. P. Palomar, “Sparse generalized eigenvalueproblem via smooth optimization,” IEEE Trans. Signal Process., vol. 63,no. 7, pp. 1627–1642, Apr. 2015.

[22] A. Wiesel, “Unified framework to regularized covariance estimation inscaled Gaussian models,” IEEE Trans. Signal Process., vol. 60, no. 1,pp. 29–38, Jan. 2012.

[23] Y. Sun, P. Babu, and D. P. Palomar, “Regularized Tyler’s scatter estimator:Existence, uniqueness, and algorithms,” IEEE Trans. Signal Process.,vol. 62, no. 19, pp. 5143–5156, Oct. 1, 2014.

[24] Y. Sun, P. Babu, and D. P. Palomar, “Regularized robust estimationof mean and covariance matrix under heavy-tailed distributions,” IEEETrans. Signal Process., vol. 63, no. 12, pp. 3096–3109, Jun. 2015.

[25] Y. Sun, P. Babu, and D. P. Palomar, “Robust estimation of structuredcovariance matrix for heavy-tailed elliptical distributions,” IEEE Trans.Signal Process., vol. 64, no. 14, pp. 3576–3590, 2016.

[26] M. Yaghoobi, T. Blumensath, and M. E. Davies, “Dictionary learningfor sparse approximations with the majorization method,” IEEE Trans.Signal Process., vol. 57, no. 6, pp. 2178–2191, Jun. 2009.

[27] C. Fevotte, “Majorization-minimization algorithm for smooth Itakura-Saito nonnegative matrix factorization,” in Proc. 2011 IEEE Int. Conf.Acoust., Speech Signal Process., 2011, pp. 1980–1983.

[28] J. M. Bioucas-Dias, M. A. Figueiredo, and J. P. Oliveira, “Total variation-based image deconvolution: A majorization-minimization approach,” inProc. 2006 IEEE Int. Conf. Acoust. Speech Signal Process., 2006, vol. 2,pp. II–II.

[29] M. A. Figueiredo, J. M. Bioucas-Dias, and R. D. Nowak, “Majorization-minimization algorithms for wavelet-based image restoration,” IEEETrans. Image Process., vol. 16, no. 12, pp. 2980–2991, Dec.2007.

Page 21: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

814 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

[30] E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via Wirtingerflow: Theory and algorithms,” IEEE Trans. Inf. Theory, vol. 61, no. 4,pp. 1985–2007, Apr. 2015.

[31] J. Song, P. Babu, and D. P. Palomar, “Optimization methods for design-ing sequences with low autocorrelation sidelobes,” IEEE Trans. SignalProcess., vol. 63, no. 15, pp. 3998–4009, Aug. 2015.

[32] J. Song, P. Babu, and D. P. Palomar, “Sequence design to mini-mize the weighted integrated and peak sidelobe levels,” arXiv preprintarXiv:1506.04234, 2015.

[33] D. Tyler, “A distribution-free M-estimator of multivariate scatter,” Ann.Statist., vol. 15, no. 1, pp. 234–251, 1987.

[34] M. E. Tipping, “Sparse Bayesian learning and the relevance vector ma-chine,” J. Mach. Learning Res., vol. 1, pp. 211–244, 2001.

[35] D. P. Wipf and B. D. Rao, “Sparse Bayesian learning for basis selection,”IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2153–2164, Aug. 2004.

[36] Y. Sun, A. Breloy, P. Babu, D. P. Palomar, F. Pascal, and G. Ginol-hac, “Low-complexity algorithms for low rank clutter parameters esti-mation in radar systems,” IEEE Trans. Signal Process., vol. 64, no. 8,pp. 1986–1998, Apr. 2016.

[37] M. Journee, Y. Nesterov, P. Richtarik, and R. Sepulchre, “General-ized power method for sparse principal component analysis,” J. Mach.Learning Res., vol. 11, pp. 517–553, 2010.

[38] M. Allain, J. Idier, and Y. Goussard, “On global and local convergence ofhalf-quadratic algorithms,” IEEE Trans. Image Process., vol. 15, no. 5,pp. 1130–1142, May 2006.

[39] H. W. Kuhn, “A note on Fermat’s problem,” Math. Program., vol. 4,no. 1, pp. 98–107, 1973.

[40] A. E. Beaton and J. W. Tukey, “The fitting of power series, meaning poly-nomials, illustrated on band-spectroscopic data,” Technometrics, vol. 16,no. 2, pp. 147–185, 1974.

[41] J. J. Fuchs, “Convergence of a sparse representations algorithm applica-ble to real or complex data,” IEEE J. Sel. Topics Signal Process., vol. 1,no. 4, pp. 598–605, Dec. 2007.

[42] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Gunturk, “Iterativelyreweighted least squares minimization for sparse recovery,” Commun.Pure Appl. Math., vol. 63, no. 1, pp. 1–38, 2010.

[43] D. Ba, B. Babadi, P. L. Purdon, and E. N. Brown, “Convergence andstability of iteratively re-weighted least squares algorithms,” IEEE Trans.Signal Process., vol. 62, no. 1, pp. 183–195, Jan. 1, 2014.

[44] D. Bohning, “Multinomial logistic regression algorithm,” Ann. Inst.Statist. Math., vol. 44, no. 1, pp. 197–200, 1992.

[45] B. Krishnapuram, L. Carin, M. A. T. Figueiredo, and A. J. Hartemink,“Sparse multinomial logistic regression: Fast algorithms and generaliza-tion bounds,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6,pp. 957–968, Jun. 2005.

[46] J. T. Chi and E. C. Chi, “Getting to the bottom of matrix completion andnonnegative least squares with the MM algorithm,” StatisticsViews.com,Mar. 2014.

[47] T. Qiu, P. Babu, and D. P. Palomar, “PRIME: Phase retrieval viamajorization-minimization,” arXiv preprint arXiv: 1511.01669, 2015.

[48] M. A. Figueiredo and R. D. Nowak, “A bound optimization approachto wavelet-based image deconvolution,” in Proc. 2005 IEEE Int. Conf.Image Process., 2005, vol. 2, pp. II–782.

[49] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding al-gorithm for linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1,pp. 183–202, 2009.

[50] P. Eggermont, “Multiplicative iterative algorithms for convex program-ming,” Linear Algebra Appl., vol. 130, pp. 25–42, 1990.

[51] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factor-ization,” in Proc. Adv. Neural Inf. Process. Syst., 2001, pp. 556–562.

[52] K. Lange, E. C. Chi, and H. Zhou, “A brief survey of modern optimizationfor statisticians,” Int. Statist. Rev., vol. 82, no. 1, pp. 46–70, 2014.

[53] M. Chiang, C. W. Tan, D. P. Palomar, D. O’Neill, and D. Julian, “Powercontrol by geometric programming,” IEEE Trans. Wireless Commun.,vol. 6, no. 7, pp. 2640–2651, Jul. 2007.

[54] K. Lange and H. Zhou, “MM algorithms for geometric and signomialprogramming,” Math. Program., vol. 143, nos. 1/2, pp. 339–356, 2014.

[55] F. Sha, L. K. Saul, and D. D. Lee, “Multiplicative updates for nonnegativequadratic programming in support vector machines,” in Proc. Adv. NeuralInf. Process. Syst., 2002, pp. 1041–1048.

[56] R. W. Gerchberg and W. O. Saxton, “A practical algorithm for the de-termination of phase from image and diffraction plane pictures,” Optik,vol. 35, 1972, Art. no. 237.

[57] E. Chouzenoux, J.-C. Pesquet, and A. Repetti, “A block coordinate vari-able metric forward-backward algorithm,” J. Global Optim., pp. 1–29,2016, doi: 10.1007/s10898-016-0405-9.

[58] J. A. Costa, N. Patwari, and A. O. Hero III, “Distributed weighted-multidimensional scaling for node localization in sensor networks,” ACMTrans. Sensor Netw., vol. 2, no. 1, pp. 39–64, 2006.

[59] P. Oguz-Ekim, J. P. Gomes, J. Xavier, and P. Oliveira, “Robust local-ization of nodes and time-recursive tracking in sensor networks usingnoisy range measurements,” IEEE Trans. Signal Process., vol. 59, no. 8,pp. 3930–3942, Aug. 2011.

[60] A. Beck, M. Teboulle, and Z. Chikishev, “Iterative minimization schemesfor solving the single source localization problem,” SIAM J. Optim.,vol. 19, no. 3, pp. 1397–1416, 2008.

[61] A. Beck and M. Teboulle, “Gradient-based algorithms with applica-tions to signal recovery problems,” in Convex Optimization in SignalProcessing and Communications, D. P. Palomar and Y. C. Eldar, Eds.Cambridge, U.K.: Cambridge Univ. Press, 2010, ch. 2.

[62] P. Stoica, P. Babu, and J. Li, “SPICE: A sparse covariance-based estima-tion method for array processing,” IEEE Trans. Signal Process., vol. 59,no. 2, pp. 629–638, Feb. 2011.

[63] H. Zhou, L. Hu, J. Zhou, and K. Lange, “MM algorithms for variancecomponents models,” arXiv preprint arXiv:1509.07426, 2015.

[64] M. Jamshidian and R. I. Jennrich, “Conjugate gradient acceleration ofthe EM algorithm,” J. Am. Statist. Assoc., vol. 88, no. 421, pp. 221–228,1993.

[65] K. Lange, “A gradient algorithm locally equivalent to the EM algorithm,”J. Roy. Statist. Soc. Ser. B (Methodol.), vol. 57, no. 2, pp. 425–437, 1995.

[66] K. Lange, “A quasi-Newton acceleration of the EM algorithm,” Statisticasinica, vol. 5, no. 1, pp. 1–18, 1995.

[67] J. A. Fessler and A. O. Hero, “Space-alternating generalized expectation-maximization algorithm,” IEEE Trans. Signal Process., vol. 42, no. 10,pp. 2664–2677, Oct. 1994.

[68] P. Tseng, “Convergence of a block coordinate descent method for non-differentiable minimization,” J. Optim. Theory Appl., vol. 109, no. 3,pp. 475–494, 2001.

[69] M. W. Jacobson and J. A. Fessler, “An expanded theoretical treatment ofiteration-dependent majorize-minimize algorithms,” IEEE Trans. ImageProcess., vol. 16, no. 10, pp. 2411–2422, Oct. 2007.

[70] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence anal-ysis of block successive minimization methods for nonsmooth optimiza-tion,” SIAM J. Optim., vol. 23, no. 2, pp. 1126–1153, 2013.

[71] J. Mairal, “Incremental majorization-minimization optimization with ap-plication to large-scale machine learning,” SIAM J. Optim., vol. 25, no. 2,pp. 829–855, 2015.

[72] J. Mairal, “Optimization with first-order surrogate functions,” arXivpreprint arXiv: 1305.3120, 2013.

[73] J. Bolte and E. Pauwels, “Majorization-minimization procedures andconvergence of SQP methods for semi-algebraic and tame programs,”Math. Operations Res., vol. 41, no. 2, pp. 442–465, 2016.

[74] C. J. Wu, “On the convergence properties of the EM algorithm,” Ann.Statist., vol. 11, no. 1, pp. 95–103, 1983.

[75] F. Vaida, “Parameter convergence for EM and MM algorithms,” StatisticaSinica, vol. 15, no. 3, pp. 831–840, 2005.

[76] T. A. Louis, “Finding the observed information matrix when using the emalgorithm,” J. Roy. Statist. Soc. Series B (Methodol.), vol. 44, pp. 226–233, 1982.

[77] I. Meilijson, “A fast improvement to the EM algorithm on its ownterms,” J. Roy. Statist. Soc.: Ser. B (Methodol.), vol. 51, pp. 127–138,1989.

[78] C. Bouman and K. Sauer, “Fast numerical methods for emission andtransmission tomographic reconstruction,” in Proc. Conf. Inf. Sci. Syst.,1993, pp. 611–616.

[79] R. M. Lewitt and G. Muehllehner, “Accelerated iterative reconstructionfor positron emission tomography based on the EM algorithm for max-imum likelihood estimation,” IEEE Trans. Med. Imag., vol. 5, no. 1,pp. 16–22, Mar. 1986.

[80] D. P. Helmbold, R. E. Schapire, Y. Singer, and M. K. Warmuth, “Acomparison of new and old algorithms for a mixture estimation problem,”Mach. Learn., vol. 27, no. 1, pp. 97–119, 1997.

[81] E. Bauer, D. Koller, and Y. Singer, “Update rules for parameter estimationin Bayesian networks,” in Proc. 13th Conf. Uncertainty Artif. Intell.,1997, pp. 3–13.

[82] R. Salakhutdinov and S. Roweis, “Adaptive overrelaxed bound optimiza-tion methods,” in Proc. Int. Conf. Mach. Learn., 2003, pp. 664–671.

[83] G. McLachlan and T. Krishnan, The EM Algorithm and Extensions,vol. 382. Hoboken, NJ, USA: Wiley, 2007.

[84] N. Laird, N. Lange, and D. Stram, “Maximum likelihood computationswith repeated measures: Application of the EM algorithm,” J. Am. Statist.Assoc., vol. 82, no. 397, pp. 97–105, 1987.

Page 22: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

SUN et al.: MAJORIZATION-MINIMIZATION ALGORITHMS IN SIGNAL PROCESSING, COMMUNICATIONS, AND MACHINE LEARNING 815

[85] M. Jamshidian and R. I. Jennrich, “Acceleration of the EM algorithmby using quasi-Newton methods,” J. Roy. Statist. Soc.: Ser. B (Statist.Methodol.), vol. 59, no. 3, pp. 569–587, 1997.

[86] R. Varadhan and C. Roland, “Simple and globally convergent methodsfor accelerating the convergence of any EM algorithm,” Scand. J. Statist.,vol. 35, no. 2, pp. 335–353, 2008.

[87] H. Zhou, D. Alexander, and K. Lange, “A quasi-Newton accelerationfor high-dimensional optimization algorithms,” Stat. Comput., vol. 21,no. 2, pp. 261–273, 2011.

[88] J. R. Magnus and H. Neudecker, Matrix Differential Calculus WithApplications in Statistics and Econometrics. Hoboken, NJ, USA:Wiley, 1995.

[89] D. P. Bertsekas, Nonlinear Programming. Belmont, MA, USA: AthenaSci., 1999.

[90] A. Hjørungnes, Complex-Valued Matrix Derivatives: With Applicationsin Signal Processing and Communications. Cambridge, U.K.: CambridgeUniv. Press, 2011.

[91] S. Chretien and A. O. Hero III, “Kullback proximal algorithms formaximum-likelihood estimation,” IEEE Trans. Inf. Theory, vol. 46, no. 5,pp. 1800–1810, Aug. 2000.

[92] R. M. Neal and G. E. Hinton, “A view of the EM algorithmthat justifies incremental, sparse, and other variants,” in Learning inGraphical Models, M. I. Jordan, Ed. Berlin, Germany: Springer-Verlag,1998, pp. 355–368.

[93] P. Stoica and P. Babu, “SPICE and LIKES: Two hyperparameter-freemethods for sparse-parameter estimation,” Signal Process., vol. 92, no. 7,pp. 1580–1590, 2012.

[94] M. M. Naghsh, M. Soltanalian, and M. Modarres-Hashemi, “Radar codedesign for detection of moving targets,” IEEE Trans. Aerosp. Electron.Syst., vol. 50, no. 4, pp. 2762–2778, Oct. 2014.

[95] P. Stoica, H. He, and J. Li, “New algorithms for designing unimodular se-quences with good correlation properties,” IEEE Trans. Signal Process.,vol. 57, no. 4, pp. 1415–1425, Apr. 2009.

[96] Q. Shi, M. Razaviyayn, Z.-Q. Luo, and C. He, “An iteratively weightedMMSE approach to distributed sum-utility maximization for a MIMOinterfering broadcast channel,” IEEE Trans. Signal Process., vol. 59,no. 9, pp. 4331–4340, Sep. 2011.

[97] R. Horst and N. V. Thoai, “DC programming: Overview,” J. Optim.Theory Appl., vol. 103, no. 1, pp. 1–43, 1999.

[98] P. D. Tao et al., “The DC (difference of convex functions) programmingand DCA revisited with DC models of real world nonconvex optimizationproblems,” Ann. Oper. Res., vol. 133, nos. 1–4, pp. 23–46, 2005.

[99] A. L. Yuille and A. Rangarajan, “The concave-convex procedure(CCCP),” in Proc. Adv. Neural Inf. Process. Syst., 2002, vol. 2,pp. 1033–1040.

[100] T. Lipp and S. Boyd,“ Variations and extension of the convex-concaveprocedure,” Optim. Eng., vol. 17, no. 2, pp. 1–25, 2015. [Online]. Avail-able: http://dx.doi.org/10.1007/s11081–015-9294-x

[101] T. D. Quoc and M. Diehl, “Sequential convex programming methodsfor solving nonlinear optimization problems with DC constraints,” arXivpreprint arXiv:1107.5841, 2011, to be published.

[102] D. P. Bertsekas and P. Tseng, “Partial proximal minimization algorithmsfor convex programming,” SIAM J. Optim., vol. 4, no. 3, pp. 551–572,1994.

[103] N. Parikh and S. Boyd, “Proximal algorithms,” Found. Trends Optim.,vol. 1, no. 3, pp. 123–231, 2013.

[104] P. L. Combettes and J.-C. Pesquet, “Proximal splitting methods insignal processing,” in Fixed-Point Algorithms for Inverse Problemsin Science and Engineering, Berlin, Germany: Springer-Verlag, 2011,pp. 185–212.

[105] E. Chouzenoux, J.-C. Pesquet, and A. Repetti, “Variable metric forward-backward algorithm for minimizing the sum of a differentiable func-tion and a convex function,” J. Optim. Theory Appl., vol. 162, no. 1,pp. 107–132, 2014.

[106] A. Repetti, M. Q. Pham, L. Duval, E. Chouzenoux, and J.-C. Pesquet,“Euclid in a taxicab: Sparse blind deconvolution with smoothed reg-ularization,” IEEE Signal Process. Lett., vol. 22, no. 5, pp. 539–543,May 2015.

[107] G. Scutari, F. Facchinei, P. Song, D. P. Palomar, and J.-S. Pang, “Decom-position by partial linearization: Parallel optimization of multi-agentsystems,” IEEE Trans. Signal Process., vol. 62, no. 3, pp. 641–656,Feb. 1, 2014.

[108] F. Facchinei, S. Sagratella, and G. Scutari, “Flexible parallel algorithmsfor big data optimization,” in Proc. 2014 IEEE Int. Conf. Acoust., SpeechSignal Process, 2014, pp. 7208–7212.

[109] F. Facchinei, G. Scutari, and S. Sagratella, “Parallel selective algorithmsfor nonconvex big data optimization,” IEEE Trans. Signal Process.,vol. 63, no. 7, pp. 1874–1889, Apr. 1, 2015.

[110] Y. Yang, G. Scutari, D. P. Palomar, and M. Pesavento, “A parallel de-composition method for nonconvex stochastic multi-agent optimizationproblems,” IEEE Trans. Signal Process., vol. 64, no. 11, pp. 2949–2964,Jun. 2016.

[111] B. R. Marks and G. P. Wright, “Technical note—A general inner approx-imation algorithm for nonconvex mathematical programs,” Oper. Res.,vol. 26, no. 4, pp. 681–683, 1978.

[112] G. Scutari, F. Facchinei, L. Lampariello, and P. Song, “Distributedmethods for constrained nonconvex multi-agent optimization—Part I:Theory,” arXiv preprint arXiv:1410.4754, 2014.

[113] Q. T. Dinh and M. Diehl, “Local convergence of sequential convexprogramming for nonconvex optimization,” in Recent Advances in Opti-mization and its Applications in Engineering. Berlin, Germany: Springer-Verlag, 2010, pp. 93–102.

[114] F. Palacios-Gomez, L. Lasdon, and M. Engquist, “Nonlinear optimiza-tion by successive linear programming,” Manage. Sci., vol. 28, no. 10,pp. 1106–1120, 1982.

[115] J. Nocedal and S. Wright, Numerical Optimization. Berlin, Germany:SpringerVerlag, 2006.

[116] C. Labat and J. Idier, “Convergence of conjugate gradient methods witha closed-form stepsize formula,” J. Optim. Theory Appl., vol. 136, no. 1,pp. 43–60, 2008.

[117] E. Chouzenoux, S. Moussaoui, and J. Idier, “Majorize-minimize line-search for inversion methods involving barrier function optimization,”Inverse Problems, vol. 28, no. 6, 2012, Art. no. 065011.

[118] E. Chouzenoux, J. Idier, and S. Moussaoui, “A majorize-minimize strat-egy for subspace optimization applied to image restoration,” IEEE Trans.Image Process., vol. 20, no. 6, pp. 1517–1528, Jun. 2011.

[119] E. Chouzenoux, A. Jezierska, J.-C. Pesquet, and H. Talbot, “A majorize-minimize subspace approach for �2 -�0 image regularization,” SIAM J.Imag. Sci., vol. 6, no. 1, pp. 563–591, 2013.

[120] A. Florescu, E. Chouzenoux, J.-C. Pesquet, P. Ciuciu, and S.Ciochina, “A majorize-minimize memory gradient method for complex-valued inverse problems,” Signal Process., vol. 103, pp. 285–295,2014.

[121] E. Chouzenoux and J.-C. Pesquet, “A stochastic majorize-minimize sub-space algorithm for online penalized least squares estimation,” arXivpreprint arXiv:1512.08722, 2015, to be published.

[122] E. Chouzenoux and J.-C. Pesquet, “Convergence rate analy-sis of the majorize-minimize subspace algorithm,” arXiv preprintarXiv:1603.07301, 2016, to be published.

[123] I. Markovsky, “Structured low-rank approximation and its applications,”Automatica, vol. 44, no. 4, pp. 891–909, 2008.

[124] J.-J. Xiao, S. Cui, Z.-Q. Luo, and A. J. Goldsmith, “Linear coherentdecentralized estimation,” IEEE Trans. Signal Process., vol. 56, no. 2,pp. 757–770, Feb. 2008.

[125] H. Zou, T. Hastie, and R. Tibshirani, “Sparse principal component anal-ysis,” J. Comput. Graph. Statist., vol. 15, no. 2, pp. 265–286, 2006.

[126] A. d’Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. Lanckriet, “Adirect formulation for sparse PCA using semidefinite programming,”SIAM Rev., vol. 49, no. 3, pp. 434–448, 2007.

[127] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decom-position, with applications to sparse principal components and canonicalcorrelation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009.

[128] P. Charbonnier, L. Blanc-Feraud, G. Aubert, and M. Barlaud, “Determin-istic edge-preserving regularization in computed imaging,” IEEE Trans.Image Process., vol. 6, no. 2, pp. 298–311, Feb. 1997.

[129] D. Bohning and B. G. Lindsay, “Monotonicity of quadratic-approximation algorithms,” Ann. Inst. Statist. Math., vol. 40, no. 4,pp. 641–663, 1988.

[130] J. Song, P. Babu, and D. P. Palomar, “Sequence set design with goodcorrelation properties via majorization-minimization,” arXiv preprintarXiv:1510.01899, 2015, to be published.

[131] A. R. De Pierro, “On the convergence of the iterative image space recon-struction algorithm for volume ECT,” IEEE Trans. Med. Imag., vol. 6,no. 2, pp. 174–175, Jun. 1987.

[132] J. D. Lee, B. Recht, N. Srebro, J. Tropp, and R. R. Salakhutdinov, “Prac-tical large-scale optimization for max-norm regularization,” in Proc. Adv.Neural Inf. Process. Syst., 2010, pp. 1297–1305.

[133] R. Jenatton, G. Obozinski, and F. Bach, “Structured sparse principalcomponent analysis,” in Proc. Int. Conf. Artif. Intell. Statist., 2010,pp. 366–373.

Page 23: 794 IEEE TRANSACTIONS ON SIGNAL …...794 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017 Majorization-Minimization Algorithms in Signal Processing, Communications,

816 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 3, FEBRUARY 1, 2017

[134] C. Fevotte and J. Idier, “Algorithms for nonnegative matrix factorizationwith the β-divergence,” Neural Comput., vol. 23, no. 9, pp. 2421–2456,2011.

[135] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alter-nating minimization,” in Proc. Adv. Neural Inf. Process. Syst., 2013,pp. 2796–2804.

[136] J. R. Fienup, “Reconstruction of an object from the modulus of its Fouriertransform,” Opt. Lett., vol. 3, no. 1, pp. 27–29, 1978.

[137] P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alter-nating minimization,” IEEE Trans. Signal Process., vol. 63, no. 18,pp. 4814–4826, Sep. 15, 2015.

[138] D. Shamsi, N. Taheri, Z. Zhu, and Y. Ye, “Conditions for correct sensornetwork localization using SDP relaxation,” in Discrete Geometry andOptimization, Berlin, Germany: Springer-Verlag, 2013, pp. 279–301.

[139] S. Korkmaz and A.-J. Van der Veen, “Robust localization in sensornetworks with iterative majorization techniques,” in Proc. 2009 IEEEInt. Conf. Acoust., Speech Signal Process., 2009, pp. 2049–2052.

[140] P. Babu and P. Stoica, “Sparse spectral-line estimation for nonuniformlysampled multivariate time series: SPICE, LIKES and MSBL,” in Proc.20th Eur. Signal Process. Conf., Aug. 2012, pp. 445–449.

Ying Sun received the B.E. degree in electronic in-formation from Huazhong University of Science andTechnology, Wuhan, China, in 2011, and the Ph.D.degree in electronic and computer engineering fromHong Kong University of Science and Technology,Hong Kong, in 2016. She is currently a PostdoctoralResearcher with the School of Industrial Engineering,Purdue University, West Lafayette, IN, USA. Her re-search interests include statistical signal processing,optimization algorithms and machine learning.

Prabhu Babu received the Ph.D. degree in electrical engineering from theUppsala University, Uppsala, Sweden, in 2012. From 2013 to 2016, he was aPostdoctorial Fellow with the Hong Kong University of Science and Technology.He is currently with the Centre for Applied Research in Electronics, IndianInstitute of Technology Delhi, New Delhi, India.

Daniel P. Palomar (S’99–M’03–SM’08–F’12) re-ceived the Graduate degree in electrical engineeringand the Ph.D. degree from the Technical Universityof Catalonia (UPC), Barcelona, Spain, in 1998 and2003, respectively.

He is a Professor in the Department of Electronicand Computer Engineering, Hong Kong Universityof Science and Technology, Hong Kong, which hejoined in 2006. Since 2013, he has been a Fel-low of the Institute for Advance Study, HKUST. Hehad previously held several research appointments,

namely, at King’s College London, London, U.K.; Stanford University, Stanford,CA, USA; Telecommunications Technological Center of Catalonia, Barcelona,Spain; Royal Institute of Technology, Stockholm, Sweden; University of Rome“La Sapienza,” Rome, Italy; and Princeton University, Princeton, NJ, USA. Hiscurrent research interests include applications of convex optimization theory,game theory, and variational inequality theory to financial systems, big datasystems, and communication systems.

Dr. Palomar received the 2004/06 Fulbright Research Fellowship, the 2004Young Author Best Paper Award by the IEEE Signal Processing Society, the2002/03 best Ph.D. prize in Information Technologies and Communications bythe Technical University of Catalonia (UPC), the 2002/03 Rosina Ribalta firstprize for the Best Doctoral Thesis in information technologies and communica-tions by the Epson Foundation, and the 2004 prize for the Best Doctoral Thesisin advanced mobile communications by the Vodafone Foundation.

He is a Guest Editor of the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL

PROCESSING 2016 Special Issue on “Financial Signal Processing and MachineLearning for Electronic Trading” and has been Associate Editor of IEEE TRANS-ACTIONS ON INFORMATION THEORY and of IEEE TRANSACTIONS ON SIGNAL

PROCESSING, a Guest Editor of the IEEE SIGNAL PROCESSING MAGAZINE 2010Special Issue on “Convex Optimization for Signal Processing,” the IEEE JOUR-NAL ON SELECTED AREAS IN COMMUNICATIONS 2008 Special Issue on “GameTheory in Communication Systems,” and the IEEE JOURNAL ON SELECTED

AREAS IN COMMUNICATIONS 2007 Special Issue on “Optimization of MIMOTransceivers for Realistic Communication Networks.”


Recommended