DOMAIN-SPECIFIC LANGUAGES FOR CONVEX AND NON …tq788ns0013...domain-specific languages for convex...

DOMAIN-SPECIFIC LANGUAGES FORCONVEX AND NON-CONVEX OPTIMIZATION

A DISSERTATIONSUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE

AND THE COMMITTEE ON GRADUATE STUDIESOF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Steven Malone DiamondMay 2020

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/tq788ns0013

© 2020 by Steven Malone Diamond. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/tq788ns0013

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Stephen Boyd, Primary Adviser


Chris Re, Co-Adviser


Alex Aiken


Gordon Wetzstein

Approved for the Stanford University Committee on Graduate Studies.

Stacey F. Bent, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Preface

Convex optimization has many applications to fields as diverse as machine learning, control, finance,and signal and image processing [28]. Using convex optimization in an application requires eitherdeveloping a custom solver or converting the problem into a standard form. Both of these tasksrequire expertise, and are time-consuming and error prone. An alternative is to use a domain-specificlanguage (DSL) for convex optimization, which allows the user to specify the problem in a naturalway that follows the math; this specification is then automatically converted into the standard formrequired by generic solvers. CVX [104], YALMIP [159], CVXGEN [164], QCML [46], PICOS [188],and Convex.jl [214] are examples of such DSLs for convex optimization.

In this thesis we demonstrate that DSLs for convex optimization are easy to use, scale to largeproblems, and can be extended to useful classes of non-convex problems. We begin in chapter 1 witha discussion of CVXPY, a widely-used DSL for convex optimization. We present several examplesof modeling optimization problems with CVXPY and highlight the novel features and modelingparadigms CVXPY introduced. The content of chapter 1 is drawn from [65].

We next illustrate how DSLs for convex optimization such as CVXPY can be extended to effi-ciently handle large-scale optimization problems involving structured linear operators. We call ourapproach matrix-free convex optimization modeling. Chapter 2 lays down the theoretical backgroundand describes a concrete implementation. The content of chapter 2 is drawn from [64, 66]. Chapter3 continues the discussion of matrix-free convex optimization modeling by presenting an effectivealgorithm for matrix-free preconditioning. Matrix-free preconditioning is needed for our modelingapproach to be robust and generic. The content of chapter 3 is drawn from [67].

We conclude in chapter 4 with an exploration of non-convex optimization using convex opti-mization as a black-box method. In particular, we consider approximate minimization of convexfunctions over non-convex sets via an ADMM-based heuristic that solves a series of convex subprob-lems. Our approach lends itself to expression as a DSL, which we call NCVX. NCVX is an extensionof CVXPY, in that CVXPY is used to model the convex subproblems. We show that our heuristicis an effective approach for a variety of non-convex problems that arise in applications. The contentof chapter 4 is drawn from [69].

We hope the work presented in this thesis inspires future research on DSLs for optimization. The

iv

topic lies at the intersection of programming languages, optimization algorithms, and applications.Creating an optimization DSL brings insights from abstract mathematics down to the concrete realmof computation, making optimization techniques available to a broader class of practitioners. Weare grateful to have so many users of our software, and our research has been substantially shapedby our interaction with them.

v

Acknowledgments

It’s been a long journey at Stanford, from freshman year in 2010 to graduating with a Ph.D. in2020. I owe my thanks to everyone who helped me along the way. The person to whom I owe themost is my advisor, Stephen Boyd. I first learned about Stephen when I was a sophomore, fromrumors about a super difficult but super interesting class on something called “convex optimization.”Little did I know when I started taking classes with Stephen that this was only the beginning of anincredible education, for which I’m more grateful than I can properly express. I’ll simply say thatit’s been a privilege to watch the master at work, and I look forward to our future collaborations.

I want to thank Gordon Wetzstein next, for opening up his lab to a curious student and givingme a world-class education in computational imaging. I’m immensely proud of the work we didtogether, and though I was not able to include it in my thesis, our work on unrolled optimizationand learned imaging systems was a highlight of my ten years at Stanford. I’m also grateful for thewelcome I received from Gordon’s students, particularly Vincent Sitzmann and Felix Heide.

I owe a huge debt of gratitude to all my labmates, for all our collaborations, discussions, andgeneral good times. I’m especially thankful that Akshay Agrawal shared my interest in softwarefor optimization and worked with me on CVXPY. Outside the lab, I appreciate all the support Igot from my friends. Grad school has its challenges, and Linus Mixson was there for me during adifficult time.

Of course I wouldn’t even have begun my journey at Stanford without the support of my family.I give my deepest love to my mom, my dad, and my brother. Last but not least, I thank Huan Guifor being the best distraction I could ever have hoped for, for showing me a new side to life, and forputting up with someone who has his head too much in the clouds for his own good.

vi

Contents

Preface iv

Acknowledgments vi

1 CVXPY 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 CVXPY syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Signed DCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Object-oriented convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Matrix-free convex optimization modeling 62.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Forward-adjoint oracles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.2 Vector mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.3 Matrix mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.4 Multiple vector mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.5 Additional examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Compositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 Forward evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.2 Adjoint evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.3 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.4 Optimizing the DAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3.5 Reducing the memory footprint . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3.6 Software implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4 Cone programs and solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4.1 Cone programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

vii

2.4.2 Cone solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.5 Matrix-free canonicalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5.1 Canonicalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.5.2 Informal overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5.3 Expression DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.4 Optimization problem representation . . . . . . . . . . . . . . . . . . . . . . . 322.5.5 Cone program representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.6 Numerical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.6.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.6.2 Nonnegative deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.6.3 Sylvester LP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Stochastic matrix-free equilibration 443.1 Equilibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2 Equilibration via convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.1 The equilibration problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.2 Equilibration and condition number . . . . . . . . . . . . . . . . . . . . . . . 463.2.3 Regularized equilibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Stochastic method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 493.3.1 Unbiased gradient estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.3.2 Projected stochastic gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5.1 LSQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.5.2 Chambolle-Cremers-Pock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6 Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 NCVX 604.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.1.2 Special cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.3 Convex relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.1.4 Projections and approximate projections . . . . . . . . . . . . . . . . . . . . . 624.1.5 Residual and merit functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 624.1.6 Solution methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.1.7 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 Local improvement methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

viii

4.2.1 Polishing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.2.2 Relax-round-polish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674.2.3 Neighbor search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 NC-ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.1 ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.3.2 Algorithm subroutines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3.4 Solution improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.3.5 Overall algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 Projections onto nonconvex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.1 Subsets of R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.4.2 Subsets of Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 734.4.3 Subsets of Rm⇥n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.4.4 Combinations of sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5.1 Variable constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5.2 Variable methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 784.5.3 Constructing and solving problems . . . . . . . . . . . . . . . . . . . . . . . . 794.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.6.1 Regressor selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 814.6.2 3-satisfiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 834.6.3 Circle packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844.6.4 Traveling salesman problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.6.5 Factor analysis model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 874.6.6 Inexact graph isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Bibliography 94

ix

List of Tables

x

List of Figures

2.1 The FAO DAG for f(x) = Ax+Bx. . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.2 The FAO DAG for f⇤(u) = ATu + BTu obtained by transforming the FAO DAG in

figure 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3 The FAO DAG for f(x) = ABx+ACx. . . . . . . . . . . . . . . . . . . . . . . . . . 252.4 The FAO DAG for f(x) = A(Bx+ Cx). . . . . . . . . . . . . . . . . . . . . . . . . . 252.5 The expression DAG for f(x) = kAxk2 + 3. . . . . . . . . . . . . . . . . . . . . . . . 312.6 The expression DAG for f(x) = x+ 2. . . . . . . . . . . . . . . . . . . . . . . . . . . 352.7 The Linear subroutine applied to the expression DAG in figure 2.6. . . . . . . . . . 352.8 The Constant subroutine applied to the expression DAG in figure 2.6. . . . . . . . . 352.9 The expression DAG for vstack(e1, . . . , e`). . . . . . . . . . . . . . . . . . . . . . . . 362.10 The expression DAG H(1) when ` = 1 and e1 represents f(x, y) = x+A(x+ y). . . . 372.11 The expression DAG H(2) obtained by transforming H(1) in figure 2.10. . . . . . . . 372.12 Results for a problem instance with n = 1000. . . . . . . . . . . . . . . . . . . . . . 402.13 Solve time in seconds T versus variable size n. . . . . . . . . . . . . . . . . . . . . . 412.14 Solve time in seconds T versus variable size n. . . . . . . . . . . . . . . . . . . . . . 42

3.1 Problem (3.5) optimality gap and RMS error versus iterations t. . . . . . . . . . . . 533.2 Condition number of DAE versus iterations t. . . . . . . . . . . . . . . . . . . . . . 543.3 Residual versus iterations t for LSQR. . . . . . . . . . . . . . . . . . . . . . . . . . . 553.4 Optimality gap versus iterations t for CCP. . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 The average error of solutions found by Lasso, relax-round-polish, and NC-ADMMfor 40 random instances of the regressor selection problem. . . . . . . . . . . . . . . 82

4.2 The best value found by NC-ADMM (usually done in 35 milliseconds) and Gurobiafter 10 seconds, 100 seconds, and 1000 seconds. . . . . . . . . . . . . . . . . . . . . 83

4.3 The fraction of the 10 3-SAT instances generated for each choice of number of clausesm and variables n for which NC-ADMM found a satisfying assignment. No instanceswere generated for (n, m) in the gray region. . . . . . . . . . . . . . . . . . . . . . . 85

xi

4.4 The relative radius r1/l for the densest known packing and the packing found withthe relax-round-polish heuristic for n = 1, . . . , 100. . . . . . . . . . . . . . . . . . . . 86

4.5 The packing for n = 41 circles with equal radii found with the relax-round-polishheuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4.6 The average cost of the TSP solutions found by relax-round-polish, NC-ADMM, andGurobi with a time cutoff equal to the runtime of NC-ADMM. . . . . . . . . . . . . 88

4.7 The average difference between the objective value found by the nuclear norm, relax-round-polish, and NC-ADMM heuristics and the best objective value found by anyof the heuristics for instances of the factor analysis problem constructed from dailystock returns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.8 Time comparison of Gurobi and NC-ADMM on random graph isomorphism prob-lems. Each point shows how long NC-ADMM or Gurobi ran on a particular probleminstance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xii

Chapter 1

CVXPY

1.1 Introduction

Convex optimization has many applications to fields as diverse as machine learning, control, finance,and signal and image processing [28]. Using convex optimization in an application requires eitherdeveloping a custom solver or converting the problem into a standard form. Both of these tasksrequire expertise, and are time-consuming and error prone. An alternative is to use a domain-specificlanguage (DSL) for convex optimization, which allows the user to specify the problem in a naturalway that follows the math; this specification is then automatically converted into the standard formrequired by generic solvers. CVX [104], YALMIP [159], CVXGEN [164], QCML [46], PICOS [188],and Convex.jl [214] are examples of such DSLs for convex optimization.

CVXPY is a new domain-specific language (DSL) for convex optimization, principally developedby the author. It is based on CVX [104], but introduces new features such as signed disciplinedconvex programming analysis and parameters. CVXPY is an ordinary Python library, which makesit easy to combine convex optimization with high-level features of Python such as parallelism andobject-oriented design.

CVXPY has been downloaded by thousands of users and used to teach multiple courses [24].Many tools have been built on top of CVXPY, such as an extension for stochastic optimization [6].In this short chapter we describe CVXPY. Future chapters rely on CVXPY as a software artifactand point of reference.

1.2 CVXPY syntax

CVXPY has a simple, readable syntax inspired by CVX [104]. The following code constructs andsolves a least squares problem where the variable’s entries are constrained to be between 0 and 1.

1

CHAPTER 1. CVXPY 2

The problem data A 2 Rm⇥n and b 2 Rm could be encoded as NumPy ndarrays or one of severalother common matrix representations in Python.

# Construct the problem.

x = Variable(n)

objective = Minimize(sum_squares(A*x - b))

constraints = [0 <= x, x <= 1]

prob = Problem(objective, constraints)

# The optimal objective is returned by prob.solve().

result = prob.solve()

# The optimal value for x is stored in x.value.

print x.value

The variable, objective, and constraints are each constructed separately and combined in the finalproblem. In CVX, by contrast, these objects are created within the scope of a particular problem.Allowing variables and other objects to be created in isolation makes it easier to write high-levelcode that constructs problems (see §1.6).

1.3 Solvers

CVXPY converts problems into a standard form known as conic form [172], a generalization of alinear program. The conversion is done using graph implementations of convex functions [103]. Theresulting cone program is equivalent to the original problem, so by solving it we obtain a solutionof the original problem.

Solvers that handle conic form are known as cone solvers; each one can handle combinations ofseveral types of cones. CVXPY interfaces with the open-source cone solvers CVXOPT [8], ECOS[71], and SCS [175], which are implemented in combinations of Python and C. These solvers havedifferent characteristics, such as the types of cones they can handle and the type of algorithmsemployed. CVXOPT and ECOS are interior-point solvers, which reliably attain high accuracy forsmall and medium scale problems; SCS is a first-order solver, which uses OpenMP to target multiplecores and scales to large problems with modest accuracy.

1.4 Signed DCP

Like CVX, CVXPY uses disciplined convex programming (DCP) to verify problem convexity [105].In DCP, problems are constructed from a fixed library of functions with known curvature andmonotonicity properties. Functions must be composed according to a simple set of rules such thatthe composition’s curvature is known. For a visualization of the DCP rules, visit dcp.stanford.edu.

CHAPTER 1. CVXPY 3

CVXPY extends the DCP rules used in CVX by keeping track of the signs of expressions. Themonotonicity of many functions depends on the sign of their argument, so keeping track of signs al-lows more compositions to be verified as convex. For example, the composition square(square(x))

would not be verified as convex under standard DCP because the square function is nonmono-tonic. But the composition is verified as convex under signed DCP because square is increasing fornonnegative arguments and square(x) is nonnegative.

1.5 Parameters

Another improvement in CVXPY is the introduction of parameters. Parameters are constants whosesymbolic properties (e.g., dimensions and sign) are fixed but whose numeric value can change. Aproblem involving parameters can be solved repeatedly for different values of the parameters withoutrepeating computations that do not depend on the parameter values. Parameters are an old idea inDSLs for optimization, appearing in AMPL [83].

A common use case for parameters is computing a trade-off curve. The following code constructsa LASSO problem [28] where the positive parameter � trades off the sum of squares error and theregularization term. The problem data are A 2 Rm⇥n and b 2 Rm.

x = Variable(n)

gamma = Parameter(sign="positive") # Must be positive due to DCP rules.

error = sum_squares(A*x - b)

regularization = norm(x, 1)

prob = Problem(Minimize(error + gamma*regularization))

Computing a trade-off curve is trivially parallelizable, since each problem can be solved indepen-dently. CVXPY can be combined with Python multiprocessing (or any other parallelism library) todistribute the trade-off curve computation across many processes.

# Assign a value to gamma and find the optimal x.

def get_x(gamma_value):

gamma.value = gamma_value

result = prob.solve()

return x.value

# Get a range of gamma values with NumPy.

gamma_vals = numpy.logspace(-4, 6)

# Do parallel computation with multiprocessing.

pool = multiprocessing.Pool(processes = N)

x_values = pool.map(get_x, gamma_vals)

CHAPTER 1. CVXPY 4

1.6 Object-oriented convex optimization

CVXPY enables an object-oriented approach to constructing optimization problems. As an example,consider an optimal flow problem on a directed graph G = (V,E) with vertex set V and (directed)edge set E. Each edge e 2 E carries a flow fe 2 R, and each vertex v 2 V has an internal sourcethat generates sv 2 R flow. (Negative values correspond to flow in the opposite direction, or a sinkat a vertex.) The (single commodity) flow problem is (with variables fe and sv)

minimizeP

e2E�e(fe) +

Pv2V

v(sv),

subject to sv +P

e2I(v) fe =P

e2O(v) fe, for all v 2 V,

where the �e and v are convex cost functions and I(v) and O(v) give vertex v’s incoming andoutgoing edges, respectively.

To express the problem in CVXPY, we construct vertex and edge objects, which store localinformation such as optimization variables, constraints, and an associated objective term. These areexported as a CVXPY problem for each vertex and each edge.

class Vertex(object):

def __init__(self, cost):

self.source = Variable()

self.cost = cost(self.source)

self.edge_flows = []

def prob(self):

net_flow = sum(self.edge_flows) + self.source

return Problem(Minimize(self.cost), [net_flow == 0])

class Edge(object):

def __init__(self, cost):

self.flow = Variable()

self.cost = cost(self.flow)

def connect(self, in_vertex, out_vertex):

in_vertex.edge_flows.append(-self.flow)

out_vertex.edge_flows.append(self.flow)

def prob(self):

return Problem(Minimize(self.cost))

The vertex and edge objects are composed into a graph using the edges’ connect method.

CHAPTER 1. CVXPY 5

To construct the single commodity flow problem, we sum the vertices and edges’ local problems.(Addition of problems is overloaded in CVXPY to add the objectives together and concatenate theconstraints.)

prob = sum([object.prob() for object in vertices + edges])

prob.solve() # Solve the single commodity flow problem.

Chapter 2

Matrix-free convex optimizationmodeling

2.1 Introduction

Convex optimization modeling systems like YALMIP [160], CVX [104], CVXPY [65], and Convex.jl[214] provide an automated framework for converting a convex optimization problem expressed ina natural human-readable form into the standard form required by a solver, calling the solver, andtransforming the solution back to the human-readable form. This allows users to form and solveconvex optimization problems quickly and efficiently. These systems easily handle problems witha few thousand variables, as well as much larger problems (say, with hundreds of thousands ofvariables) with enough sparsity structure, which generic solvers can exploit.

The overhead of the problem transformation, and the additional variables and constraints intro-duced in the transformation process, result in longer solve times than can be obtained with a customalgorithm tailored specifically for the particular problem. Perhaps surprisingly, the additional solvetime (compared to a custom solver) for a modeling system coupled to a generic solver is often notas much as one might imagine, at least for modest sized problems. In many cases the convenienceof easily expressing the problem makes up for the increased solve time using a convex optimizationmodeling system.

Many convex optimization problems in applications like signal and image processing, or medicalimaging, involve hundreds of thousands or many millions of variables, and so are well out of therange that current modeling systems can handle. There are two reasons for this. First, the standardform problem that would be created is too large to store on a single machine, and second, evenif it could be stored, standard interior-point solvers would be too slow to solve it. Yet many ofthese problems are readily solved on a single machine by custom solvers, which exploit fast linear

6

CHAPTER 2. MATRIX-FREE CONVEX OPTIMIZATION MODELING 7

transforms in the problems. The key to these custom solvers is to directly use the fast transforms,never forming the associated matrix. For this reason these algorithms are sometimes referred to asmatrix-free solvers.

The literature on matrix-free solvers in signal and image processing is extensive; see, e.g., [15, 16,40, 39, 183, 98, 233]. There has been particular interest in matrix-free solvers for LASSO and basispursuit denoising problems [16, 43, 82, 77, 140, 217]. Matrix-free solvers have also been developedfor specialized control problems [218, 219]. The most general matrix-free solvers target semidefiniteprograms [142] or quadratic programs and related problems [189, 100]. The software closest to aconvex optimization modeling system for matrix-free problems is TFOCS, which allows users tospecify many types of convex problems and solve them using a variety of matrix-free first-ordermethods [17].

To better understand the advantages of matrix-free solvers, consider the nonnegative deconvolu-tion problem

minimize kc ⇤ x� bk2

subject to x � 0,(2.1)

where x 2 Rn is the optimization variable, c 2 Rn and b 2 R2n�1 are problem data, and ⇤ denotesconvolution. Note that the problem data has size O(n). There are many custom matrix-free methodsfor efficiently solving this problem, with O(n) memory and a few hundred iterations, each of whichcosts O(n log n) floating point operations (flops). It is entirely practical to solve instances of thisproblem of size n = 107 on a single computer [144, 156].

Existing convex optimization modeling systems fall far short of the efficiency of matrix-freesolvers on problem (2.1). These modeling systems target a standard form in which a problem’slinear structure is represented as a sparse matrix. As a result, linear functions must be convertedinto explicit matrix multiplication. In particular, the operation of convolving by c will be representedas multiplication by a (2n�1)⇥n Toeplitz matrix C. A modeling system will thus transform problem(2.1) into the problem

minimize kCx� bk2

subject to x � 0,(2.2)

as part of the conversion into standard form.Once the transformation from (2.1) to (2.2) has taken place, there is no hope of solving the

problem efficiently. The explicit matrix representation of C requires O(n2) memory. A typicalinterior-point method for solving the transformed problem will take a few tens of iterations, eachrequiring O(n3) flops. For this reason existing convex optimization modeling systems will struggleto solve instances of problem (2.1) with n = 104, and when they are able to solve the problem, theywill be dramatically slower than custom matrix-free methods.

The key to matrix-free methods is to exploit fast algorithms for evaluating a linear function andits adjoint. We call an implementation of a linear function that allows us to evaluate the function


and its adjoint a forward-adjoint oracle (FAO). In this chapter we describe a new algorithm forconverting convex optimization problems into standard form while preserving fast linear functions.This yields a convex optimization modeling system that can take advantage of fast linear transforms,and can be used to solve large problems such as those arising in image and signal processing andother areas, with millions of variables. This allows users to rapidly prototype and implement newconvex optimization based methods for large-scale problems. As with current modeling systems, thegoal is not to attain (or beat) the performance of a custom solver tuned for the specific problem;rather it is to make the specification of the problem straightforward, while increasing solve timesonly moderately.

The outline of the chapter is as follows. In §2.2 we give many examples of useful FAOs. In§2.3 we explain how to compose FAOs so that we can efficiently evaluate the composition and itsadjoint. In §2.4 we describe cone programs, the standard intermediate-form representation of aconvex problem, and solvers for cone programs. In §2.5 we describe our algorithm for convertingconvex optimization problems into equivalent cone programs while preserving fast linear transforms.In §2.6 we report numerical results for the nonnegative deconvolution problem (2.1) and a special typeof linear program, for our implementation of the abstract ideas in the chapter, using versions of theexisting cone solvers SCS [175] and POGS [81] modified to be matrix-free. (The main modificationwas using the matrix-free equilibration described in [67].) Even with our simple, far from optimizedmatrix-free cone solvers, we demonstrate scaling to problems far larger than those that can besolved by generic methods (based on sparse matrices), with acceptable performance loss comparedto specialized custom algorithms tuned to the problems.

2.2 Forward-adjoint oracles

2.2.1 Definition

A general linear function f : Rn! Rm can be represented on a computer as a dense matrix

A 2 Rm⇥n using O(mn) bytes. We can evaluate f(x) on an input x 2 Rn in O(mn) flops bycomputing the matrix-vector multiplication Ax. We can likewise evaluate the adjoint f⇤(y) = AT y

on an input y 2 Rm in O(mn) flops by computing AT y.Many linear functions arising in applications have structure that allows the function and its

adjoint to be evaluated in fewer than O(mn) flops or using fewer than O(mn) bytes of data. Thealgorithms and data structures used to evaluate such a function and its adjoint can differ wildly.It is thus useful to abstract away the details and view linear functions as forward-adjoint oracles(FAOs), i.e., a tuple � = (f,�f ,�f⇤) where f is a linear function, �f is an algorithm for evaluatingf , and �f⇤ is an algorithm for evaluating f⇤. We use n to denote the size of f ’s input and m todenote the size of f ’s output.

While we focus on linear functions from Rn into Rm, the same techniques can be used to handle


linear functions involving complex arguments or values, i.e., from Cn into Cm, from Rn into Cm,or from Cn into Rm, using the standard embedding of complex n-vectors into real 2n-vectors.This is useful for problems in which complex data arise naturally (e.g., in signal processing andcommunications), and also in some cases that involve only real data, where complex intermediateresults appear (typically via an FFT).

2.2.2 Vector mappings

We present a variety of FAOs for functions that take as argument, and return, vectors.

Scalar multiplication. Scalar multiplication by ↵ 2 R is represented by the FAO � = (f,�f ,�f⇤),where f : Rn

! Rn is given by f(x) = ↵x. The adjoint f⇤ is the same as f . The algorithms �f

and �f⇤ simply scale the input, which requires O(m + n) flops and O(1) bytes of data to store ↵.Here m = n.

Multiplication by a dense matrix. Multiplication by a dense matrix A 2 Rm⇥n is representedby the FAO � = (f,�f ,�f⇤), where f(x) = Ax. The adjoint f⇤(u) = ATu is also multiplication bya dense matrix. The algorithms �f and �f⇤ are the standard dense matrix multiplication algorithm.Evaluating �f and �f⇤ requires O(mn) flops and O(mn) bytes of data to store A and AT .

Multiplication by a sparse matrix. Multiplication by a sparse matrix A 2 Rm⇥n, i.e., amatrix with many zero entries, is represented by the FAO � = (f,�f ,�f⇤), where f(x) = Ax. Theadjoint f⇤(u) = ATu is also multiplication by a sparse matrix. The algorithms �f and �f⇤ arethe standard algorithm for multiplying by a sparse matrix in (for example) compressed sparse rowformat. Evaluating �f and �f⇤ requires O(nnz(A)) flops and O(nnz(A)) bytes of data to store A

and AT , where nnz is the number of nonzero elements in a sparse matrix [60, Chap. 2].

Multiplication by a low-rank matrix. Multiplication by a matrix A 2 Rm⇥n with rank k,where k ⌧ m and k ⌧ n, is represented by the FAO � = (f,�f ,�f⇤), where f(x) = Ax. The matrixA can be factored as A = BC, where B 2 Rm⇥k and C 2 Rk⇥n. The adjoint f⇤(u) = CTBTu isalso multiplication by a rank k matrix. The algorithm �f evaluates f(x) by first evaluating z = Cx

and then evaluating f(x) = Bz. Similarly, �f⇤ multiplies by BT and then CT . The algorithms�f and �f⇤ require O(k(m + n)) flops and use O(k(m + n)) bytes of data to store B and C andtheir transposes. Multiplication by a low-rank matrix occurs in many applications, and it is oftenpossible to approximate multiplication by a full rank matrix with multiplication by a low-rank one,using the singular value decomposition or methods such as sketching [154].


Discrete Fourier transform. The discrete Fourier transform (DFT) is represented by the FAO� = (f,�f ,�f⇤), where f : R2p

! R2p is given by

f(x)k = 1pp

Pp

j=1 <

⇣!(j�1)(k�1)p

⌘xj �=

⇣!(j�1)(k�1)p

⌘xj+p

f(x)k+p = 1pp

Pp

j=1 =

⇣!(j�1)(k�1)p

⌘xj + <

⇣!(j�1)(k�1)p

⌘xj+p

for k = 1, . . . , p. Here !p = e�2⇡i/p. The adjoint f⇤ is the inverse DFT. The algorithm �f is thefast Fourier transform (FFT), while �f⇤ is the inverse FFT. The algorithms can be evaluated inO((m+ n) log(m+ n)) flops, using only O(1) bytes of data to store the dimensions of f ’s input andoutput [52, 158]. Here m = n = 2p. There are many fast transforms derived from the DFT, suchas the discrete Hartley transform [30] and the discrete sine and cosine transforms [3, 163], with thesame computational complexity as the FFT.

Convolution. Convolution with a kernel c 2 Rp is defined as f : Rn! Rm, where

f(x)k =X

i+j=k+1

cixj , k = 1, . . . ,m. (2.3)

Different variants of convolution restrict the indices i, j to different ranges, or interpret vector ele-ments outside their natural ranges as zero or using periodic (circular) indexing.

Standard (column) convolution takes m = n+ p� 1, and defines ci and xj in (2.3) as zero whenthe index is ouside its range. In this case the associated matrix Col(c) 2 Rn+p�1⇥n is Toeplitz,with each column a shifted version of c:

Col(c) =

2

666666666664

c1

c2. . .

.... . . c1

cp c2. . .

...cp

3

777777777775

.

Another standard form, row convolution, restricts the indices in (2.3) to the range k = p, . . . , n.For simplicity we assume that n � p. In this case the associated matrix Row(c) 2 Rn�p+1⇥n isToeplitz, with each row a shifted version of c, in reverse order:

Row(c) =

2

664

cp cp�1 . . . c1. . . . . . . . .

cp cp�1 . . . c1

3

775 .


The matrices Col(c) and Row(c) are related by the equalities

Col(c)T = Row(rev(c)), Row(c)T = Col(rev(c)),

where rev(c)k = cp�k+1 reverses the order of the entries of c.Yet another variant on convolution is circular convolution, where we take p = n and interpret the

entries of vectors outside their range modulo n. In this case the associated matrix Circ(c) 2 Rn⇥n

is Toeplitz, with each column and row a (circularly) shifted version of c:

Circ(c) =

2

6666666666664

c1 cn cn�1 . . . . . . c2

c2 c1 cn. . .

...

c3 c2. . . . . . . . .

......

. . . . . . . . . cn cn�1

.... . . c2 c1 cn

cn . . . . . . c3 c2 c1

3

7777777777775

.

Column convolution with c 2 Rp is represented by the FAO � = (f,�f ,�f⇤), where f : Rn!

Rn+p�1 is given by f(x) = Col(c)x. The adjoint f⇤ is row convolution with rev(c), i.e., f⇤(u) =

Row(rev(c))u. The algorithms �f and �f⇤ are given in algorithms 1 and 2, and require O((m +

n + p) log(m + n + p)) flops. Here m = n + p � 1. If the kernel is small (i.e., p ⌧ n), �f and �f⇤

instead evaluate (2.3) directly in O(np) flops. In either case, the algorithms �f and �f⇤ use O(p)

bytes of data to store c and rev(c) [51, 158].

Algorithm 1 Column convolution c ⇤ x.

Input: c 2 Rp is a length p array. x 2 Rn is a length n array. y 2 Rn+p�1 is a length n + p � 1array.

Extend c and x into length n+ p� 1 arrays by appending zeros.c FFT of c.x FFT of x.for i = 1, . . . , n+ p� 1 do

yi cixi.y inverse FFT of y.

Output: y = c ⇤ x.

Circular convolution with c 2 Rn is represented by the FAO � = (f,�f ,�f⇤), where f : Rn!


Algorithm 2 Row convolution c ⇤ u.

Input: c 2 Rp is a length p array. u 2 Rn+p�1 is a length n + p � 1 array. v 2 Rn is a length narray.

Extend rev(c) and v into length n+ p� 1 arrays by appending zeros.c inverse FFT of zero-padded rev(c).u FFT of u.for i = 1, . . . , n+ p� 1 do

vi ciui.v inverse FFT of v.Reduce v to a length n array by removing the last p� 1 entries.

Output: v = c ⇤ u.

Algorithm 3 Circular convolution c ⇤ x.Input: c 2 Rn is a length n array. x 2 Rn is a length n array. y 2 Rn is a length n array.

c FFT of c.x FFT of x.for i = 1, . . . , n do

yi cixi.y inverse FFT of y.

Output: y = c ⇤ x.


Rn is given by f(x) = Circ(c)x. The adjoint f⇤ is circular convolution with

c =

2

66666664

c1

cn

cn�1

...c2

3

77777775

.

The algorithms �f and �f⇤ are given in algorithm 3, and require O((m+ n) log(m+ n)) flops. Thealgorithms �f and �f⇤ use O(m+ n) bytes of data to store c and c [51, 158]. Here m = n.

Discrete wavelet transform. The discrete wavelet transform (DWT) for orthogonal wavelets isrepresented by the FAO � = (f,�f ,�f⇤), where the function f : R2p

! R2p is given by

f(x) =

2

664

D1G1

D1H1

I2p�2

3

775 · · ·

2

664

Dp�1Gp�1

Dp�1Hp�1

I2p�1

3

775

"DpGp

DpHp

#x, (2.4)

where Dk 2 R2k�1⇥2k is defined such that (Dkx)i = x2i and the matrices Gk 2 R2k⇥2k andHk 2 R2k⇥2k are given by

Gk = Circ

"g

0

#!, Hk = Circ

"h

0

#!.

Here g 2 Rq and h 2 Rq are low and high pass filters, respectively, that parameterize the DWT.The adjoint f⇤ is the inverse DWT. The algorithms �f and �⇤

frepeatedly convolve by g and h,

which requires O(q(m + n)) flops and uses O(q) bytes to store h and g [162]. Here m = n = 2p.Common orthogonal wavelets include the Haar wavelet and the Daubechies wavelets [57, 58]. Thereare many variants on the particular DWT described here. For instance, the product in (2.4) can beterminated after fewer than p�1 multiplications by Gk and Hk [133], Gk and Hk can be defined as adifferent type of convolution matrix, or the filters g and h can be different lengths, as in biorthogonalwavelets [47].

Discrete Gauss transform. The discrete Gauss transform (DGT) is represented by the FAO� = (fY,Z,h,�f ,�f⇤), where the function fY,Z,h : Rn

! Rm is parameterized by Y 2 Rm⇥d,Z 2 Rn⇥d, and h > 0. The function fY,Z,h is given by

fY,Z,h(x)i =nX

j=1

exp(�kyi � zjk2/h2)xj , i = 1, . . . ,m,


where yi 2 Rd is the ith column of Y and zj 2 Rd is the jth column of Z. The adjoint of fY,Z,h is theDGT fZ,Y,h. The algorithms �f and �f⇤ are the improved fast Gauss transform, which evaluatesf(x) and f⇤(u) to a given accuracy in O(dp(m + n)) flops. Here p is a parameter that depends onthe accuracy desired. The algorithms �f and �f⇤ use O(d(m+n)) bytes of data to store Y , Z, andh [230]. An interesting application of the DGT is efficient multiplication by a Gaussian kernel [229].

Multiplication by the inverse of a sparse triangular matrix. Multiplication by the inverseof a sparse lower triangular matrix L 2 Rn⇥n with nonzero elements on its diagonal is representedby the FAO � = (f,�f ,�f⇤), where f(x) = L�1x. The adjoint f⇤(u) = (LT )�1u is multiplicationby the inverse of a sparse upper triangular matrix. The algorithms �f and �f⇤ are forward andbackward substitution, respectively, which require O(nnz(L)) flops and use O(nnz(L)) bytes of datato store L and LT [60, Chap. 3].

Multiplication by a pseudo-random matrix. Multiplication by a matrix A 2 Rm⇥n whosecolumns are given by a pseudo-random sequence (i.e., the first m values of the sequence are thefirst column of A, the next m values are the second column of A, etc.) is represented by the FAO� = (f,�f ,�f⇤), where f(x) = Ax. The adjoint f⇤(u) = ATu is multiplication by a matrix whoserows are given by a pseudo-random sequence (i.e., the first m values of the sequence are the firstrow of AT , the next m values are the second row of AT , etc.). The algorithms �f and �f⇤ are thestandard dense matrix multiplication algorithm, iterating once over the pseudo-random sequencewithout storing any of its values. The algorithms require O(mn) flops and use O(1) bytes of datato store the the seed for the pseudo-random sequence. Multiplication by a pseudo-random matrixmight appear, for example, as a measurement ensemble in compressed sensing [91].

Multiplication by the pseudo-inverse of a graph Laplacian. Multiplication by the pseudo-inverse of a graph Laplacian matrix L 2 Rn⇥n is represented by the FAO � = (f,�f ,�f⇤), wheref(x) = L†x. A graph Laplacian is a symmetric matrix with nonpositive off diagonal entries and theproperty L1 = 0, i.e., the diagonal entry in a row is the negative sum of the off-diagonal entriesin that row. (This implies that it is positive semidefinite.) The adjoint f⇤ is the same as f , sinceL = LT . The algorithms �f and �f⇤ are one of the fast solvers for graph Laplacian systems thatevaluate f(x) = f⇤(x) to a given accuracy in around O(nnz(L)) flops [202, 139, 221]. (The detailsof the computational complexity are much more involved.) The algorithms use O(nnz(L)) bytes ofdata to store L.


2.2.3 Matrix mappings

We now consider linear functions that take as argument, or return, matrices. We take the standardinner product on matrices X,Y 2 Rp⇥q,

hX,Y i =X

i=1,...,p, j=1,...,q

XijYij = Tr(XTY ).

The adjoint of a linear function f : Rp⇥q! Rs⇥t is then the function f⇤ : Rs⇥t

! Rp⇥q for which

Tr(f(X)TY ) = Tr(XT f⇤(Y )),

holds for all X 2 Rp⇥q and Y 2 Rs⇥t.

Vec and mat. The function vec : Rp⇥q! Rpq is represented by the FAO � = (f,�f ,�f⇤), where

f(X) converts the matrix X 2 Rp⇥q into a vector y 2 Rpq by stacking the columns. The adjointf⇤ is the function mat : Rpq

! Rp⇥q, which outputs a matrix whose columns are successive slicesof its vector argument. The algorithms �f and �f⇤ simply reinterpret their input as a differentlyshaped output in O(1) flops, using only O(1) bytes of data to store the dimensions of f ’s input andoutput.

Sparse matrix mappings. Many common linear functions on and to matrices are given bya sparse matrix multiplication of the vectorized argument, reshaped as the output matrix. ForX 2 Rp⇥q and f(X) = Y 2 Rs⇥t,

Y = mat(Avec(X)).

The form above describes the general linear mapping from Rp⇥q to Rs⇥t; we are interested incases when A is sparse, i.e., has far fewer than pqst nonzero entries. Examples include extractinga submatrix, extracting the diagonal, forming a diagonal matrix, summing the rows or columns ofa matrix, transposing a matrix, scaling its rows or columns, and so on. The FAO representation ofeach such function is � = (f,�f ,�f⇤), where f is given above and the adjoint is given by

f⇤(U) = mat(AT vec(U)).

The algorithms �f and �f⇤ are the standard algorithms for multiplying a vector by a sparse matrixin (for example) compressed sparse row format. The algorithms require O(nnz(A)) flops and useO(nnz(A)) bytes of data to store A and AT [60, Chap. 2].

Matrix product. Multiplication on the left by a matrix A 2 Rs⇥p and on the right by a matrixB 2 Rq⇥t is represented by the FAO � = (f,�f ,�f⇤), where f : Rp⇥q

! Rs⇥t is given by


f(X) = AXB. The adjoint f⇤(U) = ATUBT is also a matrix product. There are two ways toimplement �f efficiently, corresponding to different orders of operations in multiplying out AXB.In one method we multiply by A first and B second, for a total of O(s(pq + qt)) flops (assumingthat A and B are dense). In the other method we multiply by B first and A second, for a total ofO(p(qt+ st)) flops. The former method is more efficient if

1

t+

1

p<

1

s+

1

q.

Similarly, there are two ways to implement �f⇤ , one requiring O(s(pq + qt)) flops and the otherrequiring O(p(qt + st)) flops. The algorithms �f and �f⇤ use O(sp + qt) bytes of data to storeA and B and their transposes. When p = q = s = t, the flop count for �f and �f⇤ simplifies toO�(m+ n)1.5

�flops. Here m = n = pq. (When the matrices A or B are sparse, evaluating f(X)

and f⇤(U) can be done even more efficiently.) The matrix product function is used in Lyapunov andalgebraic Riccati inequalities and Sylvester equations, which appear in many problems from controltheory [89, 219].

2-D discrete Fourier transform. The 2-D DFT is represented by the FAO � = (f,�f ,�f⇤),where f : R2p⇥q

! R2p⇥q is given by

f(X)k` = 1ppq

Pp

s=1

Pq

t=1 <

⇣!(s�1)(k�1)p !(t�1)(`�1)

q

⌘Xst �=

⇣!(s�1)(k�1)p !(t�1)(`�1)

q

⌘Xs+p,t

f(X)k+p,` = 1ppq

Pp

s=1

Pq

t=1 =

⇣!(s�1)(k�1)p !(t�1)(`�1)

q

⌘Xst + <

⇣!(s�1)(k�1)p !(t�1)(`�1)

q

⌘Xs+p,t,

for k = 1, . . . , p and ` = 1, . . . , q. Here !p = e�2⇡i/p and !q = e�2⇡i/q. The adjoint f⇤ is the inverse2-D DFT. The algorithm �f evaluates f(X) by first applying the FFT to each row of X, replacingthe row with its DFT, and then applying the FFT to each column, replacing the column with itsDFT. The algorithm �f⇤ is analogous, but with the inverse FFT and inverse DFT taking the roleof the FFT and DFT. The algorithms �f and �f⇤ require O((m+ n) log(m+ n)) flops, using onlyO(1) bytes of data to store the dimensions of f ’s input and output [155, 158]. Here m = n = 2pq.

2-D convolution. 2-D convolution with a kernel C 2 Rp⇥q is defined as f : Rs⇥t! Rm1⇥m2 ,

wheref(X)k` =

X

i1+i2=k+1,j1+j2=`+1

Ci1j1Xi2j2 , k = 1, . . . ,m1, ` = 1, . . . ,m2. (2.5)

Different variants of 2-D convolution restrict the indices i1, j1 and i2, j2 to different ranges, orinterpret matrix elements outside their natural ranges as zero or using periodic (circular) indexing.There are 2-D analogues of 1-D column, row, and circular convolution.

Standard 2-D (column) convolution, the analogue of 1-D column convolution, takes m1 = s+p�1

and m2 = t + q � 1, and defines Ci1j1 and Xi2j2 in (2.5) as zero when the indices are outside their


range. We can represent the 2-D column convolution Y = C ⇤X as the matrix multiplication

Y = mat(Col(C)vec(X)),

where Col(C) 2 R(s+p�1)(t+q�1)⇥st is given by:

Col(C) =

2

666666666664

Col(c1)

Col(c2). . .

.... . . Col(c1)

Col(cq) Col(c2). . .

...Col(cq)

3

777777777775

.

Here c1, . . . , cq 2 Rp are the columns of C and Col(c1), . . . ,Col(cq) 2 Rs+p�1⇥s are 1-D columnconvolution matrices.

The 2-D analogue of 1-D row convolution restricts the indices in (2.5) to the range k = p, . . . , s

and ` = q, . . . , t. For simplicity we assume s � p and t � q. The output dimensions are m1 = s�p+1

and m2 = t�q+1. We can represent the 2-D row convolution Y = C⇤X as the matrix multiplication

Y = mat(Row(C)vec(X)),

where Row(C) 2 R(s�p+1)(t�q+1)⇥st is given by:

Row(C) =

2

664

Row(cq) Row(cq�1) . . . Row(c1). . . . . . . . .

Row(cq) Row(cq�1) . . . Row(c1)

3

775 .

Here Row(c1), . . . ,Row(cq) 2 Rs�p+1⇥s are 1-D row convolution matrices. The matrices Col(C)

and Row(C) are related by the equalities

Col(C)T = Row(rev(C)), Row(C)T = Col(rev(C)),

where rev(C)k` = Cp�k+1,q�`+1 reverses the order of of the columns of C and of the entries in eachrow.

In the 2-D analogue of 1-D circular convolution, we take p = s and q = t and interpret the entriesof matrices outside their range modulo s for the row index and modulo t for the column index. We


can represent the 2-D circular convolution Y = C ⇤X as the matrix multiplication

Y = mat(Circ(C)vec(X)),

where Circ(C) 2 Rst⇥st is given by:

Circ(C) =

2

6666666666664

Circ(c1) Circ(ct) Circ(ct�1) . . . . . . Circ(c2)

Circ(c2) Circ(c1) Circ(ct). . .

...

Circ(c3) Circ(c2). . . . . . . . .

......

. . . . . . . . . Circ(ct) Circ(ct�1)...

. . . Circ(c2) Circ(c1) Circ(ct)

Circ(ct) . . . . . . Circ(c3) Circ(c2) Circ(c1)

3

7777777777775

.

Here Circ(c1), . . . ,Circ(ct) 2 Rs⇥s are 1-D circular convolution matrices.2-D column convolution with C 2 Rp⇥q is represented by the FAO � = (f,�f ,�f⇤), where

f : Rs⇥t! Rs+p�1⇥t+q�1 is given by

f(X) = mat(Col(C)vec(X)).

The adjoint f⇤ is 2-D row convolution with rev(C), i.e.,

f⇤(U) = mat(Row(rev(C))vec(U)).

The algorithms �f and �f⇤ are given in algorithms 4 and 5, and require O((m + n) log(m + n))

flops. Here m = (s+ p� 1)(t+ q � 1) and n = st. If the kernel is small (i.e., p⌧ s and q ⌧ t), �f

and �f⇤ instead evaluate (2.5) directly in O(pqst) flops. In either case, the algorithms �f and �f⇤

use O(pq) bytes of data to store C and rev(C) [158, Chap. 4]. Often the kernel is parameterized(e.g., a Gaussian kernel), in which case more compact representations of C and rev(C) are possible[80, Chap. 7].

2-D circular convolution with C 2 Rs⇥t is represented by the FAO � = (f,�f ,�f⇤), wheref : Rs⇥t

! Rs⇥t is given byf(X) = mat(Circ(C)vec(X)).


Algorithm 4 2-D column convolution C ⇤X.

Input: C 2 Rp⇥q is a length pq array. X 2 Rs⇥t is a length st array. Y 2 Rs+p�1⇥t+q�1 is alength (s+ p� 1)(t+ q � 1) array.

Extend the columns and rows of C and X with zeros so C,X 2 Rs+p�1⇥t+q�1.C 2-D DFT of C.X 2-D DFT of X.for i = 1, . . . , s+ p� 1 do

for j = 1, . . . , t+ q � 1 doYij CijXij .

Y inverse 2-D DFT of Y .

Output: Y = C ⇤X.

Algorithm 5 2-D row convolution C ⇤ U .

Input: C 2 Rp⇥q is a length pq array. U 2 Rs+p�1⇥t+q�1 is a length (s+ p� 1)(t+ q � 1) array.V 2 Rs⇥t is a length st array.

Extend the columns and rows of rev(C) and V with zeros so rev(C), V 2 Rs+p�1⇥t+q�1.C inverse 2-D DFT of zero-padded rev(C).U 2-D DFT of U .for i = 1, . . . , s+ p� 1 do

for j = 1, . . . , t+ q � 1 doVij CijUij .

V inverse 2-D DFT of V .Truncate the rows and columns of V so that V 2 Rs⇥t.

Output: V = C ⇤ U .

Algorithm 6 2-D circular convolution C ⇤X.

Input: C 2 Rs⇥t is a length st array. X 2 Rs⇥t is a length st array. Y 2 Rs⇥t is a length st array.

C 2-D DFT of C.X 2-D DFT of X.for i = 1, . . . , s do

for j = 1, . . . , t doYij CijXij .

Y inverse 2-D DFT of Y .

Output: Y = C ⇤X.


The adjoint f⇤ is 2-D circular convolution with

C =

2

66666664

C1,1 C1,t C1,t�1 . . . C1,2

Cs,1 Cs,t Cs,t�1 . . . Cs,2

Cs�1,1 Cs�1,t Cs�1,t�1 . . . Cs�1,2

......

.... . .

...C2,1 C2,t C2,t�1 . . . C2,2

3

77777775

.

The algorithms �f and �f⇤ are given in algorithm 6, and require O((m + n) log(m + n)) flops.The algorithms �f and �f⇤ use O(m + n) bytes of data to store C and C [158, Chap. 4]. Herem = n = st.

2-D discrete wavelet transform. The 2-D DWT for separable, orthogonal wavelets is repre-sented by the FAO � = (f,�f ,�f⇤), where f : R2p⇥2p

! R2p⇥2p is given by

f(X)ij = Wk · · ·Wp�1WpXWT

pWT

p�1 · · ·WT

k,

where k = max{dlog2(i)e, dlog2(j)e, 1} and Wk 2 R2p⇥2p is given by

Wk =

2

664

DkGk

DkHk

I

3

775 .

Here Dk, Gk, and Hk are defined as for the 1-D DWT. The adjoint f⇤ is the inverse 2-D DWT.As in the 1-D DWT, the algorithms �f and �f⇤ repeatedly convolve by the filters g 2 Rq andh 2 Rq, which requires O(q(m+ n)) flops and uses O(q) bytes of data to store g and h [133]. Herem = n = 2p. There are many alternative wavelet transforms for 2-D data; see, e.g., [36, 204, 70, 132].

2.2.4 Multiple vector mappings

In this section we consider linear functions that take as argument, or return, multiple vectors. (Theidea is readily extended to the case when the arguments or return values are matrices.) The adjointis defined by the inner product

h(x1, . . . , xk), (y1, . . . , yk)i =kX

i=1

hxi, yii =kX

i=1

xT

iyi.


The adjoint of a linear function f : Rn1⇥ · · · ⇥ Rnk ! Rm1

⇥ · · · ⇥ Rm` is then the functionf⇤ : Rm1

⇥ · · ·⇥Rm` ! Rn1⇥ · · ·⇥Rnk for which

`X

i=1

f(x1, . . . , xk)T

iyi =

kX

i=1

xT

if⇤(y1, . . . , y`)i,

holds for all (x1, . . . , xk) 2 Rn1⇥ · · ·⇥Rnk and (y1, . . . , y`) 2 Rm1

⇥ · · ·⇥Rm` . Here f(x1, . . . , xk)i

and f⇤(y1, . . . , y`)i refer to the ith output of f and f⇤, respectively.

Sum and copy. The function sum : Rm⇥ · · · ⇥ Rm

! Rm with k inputs is represented bythe FAO � = (f,�f ,�f⇤), where f(x1, . . . , xk) = x1 + · · · + xk. The adjoint f⇤ is the functioncopy : Rm

! Rm⇥ · · · ⇥ Rm, which outputs k copies of its input. The algorithms �f and �f⇤

require O(m + n) flops to sum and copy their input, respectively, using only O(1) bytes of data tostore the dimensions of f ’s input and output. Here n = km.

Vstack and split. The function vstack : Rm1⇥ · · · ⇥ Rmk ! Rn is represented by the FAO

� = (f,�f ,�f⇤), where f(x1, . . . , xk) concatenates its k inputs into a single vector output. Theadjoint f⇤ is the function split : Rn

! Rm1⇥ · · · ⇥ Rmk , which divides a single vector into k

separate components. The algorithms �f and �f⇤ simply reinterpret their input as a differentlysized output in O(1) flops, using only O(1) bytes of data to store the dimensions of f ’s input andoutput. Here n = m = m1 + · · ·+mk.

2.2.5 Additional examples

The literature on fast linear transforms goes far beyond the preceding examples. In this section wehighlight a few notable omissions. Many methods have been developed for matrices derived fromphysical systems. The multigrid [111] and algebraic multigrid [32] methods efficiently apply theinverse of a matrix representing discretized partial differential equations (PDEs). The fast multipolemethod accelerates multiplication by matrices representing pairwise interactions [107, 37], muchlike the fast Gauss transform [108]. Hierarchical matrices are a matrix format that allows fastmultiplication by the matrix and its inverse, with applications to discretized integral operators andPDEs [112, 113, 23].

Many approaches exist for factoring an invertible sparse matrix into a product of componentswhose inverses can be applied efficiently, yielding a fast method for applying the inverse of the matrix[73, 60]. A sparse LU factorization, for instance, decomposes an invertible sparse matrix A 2 Rn⇥n

into the product A = LU of a lower triangular matrix L 2 Rn⇥n and an upper triangular matrixU 2 Rn⇥n. The relationship between nnz(A), nnz(L), and nnz(U) is complex and depends on thefactorization algorithm [60, Chap. 6].


We only discussed 1-D and 2-D DFTs and convolutions, but these and related transforms canbe extended to arbitrarily many dimensions [72, 158]. Similarly, many wavelet transforms naturallyoperate on data indexed by more than two dimensions [143, 232, 161].

2.3 Compositions

In this section we consider compositions of FAOs. In fact we have already discussed several linearfunctions that are naturally and efficiently represented as compositions, such as multiplication by alow-rank matrix and sparse matrix mappings. Here though we present a data structure and algorithmfor efficiently evaluating any composition and its adjoint, which gives us an FAO representing thecomposition.

A composition of FAOs can be represented using a directed acyclic graph (DAG) with exactlyone node with no incoming edges (the start node) and exactly one node with no outgoing edges (theend node). We call such a representation an FAO DAG.

Each node in the FAO DAG stores the following attributes:

• An FAO � = (f,�f ,�f⇤). Concretely, f is a symbol identifying the function, and �f and �f⇤

are executable code.

• The data needed to evaluate �f and �f⇤ .

• A list Ein of incoming edges.

• A list Eout of outgoing edges.

Each edge has an associated array. The incoming edges to a node store the arguments to thenode’s FAO. When the FAO is evaluated, it writes the result to the node’s outgoing edges. Matrixarguments and outputs are stored in column-major order on the edge arrays.

As an example, figure 2.1 shows the FAO DAG for the composition f(x) = Ax + Bx, whereA 2 Rm⇥n and B 2 Rm⇥n are dense matrices. The copy node duplicates the input x 2 Rn into themulti-argument output (x, x) 2 Rn

⇥Rn. The A and B nodes multiply by A and B, respectively.The sum node sums two vectors together. The copy node is the start node, and the sum node isthe end node. The FAO DAG requires O(mn) bytes to store, since the A and B nodes store thematrices A and B and their tranposes. The edge arrays also require O(mn) bytes of memory.

2.3.1 Forward evaluation

To evaluate the composition f(x) = Ax + Bx using the FAO DAG in figure 2.1, we first evaluatethe start node on the input x 2 Rn, which copies x onto both outgoing edges. We evaluate the A

and B nodes (serially or in parallel) on their incoming edges, and write the results (Ax and Bx) to


copy

A B

sum

Figure 2.1: The FAO DAG for f(x) = Ax+Bx.

their outgoing edges. Finally, we evaluate the end node on its incoming edges to obtain the resultAx+Bx.

The general procedure for evaluating an FAO DAG is given in algorithm 7. The algorithmevaluates the nodes in a topological order. The total flop count is the sum of the flops from evaluatingthe algorithm �f on each node. If we allocate all scratch space needed by the FAO algorithms inadvance, then no memory is allocated during the algorithm.

Algorithm 7 Evaluate an FAO DAG.Input: G = (V,E) is an FAO DAG representing a function f . V is a list of nodes. E is a list

of edges. I is a list of inputs to f . O is a list of outputs from f . Each element of I and O isrepresented as an array.

Create edges whose arrays are the elements of I and save them as the list of incoming edges forthe start node.Create edges whose arrays are the elements of O and save them as the list of outgoing edges forthe end node.Create an empty queue Q for nodes that are ready to evaluate.Create an empty set S for nodes that have been evaluated.Add G’s start node to Q.while Q is not empty do

u pop the front node of Q.Evaluate u’s algorithm �f on u’s incoming edges, writing the result to u’s outgoing edges.Add u to S.for each edge e = (u, v) in u’s Eout do

if for all edges (p, v) in v’s Ein, p is in S thenAdd v to the end of Q.

Output: O contains the outputs of f applied to inputs I.


sum

AT BT

copy

Figure 2.2: The FAO DAG for f⇤(u) = A

Tu + B

Tu obtained by transforming the FAO DAG in

figure 2.1.

2.3.2 Adjoint evaluation

Given an FAO DAG G representing a function f , we can easily generate an FAO DAG G⇤ repre-senting the adjoint f⇤. We modify each node in G, replacing the node’s FAO (f,�f ,�f⇤) with theFAO (f⇤,�f⇤ ,�f ) and swapping Ein and Eout. We also reverse the orientation of each edge in G.We can apply algorithm 7 to the resulting graph G⇤ to evaluate f⇤. Figure 2.2 shows the FAO DAGin figure 2.1 transformed into an FAO DAG for the adjoint.

2.3.3 Parallelism

Algorithm 7 can be easily parallelized, since the nodes in the ready queue Q can be evaluated in anyorder. A simple parallel implementation could use a thread pool with t threads to evaluate up to t

nodes in the ready queue at a time. The evaluation of individual nodes can also be parallelized byreplacing a node’s algorithm �f with a parallel variant. For example, the standard algorithms fordense and sparse matrix multiplication have simple parallel variants.

The extent to which parallelism speeds up evaluation of an FAO DAG is difficult to predict.Naive parallel evaluation may be slower than serial evaluation due to communication costs andother overhead. Achieving a perfect parallel speed-up would require sophisticated analysis of theDAG to determine which aspects of the algorithm to parallelize, and may only be possible for highlystructured DAGs like one describing a block matrix [101].

2.3.4 Optimizing the DAG

The FAO DAG can often be transformed so that the output of algorithm 7 is the same but thealgorithm is executed more efficiently. Such optimizations are especially important when the FAODAG will be evaluated on many different inputs (as will be the case for matrix-free solvers, to bediscussed later). For example, the FAO DAG representing f(x) = ABx + ACx where A,B,C 2

Rn⇥n, shown in figure 2.3, can be transformed into the FAO DAG in figure 2.4, which requires one


copy

B C

A A

sum

Figure 2.3: The FAO DAG for f(x) = ABx+ACx.

copy

B C

sum

A

Figure 2.4: The FAO DAG for f(x) = A(Bx+ Cx).

fewer multiplication by A. The transformation is equivalent to rewriting f(x) = ABx + ACx asf(x) = A(Bx + Cx). Many other useful graph transformations can be derived from the rewritingrules used in program analysis and code generation [4].

Sometimes graph transformations will involve pre-computation. For example, if two nodes rep-resenting the composition f(x) = bT cx, where b, c 2 Rn, appear in an FAO DAG, the DAG can bemade more efficient by evaluating ↵ = bT c and replacing the two nodes with a single node for scalarmultiplication by ↵.

The optimal rewriting of a DAG will depend on the hardware and overall architecture on whichthe multiplication algorithm is being run. For example, if the algorithm is being run on a distributedcomputing cluster then a node representing multiplication by a large matrix

A =

"A11 A12

A21 A22

#,


could be split into separate nodes for each block, with the nodes stored on different computers. Thisrewriting would be necessary if the matrix A is so large it cannot be stored on a single machine.The literature on optimizing compilers suggests many approaches to optimizing an FAO DAG forevaluation on a particular architecture [4].

2.3.5 Reducing the memory footprint

In a naive implementation, the total bytes needed to represent an FAO DAG G, with node set V andedge set E, is the sum of the bytes of data on each node u 2 V and the bytes of memory needed forthe array on each edge e 2 E. A more sophisticated approach can substantially reduce the memoryneeded. For example, when the same FAO occurs more than once in V , duplicate nodes can sharedata.

We can also reuse memory across edge arrays. The key is determining which arrays can neverbe in use at the same time during algorithm 7. An array for an edge (u, v) is in use if node u hasbeen evaluated but node v has not been evaluated. The arrays for edges (u1, v1) and (u2, v2) cannever be in use at the same time if and only if there is a directed path from v1 to u2 or from v2 tou1. If the sequence in which the nodes will be evaluated is fixed, rather than following an unknowntopological ordering, then we can say precisely which arrays will be in use at the same time.

After we determine which edge arrays may be in use at the same time, the next step is to mapthe edge arrays onto a global array, keeping the global array as small as possible. Let L(e) denotethe length of edge e’s array and U ✓ E ⇥E denote the set of pairs of edges whose arrays may be inuse at the same time. Formally, we want to solve the optimization problem

minimize maxe2E{ze + L(e)}

subject to [ze, ze + L(e)� 1] \ [zf , zf + L(f)� 1] = ;, (e, f) 2 U

ze 2 {1, 2, . . .}, e 2 E,

(2.6)

where the ze are the optimization variables and represent the index in the global array where edgee’s array begins.

When all the edge arrays are the same length, problem (2.6) is equivalent to finding the chromaticnumber of the graph with vertices E and edges U . Problem (2.6) is thus NP-hard in general [137].A reasonable heuristic for problem (2.6) is to first find a graph coloring of (E,U) using one of themany efficient algorithms for finding graph colorings that use a small number of colors; see, e.g.,[114, 33]. We then have a mapping � from colors to sets of edges assigned to the same color. Weorder the colors arbitrarily as c1, . . . , ck and assign the ze as follows:

ze =

8<

:1, e 2 �(c1)

maxf2�(ci�1){zf + L(f)}, e 2 �(ci), i > 1.


Additional optimizations can be made based on the unique characteristics of different FAOs.For example, the outgoing edges from a copy node can share the incoming edge’s array until theoutgoing edges’ arrays are written to (i.e., copy-on-write). Another example is that the outgoingedges from a split node can point to segments of the array on the incoming edge. Similarly, theincoming edges on a vstack node can point to segments of the array on the outgoing edge.

2.3.6 Software implementations

Several software packages have been developed for constructing and evaluating compositions of lin-ear functions. The MATLAB toolbox SPOT allows users to construct expressions involving bothfast transforms, like convolution and the DFT, and standard matrix multiplication [117]. TFOCS,a framework in MATLAB for solving convex problems using a variety of first-order algorithms, pro-vides functionality for constructing and composing FAOs [17]. The Python package linop providesmethods for constructing FAOs and combining them into linear expressions [216]. Halide is a domainspecific language for image processing that makes it easy to optimize compositions of fast transformsfor a variety of architectures [185].

Our approach to representing and evaluating compositions of functions is similar to the approachtaken by autodifferentiation tools. These tools represent a composite function f : Rn

! Rm as aDAG [109], and multiply by the Jacobian J 2 Rm⇥n and its adjoint efficiently through graphtraversal. Forward mode autodifferentiation computes x ! Jx efficiently by traversing the DAGin topological order. Reverse mode autodifferentiation, or backpropagation, computes u ! JTu

efficiently by traversing the DAG once in topological order and once in reverse topological order[14]. An enormous variety of software packages have been developed for autodifferentiation; see [14]for a survey. Autodifferentiation in the form of backpropagation plays a central role in deep learningframeworks such as TensorFlow [1], Theano [20, 13], Caffe [134], and Torch [49].

2.4 Cone programs and solvers

2.4.1 Cone programs

A cone program is a convex optimization problem of the form

minimize cTx

subject to Ax+ b 2 K,(2.7)

where x 2 Rn is the optimization variable, K is a convex cone, and A 2 Rm⇥n, c 2 Rn, and b 2 Rm

are problem data. Cone programs are a broad class that include linear programs, second-order coneprograms, and semidefinite programs as special cases [172, 28]. We call the cone program matrix-freeif A is represented implicitly as an FAO, rather than explicitly as a dense or sparse matrix.


The convex cone K is typically a Cartesian product of simple convex cones from the followinglist:

• Zero cone: K0 = {0}.

• Free cone: Kfree = R.

• Nonnegative cone: K+ = {x 2 R | x � 0}.

• Second-order cone: Ksoc = {(x, t) 2 Rn+1| x 2 Rn, t 2 R, kxk2 t}.

• Positive semidefinite cone: Kpsd = {vec(X) | X 2 Sn, zTXz � 0 for all z 2 Rn}.

• Exponential cone ([179, §6.3.4]):

Kexp = {(x, y, z) 2 R3| y > 0, yex/y z} [ {(x, y, z) 2 R3

| x 0, y = 0, z � 0}.

• Power cone ([170, 199, 120]):

Ka

pwr = {(x, y, z) 2 R3| xay(1�a)

� |z|, x � 0, y � 0},

where a 2 [0, 1].

These cones are useful in expressing common problems (via canonicalization), and can be handledby various solvers (as discussed below). Note that all the cones are subsets of Rn, i.e., real vectors.It might be more natural to view the elements of a cone as matrices or tuples, but viewing theelements as vectors simplifies the matrix-free canonicalization algorithm in §2.5.

Cone programs that include only cones from certain subsets of the list above have special names.For example, if the only cones are zero, free, and nonnegative cones, the cone program is a linearprogram; if in addition it includes the second-order cone, it is called a second-order cone program.A well studied special case is so-called symmetric cone programs, which include the zero, free,nonnegative, second-order, and positive semidefinite cones. Semidefinite programs, where the coneconstraint consists of a single positive semidefinite cone, are another common case.

2.4.2 Cone solvers

Many methods have been developed to solve cone programs, the most widely used being interior-point methods; see, e.g., [28, 171, 174, 227, 231].

Interior-point. A large number of interior-point cone solvers have been implemented. Most sup-port symmetric cone programs. SDPT3 [213] and SeDuMi [207] are open-source solvers implemented


in MATLAB; CVXOPT [8] is an open-source solver implemented in Python; MOSEK [167] is a com-mercial solver with interfaces to many languages. ECOS is an open-source cone solver written inlibrary-free C that supports second-order cone programs [71]; Akle extended ECOS to support theexponential cone [5]. DSDP5 [19] and SDPA [86] are open-source solvers for semidefinite programsimplemented in C and C++, respectively.

First-order. First-order methods are an alternative to interior-point methods that scale moreeasily to large cone programs, at the cost of lower accuracy. PDOS [45] is a first-order cone solverbased on the alternating direction method of multipliers (ADMM) [26]. PDOS supports second-order cone programs. POGS [81] is an ADMM based solver that runs on a GPU, with a version thatis similar to PDOS and targets second-order cone programs. SCS is another ADMM-based conesolver, which supports symmetric cone programs as well as the exponential and power cones [175].Many other first-order algorithms can be applied to cone programs (e.g., [148, 39, 182]), but nonehave been implemented as a robust, general purpose cone solver.

Matrix-free. Matrix-free cone solvers are an area of active research, and a small number have beendeveloped. PENNON is a matrix-free semidefinite program (SDP) solver [142]. PENNON solves aseries of unconstrained optimization problems using Newton’s method. The Newton step is computedusing a preconditioned conjugate gradient method, rather than by factoring the Hessian directly.Many other matrix-free algorithms for solving SDPs have been proposed (e.g., [44, 87, 212, 235]).CVXOPT can be used as a matrix-free cone solver, as it allows users to specify linear functions asPython functions for evaluating matrix-vector products, rather than as explicit matrices [7].

Several matrix-free solvers have been developed for quadratic programs (QPs), which are asuperset of linear programs and a subset of second-order cone programs. Gondzio developed amatrix-free interior-point method for QPs that solves linear systems using a preconditioned conjugategradient method [100, 99, 119]. PDCO is a matrix-free interior-point solver that can solve QPs [189],using LSMR to solve linear systems [79].

2.5 Matrix-free canonicalization

2.5.1 Canonicalization

Canonicalization is an algorithm that takes as input a data structure representing a general convexoptimization problem and outputs a data structure representing an equivalent cone program. Bysolving the cone program, we recover the solution to the original optimization problem. This ap-proach is used by convex optimization modeling systems such as YALMIP [160], CVX [104], CVXPY[65], and Convex.jl [214]. The same technique is used in the code generators CVXGEN [164] andQCML [46].


The downside of canonicalization’s generality is that special structure in the original problemmay be lost during the transformation into a cone program. In particular, current methods ofcanonicalization convert fast linear transforms in the original problem into multiplication by a denseor sparse matrix, which makes the final cone program far more costly to solve than the originalproblem.

The canonicalization algorithm can be modified, however, so that fast linear transforms arepreserved. The key is to represent all linear functions arising during the canonicalization process asFAO DAGs instead of as sparse matrices. The FAO DAG representation of the final cone programcan be used by a matrix-free cone solver to solve the cone program. The modified canonicalizationalgorithm never forms explicit matrix representations of linear functions. Hence we call the algorithmmatrix-free canonicalization.

The remainder of this section has the following outline: In §2.5.2 we give an informal overview ofthe matrix-free canonicalization algorithm. In §2.5.3 we define the expression DAG data structure,which is used throughout the matrix-free canonicalization algorithm. In §2.5.4 we define the datastructure used to represent convex optimization problems as input to the algorithm. In §2.5.5 wedefine the representation of a cone program output by the matrix-free canonicalization algorithm.In §2.5.6 we present the matrix-free canonicalization algorithm itself.

2.5.2 Informal overview

In this section we give an informal overview of the matrix-free canonicalization algorithm. Latersections define the data structures used in the algorithm and make the procedure described in thissection formal and explicit.

We are given an optimization problem

minimize f0(x)

subject to fi(x) 0, i = 1, . . . , p

hi(x) + di = 0, i = 1, . . . , q,

(2.8)

where x 2 Rn is the optimization variable, f0 : Rn! R, . . . , fp : Rn

! R are convex functions,h1 : Rn

! Rm1 , . . . , hq : Rn! Rmq are linear functions, and d1 2 Rm1 , . . . , dq 2 Rmq are vector

constants. Our goal is to convert the problem into an equivalent matrix-free cone program, so thatwe can solve it using a matrix-free cone solver.

We assume that the problem satisfies a set of requirements known as disciplined convex program-ming [102, 105]. The requirements ensure that each of the f0, . . . , fp can be represented as partialminimization over a cone program. Let each function fi have the cone program representation

fi(x) = minimize (over t(i)) g(i)0 (x, t(i)) + e(i)0

subject to g(i)j(x, t(i)) + e(i)

j2 K

(i)j, j = 1, . . . , r(i),


x

A

k · k2

sum

3

Figure 2.5: The expression DAG for f(x) = kAxk2 + 3.

where t(i) 2 Rs(i)

is the optimization variable, g(i)0 , . . . , g(i)r(i)

are linear functions, e(i)0 , . . . , e(i)r(i)

arevector constants, and K

(i)1 , . . . ,K(i)

r(i)are convex cones.

We rewrite problem (2.8) as the equivalent cone program

minimize g(0)0 (x, t(0)) + e(0)0

subject to �g(i)0 (x, t(i))� e(i)0 2 K+, i = 1, . . . , p,

g(i)j(x, t(i)) + e(i)

j2 K

(i)j

i = 1, . . . , p, j = 1, . . . , r(i)

hi(x) + di 2 Kmi0 , i = 1, . . . , q.

(2.9)

We convert problem (2.9) into the standard form for a matrix-free cone program given in (2.7) byrepresenting g(0)0 as the inner product with a vector c 2 Rn+s

(0)

, concatenating the di and e(i)j

vectors into a single vector b, and representing the matrix A implicitly as the linear function thatstacks the outputs of all the hi and g(i)

j(excluding the objective g(0)0 ) into a single vector.

2.5.3 Expression DAGs

The canonicalization algorithm uses a data structure called an expression DAG to represent functionsin an optimization problem. Like the FAO DAG defined in §2.3, an expression DAG encodes acomposition of functions as a DAG where a node represents a function and an edge from a node u

to a node v signifies that an output of u is an input to v. Figure 2.5 shows an expression DAG forthe composition f(x) = kAxk2 + 3, where x 2 Rn and A 2 Rm⇥n.

Formally, an expression DAG is a connected DAG with one node with no outgoing edges (theend node) and one or more nodes with no incoming edges (start nodes). Each node in an expressionDAG has the following attributes:

• A symbol representing a function f .


• The data needed to parameterize the function, such as the power p for the function f(x) = xp.

• A list Ein of incoming edges.

• A list Eout of outgoing edges.

Each start node in an expression DAG is either a constant function or a variable. A variable isa symbol that labels a node input. If two nodes u and v both have incoming edges from variablenodes with symbol t, then the inputs to u and v are the same.

We say an expression DAG is affine if every non-start node represents a linear function. If inaddition every start node is a variable, we say the expression DAG is linear. We say an expressionDAG is constant if it contains no variables, i.e., every start node is a constant.

2.5.4 Optimization problem representation

An optimization problem representation (OPR) is a data structure that represents a convex opti-mization problem. The input to the matrix-free canonicalization algorithm is an OPR. An OPR canencode any mathematical optimization problem of the form

minimize (over y w.r.t. K0) f0(x, y)

subject to fi(x, y) 2 Ki, i = 1, . . . , `,(2.10)

where x 2 Rn and y 2 Rm are the optimization variables, K0 is a proper cone, K1, . . . ,K` are convexcones, and for i = 0, . . . , `, we have fi : Rn

⇥Rm! Rmi where Ki ✓ Rmi . (For background on

convex optimization with respect to a cone, see, e.g., [28, §4.7].)Problem (2.10) is more complicated than the standard definition of a convex optimization problem

given in (2.8). The additional complexity is necessary so that OPRs can encode partial minimizationover cone programs, which can involve minimization with respect to a cone and constraints otherthan equalities and inequalities. These partial minimization problems play a major role in thecanonicalization algorithm. Note that we can easily represent equality and inequality constraintsusing the zero and nonnegative cones.

Concretely, an OPR is a tuple (s, o, C) where

• The element s is a tuple (V,K) representing the problem’s objective sense. The element V

is a set of symbols encoding the variables being minimized over. The element K is a symbolencoding the proper cone the problem objective is being minimized with respect to.

• The element o is an expression DAG representing the problem’s objective function.

• The element C is a set representing the problem’s constraints. Each element ci 2 C is a tuple(ei,Ki) representing a constraint of the form f(x, y) 2 K. The element ei is an expressionDAG representing the function f and Ki is a symbol encoding the convex cone K.


The matrix-free canonicalization algorithm can only operate on OPRs that satisfy the two DCPrequirements [102, 105]. The first requirement is that each nonlinear function in the OPR have aknown representation as partial minimization over a cone program. See [103] for many examples ofsuch representations.

The second requirement is that the objective o be verifiable as convex with respect to the coneK in the objective sense s by the DCP composition rule. Similarly, for each element (ei,Ki) 2 C,the constraint that the function represented by ei lie in the convex cone represented by Ki must beverifiable as convex by the composition rule. The DCP composition rule determines the curvatureof a composition f(g1(x), . . . , gk(x)) from the curvatures and ranges of the arguments g1, . . . , gk, thecurvature of the function f , and the monotonicity of f on the range of its arguments. See [102] and[214] for a full discussion of the DCP composition rule. Additional rules are used to determine therange of a composition from the range of its arguments.

Note that it is not enough for the objective and constraints to be convex. They must alsobe structured so that the DCP composition rule can verify their convexity. Otherwise the coneprogram output by the matrix-free canonicalization algorithm is not guaranteed to be equivalent tothe original problem.

To simplify the exposition of the canonicalization algorithm, we will also require that the ob-jective sense s represent minimization over all the variables in the problem with respect to thenonnegative cone, i.e., the standard definition of minimization. The most general implementationof canonicalization would also accept OPRs that can be transformed into an equivalent OPR withan objective sense that meets this requirement.

2.5.5 Cone program representation

The matrix-free canonicalization algorithm outputs a tuple (carr, darr, barr, G,Klist) where

• The element carr is a length n array representing a vector c 2 Rn.

• The element darr is a length one array representing a scalar d 2 R.

• The element barr is a length m array representing a vector b 2 Rm.

• The element G is an FAO DAG representing a linear function f(x) = Ax, where A 2 Rm⇥n.

• The element Klist is a list of symbols representing the convex cones (K1, . . . ,K`).

The tuple represents the matrix-free cone program

minimize cTx+ d

subject to Ax+ b 2 K,(2.11)

where K = K1 ⇥ · · ·⇥K`.


We can use the FAO DAG G and algorithm 7 to represent A as an FAO, i.e., export methods formultiplying by A and AT . These two methods are all a matrix-free cone solver needs to efficientlysolve problem (2.11).

2.5.6 Algorithm

The matrix-free canonicalization algorithm can be broken down into subroutines. We describe thesesubroutines before presenting the overall algorithm.

Conic-Form. The Conic-Form subroutine takes an OPR as input and returns an equivalent OPRwhere every non-start node in the objective and constraint expression DAGs represents a linearfunction. The output of the Conic-Form subroutine represents a cone program, but the outputmust still be transformed into a data structure that a cone solver can use, e.g., the cone programrepresentation described in §2.5.5.

The general idea of the Conic-Form algorithm is to replace each nonlinear function in the OPRwith an OPR representing partial minimization over a cone program. Recall that the canonical-ization algorithm requires that all nonlinear functions in the problem be representable as partialminimization over a cone program. The OPR for each nonlinear function is spliced into the fullOPR. We refer the reader to [103] and [214] for a full discussion of the Conic-Form algorithm.

The Conic-Form subroutine preserves fast linear transforms in the problem. All linear functionsin the original OPR are present in the OPR output by Conic-Form. The only linear functions addedare ones like sum and scalar multiplication that are very efficient to evaluate. Thus, evaluatingthe FAO DAG representing the final cone program will be as efficient as evaluating all the linearfunctions in the original problem.

Linear and Constant. The Linear and Constant subroutines take an affine expression DAG asinput and return the DAG’s linear and constant components, respectively. Concretely, the Linear

subroutine returns a copy of the input DAG where every constant start node is replaced witha variable start node and a node mapping the variable output to a vector (or matrix) of zeroswith the same dimensions as the constant. The Constant subroutine returns a copy of the inputDAG where every variable start node is replaced with a zero-valued constant node of the samedimensions. Figures 2.7 and 2.8 show the results of applying the Linear and Constant subroutinesto an expression DAG representing f(x) = x+ 2, as depicted in figure 2.6.

Evaluate. The Evaluate subroutine takes a constant expression DAG as input and returns anarray. The array contains the value of the function represented by the expression DAG. If the DAGevaluates to a matrix A 2 Rm⇥n, the array represents vec(A). Similarly, if the DAG evaluates tomultiple output vectors (b1, . . . , bk) 2 Rn1

⇥ · · ·⇥Rnk , the array represents vstack(b1, . . . , bk). For


x

sum

2

Figure 2.6: The expression DAG for f(x) = x+ 2.

x

sum

0

x

Figure 2.7: The Linear subroutine applied to the expression DAG in figure 2.6.

0

sum

2

Figure 2.8: The Constant subroutine applied to the expression DAG in figure 2.6.


e1 · · · e`

vstack

Figure 2.9: The expression DAG for vstack(e1, . . . , e`).

example, the output of the Evaluate subroutine on the expression DAG in figure 2.8 is a length onearray with first entry equal to 2.

Graph-Repr. The Graph-Repr subroutine takes a list of linear expression DAGs, (e1, . . . , e`), andan ordering over the variables in the expression DAGs, <V , as input and outputs an FAO DAG G.We require that the end node of each expression DAG represent a function with a single vector asoutput.

We construct the FAO DAG G in three steps. In the first step, we combine the expression DAGsinto a single expression DAG H(1) by creating a vstack node and adding an edge from the end nodeof each expression DAG to the new node. The expression DAG H(1) is shown in figure 2.9.

In the second step, we transform H(1) into an expression DAG H(2) with a single start node. Letx1, . . . , xk be the variables in (e1, . . . , e`) ordered by <V . Let ni be the length of xi if the variable is avector and of vec(xi) if the variable is a matrix, for i = 1, . . . , k. We create a start node representingthe function split : Rn

! Rn1⇥ · · ·⇥Rnk . For each variable xi, we add an edge from output i of

the start node to a copy node and edges from that copy node to all the nodes representing xi. If xi

is a vector, we replace all the nodes representing xi with nodes representing the identity function. Ifxi is a matrix, we replace all the nodes representing xi with mat nodes. The transformation fromH(1) to H(2) when ` = 1 and e1 represents f(x) = x + A(x + y), where x, y 2 Rn and A 2 Rn⇥n,are depicted in figures 2.10 and 2.11.

In the third and final step, we transform H(2) from an expression DAG into an FAO DAG G.H(2) is almost an FAO DAG, since each node represents a linear function and the DAG has a singlestart and end node. To obtain G we simply add the node and edge attributes needed in an FAODAG. For each node u in H(2) representing the function f , we add to u an FAO (f,�f ,�f⇤) andthe data needed to evaluate �f and �f⇤ . The node already has the required lists of incoming andoutgoing edges. We also add an array to each of H(2)’s edges.

Optimize-Graph. The Optimize-Graph subroutine takes an FAO DAG G as input and outputsan equivalent FAO DAG Gopt, meaning that the output of algorithm 7 is the same for G and Gopt.We choose Gopt by optimizing G so that the runtime of algorithm 7 is as short as possible (see§2.3.4). We also compress the FAO data and edge arrays to reduce the graph’s memory footprint(see §2.3.5). We could optimize the graph for the adjoint, G⇤, as well, but asymptotically at least


x

sum

A

sum

x y

vstack

Figure 2.10: The expression DAG H(1)

when ` = 1 and e1 represents f(x, y) = x+A(x+ y).

I

sum

A

sum

I I

vstack

copy copy

split

Figure 2.11: The expression DAG H(2)

obtained by transforming H(1)

in figure 2.10.


the flop count and memory footprint for G⇤ will be the same as for G, meaning optimizing G is thesame as jointly optimizing G and G⇤.

Matrix-Repr. The Matrix-Repr subroutine takes a list of linear expression DAGs, (e1, . . . , e`),and an ordering over the variables in the expression DAGs, <V , as input and outputs a sparsematrix. Note that the input types are the same as in the Graph-Repr subroutine. In fact, for agiven input the sparse matrix output by Matrix-Repr represents the same linear function as the FAODAG output by Graph-Repr. The Matrix-Repr subroutine is used by the standard canonicalizationalgorithm to produce a sparse matrix representation of a cone program.

Overall algorithm. With all the subroutines in place, the matrix-free canonicalization algorithmis straightforward. The implementation is given in algorithm 8.

Algorithm 8 Matrix-free canonicalization.Input: p is an OPR that satisfies the requirements of DCP.

(s, o, C) Conic-Form(p).Choose any ordering <V on the variables in (s, o, C).Choose any ordering <C on the constraints in C.((e1,K1), . . . , (e`,K`)) the constraints in C ordered according to <C .cmat Matrix-Repr((Linear(o)), <V ).Convert cmat from a 1-by-n sparse matrix into a length n array carr.darr Evaluate(Constant(o)).barr vstack(Evaluate(Constant(e1)), . . . , Evaluate(Constant(e`))).G Graph-Repr((Linear(e1), . . . , Linear(e`)), <V ).Gopt

Optimize-Graph(G)Klist (K1, . . . ,K`).

return (carr, darr, barr, Gopt,Klist).

2.6 Numerical results

2.6.1 Implementation

We have implemented the matrix-free canonicalization algorithm as an extension of CVXPY [65],available at

https://github.com/mfopt/mf_cvxpy.

To solve the resulting matrix-free cone programs, we implemented modified versions of SCS [175]and POGS [81] that are truly matrix-free, available at


https://github.com/mfopt/mf_scs,https://github.com/mfopt/mf_pogs.

(The main modification was using the matrix-free equilibration described in [67].) Our implemen-tations are still preliminary and can be improved in many ways. We also emphasize that thecanonicalization is independent of the particular matrix-free cone solver used.

In this section we benchmark our implementation of matrix-free canonicalization and of matrix-free SCS and POGS on several convex optimization problems involving fast linear transforms. Wecompare the performance of our matrix-free convex optimization modeling system with that of thecurrent CVXPY modeling system, which represents the matrix A in a cone program as a sparsematrix and uses standard cone solvers. The standard cone solvers and matrix-free SCS were runserially on a single Intel Xeon processor, while matrix-free POGS was run on a Titan X GPU.

2.6.2 Nonnegative deconvolution

We applied our matrix-free convex optimization modeling system to the nonnegative deconvolutionproblem (2.1). The Python code below constructs and solves problem (2.1). The constants c andb and problem size n are defined elsewhere. The code is only a few lines, and it could be easilymodified to add regularization on x or apply a different cost function to c ⇤ x � b. The modelingsystem would automatically adapt to solve the modified problem.

# Construct the optimization problem.

x = Variable(n)

cost = norm2(conv(c, x) - b)

prob = Problem(Minimize(cost),

[x >= 0])

# Solve using matrix-free SCS.

prob.solve(solver=MAT_FREE_SCS)

Problem instances. We used the following procedure to generate interesting (nontrivial) in-stances of problem (2.1). For all instances the vector c 2 Rn was a Gaussian kernel with standarddeviation n/10. All entries of c less than 10�6 were set to 10�6, so that no entries were too close tozero. The vector b 2 R2n�1 was generated by picking a solution x with 5 entries randomly chosento be nonzero. The values of the nonzero entries were chosen uniformly at random from the interval[0, n/10]. We set b = c ⇤ x+ v, where the entries of the noise vector v 2 R2n�1 were drawn from anormal distribution with mean zero and variance kc ⇤ xk2/(400(2n� 1)). Our choice of v yielded asignal-to-noise ratio near 20.

While not relevant to solving the optimization problem, the solution of the nonnegative decon-volution problem often, but not always, (approximately) recovers the original vector x. Figure 2.12


Figure 2.12: Results for a problem instance with n = 1000.

shows the solution recovered by ECOS [71] for a problem instance with n = 1000. The ECOS solu-tion x? had a cluster of 3-5 adjacent nonzero entries around each spike in x. The sum of the entrieswas close to the value of the spike. The recovered x in figure 2.12 shows only the largest entry ineach cluster, with value set to the sum of the cluster’s entries.

Results. Figure 2.13 compares the performance on problem (2.1) of the interior-point solver ECOS[71] and matrix-free versions of SCS and POGS as the size n of the optimization variable increases.We limited the solvers to 104 seconds.

For each variable size n we generated ten different problem instances and recorded the averagesolve time for each solver. ECOS and matrix-free SCS were run with an absolute and relativetolerance of 10�3 for the duality gap, `2 norm of the primal residual, and `2 norm of the dualresidual. Matrix-free POGS was run with an absolute tolerance of 10�4 and a relative tolerance of10�3.

For each solver, we plot the solve times and the least squares linear fit to those solve times (thedotted line). The slopes of the lines show how the solvers scale. The least-squares linear fit for theECOS solve times has slope 3.1, which indicates that the solve time scales like n3, as expected. Theleast-squares linear fit for the matrix-free SCS solve times has slope 1.1, which indicates that thesolve time scales like the expected n log n. The least-squares linear fit for the matrix-free POGSsolve times in the range n 2 [105, 107] has slope 1.1, which indicates that the solve time scales likethe expected n log n. For n < 105, the GPU is not saturated, so increasing n barely increases the


Figure 2.13: Solve time in seconds T versus variable size n.

solve time.

2.6.3 Sylvester LP

We applied our matrix-free convex optimization modeling system to Sylvester LPs, or convex opti-mization problems of the form

minimize Tr(DTX)

subject to AXB C

X � 0,

(2.12)

where X 2 Rp⇥q is the optimization variable, and A 2 Rp⇥p, B 2 Rq⇥q, C 2 Rp⇥q, and D 2 Rp⇥q

are problem data. The inequality AXB C is a variant of the Sylvester equation AXB = C [89].Existing convex optimization modeling systems will convert problem (2.12) into the vectorized

formatminimize vec(D)T vec(X)

subject to (BT ⌦A)vec(X) vec(C)

vec(X) � 0,

(2.13)

where BT ⌦A 2 Rpq⇥pq is the Kronecker product of BT and A. Let p = kq for some fixed k, andlet n = kq2 denote the size of the optimization variable. A standard interior-point solver will takeO(n3) flops and O(n2) bytes of memory to solve problem (2.13). A specialized matrix-free solverthat exploits the matrix product AXB, by contrast, can solve problem (2.12) in O(n1.5) flops using


Figure 2.14: Solve time in seconds T versus variable size n.

O(n) bytes of memory [219].

Problem instances. We used the following procedure to generate interesting (nontrivial) in-stances of problem (2.12). We fixed p = 5q and generated A and B by drawing entries i.i.d. from thefolded standard normal distribution (i.e., the absolute value of the standard normal distribution).We then set

A = A/kAk2 + I, B = B/kBk2 + I,

so that A and B had positive entries and bounded condition number. We generated D by drawingentries i.i.d. from a standard normal distribution. We fixed C = 11T . Our method of generating theproblem data ensured the problem was feasible and bounded.

Results. Figure 2.14 compares the performance on problem (2.12) of the interior-point solverECOS [71] and matrix-free versions of SCS and POGS as the size n = 5q2 of the optimizationvariable increases. We limited the solvers to 104 seconds.

For each variable size n we generated ten different problem instances and recorded the averagesolve time for each solver. ECOS and matrix-free SCS were run with an absolute and relativetolerance of 10�3 for the duality gap, `2 norm of the primal residual, and `2 norm of the dualresidual. Matrix-free POGS was run with an absolute tolerance of 10�4 and a relative tolerance of10�3.


For each solver, we plot the solve times and the least squares linear fit to those solve times (thedotted line). The slopes of the lines show how the solvers scale. The least-squares linear fit for theECOS solve times has slope 3.0, which indicates that the solve time scales like n3, as expected. Theleast-squares linear fit for the matrix-free SCS solve times has slope 1.4, which indicates that thesolve time scales like the expected n1.5. The least-squares linear fit for the matrix-free POGS solvetimes in the range n 2 [5 ⇥ 105, 5 ⇥ 106] has slope 1.1. The solve time scales more slowly than theexpected n1.5, likely because the GPU was not fully saturated even on the largest problem instances.For n < 5⇥ 105, the GPU was far from saturated, so increasing n barely increases the solve time.

Chapter 3

Stochastic matrix-free equilibration

3.1 Equilibration

Equilibration refers to scaling the rows and columns of a matrix so the norms of the rows are thesame, and the norms of the columns are the same. Given a matrix A 2 Rm⇥n, the goal is to finddiagonal matrices D 2 Rm⇥m and E 2 Rn⇥n so that the rows of DAE all have `p-norm ↵ and thecolumns of DAE all have `p-norm �. (The row and column norm values ↵ and � are related bym↵p = n�p for p < 1.) Common choices of p are 1, 2, and 1; in this chapter, we will focus on`2-norm equilibration. Without loss of generality, we assume throughout that the entries of D andE are nonnegative.

Equilibration has applications to a variety of problems, including target tracking in sensor net-works [131], web page ranking [141], and adjusting contingency tables to match known marginalprobabilities [193]. The primary use of equilibration, however, is as a heuristic method for reducingcondition number [31]; in turn, reducing condition number is a heuristic for speeding up a variety ofiterative algorithms [174, Chap. 5], [208, 92]. Using equilibration to accelerate iterative algorithmsis connected to the broader notion of diagonal preconditioning, which has been a subject of researchfor decades; see, e.g., [138, Chap. 2], [106, Chap. 10], [182, 95].

Equilibration has several justifications as a heuristic for reducing condition number. We willshow in §3.2.2 that if A is square and nonsingular, any D and E that equilibrate A in the `2-normminimize a tight upper bound on (DAE) over all diagonal D and E. Scaling only the rows or onlythe columns of A so they have the same `2-norms (rather than scaling both at once, as we do here)also has a connection with minimizing condition number [200].

Another perspective is that equilibration minimizes a lower bound on the condition number. Itis straightforward to show that the ratio between the largest and smallest `2-norms of the columns

44

CHAPTER 3. STOCHASTIC MATRIX-FREE EQUILIBRATION 45

of DAE is a lower bound on (DAE):

(DAE) =supkxk2=1 kDAExk2

infkxk2=1 kDAExk2

�maxx2{e1,...,en} kDAExk2minx2{e1,...,en} kDAExk2

.

The same inequality holds for the ratio between largest and smallest `2-norms of the rows of DAE.For an equilibrated matrix these ratios are one, the smallest they can be.

Equilibration is an old problem and many techniques have been developed for it, such as theSinkhorn-Knopp [198] and Ruiz algorithms [187]. Existing `p-norm equilibration methods requireknowledge of (the entries of) the matrix |A|

p, where the function | · |p is applied elementwise. For this

reason equilibration cannot be used in matrix-free methods, which only interact with the matrix A

via multiplication of a vector by A or by AT . Such matrix-free methods play a crucial role in scientificcomputing and optimization. Examples include the conjugate gradients method [119], LSQR [177],and the Chambolle-Cremers-Pock algorithm [39].

In this chapter we introduce a stochastic matrix-free equilibration method that provably con-verges in expectation to the correct D and E. Our method builds on work by Bradley [31], whoproposed a matrix-free equilibration algorithm with promising empirical results but no theoreticalguarantees. Examples demonstrate that our matrix-free equlibration method converges far morequickly than the theoretical analysis suggests, delivering effective equilibration in a few tens of it-erations, each involving one multiplication by A and one by AT . We demonstrate the method onexamples of matrix-free iterative algorithms. We observe that the cost of equilibration is more thancompensated for by the speedup of the iterative algorithm due to diagonal preconditioning. We showhow our method can be modified to handle variants of the equilibration problem, such as symmetricand block equilibration.

3.2 Equilibration via convex optimization

3.2.1 The equilibration problem

Equilibration can be posed as the convex optimization problem [28]

minimize (1/2)P

m

i=1

Pn

j=1(Aij)2e2ui+2vj � ↵21Tu� �21T v, (3.1)

where u 2 Rm and v 2 Rn are the optimization variables [12]. The diagonal matrices D and E areobtained via

D = diag(eu1 , . . . , eum), E = diag(ev1 , . . . , evn).


The optimality conditions for problem (3.1) are precisely that DAE is equilibrated, i.e.,

|DAE|21 = ↵21, |EATD|

21 = �21.

The problem (3.1) is unbounded below precisely when the matrix A cannot be equilibrated.Problem (3.1) can be solved using a variety of methods for smooth convex optimization [28,

174]. One attractive method, which exploits the special structure of the objective, is to alternatelyminimize over u and v. We minimize over u (or equivalently D) by setting

Dii = ↵

0

@nX

j=1

A2ijE2

jj

1

A�1/2

, i = 1, . . . ,m.

We minimize over v (E) by setting

Ejj = �

mX

i=1

A2ijD2

ii

!�1/2

, j = 1, . . . , n.

When m = n and ↵ = � = 1, the above updates are precisely the Sinkhorn-Knopp algorithm. Inother words, the Sinkhorn-Knopp algorithm is alternating block minimization for the problem (3.1).

3.2.2 Equilibration and condition number

In this subsection we show that equilibrating a square matrix minimizes an upper bound on thecondition number. We will not use these results in the sequel, where we focus on matrix-freemethods for equilibration.

For U 2 Rn⇥n nonsingular define the function � by

�(U) = exp�kUk2

F/2�/ det

⇣(UTU)1/2

⌘= exp

nX

i=1

�2i/2

!/

nY

i=1

�i,

where �1 � · · · � �n > 0 are the singular values of U . (Here kUkF denotes the Frobenius norm.)

Theorem 3.2.1. Let A be square and invertible. Then diagonal D and E equilibrate A, with rowand column norms one, if and only if they minimize �(DAE) over D and E diagonal.

Proof. We first rewrite problem (3.1) in terms of D and E to obtain

minimize (1/2)kDAEk2F�P

n

i=1 logDii �P

n

j=1 logEjj

subject to diag(D) > 0, diag(E) > 0, D,E diagonal,


(Here we take ↵ = � = 1, so the row and column norms are one.) We can rewrite this problem as

minimize (1/2)kDAEk2F� log det

�(DAE)T (DAE)

�1/2

subject to diag(D) > 0, diag(E) > 0, D,E diagonal,

since the objective differs from the problem above only by the constant log det(ATA)1/2. Finally,taking the exponential of the objective, we obtain the equivalent problem

minimize �(DAE) = exp�(1/2)kDAEk2

F

�/ det

�(DAE)T (DAE)

�1/2

subject to diag(D) > 0, diag(E) > 0, D,E diagonal.

Thus diagonal (positive) D and E equilibrate A, with row and column norms one, if and only if theyminimize the objective of this problem.

Theorem 3.2.1 links equilibration with minimization of condition number because the function�(DAE) gives an upper bound on (DAE).

Theorem 3.2.2. Let U 2 Rn⇥n be nonsingular with singular values �1 � · · · � �n > 0 andcondition number = �1/�n. Then

2e�n/2�(U) � . (3.2)

Moreover this inequality is tight within a factor of 2, i.e., there exists U with condition number with

2e�n/2�(U) 2. (3.3)

Proof. We factor � into

�(U) = (�1,�n)n�1Y

i=2

�(�i),

where (�1,�n) = exp((�2

1 + �2n)/2)/(�1�n), �(�i) = exp(�2

i/2)/�i.

We first relate and the condition number, by minimizing (�1,�n) with �1 = �n (i.e., withcondition number ). We must minimize over �n the function

(�n,�n) =exp(�2

n(1 + 2)/2)

�2n

.

With change of variable z = �2n, this function is convex, with minimizer z = 2/(1+2) and minimum

value (e/2)(+ 1/). Therefore we have

(�1,�n) � (e/2)(+ 1/).

It is straightforward to show that �(�i) is convex, and minimized when �i = 1. Thus we have


�(�i) � �(1) = e1/2. We combine these results to obtain the inequality

�(U) � (en/2/2)(+ 1/), (3.4)

which is sharp; indeed, it is tight when

�1 =

✓22

1 + 2

◆1/2

, �n =

✓2

1 + 2

◆1/2

,

and �i = 1 for i = 2, . . . , n� 1.The inequality (3.4) implies inequality (3.2), since +1/ � . With the values of �i that make

(3.4) tight, the inequality (3.3) holds because + 1/ 2.

Theorems 3.2.1 and 3.2.2 show that equilibration is the same as minimizing �(DAE) over diag-onal D and E, and that �(DAE) is an upper bound on (DAE), the condition number of DAE.

3.2.3 Regularized equilibration

The equilibration problem, and its equivalent convex optimization problem (3.1), suffer from severalflaws. The first is that not all matrices can be equilibrated [31]. For example, if the nonzero matrixA has a zero row or column, it cannot be equilibrated. As a less obvious example, a triangularmatrix with unit diagonal cannot be equilibrated. When the matrix A cannot be equilibrated, theconvex problem (3.1) is unbounded [81].

The second flaw is that even when the matrix A can be equilibrated problem (3.1) does not havea unique solution. Given a solution (u?, v?) to problem (3.1), the point (u? + �, v?� �) is a solutionfor any � 2 R. In other words, we can scale D by e� and E by e�� and still have DAE equilibrated.We would prefer to guarantee a solution where D and E have roughly the same scale. The final flawis that in practice we do not want the entries of D and E to be extremely large or extremely small;we may have limits on how much we are willing to scale the rows or columns.

We address these flaws by modifying the problem (3.1), adding regularization and box constraints,and reframing the equilibration problem as

minimize (1/2)P

m

i=1

Pn

j=1(Aij)2e2ui+2vj � ↵21Tu� �21T v + (�/2)

��

"u

v

#��

2

2

,

subject to kuk1 M, kvk1 M,

(3.5)

where � > 0 is the regularization parameter and the parameter M > 0 bounds the entries of D andE to lie in the interval [e�M , eM ]. The additional regularization term penalizes large choices of u andv (which correspond to large or small row and column scalings). It also makes the objective strictlyconvex and bounded below, so the modified problem (3.5) always has a unique solution (u?, v?),


even when A cannot be equilibrated. Assuming the constraints are not active at the solution wehave

1Tu? = 1T v?,

which means that the optimal D and E have the same scale in the sense that the product of theirdiagonal entries are equal:

mY

i=1

Dii =nY

j=1

Ejj .

Problem (3.5) is convex and can be solved using a variety of methods. Block alternating mini-mization over u and v can be used here, as in the Sinkhorn-Knopp algorithm. We minimize over u

(or equivalently D) by setting

Dii = ⇧[e�2M ,e2M ]

0

@2↵2/� �W

0

@2e2↵2/�

nX

j=1

A2ijE2

jj/�

1

A

1

A1/2

, i = 1, . . . ,m,

where W is the Lambert W function [53] and ⇧[e�2M ,e2M ] denotes projection onto the interval[e�2M , e2M ]. We minimize over v (E) by setting

Ejj = ⇧[e�2M ,e2M ]

2�2/� �W

2e2�

2/�

mX

i=1

A2ijD2

ii/�

!!1/2

, j = 1, . . . , n.

When M = +1, m = n, and ↵ = � = 1, the D and E updates converge to the Sinkhorn-Knoppupdates as � ! 0 [127]. This method works very well, but like the Sinkhorn-Knopp method requiresaccess to the individual entries of A, and so is not appropriate as a matrix-free algorithm.

Of course, solving problem (3.5) does not equilibrate A exactly; unless � = 0 and the constraintsare not active, its optimality conditions are not that DAE is equilibrated. We can make the equili-bration more precise by decreasing the regularization parameter � and increasing the scaling boundM . But if we are using equilibration as a heuristic for reducing condition number, approximateequilibration is more than sufficient.

3.3 Stochastic method

In this section we develop a method for solving problem (3.5) that is matrix-free, i.e., only accessesthe matrix A by multiplying a vector by A or by AT . (Of course we can find all the entries ofA by multiplying A by the unit vectors ei, i = 1, . . . , n; then, we can use the block minimizationmethod described above to solve the problem. But our hope is to solve the problem with far fewermultiplications by A or AT .)


3.3.1 Unbiased gradient estimate

Gradient expression. Let f(u, v) denote the objective function of problem (3.5). The gradientruf(u, v) is given by

ruf(u, v) = |DAE|21� ↵2 + �u.

Similarly, the gradient rvf(u, v) is given by

rvf(u, v) =��EATD

��2 1� �2 + �v.

The first terms in these expressions, |DAE|21 and

��EATD��2 1, are the row norms squared of the

matrices DAE and EATD, respectively. These are readily computed if we have access to the entriesof A; but in a matrix-free setting, where we can only access A by multiplying a vector by A or AT , itis difficult to evaluate these row norms. Instead, we will estimate them using a randomized method.

Estimating row norms squared. Given a matrix B 2 Rm⇥n, we use the following approachto get an unbiased estimate z of the row norms squared |B|

21. We first sample a random vectors 2 Rn whose entries si 2 {�1, 1} are drawn independently with identical distribution (IID), withprobability one half for each outcome. We then set z = |Bs|2. This technique is discussed in[31, 18, 130].

To see that E[z] = |B|21, consider (bT s)2, where b 2 Rn. The expectation of (bT s)2 is given by

E[(bT s)2] =nX

i=1

b2iE[s2

i] +X

i 6=j

bibjE[sisj ] =nX

i=1

b2i.

As long as the entries of s are IID with mean 0 and variance 1, we have E[(bT s)2] =P

n

i=1 b2i.

Drawing the entries of s from {�1, 1}, however, minimizes the variance of (bT s)2.

3.3.2 Projected stochastic gradient

Method. We follow the projected stochastic gradient method described in [147] and [35, Chap. 6],which solves convex optimization problems of the form

minimize f(x)

subject to x 2 C,(3.6)

where x 2 Rn is the optimization variable, f : Rn! R is a strongly convex differentiable function,

and C is a convex set, using only an oracle that gives an unbiased estimate of rf , and projectiononto C.

Specifically, we cannot evaluate f(x) or rf(x), but we can evaluate a function g(x,!) and samplefrom a distribution ⌦ such that E!⇠⌦g(x,!) = rf(x). Let µ be the strong convexity constant for


f and ⇧C : Rn! Rn denote the Euclidean projection onto C. Then the method consists of T

iterations of the update

xt := ⇧C

�xt�1

� ⌘tg(xt�1,!)

�,

where ⌘t = 2/(µ(t+ 1)) and ! is sampled from ⌦. The final approximate solution x is given by theweighted average

x =TX

t=0

2(t+ 1)

(T + 1)(T + 2)xt.

Algorithm (9) gives the full projected stochastic gradient method in the context of problem (3.5).Recall that the objective of problem (3.5) is strongly convex with strong convexity parameter �.

Algorithm 9 Projected stochastic gradient method for problem (3.5).

Input: u0 = 0, v0 = 0, u = 0, v = 0, and ↵,�, �,M > 0.

for t = 1, 2, . . . , T doD diag(eu

t�11 , . . . , eu

t�1m ), E diag(ev

t�11 , . . . , ev

t�1n ).

Draw entries of s 2 Rn and w 2 Rm IID uniform from {�1, 1}.ut ⇧[�M,M ]m

�ut�1

� 2�|DAEs|2 � ↵21+ �ut�1

�/(�(t+ 1))

�.

vt ⇧[�M,M ]n�vt�1

� 2�|EATDw|2 � �21+ �vt�1

�/(�(t+ 1))

�.

u 2ut/(t+ 2) + tu/(t+ 2).v 2vt/(t+ 2) + tv/(t+ 2).

Output: D = diag(eu1 , . . . , eum) and E = diag(ev1 , . . . , evn).

Convergence rate. Algorithm (9) converges in expectation to the optimal value of problem (3.5)with rate O(1/t) [147]. Let f(u, v) : Rm

⇥Rn! R denote the objective of problem (3.5), let (u?, v?)

denote the problem solution, and let g(u, v, s, w) : Rm⇥Rn

⇥ {�1, 1}n⇥ {�1, 1}m ! Rm+n be theestimate of rf(u, v) given by

g(u, v, s, w) =

"|DAEs|2 � ↵21+ �u

|EATDw|2 � �21+ �v

#.

Then after T iterations of the algorithm we have

E(uT ,vT ),...,(u1,v1)f(u, v)� f(u?, v?) C

µ(T + 1),

where C is a constant bounded above by

C max(u,v)2[�M,M ]m⇥n

2Es,wkg(u, v, s, w)k22.


In the expectation s and w are random variables with entries drawn IID uniform from {�1, 1}.We can make the bound more explicit. It is straightforward to show the equality

Es,wkg(u, v, s, w)k22 = krf(u, v)k22 + 31T

��

��

"DAE

EATD

#��

2

1

��

2

� 41T|DAE|

41,

and the inequality

max(u,v)2[�M,M ]m⇥n

krf(u, v)k22 krf(M1,M1)k22 + 4�M(↵2m+ �2n).

We combine these two results to obtain the bound

C/2 krf(M1,M1)k22 + 4�M(↵2m+ �2n) + e8M

0

B@31T

��

��

"A

AT

#��

2

1

��

2

� 41T|A|

41

1

CA .

Our bound on C is quite large. A more thorough analysis could improve the bound by consideringthe relative sizes of the different parameters and entries of A. For instance, it is straightforward toshow that for t = 1, . . . , T we have

ut

i ↵2/�, i = 1, . . . ,m, vt

j �2/�, j = 1, . . . , n,

which gives a tighter bound if ↵2/� < M or �2/� < M . In any case, we find that in practice nomore than tens of iterations are required to reach an approximate solution.

3.4 Numerical experiments

We evaluated algorithm (9) on many different matrices A. We only describe the results for a singlenumerical experiment, but we obtained similar results for our other experiments. For our numericalexperiment we generated a sparse matrix A 2 Rm⇥n, with m = 2⇥104 and n = 104, with 1% of theentries chosen uniformly at random to be nonzero, and nonzero entries drawn IID from a standardnormal distribution. We next generated vectors u 2 Rm and v 2 Rn with entries drawn IID from anormal distribution with mean 1 and variance 1. We set the final matrix A to be

A = diag�eu1 , . . . , eum

�Adiag

�ev1 , . . . , evn

�.

We ran algorithm (9) for 1000 iteratons to obtain an approximate solution f(u, v). We used theparameters ↵ = ( n

m)1/4, � = (m

n)1/4, � = 10�1 and M = log(104). We obtained the exact solution

p? to high accuracy using Newton’s method with backtracking line search. (Newton’s method does


Figure 3.1: Problem (3.5) optimality gap and RMS error versus iterations t.

not account for constraints, but we verified that the constraints were not active at the solution.)Figure 3.1 plots the relative optimality gap (f(u, v) � p?)/f(0, 0) and the RMS equilibration

error,

1pm+ n

0

@mX

i=1

✓qeTi|DAE|21� ↵

◆2

+nX

j=1

✓qeTj|EATD|

2 1� �

◆21

A1/2

,

versus iteration. The RMS error shows how close DAE is to equilibrated; we do not expect it toconverge to zero because of the regularization.

The objective value and RMS error decrease quickly for the first few iterations, with oscillations,and then decrease smoothly but more slowly. The slopes of the lines show the convergence rate. Theleast-squares linear fit for the optimality gap has slope �2.0, which indicates that the convergencewas (much) faster than the theoretical upper bound 1/t.

Figure 3.2 shows the condition number of DAE versus iteration. While equilibration merelyminimizes an upper bound on the condition number, in this case the condition number correspondedquite closely with the objective of problem (3.5). The plot shows that after 4 iterations (DAE)

is back to the original condition number (A) = 104. After 100 iterations the condition number isreduced by 200⇥, and it continues to decrease with further iterations.


Figure 3.2: Condition number of DAE versus iterations t.

3.5 Applications

3.5.1 LSQR

The LSQR algorithm [177] is an iterative matrix-free method for solving the linear system Ax = b,where x 2 Rn, A 2 Rn⇥n and b 2 Rn. Each iteration of LSQR involves one multiplication by A

and one by AT . LSQR is equivalent in exact arithmetic to applying the conjugate gradients method[119] to the normal equations ATAx = AT b, but in practice has better numerical properties. Anupper bound on the number of iterations needed by LSQR to achieve a given accuracy grows with(A) [174, Chap. 5]. Thus decreasing the condition number of A via equilibration can acceleratethe convergence of LSQR. (Since LSQR is equivalent to conjugate gradients applied to the normalequations, it computes the exact solution in n iterations, at least in exact arithmetic. But withnumerical roundoff error this does not occur.)

We use equilibration as a preconditioner by solving the linear system (DAE)x = Db with LSQRinstead of Ax = b; we then recover x from x via x = Ex. We measure the accuracy of an approximatesolution x by the residual kAx� bk2 rather than by residual kDAEx�Dbk2 of the preconditionedsystem, since our goal is to solve the original system Ax = b.

We compared the convergence rate of LSQR with and without equilibration. We generated thematrix A 2 Rn⇥n as in §3.4, with n = 104. We choose b 2 Rn by first generating x?

2 Rn bydrawing entries IID from a standard normal distribution, and then setting b = Ax?.


Figure 3.3: Residual versus iterations t for LSQR.

We generated equilibrated matrices D10AE10, D30AE30, D100AE100, and D300AE300 by runningalgorithm (9) for 10, 30, 100, and 300 iterations, respectively. We used the parameters ↵ = (n/m)1/4,� = (m/n)1/4, � = 10�1 and M = log(104). Note that the cost of equlibration iterations is the sameas the cost of LSQR iterations, since each involves one multiply by A and one by AT .

Figure 3.3 shows the results of running LSQR with and without equilibration, from the initialiterate x0 = 0. We show the relative residual kAxt

� bk2/kbk2 versus iterations, counting theequilibration iterations, which can be seen as the original flat portions at the beginning of each curve.We can see that to achieve relative accuracy 10�4, LSQR without preconditioning requires around104 iterations; with preconditioning with 30 or more iterations of equilibration, it requires more than10⇥ fewer iterations. We can see that higher accuracy justifies more equilibration iterations, butthat the choice of just 30 equilibration iterations does very well. We can see that 10 iterations ofequilibration is too few, and only improves LSQR convergence a small amount.

3.5.2 Chambolle-Cremers-Pock

The Chambolle-Cremers-Pock (CCP) algorithm [39, 183] is an iterative method for solving convexoptimization problems of the form

minimize f(x) + g(Ax),


where x 2 Rn is the variable, A 2 Rm⇥n is problem data, and f and g are convex functions. Eachiteration of CCP requires one multiplication by A and one by AT . Chambolle and Pock do not showa dependence on (A) in their analysis of the algorithm convergence rate, but we nonetheless mightexpect that equilibration will accelerate convergence.

We compared the convergence rate of CCP with and without equilibration on the Lasso problem[85, §3.4]

minimize kAx� bk22/p�+p�kxk1.

We generated the matrix A 2 Rm⇥n as in §3.4, with m = 104 and n = 2 ⇥ 104. We generatedb 2 Rm by first generating x 2 Rn by choosing n/10 entries uniformly at random to be nonzero anddrawing those entries IID from a standard normal distribution. We then set b = Ax+ ⌫, where theentries of ⌫ 2 Rm were drawn IID from a standard normal distribution. We set � = 10�3

kAT bk1

and found the optimal value p? for the Lasso problem using CVXPY [65] and GUROBI [110].We generated equilibrated matrices D10AE10, D30AE30, D100AE100, and D300AE300 by running

algorithm (9) for 10, 30, 100, and 300 iterations, respectively. We used the parameters ↵ = (n/m)1/4,� = (m/n)1/4, � = 10�1 and M = log(104).

Figure 3.4 shows the results of running CCP with and without equilibration. We used theparameters ⌧ = � = 0.9/kDkAEkk2 and ✓ = 1 and set all initial iterates to 0. We show the relativeoptimality gap (f(xt)�p?)/f(0) versus iterations, counting the equilibration iterations, which can beseen as the original flat portions at the beginning of each curve. We can see that to achieve relativeaccuracy 10�6, CCP without preconditioning requires around 1000 iterations; with preconditioningwith 100 iterations of equilibration, it requires more than 4⇥ fewer iterations. CCP converges to ahighly accurate solution with just 100 equilibration iterations, so additional equilibration iterationsare unnecessary. We can see that 10 and 30 iterations of equilibration are too few, and do notimprove CCP’s convergence.

3.6 Variants

In this section we discuss several variants of the equilibration problem that can also be solved in amatrix-free manner.

Symmetric equilibration. When equilibrating a symmetric matrix A 2 Rn⇥n, we often want theequilibrated matrix DAE to also be symmetric. For example, to use equilibration as a preconditionerfor the conjugate gradients method, DAE must be symmetric [119]. We make DAE symmetric bysetting D = E.

Symmetric equilibration can be posed as the convex optimization problem

minimize (1/4)P

n

i=1

Pn

j=1(Aij)2e2ui+2uj � ↵21Tu, (3.7)


Figure 3.4: Optimality gap versus iterations t for CCP.

where u 2 Rn is the optimization variable and ↵ > 0 is the desired value of the row and columnnorms. We approximately solve problem (3.7) by adding regularization and box constraints as inproblem (3.5) and then applying algorithm (10), a simple modification of algorithm (9) with thesame convergence guarantees.

Algorithm 10 Projected stochastic gradient method for symmetric equilibration.Input: u0 = 0, u = 0, and ↵, �,M > 0.

for t = 1, 2, . . . , T doD diag(eu

t�11 , . . . , eu

t�1n ).

Draw entries of s 2 Rn IID uniform from {�1, 1}.ut ⇧[�M,M ]n

�ut�1

� 2�|DADs|2 � ↵21 + �ut�1

�/(�(t+ 1))

�.

u 2ut/(t+ 2) + tu/(t+ 2).

Output: D = diag(eu1 , . . . , eun).

Varying row and column norms. In standard equilibration we want all the row norms of DAE

to be the same and all the column norms to be the same. We might instead want the row andcolumn norms to equal known vectors r 2 Rm and c 2 Rn, respectively. The vectors must satisfyrT r = cT c.

Equilibration with varying row and column norms can be posed as the convex optimization


problemminimize (1/2)

Pm

i=1

Pn

j=1(Aij)2e2ui+2vj � rTu� cT v, (3.8)

where as usual u 2 Rm and v 2 Rn are the optimization variables. We approximately solve problem(3.8) by adding regularization and box constraints as in problem (3.5) and then applying algorithm(9) with the appropriate modification to the gradient estimate.

Block equilibration. A common constraint when using equilibration as a preconditioner is thatthe diagonal entries of D and E are divided into blocks that all must have the same value. Forexample, suppose we have a cone program

minimize cTx

subject to Ax+ b 2 K,

where x 2 Rn is the optimization variable, c 2 Rn, b 2 Rm, and A 2 Rm⇥n are problem data, andK = K1 ⇥ · · ·⇥K` is a product of convex cones.

If we equilibrate A we must ensure that DK = K. Let mi be the dimension of cone Ki. A simplesufficient condition for DK = K is that D have the form

D = diag(eu1Im1 , . . . , eupImp), (3.9)

where u 2 Rp and Imi is the mi-by-mi identity matrix. Given the constraint on D, we cannot ensurethat each row of DAE has norm ↵. Instead we view each block of mi rows as a single vector andrequire that the vector have norm pmi↵.

In the full block equilibration problem we also require that E have the form

E = diag(ev1In1 , . . . , evqInq ), (3.10)

where v 2 Rq and Inj is the nj-by-nj identity matrix. Again, we view each block of nj columns asa single vector and require that the vector have norm pnj�.

Block equilibration can be posed as the convex optimization problem

minimize (1/2)1T|DAE|

21� ↵2uT

2

664

m1

...mp

3

775� �2vT

2

664

n1

...nq

3

775 , (3.11)

where D and E are defined as in equations (3.9) and (3.10). We approximately solve problem(3.11) by adding regularization and box constraints as in problem (3.5) and then applying algorithm(9) with the appropriate modification to the gradient estimate. Our stochastic matrix-free blockequilibration method is used in the matrix-free versions of the cone solvers SCS [175] and POGS


[81] described in [64, 66].

Tensor equilibration. We describe here the case of 3-tensors; the generalization to higher ordertensors is clear. We are given a 3-dimensional array A 2 Rm⇥n⇥p, and seek coordinate scalingsd 2 Rm, e 2 Rn, f 2 Rp for which

⇣Pn

j=1

Pp

k=1 A2ijk

d2ie2jf2k

⌘1/2= ↵, i = 1, . . . ,m

⇣Pm

i=1

Pp

k=1 A2ijk

d2ie2jf2k

⌘1/2= �, j = 1, . . . , n

⇣Pm

i=1

Pn

j=1 A2ijk

d2ie2jf2k

⌘1/2= �, k = 1, . . . , p.

Here ↵,�, � > 0 are constants that satisfy m↵2 = n�2 = p�2.Tensor equilibration can be posed as the convex optimization problem

minimize (1/2)P

m

i=1

Pn

j=1

Pp

k=1(A2ijk

)e2(ui+vj+wk) � ↵21Tu� �21T v � �21Tw, (3.12)

where u 2 Rm, v 2 Rn, and w 2 Rp are the optimization variables. We can solve problem (3.12)using a simple variant of algorithm (9) that only interacts with the array A via the matrix-to-vectoroperations

X !P

m

i=1

Pn

j=1 AijkXij

Y !P

m

i=1

Pp

k=1 AijkYik

Z !P

n

j=1

Pp

k=1 AijkZjk.

Chapter 4

NCVX

4.1 Introduction

4.1.1 The problem

We consider the optimization problem

minimize f0(x, z)

subject to fi(x, z) 0, i = 1, . . . ,m

Ax+Bz = c

z 2 Z,

(4.1)

where x 2 Rn and z 2 Rq are the decision variables, A 2 Rp⇥n, B 2 Rp⇥q, c 2 Rp are problemdata, and Z ✓ Rq is closed. We assume that the objective and inequality constraint functionsf0, . . . , fm : Rn

⇥ Rq! R are jointly convex in x and z. When the set Z is convex, (4.1) is a

convex optimization problem, but we are interested here in the case where Z is not convex. Roughlyspeaking, the problem (4.1) is a convex optimization problem, with some additional nonconvexconstraints, z 2 Z. We can think of x as the collection of decision variables that appear onlyin convex constraints, and z as the decision variables that are directly constrained to lie in the(generally) nonconvex set Z. The set Z is often a Cartesian product, Z = Z1 ⇥ · · · ⇥ Zk, whereZi ⇢ Rqi are sets that are simple to describe, e.g., Zi = {0, 1}. We denote the optimal value ofthe problem (4.1) as p?, with the usual conventions that p? = +1 if the problem is infeasible, andp? = �1 if the problem is unbounded below.

60

CHAPTER 4. NCVX 61

4.1.2 Special cases

Mixed integer convex optimization. When Z = {0, 1}q, the problem (4.1) is a general mixedinteger convex program, i.e., a convex optimization problem in which some variables are constrainedto be Boolean. (‘Mixed Boolean’ would be a more accurate name for such a problem, but ‘mixedinteger’ is commonly used.) It follows that the problem (4.1) is NP-hard; it includes as a specialcase, for example, the general Boolean satisfaction problem.

Cardinality constrained convex optimization. As another broad special case of (4.1), considerthe case Z = {z 2 Rq

| card(z) k, kzk1 M}, where card(z) is the number of nonzero elementsof z, and k and M are given. We call this the general cardinality-constrained convex problem. Itarises in many interesting applications, such as regressor selection.

Other special cases. As we will see in §4.6, many (hard) problems can be formulated in theform (4.1). More examples include regressor selection, 3-SAT, circle packing, the traveling salesmanproblem, factor analysis modeling, inexact graph isomorphism, and many more.

4.1.3 Convex relaxation

Convex relaxation of a set. For bounded sets Z there usually is a manageable full or partialdescription of the convex hull of Z. By this we mean a (modest-sized) set of convex inequality andlinear equality constraints that hold for every z 2 Z:

z 2 Z =) hi(z) 0, i = 1, . . . , s, Fz = g.

We will assume that these relaxation constraints are included in the convex constraints of (4.1).Adding these relaxation constraints to the original problem yields an equivalent problem (sincethe added constraints are redundant), but can improve the convergence of any method, global orheuristic. By tractable, we mean that the number of added constraints is modest, and in particular,polynomial in q.

For example, when Z = {0, 1}q, we have the inequalities 0 zi 1, i = 1, . . . , q. (Theseinequalities define the convex hull of Z, i.e., all other convex inequalities that hold for all z 2 Z areimplied by them.) When

Z = {z 2 Rq| card(z) k, kzk1 M},

we have the convex inequalities

kzk1 kM, kzk1 M.

CHAPTER 4. NCVX 62

(These inequalities define the convex hull of Z.) For general bounded Z the inequality kzk1 M

will always be a convex relaxation for some value of M .

Relaxed problem. If we remove the nonconvex constraint z 2 Z, we get a convex relaxation ofthe original problem:

minimize f0(x, z)

subject to fi(x, z) 0, i = 1, . . . ,m

Ax+Bz = c.

(4.2)

(Recall that convex equalities and inequalities known to hold for z 2 Z have been incorporated inthe convex constraints.) The relaxed problem is convex; its optimal value is a lower bound on theoptimal value p? of (4.1). A solution (x⇤, z⇤) to problem (4.2) need not satisfy z⇤ 2 Z, but if it does,the pair (x⇤, z⇤) is optimal for (4.1).

4.1.4 Projections and approximate projections

Our methods will make use of tractable projection, or tractable approximate projection, onto the setZ. The usual Euclidean projection onto Z will be denoted ⇧. (It need not be unique when Z is notconvex.) By approximate projection, we mean any function ⇧ : Rq

! Z that satisfies ⇧(z) = z forz 2 Z. A useful approximate projection ⇧(z) will also approximately minimize ku� zk22 over u 2 Z,but since all the algorithms we present are heuristics, we do not formalize this requirement. We useapproximate projections when computing an exact projection onto the set Z is too expensive.

For example, when Z = {0, 1}q, exact projection is given by rounding the entries to {0, 1}. Asa less trivial example, consider the cardinality-constrained problem. The projection of z onto Z isgiven by

(⇧ (z))i=

8>>>><

>>>>:

M zi > M, i 2 I

�M zi < �M, i 2 I

zi |zi| M, i 2 I

0 i 62 I,

where I ✓ {1, . . . , q} is a set of indices of k largest values of |zi|. We will describe many projections,and some approximate projections, in §4.4.

4.1.5 Residual and merit functions

For any (x, z) with z 2 Z, we define the constraint residual as

r(x, z) =mX

i=1

(fi(x, z))+ + kAx+Bz � ck1,

CHAPTER 4. NCVX 63

where (u)+ = max{u, 0} denotes the positive part; (x, z) is feasible if and only if r(x, z) = 0. Notethat r(x, z) is a convex function of (x, z). We define the merit function of a pair (x, z) as

⌘(x, z) = f0(x, z) + �r(x, z),

where � > 0 is a parameter. The merit function is also a convex function of (x, z).When Z is convex and the problem is feasible, minimizing ⌘(x, z) for large enough � yields a

solution of the original problem (4.1) (that is, the residual is a so-called exact penalty function);when the problem is not feasible, it tends to find approximate solutions that satisfy many of theconstraints [115, 63, 78].

We will use the merit function to judge candidate approximate solutions (x, z) with z 2 Z; thatis, we take a pair with lower merit function value to be a better approximate solution than one withhigher merit function value. For some problems (for example, unconstrained problems) it is easy tofind feasible points, so all candidate points will be feasible. The merit function then reduces to theobjective value. At the other extreme, for feasibility problems the objective is zero, and the goal isto find a feasible point. In this case the merit function reduces to �r(x, z), i.e., a positive multipleof the residual function.

4.1.6 Solution methods

In this section we describe various methods for solving the problem (4.1), either exactly (globally)or approximately.

Global methods. Depending on the set Z, the problem (4.1) can be solved globally by a varietyof algorithms, including (or mixing) branch-and-bound [150, 169, 34], branch-and-cut [176, 210, 206],semidefinite hierarchies [197], or even direct enumeration when Z is a finite set. In each iterationof these methods, a convex optimization problem derived from (4.1) is solved, with Z removed, and(possibly) additional variables and convex constraints added. While for many applications thesemethods are effective, they are generally thought to have high worst-case complexities and indeedcan be very slow for some problems.

Local solution methods and heuristics. A local method for (4.1) solves a modest number ofconvex problems, in an attempt to find a good approximate solution, i.e., a pair (x, z) with z 2 Z

and a low value of the merit function ⌘(x, z). For a feasibility problem, we might hope to find asolution; and if not, find one with a small constraint residual. For a general problem, we can hopeto find a feasible point with low objective value, ideally near the lower bound on p? from the relaxedproblem. If we cannot find any feasible points, we can settle for a pair (x, z) with z 2 Z and lowmerit function value. All of these methods are heuristics, in the sense that they cannot in general

CHAPTER 4. NCVX 64

be guaranteed to find an optimal, or even good, or even feasible, point in only a modest number ofiterations.

There are of course many heuristics for the general problem (4.1) and for many of its specialcases. For example, any global optimization method can be stopped after some modest number ofiterations; we then take the best point found (in terms of the merit function) as our approximatesolution. (We will discuss some local search methods, including neighbor search and polishing, in§4.2.)

4.1.7 Our approach

The purpose of this chapter is to describe a general system for heuristic solution of (4.1), based onsolving a modest number of convex problems derived from (4.1). By heuristic, we mean that thealgorithm need not find an optimal point, or indeed, even a feasible point, even when one exists. Wewould hope that for many feasible problem instances from some application, the algorithm does finda feasible point, and one with objective not too far from the optimal value. The disadvantage of aheuristic over a global method is clear and simple: it need not find an optimal point. The advantageof a heuristic is that it can be (and often is) dramatically faster to carry out than a global method.Moreover there are many applications where a heuristic method for (4.1) is sufficient because thedifference between a globally optimal solution and a solution that is only close to optimal is notsignificant in practice.

ADMM. One of the heuristic methods described in this chapter is based on the alternating di-rections method of multipliers (ADMM), an operator splitting algorithm originally devised to solveconvex optimization problems. (See e.g., [27] and [75] for comprehensive tutorials on ADMM.) Wecall this heuristic nonconvex alternating directions method of multipliers (NC-ADMM). ADMM wasintroduced in the mid-1970s [96, 88]. More recently, ADMM has found applications in a variety ofdistributed settings in machine learning such as model fitting, resource allocation, and classification.(See e.g., [224, 195, 225, 234, 168, 192, 11].) The idea of using ADMM as a general purpose heuristicto solve nonconvex problems was mentioned in [27, Ch. 9] and was further explored in [62]. Con-sensus ADMM has been used for nonconvex quadratically constrained quadratic programs in [129].In [228], ADMM has been applied to nonnegative matrix factorization with missing values. ADMMalso has been used for real and complex polynomial optimization models in [135], for constrainedtensor factorization in [153], and for optimal power flow in [76]: all nonconvex problems. ADMMcan be viewed as a version of the method of multipliers [118, 184, 21], where a Gauss-Seidel passover x and z is used instead of the usual joint minimization. There is a long history of using themethod of multipliers to (attempt to) solve nonconvex problems [41, 42, 124, 126, 180, 226, 152].Instead of basing our heuristic on ADMM, which is Douglas-Rachford splitting [74] applied to thedual problem, we could also have used Spingarn’s method [203], which is Douglas-Rachford splitting

CHAPTER 4. NCVX 65

applied directly to the primal problem. For nonconvex problems the two approaches could yielddifferent results.

Our contribution. Our main contribution is to identify a small number of concepts and methodsfor heuristics for nonconvex problems that can be applied across a very wide variety of problems.The only essential one is a projection, or even just an approximate projection, onto the nonconvexsets that appear in the problem. The others, which can dramatically improve the performance ofthe heuristic, are to identify a convex relaxation for each nonconvex set, a convex restriction at ageneral point in each nonconvex set, and a method to identify or list neighbors of a given point (ina discrete nonconvex set). We have implemented a general purpose system that uses just these fourmethods and handles a variety of different problems. Our implementation is readily extensible; theuser only needs to implement these four methods for any new nonconvex set to be added.

Outline. The chapter has the following structure. In §4.2 we discuss local search methods anddescribe how they can be used as solution improvement methods. This will enable us to studysimple but sophisticated methods such as relax-round-polish and iterative neighbor search. In §4.3 wepresent a heuristic for problem (4.1) based on ADMM, which makes use of the solution improvementmethods in §4.2. In §4.4 we catalog a variety of nonconvex sets for which Euclidean projection orapproximate projection is easily evaluated and, when applicable, we discuss relaxations, restrictions,and distance functions that define the set of neighbors for a given point. In §4.5 we discuss animplementation of our general system for heuristic solution NCVX, as an extension of CVXPY[68], a Python package for formulating and solving convex optimization problems. The object-oriented features of CVXPY make the extension particularly simple to implement. Finally, in §4.6we demonstrate the performance of our methods on several example problems.

4.2 Local improvement methods

In this section we describe some simple general local search methods. These methods take a pointz 2 Z and by performing a local search on z they find a candidate pair (x, z), with z 2 Z anda lower merit function. We will see that for many applications using these methods with a goodinitialization will result in an approximate solution. We will also see how we can use these methodsto improve solution candidates from other heuristics, hence we refer to these methods as solutionimprovement.

4.2.1 Polishing

Convex restriction. For non-discrete Z, the idea of a convex restriction at a point is useful forlocal search methods. A convex restriction at a point z 2 Z is a convex set Z

rstr(z) that satisfies

CHAPTER 4. NCVX 66

z 2 Zrstr(z) ✓ Z. The trivial restriction given by Z

rstr(z) = {z} is valid for all Z. When Z isdiscrete, for example Z = {0, 1}q, the trivial restriction is the only restriction. In other cases we canhave interesting nontrivial restrictions. For example, with Z = {z 2 Rq

| card(z) k, kzk1 M},we can take as restriction Z

rstr(z) the set of vectors z with the same sparsity pattern as z, andkzk1 M .

Polishing. Given any point z 2 Z, we can replace the constraint z 2 Z with z 2 Zrstr(z) to get

the convex problemminimize ⌘(x, z)

subject to z 2 Zrstr(z),

(4.3)

with variables x, z. (When the restriction Zrstr(z) is the trivial one, i.e., a singleton, this is equivalent

to fixing z = z and minimizing over x.) We call this problem the convex restriction of (4.1) at thepoint z. The restricted problem is convex, and its optimal value is an upper bound on p? assuming� is sufficiently large in ⌘(x, z) = f0(x, z) + �r(x, z) to ensure r(x, z) = 0.

As a simple example of polishing consider the mixed integer convex problem. The only restrictionis the trivial one, so the polishing problem for a given Boolean vector z simply fixes the values ofthe Boolean variables and solves the convex problem over the remaining variables, i.e., x. Forthe cardinality-constrained convex problem, polishing fixes the sparsity pattern of z and solves theresulting convex problem over z and x.

For problems with nontrivial restrictions, we can solve the polishing problem repeatedly untilconvergence. In other words we can use the output of the polishing problem as an initial pointfor another polishing problem and keep iterating until the merit function stops improving. Thistechnique is called iterated polishing and described in algorithm 11.

Algorithm 11 Iterated polishingInput: (x, z)

do(xold, zold) (x, z).Find (x, z) by solving the polishing problem with restriction z 2 Z

rstr(zold).while ⌘(x, z) < ⌘(xold, zold)

return (x, z).

If there exists a point x such that (x, z) is feasible, the restricted problem is feasible too. Therestricted problem need not be feasible in general, but if it is, with solution (x, z), then the pair(x, z) is feasible for the original problem (4.1) and satisfies f0(x, z) f0(x, z) for any x for which(x, z) is feasible. So polishing can take a point z 2 Z (or a pair (x, z)) and produce another pair(x, z) with a possibly better objective value.

CHAPTER 4. NCVX 67

4.2.2 Relax-round-polish

With the simple tools described so far (i.e., relaxation, polishing, and projection) we can createseveral heuristics for approximately solving the problem (4.1). A basic version solves the relaxation,projects the relaxed value of z onto Z, and then iteratively polishes the result.

Algorithm 12 Relax-round-polish heuristic

Solve the convex relaxation (4.2) to obtain (xrlx, zrlx).zrnd ⇧(zrlx).Use algorithm 11 on (xrlx, zrnd) to get (x, z).

Note that in the first step we also obtain a lower bound on the optimal value p?; in the polishingstep we obtain an upper bound and a feasible pair (x, z) that achieves the upper bound providedthat polishing is successful. The best outcome is for these bounds to be equal, which means that wehave found a (global) solution of (4.1) (for this problem instance). But relax-round-polish can fail;for example, it can fail to find a feasible point even though one exists.

Many variations on relax-round-polish are possible. We can introduce randomization by replacingthe round step with

zrnd = ⇧(zrlx + w),

where w is a random vector. We can repeat this heuristic with N different random instances of w.For each of N samples of w, we polish, giving us a set of N candidate approximate solutions. Wethen take as our final approximate solution the best among these N candidates, i.e., the one withleast merit function.

4.2.3 Neighbor search

Neighbors. When Z is discrete, convex restrictions are not useful for local search. Instead weuse the concept of neighbors of a point z 2 Z as a discrete analogue to a restriction. As with arestriction, we do local search over the set of neighbors. Neighbors are defined in terms of a distancefunction Z

dist : Z ⇥ Z ! Z+ [ {+1}. The set of neighbors of a point z 2 Z within distanceD 2 Z+, denoted Z

ngbr(z,D), is given by Zngbr(z, k) = {Z

dist(y, z) D | y 2 Z}. We select adistance function and distance D such that the size of Zngbr(z,D) is computationally tractable forall z 2 Z. For non-discrete Z, we use the trivial distance function

Zdist(z, y) =

(0 z = y

+1 z 6= y,

for which Zngbr(z,D) = {z} for all z and D.

CHAPTER 4. NCVX 68

For example, for the set of Boolean vectors in Rn we use Hamming distance, the number of entriesin which two Boolean vectors differ. Hence the neighbors of a Boolean vector z within distance D

are the set of vectors that differ from z in up to D components. We define the distance between twopermutation matrices as the minimum number of swaps of adjacent rows and columns necessary totransform the first matrix into the second. With this distance metric, neighbors of a permutationmatrix Z within distance D are the set of permutation matrices generated by swapping any twoadjacent rows or columns in Z up to D times. We define distance in terms of swaps of adjacent rowsand columns rather than swaps of arbitrary rows and columns to reduce the number of neighbors.

For Cartesian products of discrete sets we use the sum of distances. In this case, for z =

(z1, z2, . . . , zq) 2 Z = Z1 ⇥ Z2 ⇥ . . . ⇥ Zq, neighbors of z within distance D are points of the form(z1, z2, . . . , zq) where

Pq

i=1 Zdisti

(zi, zi) D.

Basic neighbor search. We introduced polishing as a tool that can find a pair (x, z) given aninput z 2 Z by solving a sequence of convex problems. In basic neighbor search we solve thepolishing problem for z and all neighbors of z (within distance D) and return the pair (x⇤, z⇤) withthe smallest merit function value. In practice, we can sample from Z

ngbr(z, D) instead of iteratingover all points in Z

ngbr(z, D) if��Zngbr(z, D)

�� is large.

Algorithm 13 Basic neighbor searchInput: z

Initialize (xbest, zbest) = ;, ⌘best =1.for z 2 Z

ngbr(z, D) doFind (x⇤, z⇤), by solving the polishing problem (4.3), with constraint z 2 Z

rstr(z).if ⌘(x⇤, z⇤) < ⌘best then

(xbest, zbest) (x⇤, z⇤), ⌘best ⌘(x⇤, z⇤).

return (xbest, zbest).

Iterated neighbor search. As with polishing, we can apply basic neighbor search repeatedly untilconvergence. In other words we can feed the output of algorithm 13 back into algorithm 13 untilthe merit function stops improving. The technique is called iterated neighbor search and describedin algorithm 14. Notice that for non-discrete sets where Z

ngbr(z,D) = {z} for all z and D, thisalgorithm reduces to iterated polishing.

CHAPTER 4. NCVX 69

Algorithm 14 Iterated neighbor searchInput: (x, z)

do(xold, zold) (x, z).Use algorithm 13 on zold to get (x, z).

while ⌘(x, z) < ⌘(xold, zold)

return (x, z).

4.3 NC-ADMM

We already can use the simple tools described in the previous section as heuristics to find approximatesolutions to problem (4.1). In this section, we describe the alternating direction method of multipliers(ADMM) as a mechanism to generate candidate points z to carry out local search methods such asiterated neighbor search. We call this method nonconvex ADMM, or NC-ADMM.

4.3.1 ADMM

Define � : Rq! R [ {�1,+1} such that �(z) is the best objective value of problem (4.1) after

fixing z. In other words,

�(z) = infx

{f0(x, z) | fi(x, z) 0, i = 1, . . . ,m, Ax+Bz = c} .

Notice that �(z) can be +1 or �1 in case the problem is not feasible for this particular valueof z, or problem (4.2) is unbounded below after fixing z. The function � is convex, since it is thepartial minimization of a convex function over a convex set [29, §3.4.4]. It is defined over all pointsz 2 Rq, but we are interested in finding its minimum value over the nonconvex set Z. In otherwords, problem (4.1) can be formulated as

minimize �(z)

subject to z 2 Z.(4.4)

As discussed in [27, Chapter 9], ADMM can be used as a heuristic to solve nonconvex constrainedproblems. ADMM has the form

wk+1 := argminz

��(z) + (⇢/2)kz � zk + uk

k22

�

zk+1 := ⇧�wk+1 + uk

�

uk+1 := uk + wk+1� zk+1,

(4.5)

CHAPTER 4. NCVX 70

where ⇢ > 0 is an algorithm parameter, k is the iteration counter, and ⇧ denotes Euclidean projectiononto Z (which need not be unique when Z is not convex).

The initial values u0 and z0 are additional algorithm parameters. We always set u0 = 0 anddraw z0 randomly from a normal distribution N (0,�2I), where � > 0 is an algorithm parameter.

4.3.2 Algorithm subroutines

Convex proximal step. Carrying out the first step of the algorithm, i.e., evaluating the proximaloperator of �, involves solving the convex optimization problem

minimize f0(x, z) + (⇢/2)kz � zk + ukk22

subject to fi(x, z) 0, i = 1, . . . ,m,

Ax+Bz = c,

(4.6)

over the variables x 2 Rn and z 2 Rq. This is the original problem (4.1), with the nonconvexconstraint z 2 Z removed, and an additional convex quadratic term involving z added to theobjective. We let (xk+1, wk+1) denote a solution of (4.6). If the problem (4.6) is infeasible, then sois the original problem (4.1); should this happen, we can terminate the algorithm with the certainconclusion that (4.1) is infeasible.

Projection. The (nonconvex) projection step consists of finding a closest point in Z to wk+1+uk.If more than one point has the smallest distance, we can choose one of the minimizers arbitrarily. Incases where the projection onto Z is too costly, we replace projection with approximate projection.

Dual update. The iterate uk2 Rq can be interpreted as a scaled dual variable, or as the running

sum of the error values wk� zk.

4.3.3 Discussion

Convergence. When Z is convex (and a solution of (4.1) exists), this algorithm is guaranteedto converge to a solution, in the sense that f0(xk+1, wk+1) converges to the optimal value of theproblem (4.1), and wk+1

� zk+1! 0, i.e., wk+1

! Z. See [27, §3] and the references therein for amore technical description and details. But in the general case, when Z is not convex, the algorithmis not guaranteed to converge, and even when it does, it need not be to a global, or even local,minimum. Some recent progress has been made on understanding convergence of ADMM in thenonconvex case [152].

Parameters. Another difference with the convex case is that the convergence and the quality ofsolution depends on ⇢, whereas for convex problems this algorithm is guaranteed to converge to

CHAPTER 4. NCVX 71

the optimal value regardless of the choice of ⇢. In other words, in the convex case the choice ofparameter ⇢ only affects the speed of the convergence, while in the nonconvex case the choice of ⇢can have a critical role in the quality of approximate solution, as well as the speed of convergence.

The optimal parameter selection for ADMM is still an active research area in the convex case;even less is known about it in the nonconvex case. In [90] the optimal parameter selection for convexquadratic problems is discussed. In a more general setting, Giselsson discusses the optimal parameterselection for ADMM for strongly convex functions in [92, 93, 94]. The dependence of global andlocal convergence properties of ADMM on parameter choice has been studied in [125, 22].

Initialization. In the convex case the choice of initial point z0 affects the number of iterationsto find a solution, but not the quality of the solution. Unsurprisingly, the nonconvex case differsin that the choice of z0 has a major effect on the quality of the approximate solution. As withthe choice of ⇢, the initialization in the nonconvex case is currently an active area of research; see,e.g., [129, 152, 209]. A reasonable generic method is to draw initial points randomly from N (0,�2I)

(assuming reasonable scaling of the original problem).

4.3.4 Solution improvement

Now we describe two techniques to obtain better solutions after carrying out ADMM. The firsttechnique relies on iterated neighbor search and the second one uses multiple restarts with randominitial points in order to increase the chance of obtaining a better solution.

Iterated neighbor search. After each iteration, we can carry out iterated neighbor search (asdescribed in §4.2.3) with Z

ngbr(zk+1, D) to obtain (xk+1, zk+1). We will return the pair with thesmallest merit function as the output of the algorithm. The distance D is a parameter that can beincreased so that the neighbor search considers more points.

Multiple restarts. We choose the initial value z0 from a normal distribution N (0,�2I). We canrun the algorithm multiple times from different initial points to increase the chance of finding afeasible point with a smaller objective value.

4.3.5 Overall algorithm

The following is a summary of the algorithm with solution improvement.

CHAPTER 4. NCVX 72

Algorithm 15 NC-ADMM heuristicInitialize u0 = 0, (xbest, zbest) = ;, ⌘best =1.for algorithm repeats 1, 2, . . . ,M do

Initialize z0 ⇠ N (0,�2I), k = 0.do

(xk+1, wk+1) argminz��(z) + (⇢/2)kz � zk + uk

k22

�.

zk+1 ⇧

�wk+1 + uk

�.

uk+1 uk + wk+1

� zk+1.Use algorithm 14 on (xk+1, zk+1) to get the improved iterate (x, z).if ⌘(x, z) < ⌘best then

(xbest, zbest) (x, z), ⌘best = ⌘(x, z).

k k + 1.while k N and (x, z) has not repeated P times in a row.

return xbest, zbest.

Convergence, stopping criteria, and optimality. As described in §4.3.3, ADMM need notconverge for arbitrary nonconvex Z. The output of our heuristic is not the direct output of ADMM,however, but the output of ADMM after local search. In algorithm 15, ADMM may be viewed asa procedure for generating sample points, which we run through algorithm 14 to get different localoptima. Our heuristic may therefore be useful even on problems where ADMM fails to converge.We terminate algorithm 15 when local search returns the same point P times in a row, where P

is a parameter. Given the lack of convergence guarantees for ADMM with nonconvex Z, the onlyformal notion of optimality provided by our heuristic is that the solution is optimal among all pointsconsidered by the local search method.

4.4 Projections onto nonconvex sets

In this section we catalog various nonconvex sets with their implied convex constraints, which willbe included in the convex constraints of problem (4.1). We also provide a Euclidean projection (orapproximate projection) ⇧ for these sets. Also, when applicable, we introduce a nontrivial restrictionand distance function.

4.4.1 Subsets of R

Booleans. For Z = {0, 1}, a convex relaxation (in fact, the convex hull of Z) is [0, 1]. Projectionis simple rounding: ⇧(z) = 0 for z 1/2, and ⇧(z) = 1 for z > 1/2. (z = 1/2 can be mapped to

CHAPTER 4. NCVX 73

either point.) Moreover, Zdist(y, z) = |y � z| for y, z 2 Z.

Finite sets. If Z has M elements, the convex hull of Z is the interval from the smallest to thelargest element. We can project onto Z with no more than log2 M comparisons. For y, z 2 Z, thedistance function is given by Z

dist(y, z) = |[y, z] \ Z|� 1.

4.4.2 Subsets of Rn

Boolean vectors with fixed cardinality. Let Z = {z 2 {0, 1}n | card(z) = k}. Any z 2 Z

satisfies 0 z 1 and 1T z = k. We can project z 2 Rn onto Z by setting the k entries of z

with largest value to one and the remaining entries to zero. For y, z 2 Z, the distance Zdist(y, z)

is defined as the minimum number of swaps of entries needed to transform y into z, or half theHamming distance.

Vectors with bounded cardinality. Let Z = {x 2 [�M,M ]n | card(x) k}, where M > 0

and k 2 Z+. (Vectors z 2 Z are called k-sparse.) Any point z 2 Z satisfies �M z M and�Mk 1T z Mk. The projection ⇧(z) is found as follows

(⇧ (z))i=

8>>>><

>>>>:

M zi > M, i 2 I

�M zi < �M, i 2 I

zi |zi| M, i 2 I

0 i 62 I,

where I ✓ {1, . . . , n} is a set of indices of k largest values of |zi|. A restriction of Z at z 2 Z is theset of all points in [�M,M ]n that have the same sparsity pattern as z.

Quadratic sets. Let Sn

+ and Sn

++ denote the set of n ⇥ n symmetric positive semidefinite andsymmetric positive definite matrices, respectively. Consider the set

Z = {z 2 Rn| ↵ zTAz + 2bT z �},

where A 2 Sn

++, b 2 Rn, and � � ↵ � �bTA�1b. We assume ↵ � �bTA�1b because zTAz+2bT z �

�bTA�1b for all z 2 Rn. Any point z 2 Z satisfies the convex inequality zTAz + 2bT z �.We can find the projection onto Z as follows. If zTAz + 2bT z > �, it suffices to solve

minimize kx� zk22subject to xTAx+ 2bTx �,

(4.7)

CHAPTER 4. NCVX 74

and if zTAz + 2bT z < ↵, it suffices to solve

minimize kx� zk22subject to xTAx+ 2bTx � ↵.

(4.8)

(If ↵ zTAz+2bT z �, clearly ⇧(z) = z.) The first problem is a convex quadratically constrainedquadratic program and the second problem can be solved by solving a simple semidefinite programas described in [29, Appendix B]. Furthermore, there is a more efficient way to find the projection byfinding the roots of a single-variable polynomial of degree 2p+ 1, where p is the number of distincteigenvalues of A [129, 122]. Note that the projection can be easily found even if A is not positivedefinite; we assume A 2 Sn

++ only to make Z bounded and have a useful convex relaxation.A restriction of Z at z 2 Z is the set

Zrstr(z) = {x 2 Rn

|xTAz + bT (x+ z) + bTA�1bpzTAz + 2bT z + bTA�1b

�

p↵+ bTA�1b, xTAx+ 2bTx �}.

Recall that zTAz + 2bT z + bTA�1b � 0 for all z 2 Rn and we assume ↵ � �bTA�1b, so Zrstr(z) is

always well defined.

Annulus and sphere. Consider the set

Z = {z 2 Rn| r kzk2 R},

where R � r.Any point z 2 Z satisfies kzk2 R. We can project z 2 Rn

\{0} onto Z by the following scaling

⇧(z) =

8>>><

>>>:

rz/kzk2 if kzk2 < r

z if z 2 Z

Rz/kzk2 if kzk2 > R.

If z = 0, any point with Euclidean norm r is a valid projection.A restriction of Z at z 2 Z is the set

Zrstr(z) = {x 2 Rn

| xT z � rkzk2, kxk2 R}.

Notice that if r = R, then Z is a sphere and the restriction will be a singleton.

CHAPTER 4. NCVX 75

4.4.3 Subsets of Rm⇥n

Remember that the projection of a point X 2 Rm⇥n on a set Z ⇢ Rm⇥n is a point Z 2 Z suchthat the Frobenius norm kX �ZkF is minimized. As always, if there is more than one point Z thatminimizes kX � ZkF, we accept any of them.

Matrices with bounded singular values and orthogonal matrices. Consider the set of m⇥nmatrices whose singular values lie between 1 and ↵

Z = {Z 2 Rm⇥n| I � ZTZ � ↵2I},

where ↵ � 1, and A � B means B �A 2 Sn

+ . Any point Z 2 Z satisfies kZk2 ↵.If Z = U⌃V T is the singular value decomposition of Z with singular values (�z)min{m,n} · · ·

(�z)1 and X 2 Z with singular values (�x)min{m,n} · · · (�x)1, according to the von Neumanntrace inequality [223] we will have

Tr(ZTX)

min{m,n}X

i=1

(�z)i(�x)i.

Hence

kZ �Xk2F�

min{m,n}X

i=1

((�z)i � (�x)i)2 ,

with equality when X = U diag(�x)V T . This inequality implies that ⇧(Z) = U ⌃V T , where ⌃ is adiagonal matrix and ⌃ii is the projection of ⌃ii on the interval [1,↵]. When Z = 0, the projection⇧(Z) is any matrix.

Given Z = U⌃V T2 Z, we have the following restriction [25]

Zrstr(Z) = {X 2 Rm⇥n

| kXk2 ↵, V TXTU + UTXV ⌫ 2I}.

(Notice that X 2 Zrstr(Z) satisfies XTX ⌫ I + (X � UV T )T (X � UV T ) ⌫ I.)

There are several noteworthy special cases. When ↵ = 1 and m = n we have the set of orthogonalmatrices. In this case, the restriction will be a singleton. When n = 1, the set Z is equivalent tothe annulus {z 2 Rm

| 1 kzk2 ↵}.

Matrices with bounded rank. Let Z = {Z 2 Rm⇥n| Rank(Z) k, kZk2 M}. Any point

Z 2 Z satisfies kZk2 M and kZk⇤ Mk, where k · k⇤ denotes the trace norm. If Z = U⌃V T isthe singular value decomposition of Z, we will have ⇧(Z) = U ⌃V T , where ⌃ is a diagonal matrixwith ⌃ii = min{⌃ii,M} for i = 1, . . . k, and ⌃ii = 0 otherwise.

Given a point Z 2 Z, we can write the singular value decomposition of Z as Z = U⌃V T with

CHAPTER 4. NCVX 76

U 2 Rm⇥k, ⌃ 2 Rr⇥r and V 2 Rn⇥k. A restriction of Z at Z is

Zrstr(Z) = {U ⌃V T

| ⌃ 2 Rr⇥r}.

Assignment and permutation matrices. Assignment matrices are Boolean matrices with ex-actly one nonzero element in each column and at most one nonzero element in each row. (Theyrepresent an assignment of the columns to the rows.) In other words, the set of assignment matriceson {0, 1}m⇥n, where m � n, satisfy

Pn

j=1 Zij 1, i = 1, . . . ,mP

m

i=1 Zij = 1, j = 1, . . . , n.

These two sets of inequalities, along with 0 Zij 1 are the implied convex inequalities. Whenm = n, this set becomes the set of permutation matrices, which we denote by Pn.

Projecting Z 2 Rm⇥n (with m � n) onto the set of assignment matrices involves choosing anentry from each column of Z such that no two chosen entries are from the same row and the sum ofchosen entries is maximized. Assuming that the entries of Z are the weights of edges in a bipartitegraph, the projection onto the set of assignment matrices will be equivalent to finding a maximum-weight matching in a bipartite graph. The Hungarian method [146] is a well-known polynomial timealgorithm to find the maximum weight matching, and hence also the projection onto assignmentmatrices.

For Y, Z 2 Z, the distance Zdist(Y, Z) is defined as the minimum number of swaps of adjacent

rows and columns necessary to transform Y into Z. We define distance in terms of swaps of adjacentrows and columns rather than arbitrary rows and columns to reduce the number of neighbors. Forexample, the restriction that swaps must be of adjacent rows and columns reduces |Zngbr(Z, 1)| fromO(mn) to O(m+ n) for Z 2 Z.

Hamiltonian cycles. A Hamiltonian cycle is a cycle in a graph that visits every node exactlyonce. Every Hamiltonian cycle in a complete graph can be represented by its adjacency matrix, forexample 2

66664

0 0 1 1

0 0 1 1

1 1 0 0

1 1 0 0

3

77775

represents a Hamiltonian cycle that visits nodes (3, 2, 4, 1) sequentially. Let Hn be the set of n⇥ n

matrices that represent a Hamiltonian cycle.

CHAPTER 4. NCVX 77

Every point Z 2 Hn satisfies 0 Zij 1 for i, j = 1, . . . , n, and Z = ZT , (1/2)Z1 = 1, and

2I� Z + 411T

n⌫ 2

✓1� cos

2⇡

n

◆I,

where I denotes the identity matrix. In order to see why the last inequality holds, it is enoughto note that 2I � Z is the Laplacian of the cycle represented by Z [165, 9]. It can be shown thatthe smallest eigenvalue of 2I � Z is zero (which corresponds to the eigenvector 1), and the secondsmallest eigenvalue of 2I � Z is 2(1 � cos 2⇡

n). Hence all eigenvalues of 2I � Z + 411T

nmust be no

smaller than 2(1� cos 2⇡n).

We are not aware of a polynomial time algorithm to find the projection of a given real n ⇥ n

matrix onto Hn. We can find an approximate projection of Z by the following greedy algorithm:construct a graph with n vertices where the edge between i and j is weighted by zij . Start with theedge with largest weight and at each step, among all the edges that do not create a cycle, choosethe edge with the largest weight (except for the last step where a cycle is created).

For Y, Z 2 Hn, the distance Zdist(Y, Z) is defined as the minimum number of adjacent nodes

that must be swapped to transform Y into Z. Swapping adjacent nodes i and j means replacing Y

with P(i,j)Y PT

(i,j) where Yij = 1 and P(i,j) is a permutation matrix that swaps nodes i and j andleaves other nodes unchanged. As with assignment matrices, we define distance in terms of swapsof adjacent nodes rather than arbitrary nodes to reduce the number of neighbors.

4.4.4 Combinations of sets

Cartesian product. Let Z = Z1 ⇥ · · ·⇥ Zk ⇢ Rn, where Z1, . . . ,Zk are closed sets with knownprojections (or approximate projections). A convex relaxation of Z is the Cartesian product Zrlx

1 ⇥

· · ·⇥Zrlxk

, where Zrlxi

is the set described by the convex relaxation of Zi. The projection of z 2 Rn

onto Z is (⇧1(z1), . . . ,⇧k(zk)), where ⇧i denotes the projection onto Zi for i = 1, . . . , k.A restriction of Z at a point z = (z1, z2, . . . , zk) 2 Z is the Cartesian product Z

rstr(z) =

Zrstr1 (z1) ⇥ · · · ⇥ Z

rstrk

(zk). For y = (y1, y2, . . . , yk) 2 Z and z = (z1, z2, . . . , zk) 2 Z, the distancefunction is given by Z

dist(y, z) =P

k

i=1 Zdisti

(yi, zi).

4.5 Implementation

We have implemented the NCVX Python package for modeling problems of the form (4.1) andapplying the NC-ADMM heuristic, along with the relax-round-polish and relax methods. The NCVXpackage is an extension of CVXPY [68]. The problem objective and convex constraints are expressedusing standard CVXPY semantics. Nonconvex constraints are expressed implicitly by creating avariable constrained to lie in one of the sets described in §4.4. For example, the code snippet

x = Boolean()

CHAPTER 4. NCVX 78

creates a variable x 2 R with the implicit nonconvex constraint x 2 {0, 1}. The convex relaxation, inthis case x 2 [0, 1], is also implicit in the variable definition. The source code for NCVX is availableat https://github.com/cvxgrp/ncvx.

4.5.1 Variable constructors

The NCVX package provides the following functions for creating variables with implicit nonconvexconstraints, along with many others not listed:

• Boolean(n) creates a variable x 2 Rn with the implicit constraint x 2 {0, 1}n.

• Integer(n, M) creates a variable x 2 Rn with the implicit constraints x 2 Zn and kxk1 bMc.

• Card(n, k, M) creates a variable x 2 Rn with the implicit constraints that at most k entriesare nonzero and kxk1 M .

• Choose(n, k) creates a variable x 2 Rn with the implicit constraints that x 2 {0, 1}n andhas exactly k nonzero entries.

• Rank(m, n, k, M) creates a variable X 2 Rm⇥n with the implicit constraints Rank(X) k

and kXk2 M .

• Assign(m, n) creates a variable X 2 Rm⇥n with the implicit constraint that X is an assign-ment matrix.

• Permute(n) creates a variable X 2 Rn⇥n with the implicit constraint that X is a permutationmatrix.

• Cycle(n) creates a variable X 2 Rn⇥n with the implicit constraint that X is the adjacencymatrix of a Hamiltonian cycle.

• Annulus(n,r,R) creates a variable x 2 Rn with the implicit constraint r kxk2 R.

• Sphere(n, r) creates a variable x 2 Rn with the implicit constraint kxk2 = r.

4.5.2 Variable methods

Additionally, each variable created by the functions in §4.5.1 supports the following methods:

• variable.relax() returns a list of convex constraints that represent a convex relaxation ofthe nonconvex set Z, to which the variable belongs.

• variable.project(z) returns the Euclidean (or approximate) projection of z onto the non-convex set Z, to which the variable belongs.

CHAPTER 4. NCVX 79

• variable.restrict(z) returns a list of convex constraints describing the convex restrictionZ

rstr(z) at z of the nonconvex set Z, to which the variable belongs.

• variable.neighbors(z, D) returns a list of neighbors Zngbr(z,D) of z contained in the non-convex set Z, to which the variable belongs.

Users can add support for additional nonconvex sets by providing functions that implement thesefour methods.

4.5.3 Constructing and solving problems

To construct a problem of the form (4.1), the user creates variables z1, . . . , zk with the implicitconstraints z1 2 Z1, . . . , zk 2 Zk, where Z1, . . . ,Zk are nonconvex sets, using the functions describedin §4.5.1. The variable z in problem (4.1) corresponds to the vector (z1, . . . , zk). The componentsof the variable x, the objective, and the constraints are constructed using standard CVXPY syntax.

Once the user has constructed a problem object, they can apply the following solve methods:

• problem.solve(method="relax") solves the convex relaxation of the problem.

• problem.solve(method="relax-round-polish") applies the relax-round-polish heuristic. Ad-ditional arguments can be used to specify the parameters N , D, �, and �. By default theparameter values are N = 5, D = 1, � = 1, and � = 104. When N > 1, the first samplew1 2 Rq is always 0. Subsequent samples are drawn i.i.d. from N(0,�2I). Neighbor searchlooks at all neighbors within distance D.

• problem.solve(method="nc-admm") applies the NC-ADMM heuristic. Additional argumentscan be used to specify the number of starting points, the number of iterations the algorithm isrun from each starting point, and the values of the parameters ⇢, D, �, and �. By default thealgorithm is run from 5 starting points for 50 iterations, the value of ⇢ is drawn uniformly from[0, 1], and the other parameter values are D = 1, � = 1, and � = 104. The first starting pointis always z0 = 0 and subsequent starting points are drawn i.i.d. from N (0,�2I). Neighborsearch looks at all neighbors within distance D.

The relax-round-polish and NC-ADMM methods record the best point found (xbest, zbest) accordingto the merit function. The methods return the objective value f0(xbest, zbest) and the residualr(xbest, zbest), and set the value field of each variable to the appropriate segment of xbest and zbest.

For example, consider the regressor selection problem, which we will discuss in §4.6.1. Thisproblem can be formulated as

minimize kAx� bk22subject to card(x) k, kxk1 M,

(4.9)

CHAPTER 4. NCVX 80

with decision variable x 2 Rn and problem data A 2 Rm⇥n, b 2 Rm, M > 0, and k 2 Z+. Thefollowing code attempts to approximately solve this problem using our heuristic.

x = Card(n,k,M)

prob = Problem(Minimize(sum_squares(A*x-b)))

objective, residual = prob.solve(method="nc-admm")

The first line constructs a variable x 2 Rn with the implicit constraints that at most k entriesare nonzero, kxk1 M , and kxk1 kM . The second line creates a minimization problem withobjective kAx�bk22 and no constraints. The last line applies the NC-ADMM heuristic to the problemand returns the objective value and residual of the best point found.

4.5.4 Limitations

Our implementation is designed to be simple and to generalize to as many problems as possible. As aresult, the implementation has several limitations in terms of computational efficiency and exploitingproblem specific structure. For example, no work is cached across solves of convex subproblems.Caching factorizations or warm starting would improve performance when the convex solver supportsthese features. The implementation runs NC-ADMM from different initial values in parallel, but amore sophisticated implementation would use finer grained parallelism.

Neighbor search and polishing can be made more efficient than the general purpose approachin our implementation by exploiting problem specific structure. For example, if the variable z

is a Boolean vector, i.e., Z 2 {0, 1}q, then any neighbor z 2 Zrstr(z, 1) differs from z in only

two entries. The change in the merit function ⌘(x, z) � ⌘(x, z) can be computed efficiently given⌘(x, z), accelerating neighbor search. Polishing can also be accelerated by taking advantage ofthe structure of the convex restriction. For instance, the restriction of the cardinality constraintZ = {z 2 Rq

| card(z) k, kzk1 M} fixes the sparsity pattern, which reduces the number offree entries of z from q to k. Our implementation imposes the restriction by adding convex constraintsto the problem, but a more efficient implementation would replace z with a k-dimensional variable.

For several of the examples in §4.6, we implemented optimized versions of the NC-ADMM al-gorithm that remedy the limitations of our general purpose implementation. Even our problemspecific implementations could be improved further by better exploiting parallelism and applyinglow-level code optimization, but the implementations are fast enough to compete with optimizedgeneral purpose mixed integer solvers like Gurobi [110].

4.6 Examples

In this section we apply the NC-ADMM heuristic to a wide variety of hard problems, i.e., thatgenerally cannot be solved in polynomial time. Extensive research has been done on specialized

CHAPTER 4. NCVX 81

algorithms for each of the problems discussed in this section. Our intention is not to seek betterperformance than these specialized algorithms, but rather to show that our general purpose heuristiccan yield decent results with minimal tuning. The advantage of our heuristic is that it can beapplied to problems that no one has studied before, not that it outperforms the state-of-the-art onwell-studied problems.

Unless otherwise specified, the algorithm parameters are the defaults described in §4.5. In par-ticular, we use random initialization for all examples. For most problems a well chosen problemspecific initialization will improve the results of our method; see, e.g., [129, 152, 209]. We use ran-dom initialization, however, because it better demonstrates that our heuristic can be effective withminimal tuning. Whenever possible, we compare our heuristic to Gurobi [110], a commercial globaloptimization solver. All runtimes reported are on a laptop with a four-core 2.3 GHz Intel Core i7processor.

4.6.1 Regressor selection

We consider the problem of approximating a vector b with a linear combination of at most k columnsof A with bounded coefficients. This problem can be formulated as

minimize kAx� bk22subject to card(x) k, kxk1 M,

(4.10)

with decision variable x 2 Rn and problem data A 2 Rm⇥n, b 2 Rm, k 2 Z+, and M > 0. Lasso(least absolute shrinkage and selection operator) is a well-known heuristic for solving this problemby adding `1 regularization and minimizing kAx � bk22 + �kxk1. The value of � is chosen as thesmallest value for which card(x) k. (See [85, §3.4] and [29, §6.3].) The nonconvex set from §4.4in problem 4.10 is the set of vectors with bounded cardinality Z = {x 2 [�M,M ]n | card(x) k}.

Problem instances. We first consider a family of random problem instances. We generated thematrix A 2 Rm⇥2m with i.i.d. N (0, 1) entries, and chose b = Ax+ v, where x was drawn uniformlyat random from the set of vectors satisfying card(x) bm/5c and kxk1 1, and v 2 Rm was anoise vector drawn from N (0,�2I). We set �2 = kAxk2/(400m) so that the signal-to-noise ratiowas near 20. For each value of m, we generated 40 instances of the problem as described above. Wesolved the instances for k = bm/5c.

In order to examine this method on a real dataset, we also used data from the University ofCalifornia, Irvine (UCI) Machine Learning repository [84] to study the murder rate (per 100Kpeople) of m = 2215 communities in the United States. Similar to [211], we had n = 101 attributesmeasured in each community, and our goal was to predict the murder rate as a linear function ofonly k attributes. To find a good prediction model one would use cross validation analysis in order

CHAPTER 4. NCVX 82

Figure 4.1: The average error of solutions found by Lasso, relax-round-polish, and NC-ADMM

for 40 random instances of the regressor selection problem.

to choose k; but we limited ourselves to the problem of finding 2 k 20 regressors that minimizekAx� bk22.

Results. Figure 4.1 compares the average sum of squares error for the x⇤ values found by the Lassoheuristic, relax-round-polish, and NC-ADMM for the randomly generated instances. For Lasso, wesolved the problem for 100 values of � and then solved the polishing problem after fixing the sparsitypattern suggested by Lasso. For all m, the objective values found by the NC-ADMM heuristic wereon average better than those found by the Lasso and relax-round-polish heuristics.

For our second problem (murder rate), we used our NC-ADMM heuristic and Gurobi to solve theproblem. Our tailored implementation of NC-ADMM never took more than 40 milliseconds to run.The implementation is extremely efficient because the dominant computation is a single factorizationof the matrix ATA+⇢I. We use only one restart and hence only one value of ⇢. Figure 4.2 shows thevalue found by NC-ADMM as well as the best value found by Gurobi after 10 seconds, 100 seconds,and 1000 seconds. For all k, the objective value found by NC-ADMM after only 40 milliseconds wasbetter than those found by Gurobi after 10 or 100 seconds and comparable to those found after 1000seconds. (Of course, Gurobi will eventually find the global optimal point, and therefore match orbeat the point found by NC-ADMM.)

CHAPTER 4. NCVX 83

Figure 4.2: The best value found by NC-ADMM (usually done in 35 milliseconds) and Gurobi

after 10 seconds, 100 seconds, and 1000 seconds.

4.6.2 3-satisfiability

Given Boolean variables x1, . . . , xn, a literal is either a variable or the negation of a variable, forexample x1 and ¬x2. A clause is disjunction of literals (or a single literal), for example (¬x1 _

x2 _ ¬x3). Finally a formula is in conjunctive normal form (CNF) if it is a conjunction of clauses(or a single clause), for example (¬x1 _ x2 _ ¬x3) ^ (x1 _ ¬x2). Determining the satisfiability of aformula in conjunctive normal form where each clause is limited to at most three literals is called3-satisfiability or simply the 3-SAT problem. It is known that 3-SAT is NP-complete, hence wedo not expect to be able to solve 3-SAT in general using our heuristic. A 3-SAT problem can beformulated as the following

minimize 0

subject to Az b,

z 2 {0, 1}n,

(4.11)

where entries of A 2 Rm⇥n are given by

aij =

8>>><

>>>:

�1 if clause i contains xj

1 if clause i contains ¬xj

0 otherwise,

CHAPTER 4. NCVX 84

and the entries of b are given by

bi = (number of negated literals in clause i)� 1.

The nonconvex set from §4.4 in problem 4.11 is the set of Boolean vectors Z = {0, 1}n.

Problem instances. We generated 3-SAT problems with varying numbers of clauses and variablesrandomly as in [166, 157]. As discussed in [54], there is a threshold around 4.25 clauses per variablewhen problems transition from being feasible to being infeasible. Problems near this threshold aregenerally found to be hard satisfiability problems. The SATLIB uniform random-3-SAT benchmarkis constructed by the same method [128]. We generated 10 instances for each choice of number ofclauses m and variables n, verifying that each instance is feasible using Gurobi [110].

Results. We ran the NC-ADMM heuristic on each instance, with 10 restarts and 100 iterations,and ⇢ = 10. Figure 4.3 shows the fraction of instances solved correctly with NC-ADMM for eachchoice of number of clauses m and variables n. We see that using this heuristic, satisfying assignmentscan be found consistently for up to 3.2 constraints per variable, at which point success starts todecrease. Problems in the gray region in figure 4.3 were not tested since they are infeasible withhigh probability. We also tried the relax-round-polish heuristic, but it often failed to solve problemswith more than 50 clauses.

For all instances, the runtime of the NC-ADMM heuristic with the parameters we chose wasgreater than the time it took Gurobi to find a solution. A specialized SAT solver would of course beeven faster. We include the example nonetheless because it shows that the NC-ADMM heuristic canbe effective for feasibility problems, even though the algorithm is not guaranteed to find a feasiblepoint.

4.6.3 Circle packing

In the circle packing problem we are interested in finding the smallest square in which we canplace n non-overlapping circles with radii r1, . . . , rn [97]. This problem has been studied extensively[205, 38, 48] and a database of densest known packings (with all ri equal) for different numbers ofcircles can be found in [201]. Variants of the problem arise in industrial packing and computer aideddesign [121]. The problem can be formulated as

minimize l

subject to ri1 xi (l � ri)1, i = 1, . . . , n

xi � xj = zij , i = 1, . . . , n� 1, j = i+ 1, . . . , n

2P

n

k=1 ri � kzijk2 � ri + rj , i = 1, . . . , n� 1, j = i+ 1, . . . , n,

(4.12)

CHAPTER 4. NCVX 85

Figure 4.3: The fraction of the 10 3-SAT instances generated for each choice of number of clauses

m and variables n for which NC-ADMM found a satisfying assignment. No instances were generated

for (n, m) in the gray region.

where x1, . . . , xn 2 R2 are variables representing the circle centers and z12, z13, . . . , zn�1,n 2 R2 areadditional variables representing the offset between pairs (xi, xj). The nonconvex set from §4.4 inproblem 4.12 are the annuli

Zij = {zij 2 R2| ri + rj kzijk2 2

nX

k=1

ri},

for i = 1, . . . , n� 1 and j = i+ 1, . . . , n.

Problem instances. We generated problems with different numbers of circles n, but with equalradii r1, . . . , rn. Problem instances of this form are quite difficult to solve globally. The densestpossible packing is unknown for most n > 36 [201].

Results. We ran the relax-round-polish heuristic for problems with n = 1, . . . , 100. The heuris-tic is essentially equivalent to well-known methods like the convex-concave procedure and themajorization-minimization (MM) algorithm [157]. We observed that NC-ADMM is no more ef-fective than relax-round-polish. Figure 4.4 shows the relative radius r1/l of the packing found byour heuristic in comparison to the best packing known. Figure 4.5 shows the packing found by ourheuristic for n = 41. The obtained packing covers 78.68% of the area of the bounding square, which

CHAPTER 4. NCVX 86

Figure 4.4: The relative radius r1/l for the densest known packing and the packing found with

the relax-round-polish heuristic for n = 1, . . . , 100.

is close to the densest known packing, which covers 79.27% of the area.

4.6.4 Traveling salesman problem

In the traveling salesman problem (TSP), we wish to find the minimum weight Hamiltonian cyclein a weighted graph. A Hamiltonian cycle is a path that starts and ends on the same vertex andvisits each other vertex in the graph exactly once. Let G be a graph with n vertices and D 2 Sn

be the (weighted) adjacency matrix, i.e., the real number dij denotes the distance between i and j.We can formulate the TSP problem for G as follows

minimize (1/2)Tr(DTZ)

subject to Z 2 Hn,(4.13)

where Z is the decision variable [149, 145, 56, 123]. The nonconvex set from §4.4 in problem 4.13 isthe set of Hamiltonian cycles Z = Hn.

Problem instances. We generated problems with different numbers of vertices n by sampling n

points from the uniform distribution on [�1, 1]2. We set dij to be the Euclidean distance betweenpoints i and j. For each value of n, we generated 10 instances of the problem according to the aboveprocedure.

CHAPTER 4. NCVX 87

Figure 4.5: The packing for n = 41 circles with equal radii found with the relax-round-polish

heuristic.

Results. Figure 4.6 shows the average cost of the solutions found by relax-round-polish, NC-ADMM, and Gurobi with a time cutoff. We implemented an optimized version of NC-ADMM forthe TSP problem. We ran NC-ADMM with 4 restarts and 25 iterations. We ran Gurobi on thestandard MILP formulation of the TSP [178, §13] and gave it a time cutoff equal to the runtime ofour NC-ADMM implementation. We ignored instances where Gurobi failed to find a feasible pointwithin the runtime.

As n increases, the average cost of the solutions found by NC-ADMM goes below that of Gurobiwith a time cutoff. Of course a specialized TSP solver like Concorde [10] could solve all the probleminstances to global optimality within the runtime of NC-ADMM. We emphasize again, however, thatour goal is not to outperform specialized solvers on every problem class, but simply for NCVX tocompare favorably with other general purpose nonconvex solvers.

4.6.5 Factor analysis model

The factor analysis problem decomposes a matrix as a sum of a low-rank and a diagonal matrix andhas been studied extensively (for example in [190, 173]). It is also known as the Frisch scheme in

CHAPTER 4. NCVX 88

Figure 4.6: The average cost of the TSP solutions found by relax-round-polish, NC-ADMM, and

Gurobi with a time cutoff equal to the runtime of NC-ADMM.

CHAPTER 4. NCVX 89

the system identification literature [136, 59]. The problem is the following

minimize k⌃� ⌃lr�Dk2

F

subject to D = diag(d), d � 0

⌃lr⌫ 0

Rank(⌃lr) k,

(4.14)

where ⌃lr2 Sn

+ and diagonal matrix D 2 Rn⇥n with nonnegative diagonal entries are the decisionvariables, and ⌃ 2 Sn

+ and k 2 Z+ are problem data. One well-known heuristic for solving thisproblem is adding k · k⇤, or nuclear norm, regularization and minimizing k⌃�⌃lr

�Dk2F+ �k⌃lr

k⇤

[220, 190]. The value of � is chosen as the smallest value possible such that Rank(⌃lr) k. Since⌃lr is positive semidefinite, k⌃lr

k⇤ = Tr(⌃lr). The nonconvex set from §4.4 for problem 4.14 is theset of matrices with bounded rank Z = {⌃lr

2 Sn

+ | Rank(⌃lr) k}. Unlike in §4.4, we constrain⌃lr to be positive semidefinite but impose no bound on the norm k⌃lr

k2.

Problem instances. We constructed instances of the factor analysis problem using daily returnsfrom stocks in the July 2016 S&P 500 over 2014 and 2015. There is a long history in finance ofdecomposing the covariance of stock returns into low-rank and diagonal components [196, 181]. Wevaried the number of stocks used n, the rank k, and the month of returns history considered. Foreach choice of n, k, and month, we generated an instance of problem 4.14 by setting ⌃ to be thecovariance matrix of the daily percent returns over that month for the first n S&P 500 stocks, orderedalphabetically by NYSE ticker.

Results. We ran NC-ADMM, relax-round-polish, and the nuclear norm heuristic on each probleminstance. For the nuclear norm heuristic, we solved the problem for 1000 values of � and then polishedthe solution ⌃lr. In the polishing problem we replaced the rank constraint in problem 4.14 with theconvex restriction ⌃lr

2 {Q1:k⌃QT

1:k | ⌃ 2 Sk

+}, where ⌃ = Q⇤QT is the eigendecomposition of ⌃lr

and Q1:k is the first k columns of Q.The ⌃lr and d values found by the three methods were always feasible solutions to problem 4.14.

For each problem instance and each method, we took the value of the objective k⌃ � ⌃lr� Dk2

F

obtained by the method and subtracted the smallest objective value obtained by any method, pbest.Figure 4.7 shows the average k⌃�⌃lr

�Dk2F�pbest across all 24 months of returns data, for a given

n and k. NC-ADMM always gave the best objective value on average (though not for each specificproblem instance). The performance of relax-round-polish relative to NC-ADMM increased as k

increased, while the relative performance of the nuclear norm heuristic decreased as k increased.

CHAPTER 4. NCVX 90

Figure 4.7: The average difference between the objective value found by the nuclear norm, relax-

round-polish, and NC-ADMM heuristics and the best objective value found by any of the heuristics

for instances of the factor analysis problem constructed from daily stock returns.

4.6.6 Inexact graph isomorphism

Two (undirected) graphs are isomorphic if we can permute the vertices of one so it is the same asthe other (i.e., the same pairs of vertices are connected by edges). If we describe them by theiradjacency matrices A and B, isomorphism is equivalent to the existence of a permutation matrixZ 2 Rn⇥n such that ZAZT = B, or equivalently ZA = BZ.

Since in practical applications isomorphic graphs might be contaminated by noise, the inexactgraph isomorphism problem is usually stated [2, 215, 55], in which we want to find a permutationmatrix Z such that the disagreement kZAZT

� Bk2F

between the transformed matrix and thetarget matrix is minimized. Solving inexact graph isomorphism problems is of interest in patternrecognition [50, 186], computer vision [191], shape analysis [194, 116], image and video indexing [151],and neuroscience [222]. In many of the aforementioned fields graphs are used to represent geometricstructures, and kZAZT

�Bk2F

can be interpreted as the strength of geometric deformation.Since kZAZT

�Bk2F= kZA�BZk2

Ffor any permutation matrix Z, the inexact graph isomor-

phism problem can be formulated as

minimize kZA�BZk2F

subject to Z 2 Pn.(4.15)

If the optimal value of this problem is zero, it means that A and B are isomorphic. Otherwise, thesolution of this problem minimizes the disagreement of ZAZT and B in the Frobenius norm sense.

CHAPTER 4. NCVX 91

The nonconvex set from §4.4 in problem 4.15 is the the set of permutation matrices Z = Pn.

Problem instances. It can be shown that if A and B are isomorphic and A has distinct eigenvaluesand all eigenvectors v of A satisfy 1T v 6= 0, then the relaxed problem has a unique solution which isthe permutation matrix that relates A and B [2]. Hence in our first experiment, in order to generateharder problems, we generated the matrix A such that it violated these conditions. In particular,we constructed A for the Peterson graph (3-regular with 10 vertices), icosahedral graph (5-regularwith 12 vertices), Ramsey graph (8-regular with 17 vertices), dodecahedral graph (3-regular with 20

vertices), and the Tutte-Coxeter graph (3-regular with 30 vertices). For each example we randomlypermuted the vertices to obtain two isomorphic graphs.

We also used random graphs from the SIVALab dataset [61] in our second experiment. These areErdos-Renyi graphs that have been used for benchmarking different graph isomorphism algorithms.We ran our NCVX heuristic and Gurobi on 100 problems of size n = 20, 40, 60, 80.

Results. We implemented a faster version of NC-ADMM that caches work between convex solves.We ran our implementation with 25 iterations, 2 restarts, and no neighbor search. For all of ourexamples in the first experiment NC-ADMM was able to find the permutation relating the twographs. It is interesting to notice that running the algorithm multiple times can find differentsolutions if there is more than one permutation relating the two graphs.

We compared NC-ADMM with Gurobi on random example in our second experiment. We ranGurobi with a time limit of 300 seconds. Whenever Gurobi found a permutation matrix that gavean objective value of 0 it immediately returned the solution since the lower bound 0 was evident.NC-ADMM found a permutation solution for 97 out of 100 examples.

Figure 4.8 shows the runtime performance of the two methods. Each point shows how longNC-ADMM or Gurobi ran on a particular problem instance. Points with time component of 300seconds indicate instances that Gurobi was unable to find a solution within the time limit. The goalof this comparison is to test the performance of generic methods on the graph isomorphism problem;tailored methods for this problem are significantly faster than both of these methods.

4.7 Conclusion

We have discussed the relax-round-polish and NC-ADMM heuristics and demonstrated their perfor-mance on many different problems with convex objectives and decision variables from a nonconvexset. Our heuristics are easy to extend to additional problems because they rely on a simple math-ematical interface for nonconvex sets. We need only know a method for (approximate) projectiononto the set. We do not require but benefit from knowing a convex relaxation of the set, a convexrestriction at any point in the set, and the neighbors of any point in the set under some discrete

CHAPTER 4. NCVX 92

Figure 4.8: Time comparison of Gurobi and NC-ADMM on random graph isomorphism problems.

Each point shows how long NC-ADMM or Gurobi ran on a particular problem instance.

CHAPTER 4. NCVX 93

distance metric. Adapting our heuristics to any particular problem is straightforward, and we havefully automated the process in the NCVX package.

We do not claim that our heuristics give state-of-the-art results for any particular problem.Rather, the purpose of our heuristics is to give a fast and reasonable solution with minimal tuningfor a wide variety of problems. Our heuristics also take advantage of the tremendous progressin technology for solving general convex optimization problems, which makes it practical to treatsolving a convex problem as a black box.

Bibliography

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis,J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Va-sudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

[2] Y. Aflalo, A. Bronstein, and R. Kimmel. On convex relaxation of graph isomorphism. Proceed-ings of the National Academy of Sciences of the United States of America, 112(10):2942–2947,2015.

[3] N. Ahmed, T. Natarajan, and K. Rao. Discrete cosine transform. IEEE Transactions onComputers, C-23(1):90–93, January 1974.

[4] A. Aho, M. Lam, R. Sethi, and J. Ullman. Compilers: Principles, Techniques, and Tools (2ndEdition). Addison-Wesley Longman Publishing Co., 2006.

[5] S. Akle. Algorithms for unsymmetric cone optimization and an implementation for problemswith the exponential cone. PhD thesis, Stanford University, 2015.

[6] A. Ali, Z. Kolter, S. Diamond, and S. Boyd. Disciplined convex stochastic programming: Anew framework for stochastic optimization. In Proceedings of the Conference on Uncertaintyin Artificial Intelligence, pages 62–71, 2015.

[7] M. Andersen, J. Dahl, Z. Liu, and L. Vandenberghe. Interior-point methods for large-scalecone programming. In Optimization for Machine Learning, pages 55–83. MIT Press, 2012.

[8] M. Andersen, J. Dahl, and L. Vandenberghe. CVXOPT: Python software for convex optimiza-tion, version 1.1. http://cvxopt.org/, May 2015.

[9] W. Anderson and T. Morley. Eigenvalues of the Laplacian of a graph. Linear and MultilinearAlgebra, 18(2):141–145, 1985.

94

BIBLIOGRAPHY 95

[10] D. Applegate, R. Bixby, V. Chvatal, and W. Cook. Concorde TSP solver, 2006.

[11] N. Aybat, S. Zarmehri, and S. Kumara. An ADMM algorithm for clustering partially observednetworks. In Proceedings of the SIAM International Conference on Data Mining. SIAM, 2015.

[12] H. Balakrishnan, I. Hwang, and C. Tomlin. Polynomial approximation algorithms for beliefmatrix maintenance in identity management. In IEEE Conference on Decision and Control,volume 5, pages 4874–4879, Dec 2004.

[13] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, andY. Bengio. Theano: new features and speed improvements. Deep Learning and UnsupervisedFeature Learning, Neural Information Processing Systems Workshop, 2012.

[14] A. Baydin, B. Pearlmutter, A. Radul, and J. Siskind. Automatic differentiation in machinelearning: a survey. Preprint, 2015. http://arxiv.org/abs/1502.05767.

[15] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total variation imagedenoising and deblurring problems. IEEE Transactions on Image Processing, 18(11):2419–2434, November 2009.

[16] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverseproblems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

[17] S. Becker, E. Candès, and M. Grant. Templates for convex cone problems with applicationsto sparse signal recovery. Mathematical Programming Computation, 3(3):165–218, 2011.

[18] C. Bekas, E. Kokiopoulou, and Y. Saad. An estimator for the diagonal of a matrix. AppliedNumerical Mathematics, 57(11-12):1214–1229, November 2007.

[19] S. Benson and Y. Ye. Algorithm 875: DSDP5—software for semidefinite programming. ACMTransactions on Mathematical Software, 34(3), May 2008.

[20] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,D. Warde-Farley, and Y. Bengio. Theano: a CPU and GPU math expression compiler. InProceedings of the Python for Scientific Computing Conference, June 2010.

[21] D. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic Press,2014.

[22] D. Boley. Local linear convergence of the alternating direction method of multipliers onquadratic or linear programs. SIAM Journal on Optimization, 23(4):2183–2207, 2013.

[23] S. Börm, L. Grasedyck, and W. Hackbusch. Introduction to hierarchical matrices with appli-cations. Engineering Analysis with Boundary Elements, 27(5):405–422, 2003.

BIBLIOGRAPHY 96

[24] S. Boyd. EE364a: Convex optimization I. http://stanford.edu/class/ee364a/, December2015.

[25] S. Boyd, M. Hast, and K. Astrom. MIMO PID tuning via iterated LMI restriction. Int. J.Robust Nonlinear Control, 2015.

[26] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-tical learning via the alternating direction method of multipliers. Foundations and Trends inMachine Learning, 3(1):1–122, 2011.

[27] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statis-tical learning via the alternating direction method of multipliers. Foundations and Trends inMachine Learning, 3(1):1–122, 2011.

[28] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[29] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

[30] R. Bracewell. The fast Hartley transform. In Proceedings of the IEEE, volume 72, pages1010–1018, August 1984.

[31] A. Bradley. Algorithms for the equilibration of matrices and their application to limited-memoryQuasi-Newton methods. PhD thesis, Stanford University, 2010.

[32] A. Brandt, S. McCormick, and J. Ruge. Algebraic multigrid (AMG) for sparse matrix equa-tions. In D. Evans, editor, Sparsity and its Applications, pages 257–284. Cambridge UniversityPress, 1985.

[33] D. Brélaz. New methods to color the vertices of a graph. Communications of the ACM,22(4):251–256, April 1979.

[34] P. Brucker, B. Jurisch, and B. Sievers. A branch and bound algorithm for the job-shopscheduling problem. Discrete Appl. Math., 49(1):107–127, 1994.

[35] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends inMachine Learning, 8(3–4):231–357, 2015.

[36] E. Candès, L. Demanet, D. Donoho, and L. Ying. Fast discrete curvelet transforms. MultiscaleModeling and Simulation, 5(3):861–899, 2006.

[37] J. Carrier, L. Greengard, and V. Rokhlin. A fast adaptive multipole algorithm for particlesimulations. SIAM Journal on Scientific and Statistical Computing, 9(4):669–686, 1988.

BIBLIOGRAPHY 97

[38] I. Castillo, F. Kampas, and J. Pintér. Solving circle packing problems by global optimiza-tion: numerical results and industrial applications. European Journal of Operational Research,191(3):786–802, 2008.

[39] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems withapplications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, may2011.

[40] T. Chan, S. Esedoglu, and M. Nikolova. Algorithms for finding global minimizers of imagesegmentation and denoising models. SIAM Journal on Applied Mathematics, 66(5):1632–1648,2006.

[41] R. Chartrand. Nonconvex splitting for regularized low-rank + sparse decomposition. IEEETransactions on Signal Processing, 60(11):5810–5819, 2012.

[42] R. Chartrand and B. Wohlberg. A nonconvex ADMM algorithm for group sparsity with sparsegroups. In Proceedings of the IEEE International Conference on Acoustics, Speech and SignalProcessing, pages 6009–6013. IEEE, 2013.

[43] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journalon Scientific Computing, 20(1):33–61, 1998.

[44] C. Choi and Y. Ye. Solving sparse semidefinite programs using the dual scaling algorithm withan iterative solver. Working paper, Department of Management Sciences, University of Iowa,2000.

[45] E. Chu, B. O’Donoghue, N. Parikh, and S. Boyd. A primal-dual operator splitting method forconic optimization. Preprint, 2013.

[46] E. Chu, N. Parikh, A. Domahidi, and S. Boyd. Code generation for embedded second-ordercone programming. In Proceedings of the European Control Conference, pages 1547–1552,2013.

[47] A. Cohen, I. Daubechies, and J.-C. Feauveau. Biorthogonal bases of compactly supportedwavelets. Communications on Pure and Applied Mathematics, 45(5):485–560, 1992.

[48] C. Collins and K. Stephenson. A circle packing algorithm. Computational Geometry, 25(3):233–256, 2003.

[49] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A MATLAB-like environment formachine learning. In BigLearn, Neural Information Processing Systems Workshop, 2011.

BIBLIOGRAPHY 98

[50] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graph matching in pat-tern recognition. International Journal of Pattern Recognition and Artificial Intelligence,18(03):265–298, 2004.

[51] J. Cooley, P. Lewis, and P. Welch. The fast Fourier transform and its applications. IEEETransactions on Education, 12(1):27–34, March 1969.

[52] J. Cooley and J. Tukey. An algorithm for the machine calculation of complex Fourier series.Mathematics of computation, 19(90):297–301, 1965.

[53] R. Corless, G. Gonnet, D. Hare, D. Jeffrey, and D. Knuth. On the Lambert W function.Advances in Computational Mathematics, 5(1):329–359, 1996.

[54] J. Crawford and L. Auton. Experimental results on the crossover point in random 3-SAT.Artificial Intelligence, 81(1):31–57, 1996.

[55] A. Cross, R. Wilson, and E. Hancock. Inexact graph matching using genetic search. PatternRecognition, 30(6):953–970, 1997.

[56] G. Dantzig, R. Fulkerson, and S. Johnson. Solution of a large-scale traveling-salesman problem.Journal of the Operations Research Society of America, 2(4):393–410, 1954.

[57] I. Daubechies. Orthonormal bases of compactly supported wavelets. Communications on Pureand Applied Mathematics, 41(7):909–996, 1988.

[58] I. Daubechies. Ten lectures on wavelets, volume 61. SIAM, 1992.

[59] J. David and B. De Moor. The opposite of analytic centering for solving minimum rankproblems in control and identification. In Proceedings of the IEEE Conference on Decisionand Control, pages 2901–2902. IEEE, 1993.

[60] T. Davis. Direct Methods for Sparse Linear Systems (Fundamentals of Algorithms 2). SIAM,2006.

[61] M. De Santo, P. Foggia, C. Sansone, and M. Vento. A large database of graphs and its use forbenchmarking graph isomorphism algorithms. Pattern Recognition Letters, 24(8):1067–1079,2003.

[62] N. Derbinsky, J. Bento, V. Elser, and J. Yedidia. An improved three-weight message-passingalgorithm. Preprint, 2013. https://arxiv.org/abs/1305.1961.

[63] G. Di Pillo and L. Grippo. Exact penalty functions in constrained optimization. SIAM Journalon Control and Optimization, 27(6):1333–1360, 1989.

BIBLIOGRAPHY 99

[64] S. Diamond and S. Boyd. Convex optimization with abstract linear operators. In Proceedingsof the IEEE International Conference on Computer Vision, pages 675–683, December 2015.

[65] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex opti-mization. Journal of Machine Learning Research, 17(83):1–5, 2016.

[66] S. Diamond and S. Boyd. Matrix-free convex optimization modeling. In B. Goldengorin,editor, Optimization and Its Applications in Control and Data Sciences, volume 115 of SpringerOptimization and Its Applications, pages 221–264. Springer, 2016.

[67] S. Diamond and S. Boyd. Stochastic matrix-free equilibration. Journal of Optimization Theoryand Applications, 172(2):436–454, 2016.

[68] S. Diamond, E. Chu, and S. Boyd. CVXPY: A Python-embedded modeling language forconvex optimization, version 0.2. http://cvxpy.org/, May 2014.

[69] S. Diamond, R. Takapoui, and S. Boyd. A general system for heuristic minimization of convexfunctions over non-convex sets. Optimization Methods and Software, 33(1):165–193, 2018.

[70] M. Do and M. Vetterli. The finite ridgelet transform for image representation. IEEE Trans-actions on Image Processing, 12(1):16–28, January 2003.

[71] A. Domahidi, E. Chu, and S. Boyd. ECOS: An SOCP solver for embedded systems. InProceedings of the European Control Conference, pages 3071–3076, 2013.

[72] D. Dudgeon and R. Mersereau. Multidimensional Digital Signal Processing. Prentice-Hall,1984.

[73] I. Duff, A. Erisman, and J. Reid. Direct Methods for Sparse Matrices. Oxford University Press,1986.

[74] J. Eckstein and D. Bertsekas. On the Douglas-Rachford splitting method and the proximalpoint algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293–318, 1992.

[75] J. Eckstein and W. Yao. Understanding the convergence of the alternating direction methodof multipliers: Theoretical and computational perspectives. Pacific Journal of Optimization,11(4):619–644, 2015.

[76] T. Erseghe. Distributed optimal power flow using ADMM. IEEE Transactions on PowerSystems, 29(5):2370–2380, 2014.

[77] M. Figueiredo, R. Nowak, and S. Wright. Gradient projection for sparse reconstruction: Ap-plication to compressed sensing and other inverse problems. IEEE Journal of Selected Topicsin Signal Processing, 1(4):586–597, December 2007.

BIBLIOGRAPHY 100

[78] R. Fletcher. An exact penalty function for nonlinear programming with inequalities. Mathe-matical Programming, 5(1):129–150, 1973.

[79] D. Fong and M. Saunders. LSMR: An iterative algorithm for sparse least-squares problems.SIAM Journal on Scientific Computing, 33(5):2950–2971, 2011.

[80] D. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall ProfessionalTechnical Reference, 2002.

[81] C. Fougner and S. Boyd. Parameter selection and pre-conditioning for a graph form solver.Preprint, 2015. http://arxiv.org/pdf/1503.08366v1.pdf.

[82] K. Fountoulakis, J. Gondzio, and P. Zhlobich. Matrix-free interior point method for compressedsensing problems. Preprint, 2012. http://arxiv.org/pdf/1208.5435.pdf.

[83] R. Fourer, D. Gay, and B. Kernighan. AMPL: A Modeling Language for Mathematical Pro-gramming. Cengage Learning, 2002.

[84] A. Frank and A. Asuncion. University of California, Irvine machine learning repository, 2010.

[85] J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning, volume 1 ofSpringer Series in Statistics. Springer New York, 2001.

[86] K. Fujisawa, M. Fukuda, K. Kobayashi, M. Kojima, K. Nakata, M. Nakata, and M. Yamashita.SDPA (semidefinite programming algorithm) user’s manual – version 7.0.5. Technical report,2008.

[87] M. Fukuda, M. Kojima, and M. Shida. Lagrangian dual interior-point methods for semidefiniteprograms. SIAM Journal on Optimization, 12(4):1007–1031, 2002.

[88] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problemsvia finite element approximation. Computers & Mathematics with Applications, 2(1):17–40,1976.

[89] J. Gardiner, A. Laub, J. Amato, and C. Moler. Solution of the Sylvester matrix equationAXBT + CXDT = E. ACM Transactions on Mathematical Software, 18(2):223–231, June1992.

[90] E. Ghadimi, A. Teixeira, I. Shames, and M. Johansson. Optimal parameter selection for thealternating direction method of multipliers (ADMM): Quadratic problems. IEEE Transactionson Automatic Control, 60(3):644–658, 2015.

[91] A. Gilbert, M. Strauss, J. Tropp, and R. Vershynin. One sketch for all: Fast algorithms forcompressed sensing. In Proceedings of the ACM Symposium on Theory of Computing, pages237–246, 2007.

BIBLIOGRAPHY 101

[92] P. Giselsson and S. Boyd. Diagonal scaling in Douglas-Rachford splitting and ADMM. InProceedings of the IEEE Conference on Decision and Control, 2014.

[93] P. Giselsson and S. Boyd. Monotonicity and restart in fast gradient methods. In 53rd AnnualIEEE Conference on Decision and Control, pages 5058–5063, 2014.

[94] P. Giselsson and S. Boyd. Preconditioning in fast dual gradient methods. In 53rd AnnualIEEE Conference on Decision and Control, pages 5040–5045, 2014.

[95] P. Giselsson and S. Boyd. Metric selection in fast dual forward–backward splitting. Automatica,62:1–10, 2015.

[96] R. Glowinski and A. Marroco. Sur l’approximation, par éléments finis d’ordre un, et la ré-solution, par pénalisation-dualité d’une classe de problèmes de dirichlet non linéaires. Revuefrançaise d’automatique, informatique, recherche opérationnelle. Analyse numérique, 9(2):41–76, 1975.

[97] M. Goldberg. The packing of equal circles in a square. Mathematics Magazine, pages 24–30,1970.

[98] T. Goldstein and S. Osher. The split Bregman method for `1-regularized problems. SIAMJournal on Imaging Sciences, 2(2):323–343, 2009.

[99] J. Gondzio. Convergence analysis of an inexact feasible interior point method for convexquadratic programming. Preprint, 2012. http://arxiv.org/pdf/1208.5960.pdf.

[100] J. Gondzio. Matrix-free interior point method. Computational Optimization and Applications,51(2):457–480, 2012.

[101] J. Gondzio and A. Grothey. Parallel interior-point solver for structured quadratic programs:Application to financial planning problems. Annals of Operations Research, 152(1):319–339,2007.

[102] M. Grant. Disciplined Convex Programming. PhD thesis, Stanford University, 2004.

[103] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs. In V. Blondel,S. Boyd, and H. Kimura, editors, Recent Advances in Learning and Control, Lecture Notes inControl and Information Sciences, pages 95–110. Springer, 2008.

[104] M. Grant and S. Boyd. CVX: MATLAB software for disciplined convex programming, version2.1. http://cvxr.com/cvx, March 2014.

[105] M. Grant, S. Boyd, and Y. Ye. Disciplined convex programming. In L. Liberti and N. Maculan,editors, Global Optimization: From Theory to Implementation, Nonconvex Optimization andits Applications, pages 155–210. Springer, 2006.

BIBLIOGRAPHY 102

[106] A. Greenbaum. Iterative Methods for Solving Linear Systems. Society for Industrial andApplied Mathematics, 1997.

[107] L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of Computa-tional Physics, 73(2):325–348, 1987.

[108] L. Greengard and J. Strain. The fast Gauss transform. SIAM Journal on Scientific andStatistical Computing, 12(1):79–94, 1991.

[109] A. Griewank. On automatic differentiation. In M. Iri and K. Tanabe, editors, MathematicalProgramming: Recent Developments and Applications, pages 83–108. Kluwer Academic, 1989.

[110] Gurobi Optimization, Inc. Gurobi optimizer reference manual, 2015.

[111] W. Hackbusch. Multi-Grid Methods and Applications. Springer Berlin Heidelberg, 1985.

[112] W. Hackbusch. A sparse matrix arithmetic based on H-matrices. Part I: Introduction toH-matrices. Computing, 62(2):89–108, 1999.

[113] W. Hackbusch, B. Khoromskij, and S. Sauter. On H2-matrices. In H.-J. Bungartz, R. Hoppe,

and C. Zenger, editors, Lectures on Applied Mathematics, pages 9–29. Springer Berlin Heidel-berg, 2000.

[114] M. Halldórsson. A still better performance guarantee for approximate graph coloring. Infor-mation Processing Letters, 45(1):19–23, 1993.

[115] S. Han and O. Mangasarian. Exact penalty functions in nonlinear programming. MathematicalProgramming, 17(1):251–269, 1979.

[116] L. He, C. Han, and W. Wee. Object recognition and recovery by skeleton graph matching. InIEEE International Conference on Multimedia and Expo, pages 993–996. IEEE, 2006.

[117] G. Hennenfent, F. Herrmann, R. Saab, O. Yilmaz, and C. Pajean. SPOT: A linear operatortoolbox, version 1.2. http://www.cs.ubc.ca/labs/scl/spot/index.html, March 2014.

[118] M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applica-tion, 4(5):303–320, 1969.

[119] M. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journalof Research of the National Bureau of Standards, 49(6):409–436, 1952.

[120] L. Hien. Differential properties of euclidean projection onto power cone. Mathematical Methodsof Operations Research, 82(3):265–284, 2015.

[121] M. Hifi and R. M’Hallah. A literature review on circle and sphere packing problems: Modelsand methodologies. Advances in Operations Research, pages 1–22, 2009.

BIBLIOGRAPHY 103

[122] H. Hmam. Quadratic optimization with one quadratic equality constraint. Technical re-port, Electronic Warfare and Radar Division, Defence Science and Technology Organisation(DSTO), Australia, 2010.

[123] K. Hoffman, M. Padberg, and G. Rinaldi. Traveling salesman problem. In Encyclopedia ofOperations Research and Management Science, pages 1573–1578. Springer, 2013.

[124] M. Hong. A distributed, asynchronous and incremental algorithm for nonconvex optimization:An ADMM approach. IEEE Transactions on Control of Network Systems, 2017.

[125] M. Hong and Z. Luo. On the linear convergence of the alternating direction method of multi-pliers. Mathematical Programming, 162:165–199, 2017.

[126] M. Hong, Z. Luo, and M. Razaviyayn. Convergence analysis of alternating direction methodof multipliers for a family of nonconvex problems. In Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing, pages 3836–3840. IEEE, 2015.

[127] A. Hoorfar and M. Hassani. Inequalities on the Lambert W function and hyperpower function.Journal of Inequalities in Pure and Applied Mathematics, 9:1–5, 2008.

[128] H. Hoos. SATLIB — benchmark problems, 2016.

[129] K. Huang and N. Sidiropoulos. Consensus-ADMM for general quadratically constrainedquadratic programming. IEEE Transactions on Signal Processing, 64(20):5297–5310, 2016.

[130] M. Hutchinson. A stochastic estimator of the trace of the influence matrix for laplaciansmoothing splines. Communications in Statistics - Simulation and Computation, 19(2):433–450, 1990.

[131] I. Hwang, H. Balakrishnan, K. Roy, J. Shin, L. Guibas, and C. Tomlin. Multiple-target trackingand identity management. In Proceedings of IEEE Sensors, volume 1, pages 36–41, Oct 2003.

[132] L. Jacques, L. Duval, C. Chaux, and G. Peyré. A panorama on multiscale geometric repre-sentations, intertwining spatial, directional and frequency selectivity. IEEE Transactions onSignal Processing, 91(12):2699–2730, 2011.

[133] A. Jensen and A. la Cour-Harbo. Ripples in Mathematics. Springer, 2001.

[134] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, andT. Darrell. Caffe: Convolutional architecture for fast feature embedding. Preprint, 2014.http://arxiv.org/abs/1408.5093.

[135] B. Jiang, S. Ma, and S. Zhang. Alternating direction method of multipliers for real andcomplex polynomial optimization models. Optimization, 63(6):883–898, 2014.

BIBLIOGRAPHY 104

[136] R. Kalman. Identification of noisy systems. Russian Mathematical Surveys, 40(4):25–42, 1985.

[137] R. Karp. Reducibility among combinatorial problems. In R. Miller, J. Thatcher, andJ. Bohlinger, editors, Complexity of Computer Computations, The IBM Research SymposiaSeries, pages 85–103. Springer US, 1972.

[138] C. Kelley. Iterative Methods for Linear and Nonlinear Equations. Society for Industrial andApplied Mathematics, 1995.

[139] J. Kelner, L. Orecchia, A. Sidford, and A. Zhu. A simple, combinatorial algorithm for solvingSDD systems in nearly-linear time. In Proceedings of the ACM Symposium on Theory ofComputing, pages 911–920, 2013.

[140] S.-J. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method forlarge-scale `1-regularized least squares. IEEE Journal on Selected Topics in Signal Processing,1(4):606–617, December 2007.

[141] P. Knight. The Sinkhorn–Knopp algorithm: Convergence and applications. SIAM Journal onMatrix Analysis and Applications, 30(1):261–275, 2008.

[142] M. Kocvara and M. Stingl. On the solution of large-scale SDP problems by the modified barriermethod using iterative solvers. Mathematical Programming, 120(1):285–287, 2009.

[143] J. Kovacevic and M. Vetterli. Nonseparable multidimensional perfect reconstruction filterbanks and wavelet bases for R

n. IEEE Transactions on Information Theory, 38(2):533–555,March 1992.

[144] P. Krishnaprasad and R. Barakat. A descent approach to a class of inverse problems. Journalof Computational Physics, 24(4):339–347, 1977.

[145] J. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem.Proceedings of the American Mathematical Society, 7(1):48–50, 1956.

[146] H. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics,52(1):7–21, 2005.

[147] S. Lacoste-Julien, M. Schmidt, and F. Bach. A simpler approach to obtaining an O(1/t)

convergence rate for the projected stochastic subgradient method. Preprint, 2002. http:

//arxiv.org/pdf/1212.2002v2.pdf.

[148] G. Lan, Z. Lu, and R. Monteiro. Primal-dual first-order methods with O(1/✏) iteration-complexity for cone programming. Mathematical Programming, 126(1):1–29, 2011.

[149] E. Lawler. The traveling salesman problem: a guided tour of combinatorial optimization.Wiley Series in Discrete Mathematics & Optimization, 1985.

BIBLIOGRAPHY 105

[150] E. Lawler and D. Wood. Branch-and-bound methods: A survey. Operations Research,14(4):699–719, 1966.

[151] J. Lee. A graph-based approach for modeling and indexing video data. In IEEE InternationalSymposium on Multimedia, pages 348–355. IEEE, 2006.

[152] G. Li and T.-K. Pong. Global convergence of splitting methods for nonconvex compositeoptimization. Preprint, November 2015. http://arxiv.org/pdf/1407.0753.pdf.

[153] A. Liavas and N. Sidiropoulos. Parallel algorithms for constrained tensor factorization via alter-nating direction method of multipliers. IEEE Transactions on Signal Processing, 63(20):5450–5463, 2015.

[154] E. Liberty. Simple and deterministic matrix sketching. In Proceedings of the ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pages 581–588, 2013.

[155] J. Lim. Two-dimensional Signal and Image Processing. Prentice-Hall, 1990.

[156] Y. Lin, D. Lee, and L. Saul. Nonnegative deconvolution for time of arrival estimation. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing,volume 2, pages 377–380, May 2004.

[157] T. Lipp and S. Boyd. Variations and extension of the convex–concave procedure. Optimizationand Engineering, pages 1–25, 2014.

[158] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. SIAM, 1992.

[159] J. Lofberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedingsof the IEEE International Symposium on Computed Aided Control Systems Design, pages294–289, 2004.

[160] J. Lofberg. YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedingsof the IEEE International Symposium on Computed Aided Control Systems Design, pages294–289, September 2004.

[161] Y. Lu and M. Do. Multidimensional directional filter banks and surfacelets. IEEE Transactionson Image Processing, 16(4):918–931, April 2007.

[162] S. Mallat. A theory for multiresolution signal decomposition: the wavelet representation. IEEETransactions on Pattern Analysis and Machine Intelligence, 11(7):674–693, July 1989.

[163] S. Martucci. Symmetric convolution and the discrete sine and cosine transforms. IEEE Trans-actions on Signal Processing, 42(5):1038–1051, May 1994.

BIBLIOGRAPHY 106

[164] J. Mattingley and S. Boyd. CVXGEN: A code generator for embedded convex optimization.Optimization and Engineering, 13(1):1–27, 2012.

[165] R. Merris. Laplacian matrices of graphs: a survey. Linear Algebra and its Applications,197:143–176, 1994.

[166] D. Mitchell, B. Selman, and H. Levesque. Hard and easy distributions of SAT problems. InProceedings of the AAAI Conference on Artificial Intelligence, volume 92, pages 459–465, 1992.

[167] MOSEK optimization software, version 7. https://mosek.com/, January 2015.

[168] J. Mota, J. Xavier, P. Aguiar, and M. Püschel. Basis pursuit in sensor networks. In Proceedingsof the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 2916–2919. IEEE, 2011.

[169] P. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection.IEEE Trans. Comput., 100(9):917–922, 1977.

[170] Y. Nesterov. Towards nonsymmetric conic optimization. Optimization Methods and Software,27(4–5):893–917, 2012.

[171] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Program-ming, volume 13. Society for Industrial and Applied Mathematics, 1994.

[172] Y. Nesterov and A. Nemirovsky. Conic formulation of a convex programming problem andduality. Optimization Methods and Software, 1(2):95–115, 1992.

[173] L. Ning, T. Georgiou, T. Tryphon, A. Tannenbaum, and S. Boyd. Linear models based onnoisy data and the frisch scheme. SIAM Review, 57(2):167–197, 2015.

[174] J. Nocedal and S. Wright. Numerical Optimization. Springer Science, 2006.

[175] B. O’Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting andhomogeneous self-dual embedding. Journal of Optimization Theory and Applications, pages1–27, 2016.

[176] M. Padberg and G. Rinaldi. A branch-and-cut algorithm for the resolution of large-scalesymmetric traveling salesman problems. SIAM Review, 33(1):60–100, 1991.

[177] C. Paige and M. Saunders. LSQR: An algorithm for sparse linear equations and sparse leastsquares. ACM Transactions on Mathematical Software, 8(1):43–71, 1982.

[178] C. Papadimitriou and K. Steiglitz. Combinatorial Optimization: Algorithms and Complexity.Dover, Mineola, NY, 1998.

BIBLIOGRAPHY 107

[179] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization,1(3):123–231, 2014.

[180] Z. Peng, J. Chen, and W. Zhu. A proximal alternating direction method of multipliers fora minimization problem with nonconvex constraints. Journal of Global Optimization, pages1–18, 2015.

[181] A. Perold. Large-scale portfolio optimization. Management Science, 30(10):1143–1160, 1984.

[182] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithmsin convex optimization. In Proceedings of the IEEE International Conference on ComputerVision, pages 1762–1769, 2011.

[183] T. Pock, D. Cremers, H. Bischof, and A. Chambolle. An algorithm for minimizing theMumford-Shah functional. In Proceedings of the IEEE International Conference on ComputerVision, pages 1133–1140, September 2009.

[184] M. Powell. Algorithms for nonlinear constraints that use Lagrangian functions. MathematicalProgramming, 14(1):224–248, 1978.

[185] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe. Halide: Alanguage and compiler for optimizing parallelism, locality, and recomputation in image pro-cessing pipelines. In Proceedings of the ACM SIGPLAN Conference on Programming LanguageDesign and Implementation, pages 519–530, 2013.

[186] J. Rocha and T. Pavlidis. A shape analysis model with applications to a character recognitionsystem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(4):393–404,1994.

[187] D. Ruiz. A scaling algorithm to equilibrate both rows and columns norms in matrices. Technicalreport, Rutherford Appleton Lab., Oxon, UK, RAL-TR-2001-034, 2001.

[188] G. Sagnol. PICOS: A Python interface for conic optimization solvers, version 1.1. http:

//picos.zib.de/index.html, April 2015.

[189] M. Saunders, B. Kim, C. Maes, S. Akle, and M. Zahr. PDCO: Primal-dual interior method forconvex objectives. http://web.stanford.edu/group/SOL/software/pdco/, November 2013.

[190] J. Saunderson, V. Chandrasekaran, P. Parrilo, and A. Willsky. Diagonal and low-rank matrixdecompositions, correlation matrices, and ellipsoid fitting. SIAM Journal on Matrix Analysisand Applications, 33(4):1395–1416, 2012.

BIBLIOGRAPHY 108

[191] C. Schellewald, S. Roth, and C. Schnörr. Evaluation of convex optimization techniques for theweighted graph-matching problem in computer vision. In Pattern Recognition, pages 361–368.Springer, 2001.

[192] L. Schizas, A. Ribeiro, and G. Giannakis. Consensus in ad hoc WSNs with noisy links — partI: Distributed estimation of deterministic signals. IEEE Transactions on Signal Processing,56(1):350–364, 2008.

[193] M. Schneider and S. Zenios. A comparative study of algorithms for matrix balancing. Opera-tions Research, 38(3):439–455, 1990.

[194] T. Sebastian, P. Klein, and B. Kimia. Recognition of shapes by editing their shock graphs.IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(5):550–571, 2004.

[195] H. Sedghi, A. Anandkumar, and E. Jonckheere. Multi-step stochastic ADMM in high dimen-sions: Applications to sparse optimization and matrix decomposition. In Advances in NeuralInformation Processing Systems, pages 2771–2779, 2014.

[196] W. Sharpe. Portfolio Theory and Capital Markets. McGraw-Hill, New York, 1970.

[197] H. Sherali and W. Adams. A hierarchy of relaxations between the continuous and convex hullrepresentations for zero-one programming problems. SIAM Journal on Discrete Mathematics,3(3):411–430, 1990.

[198] R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967.

[199] A. Skajaa and Y. Ye. A homogeneous interior-point algorithm for nonsymmetric convex conicoptimization. Mathematical Programming, pages 1–32, May 2014.

[200] A. Sluis. Condition numbers and equilibration of matrices. Numerische Mathematik, 14(1):14–23, 1969.

[201] E. Specht. Packomania, October 2013.

[202] D. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graphsparsification, and solving linear systems. In Proceedings of the ACM Symposium on Theoryof Computing, pages 81–90, 2004.

[203] J.E. Spingarn. Applications of the method of partial inverses to convex programming: decom-position. Mathematical Programming, 32(2):199–223, 1985.

[204] J.-L. Starck, E. Candès, and D. Donoho. The curvelet transform for image denoising. IEEETransactions on Image Processing, 11(6):670–684, June 2002.

BIBLIOGRAPHY 109

[205] K. Stephenson. Introduction to circle packing: The theory of discrete analytic functions. Cam-bridge University Press, 2005.

[206] R. Stubbs and S. Mehrotra. A branch-and-cut method for 0-1 mixed convex programming.Mathematical Programming, 86(3):515–532, 1999.

[207] J. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones.Optimization Methods and Software, 11(1-4):625–653, 1999.

[208] R. Takapoui and H. Javadi. Preconditioning via diagonal scaling. EE364b: Convex Optimiza-tion II Class Project, 2014.

[209] R. Takapoui, N. Moehle, S. Boyd, and A. Bemporad. A simple effective heuristic for embeddedmixed-integer quadratic programming. In Proceedings of the American Control Conference,pages 5620–5625, 2016.

[210] M. Tawarmalani and N.V. Sahinidis. A polyhedral branch-and-cut approach to global opti-mization. Mathematical Programming, 103(2):225–249, 2005.

[211] R. Tibshirani. Lecture notes in modern regression, March 2013.

[212] K.-C. Toh. Solving large scale semidefinite programs via an iterative solver on the augmentedsystems. SIAM Journal on Optimization, 14(3):670–698, 2004.

[213] K.-C. Toh, M. Todd, and R. Tütüncü. SDPT3 — a MATLAB software package for semidefiniteprogramming, version 4.0. Optimization Methods and Software, 11:545–581, 1999.

[214] M. Udell, K. Mohan, D. Zeng, J. Hong, S. Diamond, and S. Boyd. Convex optimization inJulia. In Proceedings of the Workshop for High Performance Technical Computing in DynamicLanguages, pages 18–28, 2014.

[215] S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEETransactions on Pattern Analysis and Machine Intelligence, 10(5):695–703, 1988.

[216] G. Vaillant. linop, version 0.7. http://pythonhosted.org//linop/, December 2013.

[217] E. van den Berg and M. Friedlander. Probing the Pareto frontier for basis pursuit solutions.SIAM Journal on Scientific Computing, 31(2):890–912, 2009.

[218] L. Vandenberghe and S. Boyd. A polynomial-time algorithm for determining quadratic Lya-punov functions for nonlinear systems. In Proceedings of the European Conference on CircuitTheory and Design, pages 1065–1068, 1993.

[219] L. Vandenberghe and S. Boyd. A primal—dual potential reduction method for problemsinvolving matrix inequalities. Mathematical Programming, 69(1-3):205–236, 1995.

BIBLIOGRAPHY 110

[220] L. Vandenberghe and S. Boyd. Semidefinite programming. SIAM Review, 38(1):49–95, 1996.

[221] K. Vishnoi. Laplacian solvers and their algorithmic applications. Theoretical Computer Sci-ence, 8(1-2):1–141, 2012.

[222] J. Vogelstein, J. Conroy, L. Podrazik, S. Kratzer, E. Harley, D. Fishkind, R. Vogelstein,and C. Priebe. Fast approximate quadratic programming for graph matching. PLOS ONE,10(4):1–17, 2015.

[223] J. Von Neumann. Some matrix inequalities and metrization of metric space. Tomsk UniversityReview, 1:286–296, 1937.

[224] B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang. An ADMM algorithm for a class of totalvariation regularized estimation problems. IFAC Proceedings Volumes, 45(16):83–88, 2012.

[225] D. Wang, H. Lu, and M. Yang. Online object tracking with sparse prototypes. IEEE Trans-actions on Image Processing, 22(1):314–325, 2013.

[226] F. Wang, Z. Xu, and H. Xu. Convergence of Bregman alternating direction method withmultipliers for nonconvex composite problems. Preprint, 2014. https://arxiv.org/abs/

1410.8625v3.

[227] S. Wright. Primal-Dual Interior-Point Methods, volume 54. Society for Industrial and AppliedMathematics, 1987.

[228] Y. Xu, W. Yin, Z. Wen, and Y. Zhang. An alternating direction algorithm for matrix comple-tion with nonnegative factors. Frontiers of Mathematics in China, 7(2):365–384, 2012.

[229] C. Yang, R. Duraiswami, and L. Davis. Efficient kernel machines using the improved fastGauss transform. In L. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural InformationProcessing Systems 17, pages 1561–1568. MIT Press, 2005.

[230] C. Yang, R. Duraiswami, N. Gumerov, and L. Davis. Improved fast Gauss transform andefficient kernel density estimation. In Proceedings of the IEEE International Conference onComputer Vision, volume 1, pages 664–671, October 2003.

[231] Y. Ye. Interior Point Algorithms: Theory and Analysis. Wiley-Interscience, 2011.

[232] L. Ying, L. Demanet, and E. Candès. 3D discrete curvelet transform. In Proceedings of SPIE:Wavelets XI, volume 5914, pages 351–361, 2005.

[233] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime TV-`1 optical flow.In Pattern Recognition, volume 4713 of Lecture Notes in Computer Science, pages 214–223.Springer Berlin Heidelberg, 2007.

BIBLIOGRAPHY 111

[234] R. Zhang and J. Kwok. Asynchronous distributed ADMM for consensus optimization. InProceedings of the International Conference on Machine Learning, pages 1701–1709, 2014.

[235] X.-Y. Zhao, D. Sun, and K.-C. Toh. A Newton-CG augmented Lagrangian method for semidef-inite programming. SIAM Journal on Optimization, 20(4):1737–1765, 2010.

Date post:	20-Aug-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

DOMAIN-SPECIFIC LANGUAGES FOR CONVEX AND NON …tq788ns0013...domain-specific languages for convex...

Documents