+ All Categories
Home > Documents > arXiv:1301.4566v2 [stat.ML] 2 May 2013

arXiv:1301.4566v2 [stat.ML] 2 May 2013

Date post: 20-Jan-2022
Category:
Upload: others
View: 5 times
Download: 0 times
Share this document with a friend
41
1 Sprase/Robust Estimation and Kalman Smoothing with Nonsmooth Log-Concave Densities: Modeling, Computation, and Theory Aleksandr Y. Aravkin SARAVKIN@US. IBM. COM IBM T.J. Watson Research Center Yorktown, NY 10598 James V. Burke BURKE@MATH. WASHINGTON. EDU Department of Mathematics, University of Washington Seattle, WA, USA Gianluigi Pillonetto GIAPI @DEI . UNIPD. IT Department of Information Engineering, University of Padova Padova, Italy Editor: Abstract We introduce a new class of quadratic support (QS) functions, many of which already play a crucial role in a variety of applications, including machine learning, robust statistical inference, sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QS penalties include the 2 , Huber, 1 and Vapnik losses. We build on a dual representation for QS functions, using it to characterize conditions necessary to interpret these functions as negative logs of true probability densities. This interpretation establishes the foundation for statistical modeling with both known and new QS loss functions, and enables construction of non-smooth multivariate distributions with specified means and variances from simple scalar building blocks. For a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties, the dual representation allows for the development of efficient numerical estimation schemes. The main contribution of this paper is a flexible statistical modeling framework for a variety of learning applications, together with a toolbox of efficient numerical methods for estimation using these densities. In particular, for PLQ densities, we show that interior point (IP) methods can be used. IP methods solve nonsmooth optimization problems by working directly with smooth systems of equations characterizing the optimality of these problems. We provide a few simple numerical examples, along with a code that can be used to prototype general PLQ problems. The efficiency of the IP approach depends on the structure of particular applications. We consider the class of dynamic inverse problems using Kalman smoothing. This class comprises a wide variety of applications, where the aim is to reconstruct the state of a dynamical system with known process and measurement models starting from noisy output samples. In the classical case, Gaussian errors are assumed both in the process and measurement models for such problems. We show that the extended framework allows arbitrary PLQ densities to be used, and the that the proposed IP approach solves the generalized Kalman smoothing problem while maintaining the linear complexity in the size of the time series, just as in the Gaussian case. This extends the computational efficiency of the Mayne-Fraser and Rauch-Tung-Striebel algorithms to a much broader nonsmooth setting, and includes many recently proposed robust and sparse smoothers as special cases. Keywords: statistical modeling; convex analysis; nonsmooth optimization; robust inference; spar- sity optimization; Kalman smoothing; interior point methods 1. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions. 1 arXiv:1301.4566v2 [stat.ML] 2 May 2013
Transcript

1

Sprase/Robust Estimation and Kalman Smoothing with NonsmoothLog-Concave Densities: Modeling, Computation, and Theory

Aleksandr Y. Aravkin [email protected] T.J. Watson Research CenterYorktown, NY 10598

James V. Burke [email protected] of Mathematics, University of WashingtonSeattle, WA, USA

Gianluigi Pillonetto [email protected]

Department of Information Engineering, University of PadovaPadova, Italy

Editor: AbstractWe introduce a new class of quadratic support (QS) functions, many of which already play a

crucial role in a variety of applications, including machine learning, robust statistical inference,sparsity promotion, and inverse problems such as Kalman smoothing. Well known examples of QSpenalties include the `2, Huber, `1 and Vapnik losses. We build on a dual representation for QSfunctions, using it to characterize conditions necessary to interpret these functions as negative logsof true probability densities. This interpretation establishes the foundation for statistical modelingwith both known and new QS loss functions, and enables construction of non-smooth multivariatedistributions with specified means and variances from simple scalar building blocks.

For a broad subclass of QS loss functions known as piecewise linear quadratic (PLQ) penalties,the dual representation allows for the development of efficient numerical estimation schemes. Themain contribution of this paper is a flexible statistical modeling framework for a variety of learningapplications, together with a toolbox of efficient numerical methods for estimation using thesedensities. In particular, for PLQ densities, we show that interior point (IP) methods can be used.IP methods solve nonsmooth optimization problems by working directly with smooth systems ofequations characterizing the optimality of these problems. We provide a few simple numericalexamples, along with a code that can be used to prototype general PLQ problems.

The efficiency of the IP approach depends on the structure of particular applications. Weconsider the class of dynamic inverse problems using Kalman smoothing. This class comprisesa wide variety of applications, where the aim is to reconstruct the state of a dynamical systemwith known process and measurement models starting from noisy output samples. In the classicalcase, Gaussian errors are assumed both in the process and measurement models for such problems.We show that the extended framework allows arbitrary PLQ densities to be used, and the thatthe proposed IP approach solves the generalized Kalman smoothing problem while maintainingthe linear complexity in the size of the time series, just as in the Gaussian case. This extendsthe computational efficiency of the Mayne-Fraser and Rauch-Tung-Striebel algorithms to a muchbroader nonsmooth setting, and includes many recently proposed robust and sparse smoothers asspecial cases.Keywords: statistical modeling; convex analysis; nonsmooth optimization; robust inference; spar-sity optimization; Kalman smoothing; interior point methods

1. The authors would like to thank Bradley M. Bell for insightful discussions and helpful suggestions.

1

arX

iv:1

301.

4566

v2 [

stat

.ML

] 2

May

201

3

1. Introduction

Consider the classical problem of Bayesian parametric regression (MacKay, 1992; Roweis andGhahramani, 1999) where the unknown x ∈ Rn is a random vector1, with a prior distribution speci-fied using a known invertible matrix G ∈ Rn×n and known vector µ ∈ Rn via

µ = Gx+w , (1.1)

where w is a zero mean vector with covariance Q. Let z denote a linear transformation of x contam-inated with additive zero mean measurement noise v with covariance R,

z = Hx+ v , (1.2)

where H ∈ R`×n is a known matrix, while v and w are independent. It is well known that the(unconditional) minimum variance linear estimator of x, as a function of z, is the solution to thefollowing optimization problem:

minx

(z−Hx)TR−1(z−Hx)+(µ−Gx)TQ−1(µ−Gx) . (1.3)

As we will show, (1.3) includes estimation problems arising in discrete-time dynamic linear systemswhich admit a state space representation (Anderson and Moore, 1979; Brockett, 1970). In thiscontext, x is partitioned into N subvectors xk, where each xk represents the hidden system stateat time instant k. For known data z, the classical Kalman smoother exploits the special structure ofthe matrices H,G,Q and R to compute the solution of (1.3) in O(N) operations (Gelb, 1974). Thisprocedure returns the minimum variance estimate of the state sequence xkwhen the additive noisein the system is assumed to be Gaussian.In many circumstances, the estimator (1.3) performs poorly; put another way, quadratic penalizationon model deviation is a bad model in many situations. For instance, it is not robust with respectto the presence of outliers in the data (Huber, 1981; Gao, 2008; Aravkin et al., 2011a; Farahmandet al., 2011) and may have difficulties in reconstructing fast system dynamics, e.g. jumps in the statevalues (Ohlsson et al., 2011). In addition, sparsity-promoting regularization is often used in order toextract a small subset from a large measurement or parameter vector which has greatest impact onthe predictive capability of the estimate for future data. This sparsity principle permeates many wellknown techniques in machine learning and signal processing, including feature selection, selectiveshrinkage, and compressed sensing (Hastie and Tibshirani, 1990; Efron et al., 2004; Donoho, 2006).In these cases, (1.3) is often replaced by a more general formulation

minx

V (Hx− z;R)+W (Gx−µ;Q) (1.4)

where the loss V may be the `2-norm, the Huber penalty (Huber, 1981), Vapnik’s ε-insensitiveloss (used in support vector regression (Vapnik, 1998) see also (Hastie et al., 2001)) or the hingeloss (leading to support vector classifiers (Evgeniou et al., 2000; Pontil and Verri, 1998; Scholkopfet al., 2000)). The regularizer W may be the `2-norm, the `1-norm (as in the LASSO (Tibshirani,1996)), or a weighted combination of the two, yielding the elastic net procedure (Zou and Hastie,2005). Many learning algorithms using infinite-dimensional reproducing kernel Hilbert spaces ashypothesis spaces (Aronszajn, 1950; Saitoh, 1988; Cucker and Smale, 2001) boil down to solving

1. All vectors are column vectors, unless otherwise specified

2

finite-dimensional problems of the form (1.4) by virtue of the representer theorem (Wahba, 1998;Scholkopf et al., 2001).These robust and sparse approaches can often be interpreted as placing non-Gaussian priors on w(or directly on x) and on the measurement noise v. The Bayesian interpretation of (1.4) has beenextensively studied in the statistical and machine learning literature in recent years and probabilisticapproaches used in the analysis of estimation and learning algorithms can be found e.g. in (Mackay,1994; Tipping, 2001; Wipf et al., 2011). Non-Gaussian model errors and priors leading to a greatvariety of loss and penalty functions are also reviewed in (Palmer et al., 2006) using convex-typerepresentations, and integral-type variational representations related to Gaussian scale mixtures.In contrast to the above approaches, in the first part of the paper, we consider a wide class ofquadratic support (QS) functions and exploit their dual representation. This class of functions gen-eralizes the notion of piecewise linear quadratic (PLQ) penalties Rockafellar and Wets (1998). Thedual representation is the key to identifying which QS loss functions can be associated with a den-sity, which in turn allows us to interpret the solution to the problem (1.4) as a MAP estimator whenthe loss functions V and W come from this subclass of QS penalties. This viewpoint allows statisti-cal modeling using non-smooth penalties, such as the `1, hinge, Huber and Vapnik losses, which areall PLQ penalties. Identifying a statistical interpretation for this class of problems gives us severaladvantages, including a systematic constructive approach to prescribe mean and variance parame-ters for the corresponding model; a property that is particularly important for Kalman smoothing.In addition, the dual representation provides the foundation for efficient numerical methods in esti-mation based on interior point optimization technology. In the second part of the paper, we derivethe Karush-Kuhn-Tucker (KKT) equations for problem (1.4), and introduce interior point (IP) meth-ods, which are iterative methods to solve the KKT equations using smooth approximations. This isessentially a smoothing approach to many (non-smooth) robust and sparse problems of interest topractitioners. Furthermore, we provide conditions under which the IP methods solve (1.4) when Vand W come from PLQ densities, and describe implementation details for the entire class.A concerted research effort has recently focused on the solution of regularized large-scale inverseand learning problems, where computational costs and memory limitations are critical. This classof problems includes the popular kernel-based methods (Rasmussen and Williams, 2006; Scholkopfand Smola, 2001; Smola and Scholkopf, 2003), coordinate descent methods (Tseng and Yun, 2008;Lucidi et al., 2007; Dinuzzo, 2011) and decomposition techniques (Joachims, 1998; Lin, 2001; Lu-cidi et al., 2007), one of which is the widely used sequential minimal optimization algorithm for sup-port vector machines (Platt, 1998). Other techniques are based on kernel approximations, e.g. usingincomplete Cholesky factorization (Fine and Scheinberg, 2001), approximate eigen-decomposition(Zhang and Kwok, 2010) or truncated spectral representations (Pillonetto and Bell, 2007). Efficientinterior point methods have been developed for `1-regularized problems (Kim et al., 2007), and forsupport vector machines (Ferris and Munson, 2003).In contrast, general and efficient solvers for state space estimation problems of the form (1.4) aremissing in the literature. The last part of this paper provides a contribution to fill this gap, spe-cializing the general results to the dynamic case, and recovering the classical efficiency results ofthe least-squares formulation. In particular, we design new Kalman smoothers tailored for systemssubject to noises coming from PLQ densities. Amazingly, it turns out that the IP method used in(Aravkin et al., 2011a) generalizes perfectly to the entire class of PLQ densities under a simple ver-ifiable non-degeneracy condition. In practice, IP methods converge in a small number of iterations,and the effort per iteration depends on the structure of the underlying problem. We show that the IP

3

iterations for all PLQ Kalman smoothing problems can be computed with a number of operationsthat scales linearly in N, as in the quadratic case. This theoretical foundation generalizes the resultsrecently obtained in (Aravkin et al., 2011a,b; Farahmand et al., 2011; Ohlsson et al., 2011), framingthem as particular cases of the general framework presented here.The paper is organized as follows. In Section 2 we introduce the class of QS convex functions,and give sufficient conditions that allow us to interpret these functions as the negative logs of as-sociated probability densities. In Section 3 we show how to construct QS penalties and densitieshaving a desired structure from basic components, and in particular how multivariate densities canbe endowed with prescribed means and variances using scalar building blocks. To illustrates thisprocedure, further details are provided for the Huber and Vapnik penalties. In Section 4, we fo-cus on PLQ penalties, derive the associated KKT system, and present a theorem that guaranteesconvergence of IP methods under appropriate hypotheses. In Section 5, we present a few simplewell-known problems, and compare a basic IP implementation for these problems with an ADMMimplementation (all code is available online). In Section 6, we present the Kalman smoothing dy-namic model, formulate Kalman smoothing with PLQ penalties, present the KKT system for thedynamic case, and show that IP iterations for PLQ smoothing preserve the classical computationalefficiency known for the Gaussian case. We present numerical examples using both simulated andreal data in Section 7, and make some concluding remarks in Section 8. Section 9 serves as anappendix where supporting mathematical results and proofs are presented.

2. Quadratic Support Functions and Densities

In this section, we introduce the class of Quadratic Support (QS) functions, characterize some oftheir properties, and show that many commonly used penalties fall into this class. We also give astatistical interpretation to QS penalties by interpreting them as negative log likelihoods of prob-ability densities; this relationship allows prescribing means and variances along with the generalquality of the error model, an essential requirement of the Kalman smoothing framework and manyother areas.

2.1 Preliminaries

We recall a few definitions from convex analysis, required to specify the domains of QS penalties.The reader is referred to (Rockafellar, 1970; Rockafellar and Wets, 1998) for more detailed reading.

• (Affine hull) Define the affine hull of any set C⊂Rn, denoted by aff(C), as the smallest affineset (translated subspace) that contains C.

• (Cone) For any set C ⊂ Rn, denote by cone C the set tr|r ∈C, t ∈ R+.

• (Domain) For f (x) : Rn→ R= R∪∞, dom( f ) = x : f (x)< ∞.

• (Polars of convex sets) For any convex set C ⊂ Rm, the polar of C is defined to be

C := r|〈r,d〉 ≤ 1 ∀ d ∈C,

and if C is a convex cone, this representation is equivalent to

C := r|〈r,d〉 ≤ 0 ∀ d ∈C.

4

• (Horizon cone). Let C ⊂ Rn be a nonempty convex set. The horizon cone C∞ is the convexcone of ‘unbounded directions’ for C, i.e. d ∈C∞ if C+d ⊂C.

• (Barrier cone). The barrier cone of a convex set C is denoted by bar(C):

bar(C) := x∗|for some β ∈ R,〈x, x∗〉 ≤ β ∀x ∈C .

• (Support function). The support function for a set C is denoted by δ ∗ (x |C ):

δ∗ (x |C ) := sup

c∈C〈x, c〉 .

2.2 QS functions and densities

We now introduce the QS functions and associated densities that are the focus of this paper. Webegin with the dual representation, which is crucial to both establishing a statistical interpretationand to the development of a computational framework.

Definition 1 (Quadratic Support functions and penalties) A QS function is any function ρ(U,M,b,B; ·) :Rn→ R having representation

ρ(U,M,b,B;y) = supu∈U

〈u,b+By〉− 1

2〈u,Mu〉, (2.1)

where U ⊂ Rm is a nonempty convex set, M ∈S n+ the set of real symmetric positive semidefinite

matrices, and b+By is an injective affine transformation in y, with B ∈ Rm×n, so, in particular,m≤ n and null(B) = 0.

When 0 ∈U, we refer to the associated QS function as a penalty, since it is necessarily non-negative.

Remark 2 When U is polyhedral, 0 ∈U, b = 0 and B = I, we recover the basic piecewise linear-quadratic penalties characterized in (Rockafellar and Wets, 1998, Example 11.18).

Theorem 3 Let U,M,B,b be as in Definition 1, and set K =U∞∩null(M). Then

B−1[bar(U)+Ran(M)−b]⊂ dom[ρ(U,M,B,b; ·)]⊂ B−1[K−b] ,

with equality throughout when bar(U)+Ran(M) is closed, where bar(U) = dom(δ ∗ (· |U )) is thebarrier cone of U. In particular, equality always holds when U is polyhedral.

We now show that many commonly used penalties are special cases of QS (and indeed, of thePLQ) class.

Remark 4 (scalar examples) `2, `1, elastic net, Huber, hinge, and Vapnik penalties are all repre-sentable using the notation of Definition 1.

1. `2: Take U = R, M = 1, b = 0, and B = 1. We obtain

ρ(y) = supu∈R

uy−u2/2

.

The function inside the sup is maximized at u = y, hence ρ(y) = 12 y2, see top left panel of

Fig. 1.

5

−κ +κ −ε +ε

-1 1 −ε +ε

Figure 1: Scalar `2 (top left), `1 (top right), Huber (middle left), Vapnik (middle right), elastic net(bottom left) and smooth insensitive loss (bottom right) penalties

2. `1: Take U = [−1,1], M = 0, b = 0, and B = 1. We obtain

ρ(y) = supu∈[−1,1]

uy .

The function inside the sup is maximized by taking u = sign(y), hence ρ(y) = |y|, see top rightpanel of Fig. 1.

3. Elastic net: `2 +λ`1. Take

U = R× [−λ ,λ ], b =

[00

], M =

[1 00 0

], B =

[11

].

This construction reveals the general calculus of PLQ addition, see Remark 5. See bottomright panel of Fig. 1.

4. Huber: Take U = [−κ,κ], M = 1, b = 0, and B = 1. We obtain

ρ(y) = supu∈[−κ,κ]

uy−u2/2

,

with three explicit cases:

(a) If y <−κ , take u =−κ to obtain −κy− 12 κ2.

(b) If −κ ≤ y≤ κ , take u = y to obtain 12 y2.

(c) If y > κ , take u = κ to obtain a contribution of κy− 12 κ2.

This is the Huber penalty, shown in the middle left panel of Fig. 1.

6

5. Hinge loss: Taking B = 1, b =−ε , M = 0 and U = [0,1] we have

ρ(y) = supu∈U(y− ε)u= (y− ε)+.

To verify this, just note that if y < ε , u∗ = 0; otherwise u∗ = 1.

6. Vapnik loss is given by (y− ε)++(−y− ε)+. We immediately obtain its PLQ representationby taking

B =

[1−1

], b =−

ε

], M =

[0 00 0

], U = [0,1]× [0,1]

to yield

ρ(y) = supu∈U

⟨[y− ε

−y− ε

],u⟩

= (y− ε)++(−y− ε)+.

The Vapnik penalty is shown in the middle right panel of Fig. 1.

7. Soft hinge loss function (Chu et al., 2001). Combining ideas from examples 4 and 5, we canconstruct a ‘soft’ hinge loss; i.e. the function

ρ(y) =

0 if y < ε

12(y− ε)2 if ε < y < ε +κ

κ(y− ε)− 12(κ)

2 if ε +κ < y .

that has a smooth (quadratic) transition rather than a kink at ε : Taking B = 1, b =−ε , M = 1and U = [0,κ] we have

ρ(y) = supu∈[0,κ]

(y− ε)u− 12 u2 .

To verify this function has the explicit representation given above, note that if y < ε , u∗ = 0;if ε < y < κ + ε , we have u∗ = (y− ε)+, and if κ + ε < y, we have u∗ = κ .

8. Soft insensitive loss function (Chu et al., 2001). Using example 7, we can create a symmetricsoft insensitive loss function (which one might term the Hubnik) by adding together to softhinge loss functions:

ρ(y) = supu∈[0,κ]

(y− ε)u− 12 u2 + sup

u∈[0,κ](−y− ε)u− 1

2 u2

= supu∈[0,κ]2

⟨[y− ε

−y− ε

],u⟩− 1

2 uT[

1 00 1

]u .

See bottom bottom right panel of Fig. 1.

Note that the affine generalization (Definition 1) is needed to form the elastic net, the Vapnik penalty,and the SILF function, as all of these are sums of simpler QS penalties. These sum constructions areexamples of a general calculus which allows the modeler to build up a QS density having a desiredstructure. This calculus is described in the following remark.

7

Remark 5 Let ρ1(y) and ρ2(y) be two QS penalties specified by Ui,Mi,bi,Bi, for i = 1,2. Then thesum ρ(y) = ρ1(y)+ρ2(y) is also a QS penalty, with

U =U1×U2, M =

[M1 00 M2

], b =

[b1b2

], B =

[B1B2

].

Notwithstanding the catalogue of scalar QS functions in Remark 4 and the gluing procedure de-scribed in Remark 5, the supremum in Definition 1 appears to be a significant roadblock to under-standing and designing a QS function having specific properties. However, with some practice thedesign of QS penalties is not as daunting a task as it first appears. A key tool in understanding thestructure of QS functions are Euclidean norm projections onto convex sets.

Theorem 6 (Projection Theorem for Convex Sets) [Zarantonello (1971)] Let Q ∈ Rn×n be sym-metric and positive definite and let C⊂R be non-empty, closed and convex. Then Q defines an innerproduct on Rn by 〈x, t〉Q = xT Qy with associated Euclidean norm ‖x‖Q =

√〈x, x〉Q. The projection

of a point y ∈ Rn onto C in norm ‖ · ‖Q is the unique point PQ(y | C) solving the least distanceproblem

infx∈C‖y− x‖Q, (2.2)

and z = PQ (y |C ) if and only if z ∈C and

〈x− z, y− z〉Q ≤ 0 ∀ x ∈C . (2.3)

Note that the least distance problem (2.2) is equivalent to the problem

infx∈C

12‖y− x‖2

Q .

In the following lemma we use projections as well as duality theory to provide alternative represen-tations for QS penalties.

Theorem 7 Let M ∈ Rn×n be symmetric and positive semi-definite matrix, let L ∈ Rn×k be anymatrix satisfying M = LLT where k = rank(M), and let U ⊂ Rn be a non-empty, closed and convexset that contains the origin. Then the QS function ρ := ρ(U,M,0, I; ·) has the primal representations

ρ(y) = infs∈Rk

[12‖s‖

22 +δ

∗ (y−Ls |U )]= inf

s∈Rk

[12‖s‖

22 + γ (y−Ls |U )

], (2.4)

where, for any convex set V ,

δ∗ (z |V ) := sup

v∈V〈z, v〉 and γ (z |V ) := inft | t ≥ 0, z ∈ tV

are the support and gauge functionals for V , respectively.If it is further assumed that M ∈S n

++ the set of positive definite matrices, then ρ has the represen-

8

tations

ρ(y) = infs∈Rk

[12‖s‖

2M + γ

(M−1y− s

∣∣M−1U)]

(2.5)

= 12‖PM

(M−1y |U

)‖2

M + γ(M−1y−PM(M−1y|U)

∣∣M−1U)

(2.6)

= infs∈Rk

[12‖s‖

2M−1 + γ (y− s |U )

](2.7)

= 12‖PM−1(y|MU)‖2

M−1 + γ (y−PM−1(y|MU) |U ) (2.8)

= 12 yT M−1y− inf

u∈U12‖u−M−1y‖2

M (2.9)

= 12‖PM(M−1y|U)‖2

M +⟨M−1y−PM(M−1y|U), PM(M−1y|U)

⟩M (2.10)

= 12 yT M−1y− inf

v∈MU12‖v− y‖2

M−1 (2.11)

= 12‖PM−1(y|MU)‖2

M−1 + 〈y−PM−1(y|MU), PM−1(y|MU)〉M−1 . (2.12)

In particular, (2.11) says ρ(y) = 12 yT M−1y whenever y ∈ MU. Also note that, by (2.4), one can

replace the gauge functionals in (2.5)-(2.8) by the support functional of the appropriate set whereM−1U = (MU).

The formulas (2.5)-(2.12) show how one can build PLQ penalties having a wide range of desir-able properties. We now give a short list of a few examples illustrating how to make use of theserepresentations.

Remark 8 (General examples) In this remark we show how the representations in Lemma 7 canbe used to build QS penalties with specific structure. In each example we specify the componentsU,M,b, and B for the QS function ρ := ρ(U,M,b,B; ·).

1. Norms. Any norm ‖ · ‖ can be represented as a QS function by taking M = 0, B = I, b = 0,U = B, where B is the unit ball of the desired norm. Then, by (2.4), ρ(y) = ‖y‖= γ (y |B).

2. Gauges and support functions. Let U be any closed convex set containing the origin, andTake M = 0,B = I,b = 0. Then, by (2.4), ρ(y) = γ (y |U ) = δ ∗ (y |U ).

3. Generalized Huber functions. Take any norm ‖ · ‖ having closed unit ball B. Let M ∈S n++,

B = I, b = 0, and U = B. Then, by the representation (2.8),

ρ(y) = 12 PM−1(y|MB)T M−1PM−1(y|MB)+‖y−PM−1(y|MB)‖ . (2.13)

In particular, for y ∈MB, ρ(y) = 12 yT M−1y.

If we take M = I and ‖·‖= κ−1‖·‖1 for κ > 0 (i.e. U = κB∞ and U = κ−1B1), then ρ is themultivariate Huber function described in item 4 of Remark 4. In this way, Theorem 7 showshow to generalize the essence of the Huber norm to any choice of norm. For example, if wetake U = κBM = κu |‖u‖M ≤ 1, then, by (2.10),

ρ(y) =

12‖y‖2

M−1 , if ‖y‖M−1 ≤ κ

κ‖y‖M−1− κ2

2 , if ‖y‖M−1 > κ .

9

4. Generalized hinge-loss functions. Let ‖ · ‖ be a norm with closed unit ball B, let K be anon-empty closed convex cone in Rn, and let v ∈ Rn. Set M = 0, b = −v, B = I, and U =−(B∩K) = B∩ (−K). Then, by (Burke, 1987, Section 2),

ρ(y) = dist(y |v−K ) = infu∈K‖y−b+u‖ .

If we consider the order structure “≤K” induced on Rn by

y≤K v ⇐⇒ v− y ∈ K ,

then ρ(y) = 0 if and only if y≤K v. By taking ‖·‖= ‖·‖1, K =Rn+ so (−K)=K, and v= ε1,

where 1 is the vector of all ones, we recover the multivariate hinge loss function in Remark 4.

5. Order intervals and Vapnik loss functions. Let ‖ · ‖ be a norm with closed unit ball B, letK ⊂ Rn be a non-empty symmetric convex cone in the sense that K = −K, and let w <K v,or equivalently, v−w ∈ intr(K). Set

U = (B∩K)× (B∩K), M =

[0 00 0

], b =−

(vw

), and B =

[II

].

Thenρ(y) = dist(y |v−K )+dist(y |w+K ) .

Observe that ρ(y) = 0 if and only if w ≤K y ≤K v. The set y |w≤K y≤K v is an “or-der interval” (Schaefer, 1970). If we take w = −v, then y |−v≤K y≤K v is a symmetricneighborhood of the origin. By taking ‖ · ‖= ‖ · ‖1, K = Rn

+, and v = ε1=-w, we recover themultivariate Vapnik loss function in Remark 4. Further examples of symmetric cones are S n

+

and the Lorentz or `2 cone (Guler and Hauser, 2002).

The examples given above show that one can also construct generalized versions of the elastic netas well as the soft insensitive loss functions defined in Remark 4. In addition, cone constraints canalso be added by using the identity δ ∗ (· |K ) = δ (· |K ). These examples serve to illustrate the widevariety of penalty functions representable as QS functions. Computationally, one is only limited bythe ability to compute projections described in Theorem 7. Further computational properties for QSfunctions are described in (Aravkin et al., 2012, Section 6).

In order to characterize QS functions as negative logs of density functions, we need to ensure theintegrability of said density functions. The function ρ(y) is said to be coercive if lim‖y‖→∞ ρ(y) =∞,and coercivity turns out to be the key property to ensure integrability. The proof of this fact andthe characterization of coercivity for QS functions are the subject of the next two theorems (seeAppendix for proofs).

Theorem 9 (QS integrability) Suppose ρ(y) is a coercive QS penalty. Then the function exp[−ρ(y)]is integrable on aff[dom(ρ)] with respect to the dim(aff[dom(ρ)])-dimensional Lebesgue measure.

Theorem 10 A QS function ρ is coercive if and only if [BTcone(U)] = 0.

Theorem 10 can be used to show the coercivity of familiar penalties. In particular, note that ifB = I, then the QS function is coercive if and only if U contains the origin in its interior.

10

Corollary 11 The penalties `2, `1, elastic net, Vapnik, and Huber are all coercive.

Proof We show that all of these penalties satisfy the hypothesis of Theorem 10.

`2: U = R and B = 1, so[BTcone(U)

]= R = 0.

`1: U = [−1,1], so cone(U) = R, and B = 1.

Elastic Net: In this case, cone(U) = R2 and B =[

11

].

Huber: U = [−κ,κ], so cone(U) = R, and B = 1.

Vapnik: U = [0,1]× [0,1], so cone(U) = R2+. B =

[1−1], so BTcone(U) = R.

One can also show the coercivity of the above examples using their primal representations. How-ever, our main objective is to pave the way for a modeling framework where multi-dimensionalpenalties can be constructed from simple building blocks and then solved by a uniform approachusing the dual representations alone.

We now define a family of distributions on Rn by interpreting piecewise linear quadratic func-tions ρ as negative logs of corresponding densities. Note that the support of the distributions isalways contained in dom ρ , which is characterized in Theorem 3.

Definition 12 (QS densities) Let ρ(U,M,B,b;y) be any coercive extended QS penalty on Rn. De-fine p(y) to be the following density on Rn:

p(y) =

c−1 exp [−ρ(y)] y ∈ dom ρ

0 else,(2.14)

where

c =(∫

y∈dom ρ

exp [−ρ(y)]dy),

and the integral is with respect to the dim(dom(ρ))-dimensional Lebesgue measure.

QS densities are true densities on the affine hull of the domain of ρ . The proof of Theorem 9can be easily adapted to show that they have moments of all orders.

3. Constructing QS densities

In this section, we describe how to construct multivariate QS densities with prescribed means andvariances. We show how to compute normalization constants to obtain scalar densities, and thenextend to multivariate densities using linear transformations. Finally, we show how to obtain the datastructures U,M,B,b corresponding to multivariate densities, since these are used by the optimizationapproach in Section 4.

We make use of the following definitions. Given a sequence of column vectors rk= r1, . . . ,rNand matrices Σk= Σ1, . . . ,ΣN, we use the notation

vec(rk) =

r1r2...

rN

, diag(Σk) =

Σ1 0 · · · 0

0 Σ2. . .

......

. . . . . . 00 · · · 0 ΣN

.

11

In definition 12, QS densities are defined over Rn. The moments of these densities dependin a nontrivial way on the choice of parameters b,B,U,M. In practice, we would like to be ableto construct these densities to have prescribed means and variances. We show how this can bedone using scalar QS random variables as the building blocks. Suppose y = vec(yk) is a vectorof independent (but not necessarily identical) QS random variables with mean 0 and variance 1.Denote by bk,Bk,Uk,Mk the specification for the densities of yk. To obtain the density of y, we needonly take

U =U1×U2×·· ·×UN

M = diag(Mk)B = diag(Bk)b = vec(bk) .

For example, the standard Gaussian distribution is specified by U = Rn, M = I, b = 0, B = I, whilethe standard `1-Laplace (see (Aravkin et al., 2011a)) is specified by U = [−1,1]n, M = 0, b = 0,B =√

2I.The random vector y = Q1/2(y+µ) has mean µ and variance Q. If c is the normalizing constant forthe density of y, then cdet(Q)1/2 is the normalizing constant for the density of y.

Remark 13 Note that only independence of the building blocks is required in the above result.This allows the flexibility to impose different QS densities on different errors in the model. Suchflexibility may be useful for example when combining measurement data from different instruments,where some instruments may occasionally give bad data (with outliers), while others have errorsthat are modeled well by Gaussian distributions.

We now show how to construct scalar building blocks with mean 0 and variance 1, i.e. how tocompute the key normalizing constants for any QS penalty. To this aim, suppose ρ(y) is a scalar QSpenalty that is symmetric about 0. We would like to construct a density p(y) = exp [−ρ(c2y)]/c1 tobe a true density with unit variance, that is,

1c1

∫exp [−ρ(c2y)]dy = 1 and

1c1

∫y2 exp [−ρ(c2y)]dy = 1, (3.1)

where the integrals are over R. Using u-substitution, these equations become

c1c2 =∫

exp [−ρ(y)]dy and c1c32 =

∫y2 exp [−ρ(y)]dy.

Solving this system yields

c2 =

√∫y2 exp [−ρ(y)]dy

/∫exp [−ρ(y)]dy

c1 =1c2

∫exp [−ρ(y)]dy .

These expressions can be used to obtain the normalizing constants for any particular ρ using simpleintegrals.

12

3.1 Huber Density

The scalar density corresponding to the Huber penalty is constructed as follows. Set

pH(y) =1c1

exp[−ρH(c2y)] , (3.2)

where c1 and c2 are chosen as in (3.1). Specifically, we compute∫exp [−ρH(y)]dy = 2exp

[−κ

2/2] 1

κ+√

2π[2Φ(κ)−1]∫y2 exp [−ρH(y)]dy = 4exp

[−κ

2/2] 1+κ2

κ3 +√

2π[2Φ(κ)−1] ,

where Φ is the standard normal cumulative density function. The constants c1 and c2 can now bereadily computed.To obtain the multivariate Huber density with variance Q and mean µ , let U = [−κ,κ]n, M = I,B = I any full rank matrix, and b = 0. This gives the desired density:

pH(y) =1

cn1 det(Q1/2)

exp[− sup

u∈U

⟨c2Q−1/2 (y−µ) ,u

⟩− 1

2uTu]

. (3.3)

3.2 Vapnik Density

The scalar density associated with the Vapnik penalty is constructed as follows. Set

pV(y) =1c1

exp [−ρV (c2y)] , (3.4)

where the normalizing constants c1 and c2 can be obtained from∫exp [−ρV (y)]dy = 2(ε +1)∫

y2 exp [−ρV (y)]dy =23

ε3 +2(2−2ε + ε

2),

using the results in Section 3. Taking U = [0,1]2n, the multivariate Vapnik distribution with mean µ

and variance Q is

pV(y) =1

cn1 det(Q1/2)

exp[− sup

u∈U

⟨c2BQ−1/2 (y−µ)− ε12n,u

⟩], (3.5)

where B is block diagonal with each block of the form B =[

1−1], and 12n is a column vector of 1’s

of length 2n.

4. Optimization with PLQ penalties

In the previous sections, QS penalties were characterized using their dual representation and inter-preted as negative log likelihoods of true densities. As we have seen, the scope of such densities isextremely broad. Moreover, these densities can easily be constructed to possess specified moment

13

properties. In this section, we expand on their utility by showing that the resulting estimation prob-lems (1.4) can be solved with high accuracy using standard techniques from numerical optimizationfor a large subclass of these penalties. We focus on PLQ penalties for the sake of simplicity inour presentation of an interior point approach to solving these estimation problems. However, theinterior point approach applies in much more general settings, e.g. see Nemirovskii and Nesterov(1994). Nonetheless, the PLQ case is sufficient to cover all of the examples given in Remark 4 whilegiving the flavor of how to proceed in the more general cases.

We exploit the dual representation for the class of PLQ penalties (Rockafellar and Wets, 1998)to explicitly construct the Karush-Kuhn-Tucker (KKT) conditions for a wide variety of model prob-lems of the form (1.4). Working with these systems opens the door to using a wide variety ofnumerical methods for convex quadratic programming to solve (1.4).

Let ρ(Uv,Mv,bv,Bv;y) and ρ(Uw,Mw,bw,Bw;y) be two PLQ penalties and define

V (v;R) := ρ(Uv,Mv,bv,Bv;R−1/2v) (4.1)

andW (w;Q) := ρ(Uw,Mw,bw,Bw;Q−1/2w). (4.2)

Then (1.4) becomesminy∈Rn

ρ(U,M,b,B;y), (4.3)

where

U :=Uv×Uw, M :=[

Mv 00 Mw

], b :=

(bv−BvR−1/2z

bw−BwQ−1/2µ

),

and

B :=[

BvR−1/2HBwQ−1/2G

].

Moreover, the hypotheses in (1.1), (1.2), (1.4), and (2.1) imply that the matrix B in (4.3) is injective.Indeed, By = 0 if and only if BwQ−1/2Gy = 0, but, since G is nonsingular and Bw is injective, thisimplies that y = 0. That is, nul(B) = 0. Consequently, the objective in (4.3) takes the form of aPLQ penalty function (2.1). In particular, if (4.1) and (4.2) arise from PLQ densities (definition 12),then the solution to problem (4.3) is the MAP estimator in the statistical model (1.1)-(1.2).

To simplify the notational burden, in the remainder of this section we work with (4.3) directlyand assume that the defining objects in (4.3) have the dimensions specified in (2.1);

U ∈ Rm, M ∈ Rm×m, b ∈ Rm, and B ∈ Rm×n. (4.4)

The Lagrangian (Rockafellar and Wets, 1998)[Example 11.47] for problem (4.3) is given by

L(y,u) = bTu− 12

uTMu+uTBy .

By assumption U is polyhedral, and so can be specified to take the form

U = u : ATu≤ a , (4.5)

14

where A ∈ Rm×`. Using this reprsentation for U , the optimality conditions for (4.3) (Rockafellar,1970; Rockafellar and Wets, 1998) are

0 = BTu

0 = b+By−Mu−Aq

0 = ATu+ s−a

0 = qisi , i = 1, . . . , ` , q,s≥ 0 ,

(4.6)

where the non-negative slack variable s is defined by the third equation in (4.6). The non-negativityof s implies that u ∈U . The equations 0 = qisi , i = 1, . . . , ` in (4.6) are known as the complemen-tarity conditions. By convexity, solving the problem (4.3) is equivalent to satisfying (4.6). Thereis a vast optimization literature on working directly with the KKT system. In particular, interiorpoint (IP) methods (Kojima et al., 1991; Nemirovskii and Nesterov, 1994; Wright, 1997) can beemployed. In the Kalman filtering/smoothing application, IP methods have been used to solve theKKT system (4.6) in a numerically stable and efficient manner, see e.g. (Aravkin et al., 2011b).Remarkably, the IP approach used in (Aravkin et al., 2011b) generalizes to the entire PLQ class.For Kalman filtering and smoothing, the computational efficiency is also preserved (see Section 6.Here, we show the general development for the entire PLQ class using standard techniques from theIP literature (see e.g. (Kojima et al., 1991)).

Let U,M,b,B, and A be as defined in (2.1) and (4.5), and let τ ∈ (0,+∞]. We define the τ sliceof the strict feasibility region for (4.6) to be the set

F+(τ) =

(s,q,u,y)

∣∣∣∣ 0 < s, 0 < q, sTq≤ τ, and(s,q,u,y) satisfy the affine equations in (4.6)

,

and the central path for (4.6) to be the set

C :=(s,q,u,y)

∣∣∣∣ 0 < s, 0 < q, γ = qisi i = 1, . . . , `, and(s,q,u,y) satisfy the affine equations in (4.6)

.

For simplicity, we define F+ := F+(+∞). The basic strategy of a primal-dual IP method is tofollow the central path to a solution of (4.6) as γ ↓ 0 by applying a predictor-corrector dampedNewton method to the function mapping R`×R`×Rm×Rn to itself given by

Fγ(s,q,u,y) =

s+ATu−a

D(q)D(s)1− γ1By−Mu−Aq+b

BTu

, (4.7)

where D(q) and D(s) are diagonal matrices with vectors q,s on the diagonal.

Theorem 14 Let U,M,b,B, and A be as defined in (2.1) and (4.5). Given τ > 0, let F+, F+(τ),and C be as defined above. If

F+ 6= /0 and null(M)∩null(AT) = 0, (4.8)

then the following statements hold.

15

(i) F(1)γ (s,q,u,y) is invertible for all (s,q,u,y) ∈F+.

(ii) Define F+ = (s,q) |∃(u,y) ∈ Rm×Rn s.t. (s,q,u,y) ∈F+ . Then for each (s,q) ∈ F+

there exists a unique (u,y) ∈ Rm×Rn such that (s,q,u,y) ∈F+.

(iii) The set F+(τ) is bounded for every τ > 0.

(iv) For every g ∈ R`++, there is a unique (s,q,u,y) ∈F+ such that g = (s1q1,s2q2, . . . ,s`q`)T.

(v) For every γ > 0, there is a unique solution [s(γ),q(γ),u(γ),y(γ)] to the equation Fγ(s,q,u,y)=0. Moreover, these points form a differentiable trajectory in Rν×Rν×Rm×Rn. In particular,we may write

C = [s(γ),q(γ),u(γ),y(γ)] |γ > 0 .

(vi) The set of cluster points of the central path as γ ↓ 0 is non-empty, and every such cluster pointis a solution to (4.6).

Please see the Appendix for proof. Theorem 14 shows that if the conditions (4.8) hold, then IPtechniques can be applied to solve the problem (4.3). In all of the applications we consider, thecondition null(M)∩null(AT) = 0 is easily verified. For example, in the setting of (4.3) with

Uv = u |Avu≤ av and Uw = u |Awu≤ bw (4.9)

this condition reduces to

null(Mv)∩null(ATv ) = 0 and null(Mw)∩null(AT

w) = 0. (4.10)

Corollary 15 The densities corresponding to `1, `2, Huber, and Vapnik penalties all satisfy hypoth-esis (4.10).

Proof We verify that null(M)∩ null(AT) = 0 for each of the four penalties. In the `2 case, M hasfull rank. For the `1, Huber, and Vapnik penalties, the respective sets U are bounded, so U∞ = 0.

On the other hand, the condition F+ 6= /0 is typically more difficult to verify. We show how thisis done for two sample cases from class (1.4), where the non-emptiness of F+ is established byconstructing an element of this set. Such constructed points are useful for initializing the interiorpoint algorithm.

4.1 `1 – `2:

Suppose V (v;R) =∥∥R−1/2v

∥∥1 and W (w;Q) = 1

2

∥∥Q−1/2w∥∥2

2. In this case

Uv = [−1m,1m], Mv = 0m×m, bv = 0m, Bv = Im×m,

Uw = Rn, Mw = In×n, bw = 0n, Bw = In×n,

16

and R ∈ Rm×m and Q ∈ Rn×n are symmetric positive definite covariance matrices. Following thenotation of (4.3) we have

U = [−1,1]×Rn, M =

[0m×m 0

0 In×n

], b =

(−R−1/2z−Q−1/2µ

), B =

[R−1/2HQ−1/2G

].

The specification of U in (4.5) is given by

AT =

[Im×m 0n×n

−Im×m 0n×n

]and a =

(1−1

).

Clearly, the condition null(M)∩ null(AT) = 0 in (4.8) is satisfied. Hence, for Theorem 14 toapply, we need only check that F+ 6= /0. This is easily established by noting that (s,q,u,y) ∈F+,where

u =

(00

), y = G−1

µ, s =(

11

), q =

(1+[R−1/2(Hy− z)]+1− [R−1/2(Hy− z)]−

),

where, for g ∈ R`, g+ is defined componentwise by g+(i) = maxgi,0 and g−(i) = mingi,0.

4.2 Vapnik – Huber:

Suppose that V (v;R) and W (w;Q) are as in (4.1) and (4.2), respectively, with V a Vapnik penaltyand W a Huber penalty:

Uv = [0,1m]× [0,1m], Mv = 02m×2m, bv =−(

ε1m

ε1m

), Bv =

[Im×m

−Im×m

]Uw = [−κ1n,κ1n], Mw = In×n, bw = 0n, Bw = In×n ,

and R ∈ Rm×m and Q ∈ Rn×n are symmetric positive definite covariance matrices. Following thenotation of (4.3) we have

U = ([0,1m]× [0,1m])× [−κ1n,κ1n], M =

[02m×2m 0

0 In×n

],

b =−

ε1m +R−1/2zε1m−R−1/2z

Q−1/2µ

, B =

R−1/2H−R−1/2HQ−1/2G

.

The specification of U in (4.5) is given by

AT =

Im×m 0 0−Im×m 0 0

0 Im×m 00 −Im×m 00 0 In×n

0 0 −In×n

and a =

1m

0m

1m

0m

κ1n

κ1n

.

Since null(AT) = 0, the condition null(M)∩ null(AT) = 0 in (4.8) is satisfied. Hence, forTheorem 14 to apply, we need only check that F+ 6= /0. We establish this by constructing an

17

element (s,q,u,y) of F+. For this, let

u =

u1u2u3

, s =

s1s2s3s4s5s6

, q =

q1q2q3q4q5q6

,

and sety = 0n, u1 = u2 =

12 1`, u3 = 0n, s1 = s2 = s3 = s4 =

12 1`, s5 = s6 = κ1n,

and

q1 = 1m− (ε1m +R−1/2z)−, q2 = 1m +(ε1m +R−1/2z)+,

q3 = 1m− (ε1m−R−1/2z)−, q4 = 1m +(ε1m−R−1/2z)+,

q5 = 1n− (Q−1/2µ)−, q6 = 1n +(Q−1/2

µ)+ .

Then (s,q,u,y) ∈F+.

5. Simple Numerical Examples and Comparisons

Before we proceed to the main application of interest (Kalman smoothing), we present a few simpleand interesting problems in the PLQ class. An IP solver that handles the problems discussed in thissection is available through github.com/saravkin/, along with example files and ADMMimplementations. A comprehensive comparison with other methods is not in our scope, but we docompare the IP framework with the Alternating Direction Method of Multipliers (ADMM)(see Boydet al. (2011) for a tutorial reference). We hope that the examples and the code will help readers todevelop intuition about these two methods.

We focus on ADMM in particular because these methods enjoy widespread use in machinelearning and other applications, due to their versatility and ability to scale to large problems. Thefundamental difference between ADMM and IP is that ADMM methods have at best linear con-vergence, so they cannot reach high accuracy in reasonable time (see (Boyd et al., 2011, Section3.2.2)). In contrast, IP methods have a superlinear convergence rate (in fact, some variants have2-step quadratic convergence, see Ye and Anstreicher (1993); Wright (1997)).

In addition to accuracy concerns, IP methods may be preferable to ADMM when

• objective contains complex non-smooth terms, e.g. ‖Ax−b‖1.

• linear operators within the objective formulations are ill-conditioned.

For formulations with well-conditioned linear operators and simple nonsmooth pieces (suchas Lasso), ADMM can easily outperform IP. In these cases ADMM methods can attain moderateaccuracy (and good solutions) very quickly, by exploiting partial smoothness and/or simplicity ofregularizing functionals. For problems lacking these features, such as general formulations builtfrom (nonsmooth) PLQ penalties and possibly ill-conditioned linear operators, IP can dominateADMM, reaching the true solution while ADMM struggles.

18

We present a few simple examples below, either developing the ADMM approach for each, ordiscussing the difficulties (when applicable). We explain advantages and disadvantages of using IP,and present numerical results. A simple IP solver that handles all of the examples, together withADMM code used for the comparisons, is available through github.com/saravkin/. TheLasso example was taken directly from http://www.stanford.edu/˜boyd/papers/admm/,and we implemented the other ADMM examples using this code as a template.

5.1 Lasso Problem

Consider the Lasso problem

minx

12‖Ax−b‖2

2 +λ‖x‖1 , (5.1)

where A∈Rn×m. Assume that m< n. In order to develop an ADMM approach, we split the variablesand introduce a constraint:

minx,z

12‖Ax−b‖2

2 +λ‖z‖1 s.t. x = z . (5.2)

The augmented Lagrangian for (5.2) is given by

L (x,z,y) =12‖Ax−b‖2

2 +λ‖z‖1 +ηyT (z− x)+ρ

2‖z− x‖2

2 , (5.3)

where η is the augmented Lagrangian parameter. The ADMM method now comprises the followingiterative updates:

xk+1 = argminx

12‖Ax−b‖2

2 +η

2‖x+ yk− zk‖2

2

zk+1 = argminz

λ‖z‖1 +η

2‖z− xk+1 + yk‖2

2

yk+1 = yk +(zk+1− xk+1) .

(5.4)

Turning our attention to the x-update, note that the gradient is given by

AT (Ax−b)+η(x+ yk− zk) = (AT A+ I)x−AT b+η(yk− zk) .

At every iteration, the update requires solving the same positive definite m×m symmetric system.Forming AT A+ I is O(nm2) time, and obtaining a Cholesky factorization is O(m3), but once this isdone, every x-update can be obtained in O(m2) time by doing two back-solves.

The z-update has a closed form solution given by soft thresholding:

zk+1 = S(xk+1− yk+1,λ/η) ,

which is an O(n) operation. The multiplier update is also O(n). Therefore, the complexity periteration is O(m2 +n), making ADMM a great method for this problem.

In contrast, each iteration of IP is dominated by the complexity of forming a dense m×msystem AT DkA, where Dk is a diagonal matrix that depends on the iteration. So while both methodsrequire an investment of O(nm2) to form and O(m3) to factorize the system, ADMM requires thisonly at the outset, while IP has to repeat the computation for every iteration. A simple test showsADMM can find a good answer, with a significant speed advantage already evident for moderate(1000×5000) well-conditioned systems (see Table 5.4).

19

5.2 Linear Support Vector Machines

The support vector machine problem can be formulated as the PLQ (see (Ferris and Munson, 2003,Section 2.1))

minw,γ

12‖w‖2 +λρ+(1−D(Aw− γ1)) , (5.5)

where ρ+ is the hinge loss function, wT x = γ is the hyperplane being sought, D ∈ Rm×m is adiagonal matrix with ±1 on the diagonals (in accordance to the classification of the trainingdata), and A ∈ Rm×k is the observation matrix, where each row gives the features correspondingto observation i ∈ 1, . . . ,m. The ADMM details are similar to the Lasso example, so we omitthem here. The interested reader can study the details in the file linear_svm available throughgithub/saravkin.

The SVM example turned out to be very interesting. We downloaded the 9th Adult examplefrom the SVM library at http://www.csie.ntu.edu.tw/˜cjlin/libsvm/. The train-ing set has 32561 examples, each with 123 features. When we formed the operator A for prob-lem (5.5), we found it was very poorly conditioned, with condition number 7.7×1010. It should notsurprise the reader that after running for 653 iterations, ADMM is still appreciably far away — itsobjective value is higher, and in fact the relative norm distance to the (unique) true solution is 10%.

It is interesting to note that in this application, high optimization accuracy does not mean bet-ter classification accuracy on the test set — indeed, the (suboptimal) ADMM solution achieves alower classification error on the test set (18%, vs. 18.75% error for IP). Nonetheless, this is not anadvantage of one method over another — one can also stop the IP method early. The point hereis that from the optimization perspective, SVM illustrates the advantages of Newton methods overmethods with a linear rate.

5.3 Robust Lasso

For the examples in this section, we take ρ(·) to be a robust convex loss, either the 1-norm or theHuber function, and consider the robust Lasso problem

minx

ρ(Ax−b)+λ‖x‖1 . (5.6)

First, we develop an ADMM approach that works for both losses, exploiting the simple natureof the regularizer. Then, we develop a second ADMM approach when ρ(x) is the Huber functionby exploiting partial smoothness of the objective.

Setting z = Ax−b, we obtain the augmented Lagrangian

L (x,z,y) = ρ(z)+λ‖x‖1 +ηyT (z−Ax+b)+ρ

2‖− z+Ax−b‖2

2 . (5.7)

The ADMM updates for this formulation are

xk+1 = argminx

λ‖x‖1 +η

2‖Ax− yk− zk‖2

2

zk+1 = argminz

ρ(z)+η

2‖z+ yk−Axk+1 +b‖2

2

yk+1 = yk +(zk+1−Axk+1 +b) .

(5.8)

20

The z-update can be solved using thresholding, or modified thresholding, in O(m) time when ρ(·)is the Huber loss or 1-norm. Unfortunately, the x-update now requires solving a LASSO problem.This can be done with ADMM (see previous section), but the nested ADMM structure does notperform as well as IP methods, even for well conditioned problems.

When ρ(·) is smooth, such as in the case of the Huber loss, the partial smoothness of theobjective can be exploited by setting x = z, obtaining

L (x,z,y) = ρ(Ax−b)+λ‖z‖1 +ηyT (zx)+ρ

2‖x− z‖2

2 . (5.9)

The ADMM updates are:

xk+1 = argminx

ρ(Ax−b)+η

2‖x− zk + yk‖2

2

zk+1 = argminz

λ‖z‖1 +η

2‖z+(xk+1 + yk)‖2

2

yk+1 = yk +(zk+1− xk+1) .

(5.10)

The problem required for the x-update is smooth, and can be solved by a fast quasi-Newton method,such as L-BFGS. L-BFGS is implemented using only matrix-vector products, and for well-conditionedproblems, the ADMM/LBFGS approach has a speed advantage over IP methods. For ill-conditionedproblems, L-BFGS has to work harder to achieve high accuracy, and inexact solves may destabilizethe overall ADMM approach. IP methods are more consistent (see Table 5.4).

Just as in the Lasso problem, the IP implementation is dominated by the formation of AT DkA atevery iteration with complexity O(mn2). However, a simple change of penalty makes the problemmuch harder for ADMM, especially when the operator A is ill-conditioned.

5.4 Complex objectives

Many problems (including Kalman smoothers in the next section), do not have the simplifyingfeatures exhibited by Lasso, SVM, and robust Lasso problems. Consider the general regressionproblem

ρ(Ax−b)+‖Cx‖1 , (5.11)

where ρ may be nonsmooth, and C is in Rk×n.Applying ADMM to these objectives requires a bi-level implementation. For example, when

ρ(x) is the 1-norm, the x-update for ADMM requires solving

minx‖Ax−b‖1 +

η2

2‖Bx− z− y‖2

2 ,

which is more computationally expensive than the Lasso subproblem. In particular, an ADMMimplementation requires iteratively solving subproblems of the form

minx‖Bx− c‖2

2 +ξ

2‖Ax−d‖2

2 .

Since B ∈Rk×n and A ∈Rm×n, a cholesky approach to the above problem requires forming an n×nmatrix and factoring it. Since it was already observed that ADMM struggles to achieve moderate

21

Table 1: For each problem, we give iteration counts for IP, outer ADMM iterations, the maximumcap for inner ADMM iterations (if applicable). We also give total computing time for bothalgorithms (tADMM, tIP) on a 2.2 GHz dual-core Intel machine, and the objective differencef(xADMM) - f(xIP). This difference is always positive, since in all experiments IP founda lower objective value. Therefore, the magnitude of the objective difference can be usedas an accuracy heuristic for ADMM in each experiment, where lower difference meanshigher ADMM accuracy. κ(A) = condition number of A.

Problem ADMM Iters ADMM Inner IP Iters tADMM (s) tIP (s) ObjDiffLasso

A : 1500×5000 15 — 18 2.0 58.3 0.0025SVM

κ(A) = 7.7×1010; A : 32561×123 653 — 77 41.2 23.9 0.17Huber Lasso

ADMM/ADMMκ(A) = 5.8; A : 1000×2000 26 100 20 14.1 10.5 0.00006

κ(A) = 1330; A : 1000×2000 27 100 24 40.0 13.0 0.0018ADMM/L-BFGS

κ(A) = 5.8; A : 1000×2000 18 — 20 2.8 10.3 1.02κ(A) = 1330; A : 1000×2000 22 — 24 21.2 13.1 1.24

L1 LassoADMM/ADMM

κ(A) = 2.2; A : 500×2000 104 100 29 57.4 5.9 0.06κ(A) = 1416; A : 500×2000 112 100 29 81.4 5.6 0.21

General L1-L1C : 500×2000; A : 1000×2000 — — 11 — 21.4 —

accuracy in the L1 Lasso case, we did not build an ADMM implementation in this more generalsetting.

However, applying the IP solver is straightforward, and we illustrate by solving the problemwhere ρ(·) is the 1-norm. In this case, the objective is a linear program with special structure, so itis not surprising that IP methods work well.

We hope that the toy problems, results, and code that we developed in order to write this sectionhave given the reader a better intuition for IP methods. Before moving on, note that the Kalmansmoothing problems in the next section have the flavor of the general L1-L1 example, since theymust balance tradeoffs between process and measurement models. Either penalty can be takento be the 1-norm, or any other PLQ penatly, and we will show that IP methods can be specificallydesigned to exploit the time series structure and preserve classical Kalman smoothing computationalefficiency results.

22

6. Kalman Smoothing with PLQ penalties

Consider now a dynamic scenario, where the system state xk evolves according to the followingstochastic discrete-time linear model

x1 = x0 +w1

xk = Gkxk−1 +wk, k = 2,3, . . . ,N

zk = Hkxk + vk, k = 1,2, . . . ,N

(6.1)

where x0 is known, zk is the m-dimensional subvector of z containing the noisy output samplescollected at instant k, Gk and Hk are known matrices. Further, we consider the general case wherewk and vk are mutually independent zero-mean random variables which can come from any ofthe densities introduced in the previous section, with positive definite covariance matrices denotedby Qk and Rk, respectively.In order to formulate the Kalman smoothing problem over the entire sequence xk, define

x = vecx1, · · · ,xN , w = vecw1, · · · ,wNv = vecv1, · · · ,vN , Q = diagQ1, · · · ,QNR = diagR1, · · · ,RN , H = diagH1, · · · ,HN,

and

G =

I 0

−G2 I. . .

. . . . . . 0−GN I

Then model (6.1) can be written in the form of (1.1)-(1.2), i.e.,

µ = Gx+w

z = Hx+ v ,(6.2)

where x ∈ RnN is the entire state sequence of interest, w is corresponding process noise, z is thevector of all measurements, v is the measurement noise, and µ is a vector of size nN with thefirst n-block equal to x0, the initial state estimate, and the other blocks set to 0. This is preciselythe problem (1.1)-(1.2) that began our study. The problem (1.3) becomes the classical Kalmansmoothing problem with quadratic penalties. In this case, the objective function can be written

‖Gx−µ‖2Q−1 +‖Hx− z‖2

R−1 ,

and the minimizer can be found by taking the gradient and setting it to zero:

(GT Q−1G+HT R−1H)x = r .

One can view this as a single step of Newton’s method, which converges to the solution because theobjective is quadratic. Note also that once the linear system above is formed, it takes only O(n3N)operations to solve due to special block tridiagonal structure (for a generic system, it would takeO(n3N3) time). In this section, we will show that IP methods can preserve this complexity for muchmore general penalties on the measurement and process residuals. We first make a brief remarkrelated to the statistical interpretation of PLQ penalties.

23

Remark 16 Suppose we decide to move to an outlier robust formulation, where the 1-norm orHuber penalties are used, but the measurement variance is known to be R. Using the statisticalinterpretation developed in section 3, the statistically correct objective function for the smoother is

12‖Gx−µ‖2

Q−1 +√

2‖R−1(Hx− z)‖1 .

Analogously, the statistically correct objective when measurement error is the Huber penalty withparameter κ is

12‖Gx−µ‖2

Q−1 + c2ρ(R−1/2(Hx− z)) ,

where

c2 =4exp

[−κ2/2

] 1+κ2

κ3 +√

2π[2Φ(κ)−1]

2exp [−κ2/2] 1κ+√

2π[2Φ(κ)−1].

The normalization constant comes from the results in Section 3.1, and ensures that the weightingbetween process and measurement terms is still consistent with the situation regardless of whichshapes are used for the process and measurement penalties. This is one application of the statisticalinterpretation.

Next, we show that when the penalties used on the process residual Gx−w and measurementresidual Hx− z arise from general PLQ densities, the general Kalman smoothing problem takes theform (4.3), studied in the previous section. The details are given in the following remark.

Remark 17 Suppose that the noises w and v in the model (6.2) are PLQ densities with means 0,variances Q and R (see Def. 12). Then, for suitable Uw,Mw,bw,Bw and Uv,Mv,bv,Bv and corre-sponding ρw and ρv we have

p(w) ∝ exp[−ρ

(Uw,Mw,bw,Bw;Q−1/2w

)]p(v) ∝ exp

[−ρ(Uv,Mv,bv,Bv;R−1/2v)

] (6.3)

while the MAP estimator of x in the model (6.2) is

argminx∈RnN

ρ

[Uw,Mw,bw,Bw;Q−1/2(Gx−µ)

]+ρ

[Uv,Mv,bv,Bv;R−1/2(Hx− z)

] (6.4)

If Uw and Uv are given as in (4.9), then the system (4.6) decomposes as

0 = ATwuw + sw−aw ; 0 = AT

v uv + sv−av

0 = sTwqw ; 0 = sT

v qv

0 = bw +BwQ−1/2Gd−Mwuw−Awqw

0 = bv−BvR−1/2Hd−Mvuv−Avqv

0 = GTQ−T/2BTwuw−HTR−T/2BT

v uv

0 ≤ sw,sv,qw,qv.

(6.5)

See the Appendix and (Aravkin, 2010) for details on deriving the KKT system. By further exploitingthe decomposition shown in (6.1), we obtain the following theorem.

24

Theorem 18 (PLQ Kalman smoother theorem) Suppose that all wk and vk in the Kalman smooth-ing model (6.1) come from PLQ densities that satisfy

null(Mwk )∩null((Aw

k )T) = 0 ,null(Mv

k)∩null((Avk)

T) = 0 , ∀k , (6.6)

i.e. their corresponding penalties are finite-valued. Suppose further that the corresponding set F+

from Theorem 14 is nonempty. Then (6.4) can be solved using an IP method, with computationalcomplexity O[N(n3 +m3 + l)], where l is the largest column dimension of the matrices Aν

k andAw

k .

Note that the first part of this theorem, the solvability of the problem using IP methods, already fol-lows from Theorem 14. The main contribution of the result in the dynamical system context is thecomputational complexity. The proof is presented in the Appendix and shows that IP methods forsolving (6.4) preserve the key block tridiagonal structure of the standard smoother. If the number ofIP iterations is fixed (10−20 are typically used in practice), general smoothing estimates can thusbe computed in O[N(n3 +m3 + l)] time. Notice also that the number of required operations scaleslinearly with l, which represents the complexity of the PLQ density encoding.

7. Numerical example

7.1 Simulated data

In this section we use a simulated example to test the computational scheme described in the previ-ous section. We consider the following function

f (t) = exp [sin(8t)]

taken from (Dinuzzo et al., 2007). Our aim is to reconstruct f starting from 2000 noisy samplescollected uniformly over the unit interval. The measurement noise vk was generated using a mixtureof two Gaussian densities, with p = 0.1 denoting the fraction from each Gaussian; i.e.,

vk ∼ (1− p)N(0,0.25)+ pN(0,25),

Data are displayed as dots in Fig. 2. Note that the purpose of the second component of the Gaussianmixture is to simulate outliers in the output data and that all the measurements exceeding verticalaxis limits are plotted on upper and lower axis limits (4 and -2) to improve readability.The initial condition f (0) = 1 is assumed to be known, while the difference of the unknown func-tion from the initial condition (i.e. f (·)− 1) is modeled as a Gaussian process given by an inte-grated Wiener process. This model captures the Bayesian interpretation of cubic smoothing splines(Wahba, 1990), and admits a two-dimensional state space representation where the first componentof x(t), which models f (·)−1, corresponds to the integral of the second state component, modelledas Brownian motion. To be more specific, letting ∆t = 1/2000, the sampled version of the statespace model (see (Jazwinski, 1970; Oksendal, 2005) for details) is defined by

Gk =

[1 0∆t 1

], k = 2,3, . . . ,2000

Hk =[0 1

], k = 1,2, . . . ,2000

25

! !"# !"$ !"% !"& '

!#

!'

!

'

#

(

$

)*+,

-./+.012+3345,617*4518#1/3221

! !"# !"$ !"% !"& '

!#

!'

!

'

#

(

$

)*+,

-./+.012+3345,617*4519.:0*;1/32211

Figure 2: Simulation: measurements (·) with outliers plotted on axis limits (4 and−2), true function(continuous line), smoothed estimate using either the quadratic loss (dashed line, leftpanel) or the Vapnik’s ε-insensitive loss (dashed line, right panel)

26

with the autocovariance of wk given by

Qk = λ2

[∆t ∆t2

2∆t2

2∆t3

3

], k = 1,2, . . . ,2000 ,

where λ 2 is an unknown scale factor to be estimated from the data.We compare the performance of two Kalman smoothers. The first (classical) estimator uses aquadratic loss function to describe the negative log of the measurement noise density and con-tains only λ 2 as unknown parameter. The second estimator is a Vapnik smoother relying on the ε-insensitive loss, and so depends on two unknown parameters: λ 2 and ε . In both cases, the unknownparameters are estimated by means of a cross validation strategy where the 2000 measurements arerandomly split into a training and a validation set of 1300 and 700 data points, respectively. TheVapnik smoother was implemented by exploiting the efficient computational strategy described inthe previous section, see (Aravkin et al., 2011b) for specific implementation details. Efficiency isparticularly important here, because of the need for cross-validation. In this way, for each value ofλ 2 and ε contained in a 10× 20 grid on [0.01,10000]× [0,1], with λ 2 logarithmically spaced, thefunction estimate was rapidly obtained by the new smoother applied to the training set. Then, therelative average prediction error on the validation set was computed, see Fig. 3. The parametersleading to the best prediction were λ 2 = 2.15× 103 and ε = 0.45, which give a sparse solutiondefined by fewer than 400 support vectors. The value of λ 2 for the classical Kalman smootherwas then estimated following the same strategy described above. In contrast to the Vapnik penalty,the quadratic loss does not induce any sparsity, so that, in this case, the number of support vectorsequals the size of the training set.The left and right panels of Fig. 2 display the function estimate obtained using the quadratic andthe Vapnik losses, respectively. It is clear that the estimate obtained using the quadratic penalty isheavily affected by the outliers. In contrast, as expected, the estimate coming from the Vapnik basedsmoother performs well over the entire time period, and is virtually unaffected by the presence oflarge outliers.

7.2 Real industrial data

Let us now consider real industrial data coming from Syncrude Canada Ltd, also analyzed in Liuet al. (2004). Oil production data is typically a multivariate time series capturing variables such asflow rate, pressure, particle velocity, and other observables. Because the data is proprietary, the ex-act nature of the variables is not known. The data from Liu et al. (2004) comprises two anonymizedtime series variables, called V14 and V36, that have been selected from the process data. Each timeseries consists of 936 measurements, collected at times [1,2, . . . ,936] (see the top panels of Fig. 4).Due to the nature of production data, we hypothesize that the temporal profile of the variables issmooth and that the observations contain outliers, as suggested by the fact that some observationsdiffer markedly from their neighbors, especially in the case of V14.Our aim is to compare the prediction performance of two smoothers that rely on `2 and `1 measure-ment loss functions. For this purpose, we consider 100 Monte Carlo runs. During each run, data arerandomly divided into three disjoint sets: training and a validation data sets, both of size 350, anda test set of size 236. We use the same state space model adopted in the previous subsection, with∆t = 1, and use a non-informative prior to model the initial condition of the system. The regulariza-tion parameter γ (equal to the inverse of λ 2 assuming that the noise variance is 1) is chosen using

27

02000

40006000

800010000

00.2

0.40.6

0.81

1.236

1.238

1.24

1.242

1.244

1.246

1.248

1.25

2

Average prediction error on the validation set

Figure 3: Estimation of the smoothing filter parameters using the Vapnik loss. Average predictionerror on the validation data set as a function of the variance process λ 2 and ε .

standard cross validation techniques. For each value of γ , logarithmically spaced between 0.1 and1000 (30 point grid), the smoothers are trained on the training set, and the γ chosen corresponds tothe smoother that achieves the best prediction on the validation set. After estimating γ , the variable’sprofile is reconstructed for the entire time series (at all times [1,2, . . . ,936]), using the measurementscontained in the union of the training and the validation data sets. Then, the prediction capabilityof the smoothers is evaluated by computing the 236 relative percentage errors (ratio of residual andobservation times 100) in the reconstruction of the test set.In Fig. 4 we display the boxplots of the overall 23600 relative errors stored after the 100 runs forV14 (bottom left panel) and V36 (bottom right panel). One can see that the `1-Kalman smootheroutperforms the classical one, especially in case of V14. This is not surprising, since in this caseprediction is more difficult due to the larger numbers of outliers in the time series. In particular, forV14, the average percentage errors are 1.4% and 2.4% while, for V36, they are 1% and 1.2% using`1 and `2, respectively.

8. Conclusions

We have presented a new theory for robust and sparse estimation using nonsmooth QS penalties. Wegive both primal and dual representations for these densities and show how to obtain closed formexpressions using Euclidean projections. Using their dual representation, we first derived conditionsallowing the interpretation of QS penalties as negative logs of true probability densities, thus estab-lishing a statistical modeling framework. In this regard, the coercivity condition characterized inTh. 10 played a central role. This condition, necessary for the statistical interpretation, underscoresthe importance of an idea already useful in machine learning. Specifically, coercivity of the objec-tive (1.4) is a fundamental prerequisite in sparse and robust estimation, as it precludes directionsfor which the sum of the loss and the regularizer are insensitive to large parameter changes. Thus,the condition for a QS penalty to be a negative log of a true density also ensures that the problemis well posed in the machine learning context, i.e. the learning machine has enough control over

28

0 200 400 600 800

400

450

500

550

600

650

Time

Varia

ble

14

0 200 400 600 800400

450

500

550

600

650

700

750

Time

Varia

ble

36

0

1

2

3

4

5

6

7

8

9

10

L2 smoother L1 smoother

Rel

ativ

e pe

rcen

tage

erro

rs

Plus 4.5% outliers Plus 3.4% outliers

0

1

2

3

4

5

6

7

8

9

10

L2 smoother L1 smoother

Rel

ativ

e pe

rcen

tage

erro

rs

Plus 2.2% outliers Plus 1.8% outliers

Figure 4: Left panels: data set for variable 14 (top) and relative percentage errors in the recon-struction of the test set obtained by Kalman smoothers based on the `2 and the `1 loss(bottom). Right panels: data set for variable 36 (top) and relative percentage errors in thereconstruction of the test set obtained by Kalman smoothers based on the `2 and the `1loss (bottom).

29

model complexity.The QS class captures a variety of existing penalties when used either as a misfit measure or asa regularization functional. We have also shown how to construct natural generalizations of thesepenalties within the QS class that are based on general norm and cone geometries. Moreover, weshow how the structure of these functions can be understood through the use of Euclidean pro-jections. Moreover, it is straightforward to use the presented results to design new formulations.Specifically, starting with the requisite shape of a new penalty, one can use results of Section 3 toobtain a standardized corresponding density, as well as the data structures U,M,B,b required toformulate and solve the optimization problem in Section 4. The statistical interpretation for thesemethods allows us to prescribe the mean and variance parameters of the corresponding model.In the second part of the paper, we presented a broad computational approach to solving estimationproblems (1.4) using interior point methods. In the process, we derived additional conditions thatguarantee the successful implementation of IP methods to compute the estimator (1.4) when x andv come from PLQ densities (a broad subclass of QS penalties), and provided a theorem character-izing the convergence of IP methods for this class. The key condition required for the successfulexecution of IP iterations was a requirement on PLQ penalties to be finite valued, which impliesnon-degeneracy of the corresponding statistical distribution (the support cannot be contained in alower-dimensional subspace). The statistical interpretation is thus strongly linked to the computa-tional procedure.We applied both the statistical framework and the computational approach to the broad class of stateestimation problems in discrete-time dynamic systems, extending the classical formulations to al-low dynamics and measurement noise to come from any PLQ densities. Moreover, we showed thatthe classical computational efficiency results can be preserved when the general IP approach is usedin the state estimation context; specifically, PLQ Kalman smoothing can always be performed witha number of operations that is linear in the length of the time series, as in the quadratic case. Thecomputational framework presented therefore allows the broad application of interior point meth-ods to a wide class of smoothing problems of interest to practitioners. The powerful algorithmicscheme designed here, together with the breadth and significance of the new statistical frameworkpresented, underscores the practical utility and flexibility of this approach. We believe that this per-spective on modeling, robust/sparse estimation and Kalman smoothing will be useful in a numberof applications in the years ahead.While we only considered convex formulations in this paper, it is important to note that the presentedapproach makes it possible to solve a much broader class of non-convex problems. In particular, ifthe functions Hx and Gx in (1.4) are replaced by nonlinear functions g(x) and h(x), the methods inthis paper can be used to compute descent directions for the non-convex problem. For an exam-ple of this approach, see (Aravkin et al., 2011a), which considers non-convex Kalman smoothingproblems with nonlinear process and measurement models and solves by using the standard method-ology of convex composite optimization Burke (1985). As in the Gauss-Newton method, at eachouter iteration the process and measurement models are linearized around the current iterate, andthe descent direction is found by solving a particular subproblem of type (1.4) using IP methods.In many contexts, it would be useful to estimate the parameters that define QS penalties; for exam-ple the κ in the Huber penalty or the ε in the Vapnik penalty. In the numerical examples presentedin this paper, we have relied on cross-validation to accomplish this task. An alternative methodcould be to compute the MAP points returned by our estimator for different filter parameters to gaininformation about the joint posterior of states and parameters. This strategy could help in designing

30

a good proposal density for posterior simulation using e.g. particle smoothing filters (Ristic et al.,2004). We leave a detailed study of this approach to the QS modeling framework for future work.

9. Appendix

9.1 Proof of Theorem 3

Let ρ(y) = ρ(U,M, I,0;y) so that ρ(U,M,B,b;y) = ρ(b + By). Then dom(ρ(U,M,B,b; ·)) =B−1(dom(ρ)−b), hence the theorem follows if it can be shown that bar(U)+Ran(M)⊂ dom(ρ)⊂[U∞ ∩ null(M)] with equality when bar(U)+Ran(M) is closed. Observe that if there exists w ∈U∞∩null(M) such that 〈y, w〉> 0, then trivially ρ(y)=+∞ so y /∈ dom(ρ). Consequently, dom(ρ)⊂[U∞ ∩ null(M)]. Next let y ∈ bar(U) + Ran(M), then there is a v ∈ bar(U) and w such thaty = v+Mw. Hence

supu∈U

[〈u, y〉− 12 〈u, Mu〉] = sup

u∈U[〈u, v+Mw〉− 1

2 〈u, Mu〉]

= supu∈U

[〈u, v〉+ 12 wT Mw− 1

2(w−u)T M(w−u)]

≤ δ∗ (v |U )+ 1

2 wT Mw < ∞ .

Hence bar(U)+Ran(M)⊂ dom(ρ).If the set bar(U)+Ran(M) is closed, then so is the set bar(U). Therefore, by (Rockafellar,

1970, Corollary 14.2.1), (U∞) = bar(U), and, by (Rockafellar, 1970, Corollary 16.4.2), [U∞ ∩null(M)] = bar(U)+Ran(M), which proves the result.

The polyhedral case bar(U) is a polyhedral convex set, and the sum of such sets is also a poly-hedral convex set (Rockafellar, 1970, Corollary 19.3.2).

9.2 Proof of Theorem 7

To see the first equation in (2.4) write ρ(y) = supu[〈y, u〉−

(12‖LT u‖2

2 +δ (u |U ))]

, and then applythe calculus of convex conjugate functions (Rockafellar, 1970, Section 16) to find that(

12‖L

T · ‖22 +δ (· |U )

)∗(y) = inf

s∈Rk

[12‖s‖

22 +δ

∗ (y−Ls |U )].

The second equivalence in (2.4) follows from (Rockafellar, 1970, Theorem 14.5).For the remainder, we assume that M is positive definite. In this case it is easily shown that

(MU) = M−1U. Hence, by (Rockafellar, 1970, Theorem 14.5), γ (· |MU ) = δ ∗(·∣∣M−1U

). We

use these facts freely throughout the proof.The formula (2.5) follows by observing that

12‖s‖

22 +δ

∗ (y−Ls |U ) = 12‖L

−T s‖2M +δ

∗ (M−1y−L−T s |MU)

and then making the substitution v = L−T s. To see (2.6), note that the optimality conditions for (2.5)are Ms ∈ ∂δ ∗

(M−1y− s |MU

), or equivalently, M−1y− s ∈ N (Ms |MU ), i.e. s ∈U and⟨

M−1y− s, u− s⟩

M =⟨M−1y− s, M(u− s)

⟩≤ 0 ∀ u ∈U,

which, by (2.3), tells us that s = PM(M−1y |U

). Plugging this into (2.5) gives (2.6).

31

Using the substitution v = Ls, the argument showing (2.7) and (2.8) differs only slightly fromthat for (2.5) and (2.5) and so is omitted.

The formula (2.9) follows by completing the square in the M-norm in the definition (2.1):

〈y, u〉− 12 〈u, Mu〉 =

⟨M−1y, u

⟩M−

12 〈u, u〉M

= 12 yT M−1y− 1

2 [⟨M−1y, M−1y

⟩M−2

⟨M−1y, u

⟩M + 〈u, u〉M]

= 12 yT M−1y− 1

2‖M−1y−u‖2

M .

The result as well as (2.10) now follow from Theorem 6. Both (2.11) and (2.12) follow similary bycompleting the square in the M−1-norm.

9.3 Proof of Theorem 9

First we will show that if ρ is convex coercive, then for any x ∈ argmin f 6= /0, there exist constantsR and K > 0 such that

ρ(x)≥ ρ(x)+K‖x− x‖ ∀ x /∈ RB . (9.1)

Without loss of generality, we can assume that 0 = ρ(0) = infρ . Otherwise, replace ρ(x) byρ(x) = ρ(x+ x)−ρ(x), where x is any global minimizer of ρ .

Let α > 0. Since ρ is coercive, there exists R such that levρ (α) ⊂ RB. We will show thatα

R‖x‖ ≤ ρ(x) for all x /∈ RB.Indeed, for all x 6= 0, we have ρ( R

‖x‖x)≥ α . Therefore, if x /∈ RB, then 0 < R‖x‖ < 1, and we have

α

R‖x‖ ≤ ‖x‖

(R‖x‖

x)≤ ‖x‖

RR‖x‖

ρ(x) = ρ(x).

Then by (9.1),∫exp(−ρ(x))dx =

∫x+RB

exp(−ρ(x))dx+∫‖x−x‖>R

exp(−ρ(x))dx

≤C1 +C2

∫‖x−x‖>R

exp(−K‖x− x‖)dx < ∞ .

9.4 Proof of Theorem 10

First observe that B−1[cone(U)] = [BTcone(U)] by (Rockafellar, 1970, Corollary 16.3.2).Suppose that y∈ B−1[cone(U)], and y 6= 0. Then By∈ cone(U), and By 6= 0 since B is injective,

and we haveρ(ty) = supu∈U〈b+ tBy,u〉− 1

2 uTMu= supu∈U〈b,u〉− 1

2 uTMu+ t〈By,u〉≤ supu∈U〈b,u〉− 1

2 uTMu≤ ρ(U,M,0, I;b),

so ρ(ty) stays bounded even as t→ ∞, and so ρ cannot be coercive.Conversely, suppose that ρ is not coercive. Then we can find a sequence yk with ‖yk‖ > k

and a constant P so that ρ(yk) ≤ P for all k > 0. Without loss of generality, we may assume thatyk‖yk‖ → y.

32

Then by definition of ρ , we have for all u ∈U

〈b+Byk,u〉− 12 uTMu≤ P

〈b+Byk,u〉 ≤ P+ 12 uTMu

〈b+Byk‖yk‖ ,u〉 ≤

P‖yk‖ +

12‖yk‖u

TMu

Note that y 6= 0, so By 6= 0. When we take the limit as k → ∞, we get 〈By,u〉 ≤ 0. From thisinequality we see that By ∈ [cone(U)], and so y ∈ B−1[cone(U)].

9.5 Proof of Theorem 14

Proof (i) Using standard elementary row operations, reduce the matrix

F(1)γ :=

I 0 AT 0

D(q) D(s) 0 00 −A −M B0 0 BT 0

(9.2)

to I 0 AT 00 D(s) −D(q)AT 00 0 −T B0 0 BT 0

,

where T =M+AD(q)D(s)−1AT. The matrix T is invertible since null(M)∩null(CT) = 0. Hence,we can further reduce this matrix to the block upper triangular form

I 0 AT 00 D(s) −D(q)CT 00 0 −T B0 0 0 −BTT−1B

.

Since B is injective, the matrix BTT−1B is also invertible. Hence this final block upper triangular isinvertible proving Part (i).(ii) Let (s,q) ∈ F+ and choose (ui,yi) so that (s,q,ui,yi) ∈F+ for i = 1,2. Set u := u1− u2 andy := y1− y2. Then, by definition,

0 = ATu, 0 = By−Mu, and 0 = BTu . (9.3)

Multiplying the second of these equations on the left by u and utilizing the third as well as thepositive semi-definiteness of M, we find that Mu = 0. Hence, u ∈ null(M)∩null(AT) = 0, and soBy = 0. But then y = 0 as B is injective.(iii) Let (s, q, u, y) ∈F+ and (s,q,u,y) ∈F+(τ). Then, by (4.6),

(s− s)T(q− q) = [(a−ATu)− (a−ATu)]T(q− q)

= (u−u)T(Aq−Aq)

= (u−u)T[(b+By−Mu)− (b+Bb−Mu)]

= (u−u)TM(u−u)

≥ 0.

33

Hence,τ + sTq≥ sTy+ sTq≥ sTy+ yTs≥ ξ ‖(s,q)‖1 ,

where ξ = minsi, qi | i = 1, . . . , `> 0. Therefore, the set

F+(τ) = (s,q) |(s,q,u,y) ∈F+(τ)

is bounded. Now suppose the set F+(τ) is not bounded. Then there exits a sequence (sν ,qν ,uν ,yν)⊂F+(τ) such that ‖(sν ,qν ,uν ,yν)‖ ↑+∞. Since F+(τ) is bounded, we can assume that ‖(uν ,yν)‖ ↑+∞ while ‖(sν ,qν)‖ remains bounded. With no loss in generality, we may assume that there exits(u,y) 6= (0,0) such that (uν ,yν)/‖(uν ,yν)‖→ (u,y). By dividing (4.6) by ‖(uν ,yν)‖ and taking thelimit, we find that (9.3) holds. But then, as in (9.3), (u,y) = (0,0). This contradiction yields theresult.

(iv) We first show existence. This follows from a standard continuation argument. Let (s, q, u, y) ∈F+ and v ∈ R`

++. Define

F(s,q,u,y, t) =

s+ATu−a

D(q)D(s)1− [(1− t)v+ tv]By−Mu−Aq

BTu+b

, (9.4)

where g := (s1y1, . . . , s`y`)T. Note that

F(s, q, u, y,0) = 0 and, by Part (i), ∇(s,q,u,y)F(s, q, u, y,0)−1 exists.

The Implicit Function Theorem implies that there is a t > 0 and a differentiable mapping t 7→(s(t),q(t),u(t),y(t)) on [0, t) such that

F [s(t),q(t),u(t),y(t), t] = 0 on [0, t).

Let t > 0 be the largest such t on [0,1]. Since

[s(t),q(t),u(t),y(t)] | t ∈ [0, t) ⊂F+(τ),

where τ = max1Tg,1Tg, Part (iii) implies that there is a sequence ti → t and a point (s, q, u, y)such that [s(ti),q(ti),u(ti),y(ti)]→ (s, q, u, y). By continuity F(s, q, u, y, t) = 0. If t = 1, we are done;otherwise, apply the Implicit Function Theorem again at (s, q, u, y, t) to obtain a contradiction to themaximality of t.

We now show uniqueness. By Part (ii), we need only establish the uniqueness of (s,q). Let(sν ,qν) ∈ F+ be such that g = (s j(1)q j(1),s j(2)q j(2), . . . ,s j(`)q j(`))

T, where s j(i) denotes the ith ele-ment of s j, and j = 1,2. As in Part (iii), we have (s1− s2)

T(q1−q2) = (u1−u2)TM((u1−u2)≥ 0,

and, for each i = 1, . . . , `, s1(i)q1(i) = s2(i)q2(i) = gi > 0. If (s1,q1) 6= (s2,q2), then, for somei ∈ 1, . . . , `, (s1(i)− s2(i))(q1(i)− q2(i)) ≥ 0 and either s1(i) 6= s2(i) or q1(i) 6= q2(i). If s1(i) > s2(i),then q1(i) ≥ q2(i) > 0 so that gi = s1(i)q1(i) > s2(i)q2(i) = gi, a contradiction. So with out loss ingenerality (by exchanging (s1,q1) with (s2,q2) if necessary), we must have q1(i) > q2(i). But thens1(i) ≥ s2(i) > 0, so that again gi = s1(i)q1(i) > s2(i)q2(i) = gi, and again a contradiction. Therefore,(s,q) is unique.

34

(v) Apply Part (iv) to get a point on the central path and then use the continuation argument to traceout the central path. The differentiability follows from the implicit function theorem.(vi) Part (iii) allows us to apply a standard compactness argument to get the existence of clusterpoints and the continuity of Fγ(s,q,u,y) in all of its arguments including γ implies that all of thesecluster points solve (4.6).

9.6 Details for Remark 17

The Lagrangian for (6.4) for feasible (x,uw,uv) is

L(x,uw,uv) =

⟨[bw

bv

],

[uw

uv

]⟩− 1

2

[uw

uv

]T[Mw 00 Mv

][uw

uv

]−⟨[

uw

uv

],

[−BwQ−1/2GBvR−1/2H

]x⟩

(9.5)

where bw = bw − BwQ−1/2x0 and bv = bv − BvR−1/2z. The associated optimality conditions forfeasible (x,uw,uv) are given by

GTQ−T/2BTwuw−HTR−T/2BT

v uv = 0bw−Mwuw +BwQ−1/2Gx ∈ NUw(uw)

bv−Mvuv−BvR−1/2Hx ∈ NUv(uv) ,

(9.6)

where NC(r) denotes the normal cone to the set C at the point r (see (Rockafellar, 1970) for details).Since Uw and Uv are polyhedral, we can derive explicit representations of the normal cones

NUw(uw) and NUv(uv). For a polyhedral set U ⊂ Rm and any point u ∈U , the normal cone NU(u) ispolyhedral. Indeed, relative to any representation

U = u|ATu≤ a

and the active index set I(u) := i|〈Ai, u〉= ai, where Ai denotes the ith column of A, we have

NU(u) =

q1A1 + · · ·+qmAm | qi ≥ 0 for i ∈ I(u)

qi = 0 for i 6∈ I(u)

. (9.7)

Using (9.7), Then we may rewrite the optimality conditions (9.6) more explicitly as

GTQ−T/2BTwuw−HTR−T/2BT

v uv = 0

bw−Mwuw +BwQ−1/2Gd = Awqw

bv−Mvuv−BvR−1/2Hd = Avqv

qv ≥ 0|qv(i) = 0 for i 6∈ I(uv)qw ≥ 0|qw(i) = 0 for i 6∈ I(uw)

(9.8)

where qv(i) and qw(i) denote the ith elements of qv and qw. Define slack variables sw ≥ 0 and sv ≥ 0as follows:

sw = aw−ATwuw

sv = av−ATv uv.

(9.9)

Note that we know the entries of qw(i) and qv(i) are zero if and only if the corresponding slackvariables sv(i) and sw(i) are nonzero, respectively. Then we have qT

wsw = qTv sv = 0. These equations

are known as the complementarity conditions. Together, all of these equations give system (6.5).

35

9.7 Proof of Theorem 18

IP methods apply a damped Newton iteration to find the solution of the relaxed KKT system Fγ = 0,where

sw

sv

qw

qv

uw

uv

x

=

ATwuw + sw−aw

ATv uv + sv−av

D(qw)D(sw)1− γ1D(qv)D(sv)1− γ1

bw +BwQ−1/2Gd−Mwuw−Awqw

bv−BvR−1/2Hd−Mvuv−Avqv

GTQ−T/2BTwuw−HTR−T/2BT

v uv

.

This entails solving the system

F(1)γ

sw

sv

qw

qv

uw

uv

x

∆sw

∆sv

∆qw

∆qv

∆uw

∆uv

∆x

=−Fγ

sw

sv

qw

qv

uw

uv

x

, (9.10)

where the derivative matrix F(1)γ is given by

I 0 0 0 (Aw)T 0 0

0 I 0 0 0 (Av)T 0

D(qw) 0 D(sw) 0 0 0 00 D(qv) 0 D(sv) 0 0 00 0 −Aw 0 −Mw 0 BwQ−1/2G0 0 0 −Av 0 −Mv −BvR−1/2H0 0 0 0 GTQ−T/2BT

w −HTR−T/2BTv 0

(9.11)

We now show the row operations necessary to reduce the matrix F(1)γ in (9.11) to upper block

triangular form. After each operation, we show only the row that was modified.

row3← row3−D(qw) row1[0 0 D(sw) 0 −D(qw)AT

w 0 0]

row4← row4−D(qv) row2[0 0 0 D(sv) 0 −D(qv)AT

v 0]

row5← row5 +AwD(sw)−1 row3[

0 0 0 0 −Tw 0 BwQ−1/2G]

row6← row6 +AvD(sv)−1 row4[

0 0 0 0 0 −Tv −BvR−1/2H].

In the above expressions,Tw := Mw +AwD(sw)

−1D(qw)ATw

Tv := Mv +AvD(sv)−1D(qv)AT

v ,(9.12)

36

where D(sw)−1D(qw) and D(sv)

−1D(qv) are always full-rank diagonal matrices, since the vectorssw,qw,sv,qv. Matrices Tw and Tv are invertible as long as the PLQ densities for w and v satisfy (4.10).

Remark 19 (block diagonal structure of T in i.d. case) Suppose that y is a random vector, y =vec(yk), where each yi is itself a random vector in Rm(i), from some PLQ densityp(yi) ∝ exp[−c2ρ(Ui,Mi,0, I; ·)], and all yi are independent. Let Ui = u : AT

i u ≤ ai. Then thematrix Tρ is given by Tρ = M +ADAT where M = diag[M1, · · · ,MN ], A = diag[A1, · · · ,AN ], D =diag[D1, · · · ,DN ], and Di are diagonal with positive entries. Moreover, Tρ is block diagonal, withith diagonal block given by Mi +AiDiAT

i .

From Remark 19, the matrices Tw and Tv in (9.12) are block diagonal provided that wk and vkare independent vectors from any PLQ densities.

We now finish the reduction of F(1)γ to upper block triangular form:

row7← row7 +(

GTQ−T/2BTwT−1

w

)row5−

(HTR−T/2BT

v T−1v

)row6

I 0 0 0 (Aw)T 0 0

0 I 0 0 0 (Av)T 0

0 0 Sw 0 −Qw(Aw)T 0 0

0 0 0 Sv 0 −Qv(Av)T 0

0 0 0 0 −Tw 0 BwQ−1/2G0 0 0 0 0 −Tv −BvR−1/2H0 0 0 0 0 0 Ω

where

Ω = ΩG +ΩH = GTQ−T/2BTwT−1

w BwQ−1/2G+HTR−T/2BTv T−1

v BvR−1/2H. (9.13)

Note that Ω is symmetric positive definite. Note also that Ω is block tridiagonal, since

1. ΩH is block diagonal.

2. Q−T/2BTwT−1

w BwQ−1/2 is block diagonal, and G is block bidiagonal, hence ΩG is block tridi-agonal.

Solving system (9.10) requires inverting the block diagonal matrices Tv and Tw at each iteration ofthe damped Newton’s method, as well as solving an equation of the form Ω∆x = ρ . The matricesTv and Tw are block diagonal, with sizes Nn and Nm, assuming m measurements at each time point.Given that they are invertible (see (4.10)), these inversions take O(Nn3) and O(Nm3) time. Since Ω

is block tridiagonal, symmetric, and positive definite, Ω∆x = ρ can be solved in O(Nn3) time usingthe block tridiagonal algorithm in (Bell, 2000). The remaining four back solves required to solve(9.10) can each be done in O(Nl) time, where we assume that Av(k) ∈Rn×l and Aw(k) ∈Rm×l at eachtime point k.

References

B. D. O. Anderson and J. B. Moore. Optimal Filtering. Prentice-Hall, Englewood Cliffs, N.J., USA,1979.

37

A. Y. Aravkin, J. V. Burke, and M. P. Friedlander. Variational properties of value functions. Tech-nical report, Preprint, University of Washington, 2012.

A.Y. Aravkin. Robust Methods with Applications to Kalman Smoothing and Bundle Adjustment.PhD thesis, University of Washington, Seattle, WA, June 2010.

A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. An `1-laplace robust kalman smoother.Automatic Control, IEEE Transactions on, 56(12):2898–2911, dec. 2011a. ISSN 0018-9286.doi: 10.1109/TAC.2011.2141430.

A.Y. Aravkin, B.M. Bell, J.V. Burke, and G. Pillonetto. Learning using state space kernel machines.In Proc. IFAC World Congress 2011, Milan, Italy, 2011b.

N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society,68:337–404, 1950.

B.M. Bell. The marginal likelihood for parameters in a discrete Gauss-Markov process. IEEETransactions on Signal Processing, 48(3):626–636, August 2000.

Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimiza-tion and statistical learning via the alternating direction method of multipliers. Found. TrendsMach. Learn., 3(1):1–122, January 2011. ISSN 1935-8237. doi: 10.1561/2200000016. URLhttp://dx.doi.org/10.1561/2200000016.

R. Brockett. Finite Dimensional Linear Systems. John Wiley and Sons, Inc., 1970.

J. V. Burke. An exact penalization viewpoint of constrained optimization. Technical report, ArgonneNational Laboratory, ANL/MCS-TM-95, 1987.

J.V. Burke. Descent methods for composite nondifferentiable optimization problems. MathematicalProgramming, 33:260–279, 1985.

Wei Chu, S. Sathiya Keerthi, and Chong Jin Ong. A unified loss function in bayesian frameworkfor support vector regression. In In Proceeding of the 18th International Conference on MachineLearning, pages 51–58, 2001.

F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of the Americanmathematical society, 39:1–49, 2001.

F. Dinuzzo. Analysis of fixed-point and coordinate descent algorithms for regularized kernel meth-ods. IEEE Transactions on Neural Networks, 22(10):1576 –1587, 2011.

F. Dinuzzo, M. Neve, G. De Nicolao, and U. P. Gianazza. On the representer theorem and equivalentdegrees of freedom of SVR. Journal of Machine Learning Research, 8:2467–2495, 2007.

D. Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289–1306, 2006.

B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,32:407–499, 2004.

38

T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines.Advances in Computational Mathematics, 13:1–150, 2000.

S. Farahmand, G.B. Giannakis, and D. Angelosante. Doubly robust smoothing of dynamical pro-cesses via outlier sparsity constraints. IEEE Transactions on Signal Processing, 59:4529–4543,2011.

M.C. Ferris and T.S. Munson. Interior-point methods for massive support vector machines. SIAMJournal on Optimization, 13(3):783 – 804, 2003.

S. Fine and K. Scheinberg. Efficient svm training using low-rank kernel representations. J. Mach.Learn. Res., 2:243 –264, 2001.

J. Gao. Robust l1 principal component analysis and its Bayesian variational inference. NeuralComputation, 20(2):555–572, February 2008.

A. Gelb. Applied Optimal Estimation. The M.I.T. Press, Cambridge, MA, 1974.

O. Guler and R. Hauser. Self-scaled barrier functions on symmetric cones and their classification.Foundations of Computational Mathematics, 2:121–143, 2002.

T. J. Hastie and R. J. Tibshirani. Generalized additive models. In Monographs on Statistics andApplied Probability, volume 43. Chapman and Hall, London, UK, 1990.

T. J. Hastie, R. J. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining,Inference and Prediction. Springer, Canada, 2001.

P.J. Huber. Robust Statistics. Wiley, 1981.

A. Jazwinski. Stochastic Processes and Filtering Theory. Dover Publications, Inc, 1970.

T. Joachims, editor. Making large-scale support vector machine learning practical. MIT Press,Cambridge, MA, USA, 1998.

S. Kim, K. Koh, M. Lustig, S. Boyd, and D. Gorinevsky. An interior-point method for large-scale`1-regularized least squares. IEEE Journal of Selected Topics in Signal Processing, 1(4):606 –617, 2007.

M. Kojima, N. Megiddo, T. Noma, and A. Yoshise. A Unified Approach to Interior Point Algo-rithms for Linear Complementarity Problems, volume 538 of Lecture Notes in Computer Science.Springer Verlag, Berlin, Germany, 1991.

C.J. Lin. On the convergence of the decomposition method for support vector machines. IEEETransactions on Neural Networks, 12(12):1288 –1298, 2001.

H. Liu, S. Shah, and W. Jiang. On-line outlier detection and data cleaning. Computers and ChemicalEngineering, 28:1635–1647, 2004.

S. Lucidi, L. Palagi, A. Risi, and M. Sciandrone. A convergent decomposition algorithm for supportvector machines. Comput. Optim. Appl., 38(2):217 –234, 2007.

39

D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4:415–447, 1992.

D.J.C. Mackay. Bayesian non-linear modelling for the prediction competition. ASHRAE Trans.,100(2):3704–3716, 1994.

A. Nemirovskii and Y. Nesterov. Interior-Point Polynomial Algorithms in Convex Programming,volume 13 of Studies in Applied Mathematics. SIAM, Philadelphia, PA, USA, 1994.

H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. State smoothing by sum-of-norms regularization.Automatica (to appear), 2011.

B. Oksendal. Stochastic Differential Equations. Springer, sixth edition, 2005.

J.A. Palmer, D.P. Wipf, K. Kreutz-Delgado, and B.D. Rao. Variational em algorithms for non-gaussian latent variable models. In Proc. of NIPS, 2006.

G. Pillonetto and B.M. Bell. Bayes and empirical Bayes semi-blind deconvolution using eigenfunc-tions of a prior covariance. Automatica, 43(10):1698–1712, 2007.

J. Platt. Fast training of support vector machines using sequential minimal optimization. In Ad-vances in kernel methods: support vector learning, 1998.

M. Pontil and A. Verri. Properties of support vector machines. Neural Computation, 10:955–974,1998.

C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press,2006.

B. Ristic, S. Arulampalam, and N. Gordon. Beyond the Kalman Filter: Particle Filters for TrackingApplications. Artech House Publishers, 2004.

R.T. Rockafellar. Convex Analysis. Priceton Landmarks in Mathematics. Princeton University Press,1970.

R.T. Rockafellar and R.J.B. Wets. Variational Analysis, volume 317. Springer, 1998.

S. Roweis and Z. Ghahramani. A unifying review of linear gaussian models. Neural Computation,11:305–345, 1999.

S. Saitoh. Theory of reproducing kernels and its applications. Longman, 1988.

H. H. Schaefer. Topological Vector Spaces. Springe-Verlag, 1970.

B. Scholkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization,Optimization, and Beyond. (Adaptive Computation and Machine Learning). The MIT Press,2001.

B. Scholkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms.Neural Computation, 12:1207–1245, 2000.

B. Scholkopf, R. Herbrich, and A. J. Smola. A generalized representer theorem. Neural Networksand Computational Learning Theory, 81:416–426, 2001.

40

A. J. Smola and B. Scholkopf. Bayesian kernel methods. In S. Mendelson and A. J. Smola, editors,Machine Learning, Proceedings of the Summer School, Australian National University, pages65–117, Berlin, Germany, 2003. Springer-Verlag.

R. Tibshirani. Regression shrinkage and selection via the LASSO. Journal of the Royal StatisticalSociety, Series B., 58:267–288, 1996.

M. Tipping. Sparse bayesian learning and the relevance vector machine. Journal of Machine Learn-ing Research, 1:211–244, 2001.

P. Tseng and S. Yun. A coordinate gradient descent method for linearly constrained smooth opti-mization and support vector machines training. Comput. Optim. Appl., 47(2):1 –28, 2008.

V. Vapnik. Statistical Learning Theory. Wiley, New York, NY, USA, 1998.

G. Wahba. Spline models for observational data. SIAM, Philadelphia, 1990.

G. Wahba. Support vector machines, reproducing kernel Hilbert spaces and randomized GACV.Technical Report 984, Department of Statistics, University of Wisconsin, 1998.

D.P. Wipf, B.D. Rao, and S. Nagarajan. Latent variable bayesian models for promoting sparsity.IEEE Transactions on Information Theory (to appear), 2011.

S.J. Wright. Primal-dual interior-point methods. Siam, Englewood Cliffs, N.J., USA, 1997.

Yinyu Ye and Kurt Anstreicher. On quadratic and o(√

nL) convergence of a predictor-correctormethod for lcp. Mathematical Programming, 62(1-3):537–551, 1993.

E. H. Zarantonello. Projections on convex sets in Hilbert space and spectral theory. AcademicPress, 1971.

K. Zhang and J.T. Kwok. Clustered nystrom method for large scale manifold learning and dimensionreduction. IEEE Transactions on Neural Networks, 21(10):1576 –1587, 2010.

H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the RoyalStatistical Society, Series B, 67:301–320, 2005.

41


Recommended