+ All Categories
Home > Documents > Discrete-Continuous ADMM for Transductive Inference in Higher … · Discrete-Continuous ADMM for...

Discrete-Continuous ADMM for Transductive Inference in Higher … · Discrete-Continuous ADMM for...

Date post: 29-Feb-2020
Category:
Upload: others
View: 24 times
Download: 0 times
Share this document with a friend
13
Discrete-Continuous ADMM for Transductive Inference in Higher-Order MRFs Emanuel Laude 1 Jan-Hendrik Lange 2 Jonas Sch ¨ upfer 1 Csaba Domokos 1 Laura Leal-Taix´ e 1 Frank R. Schmidt 13 Bjoern Andres 234 Daniel Cremers 1 1 Technical University of Munich 2 Max Planck Institute for Informatics, Saarbr ¨ ucken 3 Bosch Center for Artificial Intelligence 4 University of T¨ ubingen Abstract This paper introduces a novel algorithm for transductive inference in higher-order MRFs, where the unary energies are parameterized by a variable classifier. The considered task is posed as a joint optimization problem in the con- tinuous classifier parameters and the discrete label vari- ables. In contrast to prior approaches such as convex re- laxations, we propose an advantageous decoupling of the objective function into discrete and continuous subprob- lems and a novel, efficient optimization method related to ADMM. This approach preserves integrality of the discrete label variables and guarantees global convergence to a crit- ical point. We demonstrate the advantages of our approach in several experiments including video object segmentation on the DAVIS data set and interactive image segmentation. 1. Introduction Various problems in computer vision, computer graph- ics and machine learning can be formulated as MAP in- ference in a (possibly higher order) Markov random field (MRF) [34, 27, 42, 29, 31, 13, 33, 45]. The resulting opti- mization problem is defined over a hypergraph (V , C ) and a finite label set L as: min y∈Y X i∈V E i (y i )+ X C∈C |C|>1 E C (y C ). (1) The optimization variable y ∈Y := L |V| corresponds to a labeling of the vertices V and assigns a label y i ∈L to each vertex i ∈V . For convenience, we make a distinction between the singleton clique energies (unaries) E i (y i ) and the higher order energies E C (y C ), |C| > 1. In computer vision tasks, where the image is interpreted as a (higher order) pixelgrid (V , C ), the higher order po- tentials often correspond to priors favoring spatially smooth Figure 1: A pixel-classifier trained to predict an object mask (top row) performs well when the distribution of object pix- els in the training image is similar to the test image (left), but often fails if it is dissimilar (right). In a transductive in- ference approach (bottom row), we optimize jointly for the test labels and classifier parameters, which successfully pre- vents the hallucination of object pixels in difficult scenes, such as in the case of occlusion (right), cf. Sec. 4.2. solutions. In semantic image segmentation, for instance, in- ference in MRFs is widely used as a post-processing step to introduce spatial smoothness on the labeling y [11]. In this sense, the overall task of semantic segmentation is sub- divided into two tasks: First, a classifier, parameterized by W , is trained in a supervised fashion on a sufficiently large labeled training set, which assigns to each pixel in a test image a class (probability) score. Second, to enforce spatial smoothness of the labeling of the test image, MAP- inference in an MRF is performed in a post-processing, in which the class (probability) scores are interpreted as the unary energies E i (y i ; W ) for each y i . We argue that it is advantageous to merge such a two-step approach into a joint approach, as the training of the classifier profits from both, (i) the distribution of the unlabeled pixels in the color or feature space, and (ii) the available structural information 1
Transcript

Discrete-Continuous ADMMfor Transductive Inference in Higher-Order MRFs

Emanuel Laude 1 Jan-Hendrik Lange 2 Jonas Schupfer 1 Csaba Domokos 1

Laura Leal-Taixe 1 Frank R. Schmidt 1 3 Bjoern Andres 2 3 4 Daniel Cremers 1

1 Technical University of Munich 2 Max Planck Institute for Informatics, Saarbrucken3 Bosch Center for Artificial Intelligence 4 University of Tubingen

Abstract

This paper introduces a novel algorithm for transductiveinference in higher-order MRFs, where the unary energiesare parameterized by a variable classifier. The consideredtask is posed as a joint optimization problem in the con-tinuous classifier parameters and the discrete label vari-ables. In contrast to prior approaches such as convex re-laxations, we propose an advantageous decoupling of theobjective function into discrete and continuous subprob-lems and a novel, efficient optimization method related toADMM. This approach preserves integrality of the discretelabel variables and guarantees global convergence to a crit-ical point. We demonstrate the advantages of our approachin several experiments including video object segmentationon the DAVIS data set and interactive image segmentation.

1. IntroductionVarious problems in computer vision, computer graph-

ics and machine learning can be formulated as MAP in-ference in a (possibly higher order) Markov random field(MRF) [34, 27, 42, 29, 31, 13, 33, 45]. The resulting opti-mization problem is defined over a hypergraph (V, C) and afinite label set L as:

miny∈Y

∑i∈V

Ei(yi) +∑C∈C|C|>1

EC(yC). (1)

The optimization variable y ∈ Y := L|V| corresponds toa labeling of the vertices V and assigns a label yi ∈ L toeach vertex i ∈ V . For convenience, we make a distinctionbetween the singleton clique energies (unaries) Ei(yi) andthe higher order energies EC(yC), |C| > 1.

In computer vision tasks, where the image is interpretedas a (higher order) pixelgrid (V, C), the higher order po-tentials often correspond to priors favoring spatially smooth

Figure 1: A pixel-classifier trained to predict an object mask(top row) performs well when the distribution of object pix-els in the training image is similar to the test image (left),but often fails if it is dissimilar (right). In a transductive in-ference approach (bottom row), we optimize jointly for thetest labels and classifier parameters, which successfully pre-vents the hallucination of object pixels in difficult scenes,such as in the case of occlusion (right), cf. Sec. 4.2.

solutions. In semantic image segmentation, for instance, in-ference in MRFs is widely used as a post-processing stepto introduce spatial smoothness on the labeling y [11]. Inthis sense, the overall task of semantic segmentation is sub-divided into two tasks: First, a classifier, parameterizedby W , is trained in a supervised fashion on a sufficientlylarge labeled training set, which assigns to each pixel in atest image a class (probability) score. Second, to enforcespatial smoothness of the labeling of the test image, MAP-inference in an MRF is performed in a post-processing, inwhich the class (probability) scores are interpreted as theunary energies Ei(yi;W ) for each yi. We argue that it isadvantageous to merge such a two-step approach into a jointapproach, as the training of the classifier profits from both,(i) the distribution of the unlabeled pixels in the color orfeature space, and (ii) the available structural information

1

about the unlabeled pixels, namely the spatial smoothnessprior. Conversely, the segmentation will also profit from animproved classifier.

A joint formulation has two interpretations; on the onehand, it is a semi-supervised learning method that makesuse of structural knowledge about the training data to learna classifier. Such knowledge may take the form of higher-order clique energies EC [59, 3] on the labeling and actsas weak supervision in the training process. This approachhelps to mitigate the need for large amounts of annotatedtraining data in typical modern machine learning applica-tions. We, on the other hand, focus on its interpretation asa transductive inference method [58], i.e. the approach todirectly infer the labels of specific test data given specifictraining data, which we accomplish by incorporating a vari-able classifier in the inference process. Transductive infer-ence stands in contrast to inductive inference, which refersto first learning a general model from training data and sub-sequently applying the model to predict labels of a-prioriunknown test data. We show the benefits of using transduc-tive inference for the tasks of video object segmentation (cf.Fig. 1) as well as scribble-based segmentation (cf. Fig. 4).

1.1. Contributions

We propose a general joint model that assumes the unar-ies not be fixed for inference in the MRF, but rather opti-mized jointly with the labeling. Let for each i ∈ V , xi ∈ Rddenote the d-dimensional feature vector associated to the ithvertex and let ϕ : Rd → Rd′ be a feature map. Then, math-ematically, such a task can be naturally formulated in termsof a bilevel optimization problem:

miny∈Y,

W∈R|L|×d′

∑i∈V

`(yi;Wϕ(xi)) +∑C∈C

EC(yC) (2)

subject to W = argminW∈R|L|×d′

∑i∈V

`(yi;Wϕ(xi)) + g(W ).

Here, the upper-level task is inference in a MRF with addi-tional unaries Ei(yi;W ) := `(yi;Wϕ(xi)), parameterizedby linear classifier weights W ∈ R|L|×d′ , and a loss func-tion ` : L × R|L| → R. The lower-level task associates toeach given set of labels y the optimal parameters W . Forinstance, if ` is the hinge loss and g(W ) = ‖W‖2F , then thelower-level optimization problem amounts to the trainingof a classical SVM. Note that in a semi-supervised learn-ing context it might be more convenient to swap upper andlower level tasks, since the primary interest is the estimationof the classifier that minimizes the generalization error andnot the inferred labeling. However, mathematically, bothviewpoints are equivalent.

The model (2) suggests a simple alternating optimiza-tion scheme as in Lloyd’s algorithm [38] for k-means tocompute a local optimum. However, such an approach

has two major drawbacks: (i) The lower-lever subproblemsare expensive, which is prohibitive for large scale applica-tions. (ii) The optimization is prone to poor local optimaand therefore sensitive to initialization [63]. Motivated bythe good practical performance of the alternating directionmethod of multipliers (ADMM) in nonconvex optimiza-tion, we propose to generalize vanilla ADMM (commonlyapplied in continuous optimization) to discrete-continuousproblems of the form (2), while preserving integrality ofthe discrete variables. Since our method serves as a generalalgorithmic framework to tackle such problems, it is alsorelevant to semi-supervised and transductive learning in abroader sense.

The main contributions of this work can be summarizedas follows:

• We devise a decomposition of the model into sim-ple, purely discrete and purely continuous subprob-lems within the framework of proximal splitting. Thesubproblems can be solved in a distributed fashion.

• We devise a tailored ADMM-inspired algorithm,discrete-continuous ADMM, to compute a local opti-mum of (2). In contrast to vanilla ADMM, our algo-rithm allows us to obtain sub-optimal solutions of theMAP inference problem so that also computationallymore challenging MRFs can be considered.

• We generalize the convergence of nonconvex vanillaADMM to the presented inexact discrete-continuousADMM.

• In diverse experiments we demonstrate the relevanceand generality of our model and the efficiency of ourmethod: In contrast to standard k-means, our modelintegrates well with deep features. In contrast to a tai-lored SDP relaxation approach for transductive logisticregression, our method produces more consistent re-sults, while being more efficient in terms of both run-time and memory consumption.

1.2. Related work

To improve image segmentation results it is commonpractice to treat the unary terms Ei as additional variablesin the optimization [5, 64, 8, 49, 55, 56, 57]. More re-cently, [55, 56] revealed the equivalence of k-means cluster-ing with pairwise constraints [59, 3] and the Chan-Vese [8]approach, where the average foreground and background in-tensities (corresponding to the centroids in k-means) are notassumed to be fixed, but are rather treated as additional vari-ables. The goal of this approach is to jointly cluster the pix-els in the color space and regularize the cluster-assignment(the segmentation) in the image space. The clustering view-point suggests the application of the “kernel trick”, which

allows us to separate more complicated, possibly nonlin-early deformed color clusters [59, 3, 55, 56]. Experimen-tally, it has been shown that this approach integrates wellwith color or even depth pixel features [55, 56]. Due to theenormous success of deep convolutional neural networks oncomputer vision tasks, it is tempting to replace the color fea-tures by more sophisticated deep features that are capableof compactly representing complicated semantic informa-tion [32, 11]. However, high-dimensional deep features arein general not “k-means friendly” [61] and without furtherpreprocessing of the features as in [61] the plain Chan-Vesek-means approach does not generalize very well to deepfeatures, despite of (almost) linear separability of the data.Since deep neural network classifiers can be viewed as alinear model on top of a deep feature extractor, we proposeto alter the approaches from related work by using a (mul-ticlass) SVM or a (multinomial) logistic regression modelalong with deep features. Under the absence of generalhigher order terms (i.e. EC = 0, for all C ∈ C) our model(2) is closely related to transductive SVMs [58, 22, 4] andtransductive logistic regression [23]. In such a setting, op-timization schemes that alternately optimize w.r.t. to labelsand model parameters as in Lloyd’s algorithm [38] are inef-fective [63]. Other approaches, such as SDP relaxations arecomputationally expensive [23].

Instead, we propose an algorithm related to ADMM,which has recently been successfully applied to many non-convex continuous optimization problems [10, 44, 52, 40,35]. ADMM appears similar in form to message passingand subgradient descent schemes applied to the Lagrangiandual problem (dual decomposition) [30, 6, 39, 62, 53]. Thelatter is a Lagrangian relaxation approach, so that in dif-ficult nonconvex cases the linear equality constraints mayremain violated in the limit [30]. In contrast, ADMM at-tempts to solve the problem exactly and enforce the linearequality constraints strictly via additional quadratic penaltyterms. In order to make mixed discrete-continuous prob-lems such as (2) amenable to ADMM, related approachesoften relax the discrete variable and perform rounding op-erations [21, 54]. In contrast, we propose a generalizationof vanilla ADMM that preserves the integrality of the labelvariable and admits a theoretical convergence guarantee un-der affordable conditions. In the traditional convex and con-tinuous setting, ADMM [19, 18] converges under mild con-ditions [17, 14]. For more restrictive nonconvex problems,its convergence has only been established recently [20, 37].In this case, however, the required assumptions are fairlystrong.

2. Discrete-Continuous ADMMThe coupling of the discrete labeling variable y and the

continuous variable W renders problem (2) hard to solve.This is not surprising since the related k-means cluster-

ing problem is known to be NP-hard. A common ap-proach is to compute a local optimum by a simple discrete-continuous coordinate descent approach as in Lloyd’s algo-rithm [38]. Instead, we propose an advantageous decou-pling into purely discrete and purely continuous subprob-lems, which allows us to compute a local optimum by up-dating the continuous and discrete variables jointly and ef-ficiently.

2.1. Variable decoupling via ADMM

To this end, we employ a change of representation tomake the proposed problem amenable to the “kernel trick”.Note that, for any fixed labeling y, the lower-level task in(2) amounts to supervised SVM training (resp. supervisedlogistic regression). Thus, we can apply the representer the-orem [50]: Let Φ(X) be the feature matrix for a (possi-bly infinte-dimensional) matrix feature map Φ : Rd×|V| →Rd′×|V| and let

g(W ) = h(‖W‖F ), (3)

for h : [0,∞)→ R strictly monotonically increasing. Then,the weights W> = Φ(X)α can be substituted via theirrepresentation α ∈ R|V|×|L| in terms of the features.More precisely, we replace the scalar products Wϕ(xi),up to transposition, by Kiα = (Wϕ(xi))

> where K :=Φ(X)>Φ(X) denotes the Gram or kernel matrix.

For f : R|V|×|L| → R, being defined as

f(α) := h(‖Φ(X)α‖F ), (4)

this substitution leaves us with the following equivalentmixed integer nonlinear program formulation of (2):

miny∈Y,

α∈R|V|×|L|

∑i∈V

`(yi;Kiα) + f(α) +∑C∈C

EC(y). (5)

In order to decompose problem (5) into simple subproblemsassociated with each i ∈ V , we introduce auxiliary variablesβi = Kiα, which yields

miny∈Y,

α,β∈R|V|×|L|

∑i∈V

`(yi;βi) + f(α) +∑C∈C

EC(y)

subject to Kα = β.

(6)

Note that the objective of (6) is a separable function overthe βi. This suggests to relax the linear constraint Kα = βand consider the equivalent saddle point problem:

miny∈Y,

α,β∈R|V|×|L|

maxλ∈R|V|×|L|

Lρ(α, β, λ, y), (7)

where λ ∈ R|V|×|L| are the Lagrange multipliers cor-responding to Kα = β and Lρ denotes the “discrete-continuous” augmented Lagrangian, that for some penalty

parameter ρ > 0 is defined as

Lρ(α, β, λ, y) :=∑i∈V

`(yi;βi) + f(α)

+∑C∈C

EC(y) + 〈λ,Kα− β〉+ρ

2‖Kα− β‖2F .

(8)

We show in Sec. 2.2 that, for fixed λ and α, the functionLρ(α, ·, λ, ·) can be minimized (not necessarily to globaloptimality) efficiently and jointly over y and β. This centralobservation and the good practical performance of ADMMin nonconvex optimization motivates the following general-ization to discrete-continuous problems of the form (7).

We propose an algorithm that, similar to continu-ous ADMM, updates the discrete-continuous variable-pair(βt+1, yt+1) via joint (and possibly suboptimal) minimiza-tion of Lρ(α

t, ·, λt, ·). Subsequently, it updates αt+1 viaminimization of Lρ(·, βt+1, λt, yt+1) and the Lagrangemultiplier λt by performing one iteration of gradient ascenton Lρ(α

t+1, βt+1, ·, yt+1) with step size ρ > 0. In sum-mary, the update steps at iteration t are given as

(βt+1, yt+1) = argminβ,y Lρ(αt, β, λt, y), (9)

αt+1 = argminα Lρ(α, βt+1, λt, yt+1), (10)

λt+1 = λt + ρ(Kαt+1 − βt+1). (11)

In practice, we choose the step size adaptively, as this of-ten leads to better solutions in terms of objective value: Forfinitely many iterations, the penalty parameter ρ is increasedaccording to the schedule ρt+1 = min {ρmax, τρt} withτ > 1 and some ρmax > 0 that guarantees theoretical con-vergence of the algorithm (cf. Sec. 3).

2.2. Distributed solution of the subproblems

In this section, we describe the implementation of up-date steps (9)–(11) in our algorithm. In principle, (9) couldbe solved by minimization over β for every feasible label-ing y ∈ Y . Obviously, this is not a viable approach, as itimplies performing exhaustive search over the set Y , whichhas size |L||V|. Instead, we pursue the following more effi-cient strategy.

Solution via lookup-tables. Assume first the absence ofany higher order energies, i.e. EC = 0, for all C ∈ C.Then, since Lρ(αt, β, λt, y) is separable w.r.t. βi and yi, wecan decompose problem (9) into |V| independent problemsof the form

argminβi,yi

`(yi;βi) +ρ

2‖βi −Kiα

t − λti/ρ‖2F︸ ︷︷ ︸

ψi(βi,yi;αt,λti)

, (12)

which can thus be solved in parallel. In the presence ofhigher-order energies, however, the problems (12) are not

completely independent, because the variables yi are cou-pled via the energies EC in which they appear. In this case,we first solve (12) w.r.t. only the continuous variables βifor every possible label yi ∈ L and store the results in alookup-table (ut+1, Bt+1).

Precisely, for each 1 ≤ i ≤ |V| and each yi ∈ L wecreate an entry (ut+1

i,yi, Bt+1

i,yi) according to

Bt+1i,yi

:= argminβi

ψi(βi; yi, αt, λti),

ut+1i,yi

:= minβi

ψi(βi; yi, αt, λti).

(13)

In a second step, we determine the discrete variable up-date yt+1 as the (possibly suboptimal) solution of the MRF

yt+1 = argminy∈Y

∑i∈V

ut+1i,yi

+∑C∈C

EC(y). (14)

Afterwards, the continuous variable updates βt+1i can be

read off from the solution of (13) via

βt+1i = Bt+1

i,yt+1i

. (15)

Note that there is an abundance of algorithms availableto tackle problems of the form (14) such as graph cuts forbinary submodular MRFs [29], move making and messagepassing algorithms [13, 28], primal-dual algorithms [31, 15]and more. For an overview, see also [24].

The matrix u specifies the unary energies in problem (14)that pushes the MRF to attain a labeling which correspondsto a more suitable classifier. The latter is determined by atradeoff between minimizing the distance of βi to the cur-rent consensus parameters Kiα

t + λti/ρ and minimizing the

loss term corresponding to sample i.In case of suboptimality of yt+1 we require that yt+1, for

some δ ≥ 0, satisfies a (sufficient) descent condition

Lρ(αt, βt+1, λt, yt+1)− Lρ(α

t, Bt+1:,yt , λ

t, yt) ≤ −δ.(16)

If this condition is violated, then we keep the previous it-erate yt+1 = yt. Under condition (16), the overall conver-gence of our algorithm is guaranteed (cf. Prop. 1 and Prop. 2in Sec. 3). We summarize our method in Alg. 1.

Note that if the discrete subproblem (14) is solvedto global optimality, our method specializes to classi-cal nonconvex ADMM applied to a purely continuousproblem minαE(Kα) + f(α). The function E(β) =miny∈Y

∑i∈V `(yi;β) +

∑C∈C EC(y) encapsulates the

minimization over the discrete labelings y. This results in apointwise minimum over exponentially many functions.

Distributed optimization. Distributed optimization isconsidered one of the main advantages of ADMM in su-pervised learning [16, 6]. In our method, the (β, y) update

Algorithm 1 Discrete-Continuous ADMM

Require: initialize α0, λ0, ρ0 > 0, τ > 1, ρmax as in (18)1: while (not converged) do2: Compute lookup-table (ut+1, Bt+1):3: for all i ∈ {1, . . . , |V|} and yj ∈ L do4: In parallel update (ut+1

i,yj, Bt+1

i,yj) as in (13).

5: end for6: Update yt+1 as in (14).7: if yt+1 violates condition (16) then8: yt+1 ← yt.9: end if

10: βt+1i ← Bt+1

i,yt+1i

11: Perform updates, as in (10),(11).12: if ρ violates condition (18) then13: ρt+1 ← min {ρmax, τρt}.14: end if15: t← t+ 116: end while

requires the solution of only |L|· |V|many (instead of |L||V|

for the naive approach) independent and small-scale contin-uous minimization problems of the form (13) and one addi-tional discrete problem (14). This suggests the distributedsolution of the subproblems (13), for instance on a GPU.Subsequently, the optimization of the MRF (14) and the up-date of the consensus variable is carried out after gatheringthe solutions of the subproblems. Since (14) need not besolved to optimality, the MRF solver may be stopped earlyto speed up computation. This is particularly useful if aprimal-dual algorithm for solving the LP-relaxation is used[31, 15].

Exploit duality. If the loss terms `(yi; ·) are convex andlower semicontinuous, then the independent subproblems(13) can be solved efficiently via duality as follows. For allloss functions we consider, it is convenient to solve the dualproblem as it scales linearly with the number of trainingsamples (which is equal to one in our case). For the Cram-mer and Singer multiclass SVM loss [12], for instance,there exists an efficient variable fixing algorithm [26] forsolving the dual problem. For the softmax loss the dualproblem reduces to a one-dimensional nonlinear equationvia the Lambert-W function [36] and may be solved by per-forming a few iterations of Newton’s or Halley’s method.For the special case of the one-vs.-all hinge loss, (13) canbe solved in closed form. In any case, each subprobleminvolves only a small number of instructions, which is im-portant for a GPU-based implementation.

2.3. Consensus update

For a quadratic regularizer h(x) = νx2, where ν is theregularization parameter, the update step (10) is equivalent

to

αt+1 = argminα

ν 〈α,Kα〉+ ρ

2‖Kα− βt+1 + λt

/ρ‖2F .

(17)

This is a quadratic problem that can be solved via a normalequation using either a cached eigenvalue decomposition ofthe kernel matrix, or an iterative algorithm such as conju-gate gradient (CG). The latter is preferred for large scaleapplications, as each CG iteration involves a kernel-matrix-vector multiplicationKv. For the linear kernelK = X>X ,this guarantees efficiency of our method, since K does nothave to be stored explicitly. For general kernels such as theRBF kernel, a low rank approximation to the kernel matrixK ≈ GG>, for some G ∈ R|V|×l with l � |V| can beobtained for instance via the Nystrom method [43, 60] orrandom features [47]. Furthermore, in practice, often only asmall number of conjugate gradient iterations are necessary.

3. Convergence analysisIn this section, we provide a complete convergence anal-

ysis of the proposed algorithm. To this end, we make thefollowing assumptions:

• The function f isL-smooth,m-semiconvex and lower-bounded, i.e. f is differentiable and ∇f is Lipschitz-continuous with modulus L and there exists m > 0sufficiently large so that f + m

2 ‖ · ‖2F is convex.

• For all yi ∈ L, `(yi; ·) is lower-bounded.

• The kernel matrix K ∈ R|V|×|V| is surjective, i.e. thesmallest eigenvalue σmin(K

>K) > 0 is positive.

• After finitely many iterations t the penalty parameter ρis sufficiently large and kept fixed such that

L2

ρσmin(K>K)+m− ρσmin(K

>K)

2< 0. (18)

When the MRF subproblem is solved to global optimality,convergence can be guaranteed by considering a pointwiseminimum over exponentially many augmented Lagrangiansand applying existing theory [37, 20]. For the general case,however, the theory needs to be extended. Our convergenceproof borrows arguments from [37, 20], where the conver-gence of ADMM in the nonconvex setting is shown via amonotonic decrease of the augmented Lagrangian. In ourcase, for a sufficiently large penalty parameter ρ, we canachieve a monotonic decrease of the “discrete-continuous”augmented Lagrangian (8), even if the MRF subproblem(14) is not solved globally optimal. This allows us to stopexact MRF solvers early or to apply heuristic solvers if com-puting global optima is intractable.

For the complete proofs of all the theoretical results, pre-sented in this section, cf. the Appendix A.

Lemma 1. Let K ∈ R|V|×|V| be surjective and δ ≥ 0. Forρ meeting condition (18) we have that

1. The “discrete-continuous” augmented Lagrangian(8) decreases monotonically with the iterates(αt, βt, λt, yt):

Lρ(αt+1, βt+1, λt+1, yt+1)− Lρ(α

t, βt, λt, yt)

≤(

L2

ρσmin(K>K)+ m−ρσmin(K

>K)2

)‖αt+1 − αt‖2F

− δJyt+1 6= ytK, (19)

where J·K denotes the Iverson bracket.

2. {Lρ(αt+1, βt+1, λt+1, yt+1)}t∈N is lower bounded.

3. {Lρ(αt+1, βt+1, λt+1, yt+1)}t∈N converges.

We are now able to guarantee that feasibility is achievedin the limit. This is in contrast to a dual-decompositionapproach [30, 62, 53] or a Gauss-Seidel quadratic penaltymethod (with finite penalty parameter ρ), used for instancein [51, 25], where a violation of the consensus constraint re-mains in the limit. Moreover, if δ > 0 is chosen strictly pos-itive, then the discrete variable is guaranteed to converge,i.e. for T sufficiently large, we have yt+1 = yt for all t > T .

Lemma 2. Let {(αt, βt, λt, yt)}t∈N be the iterates pro-duced by Alg. 1. Then {(αt, βt, λt, yt)}t∈N is a boundedsequence. Furthermore, for t → ∞ the distance of twoconsecutive continuous iterates vanishes, and feasibility isachieved in the limit:

‖αt+1 − αt‖F → 0, (20)

‖βt+1 − βt‖F → 0, (21)

‖λt+1 − λt‖F → 0, (22)

‖Kαt+1 − βt+1‖F → 0. (23)

Finally, if δ > 0 is chosen strictly positive, then there existssome T ∈ N such yt+1 = yt for all t > T .

The limit points of our algorithm correspond to“discrete-continuous” critical points of the augmented La-grangian.

Definition 1 (“Discrete-continuous” critical point). We call(α∗, β∗, λ∗, y∗) a “discrete-continuous” critical point of the“discrete-continuous” augmented Lagrangian (8) if it sat-isfies

0 ∈ ∂(`(y∗i ; ·))(β∗i )− λ∗i , ∀ i ∈ V (24)

0 ∈ ∂g(α∗) +K>λ∗ (25)Kα∗ = β∗, (26)

for y∗ with EC(y∗) < ∞ for all C ∈ C. Here, ∂f(x)

denotes the “limiting” subdifferential [48, Definition 8.3]of the function f at x with f(x) <∞.

Proposition 1. Let δ ≥ 0. Then any limit point(α∗, β∗, λ∗, y∗) of the sequence {(αt, βt, λt, yt)}t∈N is a“discrete-continuous” critical point.

Finally, under convexity of f and `(yi; ·), for all yi ∈ Land strictly positive δ > 0, we can guarantee that the se-quence of iterates produced by Alg. 1 globally converges toa point (α∗, β∗, λ∗, y∗) which has the following property:α∗ is the global optimum of the supervised learning prob-lem w.r.t. the estimated training labels y∗:

α∗ = argminα

∑i∈V

`(y∗i ;Kiα) + f(α). (27)

Proposition 2. Let `(yi; ·) and g be proper, convex andlower-semicontinuous and let δ > 0. Then the sequence{(αt, βt, λt, yt)}t∈N produced by Alg. 1 converges to a“discrete-continuous” critical point (α∗, β∗, λ∗, y∗) of (8)and α∗ solves the problem (27) to global optimality.

Discussion of the assumptions. Note that in general thekernel matrix K is not surjective. However, for the strictlypositive definite RBF kernel, K is strictly positive definiteso that convergence can be achieved for finite ρ. In order toenforce theoretical convergence for general kernels, we mayadd a small constant to the diagonal of the kernel matrixK := K + γI that alters the model only slightly. In fact,for the binary SVM, this change is equivalent to replacingthe hinge loss with its square.

4. ExperimentsIn this section, we present the experimental results of

our method on several transductive learning tasks. First, wecompare our method to an SDP relaxation method for trans-ductive multinomial logistic regression by [23]. Second, weuse our model and solver for the tasks of object video seg-mentation as well as image segmentation with user inter-action, showing improvements on the false positive rate ofobject pixels.

4.1. Comparison with SDP relaxation for transduc-tive learning

In this experiment, we consider the standard SSL bench-mark [9] for a comparison with the SDP relaxation methodfor transductive multinomial logistic regression by [23].The benchmark is a collection of several datasets, with vary-ing feature dimensions and number of classes. Each datasetis provided with 12 splits into l = 10 or l = 100 labeled andN − l unlabeled samples. We introduce additional unaryenergies EC with |C| = 1 for all the labeled examples, toconstrain their label to be fixed during optimization. While[23] incorporates an entropy prior on the labeling which fa-vors an equal balance distribution, we introduce a higher

Frame 5 Frame 15 Frame 60 Frame 68 Frame 84

[7]

Indu

ctiv

eO

urs

Figure 2: Exemplary results for video object segmentation on the DAVIS benchmark [46]. It can be seen that both, theinductive MRF inference approach and OSVOS produce a large number of false positive object pixels for the frames wherethe object is occluded.

Table 1: Comparison with the method of [23] on the SSLbenchmark [9]. Reported are the average label-accuracy (in%) and variance over the splits. Our evaluation suggeststhat our method performs better for a standard hyperparam-eter setting except for three out of 20 settings. Moreover, itproduces more consistent results, i.e. lower variances overthe splits.

Linear Kernel RBF Kernel

Dataset SDP Ours SDP Ours

Digit1,10l 69.27±27.56 82.20±4.54 53.93±9.43 78.18±8.43USPS,10l 57.72±13.73 64.58±3.37 40.10±11.67 48.19±5.84BCI,10l 50.44±3.16 50.62±2.08 50.00±3.06 51.67±0.44

g241c,10l 49.88±38.92 55.42±3.95 62.33±36.84 89.98±0.32g241n,10l 52.77±34.37 57.61±4.44 50.13±0.53 51.13±0.13

Digit1,100l 75.74±29.73 85.60±2.91 88.65±0.49 87.61±3.44USPS,100l 63.44±9.97 72.14±0.84 39.83±12.63 56.54±3.31BCI,100l 60.58±6.87 65.23±1.25 64.19±1.23 62.62±1.00

g241c,100l 64.92±17.47 86.31±0.91 85.63±0.76 89.34±1.07g241n,100l 54.14±17.13 54.11±0.64 52.23±1.61 53.98±0.38

order potential EC , with C = V , that restricts the solutionto deviate at most 10 percent from the equal balance dis-tribution. We solve the LP-relaxation of the higher-orderMRF subproblem (14) with the dual-simplex method andround the solution. The baseline results are computed witha MATLAB implementation that is provided by the authors.For these experiments, we use the softmax loss and set theregularization parameter ν = 0.05 for the linear kernel. Forthe RBF kernel we manually chose the variance parameterσ = 0.5477 and the regularization parameter ν = 0.0025.We chose the initial penalty parameter ρ0 = 0.001 andτ = 1.003. All values are averaged over 12 different splits.The evaluation in Tab. 1 suggests, that our method performs

better for a standard hyperparameter setting except for threeout of 20 settings. Moreover, it produces more consistentresults, i.e. lower variances over the splits, which suggeststhat our method is more robust towards noise and poorlylabeled data.

4.2. Video object segmentation

In this experiment, we evaluated our method on videoobject segmentation. Here, the task is to segment an ob-ject throughout a video, given its mask in the first frame.This problem has been successfully approached by [7], us-ing end-to-end deep learning with fully convolutional neu-ral networks. At test time, their classifier is fine-tuned onthe appearance of the object and the background in thefirst frame and predicts the object pixels of individual laterframes. However, this method struggles with drastic appear-ance changes of the object, which have not been learnedin advance. These include pose changes, sharp lightingand background changes or severe occlusions as shown inFig. 2.

We propose to use a transductive approach instead. Moreprecisely, we use the pre-trained (not fine-tuned) OSVOSparent network [7] as a deep feature extractor and a MRFmodel with a variable classifier in the form of (2). We usea simple linear kernel SVM in our model, as the extracteddeep features are almost linearly separable. Further, we in-troduce unary indicator energies Ei to fix the labels of theuser-annotated pixels in the first frame and pairwise ener-gies Eij for adjacent pixels in any frame to favor spatiallysmooth solutions. Similar to [7], we do not use any tempo-ral consistency terms. To reduce the number of examples,we apply our method on a superpixel level and extract 6000super-pixels [1] for each frame and apply average pooling

Frame 3 Frame 39

[7]

Indu

ctiv

eO

urs

Figure 3: Further results for video object segmentation onthe car-shadow sequence from the DAVIS benchmark [46].

over the superpixels. We compare the proposed transductiveapproach to OSVOS [7] and the classical (inductive) MRFinference approach (where the classifier is learned with thefirst frame only) on the DAVIS benchmark [46]. For boththe inductive and the transductive approach the used linearkernel SVM model, the higher order energies EC and theextracted superpixels are the same. The results are shownin Fig. 2. It can be seen that [7] works well as long as the ap-pearance of the object and background are sufficiently sim-ilar to the first frame (first column). In frames 60 to 68,where the object is occluded, both the inductive MRF infer-ence approach and OSVOS produce a large number of falsepositive object pixels. In this experiment the intersection-over-union scores (the higher the better) are 0.7087 for ourmethod, 0.6452 for OSVOS and 0.5063 for the inductiveapproach. Similarly in the car-shadow sequence, OSVOSand the inductive approach mask additionally the other carand the motorbike in frame 39 (cf. Fig. 3). In contrast ourmethod masks the correct car only. Here, the intersection-over-union scores are 0.9262 for OSVOS, 0.9196 for ourmethod and 0.8844 for the inductive approach.

4.3. Image segmentation with user interaction

We evaluated our method on the task of interactiveforeground-background segmentation with deep features.Like in the previous experiment we used OSVOS as a deepfeature extractor. On this task we compare our methodto the Chan-Vese kernel k-means approach proposed in[55, 56] as a baseline method. Since the features are almostlinearly separable, we use a simple linear kernel for both ourmodel and the baseline model. As it is shown in Fig. 4, thek-means approach often fails to find a good cluster-center

Annotation [55, 56] Ours

Figure 4: Exemplary results for interactive binary imagesegmentation with deep features. Left: Input images alongwith user scribbles in red for foreground and blue for back-ground. Middle: Segmentation results (red masks) ob-tained from k-means. Right: Segmentation results (redmasks) obtained with the proposed method.

assignment, despite of strong supervision (provided in theform of user-scribbles) and richness of the features. Thisis due to the fact that deep high dimensional features are ingeneral not k-means friendly [61], which means further pre-processing or a k-means suited kernel would be required.In contrast, our method provides a reasonable result for allcases, without the need for feature-pre-processing or kernel-parameter tuning.

5. ConclusionWe considered the joint solution of MAP-inference in

MRFs and parameter learning, which can be viewed as atransductive inference problem. To solve this task, we pro-posed a novel algorithm that jointly optimizes over the dis-crete label variables and the continuous model parameters.The proposed method is related to classical ADMM fromcontinuous optimization and admits a convergence proofunder suitable assumptions even though the objective func-tion is discrete-continuous and nonconvex. Our algorithmmakes use of a decoupling of the problem into purely dis-crete and purely continuous subproblems and can be imple-mented in a distributed fashion. We evaluated our approachin several experiments including video object segmentationand interactive image segmentation. Our results suggestthat the proposed optimization method performs favorablecompared to alternating optimization (as in k-means) andconvex relaxations. In particular, this indicates that the pre-sented method also serves as an alternative approach to op-timization problems arising in semi-supervised or transduc-tive learning, e.g., in the case of SVMs. Furthermore, thevisual results show that the transductive inference model isable to reduce the hallucination of false object pixels in im-age and video segmentation tasks.

A. Theoretical Results

In the remainder of this section we make use of thefollowing properties of L-smooth functions (known as thedescent-lemma) and m-semiconvexity [2, 41], which arestandard results and therefore stated without proof:

Lemma 3. Let f : Rk×n → R be continuously differen-tiable and let x, y ∈ Rk×n.

• If f is L-smooth (meaning that∇f is Lipschitz contin-uous with modulus L), then

f(y) ≤ f(x) + 〈∇f(x), y − x〉+ L

2‖x− y‖2F .

(28)

• If f is m-semiconvex (meaning that f + m2 ‖ · ‖

2F is

convex), then

f(y) ≥ f(x) + 〈∇f(x), y − x〉 − m

2‖x− y‖2F .

(29)

For showing convergence we make the following as-sumptions on our problem:

• The function f isL-smooth,m-semiconvex and lower-bounded.

• For all yi ∈ L, `(yi; ·) is lower-bounded.

• The kernel matrix K ∈ R|V|×|V| is surjective, i.e. thesmallest eigenvalue σmin(K

>K) > 0 is positive.

• After finitely many iterations the penalty parameterρ is sufficiently large and kept fixed such that condi-tion (18) holds.

A.1. Proof of Lemma 1

In [37, 20], to show convergence of nonconvex ADMM,a monotonic decrease of the augmented Lagrangian is guar-anteed. Following a similar line of argument, we show thatthe “discrete-continuous” augmented Lagrangian (8) mono-tonically decreases with the iterates. Whereas its value de-creases with the primal and discrete variable updates, thedual update yields a positive contribution to the overall esti-mate. Yet, for ρ > 0 chosen large enough, K surjective andf being L-smooth, this ascent can be dominated by a suffi-ciently large descent in the primal block α, updated last.

We need the following notation. Let Bt+1:,yt denote the

matrix whose i-th row is given by Bt+1i,yti

. In particular, by

definition of the β update, this means βt+1 = Bt+1:,yt+1 .

Proof. We rewrite the difference of two consecutive“discrete-continuous” augmented Lagrangians as

Lρ(αt+1, βt+1, λt+1, yt+1)− Lρ(α

t, Bt:,yt , λt, yt)

= Lρ(αt, Bt+1

:,yt , λt, yt)− Lρ(α

t, Bt:,yt , λt, yt)

+ Lρ(αt, βt+1, λt, yt+1)− Lρ(α

t, Bt+1:,yt , λ

t, yt)

+ Lρ(αt+1, βt+1, λt, yt+1)− Lρ(α

t, βt+1, λt, yt+1)

+ Lρ(αt+1, βt+1, λt+1, yt+1)

− Lρ(αt+1, βt+1, λt, yt+1)

We now bound each of the four differences separately:Since the augmented Lagrangian is separable in β and

we solve for any yi a minimization problem in βyi globallyoptimal we have that

Lρ(αt, Bt+1

:,yt , λt, yt)− Lρ(α

t, Bt:,yt , λt, yt) ≤ 0. (30)

A similar estimate holds for the the discrete variableyt+1 due to the update in the algorithm:

Lρ(αt, βt+1, λt, yt+1)− Lρ(α

t, Bt+1:,yt , λ

t, yt)

≤ −δJyt+1 6= ytK.(31)

Now we devise a bound for the third term given by

Lρ(αt+1, βt+1, λt, yt+1)− Lρ(α

t, βt+1, λt, yt+1)

= f(αt+1)− f(αt) + 〈Kαt+1 −Kαt, λt〉

2‖Kαt+1 − βt+1‖2F −

ρ

2‖Kαt − βt+1‖2F .

We apply the identity ‖a+ c‖2F −‖b+ c‖2F = −‖b−a‖2F +2〈a+c, a−b〉 with a := Kαt+1, b := Kαt and c = −βt+1

and obtain

f(αt+1)− f(αt)− ρ

2‖Kαt+1 −Kαt‖2F

+ 〈Kαt+1 −Kαt, λt + ρ(Kαt+1 − βt+1)〉.

The optimality condition for the update of the variable α isgiven as

0 = ∇f(αt+1) +K>(ρ(Kαt+1 − βt+1) + λt). (32)

We replace the term 〈Kαt+1 − Kαt, λt + ρ(Kαt+1 −βt+1)〉 = 〈αt+1 − αt,K>(λt + ρ(Kαt+1 − βt+1))〉 andobtain from the optimality condition of the α update that

f(αt+1)− f(αt)− ρ

2‖Kαt+1 −Kαt‖2F

+ 〈αt+1 − αt,−∇f(αt+1)〉

≤ f(αt+1)− f(αt)− ρσmin(K>K)

2‖αt+1 − αt‖2F

+ 〈αt − αt+1,∇f(αt+1)〉.

Moreover, due to the m-semiconvexity of the f we knowthat

f(αt) +m

2‖αt+1 − αt‖2F

≥ f(αt+1) + 〈∇f(αt+1), αt − αt+1〉.

Overall we can bound

Lρ(αt+1, βt+1, λt, yt+1)− Lρ(α

t, βt+1, λt, yt+1)

≤ m− ρσmin(K>K)

2‖αt+1 − αt‖2F .

(33)

Since by assumption K is surjective, the smallest eigen-value of K>K is greater than zero: σmin(K

>K) > 0.This means there exists some ρ > 0 large enough so thatm−ρσmin(K

>K)2 < 0.

Finally, we estimate the last term:

Lρ(αt+1, βt+1, λt+1, yt+1)− Lρ(α

t+1, βt+1, λt, yt+1)

= 〈Kαt+1 − βt+1, λt+1 − λt〉 = 1

ρ‖λt+1 − λt‖2F .

From the update of the dual variable and the optimality con-dition for the α update (32) it follows that

−∇f(αt+1) = K>λt+1. (34)

Further, since f is L-smooth we know that

‖∇f(αt+1)−∇f(αt)‖2F ≤ L2‖αt+1 − αt‖2F . (35)

Overall, we obtain

σmin(K>K)‖λt+1 − λt‖2F ≤ ‖K>λt+1 −K>λt‖2F

≤ L2‖αt+1 − αt‖2F .

This gives the bound for the last term:

Lρ(αt+1, βt+1, λt+1, yt+1)− Lρ(α

t+1, βt+1, λt, yt+1)

≤ L2

ρσmin(K>K)‖αt+1 − αt‖2F .

Then, by merging the four estimates we obtain the desiredresult.

We proceed showing the lower boundedness of{Lρ(αt+1, βt+1, λt+1, yt+1)}t∈N. Since K is surjective,there exists α′ such that Kα′ = βt+1 and it holds that

−L2‖αt+1 − α′‖2F ≥ −

L

2σmin(K>K)‖Kαt+1 −Kα′‖2F .

Let ρ > Lσmin(K>K)

. Then, since f is L-smooth, we have

f(αt+1) + 〈λt+1,Kαt+1 − βt+1〉

2‖Kαt+1 − βt+1‖2F

= f(αt+1) + 〈K>λt+1, αt+1 − α′〉

2‖Kαt+1 − βt+1‖2F

= f(αt+1) + 〈∇f(αt+1), α′ − αt+1〉

2‖Kαt+1 − βt+1‖2F

≥ f(α′)− L

2‖αt+1 − α′‖2F

2‖Kαt+1 − βt+1‖2F

≥ f(α′) +ρσmin(K

>K)− L2σmin(K>K)

‖Kαt+1 − βt+1‖2F ≥ f(α′).

Overall, since by assumption f and `(yi; ·) are boundedfrom below (for all yi ∈ L), this means

{Lρ(αt+1, βt+1, λt+1, yt+1)}t∈N

is bounded from below.Since {Lρ(αt+1, βt+1, λt+1, yt+1)}t∈N is mono-

tonically decreasing and bounded from below,{Lρ(αt+1, βt+1, λt+1, yt+1)}t∈N converges. This com-pletes the proof.

A.2. Proof of Lemma 2

Proof. We sum over the estimate (19) which yields

−∞ < limt→∞

Lρ(αt, βt, λt, yt)− Lρ(α

1, β1, λ1, y1)

≤∞∑t=1

(L2

ρσmin(K>K)+ m−ρσmin(K

>K)2

)‖αt+1 − αt‖2F

−∞∑t=1

δJyt+1 6= ytK

Due to the lowerboundedness, the infinite sums have to con-verge. This yields that ‖αt+1 − αt‖F → 0. Since 0 ≤σmin(K

>K)‖λt+1 − λt‖2F ≤ L2

ρσmin(K>K)‖αt+1 − αt‖2F

and σmin(K>K) > 0 also ‖λt+1 − λt‖F → 0. Since due

to the dual update λt+1 − λt = ρ(Kαt+1 − βt+1), also‖Kαt+1 − βt+1‖F → 0. Moreover, it holds that

‖βt+1 − βt‖F ≤ ‖βt+1 −Kαt+1‖F + ‖Kαt+1 −Kαt‖F+ ‖Kαt − βt‖F≤ ‖Kαt+1 − βt+1‖F+ ‖K‖‖αt+1 − αt‖F+ ‖Kαt − βt‖F → 0

(a) (b) (c) (d)

Figure 5: Form left to right: Ground-truth, RBF kernel k-means, coordinate descent, proposed method. The label inferenceerrors are 66.6% for constrained RBF kernel k-means, 68.5% for coordinate descent and 2.5% for our method.

for t→∞.Finally, suppose that there exists an infinite subsequence

{tj}∞j=1 ⊂ {t}∞t=1 so that ytj+1 6= ytj . The last sumrewrites as,

∞∑t=1

δJyt+1 6= ytK =∞∑j=1

δ

which diverges for δ > 0 positive. This however contradictsthe lower boundedness of Lρ(αt, βt, λt, yt).

A.3. Proof of Proposition 1

Proof. Let (α∗, β∗, λ∗, y∗) be a limit point of{(αt, βt, λt, yt)}t∈N, and let {tj}∞j=1 ⊂ {t}∞t=1 bethe corresponding subsequence of indices. The optimalityconditions for the update of the variables βi (for any i) andα are given as:

0 ∈ ∂`(ytji ;βtji )− ρ(Kiα

tj−1 − βtji + 1/ρλtj−1i ) (36)

0 = ∇f(αtj ) + ρK>(Kαtj − βtj + 1/ρλtj−1). (37)

Passing the limit j → ∞ and applying Lemma 2 we arriveat condittions (24)–(26). This completes the proof.

A.4. Proof of Proposition 2

Proof. Let δ > 0. Then, due to Lemma 2 the discrete vari-able converges, i.e. there is T > 0 so that for all t > T

yt+1 = yt. (38)

Then, since f and `(yi; ·) are convex proper and lsc., af-ter finitely many iterations our scheme Alg. 1 reduces toconvex ADMM and the global convergence is a direct con-sequence of [17, 14, 6]. This completes the proof.

B. Additional Experimental ResultsAs a proof of concept we conduct a synthetic experiment

with data sampled from 2D moon-shape distributions (600samples, 4 classes, 150 per class). We sample 25 (possi-bly overlapping) cliques C ⊂ V of cardinality 25 from

the set of examples. The synthetic labeling prior in thisexperiment is given in terms of constraints, that balancethe label assignment within each clique. More precisely,it restricts the maximal deviation of the determined label-ing from the true labeling to a given bound within eachclique C ∈ C. Mathematically, the higher order ener-gies EC in the MRF are defined so that EC(yC) = 0 ifLjC ≤ |{i ∈ C : yiC = j}| ≤ U jC , and∞ otherwise. Thebounds LjC and U jC are fixed and chosen a-priori, such thatthe number of samples i ∈ C assigned to class j deviates byat most 3 from the true number within cliqueC. This meansthat we do not provide any exact labels to the algorithm.

The overall task is to infer the correct labels from both,the distribution of the examples in the feature space, andthe combinatorial prior encoded within the higher order en-ergies. Within the algorithm, we solve the LP-relaxationof the higher order MRF-subproblem (14) with the dual-simplex method and threshold the solution. On this task,we compare our method to constrained kernel k-means andplain discrete-continuous coordinate descent on (5) withan RBF kernel and an SVM-loss (see Figure 5). Like[55, 56, 59, 3], we apply k-means in the RBF kernel spaceand solve the E-step w.r.t. to (14). It can be seen that both,discrete-continuous coordinate descent on the SVM-basedmodel (Figure 5c) and constrained kernel k-means (Figure5b) get stuck in poor local minima. In contrast, our methodis able to infer the correct labels of most examples and findsa reasonable classifier, even for a trivial initialization of theparameters, cf. Figure 5d. The label errors are 66.6% forconstrained RBF kernel k-means, 68.5% for coordinate de-scent and 2.5% for our method.

References[1] R. Achanta and S. Susstrunk. Superpixels and polygons us-

ing simple non-iterative clustering. In IEEE Conference onComputer Vision and Pattern Recognition, 2017. 7

[2] M. Artina, M. Fornasier, and F. Solombrino. Linearly con-strained nonsmooth and nonconvex minimization. SIAMJournal on Optimization, 23(3):1904–1937, 2013. 9

[3] S. Basu, I. Davidson, and K. Wagstaff. Constrained Clus-tering: Advances in Algorithms, Theory, and Applications.Chapman & Hall/CRC, 1 edition, 2008. 2, 11

[4] K. P. Bennett and A. Demiriz. Semi-supervised support vec-tor machines. In Advances in Neural Information ProcessingSystems, NIPS, pages 368–374, 1999. 3

[5] A. Blake and A. Zisserman. Visual reconstruction. MITpress, 1987. 2

[6] S. P. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein.Distributed optimization and statistical learning via the al-ternating direction method of multipliers. Foundations andTrends in Machine Learning, 3(1):1–122, 2011. 3, 4, 11

[7] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixe,D. Cremers, and L. Van Gool. One-shot video object seg-mentation. In IEEE Conference on Computer Vision andPattern Recognition, 2017. 7, 8

[8] T. F. Chan and L. A. Vese. Active contours without edges.IEEE Transactions on Image Processing, 10(2):266–277,2001. 2

[9] O. Chapelle, B. Scholkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006. 6, 7

[10] R. Chartrand and B. Wohlberg. A nonconvex ADMM algo-rithm for group sparsity with sparse groups. In IEEE Inter-national Conference on Acoustics, Speech and Signal Pro-cessing, pages 6009–6013, May 2013. 3

[11] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.Yuille. Semantic image segmentation with deep convolu-tional nets and fully connected crfs. CoRR, abs/1412.7062,2014. 1, 3

[12] K. Crammer and Y. Singer. On the algorithmic implementa-tion of multiclass kernel-based vector machines. Journal ofMachine Learning Research, 2:265–292, 2001. 5

[13] A. Delong, A. Osokin, H. N. Isack, and Y. Boykov. Fast ap-proximate energy minimization with label costs. In IEEEConference on Computer Vision and Pattern Recognition,pages 2173–2180, June 2010. 1, 4

[14] J. Eckstein and D. P. Bertsekas. On the douglas-rachfordsplitting method and the proximal point algorithm for max-imal monotone operators. Mathematical Programming,55:293–318, 1992. 3, 11

[15] A. Fix, C. Wang, and R. Zabih. A primal-dual algorithm forhigher-order multilabel markov random fields. In IEEE Con-ference on Computer Vision and Pattern Recognition, pages1138–1145, 2014. 4, 5

[16] P. A. Forero, A. Cano, and G. B. Giannakis. Consensus-based distributed support vector machines. Journal of Ma-chine Learning Research, 11:1663–1707, 2010. 4

[17] D. Gabay. Applications of the method of multipliers to vari-ational inequalities. Studies in Mathematics and Its Applica-tions, 15:299–331, 1983. 3, 11

[18] D. Gabay and B. Mercier. A dual algorithm for the solu-tion of nonlinear variational problems via finite element ap-proximation. Computers & Mathematics with Applications,2(1):17–40, 1976. 3

[19] R. Glowinski and A. Marrocco. Sur l’approxima-tion, par elements finis d’ordre un, et la resolution, par

penalisation-dualite d’une classe de problemes de Dirich-let non lineaires. Revue Francaise d’Automatique, In-formatique, Recherche Operationnelle. Analyse Numerique,9(2):41–76, 1975. 3

[20] M. Hong, Z.-Q. Luo, and M. Razaviyayn. Convergence anal-ysis of alternating direction method of multipliers for a fam-ily of nonconvex problems. SIAM Journal on Optimization,26(1):337–364, 2016. 3, 5, 9

[21] K. Huang and N. D. Sidiropoulos. Consensus-ADMMfor general quadratically constrained quadratic pro-gramming. IEEE Transactions on Signal Processing,64(20):5297–5310, 2016. 3

[22] T. Joachims. Transductive inference for text classificationusing support vector machines. In Proceedings of the 16thInternational Conference on Machine Learning ICML, pages200–209, 1999. 3

[23] A. Joulin and F. R. Bach. A convex relaxation for weakly su-pervised classifiers. In Proceedings of the 29th InternationalConference on Machine Learning, ICML, 2012. 3, 6, 7

[24] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnorr,S. Nowozin, D. Batra, S. Kim, B. X. Kausler, J. Lellmann,N. Komodakis, and C. Rother. A comparative study of mod-ern inference techniques for discrete energy minimizationproblem. In IEEE Conference on Computer Vision and Pat-tern Recognition, 2013. 4

[25] S. Kim, D. Min, S. Lin, and K. Sohn. Dctm: Discrete-continuous transformation matching for semantic flow. InIEEE International Conference on Computer Vision, ICCV,Oct 2017. 6

[26] K. C. Kiwiel. Variable fixing algorithms for the continuousquadratic knapsack problem. Journal of Optimization Theoryand Applications, 136(3):445–458, 2008. 5

[27] D. Koller and N. Friedman. Probabilistic Graphical Models:Principles and Techniques. MIT Press, 2009. 1

[28] V. Kolmogorov. A new look at reweighted message passing.IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 37(5):919–930, 2015. 4

[29] V. Kolmogorov and R. Zabin. What energy functions can beminimized via graph cuts? IEEE Transactions on PatternAnalysis and Machine Intelligence, 26(2):147–159, 2004. 1,4

[30] N. Komodakis, N. Paragios, and G. Tziritas. MRF optimiza-tion via dual decomposition: Message-passing revisited. InIEEE 11th International Conference on Computer Vision,ICCV, pages 1–8. IEEE Computer Society, 2007. 3, 6

[31] N. Komodakis, G. Tziritas, and N. Paragios. Fast, approx-imately optimal solutions for single and dynamic mrfs. InIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1–8. IEEE, 2007. 1, 4, 5

[32] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi-fication with deep convolutional neural networks. Advancesin Neural Information Processing Systems, NIPS, 2012. 3

[33] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associa-tive hierarchical random fields. IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 36(6):1056–1077,2014. 1

[34] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Con-ditional random fields: Probabilistic models for segmentingand labeling sequence data. In Proceedings of the 18th In-ternational Conference on Machine Learning, ICML, pages282–289, 2001. 1

[35] R. Lai and S. Osher. A splitting method for orthogonal-ity constrained problems. Journal of Scientific Computing,58(2):431–449, 2014. 3

[36] M. Lapin, M. Hein, and B. Schiele. Analysis and optimiza-tion of loss functions for multiclass, top-k, and multilabelclassification. IEEE Transactions on Pattern Analysis andMachine Intelligence, PP(99):1–1, 2017. 5

[37] G. Li and T. K. Pong. Global convergence of splitting meth-ods for nonconvex composite optimization. SIAM Journalon Optimization, 25(4):2434–2460, 2015. 3, 5, 9

[38] S. Lloyd. Least squares quantization in pcm. IEEE Transac-tions on Information Theory, 28(2):129–137, 1982. 2, 3

[39] A. F. T. Martins, M. A. T. Figueiredo, P. M. Q. Aguiar, N. A.Smith, and E. P. Xing. An augmented lagrangian approachto constrained MAP inference. In L. Getoor and T. Scheffer,editors, Proceedings of the 28th International Conference onMachine Learning, ICML, pages 169–176, 2011. 3

[40] O. Miksik, V. Vineet, P. Perez, and P. H. S. Torr. Distributednon-convex ADMM-based inference in large-scale randomfields. In M. F. Valstar, A. P. French, and T. P. Pridmore,editors, British Machine Vision Conference, BMVC, 2014. 3

[41] T. Mollenhoff, E. Strekalovskiy, M. Moller, and D. Cre-mers. The primal-dual hybrid gradient method for semi-convex splittings. SIAM Journal on Imaging Sciences,8(2):827–857, 2015. 9

[42] S. Nowozin and C. H. Lampert. Structured prediction andlearning in computer vision. Foundations and Trends inComputer Graphics and Vision, 6(3–4), 2010. 1

[43] E. J. Nystrom. Uber die praktische auflosung von integral-gleichungen mit anwendungen auf randwertaufgaben. ActaMathematica, 54(1):185–204, 1930. 5

[44] V. Ozolins, R. Lai, R. Caflisch, and S. Osher. Com-pressed modes for variational problems in mathematics andphysics. Proceedings of the National Academy of Sciences,110(46):18368–18373, 2013. 3

[45] R. R. Paulsen, J. A. Bærentzen, and R. Larsen. Markov ran-dom field surface reconstruction. IEEE Transactions on Vi-sualization and Computer Graphics, pages 636–646, 2010.1

[46] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. J. V. Gool, M. H.Gross, and A. Sorkine-Hornung. A benchmark dataset andevaluation methodology for video object segmentation. InIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 724–732, 2016. 7, 8

[47] A. Rahimi and B. Recht. Random features for large-scalekernel machines. In Advances in Neural Information Pro-cessing Systems, NIPS, pages 1177–1184, 2008. 5

[48] R. Rockafellar and R.-B. Wets. Variational Analysis.Springer, 1998. 6

[49] C. Rother, V. Kolmogorov, and A. Blake. ”grabcut”: inter-active foreground extraction using iterated graph cuts. ACMTransactions on Graphics, 23(3):309–314, 2004. 2

[50] B. Scholkopf, R. Herbrich, and A. J. Smola. A general-ized representer theorem. In 14th Annual Conference onComputational Learning Theory, COLT, volume 2111, pages416–426, 2001. 3

[51] F. Steinbrucker, T. Pock, and D. Cremers. Large displace-ment optical flow computation withoutwarping. In IEEE12th International Conference on Computer Vision, ICCV,pages 1609–1614, 2009. 6

[52] M. Storath, A. Weinmann, and L. Demaret. Jump-sparse andsparse recovery using potts functionals. IEEE Transactionson Signal Processing, 62(14):3654–3666, 2014. 3

[53] P. Swoboda and B. Andres. A message passing algorithm forthe minimum cost multicut problem. In IEEE Conference onComputer Vision and Pattern Recognition, July 2017. 3, 6

[54] R. Takapoui, N. Moehle, S. Boyd, and A. Bemporad. A sim-ple effective heuristic for embedded mixed-integer quadraticprogramming. International Journal of Control, pages 1–23,2017. 3

[55] M. Tang, I. B. Ayed, D. Marin, and Y. Boykov. Secrets ofgrabcut and kernel k-means. In IEEE International Confer-ence on Computer Vision, ICCV, pages 1555–1563, 2015. 2,3, 8, 11

[56] M. Tang, D. Marin, I. B. Ayed, and Y. Boykov. Normal-ized cut meets MRF. In European Conference on ComputerVision, ECCV, pages 748–765, 2016. 2, 3, 8, 11

[57] V. Trajkovska, P. Swoboda, F. Astrom, and S. Petra. Graphi-cal model parameter learning by inverse linear programming.In Scale Space and Variational Methods in Computer Vision- 6th International Conference, SSVM, pages 323–334, 2017.2

[58] V. N. Vapnik. The Nature of Statistical Learning Theory.Springer-Verlag New York, Inc., 1995. 2, 3

[59] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrodl. Con-strained k-means clustering with background knowledge. InC. E. Brodley and A. P. Danyluk, editors, Proceedings of the18th International Conference on Machine Learning, ICML,pages 577–584. Morgan Kaufmann, 2001. 2, 11

[60] C. K. Williams and M. Seeger. Using the nystrom method tospeed up kernel machines. In Advances in Neural Informa-tion Processing Systems, NIPS, pages 682–688, 2001. 5

[61] B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong. Towardsk-means-friendly spaces: Simultaneous deep learning andclustering. In Proceedings of the 34th International Confer-ence on Machine Learning, ICML, pages 3861–3870, 2017.3, 8

[62] C. Zach. Dual decomposition for joint discrete-continuousoptimization. In Proceedings of the Sixteenth InternationalConference on Artificial Intelligence and Statistics, AIS-TATS, volume 31 of Proceedings of Machine Learning Re-search, pages 632–640, 2013. 3, 6

[63] K. Zhang, I. W. Tsang, and J. T. Kwok. Maximum marginclustering made practical. In Z. Ghahramani, editor, Pro-ceedings of the 24th International Conference on MachineLearning, ICML, volume 227, pages 1119–1126, 2007. 2, 3

[64] S. C. Zhu and A. L. Yuille. Region competition: Unifyingsnakes, region growing, and bayes/mdl for multiband imagesegmentation. IEEE Transactions on Pattern Analysis andMachine Intelligence, 18(9):884–900, 1996. 2


Recommended