Learning to refine depth for robust stereo estimationnetwork...end system for depth estimation. Most...

Pattern Recognition 74 (2018) 122–133

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/patcog

Learning to refine depth for robust stereo estimation

Feiyang Cheng

a , b , Xuming He

b , 1 , Hong Zhang

a , ∗

a Image Research Center, Beihang University, Xueyuan Rd., Haidian Dist., Beijing, 100191, China b Computer Vision Group, National ICT Australia, Locked Bag 8001, Canberra, 2601, Australia

a r t i c l e i n f o

Article history:

Received 11 March 2017

Revised 18 July 2017

Accepted 26 July 2017

Available online 31 August 2017

Keywords:

Stereo matching

Confidence measure

Convolutional neural network

a b s t r a c t

Traditional depth estimation from stereo images is usually formulated as a patch-matching problem,

which requires post-processing stages to impose smoothness and handle depth discontinuities and oc-

clusions. While recent deep network approaches directly learn a regressor for the entire disparity map,

they still suffer from large errors near the depth discontinuities. In this paper, we propose a novel method

to refine the disparity maps generated by deep regression networks. Instead of relying on ad hoc post-

processing, we learn a unified deep network model that predicts a confidence map and the disparity

gradients from the learned feature representation in regression networks. We integrate the initial dispar-

ity estimation, the confidence map and the disparity gradients into a continuous Markov Random Field

(MRF) for depth refinement, which is capable of representing rich surface structures. Our disparity MRF

model can be solved via efficient global optimization in a closed form. We evaluate our approach on

both synthetic and real-world datasets, and the results show it achieves the state-of-art performance and

produces more structure-preserving disparity maps with smaller errors in the neighborhood of depth

boundaries.

© 2017 Elsevier Ltd. All rights reserved.

p

e

i

t

p

t

n

l

i

t

s

e

i

s

t

w

p

1. Introduction

Inferring depth from images is a fundamental problem in com-

puter vision [1] , vital for a large number of real-world applica-

tions such as 3D scene reconstruction, robotics and autonomous

driving. Despite the progress in predicting depth from a single im-

age [2,3] or using active sensors [4,5] , stereo image matching still

remains one of the most effective strategies for depth estimation

due to its efficiency and broad range of applicable settings [6–

9] . Traditional stereo matching paradigm typically comprises four

steps, including 1) computing matching cost, 2) cost aggrega-

tion, 3) global optimization and 4) disparity refinement [10,11] .

In particular, a variety of ad hoc post-processing methods have

been studied to handle occlusion and uncertainty in the match-

ing stage [9,10,12,13] . Moreover, most existing methods use a dis-

crete global optimization to enforce smoothness in the estimated

disparity map [7,14,15] and refine it into subpixel accuracy after-

wards [9,16] . However, such a stage-wise pipeline is prone to errors

in each step and lacks an overall objective to optimize.

∗ Corresponding author at: 37 Xueyuan Road, Haidian District, Beijing, China.

E-mail addresses: [email protected] (F. Cheng),

[email protected] (X. He), [email protected] , [email protected]

(H. Zhang). 1 X.He was also with Australia National University and he is currently at Shang-

haiTech University.

i

d

r

M

p

d

http://dx.doi.org/10.1016/j.patcog.2017.07.027

0031-3203/© 2017 Elsevier Ltd. All rights reserved.

Recent progress in learning-based deep neural networks has

rovided an alternative strategy that aims at building an end-to-

nd system for depth estimation. Most prior work focus on single

mage-based depth prediction [2,17] , or learning neural networks

o compute matching cost [8,18,19] . By contrast, Mayer et al. pro-

osed end-to-end trainable deep regression networks for stereo es-

imation [20] . Despite the achieved competitive performance, the

etworks suffer from the well-known foreground-fattening prob-

em, which appears as halo-effect near object boundaries as shown

n Fig. 1 b.

In this paper, we propose a deep network approach to refine

he disparity maps generated by the deep regression networks,

uch as [20] . Our main idea is based on a detect-and-correct strat-

gy [13,21] , in which we find the regions with low confidence

n initial predictions and exploit the disparity gradients to recon-

truct more accurate and structure-preserving disparity maps. To

his end, we design a novel two-branch fully convolutional net-

ork that takes the regression network features and images as in-

ut, and predicts a dense confidence map for the regressed dispar-

ties and a disparity gradient map.

Given the confidence estimation and predicted disparity gra-

ients, we develop a continuous Markov Random Field (MRF) to

efine the disparities generated by the regression network. Our

RF takes the initial disparity map as its observation and im-

oses a structure-preserving prior based on the estimated confi-

ence scores and disparity gradients. Specifically, we enforce the


http://www.ScienceDirect.com

http://www.elsevier.com/locate/patcog

http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2017.07.027&domain=pdf

mailto:[email protected]





F. Cheng et al. / Pattern Recognition 74 (2018) 122–133 123

Fig. 1. An illustration of refining estimated disparity map based on deep networks. (a) Reference image and two cropped patches of ground truth disparity. (b) Initial result

of the DispNetCorr1D which has blurred object boundaries and noisy structures on object surfaces [20] . (c) Halo-free and structure-preserving result of our method.

Fig. 2. An overview of our method. The input consists of the reference image and the matching feature extracted from a pre-trained CNN for disparity estimation [20] .

We learn a confidence network (ConfNet) and a gradient network (GradNet) to predict the correctness of initial estimated disparities and the disparity gradients. A global

optimization method is proposed to combine the useful information to reconstruct high-quality disparity maps.

r

g

o

t

s

i

t

l

b

s

l

a

l

d

2

p

p

v

l

t

p

p

r

d

t

d

o

m

l

a

o

o

e

a

t

d

i

n

o

l

o

n

a

e

e

o

t

s

d

c

t

v

r

m

s

m

efined disparity map to be consistent with the predicted disparity

radients especially at the regions with low confidences. We solve

ur continuous disparity MRF via an efficient global optimization

hanks to its quadratic form and convexity. An example of our re-

ults is shown in Fig. 1 (c), large errors can be removed near dispar-

ty discontinuities and surface structures are improved. We refer

he readers to Fig. 2 for an overview of our framework.

We extensively evaluate our refinement method on four chal-

enging synthetic datasets and the real-world KITTI2015 stereo

enchmark. The results show that our approach outperforms the

tate-of-the-art and several global optimization baselines. Particu-

arly, our method not only produces accurate disparity maps but

lso results in better surface structures. Moreover, the confidence

earning via a deep network is more robust than learning confi-

ence with hand-crafted features.

. Related work

Depth estimation from stereo images has a long history in com-

uter vision [22] and it is beyond the scope of this paper to

resent a complete review. We refer the readers to recent sur-

eys of the literature [10,11,22] . Here, we mainly focus on recent

earning-based work, including deep networks for depth predic-

ion, learning confidence of stereo matching and modeling depth

riors with MRFs.

Deep networks for depth prediction: Deep learning ap-

roaches have achieved significant progresses in depth estimation

ecently. A large number of prior work focus on single image-based

epth prediction, e.g., [2,17] , which typically build an end-to-end

rainable deep network to directly predict a dense depth map.

However, such data-driven methods require a large training

ataset with ground truth depth, which is difficult to obtain for

utdoor scenes. As a consequence, until recently deep learning

ethods on generic stereo estimation mostly consider the task of

earning local matching cost. In particular, Zbontar and LeCun learn

convolutional neural network for estimating the affinity of a pair

f patches, which outperforms the matching cost functions based

n low-level features [8,9] . However, computing matching cost for

ach patch pair is time-consuming during inference. An efficient

rchitecture is proposed in [18] by learning a probability distribu-

ion over all disparities at once with a dot-product layer. Using a

ot-product layer is also analyzed in [9,19] to speedup the match-

ng cost computation. The main drawback of those patch matching

etworks is that they have to rely on ad hoc post-processing to

btain the final disparity estimation. Besides the stereo matching

iterature, Zagoruyko and Komodakis propose various architectures

f deep neural networks for comparing image patches [23] .

Only recent work by Mayer et al. [20] learns deep regression

etworks to estimate continuous disparities thanks to the avail-

bility of large synthetic stereo datasets. Their networks adopt an

ncoder-decoder structure, which enables end-to-end training and

fficient inference at test time. Our work is built on top of one

f their regression networks and aims at removing the aforemen-

ioned halo-effect near disparity discontinuities and reconstructing

tructure-preserving disparity maps. We note that unsupervised

eep learning methods for stereo also attract much attention re-

ently [24,25] . Nevertheless, their performances are still inferior to

he networks trained with strong supervision.

Learning confidence in stereo estimation: Traditional stereo

ision often uses confidence measures of disparity estimation to

emove potentially large errors (i.e., bad pixels) within disparity

aps [26] . For example, the left-right consistency (LRC) check and

imple interpolation are commonly used to tackle occlusions and

ismatches [10,12,13] . Learning confidence measure with various

124 F. Cheng et al. / Pattern Recognition 74 (2018) 122–133

Fig. 3. The architecture of our confidence and gradient networks.

Fig. 4. Sparsification curves of the first frames of the FlyingThings3D, Driving, Monkaa and Sintel datasets. Optimal method means that the ground truth is used to remove

bad pixels.

t

f

p

b

g

b

t

t

s

n

hand-crafted confidence features has been investigated in [12] .

Spyropoulos et al. propose to learn a confidence map using sim-

ilar hand-crafted features and to incorporate the predicted confi-

dences into a discrete MRF model to obtain dense high-quality dis-

parity maps [21] . Park and Yoon use a two-stage learning method

to choose effective confidence measures at first and then use them

as features to refine the confidence estimation [13] . The predicted

confidence map is used to modulate the matching cost which can

be incorporated into the semi-global optimization framework later.

A training data generation method is proposed in [27] to improve

the accuracy of confidence prediction. With disparity patches as

he input, convolutional neural networks are proved to be effective

or learning confidence measures recently [28,29] . By contrast, we

ropose to learn the confidence map using a deep neural network

ased on learned matching and image features, which can be inte-

rated into a unified deep network model.

Modeling depth smoothness: Modeling depth structures has

een extensively studied in stereo matching literature [10] . In

he discrete setting, MRF-based global optimization methods of-

en make use of smoothness priors such as first-order piece-wise

moothness or second-order smoothness to tackle challenging sce-

arios such as occlusions or textureless regions [7,15] . The continu-


1000 2000 3000

FlyingThings3D frames

0

0.1

0.2

0.3

0.4

AU

C v

alue

OptimalConfNet

1000 2000 3000 4000

Driving frames

0

0.2

0.4

0.6

0.8

1

AU

C v

alue

OptimalConfNet

2000 4000 6000 8000

Monkaa frames

0

0.2

0.4

0.6

0.8

1

AU

C v

alue

OptimalConfNet

200 400 600 800 1000

Sintel frames

0

0.2

0.4

0.6

0.8

1

AU

C v

alue

OptimalConfNet

Fig. 5. AUC values of the FlyingThings3D, Driving, Monkaa and Sintel datasets. The AUC values are permutated according to the ascending order of the optimal AUC values.

o

t

o

p

v

fi

R

d

s

t

d

M

3

t

T

w

a

d

fi

u

s

s

t

m

t

d

a

s

p

t

fi

p

c

a

t

3

d

m

a

t

f

t

(

C

w

a

t

l

t

t

a

us counterparts typically formulate the problem in variational op-

imization frameworks in which the smoothness priors are based

n the total variation functionals [30,31] . The fast bilateral solver

roposed in [32] achieves competitive performances on various

ision tasks. However, these smoothness priors are globally de-

ned based on relative simple assumption on surface structures.

ecently, surface normal classifiers are adopted to impose image-

ependent constraint for depth estimation and multi-view recon-

truction [33,34] , which has shown effective im provements. In con-

rast, we propose a deep learning based approach to estimate the

isparity gradients for every image and incorporate them into our

RF model to obtain structure-preserving disparity maps.

. Two-branch fully convolutional networks

Our goal is to refine an initial disparity estimation generated by

he deep regression network [20] , given a pair of rectified images.

o achieve this, we follow a detect-and-correct strategy, in which

e first identify noisy regions in the disparity estimation based on

predicted confidence map, and then estimate the disparity gra-

ients to capture surface structures in the scene. Finally, the con-

dence map and disparity gradients are integrated into a contin-

ous Markov Random Field (MRF) based on which we compute a

tructure-preserving disparity map via global optimization. In this

ection, we introduce a two-branch deep network for computing

he confidence map and disparity gradients. The MRF-based refine-

ent process will be described in Section 4 .

An overview of our deep network is shown in Fig. 2 . In addition

o the original regression layers, we add two sub-networks to the

eep regression network, which are for the disparity confidence

nd gradient prediction respectively. Both sub-networks share a

imilar network architecture except their input features and out-

ut layer. Each sub-network consists of 5 convolutional layers. For

he first four layers, we use 3 × 3 convolution kernels and recti-

ed linear units (ReLU) as their activation functions. The fifth layer

erforms 1 × 1 convolution with its full connection as in the fully

onvolutional network (FCN) [35] , followed by a dropout layer. The

rchitecture is shown in Fig. 3 . We now describe the specific de-

ails for these two sub-networks in the following.

.1. Confidence network

The confidence network aims at detecting regions in the pre-

icted disparity map that have large uncertainty or errors. We for-

ulate it as a binary labeling task and design an FCN as described

bove to predict a confidence map, in which each pixel indicates

he probability of being correct.

The input to our confidence network consists of three types of

eatures: the reference image, the last concatenated matching fea-

ures of the DispNetCorr1D in [20] , and the left-right consistency

LRC) [12,13,21] . The LRC map C lr is defined as follows,

lr (i, j) =

{| d l i, j

− d r i, j −d l

i, j

| if ( j − d l i, j

) > 0

−1 otherwise (1)

here i, j are the pixel coordinates in the reference image, and d l

nd d r denote the left and right estimated disparity maps respec-

ively. This feature combines the disparity information from the

eft and the right views to identify potential mismatches. We stack

hese three inputs into a multi-dimensional image before passing

hem into the confidence network.

The output layer of the confidence network for each pixel is

softmax function with two outputs. We train the network us-


Fig. 6. Comparisons of the disparity maps of the synthetic datasets. From top to down: reference image, initial disparity map, initial error map, predicted confidences, our

disparity map and our error map. For the error map, the bad pixels are marked red and blue regions mean small errors. The predicted confidence maps locate the bad

pixels well and our method focuses on changing these regions without degradation within other regions according to the refined disparity maps. (For interpretation of the

references to color in this figure legend, the reader is referred to the web version of this article.)

i

i

s

i

m

c

r

d

s

w

d

g

ing the cross-entropy loss and stochastic gradient descent based

on Adam [36] . On the training data, we generate the ground truth

binary label map L by locating pixels with large disparity errors.

Specifically, for the pixel at location ( i, j ),

L (i, j) = 1 (| d i, j − d ∗i, j | ≤ t) (2)

where t is a predefined threshold and 1 ( · ) is the indicator func-

tion. d and d ∗ denote the estimated disparity and the ground truth

respectively.

3.2. Gradient network

To exploit 3D surface structure to refine the initial disparity es-

timation, we focus on the smoothness property of surfaces which

s commonly used in prior work for stereo estimation. However,

nstead of relying on a generic smoothness prior, we design a FCN

ub-network to predict the disparity gradients for the reference

mage and use it as an image-specific prior for disparity refine-

ent.

Our gradient network takes the reference image and the last

oncatenated matching features of the DispNetCorr1D as input and

egresses the disparity gradients at each pixel. Unlike the confi-

ence network, we use the leaky ReLU activation with a negative

lope of 0.1 in all the layers. The output layer of the gradient net-

ork has two neurons for each pixel, predicting its disparity gra-

ients along both horizontal and vertical directions.

We use forward-difference to approximately compute the

round truth gradient for training our deep gradient network. The


Fig. 7. AUC value comparison of our ConfNet and RDF 12 . The gaps between our

ConfNet and optimal AUC values are relative narrow in most cases.

g

m

a

1

t

4

w

D

i

p

l

E

w

E

p

p

S

E

w

i

t

t

E

w

t

e

ϕ

w

p

t

e

c

p

E

w

w

a

i

i

f

a

p

t

o

q

t

e

a

A

L

t

m

d

a

i

t

h

5

l

b

d

t

5

m

i

u

t

p

a

a

f

t

a

M

a

d

a

t

2 The bad sequence ‘A-149’ which contains totally occluded views is not used.

radients at disparity discontinuities are ill-defined and thus we

ask out those pixels with values of forward-difference larger than

threshold during training. We set the threshold empirically to

pixel in this work. The L 1 loss function is used for training and

he weights are learned based on Adam [36] .

. Continuous MRF for depth refinement

Given the disparity gradient map G and the confidence map τ ,

e now build a continuous MRF to refine the initial disparity map

. The MRF model encodes an image specific smoothness prior by

ntegrating the predicted confidence scores and disparity gradients.

Formally, denote the reference image as I and the refined dis-

arity map as S , we define the energy function of the MRF as fol-

ows,

(S) = E d (S| D, τ ) + αE g (S| G ) + βE smooth (S| I) (3)

here E d is the data term, E g is the gradient consistency term, and

smooth is the disparity smoothness term. α and β are the hyper-

arameters to balance the effect of the three terms.

Data term: The data term E d ( S | D ) enforces that the refined dis-

arity value is close to the initial disparity in the confident regions.

pecifically,

data (S| D ) = min

S

∑

p

τp (S p − D p ) 2 (4)

here D p and S p are the initial and refined disparity at pixel p. τ p

s the confidence score of the initial disparity at p .

Gradient consistency term: The gradient term E g ( S | G ) enforces

he consistency between the gradients of the refined disparity with

he predicted gradients:

g (S| G ) =

∑

p

ϕ

x p (∇ x S p − G

x p )

2 + ϕ

y p (∇ y S p − G

y p )

2 (5)

here ∇ x and ∇ y are the horizontal and vertical gradient opera-

ors. ϕ

x p and ϕ

y p are the image-based weights that approximately

stimate the likelihood of smooth surfaces, which are defined as:

x p = exp

(−||∇ x I p || 2

2 σ 2 r

), ϕ

y p = exp

(−||∇ y I p || 2

2 σ 2 r

)(6)

here ∇ x I p and ∇ y I p are the forward-differences of the color of

ixel p at the horizontal and vertical directions respectively. σ r is

he parameter of the Gaussian kernel.

Disparity smoothness term: The disparity smoothness term

nforces that the refined disparity at each potential disparity dis-

ontinuity should be close to the disparities of the neighboring

ixels with similar appearance:

smooth (S| I) =

∑

p

(1 − ϕ p )(S p −∑

q ∈ N p w q S q )

2 (7)

here N p is the neighborhood of the pixel p defined by a local

indow. A 3 × 3 patch centered at p is used in our work. w q is

weight based on the difference of two pixels’ intensities, which

s w q = exp (−‖ I p −I q ‖ 2 2 σ 2

p ) where σ p is the variance of the intensities

n a window around p . The same regularization has been exploited

or image colorization and segmentation algorithms [37] . This term

nd the gradient consistency term represent an image specific dis-

arity prior for surfaces and disparity discontinuities.

Global optimization for disparity refinement: We compute

he refined disparity map by minimizing the energy function E ( s )

f the continuous MRF. As the energy function consists of a set of

uadratic functions with positive weights, the overall energy func-

ion is convex and it can be solved in a closed form.

Specifically, taking ∂E(S) ∂S

= 0 , we can find the optimal disparity

fficiently by solving a sparse linear system AS = b. Here A and b

re defined as,

= τ − α( ̃ C x φx C x +

˜ C y φy C y ) + β(2 − φx − φy )( − W )

b = τD − α( ̃ C x φx G

x +

˜ C y φy G

y ) (8)

et N denote the number of pixels and denote the identity ma-

rix. Here, τ is a N × N diagonal matrix with τ p as the diagonal ele-

ents. ˜ C and C are the Toeplitz matrices which perform backward-

ifference and forward-difference computation respectively. φ is

lso a N × N diagonal matrix with ϕp as the diagonal elements. W

s a matrix which stores the weights w q , q ∈ N p for each pixel p at

he p th row. The hyper-parameters of the MRF are validated on a

eld-out training set and fixed throughout our experiments.

. Experiments

The proposed method is evaluated on four recently published

arge synthetic stereo datasets and the real-world KITTI2015 stereo

enchmark. The following subsections describe the experimental

etails and the comparisons to the state-of-the-art approaches for

he synthetic and real-world datasets in turn.

.1. Synthetic datasets

We first demonstrate the effectiveness of our disparity refine-

ent method on four synthetic datasets [20] , including FlyingTh-

ngs3D, Driving, Monkaa , and MPI Sintel .

For the FlyingThings3D dataset, we follow the setup of [20] and

se 22,390 image pairs for training the neural networks. We splits

he test scenes into 780 image pairs for validation and 3580 image

airs for testing 2 . For the validation set, the first 700 image pairs

re used to validate the training parameters of our neural networks

nd the other 80 image pairs are used to validate hyper-parameters

or both our global optimization method and the baselines.

The Driving dataset has 4400 image pairs of driving scenes and

he Monkaa dataset is made from an animated short film Monkaa

nd contains 8640 image pairs [20] . The 1064 image pairs in the

PI Sintel dataset simulate realistic scenes including natural im-

ge degradations such as fog and motion blur [38] . All these three

atasets are used as test sets as in [20] for evaluating our method

nd the baselines.

Implementation details: We fix the regression network and

rain the sub-networks independently. For training the gradient


Fig. 8. Refining the disparity discontinuities on the KITTI2015 stereo dataset. Our ConfNet detects the bad pixels(non-confident results marked blue ) accurately and visually

improvement can be seen in our refined disparity maps. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this

article.)


n

0

d

s

e

b

i

t

i

σ

i

o

t

d

L

f

L

w

o

t

g

b

o

n

t

d

w

M

u

e

A

t

q

e

d

p

p

d

m

t

p

c

o

o

t

w

p

i

l

o

m

p

d

c

a

a

K

g

C

t

c

m

o

t

b

s

i

i

o

c

m

t

i

r

o

p

v

m

s

b

i

f

i

c

f

l

t

fi

a

e

s

i

o

l

t

a

t

p

i

o

a

w

e

c

a

c

a

s

t

a

i

r

e

g

m

v

b

t

c

i

t

etwork, we start from a learning rate 0.0 0 01 and divided it by

.5 every 20K iterations and stop after 200K iterations. The weight

ecay is set to be 0.0 0 04. For training the confidence network, we

tart from a learning rate 0.001 and divided it by 0.5 every 40K it-

rations and stop after 320K iterations. The weight decay is set to

e 0.0 0 05. Both the models are trained on the train set of the Fly-

ngThings3D dataset only and evaluated on the other datasets as in

he prior work [20] . The hyper-parameters of our continuous MRF

s fixed for all the synthetic datasets as follows: α = 100 , β = 0 . 1 ,

r = 2 . 55 .

Baselines: We employ two groups of baselines in our compar-

sons, including deep regression networks and MRF-based meth-

ds. For deep regression networks, the first baseline is the pre-

rained neural network [20] . We further fine-tune it with an ad-

itional loss term on disparity gradients as our second baseline.

et the difference l p = (d p − d ∗p ) for a pixel p . We define the loss

unction for the fine-tuning as follows:

=

1

N

∑ | l p | +

λ

N

∑

(|∇ x l p | + |∇ y l p | ) (9)

here λ = 5 is the parameter to balance the two different terms

f the loss function. The loss function encourages the network

o find a local optimal to balance between the disparity and the

radient structure of the surfaces. A similar loss function has

een used in [3] for single image-based depth prediction. More-

ver, we include a weighted median filter based smoothing tech-

ique for the pre-trained regression network, as it is often used

o remove noises to refine disparity maps without blurring the

isparity discontinuities [39,40] . For the MRF-based methods,

e compare our method with three state-of-the-art continuous

RFs [4,31,32,37] based on our ConfNet.

Evaluation metrics: The most commonly used metric for eval-

ating stereo matching algorithms is the percentage of bad pix-

ls among all the pixels with valid ground truth disparity values.

bad pixel means the absolute disparity error is larger than a

hreshold, which is set to 1 pixel in this paper. To evaluate the

uality of continuous disparities, we also compute the end-point

rror 1 N

∑

p | d p − d ∗p | to compare different methods as in [20] . Here,

p and d ∗p denote the estimated and ground truth disparity of a

ixel p respectively. Additionally, we compute the sum of the end-

oint errors of both the horizontal and the vertical disparity gra-

ients to compare the structure-preserving performances of our

ethod and the baselines.

Detailed analysis: A sparsification curve shows the change of

he percentage of bad pixels while removing least confident dis-

arities from the disparity map [12,13] . Examples of sparsification

urve are shown in Fig. 4 . We first evaluate the performance of

ur ConfNet, we plot the area under sparsification curves (AUC)

f the test sets as [12,13] in Fig. 5 . For the FlyingThings3D dataset,

he gaps between our AUC and the optimal AUC values are small,

hich evidents that our ConfNet is able to effectively detect bad

ixels. On the Monkaa and the Sintel datasets, we observe the sim-

lar trend except for some challenging scenes corresponding to the

arge gaps in the plots. For the Driving dataset, we note that the

ptimal AUC values are large, which means that the initially esti-

ated disparity maps have poor qualities. Therefore, locating bad

ixels (i.e., selecting reliable disparities) is also challenging in this

ataset. For large scale synthetic datasets, learning a random forest

lassifier with hand-crafted features to predict confidence is not

pplicable due to memory limit. Hence, the comparison of ConfNet

nd the previous work [13] will be discussed on the real world

ITTI2015 dataset later.

We then compare the predicted disparity gradients with the

radients of the initially estimated disparity maps of DispNet-

orr1D in Table 1 . The predicted gradients are much better and

his demonstrates that they can be used as effective smoothness

onstraints to improve the noisy disparity maps.

Comparing with CNN-based methods: We first compare our

ethod with the baselines on the FlyingThings3D dataset since

ur models are trained on the training data of this dataset. For

he occlusions and the whole image, we list the percentage of

ad pixels respectively in Table 2 . Fine-tuning the deep regres-

ion network with an additional loss term on disparity gradient

s proved to be useful for improving disparity estimation accord-

ng to the current metric. Both WMF and our method work well

n removing bad pixels and our results outperform them in all

ases. Table 2 also shows the end-point errors of different refine-

ent methods based on deep networks and filtering. Fine-tuning

he network with an additional gradient-related loss term results

n slight accuracy changes, and our method and the WMF algo-

ithm achieves competitive performances.

For the other three synthetic datasets, we list the percentages

f bad pixels and the end-point-errors in Table 3 . Our method out-

erforms the baselines in most cases, which demonstrates the uni-

ersality of our trained models. We note that our method only re-

oves a little bad pixels for the Driving dataset. The probable rea-

on is that the confidence networks cannot effectively detect the

ad pixels since the FlyingThings3D ’s train set has no similar driv-

ng scenes like the Driving dataset. Hence, we also compute a per-

ormance upper bound of our method as shown in Table 3 . Specif-

cally, instead of using predicted values, we put the ground truth

onfidences into our MRF model to generate the upper bound per-

ormances. The upper bound shows that our method can remove a

arge amount of bad pixels with the ground truth confidence and

he predicted disparity gradients. An interesting result is that the

ne-tuned network has competitive performances on the Driving

nd the Sintel datasets according to the percentages of bad pix-

ls but worse performances according to the end-point-errors. It

hows that using a single metric to compare the quality of dispar-

ty maps is not enough. Lower average absolute error may mean

ver-smooth disparity maps and fewer bad pixels may mean very

arge errors for a certain number of pixels.

Additionally, we compare the quality of the final gradients of

he disparities in Table 4 . Our method has the best performance on

ll the datasets, which proves that we can effectively reconstruct

he surface structures. Note that the fine-tuned network only im-

rove the disparity gradients slightly and using a large λ will result

n bad disparity maps according to our experiment. In conclusion,

ur method outperforms the baselines in most cases considering

ll the three metrics.

Comparing with MRF-based methods: Using our confidence

eighted data term, we perform a separate set of experiments to

valuate the importance of different MRF regularization terms. The

olor-based term makes simple piece-wise smoothness assumption

nd is commonly used for filling missing depth values and image

olorization [4,37] . The TGVL2 model, which is the state-of-the-

rt method for noisy disparity map upsampling, employs both the

econd-order smoothness prior and the anisotropic diffusion tensor

o enforce smoothness within surfaces and preserve discontinuities

t potential boundaries [31] . The fast bilateral solver is proposed

n [32] to perform confidence-based edge-aware filtering for depth

efinement and upsampling. Our method is built on a similar strat-

gy and the difference is that the smoothness prior (i.e., disparity

radients) is learned from training data. According to Table 5 , our

ethod has better performance on estimating accurate disparity

alues while TGVL2-based regularization can remove slightly more

ad pixels. However, TGVL2 model needs to solve the global op-

imization problem iteratively, which can be slow in practice. In

ontrast, our global optimization problem is solved efficiently us-

ng standard least square solvers and is about 10x faster according

o the computational time. The general fast bilateral solver (FBS)


Table 1

Comparisons of the disparity gradients. The output gradients of our GradNet are much better on all the

datasets. The metric is the sum of the end-point errors of both the horizontal and the vertical disparity gra-

dients.

Method FlyingThings3D(test) FlyingThings3D(val) Driving Monkaa Sintel(train)

DispNetCorr1D [20] 0.55 0.55 0.71 0.45 0.45

GradNet 0.05 0.06 0.32 0.09 0.10

Table 2

Quantification comparisons of the disparities of both occlusions and all the pixels. The metrics are the percent-

age of bad pixels ( > 1 px) and the end-point-error (EPE) respectively. The suffix F means that the network is

fine-tuned with an additional gradient-related loss term. Our method outperforms all the baselines on both

the metrics.

Method FlyingThings3D(val) FlyingThings3D(test)

Occ All Occ All

> 1 px EPE > 1 px EPE > 1 px EPE > 1 px EPE

DispNetCorr1D [20] 54.29 4.70 24.99 2.01 53.08 4.57 23.48 2.16

DispNetCorr1D + WMF [40] 48.91 4.15 23.11 1.84 47.66 4.02 21.57 1.98

DispNetCorr1D_F 51.23 4.79 23.24 2.07 49.78 4.73 21.67 2.10

Ours 47.74 4.08 22.73 1.81 46.53 3.98 21.17 1.98

Table 3

Quantification comparisons of the whole disparity maps. The metrics are the percentage of bad

pixels ( > 1 px) and the end-point-error (EPE) respectively. The suffix F means that the network is

fine-tuned with an additional gradient-related loss term. Our method outperforms all the baselines

on end-point error and removes bad pixels effectively except for the Driving dataset. The possible

reason is that the training data contains no similar driving scenes. The upper bound proves that

bad pixel removing can benefit from estimating better confidence measures.

Method Driving Monkaa Sintel(train)

> 1 px EPE > 1 px EPE > 1 px EPE

DispNetCorr1D [20] 70.50 12.56 36.95 10.92 46.96 5.74

DispNetCorr1D + WMF [40] 70.18 12.54 36.48 10.86 46.21 5.65

DispNetCorr1D_F 68.92 13.57 37.00 10.84 44.41 7.31

Ours 70.49 12.35 35.73 10.81 46.33 5.49

Ours + CONF_GT 63.36 12.21 25.50 10.97 35.47 4.99

Table 4

Comparisons of the quality of the final disparity gradients.

Method FlyingThings3D(val) FlyingThings3D(test) Driving Monkaa Sintel(train)

DispNetCorr1D [20] 0.55 0.55 0.71 0.45 0.45

DispNetCorr1D + WMF [40] 0.47 0.47 0.75 0.44 0.40

DispNetCorr1D_F 0.47 0.47 0.67 0.42 0.47

Ours 0.38 0.38 0.60 0.32 0.32

Table 5

Comparisons of different MRF regularization terms. Grad means that only the gra-

dient consistency regularization term is used in our model. Our method use both

the color-based smoothness and gradient consistency terms to tackle smooth sur-

faces and boundaries respectively.

Method FlyingThings3D(val) Sintel(train) Time(s)

disp bad grad disp bad grad –

Color [37] 2.04 22.54 0.43 5.65 46.70 0.36 13.6

TGVL2 [31] 1.82 22.37 0.39 5.59 46.03 0.36 148.5

FBS [32] 1.83 23.31 0.53 5.78 47.84 0.47 0.7

Grad 1.99 22.92 0.42 5.51 46.42 0.38 2.5

Ours 1.81 22.73 0.38 5.49 46.33 0.32 14.3

a

n

t

m

s

5

i

i

s
is efficient but fails to work competitively. 3 Note that using a sin-
gle regularization term of our model cannot achieve state-of-the-

3 We use the publicly available code to evaluate this baseline. Note that the au-

thors propose a variant of their algorithm (the RBS solver) to refine stereo matching

results using iteratively reweighted least squares. The effectiveness and efficiency of

this iterative algorithm is not clear since this part of code is not available.

t

N

t

i

d

rt performance. It shows that making simple piece-wise smooth-

ess assumption is not robust for smooth surfaces and modeling

he disparity boundaries is important for our approach.

Visually results are shown in Fig. 6 to demonstrate that our

ethod can remove bad pixels and smooth the disparities within

urfaces.

.2. Real world dataset

We now turn to the KITTI2015 dataset, which consists of 200

mage pairs for training and 200 image pairs for testing [41] . The

mages are captured from real world driving environment with

parse ground truth disparity maps.

Detailed analysis: We use the first 100 images of KITTI2015

rain set to train our ConfNet for predicting the confidence of Disp-

etCorr1D on KITTI before fine-tuning. We compare our approach

o a baseline method RDF 12 [13] . For the baseline, all the match-

ng cost-based features are not available in this setting and a 12

imension feature vector is used to train the Random Forest clas-


Table 6

Comparisons of state-of-the-art methods on the

KITTI2015 stereo benchmark. The metric is the

percentage of bad pixels, bg and fg denote back-

ground and foreground regions respectively.

Methods bg fg all

Displets [42] 3.00 5.56 3.43

MC-CNN-acrt [9] 2.89 8.88 3.89

DispNetCorr1D-K [20] 4.32 4.41 4.34

Content-CNN [18] 3.73 8.58 4.54

SPS-St [14] 3.84 12.67 5.31

Ours 4.39 4.59 4.43

s

t

o

s

b

a

o

t

d

W

g

p

l

b

a

g

f

g

T

t

t

d

6

p

S

w

s

t

t

u

o

w

s

A

C

w

C

w

p

d

t

a

i

C

t

S

f

R

[

[

[

[

[

[

[

[

[

ifier. 4 We refer the readers to [12,13] for the details of these fea-

ures. The results of our ConfNet and RDF 12 are shown in Fig. 7 and

ur ConfNet outperforms RDF 12 in most cases. It shows that adding

ub-network can learn the error mode of its parent network easily

ased on the pre-learned representations.

To fine-tune our GradNet, we need dense disparity maps which

re unavailable in the KITTI dataset. We instead use the results

f [9] for its strong performance in background regions and replace

he object regions with ground truth to create high-quality dense

isparity maps for fine-tuning our GradNet.

Overall results: We first show our qualitative results in Fig. 8 .

e can see the visual improvement in these examples. The learned

radients from high-quality disparity maps seems to refine the dis-

arities near depth discontinuities. Also, the predicted confidence

ocates the errors well and our method focuses on removing these

ad pixels near depth discontinuities. We note that there is still

number of bad pixels cannot be removed since we learn the

radient of [9] which also suffers from foreground-fattening ef-

ects in some cases. On the public KITTI2015 stereo benchmark, we

et close-by performance to DispNetCorr1D-K model as shown in

able 6 . This is probably due to the fact that there are a few ground

ruth points near object boundaries in the KITTI2015 dataset and

he synthetic datasets are insufficient to train a network to predict

isparity gradients for real world data. 5

. Conclusions

We propose a deep network method to refine continuous dis-

arity maps which are estimated by a deep regression network.

pecifically, we learn a unified two-brunch fully convolutional net-

ork to predict the confidence map of disparity and the surface

moothness priors respectively. We then integrate them into a con-

inuous MRF for improving the initial disparity maps, and solve

he MRF efficiently with a closed-form solution. Performance eval-

ated on both synthetic and real-world dataset demonstrate that

ur method effectively reduces the errors of a deep regression net-

ork effectively without requiring multiple ad hoc post-processing

teps.

cknowledgment

The work of F. Cheng was supported by the China Scholarship

ouncil . This work was done during Cheng’s visit to NICTA, and

as also supported by the National Natural Science Foundation of

hina (grant number 61571026 ). The work of X. He was supported

4 Most of the hand-crafted confidence features are based on the matching cost,

hich is absence in our case since the regression network outputs continuous dis-

arity maps only. Hence, we use the code of [13] to compute all the 12 image and

isparity map-based confidence features to learn RDF 12 as the baseline. 5 We have tried to train our sub-networks on the Driving dataset and tested on

he KITTI data but without success. The Driving dataset has rather different image

nd disparity statistics compared to the realistic driving scenes.

[

[

[

n part by the Australian Government through the Department of

ommunications and in part by the Australian Research Council

hrough the ICT Center of Excellence Program.

upplementary material

Supplementary material associated with this article can be

ound, in the online version, at 10.1016/j.patcog.2017.07.027

eferences

[1] R.I. Hartley , A. Zisserman , Multiple View Geometry in Computer Vision, 2nd,

Cambridge University Press, 2004 . ISBN: 0521540518 [2] D. Eigen , C. Puhrsch , R. Fergus , Depth map prediction from a single image us-

ing a multi-scale deep network, in: NIPS, 2014, pp. 2366–2374 . [3] D. Eigen , R. Fergus , Predicting depth, surface normals and semantic la-

bels with a common multi-scale convolutional architecture, in: ICCV, 2015,pp. 2650–2658 .

[4] N. Silberman , D. Hoiem , P. Kohli , R. Fergus , Indoor segmentation and support

inference from rgbd images, in: ECCV, 2012, pp. 746–760 . [5] A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? The kitti

vision benchmark suite, in: CVPR, pp. 3354–3361. [6] H. Hirschmuller , Accurate and efficient stereo processing by semi-global

matching and mutual information, in: CVPR, 2005, pp. 807–814 . [7] O. Woodford , P. Torr , I. Reid , A. Fitzgibbon , Global stereo reconstruction under

second-order smoothness priors, PAMI 31 (12) (2009) 2115–2128 .

[8] J. Zbontar , Y. LeCun , Computing the stereo matching cost with a convolutionalneural network, in: CVPR, 2015, pp. 1592–1599 .

[9] J. Zbontar , Y. LeCun , Stereo matching by training a convolutional neural net-work to compare image patches, JMLR 17 (2016) 1–32 .

[10] D. Scharstein , R. Szeliski , A taxonomy and evaluation of dense two-framestereo correspondence algorithms, IJCV 47 (1–3) (2002) 7–42 .

[11] H. Hirschmuller , D. Scharstein , Evaluation of stereo matching costs on imageswith radiometric differences, PAMI 31 (9) (2009) 1582–1599 .

[12] R. Haeusler , R. Nair , D. Kondermann , Ensemble learning for confidence mea-

sures in stereo vision, in: CVPR, 2013, pp. 305–312 . [13] M.-G. Park , K.-J. Yoon , Leveraging stereo matching with learning-based confi-

dence measures, in: CVPR, 2015, pp. 101–109 . [14] K. Yamaguchi , D. McAllester , R. Urtasun , Efficient joint segmentation, occlusion

labeling, stereo and flow estimation, in: ECCV, 2014, pp. 756–771 . [15] M.G. Mozerov , J. van de Weijer , Accurate stereo matching by two-step energy

minimization, TIP 24 (3) (2015) 1153–1163 .

[16] R. Szeliski , D. Scharstein , Sampling the disparity space image, PAMI 26 (3)(2004) 419–425 .

[17] F. Liu , C. Shen , G. Lin , Deep convolutional neural fields for depth estimationfrom a single image, in: CVPR, 2015, pp. 5162–5170 .

[18] W. Luo , A.G. Schwing , R. Urtasun , Efficient deep learning for stereo matching,in: CVPR, 2016, pp. 5695–5703 .

[19] Z. Chen , X. Sun , L. Wang , Y. Yu , C. Huang , A deep visual correspondence em-

bedding model for stereo matching costs, in: ICCV, 2015, pp. 972–980 . 20] N. Mayer , E. Ilg , P. Hausser , P. Fischer , D. Cremers , A. Dosovitskiy , T. Brox , A

large dataset to train convolutional networks for disparity, optical flow, andscene flow estimation, in: CVPR, 2016, pp. 4040–4048 .

[21] A. Spyropoulos , N. Komodakis , P. Mordohai , Learning to detect ground con-trol points for improving the accuracy of stereo matching, in: CVPR, 2014,

pp. 1621–1628 .

22] R. Szeliski , Computer Vision: Algorithms and Applications, 1st, Springer-VerlagNew York, Inc., New York, NY, USA, 2010 .

23] Z. Sergey , K. Nikos , Learning to compare image patches via convolutional neu-ral networks, in: CVPR, 2015, pp. 4353–4361 .

24] R. Garg, K.B. Vijay, C. Gustavo, I. Reid, Unsupervised cnn for single view depthestimation: geometry to the rescue, in: ECCV, pp. 740–756.

25] G. Clment, M. Oisin, G.J. Brostow, Unsupervised monocular depth estimation

with left-right consistency, in: arXiv:1609.03677v2 , 2016. 26] X. Hu , P. Mordohai , A quantitative evaluation of confidence measures for stereo

vision, PAMI 34 (11) (2012) 2121–2133 . [27] C. Mostegel, M. Rumpler, F. Fraundorfer, H. Bischof, Using self-contradiction to

learn confidence measures in stereo vision, in: CVPR, pp. 4067–4076. 28] A. Seki , M. Pollefeys , Patch based confidence prediction for dense disparity

map, BMVC, 2016 .

29] P. Matteo , M. Stefano , Learning from scratch a confidence measure, in: BMVC,2016, pp. 4165–4175 .

30] R. Ranftl , S. Gehrig , T. Pock , H. Bischof , Pushing the limits of stereo us-ing variational stereo estimation, in: Intelligent Vehicles Symposium, 2012,

pp. 401–407 . [31] D. Ferstl , C. Reinbacher , R. Ranftl , M. Ruther , H. Bischof , Image guided

depth upsampling using anisotropic total generalized variation, in: ICCV, 2013,pp. 993–10 0 0 .

32] J.T. Barron, B. Poole, The fast bilateral solver, in: ECCV, pp. 617–632.

33] C. Hane , L. Ladicky , M. Pollefeys , Direction matters: depth estimation with asurface normal classifier, in: CVPR, 2015, pp. 381–389 .

34] G. Silvano , S. Konrad , Just look at the image: viewpoint-specific surfacenormal prediction for improved multi-view reconstruction, in: CVPR, 2016,

pp. 5479–5487 .

http://dx.doi.org/10.13039/501100004543

http://dx.doi.org/10.13039/501100001809

http://dx.doi.org/10.13039/501100000923


http://refhub.elsevier.com/S0031-3203(17)30298-4/sbref0001



















































































http://arxiv.org/abs/1609.03677v2





























[

[

[35] J. Long , E. Shelhamer , T. Darrell , Fully convolutional networks for semantic seg-mentation, in: CVPR, 2015, pp. 3431–3440 .

[36] D. Kingma , J. Ba , Adam: a method for stochastic optimization, arXiv:1412.6980,2014 .

[37] A. Levin , D. Lischinski , Y. Weiss , Colorization using optimization, TOG 23 (3)(2004) 689–694 .

[38] D.J. Butler , J. Wulff, G.B. Stanley , M.J. Black , A naturalistic open source moviefor optical flow evaluation, in: ECCV, 2012, pp. 611–625 .

39] Z. Ma , K. He , Y. Wei , J. Sun , E. Wu , Constant time weighted median filteringfor stereo matching and beyond, in: ICCV, 2013, pp. 49–56 .

[40] Q. Zhang , L. Xu , J. Jia , 100+ times faster weighted median filter (wmf), in:CVPR, 2014, pp. 2830–2837 .

[41] M. Menze , A. Geiger , Object scene flow for autonomous vehicles, in: CVPR,2015, pp. 3061–3070 .

42] F. Guney , A. Geiger , Displets: resolving stereo ambiguities using object knowl-edge, in: CVPR, 2015, pp. 4165–4175 .


































F al University in 2010. He is currently a Ph.D. candidate in Image Research Center, School o t National ICT Australia (NICTA) from 2014 to 2016. His research interests include stereo

v

X and Technology at ShanghaiTech University. He received Ph.D. degree in computer science f the University of California at Los Angeles (USA) from 2008 to 2010. After that, he joined

i He was also an adjunct Research Fellow at the Australian National University from 2010 t D scene understanding, visual motion analysis, and efficient inference and learning in

s

H 2002. She is currently a Professor of Beihang University. She was at the University of P ctivity recognition, image indexing, object detection and stereo vision.

eiyang Cheng received B.S. degree in bio-medical engineering from Tianjin Medicf Astronautics at Beihang University, China. He was also a visiting Ph.D. student a

ision and semantic segmentation.

uming He is currently an Associate Professor in the School of Information Science rom the University of Toronto (Canada) in 2008. He held a postdoctoral position at

n National ICT Australia (NICTA) and was a Senior Researcher from 2013 to 2016. o 2016. His research interests include semantic image and video segmentation, 3

tructured models.

ong Zhang received Ph.D. degree from Beijing Institute of Technology, China inittsburgh as a visiting scholar from 2007 to 2008. Her research interests include a

Date post:	26-Apr-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

Learning to refine depth for robust stereo estimationnetwork...end system for depth estimation. Most...

Documents