COST-AWARE DEPTH MAP ESTIMATION FOR LYTRO CAMERA∗
Min-Jung Kim, Tae-Hyun Oh, In So Kweon
Robotics and Computer Vision Laboratory, KAIST
ABSTRACT
Since commercial light field cameras became available, the
light field camera has aroused much interest from computer
vision and image processing communities due to its versa-
tile functions. Most of its special features are based on an
estimated depth map, so reliable depth estimation is a cru-
cial step. However, estimating depth on real light field cam-
eras is a challenging problem due to noise and short base-
lines among sub-aperture images. We propose a depth map
estimation method for light field cameras by exploiting corre-
spondence and focus cues. We aggregate costs among all the
sub-aperture images on cost volume to alleviate noise effects.
With efficiency of the cost volume, cost-aware depth estima-
tion is quickly achieved by discrete-continuous optimization.
In addition, we analyze each property of correspondence and
focus cues and utilize them to select reliable anchor points.
A well reconstructed initial depth map from the anchors is
shown to enhance convergence. We show our method out-
performs the state-of-the-art methods by validating it on real
datasets acquired with a Lytro camera.
Index Terms— Lytro, light field camera, depth map, cost
volume, discrete-continuous optimization
1. INTRODUCTION
Unlike conventional cameras, Light Field (LF) cameras can
capture both spatial and angular light information in only a
single shot. This feature has drawn much interest to LF cam-
eras, and has generated versatile applications like refocus-
ing [1, 2], segmentation [3], surface reconstruction [4], and
view perspective shifting [5]. Most of these applications are
based on depth estimation, so a reliable depth estimation pro-
cess is crucial for LF camera applications.
An early work of Adelson and Wang [5] mentioned the
possibility of estimating a depth map using a plenoptic cam-
era. Since introducing the concept of the LF camera, there
have been several studies on estimating a reliable depth map
from a LF camera [6, 7, 8, 9]. Kim et al. [8] and Wanner etal. [7, 10] analyzed the slope on an epipolar image domain to
estimate correspondence information. Yu et al. [6] exploited
a 3D line structure in ray space to constraint solution space
∗THIS WORK WAS SUPPORTED BY THE NATIONAL RESEARCH FOUN-DATION OF KOREA(NRF) GRANT FUNDED BY THE KOREA GOVERN-MENT(MSIP) (NO. 2010-0028680).
Fig. 1: (a) A sample sub-aperture image. (b) Depth map that min-
imizes RGB-Gradient consistency. (c) Depth map that maximizes
defocus measure. (d) Initial depth map by our approach. (e) Final
depth map based on (d) as initial.
by linearly interpolating the disparity along a 3D line. Since
these methods [6, 7, 8] work on the epipolar image domain to
detect lines or to measure edge confidence, they can be vul-
nerable to noise. Furthermore, these methods are restricted to
synthetic or large baseline multi-camera systems. They have
rarely been tested for challenging micro-lens based LF cam-
eras such as Raytrix and Lytro, because the cameras have short
baseline and the input image is noisy. A difference between
Raytrix and Lytro is the type of micro lens array. Lytro uses
a single lens type array, while Raytrix uses a group lens type
array to provide extended DoF and to have benefits for depth
estimation. Lytro has simpler hardwares and disadvantages
for depth estimation, so this study targeted Lytro [11].
Some studies [4, 9] have recently introduced and demon-
strated the applications on real data acquired by the micro-
lens LF cameras. Tao et al. [4] combined depth maps es-
timated from correspondence and defocus cues, while Per-
waß et al. [9] only utilized correspondence to triangulate it.
However, without considering the difference of each cue,
Tao’s method uses the estimated depth from each cue as it
is and combines them directly by using confidence weights.
Thus, that method could result in artifacts in the final depth.
In this paper, we propose a reliable depth estimation
framework for micro-lens based LF cameras like Lytro. Sim-
ilar to conventional stereo problems, there are ambiguities
from the homogeneous regions and narrow baseline among
sub-aperture images. We alleviate the ambiguities by picking
out the pixels with reliable depth estimates as anchor pixels;
the anchor pixels are used to correct unreliable nearby pixels.
We analyze each characteristic of correspondence and focus
cues and exploit the cues to select reliable pixels consider-
ing their own properties. After estimating the anchor pixels,
unreliable depth values are corrected with the assumption
that nearby pixels with similar color values may have similar
depth values. The assumption significantly improves conver-
Fig. 2: Illustration of cost volume. (a) Cost volume structure. (b)
Projection model. (c) Cost volume for defocus cue.
gence of the following non-convex optimization step. Finally,
we estimate the final depth map by cost-aware optimization
on a cost-volume structure. By using the cost volume con-
cept and full 4D LF information, fast depth map estimation
is achieved along with alleviating noise effects. We validate
our performance on real data acquired by a Lytro camera, and
show our method outperforms the state-of-the-art methods in
the aspects of computational efficiency and quality.
2. COST VOLUME CONSTRUCTION
The LF cameras provide multi-view images in a single shot
as sub-aperture images with coplanar image planes. To allevi-
ate noise effects, we fully utilize all the information from the
sub-aperture images simultaneously by an efficient 3D vol-
umetric structure, called cost volume [12]. Our framework
first performs a cost volume construction by aggregating all
the measurements.
The cost volume is defined by the voxel structure with the
size M × N × L, where M and N are the height and width
of the reference image (we choose the center view), and Lis the predefined number of the depth hypotheses. The depth
candidates are non-linearly sampled on a ray that is a back-
projection ray at a pixel location of the reference view. The
cost volume configuration for the LF is illustrated in Fig. 2-
(a). To use the cost volume structure, the relationship among
sub-aperture images and the 3D space needs to be determined.
We use the calibration method suggested by Dansereau etal. [13]. Here i, j denotes the index of the sub-aperture im-
age (i-th column and j-th row), k, l denotes a pixel location
on the designated sub-aperture image by i, j (we also refer
to this notation in later). Based on the geometric relationship
defined by [13], we can calculate a projected point (k′, l′) on
a sub-aperture image (i′, j′) from a 3D point (X,Y, Z) on a
ray defined by a pixel (k, l) of the reference image (i, j)) by
the following equation:
k′=
X − (H11i′ + H15 + ZH13i
′ + ZH35)
H13 + ZH33
,
l′=
Y − (H22j′ + H25 + ZH42j
′ + ZH45)
H24 + ZH44
,
(1)
where Hpq denotes an entry at (p, q) of the calibration matrix
H obtained by [13]. Eq. (1) allows us to linearly calculate the
projection location of a 3D point defined in the 3D space of
which the origin is the camera center of the reference view
as shown in Fig. 2-(b). From here, we regard the correspon-
dences as known, given the known (i, j, k, l) and depth Z.
The centroid of each voxel element corresponds to a hy-
pothetical 3D point. When we project a 3D point to all the
sub-aperture images, if the projected points have some consis-
tency on the image (such as color), the hypothetical 3D point
is regarded as reliable. From these observations, we measure
the RGB color and color-gradient consistency as cost and ag-
gregate each cost into a voxel at pixel p and depth d as:
Cc(p, d) =1
n(S)
∑(i′,j′)∈S
∣∣a(ir, jr, k, l)− a(i′, j′, k′, l′)∣∣ , (2)
where p = (k, l) of the reference image indexed by (ir, jr);S = {(i′, j′)|0 ≤ k′i′,j′ < N, 0 ≤ l′i′,j′ < M}; n(·) denotes
the number of elements in a set; and a(i, j, k, l) is a feature
vector defined as a =[R,G,B, ∂R
∂x ,∂G∂x ,
∂B∂x ,
∂R∂y ,
∂G∂y ,
∂B∂y
],
where R,G,B denote each color intensity. Intuitively, the
depth with the minimum cost value along the d axis can be
regarded as a strong candidate. We will refer to Cc as the
consistency cost (volume).
While the previous cost volume only incorporates con-
sistency information, the LF configuration can also provide
focus information. We construct another cost volume for a
focus cue to achieve stable depth estimation. The cost vol-
ume for the focus cue is constructed by accumulating local
sharpness information. To measure local sharpness, Sum-
Modified-Laplacian (SML) [14] is used. The cost volume for
the focus cue is defined as the following equation:
Cf (p, d) =∑
p∈N(p)
δ(ML(p, d) ≥ T ) ·ML(p, d), (3)
where δ(·) represents the indicator function which returns 1 if
the argument is true, otherwise 0, N(p) is a set of neighbors
within a radius R, and T denotes a predefined threshold. This
is illustrated in Fig. 2-(c). ML(·) is defined in [14]. This
measure is computed for all the generated refocus images Idspecified at depth d. An intensity value of the refocused image
is obtained by averaging intensities of the projected 2D points
from a 3D point at depth d on each sub-aperture.
3. INITIAL DEPTH MAP ESTIMATION
We estimate a depth map by optimizing it in a cost-aware
manner. It is typically a non-convex optimization procedure,
so the initial guess is important to obtain a fine depth map.
Tao et al. [4] used two depth maps estimated from focus and
correspondence independently as initial depth maps. How-
ever, we found that consistency (correspondence) and focus
cues should be treated differently considering each character-
istic. As shown in Fig. 3, we observe that the defocus cost
in (b) only provides a rough bound of possible depth values
Fig. 3: Analysis of consistency and focus costs. (a) The blue cir-
cle denotes the selected pixel to inspect two different costs for each
depth candidate. (b) The cost obtained from the defocus cue. (c) The
cost obtained from the consistency cue. (d) Green circles denotes se-
lected pixels for (e,f). (e) The consistency cost of pixel A. (f) The
consistency cost of pixel B. Red curves represent a quadratic curve
locally fitted near the minimum.
rather than an accurate extremum point like the consistency
cost in (c). Depth maps obtained by searching extrema of two
cues are shown in Fig. 1-(b) and (c). The consistency cost
provides accurate depth values on pixels with distinctive fea-
tures, but depth values in a homogeneous regions are quite
noisy and unreliable. From these observations, we decided to
use the defocus cue as a guide. Even in a weak textured re-
gion, the defocus cue can estimate a reliable bound of depth
because SML is measured on a local regions. In the homo-
geneous region, SML costs are zero for all depth candidates,
so it can easily reject unreliable depth on the homogeneous
region without any heuristic threshold.
Both initial depth maps from two cues are vulnerable in
homogeneous regions. Such unreliable depth could disturb
estimating the fine depth map (e.g. the 2nd column of Fig. 4-
(b)). We select reliable pixels as anchors and use it to correct
neighboring pixels. The anchor points are selected by the fol-
lowing criteria: 1) Pixels with zero SML costs for all depthcandidates are filtered out. 2) If the estimated depth by con-sistency cost does not fall into the bound specified by the focuscue, the pixel is filtered out. 3) While the consistency cost onthe distinctive feature point show a convex-enough shape, thecost on a weakly textured region has a broad shape as shownin Fig. 3-(e) and (f). We fit a quadratic function near the min-imum of the cost and measure the variance (the inverse valueof the coefficient of the quadratic term). If the variance is big,the pixel is filtered.
Most of the pixels in the homogeneous region are filtered
out in the previous step. To get a holistic initial depth map,
depth values in homogeneous regions also need to be esti-
mated. One reasonable assumption is that local color con-
sistent regions may have similar depth values and the depth
may vary smoothly in a homogeneous region [15]. Under this
assumption, we propagate the depth of anchor points along
the color consistent regions. Inspired by Levin et al. [16], we
Yu et al. [6] Tao et al. [4] OursEnvironment Matlab C++ Matlab
Execution time(min)
12 25 9.6 (w/o parfor)6 (with parfor)
Continuousdepth
X O O
Metric Depth O X O
Table 1: Performance comparison
formulate as a depth propagation problem:
minD
∑p
(D(p)−
∑q∈N(p) wpq D(q)
)2
, (4)
where N(p) is a set of neighboring pixels of p. Our inten-
tion is that Eq. (4) encourages depth values of two neigh-
bors to be similar if their colors are similar. Thus, we define
the weight as wpq = exp(−‖K(p)−K(q)‖/2σ2
), where
K = [R,G,B] at the pixel. Given a set of pixels pi with reli-
able depth Di, Eq. (4) is minimized subject to D(pi) = Di as
constraint. Since Eq. (4) is quadratic and the constraints are
linear, this optimization can be solved linearly in a closed-
form by pseudo inversion. The result of the propagated initial
depth map is shown in Fig. 1-(d).
4. DISCRETE-CONTINUOUS OPTIMIZATION
The estimated initial depth map looks plausible, but the depth
map obtained by Eq. (4) only depends on a color-aware
smoothness prior and does not consider any cost of the esti-
mated region. We apply another refinement step to estimate
depth in a cost-aware manner.
Since our depth map should be at minima of the consis-
tency cost volume and the neighbor depth should also be sim-
ilar, we formulate the following objective function:
mind
∑p
(λswg(p)‖∇d(p)‖H + wr(p)Cc(p, d)), (5)
where ‖·‖H is Huber function [17], wg and wr are weights for
anisotropic smoothness and for alleviating the cost of unreli-
able pixels, respectively. We define the weights as wg(p) =exp
(−‖∇K(p)‖22/σg
)to encourage depth discontinuity at
color inconsistent regions and wr(p) = 1 if p is classified as
reliable, otherwise 11000 .
In Eq. (5), minimizing Cc(·) is a discrete optimization
problem and the smoothness term is defined in the continuous
domain, so optimizing Eq. (5) simultaneously is intractable.
We instead add an additional penalty term with an auxiliary
variable z to split the discrete and continuous parts. Then,
Eq. (5) is converted to∑p
(λswg(p)‖∇z(p)‖H +wr(p)Cc(p, d) +1
2θ‖z− d‖22). (6)
As θ � 0, the penalty term must be almost 0 and Eq. (6)
becomes Eq. (5). We alternatively solve Eq. (6) by optimizing
a variable by fixing another one, and vice versa.
For an unknown z, it can be efficiently solved by a con-
ventional primal-dual method [18]. For unknown d, it can
Various Real Datasets
Depth fromRGB-gradient cost
3D reconstructionDepth frominitial guess
Captured sceneProposedTao et al.(iccv2013)
Yu et al.(iccv2013)Captured scene
Wanner et al.(cvpr 2012)
Fig. 4: Experimental results. (a) Comparisons with others methods. (b) Self-evaluation of the proposed method.
be solved by an exhaustive search for all discrete depth can-
didates. Searching exhaustively every iteration is a time
consuming task. We apply an acceleration technique sug-
gested by Newcombe et al. [19]. With their results, we
need to only search within the theoretical range d ∈ [ z −2θ√Cmax
c (p)−Cminc (p), z + 2θ
√Cmax
c (p)−Cminc (p) ].
This range significantly reduces the number of candidates
that must be inspected and prevents undesirable local minima.
With every iteration we update θ by multiplying ρ = 10−3;
θt+1 = ρθt. The iteration is terminated when ‖d − z‖2become almost 0 or θ is smaller than 10−7.
5. EXPERIMENTAL RESULTS
We validate our method by comparing the state-of-the-art
methods on real datasets acquired by Lytro (Fig. 4-(a)) and
provide a self-evaluation to analyze dependency on the initial
depth map (Fig. 4-(b)). For fair comparison, we adjust the
parameters of all the methods to have the same depth space:
the step of disparity is 0.02 pixels and the same maximum
disparity parameter is set over all the methods. Default values
are used for other parameters. We also maintain all the same
parameters for all the experiments. Yu’s method [6] returns a
disparity as an output. We convert the disparity information
into depth with the known calibration. Tao et al. [4] calcu-
lated relative depth as an output, so we adjust the depth scale
for visually easy comparison.
The performance comparisons are denoted in Table 1. The
proposed method is fastest despite un-optimized Matlab im-
plementation. Since Yu’s method is based on graph-cut, the
final depth map is discrete, while our depth map is continuous
by virtue of discrete-continuous optimization.
In Fig. 4-(a), the results of Yu et al. have clear depth dis-
continuity preserved at the boundary of the object. However,
since their method depends on line detection, depth reversal
effects are observed (indicated by red circles). In addition, a
discrete labeling scheme produces severely quantized depth.
Although Tao et al. took advantage of the depth from defocus,
their method had overall artifacts in an incorrectly estimated
region (e.g. far distant or homogeneous regions) by defocus
(orange circles). Wanner et al. [7] works well in textured re-
gions, but vulnerable to homogeneous regions. Even though
there are some depth bleeding effects caused by calibration
error at the boundary of the object, our method overall shows
continuous and stable depth estimation results.
Fig. 4-(b) shows the self-evaluation that shows the impor-
tance of the initial guess. The second and third columns depict
the depth map obtained by discrete-continuous optimization
with different initials. The former uses the initial depth map
that minimizes RGB-gradient consistency cost directly, while
the latter is based on the initial estimated by our approach in
Sec. 3. As mentioned, the initial acquired by propagating reli-
able depth values plays an important role in our optimization.
The proposed initialization method provides more plausible
results with fewer artifacts. We additionally provide 3D re-
construction results as shown in the right most column of (b).
Our method produces continuous and metric depth, so plausi-
ble surface reconstruction can be achieved.
6. CONCLUSION
We proposed a stable depth estimation method for Lytro. We
estimated an initial depth map by propagating reliable depth
values filtered by the focus cue and the level of texture with
the assumption that depth varies smoothly in color consistent
regions. The reliable initial depth map enhanced the final
solution of the discrete-continuous optimization. The effi-
ciency and quality of our method were validated on various
real datasets. In a future study, we will extend our framework
to a video sequence of a LF camera to enhance the absolute
quality by incorporating multi-view measurements.
7. REFERENCES
[1] Ren Ng, Digital light field photography, Ph.D. thesis,
Stanford University, 2006.
[2] K. Mitra and A. Veeraraghavan, “Light field denoising,
light field superresolution and stereo camera based refo-
cussing using a gmm light field patch prior,” in IEEEConference on Computer Vision and Pattern Recogni-tion Workshop (CVPRW), 2012.
[3] S. Wanner, C. Straehle, and B. Goldluecke, “Globally
consistent multi-label assignment on the ray space of 4d
light fields,” in IEEE Conference on Computer Visionand Pattern Recognition (CVPR), 2013.
[4] M. W. Tao, S. Hadap, J. Malik, and R. Ramamoorthi,
“Depth from combining defocus and correspondence us-
ing light-field cameras,” in IEEE International Confer-ence on Computer Vision (ICCV), 2013.
[5] E. H. Adelson and J. Y. Wang, “Single lens stereo with a
plenoptic camera,” IEEE Transactions on Pattern Anal-ysis and machine intelligence (TPAMI), vol. 14, no. 2,
pp. 99–106, 1992.
[6] Z. Yu, X. Guo, and J. Yu, “Line assisted light field tri-
angulation and stereo matching,” in IEEE InternationalConference on Computer Vision (ICCV), 2013.
[7] S. Wanner and B. Goldluecke, “Globally consistent
depth labeling of 4d light fields,” in IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR),2012.
[8] C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung,
and M. Gross, “Scene reconstruction from high spatio-
angular resolution light fields,” ACM Transactions onGraphics (SIGGRAPH), vol. 32, no. 4, pp. 73:1–73:12,
2013.
[9] C. Perwaß and L. Wietzke, “Single lens 3d-camera with
extended depth-of-field,” in Proceedings of the confer-ence on Society of Photo-Optical Instrumentation Engi-neers (SPIE), 2012.
[10] S. Wanner and B. Goldluecke, “Variational light field
analysis for disparity estimation and super-resolution,”
IEEE Transactions on Pattern Analysis and Machine In-telligence (TPAMI), 2013.
[11] R. Ng, “Lytro official homapage,” https://www.lytro.com/about/, 2012.
[12] Christoph Rhemann, Asmaa Hosni, Michael Bleyer,
Carsten Rother, and Margrit Gelautz, “Fast cost-
volume filtering for visual correspondence and beyond,”
in IEEE Conference on Computer Vision and PatternRecognition (CVPR), 2011.
[13] D. G. Dansereau, O. Pizarro, and S. B. Williams, “De-
coding, calibration and rectification for lenselet-based
plenoptic cameras,” in IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2013.
[14] S. K. Nayar and Y. Nakagawa, “Shape from focus,”
IEEE Transactions on Pattern Analysis and machine in-telligence (TPAMI), vol. 16, no. 8, pp. 824–831, 1994.
[15] J. Park, H. Kim, Y.-W. Tai, M. S. Brown, and I. Kweon,
“High quality depth map upsampling for 3d-tof cam-
eras,” in IEEE International Conference on ComputerVision (ICCV), 2011.
[16] A. Levin, D. Lischinski, and Y. Weiss, “Colorization
using optimization,” ACM Transactions on Graphics(TOG), vol. 23, no. 3, pp. 689–694, 2004.
[17] Peter J Huber, “Robust estimation of a location parame-
ter,” The Annals of Mathematical Statistics, vol. 35, no.
1, pp. 73–101, 1964.
[18] A. Chambolle, V. Caselles, D. Cremers, M. Novaga, and
T. Pock, “An introduction to total variation for image
analysis,” in Theoretical Foundations and NumericalMethods for Sparse Recovery. De Gruyter, 2010.
[19] R. A. Newcombe, S. J. Lovegrove, and A. J. Davi-
son, “DTAM: Dense tracking and mapping in real-
time,” in IEEE International Conference on ComputerVision (ICCV), 2011.