Accurate and real-time depth video acquisition using...

Accurate and real-time depth videoacquisition using Kinect–stereocamera fusion

WilliemYu-Wing TaiIn Kyu Park

Accurate and real-time depth video acquisition usingKinect–stereo camera fusion

Williem,a Yu-Wing Tai,b and In Kyu Parka,*aInha University, Department of Information and Communication Engineering, 100 Inha-ro, Nam-gu, Incheon 402-751, Republic of KoreabKorea Advanced Institute of Science and Technology, Department of Electrical Engineering, 291 Daehak-ro, Yuseong-gu, Daejeon 305-701,Republic of Korea

Abstract. This paper presents a Kinect–stereo camera fusion system that significantly improves the accuracy ofdepth map acquisition. The typical Kinect depth map suffers frommissing depth values and errors, resulting froma single Kinect input. To ameliorate such problems, the proposed system couples a Kinect with a stereo RGBcamera to provide an additional disparity map. Kinect depth map and the disparity map are efficiently fused inreal time by exploiting a spatiotemporal Markov random field framework on a graphics processing unit. An effi-cient temporal data cost is proposed to maintain the temporal coherency between frames. We demonstrate theperformance of the proposed system on challenging real-world examples. Experimental results confirm that theproposed system is robust and accurate in depth video acquisition. © 2014 Society of Photo-Optical Instrumentation Engineers(SPIE) [DOI: 10.1117/1.OE.53.4.043110]

Keywords: Kinect–stereo fusion; stereo matching; depth correspondence; three-dimensional computer vision; real-time stereomatching.

Paper 140121 received Jan. 22, 2014; revised manuscript received Mar. 24, 2014; accepted for publication Mar. 26, 2014; publishedonline Apr. 30, 2014.

1 IntroductionAccurate depth (disparity) map acquisition is an importanttask in computer vision. Early works in stereo matchingestimate disparity map of a scene by measuring pixel corre-spondences in stereo image pairs. Recently, the introductionof consumer depth cameras, such as Microsoft Kinect, hasled to a revolution in depth map acquisition. Kinect utilizesa structured light technique by projecting speckle patterns ininfrared spectrum to capture depth maps in real time at a res-olution of up to 1280 × 1024. However, depth maps acquiredby a Kinect typically contain holes and errors which cannotbe easily resolved from a single Kinect input. In this paper,we propose a Kinect–stereo camera fusion system by fusinga disparity map from a stereo camera and a depth map from aKinect to achieve an accurate depth map acquisition.

We formulate the disparity map and depth map fusionproblem into a spatiotemporal Markov random field (MRF)framework. The proposed framework utilizes three differentdata costs, namely, the disparity data cost, Kinect data cost,and the temporal data cost. The temporal data cost ensuresthe temporal consistence of the acquired depth maps. Sincea real-time acquisition is an important criteria in depthvideo acquisition, our solution is formulated to facilitate effi-cient graphics processing unit (GPU) implementation by usingthe hierarchical belief propagation (HBP) to solve a two-dimensional (2-D) MRF discrete labeling problem on eachframe individually. The novel temporal data cost contributesto an efficient GPU implementation because it reduces thecomputational cost in comparison with the temporal smooth-ness term used in Zhu’s algorithm.1 Although this work ismore applied in nature, the result is a system that is able tocapture highly accurate depth video in real time in a streamingfashion. We describe the components and implementation

issues of the proposed system, and assess its performancein real-world challenging experiments. Experimental resultsdemonstrate that the proposed framework is efficient andeffective and is able to handle cases where an individualacquisition by either Kinect or stereo camera fails.

This paper is organized as follows. In Sec. 2, we reviewthe related works. Section 3 describes the proposed Kinect–stereo camera fusion system. The spatiotemporal MRFframework is presented in Sec. 4 in detail. Section 5describes the GPU implementation of HBP algorithm toobtain an optimal disparity result. Section 6 provides theexperimental results. Finally, we give a conclusive remarkin Sec. 7.

2 Related WorkOur work is related to real-time stereo matching2–8 andfusion techniques for depth map enhancement.1,9–18

Real-time performance in stereo matching is mostlyachieved by using embedded hardwares, such as GPU andfield-programmable gate array (FPGA). Kowalczuk et al.proposed a real-time stereo matching framework whichutilizes adaptive support-weight aggregation and a low-com-plexity iterative disparity refinement technique.5 A real-timelocal stereo matching using guided image filtering was pro-posed by Hosni et al.3 Instead of using bilateral filtering tosmooth the cost space, guided filter was utilized to obtain abetter performance in term of computational time. Zhanget al. proposed a GPU-oriented bitwise fast voting methodto achieve a real-time performance.8 Richard et al. proposeddual-cross-bilateral grid method to smooth their cost spacewhile preserving edges in both input images.4 A nearreal-time stereo matching based on anisotropic diffusionwas proposed by Maeztu et al.6 Yang et al. proposed a

*Address all correspondence to: In Kyu Park, E-mail: [email protected] 0091-3286/2014/$25.00 © 2014 SPIE

Optical Engineering 043110-1 April 2014 • Vol. 53(4)

Optical Engineering 53(4), 043110 (April 2014)

http://dx.doi.org/10.1117/1.OE.53.4.043110

http://dx.doi.org/10.1117/1.OE.53.4.043110

http://dx.doi.org/10.1117/1.OE.53.4.043110

http://dx.doi.org/10.1117/1.OE.53.4.043110

http://dx.doi.org/10.1117/1.OE.53.4.043110

http://dx.doi.org/10.1117/1.OE.53.4.043110

method which can handle weakly textured scenes.7 Animplementation of the HBP on a GPU was proposed byYang et al. for global stereo matching.2 Table 1 summarizesthe performance of the state-of-the-art real-time stereomatching algorithms in terms of their running time in framesper second (fps) and in million disparity evaluations persecond (MDe∕s ¼ width × height × disparity labels × fps).Comparing to the offline stereo matching techniques,19

these real-time techniques have lower accuracy in their esti-mated disparity map and might even contain errors to a largeextent. However, they are still in favor in many applicationswhere a real-time processing is needed. An evaluation of per-formance of several image processing algorithms on GPUimplementation was done by Park et al.20

In fusion techniques for depth map enhancement, recentadvances use an additional RGB image to denoise andupsample a depth map.9–11,21 With an RGB image whose res-olution and signal-to-noise ratio are higher than those of adepth map, a direct approach is to apply a joint bilateral fil-ter9,10 or optimization11,21 using the RGB image to guide theneighborhood smoothness term. Although these methods canproduce good results, they are likely to overlay smooth depthdetails especially in hole regions where the depth values aremissing in the initial depth map. Considering the limitationof using a single RGB image in depth map enhancement,Zhu et al. proposed not only a method to calibrate a time-of-flight (ToF) camera and a stereo color camera but alsoan MRF framework to fuse depth maps from the ToF cameraand disparity maps from the stereo camera.12 They also con-sidered on spatiotemporal fusion by adding a temporalsmoothness term.1 Nair et al. solved the fusion problemwith a variational approach using a high-order total variationregularization.13 Another ToF and stereo camera fusionmethod for depth image upsampling were proposed byMutto et al. which utilize bilateral filtering and image seg-mentation.14 Wang and Jia designed a data term based onvisibility and pixelwise noise of Kinect depth map forKinect and stereo camera fusion.15 Chiu et al. proposedcross-modal stereo matching between Kinect RGB cameraand IR camera to improve Kinect depth map.16

Comparing the proposed approach with the previousworks, especially, the works that utilize disparity map anddepth map fusion, the proposed approach focuses on prac-tical real-time depth video acquisition. Instead of using ahigh quality disparity map in fusion, we utilize a real-time

stereo matching technique with a simple data cost that haslower resolution in disparity map for real-time fusion. Inaddition, unlike the previous frame-by-frame individualfusion methods,9–11,13–16,21 we include the temporal datacost in the proposed framework to avoid temporal flickering.The spatiotemporal MRF framework is implemented as a 2-D MRF framework which is more efficient than the previousthree-dimensional (3-D) MRF framework in Zhu’s algo-rithms.1,12 The proposed system also demonstrates the higherquality depth video acquisition in comparisons with the pre-vious real-time methods.

3 Kinect–Stereo Camera FusionWe build the Kinect–stereo camera fusion system by con-necting two Kinects as illustrated in Fig. 1. The masterKinect captures depth map and the right RGB image ofthe stereo camera, and the slave Kinect captures the leftRGB image of the stereo camera. The two RGB camerasare separated by 5 cm and are located on the left- and theright-hand side of the IR camera of the master Kinect.The IR projector and the IR camera in the slave Kinect aredisabled due to the interference of IR projector patterns. Webuild the Kinect–stereo camera fusion system in this waybecause we can fully utilize Kinect Software DevelopmentKit (SDK) (http://msdn.microsoft.com/en-us/library/jj663856.aspx) and the standard rectification tools (http://www.vision.caltech.edu/bouguetj/calib_doc/) for calibration.Note that the captured depth map and the stereo RGBimage pairs are already fully calibrated photometricallyand geometrically in the SDK. In addition, the camerasare fully controlled in software level. Thus, although wedo not synchronize the captures of the master and theslave Kinect in hardware level, they are synchronized in soft-ware level. For the sake of completeness, we refer readers toZhu’s algorithm12 for details of photometric and geometriccalibration in building similar system by fusing ToF depthcamera and stereo camera.

4 Spatiotemporal MRF Framework

4.1 Overview

The pipeline of the proposed framework is presented in Fig. 2.After capturing the raw data from the proposed fusion system,we apply a Gaussian smoothing to filter image noise. The

Table 1 State-of-the-art of the real-time stereo matching algorithms. The comparison is done by borrowing the results from the original papers.Therefore, the results are generated from different hardwares and resources.

Reference Image size Max disparity fps MDe/s Platform Graphics processing unit (GPU) device

Kowalczuk et al.5 320 × 240 32 62 152.3 CPU–GPU GeForce GTX 580

Richard et al.4 480 × 270 40 14 72.6 CPU–GPU Quadro FX 5800

Hosni et al.3 640 × 480 26 25 199.7 CPU–GPU GeForce GTX 480

Yang et al.2 320 × 240 16 16 19.7 CPU–GPU GeForce GTX 7900

Zhang et al.8 320 × 240 16 87 106.9 CPU–GPU GeForce GTX 8800

De-Maeztu et al.6 384 × 288 16 16 28.3 CPU–GPU GeForce GTX 480

Yang et al.7 512 × 384 48 18 169.8 CPU–GPU GeForce GTX 8800


Williem, Tai, and Park: Accurate and real-time depth video acquisition using Kinect–stereo camera fusion

http://msdn.microsoft.com/en-us/library/jj663856.aspx




http://www.vision.caltech.edu/bouguetj/calib_doc/

http://www.vision.caltech.edu/bouguetj/calib_doc/

real-time optical flow algorithm by Bruhn et al.22 (imple-mented on GPU) is adopted to develop correspondencesbetween the current frame and the previous and the nextframe. Since the proposed algorithm uses the previous andthe next frames to define the temporal data cost, the beliefcost of the previous frame, the current frame, and the nextframe is cached on GPU memory together with the pixel cor-respondences across three frames. The HBP is applied to opti-mize the MRF, and the information of the previous frame willbe discarded after processing the current frame. Whenprocessing the next frame, we shift the memory pointer toupdate the belief cost of the previous frame and computethe belief cost of the next next frame using HBP withoutapplying temporal data cost. Thus, the cached belief costof the previous frame is the result after the spatiotemporalMRF optimization while the cached belief cost of the nextframe is the result after MRF optimization without temporalinformation. Note that both the belief costs of the previous andthe next frames are used as temporal information to computethe disparity of the current frame.

In the following, we formulate the spatiotemporal MRFand define the data and the smoothness cost, respectively.The HBP algorithm and the important pseudocodes of theGPU implementation will also be presented. The definitionof the data and the smoothness cost balances the runningtime and quality of depth result. Note that although thereare better definitions of the cost function as demonstrated

by Park et al.,11 they are not ready to be implemented onGPU in real time considering their heavy computationinvolved in using high level information to define the dataand the smoothness costs.

4.2 MRF Formulation

We formulate the MRF optimization using the maximum aposteriori estimation. Given the observation data s (thestereo image pairs) from the stereo camera, k (the RAWdepth map) from Kinect, and t (the temporal belief cost)from the previous and the next frames, our goal is to estimatethe scene depth d of the current frame by maximizing a pos-teriori probability Pðdjs; k; tÞ. Assuming s, k, and t are con-ditionally independent, using the Bayes’ rule, we have

Pðdjs; k; tÞ ¼ Pðs; k; tjdÞPðdÞPðs; k; tÞ ¼ PðsjdÞPðkjdÞPðtjdÞPðdÞ

PðsÞPðkÞPðtÞ∝ PðsjdÞPðkjdÞPðtjdÞPðdÞ; (1)

where PðsjdÞ, PðkjdÞ, and PðtjdÞ are the likelihood proba-bility of the stereo data cost, Kinect data cost, and the tem-poral data cost, respectively. PðdÞ is the prior probability thatencodes the smoothness cost in the current frame. Accordingto the Hammersley–Clifford theorem, we can further rewriteEq. (1) into

PðsjdÞPðkjdÞPðtjdÞPðdÞ

∝1

Z

Yi

ϕðdi; siÞϕðdi; kiÞϕðdi; tiÞY

i; j ∈ NðiÞφðdi; djÞ;

(2)

where Z is a normalization constant and j ∈ NðiÞ is the first-order neighborhood of i. ϕðdi; siÞ, ϕðdi; kiÞ, and ϕðdi; tiÞ arecalled the potential functions which encode the cost for pixeli based on the stereo, Kinect, and the temporal information,respectively. The compatibility function, which gives thesmoothness cost to pixel i, is denoted as φðdi; djÞ. After tak-ing the negative logarithm of Eq. (2), we have

E ¼Xi

CðdiÞ þ λsX

i; j ∈ NðiÞVðdi; djÞ; (3)

CðdiÞ ¼ wsfsðdi; siÞ þ wkfkðdi; kiÞ þ wtftðdi; tiÞ; (4)

Fig. 1 The proposed Kinect–stereo camera fusion system. (a) Up-side view and (b) frontal view.

Host Device

Image acquisition Preprocessing algorithm

Optical flow computation

Next frame belief cost computation (using HBP algorithm)

Temporal belief cost normalization (previous and next frames)

Spatiotemporal data cost computation

HBP algorithm computation

Final belief cost computation

Compute final disparityObtain disparity map

Fig. 2 The pipeline of the graphics processing unit implementation.



where ws, wk, and wt are the reliability weight functions toevaluate the relative reliability among different data costs,and fsðdi; siÞ, fkðdi; kiÞ, and ftðdi; tiÞ are the data costsfor each pixel based on the stereo, Kinect, and the temporalinformation, respectively. λs is a global regularization param-eter that controls the relative weight between the data termsCðdiÞ and the smoothness term Vðdi; djÞ.

Comparing the proposed formulation with the formu-lation in Zhu’s algorithm,1 their temporal coherence isencoded in the smoothness cost while ours is encoded inthe data cost. A major benefit to encode the temporal coher-ence in the data term instead of the smoothness term is that itcan significantly reduce the number of nodes in the MRF byreducing a 3-D grid MRF into a 2-D grid MRF. This reducesthe number of message passing during the MRF optimizationand the required memory to store probabilities in hiddennodes while the quality of resulting depth map remains unde-graded. Indeed, in the experimental comparisons, the pro-posed method performs better than Zhu’s algorithm1

which shows that the proposed MRF formulation is effective.

4.2.1 Stereo data cost

The stereo data cost fsðdi; siÞ is defined by measuring theabsolute difference of the gradient between the left imageL and the right image R. A gradient-based cost functionachieves better result than an intensity-based cost functionas demonstrated by Hermann and Vaudrey.23 The stereodata cost is given by

fsðdi; siÞ ¼ minðj∇Ri − ∇Liþdj1; T1Þ; (5)

where ∇Ii ¼ ðjIiþ1−Ii−1j2

Þ is the central difference gradientoperator, j · j1 is the absolute L1-norm distance betweenthe two vectors, and T1 is a truncation value for robustdata cost. When computing the stereo data cost, we firstshift the left gradient images on GPU according to the dis-parity value d and then compute the matching cost of theentire image on GPU in parallel.

4.2.2 Kinect data cost

Kinect data cost fkðdi; kiÞ is defined by calculating the dif-ference between Kinect depth value and the candidate depthvalue Di. The candidate depth values are computed from thecandidate disparity di as

Di ¼F × Bdi

; (6)

where F is the focal length of stereo camera and B is thebaseline length. Essentially, Eq. (6) converts the disparityvalue to match the absolute depth value from Kinect.Similar to the stereo data cost, Kinect data cost is defined as

fkðdi; kiÞ ¼ minðjki −Dij1; T2Þ; (7)

where ki is the measured Kinect depth map and T2 is a trun-cation value. A uniform belief cost is given, if ki is missing inthe initial Kinect depth map.

4.2.3 Temporal data cost

The introduction of the temporal data cost is to ensure tem-poral coherence of the estimated disparity map across the

previous and the next frames. Since the belief costs have dif-ferent range of values, we normalize the values to [0,1].Using the normalized final belief cost bt−1 from the previousframe and the normalized stereo, Kinect, and smoothnessbelief cost btþ1 from the next frame, the proposed temporaldata cost is defined as

ftðdi; tiÞ ¼ bdt−1 þ bdtþ1: (8)

The temporal belief cost is propagated according to theestimated pixel correspondences from the real-time opticalflow algorithm.22 The optical flow map between the currentframe and the previous frame is computed to get the previousframe pixel correspondences. Then, the next frame pixel cor-respondences are also obtained by the same method. Finally,both the frame pixel correspondences are utilized to computethe temporal belief cost.

4.2.4 Smoothness cost

We follow Felzenszwalb’s algorithm24 to define the smooth-ness cost as a truncated linear function which allows thebelief message to be computed and passed in OðnÞ per iter-ation. The smoothness cost is defined as

Vðdi; djÞ ¼ minðSjdi − djj1; T3Þ; (9)

where S is the cost multiplier defined according to the inten-sity difference between neighboring pixels25 and T3 is themaximum cost value.

4.2.5 Reliability weight function

In order to combine three data costs effectively, we utilize asimple heuristic to evaluate the reliability of each data costby comparing how difference is the first minimum beliefvalue b1st against the second minimum belief value b2nd.12

Note that the first minimum belief value means that the mini-mum cost from all candidate labels on a pixel as described in

b1st ¼ miniCðdiÞ: (10)

Intuitively, if there is no ambiguity in the data cost, thefirst minimum belief value should be much smaller thanthe second minimum belief and therefore, a large weightshould be given to this data cost. In contrast, a small weightshould be given to a data cost when there exists ambiguities,i.e., ambiguities in stereo matching, or missing depth value inKinect, in its data cost function. The reliability function isdefined as

Rm∈s;k;t ¼�1 − b1stm

b2ndm; ifb2ndm > Tc

0; otherwise; (11)

where Tc is introduced to avoid division by zero. After nor-malization, we obtain the reliability weight function

ws ¼Rs

W;wk ¼

Rk

W;wt ¼

Rt

W; (12)

where W ¼ Rs þ Rk þ Rt.



5 Parallelizing HBP on GPUIn this paper, a GPU implementation of the efficient HBPalgorithm is utilized to solve the MRF energy minimizationproblem. The HBP algorithm operates by passing belief mes-sages to the neighborhood pixels. The belief message isdefined as follows:

mlijðdjÞ ¼ min

di½Vðdi; djÞ þ CðdiÞ þ

Xn ∈ NðiÞ \ j

ml−1ni ðdiÞ�;

(13)

where n ∈ NðiÞ \ j denotes the neighbors of i except j.ml

ijðdjÞ is the belief message which is sent from pixel i topixel j at a time l. The iterative message passing computationis done in N iterations and the final belief cost bðdiÞ for eachpixel is computed as

bðdiÞ ¼ CðdiÞ þX

j ∈ NðiÞmN

jiðdiÞ. (14)

Finally, the label d�i is obtained by computing the mini-mum belief cost bðdiÞ for all pixels. The Felzenszwalb andHuttenlocher’s efficient HBP algorithm24 reduces the com-putational complexity for message computation andincreases the speed of the HBP algorithm using the gridgraph and multiscale approaches.

In the proposed GPU implementation, for each pixel, weneed to access belief messages from four directions duringthe belief propagation. These messages can be accessed andprocessed in parallel. Thus, the number of threads we need isequal to the number of pixels multiplied by four. For a fastparallel processing, we utilize the on-chip shared memoryinstead of the global memory where each thread readsone message array from the global memory and save it inthe shared memory for fast access. For each GPU threadblock, we create 1024 threads. Note that the number ofGPU blocks is not same as the number of pixels becauseeach GPU block computes more than 1 pixel. The numberof pixels computed in a single GPU block depends on thetotal number of disparity label, size of shared memory,and other hardware limitation. Algorithm 1 shows the pseu-docodes of this procedure.

Algorithm 2 shows the pseudocodes for computing thedata costs for each pixel. Note that the data cost computationhas a large computational complexity, which is slow when it

is computed on central processing unit (CPU) hardware. Thefirst and the second minimum belief values can be obtainedefficiently using the parallel reduction method in computeunified device architecture (CUDA) SDK so that the compu-tational speed is faster on GPU architecture.

6 Experimental Results

6.1 Implementation Environment and Running Time

We implemented the proposed framework on an Intel CPU(2.67-GHz i5 750 CPU with 4 GB RAM) and an NVIDIAGPU (GTX580 with 4 GB RAM). The optical flow algo-rithm and the HBP are implemented on GPU usingCUDA (version 5.0) with parallel architecture as describedin Sec. 5. The proposed implementation runs at 33 fps fordepth video acquisition with 320 × 240 resolution (6 fpsat the resolution of 640 × 480). Note that this runningtime includes the optical flow computation and other pre-/postprocessing in the proposed framework. For references,we have also implemented Zhu’s algorithms1,12 whichalso use GPU architecture for real-time depth map and stereofusion. A detailed comparison with Zhu’s algorithms1,12 onthe running time for processing 150 frames is shown inTable 2. Comparing with Zhu’s algorithm,12 their methodruns slightly faster than ours. However, they do not considerany temporal information and thus their depth maps are nottemporally consistent. Comparing with Zhu’s algorithm,1

which also considers temporal coherency in their optimiza-tion, proposed algorithm runs much faster. To show theimportance of GPU architecture, we also provide the runningtime of our CPU implementation for comparison.

6.2 Qualitative Evaluation

We present our qualitative results on challenging examples inFig. 3. For all experimental results, we set, T2 ¼ 400,T3 ¼ 4, S ¼ 0.5, and Tc ¼ 0 in the proposed implementa-tion. These values were determined empirically. Our resultsare compared with Zhu’s algorithms.1,12 In the first rowexample, depth map around the light bulb area is missingin Kinect depth map but was captured in stereo matching.On the other hand, disparity map of the table is missingdue to matching ambiguity but was captured in Kinect

Algorithm 1 Pseudocode for message_kernel

Create 4N number of threads, NN is the number of pixels

for Each thread do

Determine the memory offset of belief values for computing onebelief message in the global memory

Save the array of belief values in the shared memory

Compute one belief message for propagation

Save the belief message in the global memory

end for

Algorithm 2 Pseudocode for datacost_kernel

CreateMN number of threads,M is the number of disparities,N is thenumber of pixels Save the data cost belief values in the globalmemory

for each thread do

Determine the memory offset of data costs of a pixel in the globalmemory

Save the data costs on shared memory

Find the first and the second minimum values of each data cost

Compute reliability weight

Compute data cost belief values

end for



depth map. The proposed framework can effectively com-bine the inputs from Kinect and stereo matching to correcterrors and fill the missing values. In the second and third rowexamples, we have a moving camera and a moving object,respectively. Similar to the first row example, there are manymissing depth values in Kinect depth map and ambiguities instereo matching. Without Kinect–stereo fusion, either of theresults are unsatisfactory. After the fusion process, we canachieve an accurate depth map acquisition in the realtime. In the fourth and the fifth row examples, a person ismoving with different poses and the brightness of imagesare changing due to auto-exposure. The proposed methodcan produce temporal consistent disparity map for back-ground and accurate disparity map for foreground. Thisexample also shows the limitation of the proposed approach.The proposed method fails to estimate the correct disparitymap around the left fist in the fifth row example. This isbecause the depth map of the left fist is not captured byKinect and the stereo matching around that region is

wrong. Comparing our results with results from Zhu’s algo-rithms,1,12 although the results are similar, results of Zhu’salgorithm1 do not have temporal coherency due to their indi-vidual frame-by-frame processing, and the proposed frame-work runs much faster than Zhu’s algorithm12 with similarqualitative results.

To further compare our results with results from Zhu’salgorithms1,12 in term of temporal coherency, we show theresults of three consecutive frames in Fig. 4. Results are pre-sented in Video 1. The results of Zhu’s algorithm1 exhibitnoticeable flickering, as shown in Fig. 4(b). On the contrary,our results are temporally smooth and more accurate, asshown in Fig. 4(d).

6.3 Quantitative Evaluation

We evaluate the accuracy of our results quantitatively bymeasuring the peak-signal-to-noise ratio (PSNR). To calcu-late the PSNR value, the left stereo image is projected to thecorresponding right stereo image using the estimated dispar-ity map. Then, the image difference between the right imageand the projected left image is used to measure the PSNRvalue. The PSNR value is defined as

PSNR ¼ 10 × log

�2552 × NtotalP

ijL 0ðiÞ − RðiÞj2

�; (15)

where L 0 is the projected left image and R is the right image.The regions near the right border are ignored because thoseregions are invisible in the left image. The higher the PSNRvalue, the better the accuracy of the estimated disparity map.Figure 5 shows the frame-by-frame PSNR of our results and

Table 2 Comparative results of running time for 150 frames withresolution 320 × 240.

TypeProposed(on CPU) Zhu’s12 Zhu’s1

Proposed(on GPU)

Runningtime (s)

49.47 2.72 5.39 2.99

fps 2.02 36 18 33

MDe/s 2.36 42.19 21.09 38.67

Fig. 3 Challenging real-world examples. (a) Right images of the stereo camera; (b) disparity maps fromKinect; (c) disparity maps from stereo matching; (d) results from Zhu’s algorithm;12 (e) results from Zhu’salgorithm;1 and (f) our results after Kinect–stereo fusion. More results can be found in Video 1 (MOV,20.0 MB) [URL: http://dx.doi.org/10.1117/1.OE.53.4.043110.1].



http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

http://dx.doi.org/10.1117/1.OE.53.4.043110.1

results from Zhu’s algorithms1,12 and Wang’s algorithm15 onsix real-world video sequences. Our results consistentlyshow the highest PSNR in most video frames in compari-sons with results from Zhu’s algorithms1,12 and Wang’salgorithm.15

We have also evaluated the temporal consistency of ourestimated disparity map by counting the number of inconsis-tent pixel with large disparity difference between correspond-ing pixels in consecutive frames. The error percentage ofinconsistent pixel in disparity map ε is defined as

ε ¼ 100

Ntotal

Xi

ðjdtðiÞ − dtþ1ðjÞj > 0Þ; (16)

where Ntotal is the number of pixels in the image, dtðiÞ anddtþ1ðjÞ denote the disparity result of the current and thenext frames, respectively, and j is the correspondence pixelposition of the pixel at i in the next frame. Figure 6 showsthe frame-by-frame error percentage of inconsistent pixelbetween the current and the next frames of our results andresults from Zhu’s algorithms1,12 and Wang’s algorithm.15

Our results have consistent small error percentage deal tothe usage of temporal data in term of spatiotemporal MRFframework. Table 3 shows the average temporal error percent-age and the average PSNR for each of the testing videos.Although the results in Wang’s algorithm15 have the smallesttemporal average errors in most testing cases, their results alsoshow the minimum PSNR among all testing cases. This shows

Fig. 4 Disparity map of the test video 5 (frames 44 to 46 of Video 1). (a) Right image; (b) result of Zhu’salgorithm;12 (c) result of Zhu’s algorithm;1 and (d) our results. Flickering is significantly removed.

Fig. 5 Frame-by-frame peak-signal-to-noise ratio values on six real-world video sequences. Our resultsare compared with the results in Zhu’s algorithms1,12 and Wang’s algorithm15.



http://dx.doi.org/10.1117/1.OE.53.4.043110.1

that their approach has over-smoothed the resulting disparitymap temporally which leads to low quality in spatial domainbut high quality in temporal smoothness.

The plots and the tables show that the proposed methodoutperforms Zhu’s methods1,12 and Wang’s method15 in termof disparity map accuracy. Together with the qualitative com-parisons, it shows that having the temporal consistency termin data cost instead of the smoothness cost (which were usedby Zhu’s algorithm1) is effective not only in speeding up thecomputation, but also in achieving more accurate fusionresults.

7 ConclusionIn this paper, we have presented a Kinect–stereo camerafusion system for real-time video disparity map acquisition.

A GPU spatiotemporal MRF framework is proposed to fuseKinect depth map and stereo disparity map in real time.Although the proposed data cost, smoothness cost, and reli-ability weighting function are simple, they are effective infusing Kinect depth map and the stereo disparity map. Asdemonstrated in the experimental results, the depth mapfrom Kinect and the disparity map from stereo cameracan complement with each other especially for regionswhere either one of the input is missing. In the qualitativeand quantitative comparisons, the proposed approach outper-forms state-of-the-art real-time depth map and stereo fusionalgorithms from Zhu’s algorithms1,12 and Wang’s algo-rithm.15 Our results achieved the second best in runningtime, the highest average PSNR in measuring disparitymap quality, and the second best in term of disparity maptemporal coherence.

Fig. 6 Frame-by-frame temporal consistency between the current and the next frame on six real-worldvideo sequences. Our results are compared with the results in Zhu’s algorithms1,12 and Wang’salgorithm15.

Table 3 Comparative results of the average temporal error and the average peak-signal-to-noise ratio (PSNR).

Dataset No.

Average temporal error Average PSNR

Zhu’s12 Wang’s15 Zhu’s1 Proposed Zhu’s12 Wang’s15 Zhu’s1 Proposed

1 3.25 4.59 3.60 3.12 18.54 17.79 18.68 18.77

2 6.08 3.29 4.65 3.61 18.77 17.97 18.87 18.95

3 12.78 7.44 10.67 8.43 17.16 16.95 17.24 17.44

4 6.63 4.06 5.41 4.63 18.27 17.92 18.37 18.49

5 12.14 7.36 9.67 7.97 17.74 17.04 18.01 18.22

6 13.05 7.79 10.17 8.15 16.01 15.73 16.18 16.39

Note: The bold value represents the best performance compared with other results.



AcknowledgmentsThis research was supported by Basic Science ResearchProgram through the National Research Foundation ofKorea (NRF) funded by the Ministry of Education (NRF-2012R1A1A2009495).

References

1. J. Zhu et al., “Spatial-temporal fusion for high accuracy depth mapsusing dynamic MRFs,” IEEE Trans. Pattern Anal. Mach. Intell. 32(5),899–909 (2010).

2. Q. Yang et al., “Real-time global stereo matching using hierarchicalbelief propagation,” in Proc. of British Machine Vision Conf.,pp. 989–998, British Machine Vision Association, Edinburgh, UK(2006).

3. A. Hosni et al., “Real-time local stereo matching using guided imagefiltering,” in Proc. of IEEE Int. Conf. on Multimedia & Expo, pp. 1–6,IEEE, Barcelona, Spain (2011).

4. C. Richardt et al., “Real-time spatiotemporal stereo matching using thedual-cross-bilateral grid,” in Prof. of European Conf. on ComputerVision, pp. 510–523, Springer, Heraklion, Crete, Greece (2010).

5. J. Kowalczuk, E. T. Psota, and L. C. Perez, “Real-time stereo matchingon CUDA using an iterative refinement method for adaptive support-weight correspondences,” IEEE Trans. Circuits Syst. Video Technol.23(1), 94–104 (2013).

6. L. De-Maeztu, A. Villanueva, and R. Cabeza, “Near real-time stereomatching using geodesic diffusion,” IEEE Trans. Pattern Anal. Mach.Intell. 34(2), 410–416 (2012).

7. Q. Yang, C. Engels, and A. Akbarzadeh, “Near real-time stereo forweakly-textured scenes,” in Proc. of British Machine Vision Conf.,pp. 72.1–72.10, British Machine Vision Association, Leeds, UK(2008).

8. K. Zhang et al., “Real-time accurate stereo with bitwise fast voting onCUDA,” in Proc. of IEEE Int. Conf. on Computer Vision Workshops,pp. 794–800, IEEE, Kyoto, Japan (2009).

9. Q. Yang et al., “Spatial-depth super resolution for range images,” inProc. of IEEE Conf. on Computer Vision and Pattern Recognition,pp. 1–8, IEEE, Minneapolis, Minnesota (2007).

10. J. Dolson et al., “Upsampling range data in dynamic environments,” inProc. of IEEE Conf. on Computer Vision and Pattern Recognition,pp. 1141–1148, IEEE, San Francisco, California (2010).

11. J. Park et al., “High quality depth map upsampling for 3D-ToF cam-eras,” in Proc. of IEEE Int. Conf. on Computer Vision, pp. 1623–1630,IEEE, Barcelona, Spain (2011).

12. J. Zhu et al., “Reliability fusion of time-of-flight depth and stereogeometry for high quality depth maps,” IEEE Trans. Pattern Anal.Mach. Intell. 33(7), 1400–1414 (2011).

13. R. Nair et al., “High accuracy ToF and stereo sensor fusion at inter-active rates,” in Proc. of European Conf. on Computer VisionWorkshops, pp. 1–11, Springer, Florence, Italy (2012).

14. C. D. Mutto et al., “Locally consistent ToF and stereo data fusion,” inProc. of European Conf. on Computer Vision Workshops, pp. 598–607, Springer, Florence, Italy (2012).

15. Y. Wang and Y. Jia, “A fusion framework of stereo vision and kinectfor high-quality dense depth maps,” in Proc. of Asian Conf. onComputer Vision Workshops, pp. 109–120, Springer, Daejeon,Korea (2012).

16. W. C. Chiu, U. Blanke, and M. Fritz, “Improving the kinect by cross-modal stereo,” in Proc. of British Machine Vision Conf., pp. 116.1–116.10, British Machine Vision Association, Dundee, UK (2011).

17. E. S. Larsen et al., “Temporally consistent reconstruction from multi-ple video streams using enhanced belief propagation,” in Proc. of IEEEInt. Conf. on Computer Vision, pp. 1–8, IEEE, Rio de Janeiro, Brazil(2007).

18. L. Yao, D. X. Li, and M. Zhang, “Temporally consistent depth mapsrecovery from stereo vision,” Inf. Technol. J. 11(1), 30–39 (2012).

19. D. Scharstein and R. Szeliski, “A taxonomy and evaluation of densetwo-frame stereo correspondence algorithms,” Int. J. Comput. Vision47(1), 7–42 (2002).

20. I. K. Park et al., “Design and performance evaluation of image process-ing algorithms on GPUs,” IEEE Trans. Parallel Distrib. Syst. 22(1),91–104 (2011).

21. L.-F. Yu et al., “Shading-based shape refinement of RGB-D images,”in Proc. of IEEE Conf. on Computer Vision and Pattern Recognition,pp. 1415–1422, IEEE, Portland, Oregon (2013).

22. A. Bruhn et al., “Real-time optic flow computation with variationalmethods,” in Proc. of Computer Analysis of Images and Patterns,pp. 222–229, Springer, Münster, Germany (2003).

23. S. Hermann and T. Vaudrey, “The gradient—a powerful and robustcost function for stereo matching,” in Proc. of Int. Conf. of Imageand Vision Computing New Zealand, pp. 1–8, IEEE, Queenstown,New Zealand (2010).

24. P. Felzenszwalb and D. Huttenlocher, “Efficient belief propagation forearly vision,” Int. J. Comput. Vision 70(1), 41–54 (2006).

25. M. Tappen and W. Freeman, “Comparison of graph cuts with beliefpropagation for stereo, using identical MRF parameters,” in Proc.of IEEE Int. Conf. on Computer Vision, pp. 900–906, IEEE, Nice,France (2003).

Williem received the BS degree in computer science from BinaNusantara University, Indonesia, in 2011. He is currently workingtoward the PhD degree in Inha University, Republic of Korea. Hisresearch interests include stereo matching, computational photogra-phy, and GPU computing.

Yu-Wing Tai received the BEng (first class honors) and MS degreesin computer science from the Hong Kong University of Science andTechnology (HKUST) in 2003 and 2005, respectively, and the PhDdegree from the National University of Singapore (NUS) in June2009. He joined the Korea Advanced Institute of Science andTechnology (KAIST) as an assistant professor in fall 2009. Hisresearch interests include computer vision and image/videoprocessing.

In Kyu Park received the BS, MS, and PhD degrees from SeoulNational University in 1995, 1997, and 2001, respectively, all in elec-trical engineering and computer science. From September 2001 toMarch 2004, he was a member of technical staff at SamsungAdvanced Institute of Technology. Since March 2004, he has beenwith the Department of Information and CommunicationEngineering, Inha University. His research interests include thejoint area of computer graphics and vision.



http://dx.doi.org/10.1109/TPAMI.2009.68

http://dx.doi.org/10.1109/TCSVT.2012.2203200





http://dx.doi.org/10.3923/itj.2012.30.39

http://dx.doi.org/10.1023/A:1014573219977

http://dx.doi.org/10.1109/TPDS.2010.115

http://dx.doi.org/10.1007/s11263-006-7899-4

Date post:	12-Sep-2018
Category:	Documents
Upload:	haque
View:	216 times
Download:	0 times

Accurate and real-time depth video acquisition using...

Documents