IVCL FCV 17012017 1455image.inha.ac.kr/paper/FCV2017Jonathan.pdf · 2017-08-27 · )lj /ljkw ilhog...

Feature Descriptors for 4D Light Field Images

Jonathan Samuel Lumentut, Williem, and In Kyu ParkDepartment of Information and Communication Engineering

Inha University, Incheon 22212, Korea {[email protected], [email protected], [email protected]}

Abstract—Research area in feature descriptor has been done widely along with feature detector to increase feature matching performance. Viewpoint invariant is one of the issue in feature matching. To obtain better matching result in scenes with different viewpoint, robust viewpoint invariant descriptor is needed. In this framework, we introduced method to obtain viewpoint invariant descriptors by utilizing light field images. Initial features are extracted from the central image of sampled light field images. Then, corresponding features in neighboring images are tracked using KLT tracker. Their features descriptors are then computed and used to describe the initial features. Experimental results show the proposed descriptors obtained from light field images give better matching result compared to normal descriptors.

Keywords—feature, descriptor, light field, viewpoint invariant

I. INTRODUCTION

It has been more than a decade that feature and descriptors are created and widely used in many image processing and computer vision algorithms, such as the structure-from-motion technique [1], visual simultaneous localization and mapping [2], object detection [3], object recognition [4], object tracking [5], and scene classification [6]. Many popular descriptors such as Scale Invariant Feature Transform (SIFT) [7] and Maximally Stable Extremal Region (MSER) [8] are made to be invariant of illumination, rotation, scale, and viewpoint change. Among them, viewpoint difference has been an interesting topic, yet arduous, for generating robust descriptor. Such a small difference of viewpoint may affect the variance of descriptor values. Feature descriptors such as MSER [8] and ASIFT [9] algorithms have been proposed to increase feature’s robustness on viewpoint change. MSER is not scale invariant because it does not simulate the blur in image with depth distance of camera. ASIFT [9] handles the transformation dilemma in MSER algorithm by applying manual projective transformation to achieve smooth deformation. Their method of simulating camera axis orientation with fixed parameters, such as rotation angle and tilt angle, produces significant number of inliers compared to normal matching. Using this idea to gain similar objective, our method relies on multiview light field images to subtitute the projective simulation part.

(a) 3 × 3 sampling

(b) 5 × 5 sampling

Fig. 1. Light field images sampling

Capturing multiview images is another issue beside generating descriptors. It needs strict intrinsic and extrinsic camera parameters. It is also impractical for common user to use large-size camera array for capturing multiview images. To overcome these issues, Lytro [10] and Raytrix [11] light field cameras were introduced to consumer. Prior to its benefit in terms of compactness and sizes, we use Lytro Illum camera to generate the light field data. A single shot of light field camera produces multiple sub aperture images.

In our work, raw light field image captured from light field camera is adjusted and extracted using light field image extractor toolbox provided by Dansereau et al. [12]. There are total 11 × 11 light field images extracted from a raw data. From total 11 × 11 light field images, n × n images are sampled with pattern seen in Fig. 1. There is a central image in each sampled light field images. Initial features are detected on this image using SIFT detector. These initial features are then used in KLT tracker [13] to track their correspondences in the neighboring images. The reason to not extract features individually in all sampled images is to keep the total feature numbers to be constant. Descriptors are then computed from patches located on those key points. Each feature is represented by n × n × bins of descriptor. In our proposed method, SIFT is used, so that it produces total 128n2 gradient magnitudes as descriptors at each key point.

Fig. 2. Light field descriptor and matching framework

II. RELATED WORK

The works of computing feature descriptors have been proposed along with feature detectors in the past two decades. Starting with one state-of-the-art method, SIFT, was introduced by Lowe [7], provided descriptor calculation by generating histogram of gradient orientation on detected key patches. This method is widely used since it produces high repeatability. Despite its widely used in many feature related works, SIFT method requires high computational cost in matching procedure. GLOH descriptor, which was introduced by Mikolajczyk and Schmid [14], extended the works of Lowe by using circular sampling pattern for managing gradient location. Alahi et al. [15] proposed descriptor, coined FREAK, that takes the advantage of retinal sampling pattern to generate binary descriptor, which is later brings out low cost matching computation. BRIEF [16], which is also binary descriptor, uses binary strings from patches established by Gaussian sampling pattern.

While those features are created to be robust on scale, rotation, and illumination change, they are only extracted from single view image. Thus, by extending this idea to use multiview images, our method utilizes light field images to obtain viewpoint invariant descriptors.

III. PROPOSED METHOD

We develop an efficient framework for extracting light field descriptors. By substituting the manual orientation and tilting simulation from previous work by [9], we use the sampled light field images to extract descriptors from different viewpoint. As shown in Fig. 2, the proposed method starts by extracting initial features from central image. Central image coordinate is located in [(n+1)/2, (n+1)/2] index. After initial features are extracted, their correspondences from neighboring images are tracked using KLT tracker. The descriptors of each detected key point are calculated using SIFT descriptor algorithm. Finally, exhaustive matching with complete combination is done to all detected features. More detail explanation of our proposed method is described in the following subsections:

A. Image sampling

The sampling figures can be seen in Fig. 1. Black marks indicate sampled images from total 121 light field sub aperture images. Central image is located precisely at the center of n × n light field images. Note that the further location of one neighbor image to its center, the bigger viewpoint difference it has.

Image sampling is done from 3 to 5 to show how much matching results are increased after the sampling size is increased. In our experiment, the maximum n value is limited to 5 since the time complexity for exhaustive matching is O(n4).

B. Feature extraction

After light field images sampling is done, initial features are extracted using SIFT detection algorithm. The initial features are detected on central image. Then, the corresponding features from all n × n – 1 neighboring images are detected using KLT tracker. Number of corresponding features in neighboring images is equal to the number of features in central image.

C. Descriptor computation

For computing local descriptor, SIFT descriptor algorithm is applied in every detected feature from n × n light field images. The patch size of each detected feature is 16 × 16. Thus, the radius (r) parameter of SIFT descriptor algorithm is set to be 8. Total 128 bin of gradient magnitudes generated from each patch. Each subset of descriptor represents the same feature from single view point (i). Note that each subset descriptors i from total n2 multiview descriptors are not concatenated.

D. Descriptor matching

After collecting all features descriptors from central image and its neighbors, the next step is to do exhaustive matching to obtain correspondences between 2 compared data. The number of features (f) are same in each sampled light field image. Each successfully paired key points are pushed to correspondences vector.

(a) Orange triceratops (105/1)

(b) House (147/4)

(c) Orange triceratops (271/2) (d) House (352/20)

(e) Car (55/29)

(f) Brown triceratops (26/17)

(g) Car (86/52)

(h) Brown triceratops (70/37)

Fig. 3. Match result from 3 × 3 sampled dataset: (a), (b), (e), and (f) Normal SIFT. (c), (d), (g), and (h) LF-SIFT. Green and red lines represent inliers and outliers, respectively. Performance is measured by (# of inliers / # of outliers).

IV. EXPERIMENTAL RESULT

We are running this experiment on an Intel i7-6700 @ 3.4 GHz computer with 16 GB RAM. Light field images data are captured using Lytro Illum camera. Light field images are extracted from raw Lytro camera file using the tool box provided by Dansereau et al. [12]. It produces 625 × 434 image resolution each image and a 11 × 11 angular resolutions. We only consider the difference of view point under the same indoor environment.

As depicted in Fig. 3, comparison is done between 2 methods: Normal SIFT and our proposed method, Light Field SIFT (LF-SIFT). Normal SIFT means the features descriptors are extracted only from the central image. The

features descriptors are then directly used to find correspondences. On the other hand, LF-SIFT uses descriptors from central image and its neighbors. In this experiment, we also included the result of using 5 × 5 sampling along with 3 × 3 sampling. Inliers and outliers are checked manually during matching. Performance evaluation of proposed method can be seen in Table I. Though our result has slight difference of inlier rate with the normal one, ours still provides more inliers compared to normal method as seen in Fig. 3 and Fig. 4. Also, in the proposed method, inlier rates are mostly increased when sampling size is increased. This is due to more descriptors are collected in neighboring images. We limit the maximum value of sampling size equals to 5 to reduce computational time.

Table I. Inlier rate Orange

triceratops House Car Brown

triceratops

Normal-SIFT

0.99057 0.97351 0.65476 0.60465

LF-SIFT (3 × 3)

0.99267 0.94624 0.62319 0.65421

LF-SIFT (5 × 5)

0.9802 0.94664 0.69058 0.67188

(a) Orange triceratops (396/8)

(b) House (550/31)

(c) Car (154/69)

(d) Brown triceratops (86/42)

Fig. 4. Match result from 5 × 5 sampled dataset using LF-SIFT: Green and red lines represent inliers and outliers, respectively. Performance is measured by (# of inliers / # of outliers).

V. CONCLUSION

In this paper, we proposed a framework to extract descriptors that are invariant to viewpoint. The idea of collecting features from sampled light field images led to produce more inlier result than using one viewpoint image (normal) in matching process. Our proposed framework gives good performance in matching result without reducing much difference in terms of inlier rate. For future works, we will use more variation such as illumination, rotation, and scale in the datasets.

ACKNOWLEDGEMENT

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2013R1A2A2A01069181).

REFERENCES

[1] S. Agarwal et al., “Building Rome in A Day, ” Communications of

ACM, vol. 54, no. 10, pp. 105-112, October 2011.

[2] N. Kalrsson et al., “The vSlam Algorithm for Robust Localization and Mapping,” Proc. IEEE ICRA, pp. 24-29, April 2005.

[3] S. Maji and J. Malik, “Object Detection Using a Max-Margin Hough Transform,” Proc. IEEE CVPR, pp. 1038-1045, June 2009.

[4] D. G. Lowe, “Object Recognition from Local Scale-Invariant Features,” Proc. IEEE ICCV, pp. 1150-1157, September 1999.

[5] H. Zhou, Y. Yuan, and C. Shi, “Object Tracking Using SIFT Features and Mean Sift,” CVIU, vol. 113, no. 3, pp. 345-352, March 2009.

[6] N. Serrano, A. E. Savakis, and J. Luo, “Improved Scene Classification Using Efficient Low-Level Features and Semantic Cues,” Pattern Recognition, vol. 37, no. 9, pp. 1773-1784, November 2004.

[7] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, pp. 91-110, November 2004.

[8] M. Donoser and H. Bischof, “Efficient Maximally Stable Extremal Region (MSER) Tracking,” Proc. IEEE CVPR, pp. 553-560, June 2006.

[9] J-M. Morel and G. Yu, “ASIFT: A new framework for fully affine invariant image comparison,” SIAM Journal on Imaging Sciences, vol. 2, pp. 438-469, April 2009.

[10] Lytro (2014) The Lytro camera. https://www.lytro.com

[11] Raytrix (2013) 3D light field camera technology. www.raytrix.de

[12] D. G. Dansereau, O. Pizarro, and S. B. Williams, “Decoding, calibration and rectification for lenselet-based plenoptic cameras,” Proc. IEEE CVPR, pp. 1027-1034, June 2013.

[13] B. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” Proc. IJCAI, vol. 2, pp. 674-679, August 1981.

[14] K . Mikolajczyk and C. Schmid, “A Performance Evaluation of Local Descriptors,” IEEE TPAMI, Vol. 27, No. 10, pp. 1615-1630, October 2005.

[15] A. Alahi, R. Ortiz, and P. Vandergheynst, “FREAK: Fast Retina Keypoint,” Proc. IEEE CVPR, pp. 510-517, June 2012.

[16] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary Robust Independent Elementary Features,” Proc. IEEE ECCV, pp. 778-792, September 2010.

Date post:	05-Apr-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

IVCL FCV 17012017 1455image.inha.ac.kr/paper/FCV2017Jonathan.pdf · 2017-08-27 · )lj /ljkw ilhog...

Documents