+ All Categories
Home > Documents > Rethinking RGB-D Salient Object Detection: Models, Data...

Rethinking RGB-D Salient Object Detection: Models, Data...

Date post: 05-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
15
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks Deng-Ping Fan , Member, IEEE, Zheng Lin , Zhao Zhang , Menglong Zhu, Member, IEEE, and Ming-Ming Cheng , Senior Member, IEEE Abstract—The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new Salient Person (SIP) data set that consists of 1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architec- ture, called deep depth-depurator network (D 3 Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D 3 Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D 3 Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D 3 Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark. Index Terms— Benchmark, RGB-D, saliency, salient object detection (SOD), Salient Person (SIP) data set. I. I NTRODUCTION H OW to take high-quality photos has become one of the most important competition points among mobile phone manufacturers. Salient object detection (SOD) methods [1]–[18] have been incorporated into mobile phones and widely used for creating perfect portraits by automatically adding large aperture and other enhancement effects. While existing SOD methods [19]–[35] have achieved remarkable Manuscript received July 16, 2019; revised March 9, 2020; accepted May 16, 2020. This work was supported in part by the Major Project for New Generation of AI under Grant 2018AAA0100400, in part by the NSFC under Grant 61922046, and in part by the Tianjin Natural Science Foundation under Grant 17JCJQJC43700. (Corresponding author: Ming-Ming Cheng.) Deng-Ping Fan was with the College of Computer Science, Nankai Univer- sity, Tianjin 300350, China. He is now with the Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, United Arab Emirates. Zheng Lin, Zhao Zhang, and Ming-Ming Cheng are with the College of Computer Science, Nankai University, Tianjin 300350, China (e-mail: [email protected]). Menglong Zhu is with Google AI, Mountain View, CA 94043 USA. Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNNLS.2020.2996406 Fig. 1. Left to right: input image, GT, and the corresponding depth map. The quality of the depth map from low (first row), mid (second row), to high (last row). As shown in the second row, it is difficult to recognize the boundary of the human arm in the boundary box region. However, it is clearly visible in the depth map. The high-quality depth maps benefit the RGB-D-based SOD task. These three examples are from the NJU2 K [37], our SIP, and NLPR [39] data sets, respectively. success, most of them only rely on RGB images and ignore the important depth information, which is widely available in modern smartphones (e.g., iPhone X, Huawei Mate10, and Samsung Galaxy S10). Thus, fully utilizing RGB-D information for SOD detection has recently attracted significant research attention [36]–[51]. One of the primary goals of existing smartphone cameras is to identify humans in visual scenes through either coarse, bounding-box-level, or instance-level segmentation. To this end, intelligence solutions, such as RGB-D saliency detecting techniques, have gained considerable attention. However, most existing RGB-D-based SOD methods are tested on RGB-D images taken by Kinect [52] or a light field camera [53] or estimated by optical flow [54], which have different characteristics from actual smartphone cameras. Since humans are the key subjects of photographs taken with smartphones, a human-oriented RGB-D data set featuring realistic, in-the-wild images would be more useful for mobile manufacturers. Despite the effort of some authors [37], [39] to augment their scenes with additional objects, a human- centered RGB-D data set for SOD does not yet exist. Furthermore, although depth maps provide important com- plementary information for identifying salient objects, the low- quality versions often cause wrong detections [55]. While existing RGB-D-based SOD models typically fuse RGB and depth features by different strategies [51], there is no model that explicitly/automatically discards the low-quality depth map (see Fig. 1) in the RGB-D SOD field. We believe 2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.
Transcript
Page 1: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1

Rethinking RGB-D Salient Object Detection:Models, Data Sets, and Large-Scale Benchmarks

Deng-Ping Fan , Member, IEEE, Zheng Lin , Zhao Zhang , Menglong Zhu, Member, IEEE,

and Ming-Ming Cheng , Senior Member, IEEE

Abstract— The use of RGB-D information for salient objectdetection (SOD) has been extensively explored in recent years.However, relatively few efforts have been put toward modelingSOD in real-world human activity scenes with RGB-D. In thisarticle, we fill the gap by making the following contributionsto RGB-D SOD: 1) we carefully collect a new Salient Person(SIP) data set that consists of ∼1 K high-resolution imagesthat cover diverse real-world scenes from various viewpoints,poses, occlusions, illuminations, and backgrounds; 2) we conducta large-scale (and, so far, the most comprehensive) benchmarkcomparing contemporary methods, which has long been missingin the field and can serve as a baseline for future research, andwe systematically summarize 32 popular models and evaluate18 parts of 32 models on seven data sets containing a total ofabout 97k images; and 3) we propose a simple general architec-ture, called deep depth-depurator network (D3Net). It consistsof a depth depurator unit (DDU) and a three-stream featurelearning module (FLM), which performs low-quality depth mapfiltering and cross-modal feature learning, respectively. Thesecomponents form a nested structure and are elaborately designedto be learned jointly. D3Net exceeds the performance of anyprior contenders across all five metrics under consideration,thus serving as a strong model to advance research in thisfield. We also demonstrate that D3Net can be used to efficientlyextract salient object masks from real scenes, enabling effectivebackground-changing application with a speed of 65 frames/s ona single GPU. All the saliency maps, our new SIP data set, theD3Net model, and the evaluation tools are publicly available athttps://github.com/DengPingFan/D3NetBenchmark.

Index Terms— Benchmark, RGB-D, saliency, salient objectdetection (SOD), Salient Person (SIP) data set.

I. INTRODUCTION

HOW to take high-quality photos has become oneof the most important competition points among

mobile phone manufacturers. Salient object detection (SOD)methods [1]–[18] have been incorporated into mobile phonesand widely used for creating perfect portraits by automaticallyadding large aperture and other enhancement effects. Whileexisting SOD methods [19]–[35] have achieved remarkable

Manuscript received July 16, 2019; revised March 9, 2020; acceptedMay 16, 2020. This work was supported in part by the Major Project forNew Generation of AI under Grant 2018AAA0100400, in part by the NSFCunder Grant 61922046, and in part by the Tianjin Natural Science Foundationunder Grant 17JCJQJC43700. (Corresponding author: Ming-Ming Cheng.)

Deng-Ping Fan was with the College of Computer Science, Nankai Univer-sity, Tianjin 300350, China. He is now with the Inception Institute of ArtificialIntelligence (IIAI), Abu Dhabi, United Arab Emirates.

Zheng Lin, Zhao Zhang, and Ming-Ming Cheng are with the Collegeof Computer Science, Nankai University, Tianjin 300350, China (e-mail:[email protected]).

Menglong Zhu is with Google AI, Mountain View, CA 94043 USA.Color versions of one or more of the figures in this article are available

online at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TNNLS.2020.2996406

Fig. 1. Left to right: input image, GT, and the corresponding depth map. Thequality of the depth map from low (first row), mid (second row), to high (lastrow). As shown in the second row, it is difficult to recognize the boundary ofthe human arm in the boundary box region. However, it is clearly visible in thedepth map. The high-quality depth maps benefit the RGB-D-based SOD task.These three examples are from the NJU2 K [37], our SIP, and NLPR [39]data sets, respectively.

success, most of them only rely on RGB images andignore the important depth information, which is widelyavailable in modern smartphones (e.g., iPhone X, HuaweiMate10, and Samsung Galaxy S10). Thus, fully utilizingRGB-D information for SOD detection has recently attractedsignificant research attention [36]–[51].

One of the primary goals of existing smartphone camerasis to identify humans in visual scenes through either coarse,bounding-box-level, or instance-level segmentation. To thisend, intelligence solutions, such as RGB-D saliency detectingtechniques, have gained considerable attention.

However, most existing RGB-D-based SOD methods aretested on RGB-D images taken by Kinect [52] or a lightfield camera [53] or estimated by optical flow [54], whichhave different characteristics from actual smartphone cameras.Since humans are the key subjects of photographs taken withsmartphones, a human-oriented RGB-D data set featuringrealistic, in-the-wild images would be more useful for mobilemanufacturers. Despite the effort of some authors [37], [39]to augment their scenes with additional objects, a human-centered RGB-D data set for SOD does not yet exist.

Furthermore, although depth maps provide important com-plementary information for identifying salient objects, the low-quality versions often cause wrong detections [55]. Whileexisting RGB-D-based SOD models typically fuse RGB anddepth features by different strategies [51], there is no modelthat explicitly/automatically discards the low-quality depthmap (see Fig. 1) in the RGB-D SOD field. We believe

2162-237X © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 2: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

that such models have a high potential for driving thisfield forward.

In addition to the limitations of current RGB-D data setsand models already mentioned, most RGB-D studies alsosuffer from several other common constraints, including thefollowing.

Sufficiency: Only a limited number of data sets (1∼4) havebeen benchmarked in recent articles [39], [56] (see Table I).The generalizability of models cannot be properly accessedwith such a small number of data sets.

Completeness: F-measure [57], MAE, and precision &recall (PR) curve are the three most widely used metrics inexisting works. However, as suggested in [58] and [59], thesemetrics essentially act at a pixel level. It is, thus, difficultto draw thorough and reliable conclusions from quantitativeevaluations [60].

Fairness: Some works [49], [51], [61] use the sameF-measure metric but do not explicitly describe whichstatistic (e.g., mean or max) was used, easily resulting inunfair comparison and inconsistent performance. Meanwhile,the different threshold strategies for F-measure (e.g., 255 var-ied thresholds [51], [61], [62], adaptive saliency threshold[39], [41], and self-adaptive threshold [43]) will result indifferent performance. It is, thus, of crucial need to provide afair comparison of RGB-D-based SOD models by extensivelyevaluating them with same metrics on a standard leaderboard.

A. Contribution

To address the abovementioned problems, we provide threedistinct contributions.

1) We have built a new Salient Person (SIP) data set (seeFigs. 2 and 3). It consists of 929 accurately annotatedhigh-resolution images that are designed to contain mul-tiple salient persons per image. It is worth mentioningthat the depth maps are captured by a real smartphone.We believe such a data set is highly valuable and willfacilitate the application of RGB-D models to mobiledevices. Besides, the data set is carefully designedto cover diverse scenes, various challenging situations(e.g., occlusion and appearance change), and elaboratelyannotated with pixel-level ground truths (GTs). Anotherdiscriminative feature of our SIP data set is the avail-ability of both RGB and grayscale images captured bya binocular camera, which can benefit a broad numberof research directions, such as stereo matching, depthestimation, and human-centered detection.

2) With the proposed SIP and six existing RGB-D datasets [37]–[39], [63], [64], [66], we provide a morecomprehensive comparison of 32 classical RGB-D SODmodels and present the large-scale (∼97k images) fairevaluation of 18 state-of-the-art (SOTA) algorithms[37]–[47], [49], [55], [67]–[69], making our study agood all-around RGB-D benchmark. To further promotethe development of this field, we additionally provide anonline evaluation platform with the preserved test set.

3) We propose a simple general model called deepdepth-depurator network (D3Net), which learns toautomatically discard low-quality depth maps using a

novel depth depurator unit (DDU). Due to the gateconnection mechanism, our D3Net can predict salientobjects accurately. Extensive experiments demonstratethat our D3Net remarkably outperforms prior work onmany challenging data sets. Such a general frameworkdesign helps to learn cross-modality features from RGBimages and depth maps.

Our contributions offer a systematic benchmark equippedwith the basic tools for a comprehensive assessment of RGB-Dmodels, offering deep insight into the task of RGB-D-basedmodeling and encouraging future research in this direction.

B. Organization

In Section II, we first review current data sets for RGB-DSOD, as well as representative models for this task. Then,we present details on the proposed salient person data setSIP in Section III. In Section IV, we describe our D3Net modelfor RGB-D SOD by explicitly filtering out the low-qualitydepth maps.

In Section V, we provide both a quantitative and qualitativeexperimental analysis of the proposed algorithm. Specifically,in Section V-A, we offer more details on our experimentalsettings, including the benchmarked models, data sets, and run-time. In Section V-B, five evaluation metrics (E-measure [59],S-measure [58], MAE, PR curve, and F-measure [57])are described in detail. In Section V-C, we provide themean statistics over different data sets and summarize themin Table II. Comparison results of 18 SOTA RGB-D-basedSOD models over seven data sets, namely, STERE [63],LFSD [64], DES [38], NLPR [39], NJU2K [37], SSD [66],and SIP (Ours) clearly demonstrate the robustness andefficiency of our D3Net model. Furthermore, in Section V-D,we provide a performance comparison between traditionaland deep models. We also discuss the experimental resultsin more depth. In Section V-E, we provide visualizations ofthe results and present saliency maps generated for variouschallenging scenes. In Section VI, we discuss some potentialapplications about human activities and provide an interestingand realistic use scenario of D3Net in a background-changingapplication. To better understand the contributions of DDUin the proposed D3Net, in Section VII, we present theupper and lower bounds of the DDU. All in all, the extensiveexperimental results clearly demonstrate that our D3Net modelexceeds the performance of any prior competitors across fivedifferent metrics. In Section VII-B, we discuss the limitationsof this work. Finally, Section VIII concludes this article.

II. RELATED WORKS

A. RGB-D Data Sets

Over the past few years, several RGB-D data sets havebeen constructed for SOD. Some statistics of these data setsare shown in Table III. Specifically, the STERE [63] data setwas the first collection of stereoscopic photos in this field.GIT [36], LFSD [64], and DES [64] are three small-sizeddata sets. GIT and LFSD were designed with specific purposesin mind, e.g., saliency-based segmentation of generic objectsand saliency detection on the light field. DES has 135 indoorimages captured by Microsoft Kinect [52]. Although these

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 3: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 3

Fig. 2. Representative subsets in our SIP. The images in SIP are grouped into eight subsets according to background objects (i.e., grass, car, barrier, road,sign, tree, flower, and other), different lighting conditions (i.e., low light and sunny with clear object boundary), and various number of objects (i.e., 1, 2, ≥3).

Fig. 3. Examples of images, depth maps and annotations (i.e., object level and instance level) in our SIP data set with different numbers of salient objects,object sizes, object positions, scene complexities, and lighting conditions. Note that the “RGB” and “Gray” images are captured by two different monocularcameras from short distances. Thus, the “Gray” images are slightly different from the grayscale images obtained from colorful (RGB) image. Our SIP dataset provides a new direction, such as depth estimating from “RGB” and “Gray” images, and instance-level RGB-D SOD.

data sets have advanced the field to various degrees, theyare severely restricted by their small scale or low resolution.To overcome these barriers, Peng et al. [39] created NLPR,a large-scale RGB-D data set with a resolution of 640 × 480.Later, Ju et al. built NJU2K [37], which has become oneof the most popular RGB-D data sets. The recent SSD [66]data set partially remedied the resolution restriction of NLPRand NJU2K. However, it only contains 80 images. Despitethe progress made by existing RGB-D data sets, they stillsuffer from the common limitation of not capturing depth mapsin the real smartphones, making them unsuitable for reflect-ing real environmental conditions (e.g., lighting or distanceto object).

Compared with the previous data sets, the proposed SIP dataset has three fundamental differences.

1) It includes 929 images with many challenging situa-tions [83] (e.g., dark background, occlusion, appearancechange, and out of view) from various outdoor scenarios.

2) The RGB, grayscale images, and estimated depth mapsare captured by a smartphone with a dual camera. Due tothe predominant application of SOD to human subjectson mobile phones, we also focus on this and, thus, forthe first time, emphasize the salient persons in the real-world scenes.

3) A detailed quantitative analysis is presented for thequality of the data set (e.g., center bias and object sizedistribution), which was not carefully investigated inprevious RGB-D-based studies.

B. RGB-D Models

Traditional models rely heavily on handcrafted features(e.g., contrast [38], [39], [73], [75] and shape [36]). By embed-ding the classical principles (e.g., spatial bias [38], center-dark channel [46], 3-D [77], and background [40], [47]),difference of Gaussian [37], region classification [62],SVM [45], [73], graph knowledge [55], cellular automata [42],

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 4: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

TABLE I

COMPARISON OF 31 CLASSICAL RGB-D-BASED SOD ALGORITHMS AND THE PROPOSED BASELINE (D3NET). TRAIN/VAL SET. (#) = TRAINING

OR VALIDATION SET: NLR = NLPR [39]. NJU = NJU2K [37]. MK = MSRA10K [70]. O = MK + DUTS [71]. BASIC: 4PRIORS =4 PRIORS, E.G., REGION, BACKGROUND, DEPTH, AND SURFACE ORIENTATION PRIOR. IPT = INITIALIZATION PARAMETERS TRANSFER.

LGBS PRIORS = LOCAL CONTRAST, GLOBAL CONTRAST, BACKGROUND, AND SPATIAL PRIOR. RFR [72] = RANDOM FORESTREGRESSOR. MCFM = MULTICONSTRAINT FEATURE MATCHING. CLP = CROSS LABEL PROPAGATION.

TYPE: T = TRADITIONAL. D = DEEP LEARNING. SP. = SUPERPIXEL: WHETHER OR NOT USE THE

SUPERPIXEL ALGORITHM. E-MEASURE: THE RANGE OF SCORES OVER THE SEVEN DATA SETS

IN TABLE II. EVALUATION TOOLS: HTTPS://GITHUB.COM/DENGPINGFAN/E-MEASURE

and the Markov random field [40], [75], these models showthat specific handcrafted features can lead to decent perfor-mance. Several studies have also explored methods of integrat-ing RGB and depth features via various combination strategiesusing, for instance, angular densities [41], random forestregressors [45], [62], and minimum barrier distances [77].More details are shown in Table I.

To overcome the limited expression ability of handcraftedfeatures, recent works [43], [44], [48], [49], [51], [61], [76],[78], [80]–[82] have proposed to introduce CNNs to infersalient objects from RGB-D data. BED [76] and DF [44]are the two pioneering works for this, which introduced deeplearning technology into the RGB-D-based SOD task. Morerecently, Huang et al. [78] developed a more efficient end-to-end model with a modified loss function. To address theshortage of training data, Zhu et al. [48] presented a robustprior model with a guided depth-enhancement module forSOD. In addition, Chen et al. [49] developed a series ofnovel approaches for this field, such as hidden structure trans-fer [43], a complementarity fusion module [49], an attention-aware component [80], [82], and dilated convolutions [81].Nevertheless, these works, to the best of our knowledge, arededicated to extracting general depth features/information.

We argue that not all information in a depth map is infor-mative for SOD, and low-quality depth maps often introducesignificant noise (first row in Fig. 1). Thus, we instead design

a simple general framework D3Net, which is equipped witha depth-depurator unit to explicitly exclude low-quality depthmaps when learning complementary feature.

III. PROPOSED DATA SET

A. Data Set Overview

We introduce SIP, the first human activities oriented salientperson detection data set. Our data set contains 929 RGB-Dimages belonging to eight different background scenes, undertwo different object boundary conditions, which portrays mul-tiple actors. Each of them wears different clothes in differentimages. Following [83], the images are carefully selectedto cover diverse challenging cases (e.g., appearance change,occlusion, and shape complexity). Examples can be foundin Figs. 2 and 3. The overall data set can be downloaded fromour website http://dpfan.net/SIPDataset/.

B. Sensors and Data Acquisition

1) Image Collection: We used the Huawei Mate 10 tocollect our images. The Mate 10’s rear cameras feature high-grade Leica SUMMILUX-H lenses with bright f/1.6 aperturesand combine 12MP RGB and 20MP Monochrome (grayscale)sensors. The depth map is automatically estimated by theMate10. We asked nine people, all dressed in different col-ors, to perform specific actions in real-world daily scenes.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 5: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 5

TABLE II

BENCHMARKING RESULTS OF 18 LEADING RGB-D APPROACHES ON OUR SIP AND FDPSIX CLASSICAL [37]–[39], [63], [64], [66] DATA SETS.↑ & ↓ DENOTE LARGER AND SMALLER IS BETTER, RESPECTIVELY. “-T” INDICATES THE TEST SET OF THE CORRESPONDING DATA SET.FOR TRADITIONAL MODELS, THE STATISTICS ARE BASED ON OVERALL DATA SETS RATHER ON THE TEST SET. THE “RANK” DENOTES

THE RANKING OF EACH MODEL IN A SPECIFIC MEASURE. THE “ALL RANK” INDICATES THE OVERALL RANKING (AVERAGE OF

EACH RANK) IN A SPECIFIC DATA SET. THE BEST PERFORMANCE IS HIGHLIGHTED IN BOLD

Instructions on how to perform the action to cover differentchallenging situations (e.g., occlusion and out of view) weregiven, but no instructions on style, angle, or speed wereprovided in order to record realistic data.

2) Data Annotation: After capturing 5269 images and thecorresponding depth maps, we first manually selected about2500 images, each of which included one or multiple salient

people. Following many famous SOD data sets [19], [57],[70], [71], [84]–[90], six viewers were further instructed todraw the bounding boxes (bboxes) around the most attention-grabbing person according to their first instinct. We adoptedthe voting scheme described in [39] to discard images withlow voting consistency and chose top 1000 most satisfactoryimages. Another five annotators were then introduced to label

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 6: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 4. (a) Distribution of the normalized object center distance from image center. (b) Distribution of the normalized object margin (farthest point in anobject) distance from image center. (c) Distribution of the normalized object size.

TABLE III

COMPARISON OF CURRENT RGB-D DATA SETS IN TERMS OF YEAR (YEAR), PUBLICATION (PUB.), DATA SET SIZE (DS.), NUMBER OF OBJECTS

IN THE IMAGES (#OBJ.), TYPE OF SCENE (TYPES.), DEPTH SENSOR (SENSOR.), AND DEPTH QUALITY (DQ., E.G., HIGH-QUALITY DEPTHMAP SUFFERS FROM LESS RANDOM NOISE. SEE LAST ROW IN FIG. 1, ANNOTATION QUALITY (AQ., SEE FIG. 12), WHETHER OR NOT

PROVIDE GRAYSCALE IMAGE FROM MONOCULAR CAMERA (GI.), CENTER BIAS [CB., SEE FIG. 4(a) AND (b)], AND RESOLUTION

(IN PIXEL). H & W DENOTE THE HEIGHT AND WIDTH OF THE IMAGE, RESPECTIVELY

TABLE IV

STATISTICS REGARDING CAMERA/OBJECT MOTIONS AND SALIENT OBJECT INSTANCE NUMBERS IN SIP DATA SET

accurate silhouettes of the salient objects according to thebboxes. We discard some images with low-quality annota-tions and, finally, obtained the 929 images with high-qualityGT Fig. 12 annotations.

C. Data Set Statistics

1) Center Bias: Center bias has been identified as one ofthe most significant biases of saliency detection data sets [91].It occurs because subjects tend to look at the center of ascreen [92]. As noted in [83], simply overlapping all of themaps in the data set cannot well describe the degree ofcenter bias.

Following [83], we present the statistics of two distance Ro

and Rm in Fig. 4(a) and (b), where Ro and Rm indicate howfar an object center and margin (farthest) point in an objectare from the image center, respectively. The center biases ofour SIP and existing [36]–[39], [63], [64], [66] data sets areshown in Fig. 4(a) and (b). Except for our SIP and two small-scale data sets (GIT and SSD), most data sets present a highdegree of center bias, i.e. the center of the object is close tothe image center.

2) Size of Objects: We define object size as the ratio ofsalient object pixels to the total number of pixels in the image.

The distribution [see Fig. 4(c)] of normalized object size inSIP is 0.48%∼66.85% (avg.: 20.43%).

3) Background Objects: As summarized in Table IV, SIPincludes diverse backgrounds objects (e.g., cars, trees, andgrass). Models tested on such a data set would likely be ableto handle realistic scenes better and, thus, be more practical.

4) Object Boundary Conditions: In Table IV, we showdifferent object boundary conditions (e.g., dark and clear) inour SIP data set. One example of a dark condition, which oftenoccurs in daily scenes, can be found in Fig. 3. The depthmaps obtained in low-light conditions inevitably introducemore challenges for detecting salient objects.

5) Number of Salient Object: From Table III, we notethat existing data sets fall short in their numbers of salientobjects (e.g., they often only have one). Previous studies [93],however, have shown that humans can accurately enumerateup to at least five objects without counting. Thus, our SIPis designed to contain up to five salient objects per-image.The statistics of labeled objects in each image are shownin Table IV (# Object).

IV. PROPOSED MODEL

According to the motivation described in Fig. 1, cross-modality feature extraction and depth filter unit are highly

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 7: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 7

Fig. 5. Illustration of the proposed D3Net. In the training stage (left), the input RGB and depth images are processed with three parallel subnetworks, e.g.,RgbNet, RgbdNet, and DepthNet. The three subnetworks are based on the same modified structure of feature pyramid networks (FPNs) (see Section IV-A).We introduced these subnetworks to obtain three saliency maps (i.e., Srgb, Srgbd, and Sdepth) that considered both coarse and fine details of the input. In thetest phase (right), a novel DDU (see Section IV-B) is utilized for the first time in this work to explicitly discard (i.e., Srgbd) or keep (i.e., Srgbd) the saliencymap introduced by the depth map. In the training/test phase, these components form a nested structure and are elaborately designed (e.g., gate connection inDDU) to automatically learn the salient object from the RGB image and depth image jointly.

desired; therefore, we proposed the simple general D3Netmodel (as illustrated in Fig. 5) that contains two compo-nents, e.g., a three-stream feature learning module (FLM)(see Section IV-A) and a DDU (see Section IV-B). The FLMis utilized to extract the features from different modalities,while the DDU is acting as a gate to explicitly filter outthe low-quality depth maps. If DDU decides to filter out thisdepth map, the data flow will pass along with the RgbNet.These components form a nested structure and are elaboratelydesigned to achieve robust performance and high generaliza-tion ability on various challenging data sets.

A. Feature Learning Module

Most existing models [94]–[96] have shown significantimprovement for object detectors in several applications. Thesemodels typically share a common structure of FPNs [97].Based on this motivation, we decide to introduce this com-ponent like FPN in our D3Net baseline to efficiently extractthe features in a pyramid manner. The entire D3Net model isdivided into the training phase and test phase as the DDU hasopted to use only in the test phase.

As shown in Fig. 5, the designed FLM appears in trainingand test phases. The FLM consists of three subnetworks, i.e.,RgbNet, RgbdNet, and DepthNet. Note that the three subnet-works have the same structure while fed with different inputchannel. Specifically, each subnetwork receives a rescaledimage I ∈ {Irgb, Irgbd, Idepth} with 224 × 224 resolution.The goal of FLM is to obtain the corresponding predictedmap S ∈ {Srgb, Srgbd, Sdepth}.

As given in [97], we also use bottom–up, top–down path-way, and lateral connections to extract the features. Then,the outputs will be proportionally organized at multiple levels.The FPN is independent of the backbone, and thus, forsimplicity, we adopt the VGG-16 [98] architecture as ourbasic convolutional network to extract spatial features whileutilizing more powerful backbone [99] feature extractor couldbe explored in future. Some studies, such as [100], have

Fig. 6. FPN is introduced to extract the context-aware information. Differentfrom [97], we further add the sixth layer on the base of VGG-16, and theinformation merge strategy is concatenation rather than addition. More detailscan be found in Section IV-A.

shown that deeper layers retain more semantic informationfor locating objects. Based on this observation, we introducea layer containing two 3 × 3 convolution kernels on the basisof the five-layer VGG-16 structure to achieve this goal.

As shown in Fig. 6, our top–down features are built.For a specific layer (e.g., coarser layer), we first conducta 2 × upsampling using nearest neighbor operation. Then,the upsampled feature is concatenated with the finer featuremap to obtain rich features. Before concatenated with a coarsemap, the finer map undergoes a 1×1 Conv operation to reducethe channel. For example, let Irgbd ∈ R

W×H×4 denote the 4-Dfeature tensor of the input of RgbdNet. Then, we define a setof anchors on different layers so that we can obtain a set ofpyramid feature tensors with Ci ×Wi ×Hi , i.e., {64×224×224,128×112×112, 256×56×56, 512×28×28, 512×14×14,32 × 7 × 7, 32 × 14 × 14, 32 × 28 × 28, 32 × 56 × 56,32×112×112, 32×224×224} on {Fi , i∈ [1,11]}, respectively.Note that the {F1, F2, F3, F4, F5} are corresponding to the fiveconvoluational stages of VGG-16 (i.e., {C1, C2, C3, C4, C5}).

B. Depth Depurator Unit

In the test phase, we further adopt a new gate connectionstrategy to obtain the optimal predicted map. Low-quality

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 8: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 7. Smoothed histogram of high-quality (first row) and low-quality(second row) depth maps, respectively. (a) RGB. (b) Depth. (c) Histogram.

depth maps introduce more noise than informative cues to theprediction. The goal of the gate connection is to classify depthmaps into reasonable and low-quality ones and not use thepoor ones in the pipeline.

As shown in Fig. 7(b), a stand-alone salient object in a high-quality depth map is typically characterized by well-definedclosed boundaries and shows clear double peaks in its depthdistribution. The statistics of the depth maps in the existingdata sets [37]–[39], [63], [64], [66] also support the fact that“high-quality depth maps usually contain clear objects, whilethe elements in low-quality depth maps are cluttered (secondrow in Fig. 7).” In order to reject the low-quality depth maps,we propose DDU as follows.

More specifically, in the test phase, the RGB and depth mapis first resized to a fixed size (e.g., the same as the trainingphase 224 × 224) to reduce the computational complexity.As shown in Fig. 5 (right), the DDU is implemented with agate connection. Denote the input images with three predictedmaps S ∈ {Srgb, Srgbd, Sdepth}, and then, the goal of DDU is todecide which predicted map P ∈ [0, 1]W×H is optimal

P = Fddu({Srgb, Srgbd, Sdepth}). (1)

Intuitively, there are two ways to achieve this goal, e.g.,postprocessing and preprocessing. We propose a simple butgeneral postprocessing scheme for DDU. The DDU is con-sidered in the test phase rather than in the training phase.Especially, a comparison unit Fcu is leveraged to assess thesimilarity between Sdepth and Srgbd generated from DepthNetand RgbdNet, respectively

Fcu ={

1, δ(Srgbd, Sdepth) ≤ t

0, otherwise(2)

where δ(·) represents distance function, and t indicates a fixedthreshold. Note that the comparison unit Fcu is act as an indexto decide which subnetwork (RgbNet or RgbdNet) should beutilized.

The key of our comparison unit is the DDU. We utilizethe comparison unit Fcu as a gate connection to decide thefinal/optimal predicted map P. Thus, our Fddu module can beformulated as

P = Fcu · Srgbd + F̄cu · Srgb (3)

where F̄cu =1−Fcu. Fcu can be viewed as a fixed weight.A more elegant formulation (adaptive weight) would be a partof our future work.

C. Implementation Details

1) DDU: The key component of our D3Net is the DDU.In this work, we show a simple yet powerful distance functionformulated in (2). We leverage the mean absolute error (MAE)metric [the same as (5)] to assess the distance between twomaps. The basic motivation is that if the high-quality depthcontains clear objects, the DepthNet will easily detect theseobjects in Sdepth (see the first row in Fig. 7). The higher thequality of depth map in Idepth, the more similarity betweenSrgbd and Sdepth. In other words, the predicted map Srgbd fromRgbdNet has considered the feature from Idepth. If the qualityof the depth map is low, then the predicted map from RgbdNetwill quite different from the generated map from DepthNet.We have tested a set of values of the fixed threshold t in (2),such as 0.01, 0.02, 0.05, 0.10, 0.15, and 0.20, but have foundthat t = 0.15 achieved the best performance.

2) Loss Function: We adopt the widely used cross entropyloss function L to train our model

L(S, G) = − 1

N

N∑i=1

(gi log(si ) + (1 − gi) log(1 − si )

)(4)

where S ∈ [0, 1]224×224 and G ∈ {0, 1}224×224 indicate theestimated saliency map (i.e., Srgb, Srgbd, or Sdepth) and the GTmap, respectively. gi ∈ G, si ∈ S, and N denotes the totalnumber of pixels.

3) Training Settings: For fair comparisons, we follow thesame training settings described in [51]. We select 1485 imagepairs from the NJU2K [37] and 700 image pairs fromNLPR [39] data set, respectively, as the training data (pleaserefer to our website for the Trainlist.txt). The proposedD3Net is implemented using Python, with the Pytorch toolbox.We adopt Adam as the optimizer and the initial learning rate is1e-4, and batch size is set to 8. The total training is 30 epochon a GTX TITAN X GPU with 12 GB of memory.

4) Data Augmentation: Due to the limited scale of existingdata sets, we augment the training samples by flipped theimages horizontally to overcome the risk of overfitting.

V. BENCHMARKING EVALUATION RESULTS

We benchmark about 97k images (5398 images × 18 mod-els) in this study, making it the largest and most comprehensiveRGB-D-based SOD benchmark to date.

A. Experimental Settings

1) Models: We benchmark 18 SOTA models (see Table II),including ten traditional and eight CNN-based models.

2) Data Sets: We conduct our experiments on seven datasets (see Table II). The test sets of NJU2K [37] and NLPR [39]data sets and the whole STERE [63], DES [38], SSD [66],LFSD [64], and SIP data sets are used for testing.

3) Runtime: In Table II, we summarize the runtime ofexisting approaches. The timings are tested on the sameplatform: Intel Xeon E5-2676v3 2.4 GHz×24 and GTXTITAN X. Since [43], [47], [49], [67]–[69], [80]–[82] havenot released their codes, the timings are borrowed from theoriginal articles or provided by the authors. Our D3Net does

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 9: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 9

Fig. 8. PR curve (top) and F-measures (bottom) for 18 methods on (a) NJU2K [37], (b) STERE [63], and (c) NLPR [39] data sets, using various fixedthresholds.

Fig. 9. PR curve (left) and F-measures (right) under different thresholds on the proposed SIP data set.

not apply postprocessing (e.g., CRF); thus, the computationonly takes about 0.015 s for a 224 × 224 image.

B. Evaluation Metrics

1) MAE M: We follow Perazzi et al. [101] and evaluate theMAE between a real-valued saliency map Sal and a binaryGT G for all image pixels

MAE = 1

N|Sal − G| (5)

where N is the total number of pixels. The MAE estimatesthe approximation degree between the saliency map and theGT map, and it is normalized to [0, 1]. The MAE provides adirect estimate of conformity between estimated and GT maps.However, for the MAE metric, small objects are naturally

assigned smaller errors, while larger objects are given largererrors. The metric is also unable to tell where the erroroccurs [102].

2) PR Curve: We also follow Borji et al. [5] and provide thePR curve. We divide a saliency map S using a fixed thresholdthat changes from 0 to 255. For each threshold, a pair ofrecall & precision scores are computed and then combinedto form a precision–recall curve that describes the modelperformance in different situations. The overall evaluationresults for PR curves are shown in Figs. 8 (top) and 9 (left).

3) F-Measure Fβ: F-measure is essentially a region-basedsimilarity metric. Following the works by Borji et al. [5]and Zhang et al. [103], we also provide the max F-measureusing various fixed (0-255) thresholds. The overall F-measure

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 10: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 10. Visual comparisons with the top-three CNN-based models (CPFP [51], TANet [82], and PCF [49]) and three classical nondeep methods (MDSF [45],SE [45], and DCMC [55]), on five data sets. Further results can be found in http://dpfan.net/D3NetBenchmark. (a) RGB. (b) Depth. (c) GT. (d) D3Net. (e)CPFP [51]. (f) TANet [82]. (g) PCF [49]. (h) MDSF [45]. (i) SE [42]. (j) DCMC [55].

evaluation results under different thresholds on each data setare shown in Fig. 8 (bottom) and Fig. 9 (right).

4) S-Measure Sα: Both the MAE and F-measure metricsignore important structural information. However, behavioralvision studies have shown that the human visual system ishighly sensitive to structures in scenes [58]. Thus, we addi-tionally include the structure measure (S-measure [58]).The S-measure combines the region-aware (Sr ) and object-aware (So) structural similarities as the final structuremetric

Sα = α ∗ So + (1 − α) ∗ Sr (6)

where α∈[0, 1] is the balance parameter and set to 0.5.5) E-Measure Eξ : E-measure is the recently proposed

enhanced alignment measure [59] from the binary map evalu-ation field. This measure is based on cognitive vision studiesand combines local pixel values with the image-level meanvalue in one term, jointly capturing image-level statisticsand local pixel matching information. Here, we introducemax/maximal E-measure to provide a more comprehensiveevaluation.

C. Metric Statistics

For a given metric ζ ∈ {Sα, Fβ, Eξ , M}, we considerdifferent statistics. I i

j denotes an image from a specific dataset Di . Thus, Di = {I i

1, I i2, . . . , I i

j }. Let ζ (I ij ) be the metric

score on image I ij . The mean is the average data set statistic

defined as Mζ (Di ) = (1/|Di |) ∑ζ (I i

j ), where |Di | is the totalnumber of images on the Di data set. The mean statistics overdifferent data sets are summarized in Table II.

D. Performance Comparison and Analysis

1) Performance of Traditional Models: Based on the overallperformances listed in Table II, we observe that “SE [42],MDSF [45], and DCMC [55] are the top-three traditionalalgorithms.” Utilizing superpixel technology, both SE andDCMC explicitly extract the region contrast features from anRGB image. In contrast, MDSF formulates SOD as a pixelwisebinary labeling problem, which is solved by SVM.

2) Performance of Deep Models: Our D3Net, CPFP [51],and TANet [82] are the top-three deep models out of allleading methods, showing the strong feature representationability of deep learning for this task.

3) Traditional Versus Deep Models: From Table II,we observe that most of the deep models perform better thanthe traditional algorithms. Interestingly, MDSF [66] outper-forms two deep models (i.e., DF [44] and AFNet [61]) on theNLPR data set.

E. Comparison With SOTAs

We compare our D3Net with 17 SOTA models in Table II.In general, our model outperforms the best published result(CPFP [51]-CVPR’19) by large margins of 1.0%∼5.8% on sixdata sets. Notably, we also achieve a significant improvementof 1.4% on the proposed real-world SIP data set.

We also report saliency maps generated on various chal-lenging scenes to show the visual superiority of our D3Net.Some representative examples are shown in Fig. 10, such aswhen the structure of the salient object in the depth map ispartially (e.g., the first, fourth, and fifth rows) or dramatically(i.e., the second to third rows) damaged. Specifically, in the

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 11: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 11

TABLE V

S-MEASURE↑ SCORE ON OUR SIP AND THE STERE DATA SET. THE SYMBOL ↑ INDICATES THAT THE HIGHER THE SCORE IS, THE BETTER THE MODELPERFORMS, AND VICE VERSA. SEE DETAILS IN SECTION VII

third and fifth rows, the depth of the salient object is locallyconnected with background scenes. Also, the fourth rowcontains multiple isolated salient objects. For these challengingsituations, most of the existing top competitors are unlikely tolocate the salient objects due to their poor depth maps or insuf-ficient multimodal fusion schemes. Although CPFP [51],TANet [82], and PCF [49] can generate more correct saliencymaps than others, the salient object often introduces noticeabledistinct backgrounds (third to fifth rows) or the fine details ofthe salient object are lost (first row) due to the lack of a cross-modality learning ability. In contrast, our D3Net can eliminatelow-quality depth maps and adaptively select complementarycues from RGB and depth images to infer the real salientobject and highlight its details.

VI. APPLICATIONS

A. Human Activities

Nowadays, mobile phones generally have deep sensingcameras. With RGB-D SOD, users can better achieve thefollowing functions: object extraction, a bokeh effect, mobileuser recognition, and so on. Many monitoring probes alsohave depth sensors, and RGB-D SOD can be helpful tothe discovery of suspicious objects. For example, there isa LiDAR probe in autonomous vehicles designed to obtaindepth information. RGB-D SOD is, thus, helpful for detectingbasic objects, such as pedestrians and signboards in thesevehicles. There are also depth sensors in most industrial robots,so RGBD-SOD can help them better perceive the environmentand take certain actions.

B. Background-Changing Application

Background-changing techniques have become vital for artdesigners to leverage the increasing volumes of availableimage database. Traditional designers utilize photoshop todesign their products. This is quite a time-consuming task andrequires significant technical knowledge. A large majority ofpotential users fail to grasp the high-skilled technique in theart design. Thus, an easy-to-use application is needed.

To overcome the abovementioned drawbacks, SOD tech-nology could be a potential solution. Previous similar works,such as the automatic generation of visual–textual applica-tions [104], [105], motive us to create a background-changingapplication for book cover layouts. We provide a prototypedemo, as shown in Fig. 11. First, the user can upload animage as a candidate design image [input image in Fig. 11(a)].Then, content-based image features, such as an RGB-D-based

Fig. 11. Examples of book cover maker (see Section VI for details).(a) Input image. (b) Template. (c) Salient object. (d) Results.

saliency map, are considered in order to automatically generatesalient objects. Finally, the system allows us to choose fromour library of professionally designed book cover layouts[template in Fig. 11(b)]. By combining high-level templateconstraints and low-level image features, we obtain the back-ground changed book cover [results in Fig. 11(d)].

Since designing a complete software system is not our mainfocus in this article, future researchers can follow [104] andset our visual background image with a specified topic [105].In stage two, the input image is resized to match the targetstyle size and preserve the salient region according to theinference of our D3Net model.

VII. DISCUSSION

Based on our comprehensive benchmarking results,we present our conclusions to the most important questionsthat may benefit the research community to rethink theRGB-D image for SOD.

A. Ablation Study

We now provide a detailed analysis of the proposed baselineD3Net model. To verify the effectiveness of the depth mapfilter mechanism (the DDU), we derive two ablation studies:w/o DDU and DDU that refer to our D3Net without utilizingDDU or include the DDU. For w/o DDU, we further testthe performance of the three subnetworks in the test phaseof D3Net. In Table V, we observe that RgbdNet performsbetter than RgbNet on the SIP, STERE, DES, LFSD, SSD, andNJU2K data sets. It indicates that the cross-modality (RGBand depth) features show strong promise for RGB-D imagerepresentation learning. In most cases, however, DepthNet haslower performance than DepthNet and RgbNet. It shows thatonly based on a single modality, it is difficult for the modelto construct the structure of the geometry in an image.

From Table V, we also observed that the use of the DDUimproves the performance (compared with RgbdNet) to a cer-tain extent on the STERE, DES, NJU2K, and NLPR data sets.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 12: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Fig. 12. Comparison with the previous object-level data sets that are labeled with polygons (the foot pad in NLPR [39]), coarse boundaries (i.e., the chairin DES [38]), and missed object parts (e.g., the person in SSD [66]). In contrast, the proposed object-/instance-level SIP data set is labeled with smooth, fineboundaries. More specifically, occlusions are also considered (e.g., the barrier region).

We attribute the improvement to the DDU being able todiscard low-quality depth maps and select one optimalpath (RgbNet or RgbdNet). For the SSD data set, however,the DDU achieves comparable performance to the single-stream network (i.e., RgbdNet). It is worth mentioning thatD3Net outperforms any prior approach intended for SOD,without any postprocessing techniques, such as CRF, whichare typically used to boost scores. In order to know the lowerand upper bounds of our D3Net, we additionally select theoptimal path (RgbdNet or RgbNet) of the D3Net. For example,for a specific RGB (Irgb) and depth map (Idepth), the twopredicted maps, i.e., Srgb and Srgbd, can be assessed separately.Thus, for each input, we know the best output in the existingnetwork. We aggregate all the best and worst results andachieve the upper bound and lower bound of our D3Net. Fromexisting results listed in Table V, D3Net still has a ∼1.6%performance gap on average related to the upper bound.

B. Limitations

First, it is worth pointing out that the number of imagesin the SIP data set is relatively small compared with mostdata sets for RGB SOD. Our goal behind building this dataset is to explore the potential direction of smartphone-basedapplications. As can be seen from the benchmark results andthe demo application described in Section VI, SOD over realhuman activity scenes is a promising direction. We plan tokeep growing the data set with more challenging situationsand various kinds of foreground persons.

Second, our simple general framework D3Net consists ofthree subnetworks, which may increase the memory on alightweight device. In a real environment, several strategiescan be considered to avoid this, such as replacing the backbonewith MobileNet V2 [106], dimension reduction [107], or usingthe recently released ESPNet V2 [108] models.

Third, we present the lower and upper bounds of the DDU.The optimal upper bound is obtained by feeding the inputinto RgbdNet or RgbNet so that the predicted map is optimal.As shown in Table V, our DDU module does not achieve thebest upper bound on the current training subset. There is, thus,still an opportunity to design a better DDU to further improvethe performance.

VIII. CONCLUSION

We present systematic studies on RGB-D-based SOD by:1) introducing a new human-oriented SIP data set reflecting

the realistic in-the-wild mobile use scenarios; 2) designinga novel D3Net; and 3) conducting so far the largest-scale(∼97k) benchmark. Compared with the existing data sets,SIP covers several challenges (e.g., background diversity andocclusion) of human in the real environments. Moreover,the proposed baseline achieves promising results. It is amongthe fastest methods, making it a practical solution to RGB-DSOD. The comprehensive benchmarking results include32 summarized SOTAs and 18 evaluated traditional/deepmodels. We hope this benchmark will accelerate not onlythe development of this area but also others (e.g., stereoestimating/matching [109], multiple salient person detection,salient instance detection [19], sensitive object detection [110],and image segmentation [111]). Note that the methodsutilized in our D3Net baseline are simple, and more complexcomponents (e.g., PDC in [112]) or training strategy [113] arepromising to increase the performance. In the future, we planto incorporate recently proposed techniques, e.g., the weightedtriplet loss [114], hierarchical deep features [115], and visualquestion-driven saliency [116], into our D3Net to furtherboost the performance. After this submission, there are manyinteresting models, such as UCNet [117], JL-DCF [118],GFNet [119], DMRA [120], ERNet [121], and BiANet [122],have been released. Please refer to our online leaderboard(http://dpfan.net/d3netbenchmark/) for more details. Thiswebsite will be updated continually. We foresee this studydriving SOD toward real-world application scenarios withmultiple salient persons and complex interactions through themobile device (e.g., smartphone or tablets).

ACKNOWLEDGMENT

The authors thank Jia-Xing Zhao, Yun Liu, and Qibin Houfor insightful feedback.

REFERENCES

[1] A. Borji, M.-M. Cheng, Q. Hou, H. Jiang, and J. Li, “Salient objectdetection: A survey,” Comput. Vis. Media, vol. 5, no. 2, pp. 117–150,Jun. 2019.

[2] T. Wang et al., “Detect globally, refine locally: A novel approach tosaliency detection,” in Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit., Jun. 2018, pp. 3127–3135.

[3] H. Fu, D. Xu, S. Lin, and J. Liu, “Object-based RGBD image co-segmentation with mutex constraint,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 4428–4436.

[4] P. Zhang, W. Liu, H. Lu, and C. Shen, “Salient object detection withlossless feature reflection and weighted structural loss,” IEEE Trans.Image Process., vol. 28, no. 6, pp. 3048–3060, Jun. 2019.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 13: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 13

[5] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detec-tion: A benchmark,” IEEE Trans. Image Process., vol. 24, no. 12,pp. 5706–5722, Dec. 2015.

[6] W. Wang, J. Shen, J. Xie, M.-M. Cheng, H. Ling, and A. Borji,“Revisiting video saliency prediction in the deep learning era,” IEEETrans. Pattern Anal. Mach. Intell., early access, Jun. 24, 2019,doi: 10.1109/TPAMI.2019.2924417.

[7] D.-P. Fan, W. Wang, M.-M. Cheng, and J. Shen, “Shifting moreattention to video salient object detection,” in Proc. IEEE/CVF Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 8554–8564.

[8] Y. Zeng, Y. Zhuge, H. Lu, L. Zhang, M. Qian, and Y. Yu, “Multi-source weak supervision for saliency detection,” in Proc. IEEE/CVFConf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019.

[9] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding, “A mutuallearning method for salient object detection with intertwined multi-supervision,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.(CVPR), Jun. 2019, pp. 8150–8159.

[10] L. Zhang, I. JZhang, Z. Lin, H. Lu, and Y. He, “CapSal: Leveragingcaptioning to boost semantics for salient object detection,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2019, pp. 6024–6033.

[11] M. Feng, H. Lu, and E. Ding, “Attentive feedback network forboundary-aware salient object detection,” in Proc. IEEE/CVF Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1623–1632.

[12] X. Hu, L. Zhu, J. Qin, C.-W. Fu, and P.-A. Heng, “Recurrentlyaggregating deep features for salient object detection,” in Proc. AAAIConf. Artif. Intell., 2018, pp. 6943–6950.

[13] L. Wang, L. Wang, H. Lu, P. Zhang, and X. Ruan, “Salient objectdetection with recurrent fully convolutional networks,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1734–1746, Jul. 2019.

[14] Y. Xu, X. Hong, F. Porikli, X. Liu, J. Chen, and G. Zhao, “Saliencyintegration: An arbitrator model,” IEEE Trans. Multimedia, vol. 21,no. 1, pp. 98–113, Jan. 2019.

[15] N. Liu, J. Han, and M.-H. Yang, “PiCANet: Learning pixel-wisecontextual attention for saliency detection,” in Proc. IEEE/CVF Conf.Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3089–3098.

[16] Z. Deng et al., “R3net: Recurrent residual refinement network forsaliency detection,” in Proc. 27th Int. Joint Conf. Artif. Intell., Jul. 2018,pp. 684–690.

[17] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, and P. H. S. Torr,“Deeply supervised salient object detection with short connections,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 4, pp. 815–828,Apr. 2019.

[18] D. Tao, J. Cheng, M. Song, and X. Lin, “Manifold ranking-based matrixfactorization for saliency detection,” IEEE Trans. Neural Netw. Learn.Syst., vol. 27, no. 6, pp. 1122–1134, Jun. 2016.

[19] G. Li, Y. Xie, L. Lin, and Y. Yu, “Instance-level salient objectsegmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), Jul. 2017, pp. 247–256.

[20] M. A. Amirul, M. Kalash, M. Rochan, N. Bruce, and Y. Wang, “Salientobject detection using a context-aware refinement network,” in Proc.Brit. Mach. Vis. Conf., 2017, pp. 1–12.

[21] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, and P.-M. Jodoin, “Non-local deep features for salient object detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jul. 2017, pp. 6609–6617.

[22] T. Chen, L. Lin, L. Liu, X. Luo, and X. Li, “DISC: Deep image saliencycomputing via progressive representation learning,” IEEE Trans. NeuralNetw. Learn. Syst., vol. 27, no. 6, pp. 1135–1149, Jun. 2016.

[23] G. Lee, Y.-W. Tai, and J. Kim, “Deep saliency with encoded low leveldistance map and high level features,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 660–668.

[24] R. Zhao, W. Ouyang, H. Li, and X. Wang, “Saliency detection bymulti-context deep learning,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2015, pp. 1265–1274.

[25] S. Chen, X. Tan, B. Wang, and X. Hu, “Reverse attention for salientobject detection,” in Proc. Eur. Conf. Comput. Vis. Springer, 2018,pp. 234–250.

[26] Y. Zhuge, Y. Zeng, and H. Lu, “Deep embedding features forsalient object detection,” in Proc. AAAI Conf. Artif. Intell., 2019,pp. 9340–9347.

[27] J. Su, J. Li, Y. Zhang, C. Xia, and Y. Tian, “Selectivity or invariance:Boundary-aware salient object detection,” 2018, arXiv:1812.10066.[Online]. Available: http://arxiv.org/abs/1812.10066

[28] P. Jiang, Z. Pan, N. Vasconcelos, B. Chen, and J. Peng, “Super dif-fusion for salient object detection,” 2018, arXiv:1811.09038. [Online].Available: http://arxiv.org/abs/1811.09038

[29] Z. Li, C. Lang, Y. Chen, J. Liew, and J. Feng, “Deep rea-soning with multi-scale context for salient object detection,”2019, arXiv:1901.08362. [Online]. Available: http://arxiv.org/abs/1901.08362

[30] S. Jia and N. D. B. Bruce, “Richer and deeper supervision network forsalient object detection,” 2019, arXiv:1901.02425. [Online]. Available:http://arxiv.org/abs/1901.02425

[31] X. Huang and Y.-J. Zhang, “300-FPS salient object detection viaminimum directional contrast,” IEEE Trans. Image Process., vol. 26,no. 9, pp. 4243–4254, Sep. 2017.

[32] X. Li, F. Yang, H. Cheng, W. Liu, and D. Shen, “Contour knowledgetransfer for salient object detection,” in Proc. Eur. Conf. Comput. Vis.Springer, 2018, pp. 355–370.

[33] M. Kummerer, T. S. A. Wallis, and M. Bethge, “Saliency benchmarkingmade easy: Separating models, maps and metrics,” in Proc. Eur. Conf.Comput. Vis. Springer, 2018.

[34] X. Chen, A. Zheng, J. Li, and F. Lu, “Look, perceive and segment:Finding the salient objects in images via two-stream fixation-semanticCNNs,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,pp. 1050–1058.

[35] M. A. Islam, M. Kalash, and N. D. B. Bruce, “Revisiting salientobject detection: Simultaneous detection, ranking, and subitizing ofmultiple salient objects,” in Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit., Jun. 2018, pp. 7142–7150.

[36] A. Ciptadi, T. Hermans, and J. Rehg, “An in depth view of saliency,”in Proc. Brit. Mach. Vis. Conf., 2013, pp. 1–11.

[37] R. Ju, L. Ge, W. Geng, T. Ren, and G. Wu, “Depth saliency based onanisotropic center-surround difference,” in Proc. IEEE Int. Conf. ImageProcess. (ICIP), Oct. 2014, pp. 1115–1119.

[38] Y. Cheng, H. Fu, X. Wei, J. Xiao, and X. Cao, “Depth enhancedsaliency detection method,” in Proc. Int. Conf. Internet MultimediaComput. Service (ICIMCS), 2014, p. 23.

[39] H. Peng, B. Li, W. Xiong, W. Hu, and R. Ji, “RGBD salient objectdetection: A benchmark and algorithms,” in Proc. Eur. Conf. Comput.Vis. Springer, 2014, pp. 92–109.

[40] J. Ren, X. Gong, L. Yu, W. Zhou, and M. Y. Yang, “Exploiting globalpriors for RGB-D saliency detection,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2015, pp. 25–32.

[41] D. Feng, N. Barnes, S. You, and C. McCarthy, “Local backgroundenclosure for RGB-D salient object detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 2343–2350.

[42] J. Guo, T. Ren, and J. Bei, “Salient object detection for RGB-D imagevia saliency evolution,” in Proc. IEEE Int. Conf. Multimedia Expo(ICME), Jul. 2016, pp. 1–6.

[43] J. Han, H. Chen, N. Liu, C. Yan, and X. Li, “CNNs-based RGB-Dsaliency detection via cross-view transfer and multiview fusion,” IEEETrans. Cybern., vol. 48, no. 11, pp. 3171–3183, Nov. 2018.

[44] L. Qu, S. He, J. Zhang, J. Tian, Y. Tang, and Q. Yang, “RGBD salientobject detection via deep fusion,” IEEE Trans. Image Process., vol. 26,no. 5, pp. 2274–2285, May 2017.

[45] H. Song, Z. Liu, H. Du, G. Sun, O. Le Meur, and T. Ren, “Depth-awaresalient object detection and segmentation via multiscale discriminativesaliency fusion and bootstrap learning,” IEEE Trans. Image Process.,vol. 26, no. 9, pp. 4204–4216, Sep. 2017.

[46] C. Zhu, G. Li, W. Wang, and R. Wang, “An innovative salient objectdetection using center-dark channel prior,” in Proc. IEEE Int. Conf.Comput. Vis. Workshops (ICCVW), Oct. 2017.

[47] F. Liang, L. Duan, W. Ma, Y. Qiao, Z. Cai, and L. Qing, “Stereoscopicsaliency model using contrast and depth-guided-background prior,”Neurocomputing, vol. 275, pp. 2227–2238, Jan. 2018.

[48] C. Zhu, X. Cai, K. Huang, T. H. Li, and G. Li, “PDNet: Prior-modelguided depth-enhanced network for salient object detection,” in Proc.IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2019, pp. 199–204.

[49] H. Chen and Y. Li, “Progressively complementarity-aware fusionnetwork for RGB-D salient object detection,” in Proc. IEEE/CVF Conf.Comput. Vis. Pattern Recognit., Jun. 2018, pp. 3051–3060.

[50] W. Wang, J. Shen, Y. Yu, and K.-L. Ma, “Stereoscopic thumbnailcreation via efficient stereo saliency detection,” IEEE Trans. Vis.Comput. Graphics, vol. 23, no. 8, pp. 2014–2027, Aug. 2017.

[51] J.-X. Zhao, Y. Cao, D.-P. Fan, M.-M. Cheng, X.-Y. Li, and L. Zhang,“Contrast prior and fluid pyramid integration for RGBD salient objectdetection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.(CVPR), Jun. 2019, pp. 3927–3936.

[52] Z. Zhang, “Microsoft kinect sensor and its effect,” IEEE Multimedia-Mag., vol. 19, no. 2, pp. 4–10, Feb. 2012.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 14: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

[53] R. Ng, M. Levoy, M. Brédif, G. Duval, M. Horowitz, andP. Hanrahan, “Light field photography with a hand-held plenopticcamera,” Comput. Sci. Tech. Rep., vol. 2, no. 11, pp. 1–11, 2005.

[54] C. Liu, J. Yuen, and A. Torralba, “SIFT flow: Dense correspondenceacross scenes and its applications,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 33, no. 5, pp. 978–994, May 2011.

[55] R. Cong, J. Lei, C. Zhang, Q. Huang, X. Cao, and C. Hou, “Saliencydetection for stereoscopic images based on depth confidence analysisand multiple cues fusion,” IEEE Signal Process. Lett., vol. 23, no. 6,pp. 819–823, Jun. 2016.

[56] R. Cong, J. Lei, H. Fu, M.-M. Cheng, W. Lin, and Q. Huang, “Reviewof visual saliency detection with comprehensive information,” IEEETrans. Circuits Syst. Video Technol., vol. 29, no. 10, pp. 2941–2959,Oct. 2019.

[57] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tunedsalient region detection,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2009, pp. 1597–1604.

[58] D.-P. Fan, M.-M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in Proc. IEEEInt. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 4548–4557.

[59] D.-P. Fan, C. Gong, Y. Cao, B. Ren, M.-M. Cheng, and A. Borji,“Enhanced-alignment measure for binary foreground map evaluation,”in Proc. 27th Int. Joint Conf. Artif. Intell., Jul. 2018, pp. 698–704.

[60] R. Margolin, L. Zelnik-Manor, and A. Tal, “How to evaluate foregroundmaps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,pp. 248–255.

[61] N. Wang and X. Gong, “Adaptive fusion for RGB-D salient objectdetection,” IEEE Access, vol. 7, pp. 55277–55284, 2019.

[62] H. Du, Z. Liu, H. Song, L. Mei, and Z. Xu, “Improving RGBD saliencydetection using progressive region classification and saliency fusion,”IEEE Access, vol. 4, pp. 8987–8994, 2016.

[63] Y. Niu, Y. Geng, X. Li, and F. Liu, “Leveraging stereopsis forsaliency analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2012, pp. 454–461.

[64] N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on lightfield,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014,pp. 2806–2813.

[65] D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow estimationand their principles,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis.Pattern Recognit., Jun. 2010, pp. 2432–2439.

[66] G. Li and C. Zhu, “A three-pathway psychobiological frameworkof salient object detection using stereoscopic technology,” in Proc.IEEE Int. Conf. Comput. Vis. Workshops (ICCVW), Oct. 2017,pp. 3008–3014.

[67] R. Cong et al., “An iterative co-saliency framework for RGBD images,”IEEE Trans. Cybern., vol. 49, no. 1, pp. 233–246, Nov. 2017.

[68] R. Cong, J. Lei, H. Fu, Q. Huang, X. Cao, and N. Ling, “HSCS:Hierarchical sparsity based co-saliency detection for RGBD images,”IEEE Trans. Multimedia, vol. 21, no. 7, pp. 1660–1671, Jul. 2019.

[69] R. Cong, J. Lei, H. Fu, Q. Huang, X. Cao, and C. Hou, “Co-saliencydetection for RGBD images based on multi-constraint feature matchingand cross label propagation,” IEEE Trans. Image Process., vol. 27,no. 2, pp. 568–579, Feb. 2018.

[70] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu,“Global contrast based salient region detection,” IEEE Trans. PatternAnal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2015.

[71] L. Wang et al., “Learning to detect salient objects with image-levelsupervision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.(CVPR), Jul. 2017, pp. 136–145.

[72] P. Sauer, T. Cootes, and C. Taylor, “Accurate regression proceduresfor active appearance models,” in Proc. Brit. Mach. Vis. Conf., 2011,pp. 1–11.

[73] K. Desingh, K. M. Krishna, D. Rajan, and C. Jawahar, “Depth reallymatters: Improving visual salient region detection with depth,” in Proc.Brit. Mach. Vis. Conf., 2013, pp. 1–11.

[74] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vec-tor machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3,pp. 27:1–27:27, 2011.

[75] X. Fan, Z. Liu, and G. Sun, “Salient region detection for stereoscopicimages,” in Proc. 19th Int. Conf. Digit. Signal Process., Aug. 2014,pp. 454–458.

[76] R. Shigematsu, D. Feng, S. You, and N. Barnes, “Learning RGB-Dsalient object detection using background enclosure, depth contrast, andtop-down features,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops(ICCVW), Oct. 2017, pp. 2749–2757.

[77] A. Wang and M. Wang, “RGB-D salient object detection via minimumbarrier distance transform and saliency fusion,” IEEE Signal Process.Lett., vol. 24, no. 5, pp. 663–667, May 2017.

[78] P. Huang, C.-H. Shen, and H.-F. Hsiao, “RGBD salient object detectionusing spatially coherent deep learning framework,” in Proc. IEEE 23rdInt. Conf. Digit. Signal Process. (DSP), Nov. 2018, pp. 1–5.

[79] S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “Learning rich featuresfrom RGB-D images for object detection and segmentation,” in Proc.Eur. Conf. Comput. Vis. Springer, 2014, pp. 345–360.

[80] H. Chen, Y.-F. Li, and D. Su, “Attention-aware cross-modal cross-levelfusion network for RGB-D salient object detection,” in Proc. IEEE/RSJInt. Conf. Intell. Robots Syst. (IROS), Oct. 2018, pp. 6821–6826.

[81] H. Chen, Y. Li, and D. Su, “Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient objectdetection,” Pattern Recognit., vol. 86, pp. 376–385, Feb. 2019.

[82] H. Chen and Y. Li, “Three-stream attention-aware network for RGB-Dsalient object detection,” IEEE Trans. Image Process., vol. 28, no. 6,pp. 2825–2835, Jun. 2019.

[83] D.-P. Fan, J.-J. Liu, S.-H. Gao, Q. Hou, A. Borji, and M.-M. Cheng,“Salient objects in clutter: Bringing salient object detection tothe foreground,” in Proc. Eur. Conf. Comput. Vis. Springer, 2018,pp. 1597–1604.

[84] S. Alpert, M. Galun, R. Basri, and A. Brandt, “Image segmentationby probabilistic bottom-up aggregation and cue integration,” in Proc.IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2007, pp. 1–8.

[85] H. Jiang, M.-M. Cheng, S.-J. Li, A. Borji, and J. Wang, “Joint salientobject detection and existence prediction,” Frontiers Comput. Sci.,vol. 13, no. 4, pp. 778–788, Aug. 2019.

[86] G. Li and Y. Yu, “Visual saliency based on multiscale deep features,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015,pp. 5455–5463.

[87] T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, “Learningto detect a salient object,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2007, pp. 1–8.

[88] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of humansegmented natural images and its application to evaluating segmenta-tion algorithms and measuring ecological statistics,” in Proc. 8th IEEEInt. Conf. Comput. Vis. (ICCV), vol. 2, 2001, pp. 416–423.

[89] C. Xia, J. Li, X. Chen, A. Zheng, and Y. Zhang, “What is and whatis not a salient object? Learning salient object detector by ensemblinglinear exemplar regressors,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jul. 2017, pp. 4142–4150.

[90] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,pp. 1155–1162.

[91] Y. Li, X. Hou, C. Koch, J. M. Rehg, and A. L. Yuille, “The secrets ofsalient object segmentation,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2014, pp. 280–287.

[92] B. W. Tatler, R. J. Baddeley, and I. D. Gilchrist, “Visual correlates offixation selection: Effects of scale and time,” Vis. Res., vol. 45, no. 5,pp. 643–659, Mar. 2005.

[93] E. L. Kaufman, M. W. Lord, T. W. Reese, and J. Volkmann, “The dis-crimination of visual number,” Amer. J. Psychol., vol. 62, no. 4,pp. 498–525, 1949.

[94] A. Kirillov, R. Girshick, K. He, and P. Dollar, “Panoptic featurepyramid networks,” in Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2019, pp. 6399–6408.

[95] X. Chen, R. Girshick, K. He, and P. Dollar, “TensorMask: A foundationfor dense object segmentation,” in Proc. IEEE/CVF Int. Conf. Comput.Vis. (ICCV), Oct. 2019, pp. 2061–2069.

[96] Y. Xiong et al., “UPSNet: A unified panoptic segmentation network,”in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2019, pp. 8818–8826.

[97] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 2117–2125.

[98] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in Proc. Int. Conf. Learn. Represent.,2015, pp. 1–14.

[99] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, andP. H. S. Torr, “Res2Net: A new multi-scale backbone architecture,”IEEE Trans. Pattern Anal. Mach. Intell., early access, Aug. 30, 2019,doi: 10.1109/TPAMI.2019.2938758.

[100] T. Zhao and X. Wu, “Pyramid feature attention network for saliencydetection,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.(CVPR), Jun. 2019, pp. 3085–3094.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.

Page 15: Rethinking RGB-D Salient Object Detection: Models, Data ...dpfan.net/wp-content/uploads/D3Net-TNNLS.2020.2996406.pdf · ture, called deep depth-depurator network (D3Net). It consists

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

FAN et al.: RETHINKING RGB-D SOD: MODELS, DATA SETS, AND LARGE-SCALE BENCHMARKS 15

[101] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters:Contrast based filtering for salient region detection,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., Jun. 2012, pp. 733–740.

[102] D. Tsai, M. Flagg, and J. Rehg, “Motion coherent tracking withmulti-label MRF optimization,” in Proc. Brit. Mach. Vis. Conf., 2010,pp. 1–11.

[103] P. Zhang, D. Wang, H. Lu, H. Wang, and X. Ruan, “Amulet:Aggregating multi-level convolutional features for salient object detec-tion,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017,pp. 202–211.

[104] X. Yang, T. Mei, Y.-Q. Xu, Y. Rui, and S. Li, “Automatic generation ofvisual-textual presentation layout,” ACM Trans. Multimedia Comput.,Commun., Appl., vol. 12, no. 2, p. 33, 2016.

[105] A. Jahanian et al., “Recommendation system for automatic designof magazine covers,” in Proc. Int. Conf. Intell. User Interface (IUI).New York, NY, USA: ACM, 2013, pp. 95–106.

[106] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,“MobileNetV2: Inverted residuals and linear bottlenecks,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2018,pp. 4510–4520.

[107] J. Zhang, J. Yu, and D. Tao, “Local deep-feature alignment for unsu-pervised dimension reduction,” IEEE Trans. Image Process., vol. 27,no. 5, pp. 2420–2432, May 2018.

[108] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “ESPNetv2:A light-weight, power efficient, and general purpose convolutionalneural network,” in Proc. IEEE/CVF Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2019, pp. 9190–9200.

[109] G.-Y. Nie et al., “Multi-level context ultra-aggregation for stereomatching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.(CVPR), Jun. 2019, pp. 3283–3291.

[110] J. Yu, B. Zhang, Z. Kuang, D. Lin, and J. Fan, “IPrivacy: Image privacyprotection by identifying sensitive objects via deep multi-task learning,”IEEE Trans. Inf. Forensics Security, vol. 12, no. 5, pp. 1005–1016,May 2017.

[111] J. Shen, X. Dong, J. Peng, X. Jin, L. Shao, and F. Porikli, “Submodularfunction optimization for motion clustering and image segmentation,”IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 9, pp. 2637–2649,Sep. 2019.

[112] D.-P. Fan, G.-P. Ji, G. Sun, M.-M. Cheng, J. Shen, and L. Shao,“Camouflaged object detection,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2020, pp. 1–11.

[113] D.-P. Fan et al., “Inf-Net: Automatic COVID-19 lung infection seg-mentation from CT images,” IEEE Trans. Med. Imag., early access,doi: 10.1109/TMI.2020.2996645.

[114] J. Yu, C. Zhu, J. Zhang, Q. Huang, and D. Tao, “Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition,”IEEE Trans. Neural Netw. Learn. Syst., vol. 31, no. 2, pp. 661–674,Feb. 2020.

[115] J. Yu, M. Tan, H. Zhang, D. Tao, and Y. Rui, “Hierarchical deep clickfeature prediction for fine-grained image recognition,” IEEE Trans.Pattern Anal. Mach. Intell., early access, Jul. 30, 2019, doi: 10.1109/TPAMI.2019.2932058.

[116] S. He, C. Han, G. Han, and J. Qin, “Exploring duality in visualquestion-driven top-down saliency,” IEEE Trans. Neural Netw. Learn.Syst., early access, Sep. 2, 2019, doi: 10.1109/TNNLS.2019.2933439.

[117] J. Zhang et al., “UC-Net: Uncertainty inspired RGB-D saliency detec-tion via conditional variational autoencoders,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., 2020.

[118] K. F. Fu, D.-P. Fan, G.-P. Ji, and Q. Zhao, “JL-DCF: Joint learning anddensely-cooperative fusion framework for RGB-D salient object detec-tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Apr. 2020,pp. 1–11.

[119] Z. Liu, W. Zhang, and P. Zhao, “A cross-modal adaptive gated fusiongenerative adversarial network for RGB-D salient object detection,”Neurocomputing, vol. 387, pp. 210–220, Apr. 2020.

[120] Y. Piao, W. Ji, J. Li, M. Zhang, and H. Lu, “Depth-induced multi-scalerecurrent attention network for saliency detection,” in Proc. IEEE/CVFInt. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 7254–7263.

[121] Y. Piao, Z. Rong, M. Zhang, and H. Lu, “Exploit and replace: An asym-metrical two-stream architecture for versatile light field saliency detec-tion,” in Proc. AAAI Conf. Artif. Intell., 2020, pp. 1–9.

[122] Z. Zhang, Z. Lin, J. Xu, W. Jin, S.-P. Lu, and D.-P. Fan, “Bilat-eral attention network for RGB-D salient object detection,” 2020,arXiv:2004.14582. [Online]. Available: http://arxiv.org/abs/2004.14582

Deng-Ping Fan (Member, IEEE) received the Ph.D.degree from Nankai University, Tianjin, China,in 2019.

He joined the Inception Institute of ArtificialIntelligence (IIAI), Abu Dhabi, United Arab Emi-rates, in 2019. From 2015 to 2019, he was a Ph.D.Candidate with the Department of ComputerScience, Nankai University, directed by Prof.Ming-Ming Cheng. His current research interestsinclude computer vision, image processing, anddeep learning.

Dr. Fan received the Huawei Scholarship in 2017.

Zheng Lin is currently pursuing the Ph.D. degreewith the College of Computer Science, Nankai Uni-versity, Tianjin, China, under the supervision of Prof.Ming-Ming Cheng.

His research interests include deep learning,computer graphics, and computer vision.

Zhao Zhang received the B.Eng. degree fromYangzhou University, Yangzhou, China, in 2019.He is currently pursuing the master’s degree withNankai University, Tianjin, China, under the super-vision of Prof. Ming-Ming Cheng.

His research interests include computer vision andimage processing.

Menglong Zhu (Member, IEEE) received the bach-elor’s degree in computer science from Fudan Uni-versity, Shanghai, China, in 2010, and the master’sdegree in robotics and the Ph.D. degree in com-puter and information science from the University ofPennsylvania, Philadelphia, PA, USA, in 2012 and2016, respectively.

He is currently a Computer Vision Software Engi-neer with Google AI, Mountain View, CA, USA.His research interests are on object recognition,3-D object/human pose estimation, human actionrecognition, visual simultaneous localization and

mapping (SLAM), and text recognition.

Ming-Ming Cheng (Senior Member, IEEE)received the Ph.D. degree from TsinghuaUniversity, Beijing, China, in 2012.

He then did a two-year research fellowship, withProf. Philip Torr, at the University of Oxford,Oxford, U.K. He is currently a Professor withNankai University, Tianjin, China, where he hasbeen leading the Media Computing Lab. Hisresearch interests include computer graphics,computer vision, and image processing.

Dr. Cheng received several research awards,including the ACM China Rising Star Award, the IBM Global SUR Award,and the CCF-Intel Young Faculty Researcher Program.

Authorized licensed use limited to: University of Exeter. Downloaded on June 06,2020 at 02:32:40 UTC from IEEE Xplore. Restrictions apply.


Recommended