+ All Categories
Home > Documents > IEEE TRANSACTIONS ON MULTIMEDIA 1 HNIP: Compact Deep...

IEEE TRANSACTIONS ON MULTIMEDIA 1 HNIP: Compact Deep...

Date post: 20-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
34
IEEE Proof IEEE TRANSACTIONS ON MULTIMEDIA 1 HNIP: Compact Deep Invariant Representations for Video Matching, Localization, and Retrieval 1 2 Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang, Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE 3 4 Abstract—With emerging demand for large-scale video analysis, 5 MPEG initiated the compact descriptor for video analysis (CDVA) 6 standardization in 2014. Unlike handcrafted descriptors adopted 7 by the ongoing CDVA standard, in this work, we study the 8 problem of deep learned global descriptors for video matching, 9 localization, and retrieval. First, inspired by a recent invariance 10 theory, we propose a nested invariance pooling (NIP) method 11 to derive compact deep global descriptors from convolutional 12 neural networks (CNN), by progressively encoding translation, 13 scale, and rotation invariances into the pooled descriptors. Second, 14 our empirical studies have shown that pooling moments (e.g., 15 max or average) drastically affect video matching performance, 16 which motivates us to design hybrid pooling operations within 17 NIP (HNIP). HNIP further improves the discriminability of 18 deep global descriptors. Third, the advantages and performance 19 on the combination of deep and handcrafted descriptors are 20 provided to better investigate the complementary effects of them. 21 We evaluate the effectiveness of HNIP by incorporating it into 22 the well-established CDVA evaluation framework. Experimental 23 results show that HNIP outperforms state-of-the-art deep and 24 canonical handcrafted descriptors with significant mAP gains of 25 5.5% and 4.7%, respectively. Moreover, the combination of HNIP 26 and handcrafted global descriptors further boosts the performance 27 of CDVA core techniques with comparable descriptor size. 28 Index Terms—Convolutional neural networks (CNN), deep 29 global descriptors, handcrafted descriptors, hybrid nested 30 invariance pooling, MPEG compact descriptor for video analysis 31 (CDVA), MPEG CDVS. 32 I. INTRODUCTION 33 R ECENT years have witnessed a remarkable growth of 34 interest for video retrieval, which refers to searching for 35 Manuscript received December 5, 2016; revised April 3, 2017 and May 30, 2017; accepted May 30, 2017. This work was supported in part by the National Hightech R&D Program of China under Grant 2015AA016302, in part by the National Natural Science Foundation of China under Grant U1611461 and Grant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation. The guest editor coordinating the review of this manuscript and approving it for publication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.) J. Lin and V. Chandrasekhar is with the Institute for Infocomm Research, ASTAR, Singapore 138632 (e-mail: [email protected]). L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti- tute of Digital Media, Peking University, Beijing 100080, China (e-mail: [email protected]; [email protected]; [email protected]; tjhuang@pku. edu.cn; [email protected]). S. Wang is with the Department of Computer Science, City University of Hong Kong, Hong Kong, China (e-mail: [email protected]). V. Chandrasekhar and A. Kot are with the Nanyang Technological University, Singapore 639798 (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2017.2713410 videos representing the same object or scene as the one depicted 36 in a query video. Such capability can facilitate a variety of appli- 37 cations including mobile augmented reality (MAR), automotive, 38 surveillance, media entertainment, etc. [1]. In the rapid evolution 39 of video retrieval, both great promises and new challenges aris- 40 ing from real applications have been perceived [2]. Typically, the 41 video retrieval is performed at the server end, which requires 42 the transmission of visual data via wireless network [3], [4]. 43 Instead of directly sending huge volume of compressed video 44 data, developing compact and robust video feature representa- 45 tions is highly desirable, which fulfills low latency transmission 46 over bandwidth constrained network, e.g., thousands of bytes 47 per second. 48 To this end, in 2009, MPEG started the standardization of 49 Compact Descriptors for Visual Search (CDVS) [5], which 50 came up with a normative bitstream of standardized descriptors 51 for mobile visual search and augmented reality applications. 52 In Sep. 2015, MPEG published the CDVS standard [6]. Very 53 recently, towards large-scale video analysis, MPEG has moved 54 forward to standardize Compact Descriptors for Video Analy- 55 sis (CDVA) [7]. To deal with content redundancy in temporal 56 dimension, the latest CDVA Experimental Model (CXM) [8] 57 casts video retrieval into keyframe based image retrieval task, 58 in which the keyframe-level matching results are combined for 59 video matching. The keyframe-level representation avoids de- 60 scriptor extraction on dense frames in videos, which largely re- 61 duces the computational complexity (e.g., CDVA only extracts 62 descriptors of 12% frames detected from raw videos). 63 In CDVS, handcrafted local and global descriptors have been 64 successfully standardized in a compact and scalable manner 65 (e.g., from 512 B to 16 KB), where local descriptors capture the 66 invariant characteristics of local image patches and global de- 67 scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect the 68 aggregated statistics of local descriptors. Though handcrafted 69 descriptors have achieved great success in CDVS standard [5] 70 and CDVA experimental model, how to leverage promising 71 deep learned global descriptors remains an open issue in the 72 MPEG CDVA Ad-hoc group. Many recent works [11]–[18] 73 have shown the advantages of deep global descriptors for image 74 retrieval, which may be attributed to the remarkable success of 75 Convolutional Neural Networks (CNN) [19], [20]. In particular, 76 state-of-the-art deep global descriptors R-MAC [18] computes 77 the max over a set of Region-of-Interest (ROI) cropped from 78 feature maps output by intermediate convolutional layer, 79 followed by the average of these regional max-pooled features. 80 Results show that R-MAC offers remarkable improvements over 81 1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
Transcript
  • IEEE P

    roof

    IEEE TRANSACTIONS ON MULTIMEDIA 1

    HNIP: Compact Deep Invariant Representations forVideo Matching, Localization, and Retrieval

    1

    2

    Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang,Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE

    3

    4

    Abstract—With emerging demand for large-scale video analysis,5MPEG initiated the compact descriptor for video analysis (CDVA)6standardization in 2014. Unlike handcrafted descriptors adopted7by the ongoing CDVA standard, in this work, we study the8problem of deep learned global descriptors for video matching,9localization, and retrieval. First, inspired by a recent invariance10theory, we propose a nested invariance pooling (NIP) method11to derive compact deep global descriptors from convolutional12neural networks (CNN), by progressively encoding translation,13scale, and rotation invariances into the pooled descriptors. Second,14our empirical studies have shown that pooling moments (e.g.,15max or average) drastically affect video matching performance,16which motivates us to design hybrid pooling operations within17NIP (HNIP). HNIP further improves the discriminability of18deep global descriptors. Third, the advantages and performance19on the combination of deep and handcrafted descriptors are20provided to better investigate the complementary effects of them.21We evaluate the effectiveness of HNIP by incorporating it into22the well-established CDVA evaluation framework. Experimental23results show that HNIP outperforms state-of-the-art deep and24canonical handcrafted descriptors with significant mAP gains of255.5% and 4.7%, respectively. Moreover, the combination of HNIP26and handcrafted global descriptors further boosts the performance27of CDVA core techniques with comparable descriptor size.28

    Index Terms—Convolutional neural networks (CNN), deep29global descriptors, handcrafted descriptors, hybrid nested30invariance pooling, MPEG compact descriptor for video analysis31(CDVA), MPEG CDVS.32

    I. INTRODUCTION33

    R ECENT years have witnessed a remarkable growth of34 interest for video retrieval, which refers to searching for35Manuscript received December 5, 2016; revised April 3, 2017 and May 30,

    2017; accepted May 30, 2017. This work was supported in part by the NationalHightech R&D Program of China under Grant 2015AA016302, in part bythe National Natural Science Foundation of China under Grant U1611461 andGrant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI)sponsored by a donation from the Ng Teng Fong Charitable Foundation. Theguest editor coordinating the review of this manuscript and approving it forpublication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.)

    J. Lin and V. Chandrasekhar is with the Institute for Infocomm Research,A∗STAR, Singapore 138632 (e-mail: [email protected]).

    L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti-tute of Digital Media, Peking University, Beijing 100080, China (e-mail:[email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

    S. Wang is with the Department of Computer Science, City University ofHong Kong, Hong Kong, China (e-mail: [email protected]).

    V. Chandrasekhar and A. Kot are with the Nanyang Technological University,Singapore 639798 (e-mail: [email protected]; [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TMM.2017.2713410

    videos representing the same object or scene as the one depicted 36in a query video. Such capability can facilitate a variety of appli- 37cations including mobile augmented reality (MAR), automotive, 38surveillance, media entertainment, etc. [1]. In the rapid evolution 39of video retrieval, both great promises and new challenges aris- 40ing from real applications have been perceived [2]. Typically, the 41video retrieval is performed at the server end, which requires 42the transmission of visual data via wireless network [3], [4]. 43Instead of directly sending huge volume of compressed video 44data, developing compact and robust video feature representa- 45tions is highly desirable, which fulfills low latency transmission 46over bandwidth constrained network, e.g., thousands of bytes 47per second. 48

    To this end, in 2009, MPEG started the standardization of 49Compact Descriptors for Visual Search (CDVS) [5], which 50came up with a normative bitstream of standardized descriptors 51for mobile visual search and augmented reality applications. 52In Sep. 2015, MPEG published the CDVS standard [6]. Very 53recently, towards large-scale video analysis, MPEG has moved 54forward to standardize Compact Descriptors for Video Analy- 55sis (CDVA) [7]. To deal with content redundancy in temporal 56dimension, the latest CDVA Experimental Model (CXM) [8] 57casts video retrieval into keyframe based image retrieval task, 58in which the keyframe-level matching results are combined for 59video matching. The keyframe-level representation avoids de- 60scriptor extraction on dense frames in videos, which largely re- 61duces the computational complexity (e.g., CDVA only extracts 62descriptors of 1∼2% frames detected from raw videos). 63

    In CDVS, handcrafted local and global descriptors have been 64successfully standardized in a compact and scalable manner 65(e.g., from 512 B to 16 KB), where local descriptors capture the 66invariant characteristics of local image patches and global de- 67scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect the 68aggregated statistics of local descriptors. Though handcrafted 69descriptors have achieved great success in CDVS standard [5] 70and CDVA experimental model, how to leverage promising 71deep learned global descriptors remains an open issue in the 72MPEG CDVA Ad-hoc group. Many recent works [11]–[18] 73have shown the advantages of deep global descriptors for image 74retrieval, which may be attributed to the remarkable success of 75Convolutional Neural Networks (CNN) [19], [20]. In particular, 76state-of-the-art deep global descriptors R-MAC [18] computes 77the max over a set of Region-of-Interest (ROI) cropped from 78feature maps output by intermediate convolutional layer, 79followed by the average of these regional max-pooled features. 80Results show that R-MAC offers remarkable improvements over 81

    1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

    http://www.ieee.org/publications_standards/publications/rights/index.htmlDELL删划线

    DELL替换文本current

    DELL删划线

    DELL替换文本reference model

    DELL删划线

    DELL插入号a sequence of well designed

    DELL插入号may

    DELL删划线

    DELL替换文本impact

    DELL删划线

    DELL替换文本via

    DELL删划线

    DELL替换文本has further improved

    DELL删划线

    DELL替换文本technical merits and performance improvements by combining

    DELL删划线

    DELL替换文本within

    DELL插入号MPEG-

    DELL插入号MPEG-

    DELL插入号the

    DELL删划线

    DELL替换文本The extensive experiments have demonstrated

    DELL删划线

    DELL替换文本In particular

    DELL插入号incorporated CNN descriptors

    DELL删划线

    DELL插入号s

    DELL删划线

    DELL替换文本has significantly boosted

    DELL删划线

    DELL删划线

    DELL替换文本CDVA

    DELL删划线

    DELL删划线

    DELL删划线

    DELL删划线

    DELL删划线

    DELL替换文本Beyond

    AdminSticky NoteJ. Lin and V. Chandrasekhar are with ...

  • IEEE P

    roof

    2 IEEE TRANSACTIONS ON MULTIMEDIA

    other deep global descriptors like MAC [18] and SPoC [16],82while maintaining the same dimensionality.83

    In the context of compact descriptors for video retrieval, there84exist important practical issues with CNN based deep global de-85scriptors. First, one main drawback of CNN is the lack of invari-86ance to geometric transformations of the input image such as87rotations. The performance of deep global descriptors quickly88degrades when the objects in query and database videos are89rotated differently. Second, different from CNN features, hand-90crafted descriptors are robust to scale and rotation changes in912D plane, because of the local invariant feature detectors. As92such, more insights should be provided on whether there is great93complementarity between CNN and conventional handcrafted94descriptors for better performance.95

    To tackle the above issues, we make the following contribu-96tions in this work:97

    1) We propose a Nested Invariance Pooling (NIP) method98to produce compact global descriptors from CNN by pro-99gressively encoding translation, scale and rotation invari-100ances. NIP is inspired from a recent invariance theory,101which provides a practical and mathematically proven102way for computing invariant representations with feedfor-103ward neural networks. In this respect, NIP is extensible104to other types of transformations. Both quantitative and105qualitative evaluations are introduced for a deeper look at106the invariance properties.107

    2) We further improve the discriminability of deep global108descriptors by designing Hybrid pooling moments within109NIP (HNIP). Evaluations on video retrieval show that110HNIP outperforms state-of-the-art deep and handcrafted111descriptors by a large margin with comparable or smaller112descriptor size.113

    3) We analyze the complementary nature of deep features114and handcrafted descriptors over diverse datasets (land-115marks, scenes and common objects). A simple combina-116tion strategy is introduced to fuse the strengths of both117deep and handcrafted global descriptors. We show that118the combined descriptors offer the optimal video match-119ing and retrieval performance, without incurring extra cost120on descriptor size compared to CDVA.121

    4) Due to the superior performance, HNIP has been adopted122by CDVA Ad-hoc group as technical reference to setup123new core experiments [21], which opens up future ex-124ploration of CNN techniques in the development of stan-125dardized video descriptors. The latest core experiments126involve compact deep feature representation, CNN model127compression, etc.128

    II. RELATED WORK129

    Handcrafted descriptors: Handcrafted local descriptors [22],130[23], such as SIFT based on Difference of Gaussians (DoG)131detector [22], have been successfully and widely employed132to conduct image matching and localization tasks due to133their robustness in scale and rotation changes. Building on134local image descriptors, global image representations aim to135provide statistical summaries of high level image properties136and facilitate fast large-scale image search. In particular, for137

    global image descriptors, the most prominent ones include 138Bag-of-Words (BoW) [24], [25], Fisher Vector (FV) [9], 139VLAD [10], Triangulation Embedding [26] and Robust Visual 140Descriptor with Whitening (RVDW) [27], with which fast 141comparisons against a large scale database become practical. 142

    Given the fact that raw descriptors such as SIFT and FV may 143consume extraordinarily large number of bits for transmission 144and storage, many compact descriptors were developed. For ex- 145ample, numerous strategies have been proposed to compress 146SIFT using hashing [28], [29], transform coding [30], lattice 147coding [31] and vector quantization [32]. On the other hand, bi- 148nary descriptors including BRIEF [33], ORB [34], BRISK [35] 149and Ultrashort Binary Descriptor (USB) [36] were proposed, 150which support fast Hamming distance matching. For compact 151global descriptors, efforts have also been made to reduce their 152descriptor size, such as tree-structure quantizer [37] for BoW 153histogram, locality sensitive hashing [38], dimensionality re- 154duction and vector quantization for FV [9], and simple sign 155binarization for VLAD like descriptors [9], [39]. 156

    Deep descriptors: Deep learned descriptors have been ex- 157tensively applied to image retrieval [11]–[18]. First initial 158study [11], [13] proposed to use representations directly ex- 159tracted from fully connected layer of CNN. More compact 160global descriptors [14]–[16] can be extracted by performing 161either global max or average pooling (e.g., SPoC in [16]) over 162feature maps output by intermediate layers. Further improve- 163ments are obtained by spatial or channel-wise weighting of 164pooled descriptors [17]. Very recently, inspired by the R-CNN 165approach [40] used for object detection, Tolias et al. [18] pro- 166posed ROI based pooling on deep convolutional features, Re- 167gional Maximum Activation of Convolutions (R-MAC), which 168significantly improves global pooling approaches. Though 169R-MAC is scale invariant to some extent, it suffers from the 170lack of rotation invariance. These regional deep features can be 171also aggregated into global descriptors by VLAD [12]. 172

    In a number of recent works [13], [41]–[43], pre-trained 173CNNs for image classification are repurposed for the image 174retrieval task, by fine-tuning them with specific loss functions 175(e.g., Siamese or triplet networks) over carefully constructed 176matching and non-matching training image sets. There is consid- 177erable performance improvement when training and test datasets 178in similar domains (e.g., buildings). In this work, we aim to 179explicitly construct invariant deep global descriptors from the 180perspective of better leveraging the state-of-the-arts or classical 181CNN architectures, rather than further optimizing the learning 182of invariant deep descriptors. 183

    Video descriptors: Video is typically composed of a num- 184ber of moving frames. Therefore, a straightforward method for 185video descriptor representation is extracting feature descriptors 186at frame level then reducing the temporal redundancies of these 187descriptors for compact representation. For local descriptors, 188Baroffio et al. [44] proposed both intra- and inter-feature cod- 189ing methods of SIFT in the context of visual sensor network, 190and an additional mode decision scheme based on rate-distortion 191optimization was designed to further improve the feature coding 192efficiency. In [45], [46], studies have been conducted to encode 193the binary features such as BRISK [35]. Makar et al. [47] pre- 194sented a temporally coherent keypoint detector to allow efficient 195

    DELL删划线

    DELL替换文本major

    DELL删划线

    DELL替换文本against

    DELL插入号video matching and retrieval

    DELL删划线

    DELL替换文本practically useful and mathematically proved approach to computing invariant representation with feedforward neural network.

    DELL删划线

    DELL删划线

    DELL替换文本provided on

    DELL删划线

    DELL删划线

    DELL替换文本incorporating

    DELL删划线

    DELL删划线

    DELL替换文本into

    DELL删划线

    DELL替换文本The proposed

    DELL插入号in the well-established video retrieval evaluation framework

    DELL删划线

    DELL替换文本By fusing

    DELL删划线

    DELL替换文本we have further improved the

    DELL删划线

    DELL替换文本as per CDVA evaluation framework

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 3

    interframe coding of canonical patches, corresponding feature196descriptors, and locations towards mobile augmented reality197application. Chao et al. [48] developed a key-points encoding198technique, where locations, scales and orientations extracted199from original videos are encoded and transmitted along with200compressed video to the server. Recently, the temporal depen-201dencies of global descriptors have also been exploited. For BoW202extracted from video sequence, Baroffio et al. [49] proposed an203intra-frame coding method with uniform scalar quantization,204as well as an inter-frame technique with arithmetic coding the205quantized symbols. Chen et al. [50], [51] proposed an encoding206scheme for scalable residual based global signatures given the207fact that REVVs [39] of adjacent frames share most codewords208and residual vectors.209

    Besides the frame-level approaches, aggregations of local de-210scriptors over video slots and global descriptors over scenes211have also been intensively explored [52]–[55]. In [54], tempo-212ral aggregation strategies for large scale video retrieval were213experimentally studied and evaluated with the CDVS global214descriptors [56]. Four aggregation modes, including local fea-215ture, global signature, tracking-based and independent frame216based aggregation schemes were investigated. For video-level217CNN representation, in [57], the authors applied FV and VLAD218aggregation techniques over dense local features of CNN acti-219vation maps for video event detection.220

    III. MPEG CDVA221

    MPEG CDVA [7] aims to standardize the bitstream of com-222pact video descriptors for large-scale video analysis. The CDVA223standard incurs two main technical requirements of the dedi-224cated descriptors: compactness and robustness. On the one hand,225compact representation is an efficient way to economize the226transmission bandwidth, storage space and computational cost.227On the other hand, robust representation in the scenario of ge-228ometric transformations such as rotation and scale variations is229particularly required. To this end, in the 115th MPEG meeting,230the CXM [8] was released, which mainly relies on MPEG CDVS231reference software TM14.2 [6] for keyframe-level compact and232robust handcrafted descriptor representation based on scale and233rotation invariant local features.234

    A. CDVS-Based Handcrafted Descriptors235

    The MPEG CDVS [5] standardized descriptors serve as the236fundamental infrastructure to represent video keyframes. The237normative blocks of CDVS standard are illustrated in Fig. 1(b),238mainly involving extraction of compressed local and global239descriptors. First, scale and rotation invariant interest key points240are detected from image, and a subset of reliable key points241are retained, followed by the computation of handcrafted local242SIFT features. The compressed local descriptors are formed243by applying a low-complexity transform coding on local SIFT244features. The compact global descriptors are Fisher vectors ag-245gregated from the selected local features, followed by scalable246descriptor compression with simple sign binarization. Basically,247pairwise image matching is accomplished by first comparing248compact global descriptors, then further performing geometric249consistency checking (GCC) with compressed local descriptors.250

    CDVS handcrafted descriptors are with very low memory foot- 251print, while preserving competitive matching and retrieval accu- 252racy. The standard supports operating points ranging from 512 B 253to 16 kB specified for different bandwidth constraints. Overall, 254the 4 kB operating point achieves a good tradeoff between ac- 255curacy and complexity (e.g., transmission bitrate, search time). 256Thus, CDVA CXM adopts the 4 kB descriptors for keyframe- 257level representation, in which compressed local descriptors and 258compact global descriptors are both ∼2 kB per keyframe. 259

    B. CDVA Evaluation Framework 260

    Fig. 1(a) shows the evaluation framework of CDVA, includ- 261ing keyframe detection, CDVS descriptors extraction, trans- 262mission, and video retrieval and matching. At the client side, 263color histogram comparison is applied to identify keyframes 264from video clips. The standardized CDVS descriptors are ex- 265tracted from these keyframes, which can be further packed to 266form CDVA descriptors [58]. Keyframe detection has largely 267reduced the temporal redundancy in videos, resulting in low bi- 268trate query descriptor transmission. At the server side, the same 269keyframe detection and CDVS descriptors extraction are ap- 270plied to database videos. Formally, we denote query video X = 271{x1 , ...,xNx } and reference video Y = {y1 , ...,yNy }, where x 272and y denote keyframes. Nx and Ny are the number of detected 273keyframes in query and reference videos, respectively. The start 274and end timestamps for keyframes are recorded, e.g., [T sxn , T

    exn

    ] 275for query keyframe xn . Here, we briefly describe the pipeline of 276pairwise video matching, localization and video retrieval with 277CDVA descriptors. 278

    Pairwise video matching and localization: Pairwise video 279matching is performed by comparing the CDVA descriptors of 280video keyframe pair (X,Y). Each keyframe in X is compared 281with all keyframes in Y. The video-level similarity K(X, Y) is 282defined as the largest matching score among all keyframe-level 283similarities. For example, if we consider video matching with 284CDVS global descriptors only 285

    K(X, Y) = maxx∈X,y∈Y

    k(f(x), f(y)) (1)

    where k(·, ·) denotes a matching function (e.g., cosine similar- 286ity). f(x) denotes CDVS global descriptors for keyframe x.1 287Following the matching pipeline in CDVS, if k(·, ·) exceeds a 288pre-defined threshold, GCC with CDVS local descriptors is sub- 289sequently applied for verifying true positive keyframe matches. 290Then the keyframe-level similarity is finally determined as the 291multiplication of matching scores from both CDVS global and 292local descriptors. Correspondingly, K(X, Y) in (1) is refined as 293the maximum of their combined similarities. 294

    The matched keyframe timestamps between query and refer- 295ence videos are recorded for evaluating the temporal localization 296task, i.e., locating the video segment containing item of inter- 297est. In particular, if the multiplication of CDVS global and local 298matching scores exceeds a predefined threshold, the correspond- 299ing keyframe timestamps are recorded. Assuming there are τ 300(τ ≤ Nx ) keyframes satisfying such criterion in a query video, 301the matching video segment is defined as T ′start = min

    {T sxn

    }302

    1We use the same notation for deep global descriptors in the following section.

  • IEEE P

    roof

    4 IEEE TRANSACTIONS ON MULTIMEDIA

    Fig. 1. (a) Illustration of MPEG CDVA evaluation framework. (b) Descriptor extraction pipeline for MPEG CDVS. (c) Temporal localization of item of interestbetween video pair.

    and T ′end = max{T exn

    }, where 1 ≤ n ≤ τ . As such, we can ob-303

    tain the predicted matching video segment by descriptor match-304ing, as shown in Fig. 1(c).305

    Video retrieval: Video retrieval differs from pairwise video306matching in that the former is one to many matching, while307the latter is one to one matching. Thus, video retrieval shares308similar matching pipeline with pairwise video matching, ex-309cept for the following differences: 1) For each query keyframe,310the top Kg candidate keyframes are retrieved from database311by comparing CDVS global descriptors. Subsequently, GCC312reranking with CDVS local descriptors is performed between313query keyframe and each candidate. The top Kl database314keyframes are recorded. The default choices for Kg and Kl315are 500 and 100, respectively. 2) For each query video, all re-316turned database keyframes are merged into candidate database317videos according to their video indices. Then, the video-level318similarity between query and each candidate database video319is obtained following the same principle as pairwise video320matching. Finally, the top ranked candidate database videos are321returned.322

    IV. METHOD323

    A. Hybrid Nested Invariance Pooling324

    Fig. 2(a) shows the extraction pipeline of our compact deep325invariant global descriptors with HNIP. Given a video keyframe326x, we rotate it by R times (with step size θ◦). By forward-327ing each rotated image through a pre-trained deep CNN, the328convolutional feature maps output by intermediate layer (e.g.,329convolutional layer) are represented by a cube W × H × C,330where W and H denote width and height of each feature331map respectively and C is the number of channels in the fea-332ture maps. Subsequently, we extract a set of ROIs from the333cube using an overlapping sliding window, with window size334

    W′ ≤ W and H ′ ≤ H . The window size is adjusted to incor- 335

    porate ROIs with different scales (e.g., 5 × 5, 10 × 10). Here, 336we denote the number of scales as S. Finally, a 5-D data struc- 337ture γx(Gt,Gs,Gr , C) ∈ RW

    ′×H ′×S×R×C is derived, which 338encodes translations Gt (i.e., spatial locations W

    ′ × H ′ ), scales 339Gs and rotations Gr of input keyframe x. 340

    HNIP aims to aggregate the 5-D data into a compact deep in- 341variant global descriptor. In particular, it firstly performs pooling 342over translations (W

    ′ × H ′ ), then scales (S) and finally rota- 343tions (R) in a nested way, resulting in a C-dimensional global 344descriptor. Formally, for the cth feature map, nt-norm pooling 345over translations Gt is given by 346

    γx(Gs,Gr , c) =

    ⎝ 1W ′ × H ′

    W′×H ′∑

    j=1

    γx(j,Gs,Gr , c)nt

    1n t

    (2)where the pooling operation has a parameter nt defining the sta- 347tistical moments, e.g., nt = 1 is first-order (i.e., average pool- 348ing), nt → +∞ on the other extreme is infinite order (i.e., max 349pooling), and nt = 2 is second order (i.e., square-root pooling). 350Equation (2) leads to a 3-D data γx(Gs,Gr , C) ∈ RS×R×C . 351Analogously, ns -norm pooling over scale transformations Gs 352and the subsequent nr -norm pooling over rotation transforma- 353tions Gr are 354

    γx(Gr , c) =

    ⎝ 1S

    S∑

    j=1

    γx(j,Gr , c)ns

    1n s

    , (3)

    γx(c) =

    ⎝ 1R

    R∑

    j=1

    γx(j, c)nr

    1n r

    . (4)

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 5

    Fig. 2. (a) Nested invariance pooling (NIP) on feature maps extracted from intermediate layer of CNN architecture. (b) A single convolution-pooling operationfrom a CNN schematized for a single input layer and single output neuron. The parallel with invariance theory shows that the universal building block of CNN iscompatible with the incorporation of invariance to local translations of the input.

    The corresponding global descriptor is obtained by concatenat-355ing γx(c) for all feature maps356

    f(x) = {γx(c)}0≤c (6)

    where β(·) is a normalization term computed by β(x) =358(∑C

    c=1 < γx(c), γx(c) >)− 12 . Equation (6) refers to cosine sim-359

    ilarity by accumulating the scalar products of normalized pooled360features for each feature map. HNIP descriptors can be further361improved by post-processing techniques such as PCA whiten-362ing [16], [18]. In this work, The global descriptor is firstly L2363normalized, followed by PCA projection and whitening with a364pre-trained PCA matrix. The whitened vectors are L2 normal-365ized and compared with (6).366

    Subsequently, we further investigate HNIP with more details.367In Section IV-B, inspired by a recent invariance theory [59],368HNIP is proven to be approximately invariant to translation,369scale and rotation transformations, which is independent of370the statistical moments chosen in the nested pooling stages.371In Section IV-C, both quantitative and qualitative evaluations372are presented to illustrate the invariance properties. Moreover,373we observe the statistical moments in HNIP drastically affect374video matching performance. Our empirical results show that375the optimal nested pooling moments correspond to: nt = 2 as a376second order moment, ns first-order and nr of infinite order.377

    B. Theory on Transformation Invariance378

    Invariance theory in a nutshell: Recently, Anselmi and379Poggio [59] proposed an invariance theory exploring how signal380(e.g., image) representations are invariant to various transforma-381tions. Denote f(x) as the representation for image x, f(x) is382invariant to transformation g (e.g., rotation) if f(x) = f(g · x)383is hold ∀g ∈ G, where we define the orbit of x by a transfor-384mation group G as Ox = {g · x|g ∈ G}. It can be easily shown385

    that Ox is globally invariant to the action of any element of 386G and thus any descriptor computed directly from Ox will be 387globally invariant to G. 388

    More specifically, the invariant descriptor f(x) can be 389derived in two stages. First, given a predefined template t (e.g., 390convolutional filter in CNN), we compute the dot products of 391t over the orbit: Dx,t = {< g · x, t >∈ R|g ∈ G}. Second, the 392extracted invariant descriptor should be a histogram represen- 393tation of the distribution Dx,t with a specific bin configuration, 394for example, the statistical moments (e.g., mean, max, standard 395deviation, etc.) derived from Dx,t . Such a representation is 396mathematically proven to have proper invariance property for 397transformations such as translations (Gt), scales (Gs) and 398rotations (Gr ). One may note that the transformation g can 399be applied either on the image or the template indifferently, 400i.e., {< g · x, t >=< x, g · t > |g ∈ G}. Recent work on face 401verification [60] and music classification [61] successfully 402applied this theory to practical applications. More details about 403invariance theory are referred to [59]. 404

    An example: translation invariance of CNN: The convolution- 405pooling operations in CNN are compliant with the invari- 406ance theory. Existing well-known CNN architectures like 407AlexNet [19] and VGG16 [20] share a common building block: 408a succession of convolution and pooling operations, which in 409fact provides a way to incorporate local translation invariance. 410As shown in Fig. 2(b), convolution operation on translated im- 411age patches (i.e., sliding windows) is equivalent to < g · x, t >, 412and max pooling operation is in line with the statistical his- 413togram computation over the distribution of the dot products 414(i.e., feature maps). For instance, considering a convolutional 415filter learned “cat face”, the filter would always response to an 416image depicted cat face no matter where is the face located in 417the image. Subsequently, max pooling over the activation fea- 418ture maps captures the most salient feature of the cat face, which 419is naturally invariant to object translation. 420

    Incorporating scale and rotation invariances into CNN: 421Based on the already locally translation invariant feature maps 422(e.g., the last pooling layer, pool5), we propose to further im- 423

  • IEEE P

    roof

    6 IEEE TRANSACTIONS ON MULTIMEDIA

    Fig. 3. Comparison of pooled descriptors invariant to (a) rotation and (b)scale changes of query images, measured by retrieval accuracy (mAP) on theHolidays data set. fc6 layer of VGG16 [20] pretrained on ImageNet datasetis used.

    prove the invariance of pool5 CNN descriptors by incorporating424global invariance to several transformation groups. The spe-425cific transformation groups considered in this study are scales426GS and rotations GR . As one can see, it is impractical to gen-427erate all transformations g · x for the orbit Ox. In addition,428the computational complexity of feedforward pass in CNN in-429creases linearly with the number of transformed version of the430input x. For practical consideration, we simplify the orbit by431a finite set of transformations (e.g., # of rotations R = 4, # of432scales S = 3). This results in HNIP descriptors approximately433invariant to transformations, without huge increase in feature434extraction time.435

    An interesting aspect of the invariance theory is the possibility436in practice to chain multiple types of group invariances one after437the other as already demonstrated in [61]. In this study, we con-438struct descriptors invariant to several transformation groups by439progressively applying the method to different transformation440groups as shown in (2)–(4).441

    Discussions: While there is theoretical guarantee in the scale-442and rotation-invariance of handcrafted local feature detectors443such as DoG, classical CNN architectures lack invariance to444these geometric transformations [62]. Many works have pro-445posed to encode transformation invariances into both hand-446crafted (e.g., BoW built on dense sampled SIFT [63]) and CNN447representations [64], by explicitly augmenting input images with448rotation and scale transformations. Our HNIP takes similar idea449of image augmentation, but has several significant differences.450First, rather than a single pooling (max or average) layer over451all transformations, HNIP progressively pools features together452across different transformations with different moments, which453is essential for significantly improving the quality of pooled454CNN descriptors. Second, unlike previous empirical studies,455we have attempted to mathematically show that the design of456nested pooling ensures that HNIP is approximately invariant to457multiple transformations, which is inspired by the recently de-458veloped invariance theory. Third, to the best of our knowledge,459this work is the first to comprehensively analyze invariant prop-460erties of CNN descriptors, in the context of large scale video461matching and retrieval.462

    C. Quantitative and Qualitative Evaluations463

    Transformation invariance: In this section, we propose a464database-side data augmentation strategy for image retrieval465

    Fig. 4. Distances for three matching pairs from the MPEG CDVA dataset (seeSection VI-A for more details). For each pair, three pairwise distances (L2normalized) are computed by progressively encoding translations (Gt ), scales(Gt + Gs ), and rotations (Gt + Gs + Gr ) into the nested pooling stages.Average pooling is used for all transformations. Feature maps are extractedfrom the last pooling layer of pretrained VGG16.

    to study rotation and scale invariance properties, respectively. 466For simplicity, we represent an image as a 4096 dimensional 467descriptor extracted from the first fully connected layer (fc6) 468of VGG16 [20] pre-trained on ImageNet dataset. We report re- 469trieval results in terms of mean Average Precision (mAP) on the 470INRIA Holidays dataset [65] (500 query images, 991 reference 471images). 472

    Fig. 3(a) investigates the invariance property with respect to 473query rotations. First, we observe that the retrieval performance 474drops significantly as we rotate query images when fixing the 475reference images (the red curve). To gain invariance to query 476rotations, we rotate each reference image within a range of 477−180◦ to 180◦, with the step of 10◦. The fc6 features for its 47836 rotated images are pooled together into one common global 479descriptor representation, with either max or average pooling. 480We observe that the performance is relatively consistent (blue for 481max pooling, green for average pooling) against the rotation of 482query images. Moreover, performance in terms of the variations 483of the query image scale is plotted in Fig. 3(b). It is observed 484that the database-side augmentation by max or average pooling 485over scale changes (scale ratio of 0.75, 0.5, 0.375, 0.25, 0.2 and 4860.125) can improve the performance when query scale is small 487(e.g., 0.125). 488

    Nesting multiple transformations: We further analyze nested 489pooling over multiple transformations. Fig. 4 provides an insight 490on how progressively adding different types of transformations 491affect the matching distance on different image matching pairs. 492We can see the reduction in matching distance with the incor- 493poration of each new transformation group. Fig. 5 takes a closer 494look at pairwise similarity maps between local deep features 495of query keyframes and the global deep descriptors of refer- 496ence keyframes, which explicitly reflects the regions of query 497keyframe significantly contributing to similarity measurement. 498We compare our HNIP (the third heatmap at each row) to the 499state-of-the-art deep descriptors MAC [18] and R-MAC [18]. 500Because of the introduction of scale and rotation transforma- 501

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 7

    Fig. 5. Example similarity maps between local deep features of query keyframes and the global deep descriptors of reference keyframes, using off-the-shelfVGG16. For query (left image) and reference (right image) keyframe pair at each row, the middle three similarity maps are from MAC, R-MAC, and HNIP (fromleft to right), respectively. Each similarity map is generated by cosine similarity between query local features at each feature map location and the pooled globaldescriptors for reference keyframe (i.e., MAC, R-MAC, or HNIP), which allows locating the regions of query keyframe contributing most to the pairwise similarity.

    TABLE IPAIRWISE VIDEO MATCHING BETWEEN MATCHING AND NON-MATCHING

    VIDEO DATASETS, WITH POOLING CARDINALITY INCREASEDBY PROGRESSIVELY ENCODING TRANSLATION, SCALE, ANDROTATION TRANSFORMATIONS, FOR DIFFERENT POOLING

    STRATEGIES, I.E., MAX-MAX-MAX, AVG-AVG-AVG,AND OUR HNIP (SQU+AVG+MAX)

    Gt Gt -Gs Gt -Gs -Gr

    Max 71.9 Max-Max 72.8 Max-Max-Max 73.9Avg 76.9 Avg-Avg 79.2 Avg-Avg-Avg 82.2Squ 81.6 Squ-Avg 82.7 Squ-Avg-Max 84.3

    TABLE IISTATISTICS ON THE NUMBER OF RELEVANT DATABASE VIDEOS RETURNED

    IN TOP 100 LIST (I.E., RECALL@100) FOR ALL QUERY VIDEOS IN THEMPEG CDVA DATASET (SEE SECTION VI-A FOR MORE DETAILS)

    Landmarks Scenes Objects

    HNIP \ CXM 8143 1477 1218CXM \ HNIP 1052 105 1834

    “A \ B” represents relevant instances are successfully re-trieved by method A but missed in the list generated bymethod B. The last pooling layer of pretrained VGG16 isused for HNIP.

    tions, HNIP is able to locate the query object of interest re-502sponsible for similarity measures more precisely than MAC503and R-MAC, though there are scale and rotation changes be-504tween query-reference pairs. Moreover, as shown in Table I,505quantitative results on video matching by HNIP with progres-506sively encoded multiple transformations provide additional pos-507itive evidence for the nested invariance property.508

    Pooling moments: In Fig. 3, it is interesting to note that any509choice of pooling moment n in the pooling stage can produce510invariant descriptors. However, the discriminability of NIP with511varied pooling moments could be quite different. For video512retrieval, we empirically observe that pooling with hybrid mo-513ments works well for NIP, e.g., starting with square-root pool-514ing (nt = 2) for translations and average pooling (ns = 1) for515scales, and ending with max pooling (nr → +∞) for rotations.516Here, we present an empirical analysis of how pooling moments517affect pairwise video matching performance.518

    We construct matching and non-matching video set from the 519MPEG CDVA dataset. Both sets contain 4690 video pairs. With 520input video keyframe size 640 × 480, feature maps of size 20 × 52115 are extracted from the last pooling layer of VGG16 [20] pre- 522trained on ImageNet dataset. For transformations, we consider 523nested pooling by progressively adding transformations with 524translations (Gt), scales (Gt-Gs) and rotations (Gt-Gs-Gr ). 525For pooling moments, we evaluate Max-Max-Max, Avg-Avg- 526Avg and our HNIP (i.e., Squ-Avg-Max). Finally, video similarity 527is computed using (1) with pooled features. 528

    Table I reports pairwise matching performance in terms 529of True Positive Rate with False Positive Rate set to 1%, 530with transformations switching from Gt to Gt-Gs-Gr , for 531Max-Max-Max, Avg-Avg-Avg and HNIP. As more transfor- 532mations nested in, the separability between matching and 533non-matching video sets becomes larger, regardless of the 534pooling moments used. More importantly, HNIP performs the 535best compared to Max-Max-Max and Avg-Avg-Avg, while 536Max-Max-Max is the worst. For instance, HNIP outperforms 537Avg-Avg-Avg, i.e., 84.3% vs. 82.2%. 538

    V. COMBINING DEEP AND HANDCRAFTED DESCRIPTORS 539

    In this section, we analyze the strength and weakness of deep 540features in the context of video retrieval and matching, com- 541pared to state-of-the-art handcrafted descriptors built upon local 542invariant features (SIFT). To this end, we respectively compute 543statistics of HNIP and CDVA handcrafted descriptors (CXM) 544by retrieving different types of video data. In particular, we fo- 545cus on videos depicting landmarks, scenes and common objects, 546collected by MPEG CDVA. Here we describe how to compute 547the statistics. First, for each query video, we retrieve top 100 548most similar database videos using HNIP and CXM, respec- 549tively. Second, for all queries from each type of video data, we 550accumulate the number of relevant database videos (1) retrieved 551by HNIP but missed by CXM (denoted as HNIP \ CXM), and 552(2) retrieved by CXM but missed by HNIP (CXM \ HNIP). 553The statistics are presented in Table II. As one can see, com- 554pared to handcrafted descriptors CXM, HNIP is able to identify 555much more relevant landmark and scene videos in which CXM 556fails. On the other hand, CXM recalls more videos depicting 557common objects than HNIP. Fig. 6 shows qualitative examples 558

  • IEEE P

    roof

    8 IEEE TRANSACTIONS ON MULTIMEDIA

    Fig. 6. Examples of keyframe pairs in which (a) HNIP determines as matching but CXM as non-matching (HNIP \ CXM) and (b) CXM determines as matchingbut HNIP as non-matching (CXM \ HNIP).

    of keyframe pairs corresponding to HNIP \ CXM and CXM \559HNIP, respectively.560

    Fig. 7 further visualizes intermediate keyframe matching re-561sults produced by handcrafted and deep features, respectively.562Despite viewpoint change of the landmark images in Fig. 7(a),563the salient features fired in their activation maps are spatially564consistent. Similar observations exist in indoor scene images in565Fig. 7(b). These observations are probably attributed to deep566descriptors are excelling in characterizing global salient fea-567tures. On the other hand, handcrafted descriptors work on local568patches detected at sparse interest points, which prefer rich tex-569tured blobs [Fig. 7(c)] rather than lower textured ones [Fig. 7(a)570and 7(b)]. This may explain why there are more inlier matches571found by GCC for the product images in Fig. 7(a). Finally, com-572pared to approximate scale and rotation invariances provided573by HNIP analyzed in the previous section, handcrafted local574features have with built-in mechanism to ensure nearly exact575invariances to these transformations of rigid object in the 2D576plane, and examples can be found in Fig. 6(b).577

    In summary, these observations reveal that the deep learn-578ing features may not always outperform handcrafted features.579There may exist complementary effects between CNN deep de-580scriptors and handcrafted descriptors. Therefore, we propose to581leverage the benefits of both deep and handcrafted descriptors.582Considering that handcrafted descriptors are categorized into583local and global ones, we investigate the combination of deep584descriptors with either handcrafted local or global descriptors,585respectively.586

    Combining HNIP with handcrafted local descriptors: For587matching, if the HNIP matching score exceeds a threshold, then588we use handcrafted local descriptors for verification. For re-589

    trieval, HNIP matching score is used to select the top 500 590candidates list, then we use handcrafted local descriptors for 591reranking. 592

    Combining HNIP and handcrafted global descriptors: In- 593stead of simply concatenating the HNIP derived deep descrip- 594tors and handcrafted descriptors, for both matching and retrieval, 595the similarity score is defined as the weighted sum of matching 596scores of HNIP and handcrafted global descriptors 597

    k̂(x,y) = α · kc(x,y) + (1 − α) · kh(x,y) (7)where α is the weighting factor. kc and kh represent the matching 598score of HNIP and handcrafted descriptors, respectively. In this 599work, α is empirically set to 0.75. 600

    VI. EXPERIMENTAL RESULTS 601

    A. Datasets and Evaluation Metrics 602

    Datasets: MPEG CDVA ad-hoc group collects large scale 603diverse video dataset to evaluate the effectiveness of video 604descriptors for video matching, localization and retrieval ap- 605plications, with resource constraints including descriptor size, 606extraction time and matching complexity. This CDVA dataset2 607is diversified to contain views of 1) stationary large objects, e.g., 608buildings, landmarks (most likely background objects, possibly 609partially occluded or a close-up), 2) generally smaller items 610(e.g., paintings, books, CD covers, products) which typically 611

    2MPEG CDVA dataset and evaluation framework are available upon re-quest at http://www.cldatlas.com/cdva/dataset.html. CDVA standard documentsare available at http://mpeg.chiariglione.org/standards/exploration/compact-descriptors-video-analysis.

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 9

    Fig. 7. Keyframe matching examples which illustrate the strength and weakness of CNN based deep descriptors and handcrafted descriptors. In (a) and (b), deepdescriptors perform well but handcrafted ones fail, while (c) is the opposite.

    TABLE IIISTATISTICS ON THE MPEG CDVA BENCHMARK DATASETS

    IoI: items of interest. q.v. : query videos. r.v.: reference videos.

    appear in front of background scenes, possibly occluded and 3)612scenes (e.g., interior scenes, natural scenes, multi-camera shots,613etc.). CDVA dataset also comprises planar or non-planar, rigid614or partially rigid, textured or partially textured objects (scenes),615which are captured from different view-points with different616camera parameters and lighting conditions.617

    Specifically, MPEG CDVA dataset contains 9974 query and6185127 reference videos (denoted as All), depicting 796 items of619interest in which 489 large landmarks (e.g., buildings), 71 scenes620(e.g., interior or natural scenes) and 236 small common objects621(e.g., paintings, books, products). The videos have durations622from 1 sec to 1+ min. To evaluate video retrieval on different623types of video data, we categorize query videos into Landmarks624(5224 queries), Scenes (915 queries) and Objects (3835 queries).625Table III summaries the numbers of items of interest and their626instances for each category. Fig. 8 shows some example video627clips from the three categories.628

    To evaluate the performance of large scale video retrieval,629we combine the reference videos with a set of user-generated630and broadcast videos as distractors, which consist of content631unrelated to the items of interest. There are 14537 distractor632videos with more than 1000 hours data.633

    Moreover, to evaluate pairwise video matching and tempo-634ral localization, 4693 matching video pairs and 46911 non-635matching video pairs are constructed from query and reference636videos. Temporal location of items of interest within each video637pair is annotated as the ground truth.638

    We also evaluate our method on image retrieval bench- 639mark datasets. INRIA Holidays, [65] dataset is composed of 6401491 high-resolution (e.g., 2048 × 1536) scene-centric images, 641500 of them are queries. This dataset includes a large variety 642of outdoor scene/object types: natural, man-made, water and 643fire effects. We evaluate the rotated version of Holidays [13], 644where all images are with up-right orientation. Oxford5k, [66] 645is buildings datasets consisting of 5062 images, mainly with 646size 1024 × 768. There are 55 queries composed of 11 land- 647marks, each represented by 5 queries. We use the provided 648bounding boxes to crop query images. The University of Ken- 649tucky Benchmark (UKBench) [67] consists of 10200 VGA size 650images, organized into 2550 groups of common objects, each 651object represented by 4 images. All 10200 images are serving as 652queries. 653

    Evaluation metrics: Retrieval performance is evaluated by 654mean Average Precision (mAP) and precision at a given cut- 655off rank R for query videos (Precisian@R), and we set R = 656100 following MPEG CDVA standard. Pairwise video matching 657performance is evaluated by Receiver Operating Characteristic 658(ROC) curve. We also report pairwise matching results in terms 659of True Positive Rate (TPR), given False Positive Rate (FPR) 660equals to 1%. In case a video pair is predicted as a match, 661temporal location of the item of interest is further identified 662within the video pair. The localization accuracy is measured 663by Jaccard Index: [T s t a r t ,Te n d ]

    ⋂[T ′s t a r t ,T

    ′e n d ]

    [T s t a r t ,Te n d ]⋃

    [T ′s t a r t ,T′e n d ]

    , where [Tstart , Tend ] 664denotes the ground truth and [T ′start , T

    ′end ] denotes the predicted 665

    start and end frame timestamps. 666Besides these accuracy measurements, we also measure the 667

    complexity of algorithms, including descriptor size, transmis- 668sion bit rate, extraction time and search time. In particular, 669transmission bit rate is measured by (# query keyframes) × 670(descriptor size in Bytes) / (query durations in seconds). 671

    B. Implementation Details 672

    In this work, we build HNIP descriptors with two widely 673used CNN architectures : AlexNet [19] and VGG16 [20]. We 674test off-the-shelf networks pre-trained on ImageNet ILSVRC 675classification data set. In particular, we crop the networks to the 676

    AdminSticky NoteMarked set by Admin

    AdminLine[25]

  • IEEE P

    roof

    10 IEEE TRANSACTIONS ON MULTIMEDIA

    Fig. 8. Example video clips from the CDVA dataset.

    TABLE IVVIDEO RETRIEVAL COMPARISON (MAP) BY PROGRESSIVELY ADDING

    TRANSFORMATIONS (TRANSLATION, SCALE, ROTATION) INTO NIP(OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

    Transf. size (kB/k.f.) Landmarks Scenes Objects All

    Gt 2 64.0 82.9 64.8 66.0Gt -Gs 2 65.3 82.4 67.3 67.6Gt -Gs -Gr 2 64.6 82.7 72.2 69.2

    Average pooling is applied to all transformations. No PCA whitening is performed.kB/k.f.: descriptor size per key frame. The best results are highlighted in bold.

    last pooling layer (i.e., pool5). We resize all video keyframes677to 640×480 and Holidays (Oxford5k) images to 1024×768678as the inputs of CNN for descriptor extraction. Finally, post-679processing can be applied to the pooled descriptors like HNIP680and R-MAC [18]. Following the standard practice, we choose681PCA whitening in this work. We randomly sample 40K frames682from the distractor videos for PCA learning. These experimen-683tal setups are applied to both HNIP and state-of-the-art deep684pooled descriptors like MAC [18], SPoC [16], CroW [17] and685R-MAC [18].686

    We also compare HNIP with the MPEG CXM, which is cur-687rent state-of-the-art handcrafted compact descriptors for video688analysis. Following the practice in CDVA standard, we em-689ploy OpenMP to perform parallel retrieval for both CXM and690deep global descriptors. Experiments are conducted on Tianhe691HPC platform, where each node is equipped with 2 processors692(24 cores, Xeon E5-2692) @2.2GHZ, and 64 GB RAM. For693CNN feature maps extraction, we use NVIDIA Tesla K80 GPU.694

    C. Evaluations on HNIP variants695

    We perform video retrieval experiments to assess the effect of696transformations and pooling moments in HNIP pipeline, using697off-the-shelf VGG16.698

    Transformations: Table IV studies the influence of pooling699cardinalities by progressively adding transformations into the700nested pooling stages. We simply apply average pooling to all701transformations. First, dimensions of all NIP variants are 512702for VGG16, resulting in descriptor size 2 kB per keyframe for703floating point vectors (4 bytes each element). Second, over-704all, retrieval performance (mAP) increases as more transfor-705mations nested in the pooled descriptors, e.g., from 66.0% for706Gt to 69.2% for Gt-Gs -Gr on the full test dataset (All). Also,707we observe that Gt-Gs-Gr outperforms Gt-Gs and Gt by a708large margin on Objects, while achieving comparable perfor-709

    TABLE VVIDEO RETRIEVAL COMPARISON (MAP) OF NIP WITH DIFFERENT POOLING

    MOMENTS (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

    Pool Op. size (kB/k.f.) Landmarks Scenes Objects All

    Max-Max-Max 2 63.2 82.9 69.9 67.6Avg-Avg-Avg 2 64.6 82.7 72.2 69.2Squ-Squ-Squ 2 54.2 65.6 66.5 60.0Max-Squ-Avg 2 60.6 80.7 74.4 67.8Hybrid (HNIP) 2 70.0 88.2 80.9 75.9

    Transformations are Gt -Gs -Gr for all experiments. No PCA whitening is performed.The best results are highlighted in bold.

    mance on Landmarks and Scenes. Revisiting the analysis of ro- 710tation invariance pooling on the scene-centric Holidays dataset 711in Fig. 3(a), though invariance to query rotation changes can 712be gained by database-side augmented pooling, one may note 713that its retrieval performance is comparable to the one without 714rotating query and reference images (i.e., the peak value of the 715red curve). These observations are probably because there are 716relatively limited rotation (scale) changes for videos depicting 717large landmarks or scenes, compared to small common objects. 718More examples can be found in Fig. 8. 719

    Hybrid pooling moments: Table V explores the effects of 720pooling moments within NIP. Transformations are fixed as Gt- 721Gs-Gr . There are 33 = 27 possible combinations of pooling 722moments in HNIP. For simplicity, we compare our Hybrid NIP 723(i.e., Squ-Avg-Max) to two widely-used pooling strategies (i.e., 724max or average pooling across all transformations) and another 725two schemes: square-root pooling across all transformations 726and Max-Squ-Avg which decreases pooling moment along the 727way. First, for uniform pooling, Avg-Avg-Avg is overall supe- 728rior over Max-Max-Max and Squ-Squ-Squ, while Squ-Squ-Squ 729performs much worse than the other two. Second, HNIP out- 730performs the optimal uniform pooling Avg-Avg-Avg by a large 731margin. For instance, the gains over Avg-Avg-Avg are +5.4%, 732+5.5% and +8.7% on Landmarks, Scenes and Objects, respec- 733tively. Finally, for hybrid pooling, HNIP performs significantly 734better than Max-Squ-Avg over all test datasets. We observe 735similar trends when comparing HNIP to other hybrid pooling 736combinations. 737

    D. Comparisons Between HNIP and State-of-the-Art Deep 738Descriptors 739

    Previous experiments show that the integration of transfor- 740mations and hybrid pooling moments offers remarkable video 741

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 11

    TABLE VIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART DEEP DESCRIPTORS

    IN TERMS OF MAP (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

    method size (kB/k.f.) extra. time (s/k.f.) Landmarks Scenes Objects All

    w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW

    MAC [18] 2 0.32 57.8 61.9 77.4 76.2 70.0 71.8 64.3 67.0SPoC [16] 2 0.32 64.0 69.1 82.9 84.0 64.8 70.3 66.0 70.9CroW [17] 2 0.32 62.3 63.9 79.2 78.4 71.9 72.0 67.5 68.3R-MAC [18] 2 0.32 69.4 74.6 84.4 87.3 73.8 78.2 72.5 77.1HNIP (Ours) 2 0.96 70.0 74.8 88.2 90.1 80.9 85.0 75.9 80.1

    We implement MAC [18], SPoC [16], CroW [17], R-MAC [18] based on the source codes released by the authors, while following the same experimental setups as our HNIP. Thebest results are highlighted in bold.

    TABLE VIIEFFECT OF THE NUMBER OF DETECTED VIDEO KEYFRAMES ON DESCRIPTOR TRANSMISSION BIT RATE,

    RETRIEVAL PERFORMANCE (MAP), AND SEARCH TIME, ON THE FULL TEST DATASET (ALL)

    # query k.f. # DB k.f. method size (kB/k.f.) bit rate (Bps) mAP search time (s/q.v.)

    ∼140K (1.6%) ∼105K (2.4%) CXM ∼4 2840 73.6 12.4AlexNet-HNIP 1 459 71.4 1.6

    VGG16-HNIP 2 918 80.1 2.3∼175K (2.0%) ∼132K (3.0%) CXM ∼4 3463 74.3 16.6

    AlexNet-HNIP 1 571 71.9 2.0VGG16-HNIP 2 1143 80.6 2.8

    ∼231K (2.7%) ∼176K (3.9%) CXM ∼4 4494 74.6 21.0AlexNet-HNIP 1 759 71.9 2.2VGG16-HNIP 2 1518 80.7 3.1

    We report performance of state-of-the-art handcrafted descriptors (CXM), and PCA whitened HNIP with both off-the-shelf AlexNet andVGG16. Numbers in bracket denote the percentage of detected keyframes from raw videos. Bps: bytes per second. s/q.v.: seconds perquery video.

    TABLE VIIIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART

    HANDCRAFTED DESCRIPTORS (CXM), FOR ALL TEST DATASETS

    method Landmarks Scenes Objects All

    CXM 61.4/60.9 63.0/61.9 92.6/91.2 73.6/72.6AlexNet-HNIP 65.2/62.3 77.6/74.1 78.4/74.5 71.4/68.1VGG16-HNIP 74.8/71.6 90.1/86.6 85.0/81.3 80.1/76.7

    The former (latter) number in each cell represents performance in terms of mAP(Precision@R). We report performance of PCA whitened HNIP with both off-the-shelf AlexNet and VGG16. The best results are highlighted in bold.

    retrieval performance improvements. Here, we conduct another742round of video retrieval experiments to validate the effective-743ness of our optimal reported HNIP, compared to state-of-the-art744deep descriptors [16]–[18].745

    Effect of PCA whitening: Table VI studies the effect of PCA746whitening on different deep descriptors in video retrieval per-747formance (mAP), using off-the-shelf VGG16. Overall, PCA748whitened descriptors perform better than their counterparts749without PCA whitening. More specifically, the improvements750on SPoC, R-MAC and our HNIP are much larger than MAC751and CroW. In view of this, we apply PCA whitening to HNIP in752the following sections.753

    HNIP versus MAC, SPoC, CroW and R-MAC: Table VI754presents the comparison of HNIP against state-of-the-art deep755descriptors. We observe that HNIP obtains consistently better756

    performance than other approaches on all test datasets, at the cost 757of extra extraction time.3 HNIP significantly improves the re- 758trieval performance over MAC [18], SPoC [16] and CroW [17], 759e.g., over 10% in mAP on the full test dataset (All). Compared 760with the state-of-the-art R-MAC [18], +7% mAP improvement 761is achieved on Objects, which is mainly attributed to the im- 762proved robustness against the rotation changes in videos (the 763keyframes capture small objects from different angles). 764

    E. Comparisons Between HNIP and Handcrafted Descriptors 765

    In this section, we first study the influences of the number 766of detected video keyframes on descriptor transmission bit rate, 767retrieval performance and search time. Then, with keyframes 768fixed, we compare HNIP to the state-of-the-art compact hand- 769crafted descriptors (CXM), which currently obtains the optimal 770video retrieval performance in MPEG CDVA datasets. 771

    Effect of the number of detected video keyframes: As shown 772in Table VII, we generate three keyframe detection config- 773urations by varying the detection parameters. Also, we test 774

    3The extraction time of deep descriptors is mainly decomposed into 1) feed-forward pass for extracting feature maps and 2) pooling over feature mapsfollowed by post-processing such as PCA whitening. In our implementationbased on MatConvNet, the first stage takes 0.21 seconds per keyframe (VGAsize input image to VGG16 executed on a NVIDIA Tesla K80 GPU); HNIP isfour times slower (∼0.84) as there are four rotations for each keyframe. Thesecond stage is ∼0.11 seconds for MAC, SPoC and CroW, ∼0.115 seconds forR-MAC and ∼0.12 seconds for HNIP. Therefore, the extraction time of HNIPis roughly three times as much as others.

  • IEEE P

    roof

    12 IEEE TRANSACTIONS ON MULTIMEDIA

    Fig. 9. Pairwise video matching comparison of HNIP with state-of-the-art handcrafted descriptors (CXM) in terms of ROC curve, for all test datasets. Experimentalsettings are identical with those in Table VIII.

    their retrieval performance and complexity for CXM descriptors775(∼4 kB per keyframe) and PCA whitened HNIP with both off-776the-shelf AlexNet (1 kB per keyframe) and VGG16 (2 kB per777keyframe), on the full test dataset (All). It is easy to find that778descriptor transmission bit rate and search time increase pro-779portionally with the number of detected keyframes. However,780there is little retrieval performance gain for all descriptors, i.e.,781less than 1% in mAP. Thus, we consider the first configuration782throughout this work, which achieves a good tradeoff between783accuracy and complexities. For instance, mAP for VGG16-784HNIP is 80.1% when the descriptor transmission bit rate is785only 2840 Bytes per second.786

    Video retrieval: Table VIII shows the video retrieval com-787parison of HNIP with handcrafted descriptors CXM on all788test datasets. Overall, AlexNet-HNIP is inferior to CXM,789while VGG16-HNIP performs the best. Second, HNIP with790both AlexNet and VGG16 outperform CXM on Landmarks791and Scenes. The performance gap between HNIP and CXM792becomes larger as network goes deeper from AlexNet to793VGG16, e.g., AlexNet-HNIP and VGG16-HNIP improve CXM794by 3.8% and 13.4% in mAP on Landmarks, respectively. Third,795we observe AlexNet-HNIP performs much worse than CXM on796Objects (e.g., 74.5% vs. 91.2% in Precision@R). VGG16-HNIP797reduces the gap, but still underperforms CXM. This is reason-798able as handcrafted descriptors based on SIFT are more robust799to scale and rotation changes of rigid objects in the 2D plane.800

    Video pairwise matching and localization: Fig. 9 and801Table IX further show pairwise video matching and temporal802localization performance of HNIP and CXM on all test datasets,803respectively. For pairwise video matching, VGG16-HNIP and804AlexNet-HNIP consistently outperform CXM in terms of TPR805for varied FPR on Landmarks and Scenes. In Table IX, we806observe the performance trends of temporal localization are807roughly consistent with pairwise video matching.808

    One may note that the localization accuracy of CXM is809worse than HNIP on Objects (see Table IX), but CXM ob-810tains much better video retrieval mAP than HNIP on Objects811(see Table VIII). First, given a query-reference video pair, video812retrieval tries to identify the most similar keyframe pair, but813temporal localization aims to locate multiple keyframe pairs by814comparing with a predefined threshold. Second, as shown in815Fig. 9, CXM achieves better TPR (Recall) than both VGG16-816HNIP and AlexNet-HNIP on Objects when FPR is small (e.g.,817

    TABLE IXVIDEO LOCALIZATION COMPARISON OF HNIP WITH STATE-OF-THE-ART

    HANDCRAFTED DESCRIPTORS (CXM) IN TERMSOF JACCARD INDEX, FOR ALL TEST DATASETS

    method Landmarks Scenes Objects All

    CXM 45.5 45.9 68.8 54.4AlexNet-HNIP 48.9 63.0 67.3 57.1VGG16-HNIP 50.8 63.8 71.2 59.7

    Experimental settings are the same as in Table VIII. The best resultsare highlighted in bold.

    TABLE XIMAGE RETRIEVAL COMPARISON (MAP) OF HNIP WITH STATE-OF-THE-ART

    DEEP AND HANDCRAFTED DESCRIPTORS (CXM),ON HOLIDAYS, OXFORD5K, AND UKBENCH

    method Holidays Oxford5k UKbench

    CXM 71.2 43.5 3.46MAC [18] 78.3 56.1 3.65SPoC [16] 84.5 68.6 3.68R-MAC [18] 87.2 67.6 3.73HNIP (Ours) 88.9 69.3 3.90

    We report performance of PCA whitened deep descriptorswith off-the-shelf VGG16. The best results are highlightedin bold.

    FPR = 1%), and TPR becomes worse as FPR increases. This 818implies that 1) CXM ranks relevant videos and keyframes higher 819than HNIP in the retrieved list for object queries, which leads to 820better mAP on Objects when evaluating retrieval performance 821in a small shortlist (100 in our experiments). 2) VGG16-HNIP 822gains higher Recall than CXM when FPR becomes large, which 823leads to higher localization accuracy on Objects. In other words, 824towards better temporal localization, we choose a small thresh- 825old (the corresponding FPR = 14.3% in our experiments) in 826order to recall as many relevant keyframes as possible. 827

    Image retrieval: To further verify the effectiveness of HNIP, 828we conduct image instance retrieval experiments on scene- 829centric Holidays, landmark-centric Oxford5k and object-centric 830UKbench. Table X shows the comparisons of HNIP with MAC, 831SPoC, R-MAC, and handcrafted descriptors from MPEG CDVA 832evaluation framework. First, we observe HNIP outperforms 833handcrafted descriptors by a large margin on all data sets. 834

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 13

    Fig. 10. (a) and (b) Video retrieval, (c) pairwise video matching, and (d) and localization performance of the optimal reported HNIP (i.e., VGG16-HNIP)combined with either CXM local or CXM global descriptors, for all test datasets. For simplicity, we report the pairwise video matching performance in terms ofTPR given FPR = 1%.

    TABLE XIVIDEO RETRIEVAL COMPARISON OF HNIP WITH CXM AND THE COMBINATION OF HNIP WITH CXM-LOCAL AND CXM-GLOBAL RESPECTIVELY,

    ON THE FULL TEST DATASET (ALL), WITHOUT (“W/O D ”), or WITH (“W/ D ”) COMBINATION OF THE LARGE SCALE DISTRACTOR VIDEOS

    method size (kB/k.f.) mAP Precision@R # query k.f. # DB k.f. search time (s/q.v.)

    w/o D w/ D w/o D w/ D w/o D w/ D w/o D w/ D

    CXM ∼4 73.6 72.1 72.6 71.2 ∼140K ∼105K ∼1.25M 12.4 38.6VGG16-HNIP 2 80.1 76.8 76.7 73.6 2.3 9.2VGG16-HNIP + CXM-Local ∼4 75.7 75.4 74.4 74.1 12.9 17.8VGG16-HNIP + CXM-Global ∼4 84.9 82.6 82.4 80.3 4.9 39.5

    Second, HNIP performs significantly better than state-of-the-835art deep descriptors R-MAC on UKbench, though it shows836marginally better performance on Holidays. The performance837advantage trend between HNIP and R-MAC is consistent with838video retrieval results on CDVA-Scenes and CDVA-Objects in839Table VI. It is again demonstrated that HNIP tends to be more840effective on object-centric datasets compared to scene- and841landmark-centric datasets, as object-centric datasets are usually842with more rotation and scale distortions.843

    F. Combination of HNIP and Handcrafted Descriptors844

    CXM contains both compressed local descriptors845(∼2 kB/frame) and compact global descriptors (∼2 kB/frame)846aggregated from local ones. Following the combination strate-847gies designed in Section V, Fig. 10 shows the effectiveness848of combining VGG16-HNIP with either CXM-Global or849CXM-Local descriptors,4 in video retrieval (a) & (b), matching850(c) and localization (d). First, we observe the combination of851VGG16-HNIP with either CXM-Global or CXM-Local consis-852tently improves CXM across all tasks on all test datasets. In this853regard, the improvements of VGG16-HNIP + CXM-Global are854much larger than VGG16-HNIP + CXM-Local, especially for855Landmarks and Scenes. Second, VGG16-HNIP + CXM-Global856performs better on all test datasets in video retrieval, matching857and localization (except localization accuracy on Landmarks).858In particular, VGG16-HNIP + CXM-Global significantly859

    4Here, we did not introduce the complicated combination of VGG-HNIP +CXM-Global + CXM-Local, because its performance is very close to VGG-HNIP + CXM-Local, and moreover it further increases descriptor size andsearch time, compared to VGG-HNIP + CXM-Local.

    improves VGG16-HNIP on Objects in terms of mAP and 860Precision@R (+10%). This leads us to the conclusion that 861deep descriptors VGG16-HNIP and handcrafted descriptors 862CXM-Global are complementary to each other. Third, we 863observe VGG16-HNIP + CXM-Local significantly degrades 864the performance of VGG16-HNIP on Landmarks and Scenes, 865e.g., there is a drop of ∼10% in mAP on Landmarks. This 866is due to the fact that matching pairs retrieved by HNIP (but 867handcrafted features fail) cannot pass the GCC step, i.e., the 868number of inliers (patch-level matching pairs) is less sufficient. 869For instance, in Fig. 7, the landmark pair is determined as 870a match by VGG16-HNIP, but the subsequent GCC step 871considers it as a non-match because there are only 2 matched 872patch pairs. More examples can be found in Fig. 6(a). 873

    G. Large Scale Video Retrieval 874

    Table XI studies video retrieval performance of CXM, 875VGG16-HNIP, their combinations VGG16-HNIP + CXM-Local 876and VGG16-HNIP + CXM-Global, on the full test dataset 877(All) without or with the large scale distractor video set. By 878combining reference videos with the large scale distractor set, 879the number of database keyframes increases from ∼105K to 880∼1.25M, making the search speed significantly slower. For ex- 881ample, HNIP is ∼5 times slower with the 512-D Euclidean 882distance computation. Further compressing HNIP into extreme 883compact code (e.g., 256 bits) for ultra-fast Hamming distance 884computation is highly desirable, while without incurring con- 885siderable performance loss. We will study it in our future work. 886Second, the performance ordering of approaches remains the 887same on large scale experiments, i.e., VGG16-HNIP + CXM- 888

  • IEEE P

    roof

    14 IEEE TRANSACTIONS ON MULTIMEDIA

    Global performs the best, followed by VGG16-HNIP, VGG16-889HNIP + CXM-Local and CXM. Finally, when increasing the890database size by 10× larger, we observe the performance loss891is relatively small, e.g., −1.5%, −3.3%, −0.3% and −2.3% in892mAP for CXM, VGG16-HNIP, VGG16-HNIP + CXM-Local893and VGG16-HNIP + CXM-Global, respectively.894

    VII. CONCLUSION AND DISCUSSIONS895

    In this work, we propose a compact and discriminative CNN896descriptor HNIP for video retrieval, matching and localization.897Based on the invariance theory, HNIP is proven to be robust898to multiple geometric transformations. More importantly, our899empirical studies show that the statistical moments in HNIP900dramatically affect video matching performance, which leads901us to the design of hybrid pooling moments within HNIP. In ad-902dition, we study the complementary nature of deep learned and903handcrafted descriptors, then propose a strategy to combine the904two descriptors. Experimental results demonstrate that HNIP905descriptor significantly outperforms state-of-the-art deep and906handcrafted descriptors, with comparable or even smaller de-907scriptor size. Furthermore, the combination of HNIP and hand-908crafted descriptors offers the optimal performance.909

    This work provides valuable insights for the ongoing CDVA910standardization efforts. During the recent 116th MPEG meeting911in Oct. 2016, MPEG CDVA Ad-hoc group has adopted the pro-912posed HNIP into core experiments [21] for investigating more913practical issues when dealing with deep learned descriptors in914the well-established CDVA evaluation framework. There are915several directions for future work. First, an in-depth theoretical916analysis on how pooling moments affect video matching perfor-917mance is valuable to further reveal and clarify the mechanism918of hybrid pooling, which may contribute to the invariance the-919ory. Second, it is interesting to study how to further improve re-920trieval performance by optimizing deep features like fine-tuning921CNN tailored for the video retrieval task, instead of off-the-shelf922CNNs used in this work. Third, to accelerate search speed, fur-923ther compressing deep descriptors into extremely compact codes924(e.g., dozens of bits) while still preserving retrieval accuracy is925worth investigating. Last but not the least, as CNN incurs huge926number of network model parameters (over 10 millions), how to927effectively and efficiently compress CNN model is a promising928direction.929

    REFERENCES930

    [1] Compact Descriptors for Video Analysis: Objectives, Applications and931Use Cases, ISO/IEC JTC1/SC29/WG11/N14507, 2014.932

    [2] Compact Descriptors for Video Analysis: Requirements for Search Appli-933cations, ISO/IEC JTC1/SC29/WG11/N15040, 2014.934

    [3] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,935vol. 28, no. 4, pp. 61–76, Jul. 2011.936

    [4] R. Ji et al., “Learning compact visual descriptor for low bit rate mobile937landmark search,” vol. 22, no. 3, 2011.Q1 938

    [5] L.-Y. Duan et al., “Overview of the MPEG-CDVS standard,” IEEE Trans.939Image Process., vol. 25, no. 1, pp. 179–194, Jan. 2016.940

    [6] Test Model 14: Compact Descriptors for Visual Search, ISO/IEC941JTC1/SC29/WG11/W15372, 2011.942

    [7] Call for Proposals for Compact Descriptors for Video Analysis (CDVA)-943Search and Retrieval, ISO/IEC JTC1/SC29/WG11/N15339, 2015.944

    [8] Cdva Experimentation Model (cxm) 0.2, ISO/IEC JTC1/SC29/945WG11/W16274, 2015.946

    [9] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Large-scale image re- 947trieval with compressed fisher vectors,” in Proc. IEEE Conf. Comput. Vis. 948Pattern Recog., Jun. 2010, pp. 3384–3391. Q2949

    [10] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descrip- 950tors into a compact image representation,” in Proc. IEEE Conf. Comput. 951Vis. Pattern Recog., Jun. 2010, pp. 3304–3311. 952

    [11] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn fea- 953tures off-the-shelf: An astounding baseline for recognition,” in Proc. IEEE 954Conf. Comput. Vis. Pattern Recog. Workshops, Jun. 2014, pp. 512–519. 955

    [12] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling 956of deep convolutional activation features,” in Proc. Eur. Conf. Comput. 957Vis., 2014. 958

    [13] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes 959for image retrieval,” in Proc. Eur. Conf. Comput. Vis., 2014. 960

    [14] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “A baseline for 961visual instance retrieval with deep convolutional networks,” CoRR, 2014. 962[Online]. Available: http://arxiv.org/abs/1412.6574 Q3963

    [15] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson, 964“From generic to specific deep representations for visual recognition,” in 965Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2015. 966

    [16] A. Babenko and V. Lempitsky, “Aggregating local deep features for image 967retrieval,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1269– 9681277. 969

    [17] Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weight- 970ing for aggregated deep convolutional features,” CoRR, 2015. [Online]. 971Available: http://arxiv.org/1512.04065 972

    [18] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with inte- 973gral max-pooling of cnn activations,” CoRR, 2015. [Online]. Available: 974http://arxiv.org/abs/1511.05879 975

    [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 976with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- 977cess. Syst., 2012. 978

    [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks 979for large-scale image recognition,” CoRR, 2014. [Online]. Available: 980http://arxiv.org/abs/1409.1556 981

    [21] Description of Core Experiments in CDVA, ISO/IEC JTC1/SC29/ 982WG11/W16510, 2016. 983

    [22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” 984Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. 985

    [23] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” 986in Proc. Eur. Conf. Comput. Vis., 2006. 987

    [24] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach 988to object matching in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 989Oct. 2003, vol. 2, pp. 1470–1477. 990

    [25] D. Nistér and H. Stewénius, “Scalable recognition with a vocabulary tree,” 991in Proc. Comput. Vis. Pattern Recog., 2006. 992

    [26] H. Jégou and A. Zisserman, “Triangulation embedding and democratic 993aggregation for image search,” in Proc. IEEE Conf. Comput. Vis. Pattern 994Recog., Jun. 2014, pp. 3310–3317. 995

    [27] S. S. Husain and M. Bober, “Improving large-scale image retrieval through 996robust aggregation of local descriptors,” IEEE Trans. Pattern Anal. Mach. 997Intell., to be published. 998

    [28] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift- 999invariant kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2009. 1000

    [29] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Adv. 1001Neural Inf. Process. Syst., 2009. 1002

    [30] V. Chandrasekhar et al.,“Transform coding of image feature descriptors,” 1003in Proc. IS&T/SPIE Electron. Imag., 2009. 1004

    [31] V. Chandrasekhar et al., “CHoG: Compressed histogram of gradients a 1005low bit-rate feature descriptor,” in Proc. IEEE Conf. Comput. Vis. Pattern 1006Recog., Jun. 2009, pp. 2504–2511. 1007

    [32] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest 1008neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, 1009pp. 117–128, Jan. 2011. 1010

    [33] M. Calonder et al., “BRIEF: Computing a local binary descriptor 1011very fast,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, 1012pp. 1281–1298, Jul. 2012. 1013

    [34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficient 1014alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis., 1015Nov. 2011, pp. 2564–2571. 1016

    [35] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust 1017invariant scalable keypoints,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, 1018pp. 2548–2555. 1019

    [36] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui, “USB: Ultrashort 1020binary descriptor for fast visual matching and retrieval,” IEEE Trans. 1021Image Process., vol. 23, no. 8, pp. 3671–3683, Aug. 2014. 1022

  • IEEE P

    roof

    LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 15

    [37] D. M. Chen et al., “Tree histogram coding for mobile image matching,”1023in Proc. Data Compression Conf., 2009.1024

    [38] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scal-1025able image search,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct.10262009, pp. 2130–2137.1027

    [39] D. Chen et al., “Residual enhanced visual vector as a compact signature1028for mobile visual search,” Signal Process., vol. 93, no. 8, pp. 2316–2327,10292013.1030

    [40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies1031for accurate object detection and semantic segmentation,” in Proc. IEEE1032Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587.1033

    [41] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:1034CNN architecture for weakly supervised place recognition,” in Proc. Com-1035put. Vis. Pattern Recog., Jun. 2016, pp. 5297–5307.1036

    [42] F. Radenović, G. Tolias, and O. Chum, “CNN image retrieval learns from1037BoW: Unsupervised fine-tuning with hard examples,” in Proc. Eur. Conf.1038Comput. Vis., 2016.1039

    [43] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:1040Learning global representations for image search,” in Proc. Eur. Conf.1041Comput. Vis., 2016.1042

    [44] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, and S. Tubaro,1043“Coding visual features extracted from video sequences,” IEEE Trans.1044Image Process., vol. 23, no. 5, pp. 2262–2276, May 2014.1045

    [45] A. Redondi, L. Baroffio, M. Cesana, and M. Tagliasacchi, “Compress-1046then-analyze vs. analyze-then-compress: Two paradigms for image anal-1047ysis in visual sensor networks,” in Proc. IEEE Int. Workshop Multimedia1048Signal Process., Sep.-Oct. 2013, pp. 278–282.1049

    [46] L. Baroffio, J. Ascenso, M. Cesana, A. Redondi, and M. Tagliasacchi,1050“Coding binary local features extracted from video sequences,” in Proc.1051IEEE Int. Conf. Image Process., Oct. 2014, pp. 2794–2798.1052

    [47] M. Makar, V. Chandrasekhar, S. Tsai, D. Chen, and B. Girod, “Interframe1053coding of feature descriptors for mobile augmented reality,” IEEE Trans.1054Image Process., vol. 23, no. 8, pp. 3352–3367, Aug. 2014.1055

    [48] J. Chao and E. G. Steinbach, “Keypoint encoding for improved feature ex-1056traction from compressed video at low bitrates,” IEEE Trans. Multimedia,1057vol. 18, no. 1, pp. 25–39, Jan. 2016.1058

    [49] L. Baroffio et al., “Coding local and global binary visual features extracted1059from video sequences,” IEEE Trans. Image Process., vol. 24, no. 11,1060pp. 3546–3560, Nov. 2015.1061

    [50] D. M. Chen, M. Makar, A. F. Araujo, and B. Girod, “Interframe coding1062of global image signatures for mobile augmented reality,” in Proc. Data1063Compression Conf., 2014.1064

    [51] D. M. Chen and B. Girod, “A hybrid mobile visual search system with1065compact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7,1066pp. 1019–1030, Jul. 2015.1067

    [52] C.-Z. Zhu, H. Jégou, and S. Satoh, “Nii team: Query-adaptive asymmetri-1068cal dissimilarities for instance search,” in Proc. TRECVID 2013 Workshop,1069Gaithersburg, USA, 2013.1070

    [53] N. Ballas et al., “Irim at trecvid 2014: Semantic indexing and instance1071search,” in Proc. TRECVID 2014 Workshop, 2014.1072

    [54] A. Araujo, J. Chaves, R. Angst, and B. Girod, “Temporal aggregation1073for large-scale query-by-image video retrieval,” in Proc. IEEE Int. Conf.1074Image Process., Sep. 2015, pp. 1519–1522.1075

    [55] M. Shi, T. Furon, and H. Jégou, “A group testing framework for simi-1076larity search in high-dimensional spaces,” in Proc. 22nd ACM Int. Conf.1077Multimedia. 2014, pp. 407–416.1078

    [56] J. Lin et al., “Rate-adaptive compact fisher codes for mobile vi-1079sual search,” IEEE Signal Process. Lett., vol. 21, no. 2, pp. 195–198,1080Feb. 2014.1081

    [57] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN video1082representation for event detection,” in Proc. IEEE Conf. Comput. Vis.1083Pattern Recog., Jun. 2015, pp. 1798–1807.1084

    [58] L.-Y. et al., “Compact Descriptors for Video Analysis: The1085Emerging MPEG Standard,” CoRR, 2017. [Online]. Available:1086http://arxiv.org/abs/1704.081411087

    [59] F. Anselmi and T. Poggio, “Representation learning in sensory cortex: A1088theory,” in Proc. Center Brains, Minds Mach., 2014.1089

    [60] Q. Liao, J. Z. Leibo, and T. Poggio, “Learning invariant representations and1090applications to face verification,” in Proc. Advances Neural Inf. Process.1091Syst., Lake Tahoe, NV, 2013.1092

    [61] C. Zhang, G. Evangelopoulos, S. Voinea, L. Rosasco, and T. Poggio, “A1093deep representation for invariance and music classification,” in Proc. IEEE1094Int. Conf. Acoust., Speech, Signal Process., May 2014, pp. 6984–6988.1095

    [62] K. Lenc and A. Vedaldi, “Understanding image representations by mea-1096suring their equivariance and equivalence,” in Proc. IEEE Conf. Comput.1097Vis. Pattern Recog., Jun. 2015, pp. 991–995.1098

    [63] J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-time 1099visual concept classification.” IEEE Trans. Multimedia, vol. 12, no. 7, 1100pp. 665–681, Nov. 2010. 1101

    [64] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recogni- 1102tion and segmentation.” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1103Jun. 2015, pp. 3828–3836. 1104

    [65] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weak 1105geometric consistency for large scale image search,” in Proc. Eur. Conf. 1106Comput. Vis., 2008. 1107

    [66] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval 1108with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. 1109Comput. Vis. Pattern Recog., Jun. 2007, pp. 1–8. 1110

    [67] D. Nistér and H. Stewénius, “Scalable recognition with a vocabulary tree,” 1111in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., Jun. 2006, 1112vol. 2, pp. 2161–2168. 1113

    Jie Lin received the B.S. and Ph.D. degrees from the 1114School of Computer Science and Technology, Bei- 1115jing Jiaotong University, Beijing, China, in 2006 and 11162014, respectively. 1117

    He is currently a Research Scientist with the In- 1118stitute of Infocomm Research, A*STAR, Singapore. 1119He was previously a visiting student in the Rapid- 1120Rich Object Search Laboratory, Nanyang Technolog- 1121ical University, Singapore, and the Institute of Digital 1122Media, Peking University, Beijing, China, from 2011 1123to 2014. His research interests include deep learning, 1124

    feature coding and large-scale image/video retrieval. His work on image feature 1125coding has been recognized as core contribution to the MPEG-7 Compact De- 1126scriptors for Visual Search (CDVS) standard. 1127

    1128

    Ling-Yu Duan (M’09) received the M.Sc. degree in 1129automation from the University of Science and Tech- 1130nology of China, Hefei, China, in 1999, the M.Sc. 1131degree in computer science from the National Uni- 1132versity of Singapore (NUS), Singapore, in 2002, and 1133the Ph.D. degree in information technology from 1134The University of Newcastle, Callaghan, Australia, 1135in 2008. 1136

    He is currently a Full Professor with the National 1137Engineering Laboratory of Video Technology, School 1138of Electronics Engineering and Computer Science, 1139

    Peking University (PKU), Beijing, China. He was the Associate Director of the 1140Rapid-Rich Object Search Laboratory, a joint lab between Nanyang Techno- 1141logical University, Singapore, and PKU since 2012. Before he joined PKU, he 1142was a Research Scientist with the Institute for Infocomm Research, Singapore, 1143from Mar. 2003 to Aug. 2008. He has authored or coauthored more than 130 1144research papers in international journals and conferences. His research interests 1145include multimedia indexing, search, and retrieval, mobile visual search, visual 1146feature coding, and video analytics. Prior to 2010, his research was basically 1147focused on multimedia (semantic) content analysis, especially in the domains 1148of broadcast sports videos and TV commercial videos. 1149

    Prof. Duan was the recipient of the EURASIP Journal on Image and Video 1150Processing Best Paper Award in 2015, and the Ministry of Education Technology 1151Invention Award (First Prize) in 2016. He was a co-editor of MPEG Compact 1152Descriptor for Visual Search (CDVS) Standard (ISO/IEC 15938-13), and is a 1153Co-Chair of MPEG Compact Descriptor for Video Analytics (CDVA). His re- 1154cent major achievements focus on the topic of compact representation of visual 1155features and high-performance image search. He made significant contribution 1156in the completed MPEG-CDVS standard. The suite of CDVS technologies have 1157been successfully deployed, which impact visual search products/services of 1158leading Internet companies such as Tencent (WeChat) and Baidu (Image Search 1159Engine). 1160

    1161

    AdminLine

  • IEEE P

    roof

    16 IEEE TRANSACTIONS ON MULTIMEDIA

    Shiqi Wang received the B.S. degree in computer sci-1162ence from the Harbin Institute of Technology, Harbin,1163China, in 2008, and the Ph.D. degree in computer1164application technology from the Peking University,1165Beijing, China, in 2014.1166

    From March 2014 to March 2016, he was a Post-1167doc Fellow with the Department of Electrical and1168Computer Engineering, University of Waterloo, Wa-1169terloo, ON, Canada. From April 2016 to Aprill 2017,1170he was with the Rapid-Rich Object Search Labora-1171tory, Nanyang Technological University, Singapore,1172

    as a Research Fellow. He is currently an Assistant Professor in the Depart-1173ment of Computer Science, City University of Hong Kong, Hong Kong. He1174has propo


Recommended