Post on 08-Sep-2018
transcript
IEEE P
roof
IEEE TRANSACTIONS ON MULTIMEDIA 1
HNIP: Compact Deep Invariant Representations forVideo Matching, Localization, and Retrieval
1
2
Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang,Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE
3
4
Abstract—With emerging demand for large-scale video analysis,5MPEG initiated the compact descriptor for video analysis (CDVA)6standardization in 2014. Unlike handcrafted descriptors adopted7by the ongoing CDVA standard, in this work, we study the8problem of deep learned global descriptors for video matching,9localization, and retrieval. First, inspired by a recent invariance10theory, we propose a nested invariance pooling (NIP) method11to derive compact deep global descriptors from convolutional12neural networks (CNN), by progressively encoding translation,13scale, and rotation invariances into the pooled descriptors. Second,14our empirical studies have shown that pooling moments (e.g.,15max or average) drastically affect video matching performance,16which motivates us to design hybrid pooling operations within17NIP (HNIP). HNIP further improves the discriminability of18deep global descriptors. Third, the advantages and performance19on the combination of deep and handcrafted descriptors are20provided to better investigate the complementary effects of them.21We evaluate the effectiveness of HNIP by incorporating it into22the well-established CDVA evaluation framework. Experimental23results show that HNIP outperforms state-of-the-art deep and24canonical handcrafted descriptors with significant mAP gains of255.5% and 4.7%, respectively. Moreover, the combination of HNIP26and handcrafted global descriptors further boosts the performance27of CDVA core techniques with comparable descriptor size.28
Index Terms—Convolutional neural networks (CNN), deep29global descriptors, handcrafted descriptors, hybrid nested30invariance pooling, MPEG compact descriptor for video analysis31(CDVA), MPEG CDVS.32
I. INTRODUCTION33
R ECENT years have witnessed a remarkable growth of34
interest for video retrieval, which refers to searching for35
Manuscript received December 5, 2016; revised April 3, 2017 and May 30,2017; accepted May 30, 2017. This work was supported in part by the NationalHightech R&D Program of China under Grant 2015AA016302, in part bythe National Natural Science Foundation of China under Grant U1611461 andGrant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI)sponsored by a donation from the Ng Teng Fong Charitable Foundation. Theguest editor coordinating the review of this manuscript and approving it forpublication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.)
J. Lin and V. Chandrasekhar is with the Institute for Infocomm Research,A∗STAR, Singapore 138632 (e-mail: lin-j@i2r.a-star.edu.sg).
L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti-tute of Digital Media, Peking University, Beijing 100080, China (e-mail:lingyu@pku.edu.cn; yanbai@pku.edu.cn; yihang@pku.edu.cn; tjhuang@pku.edu.cn; wgao@pku.edu.cn).
S. Wang is with the Department of Computer Science, City University ofHong Kong, Hong Kong, China (e-mail: shiqwang@cityu.edu.hk).
V. Chandrasekhar and A. Kot are with the Nanyang Technological University,Singapore 639798 (e-mail: vijay@i2r.a-star.edu.sg; EACKOT@ntu.edu.sg).
Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2017.2713410
videos representing the same object or scene as the one depicted 36
in a query video. Such capability can facilitate a variety of appli- 37
cations including mobile augmented reality (MAR), automotive, 38
surveillance, media entertainment, etc. [1]. In the rapid evolution 39
of video retrieval, both great promises and new challenges aris- 40
ing from real applications have been perceived [2]. Typically, the 41
video retrieval is performed at the server end, which requires 42
the transmission of visual data via wireless network [3], [4]. 43
Instead of directly sending huge volume of compressed video 44
data, developing compact and robust video feature representa- 45
tions is highly desirable, which fulfills low latency transmission 46
over bandwidth constrained network, e.g., thousands of bytes 47
per second. 48
To this end, in 2009, MPEG started the standardization of 49
Compact Descriptors for Visual Search (CDVS) [5], which 50
came up with a normative bitstream of standardized descriptors 51
for mobile visual search and augmented reality applications. 52
In Sep. 2015, MPEG published the CDVS standard [6]. Very 53
recently, towards large-scale video analysis, MPEG has moved 54
forward to standardize Compact Descriptors for Video Analy- 55
sis (CDVA) [7]. To deal with content redundancy in temporal 56
dimension, the latest CDVA Experimental Model (CXM) [8] 57
casts video retrieval into keyframe based image retrieval task, 58
in which the keyframe-level matching results are combined for 59
video matching. The keyframe-level representation avoids de- 60
scriptor extraction on dense frames in videos, which largely re- 61
duces the computational complexity (e.g., CDVA only extracts 62
descriptors of 1∼2% frames detected from raw videos). 63
In CDVS, handcrafted local and global descriptors have been 64
successfully standardized in a compact and scalable manner 65
(e.g., from 512 B to 16 KB), where local descriptors capture the 66
invariant characteristics of local image patches and global de- 67
scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect the 68
aggregated statistics of local descriptors. Though handcrafted 69
descriptors have achieved great success in CDVS standard [5] 70
and CDVA experimental model, how to leverage promising 71
deep learned global descriptors remains an open issue in the 72
MPEG CDVA Ad-hoc group. Many recent works [11]–[18] 73
have shown the advantages of deep global descriptors for image 74
retrieval, which may be attributed to the remarkable success of 75
Convolutional Neural Networks (CNN) [19], [20]. In particular, 76
state-of-the-art deep global descriptors R-MAC [18] computes 77
the max over a set of Region-of-Interest (ROI) cropped from 78
feature maps output by intermediate convolutional layer, 79
followed by the average of these regional max-pooled features. 80
Results show that R-MAC offers remarkable improvements over 81
1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
IEEE P
roof
2 IEEE TRANSACTIONS ON MULTIMEDIA
other deep global descriptors like MAC [18] and SPoC [16],82
while maintaining the same dimensionality.83
In the context of compact descriptors for video retrieval, there84
exist important practical issues with CNN based deep global de-85
scriptors. First, one main drawback of CNN is the lack of invari-86
ance to geometric transformations of the input image such as87
rotations. The performance of deep global descriptors quickly88
degrades when the objects in query and database videos are89
rotated differently. Second, different from CNN features, hand-90
crafted descriptors are robust to scale and rotation changes in91
2D plane, because of the local invariant feature detectors. As92
such, more insights should be provided on whether there is great93
complementarity between CNN and conventional handcrafted94
descriptors for better performance.95
To tackle the above issues, we make the following contribu-96
tions in this work:97
1) We propose a Nested Invariance Pooling (NIP) method98
to produce compact global descriptors from CNN by pro-99
gressively encoding translation, scale and rotation invari-100
ances. NIP is inspired from a recent invariance theory,101
which provides a practical and mathematically proven102
way for computing invariant representations with feedfor-103
ward neural networks. In this respect, NIP is extensible104
to other types of transformations. Both quantitative and105
qualitative evaluations are introduced for a deeper look at106
the invariance properties.107
2) We further improve the discriminability of deep global108
descriptors by designing Hybrid pooling moments within109
NIP (HNIP). Evaluations on video retrieval show that110
HNIP outperforms state-of-the-art deep and handcrafted111
descriptors by a large margin with comparable or smaller112
descriptor size.113
3) We analyze the complementary nature of deep features114
and handcrafted descriptors over diverse datasets (land-115
marks, scenes and common objects). A simple combina-116
tion strategy is introduced to fuse the strengths of both117
deep and handcrafted global descriptors. We show that118
the combined descriptors offer the optimal video match-119
ing and retrieval performance, without incurring extra cost120
on descriptor size compared to CDVA.121
4) Due to the superior performance, HNIP has been adopted122
by CDVA Ad-hoc group as technical reference to setup123
new core experiments [21], which opens up future ex-124
ploration of CNN techniques in the development of stan-125
dardized video descriptors. The latest core experiments126
involve compact deep feature representation, CNN model127
compression, etc.128
II. RELATED WORK129
Handcrafted descriptors: Handcrafted local descriptors [22],130
[23], such as SIFT based on Difference of Gaussians (DoG)131
detector [22], have been successfully and widely employed132
to conduct image matching and localization tasks due to133
their robustness in scale and rotation changes. Building on134
local image descriptors, global image representations aim to135
provide statistical summaries of high level image properties136
and facilitate fast large-scale image search. In particular, for137
global image descriptors, the most prominent ones include 138
Bag-of-Words (BoW) [24], [25], Fisher Vector (FV) [9], 139
VLAD [10], Triangulation Embedding [26] and Robust Visual 140
Descriptor with Whitening (RVDW) [27], with which fast 141
comparisons against a large scale database become practical. 142
Given the fact that raw descriptors such as SIFT and FV may 143
consume extraordinarily large number of bits for transmission 144
and storage, many compact descriptors were developed. For ex- 145
ample, numerous strategies have been proposed to compress 146
SIFT using hashing [28], [29], transform coding [30], lattice 147
coding [31] and vector quantization [32]. On the other hand, bi- 148
nary descriptors including BRIEF [33], ORB [34], BRISK [35] 149
and Ultrashort Binary Descriptor (USB) [36] were proposed, 150
which support fast Hamming distance matching. For compact 151
global descriptors, efforts have also been made to reduce their 152
descriptor size, such as tree-structure quantizer [37] for BoW 153
histogram, locality sensitive hashing [38], dimensionality re- 154
duction and vector quantization for FV [9], and simple sign 155
binarization for VLAD like descriptors [9], [39]. 156
Deep descriptors: Deep learned descriptors have been ex- 157
tensively applied to image retrieval [11]–[18]. First initial 158
study [11], [13] proposed to use representations directly ex- 159
tracted from fully connected layer of CNN. More compact 160
global descriptors [14]–[16] can be extracted by performing 161
either global max or average pooling (e.g., SPoC in [16]) over 162
feature maps output by intermediate layers. Further improve- 163
ments are obtained by spatial or channel-wise weighting of 164
pooled descriptors [17]. Very recently, inspired by the R-CNN 165
approach [40] used for object detection, Tolias et al. [18] pro- 166
posed ROI based pooling on deep convolutional features, Re- 167
gional Maximum Activation of Convolutions (R-MAC), which 168
significantly improves global pooling approaches. Though 169
R-MAC is scale invariant to some extent, it suffers from the 170
lack of rotation invariance. These regional deep features can be 171
also aggregated into global descriptors by VLAD [12]. 172
In a number of recent works [13], [41]–[43], pre-trained 173
CNNs for image classification are repurposed for the image 174
retrieval task, by fine-tuning them with specific loss functions 175
(e.g., Siamese or triplet networks) over carefully constructed 176
matching and non-matching training image sets. There is consid- 177
erable performance improvement when training and test datasets 178
in similar domains (e.g., buildings). In this work, we aim to 179
explicitly construct invariant deep global descriptors from the 180
perspective of better leveraging the state-of-the-arts or classical 181
CNN architectures, rather than further optimizing the learning 182
of invariant deep descriptors. 183
Video descriptors: Video is typically composed of a num- 184
ber of moving frames. Therefore, a straightforward method for 185
video descriptor representation is extracting feature descriptors 186
at frame level then reducing the temporal redundancies of these 187
descriptors for compact representation. For local descriptors, 188
Baroffio et al. [44] proposed both intra- and inter-feature cod- 189
ing methods of SIFT in the context of visual sensor network, 190
and an additional mode decision scheme based on rate-distortion 191
optimization was designed to further improve the feature coding 192
efficiency. In [45], [46], studies have been conducted to encode 193
the binary features such as BRISK [35]. Makar et al. [47] pre- 194
sented a temporally coherent keypoint detector to allow efficient 195
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 3
interframe coding of canonical patches, corresponding feature196
descriptors, and locations towards mobile augmented reality197
application. Chao et al. [48] developed a key-points encoding198
technique, where locations, scales and orientations extracted199
from original videos are encoded and transmitted along with200
compressed video to the server. Recently, the temporal depen-201
dencies of global descriptors have also been exploited. For BoW202
extracted from video sequence, Baroffio et al. [49] proposed an203
intra-frame coding method with uniform scalar quantization,204
as well as an inter-frame technique with arithmetic coding the205
quantized symbols. Chen et al. [50], [51] proposed an encoding206
scheme for scalable residual based global signatures given the207
fact that REVVs [39] of adjacent frames share most codewords208
and residual vectors.209
Besides the frame-level approaches, aggregations of local de-210
scriptors over video slots and global descriptors over scenes211
have also been intensively explored [52]–[55]. In [54], tempo-212
ral aggregation strategies for large scale video retrieval were213
experimentally studied and evaluated with the CDVS global214
descriptors [56]. Four aggregation modes, including local fea-215
ture, global signature, tracking-based and independent frame216
based aggregation schemes were investigated. For video-level217
CNN representation, in [57], the authors applied FV and VLAD218
aggregation techniques over dense local features of CNN acti-219
vation maps for video event detection.220
III. MPEG CDVA221
MPEG CDVA [7] aims to standardize the bitstream of com-222
pact video descriptors for large-scale video analysis. The CDVA223
standard incurs two main technical requirements of the dedi-224
cated descriptors: compactness and robustness. On the one hand,225
compact representation is an efficient way to economize the226
transmission bandwidth, storage space and computational cost.227
On the other hand, robust representation in the scenario of ge-228
ometric transformations such as rotation and scale variations is229
particularly required. To this end, in the 115th MPEG meeting,230
the CXM [8] was released, which mainly relies on MPEG CDVS231
reference software TM14.2 [6] for keyframe-level compact and232
robust handcrafted descriptor representation based on scale and233
rotation invariant local features.234
A. CDVS-Based Handcrafted Descriptors235
The MPEG CDVS [5] standardized descriptors serve as the236
fundamental infrastructure to represent video keyframes. The237
normative blocks of CDVS standard are illustrated in Fig. 1(b),238
mainly involving extraction of compressed local and global239
descriptors. First, scale and rotation invariant interest key points240
are detected from image, and a subset of reliable key points241
are retained, followed by the computation of handcrafted local242
SIFT features. The compressed local descriptors are formed243
by applying a low-complexity transform coding on local SIFT244
features. The compact global descriptors are Fisher vectors ag-245
gregated from the selected local features, followed by scalable246
descriptor compression with simple sign binarization. Basically,247
pairwise image matching is accomplished by first comparing248
compact global descriptors, then further performing geometric249
consistency checking (GCC) with compressed local descriptors.250
CDVS handcrafted descriptors are with very low memory foot- 251
print, while preserving competitive matching and retrieval accu- 252
racy. The standard supports operating points ranging from 512 B 253
to 16 kB specified for different bandwidth constraints. Overall, 254
the 4 kB operating point achieves a good tradeoff between ac- 255
curacy and complexity (e.g., transmission bitrate, search time). 256
Thus, CDVA CXM adopts the 4 kB descriptors for keyframe- 257
level representation, in which compressed local descriptors and 258
compact global descriptors are both ∼2 kB per keyframe. 259
B. CDVA Evaluation Framework 260
Fig. 1(a) shows the evaluation framework of CDVA, includ- 261
ing keyframe detection, CDVS descriptors extraction, trans- 262
mission, and video retrieval and matching. At the client side, 263
color histogram comparison is applied to identify keyframes 264
from video clips. The standardized CDVS descriptors are ex- 265
tracted from these keyframes, which can be further packed to 266
form CDVA descriptors [58]. Keyframe detection has largely 267
reduced the temporal redundancy in videos, resulting in low bi- 268
trate query descriptor transmission. At the server side, the same 269
keyframe detection and CDVS descriptors extraction are ap- 270
plied to database videos. Formally, we denote query video X = 271
{x1 , ...,xNx} and reference video Y = {y1 , ...,yNy
}, where x 272
and y denote keyframes. Nx and Ny are the number of detected 273
keyframes in query and reference videos, respectively. The start 274
and end timestamps for keyframes are recorded, e.g., [T sxn
, T exn
] 275
for query keyframe xn . Here, we briefly describe the pipeline of 276
pairwise video matching, localization and video retrieval with 277
CDVA descriptors. 278
Pairwise video matching and localization: Pairwise video 279
matching is performed by comparing the CDVA descriptors of 280
video keyframe pair (X,Y). Each keyframe in X is compared 281
with all keyframes in Y. The video-level similarity K(X, Y) is 282
defined as the largest matching score among all keyframe-level 283
similarities. For example, if we consider video matching with 284
CDVS global descriptors only 285
K(X, Y) = maxx∈X,y∈Y
k(f(x), f(y)) (1)
where k(·, ·) denotes a matching function (e.g., cosine similar- 286
ity). f(x) denotes CDVS global descriptors for keyframe x.1 287
Following the matching pipeline in CDVS, if k(·, ·) exceeds a 288
pre-defined threshold, GCC with CDVS local descriptors is sub- 289
sequently applied for verifying true positive keyframe matches. 290
Then the keyframe-level similarity is finally determined as the 291
multiplication of matching scores from both CDVS global and 292
local descriptors. Correspondingly, K(X, Y) in (1) is refined as 293
the maximum of their combined similarities. 294
The matched keyframe timestamps between query and refer- 295
ence videos are recorded for evaluating the temporal localization 296
task, i.e., locating the video segment containing item of inter- 297
est. In particular, if the multiplication of CDVS global and local 298
matching scores exceeds a predefined threshold, the correspond- 299
ing keyframe timestamps are recorded. Assuming there are τ 300
(τ ≤ Nx ) keyframes satisfying such criterion in a query video, 301
the matching video segment is defined as T ′start = min
{T s
xn
}302
1We use the same notation for deep global descriptors in the following section.
IEEE P
roof
4 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 1. (a) Illustration of MPEG CDVA evaluation framework. (b) Descriptor extraction pipeline for MPEG CDVS. (c) Temporal localization of item of interestbetween video pair.
and T ′end = max
{T e
xn
}, where 1 ≤ n ≤ τ . As such, we can ob-303
tain the predicted matching video segment by descriptor match-304
ing, as shown in Fig. 1(c).305
Video retrieval: Video retrieval differs from pairwise video306
matching in that the former is one to many matching, while307
the latter is one to one matching. Thus, video retrieval shares308
similar matching pipeline with pairwise video matching, ex-309
cept for the following differences: 1) For each query keyframe,310
the top Kg candidate keyframes are retrieved from database311
by comparing CDVS global descriptors. Subsequently, GCC312
reranking with CDVS local descriptors is performed between313
query keyframe and each candidate. The top Kl database314
keyframes are recorded. The default choices for Kg and Kl315
are 500 and 100, respectively. 2) For each query video, all re-316
turned database keyframes are merged into candidate database317
videos according to their video indices. Then, the video-level318
similarity between query and each candidate database video319
is obtained following the same principle as pairwise video320
matching. Finally, the top ranked candidate database videos are321
returned.322
IV. METHOD323
A. Hybrid Nested Invariance Pooling324
Fig. 2(a) shows the extraction pipeline of our compact deep325
invariant global descriptors with HNIP. Given a video keyframe326
x, we rotate it by R times (with step size θ◦). By forward-327
ing each rotated image through a pre-trained deep CNN, the328
convolutional feature maps output by intermediate layer (e.g.,329
convolutional layer) are represented by a cube W × H × C,330
where W and H denote width and height of each feature331
map respectively and C is the number of channels in the fea-332
ture maps. Subsequently, we extract a set of ROIs from the333
cube using an overlapping sliding window, with window size334
W′ ≤ W and H
′ ≤ H . The window size is adjusted to incor- 335
porate ROIs with different scales (e.g., 5 × 5, 10 × 10). Here, 336
we denote the number of scales as S. Finally, a 5-D data struc- 337
ture γx(Gt,Gs,Gr , C) ∈ RW′×H
′×S×R×C is derived, which 338
encodes translations Gt (i.e., spatial locations W′ × H
′), scales 339
Gs and rotations Gr of input keyframe x. 340
HNIP aims to aggregate the 5-D data into a compact deep in- 341
variant global descriptor. In particular, it firstly performs pooling 342
over translations (W′ × H
′), then scales (S) and finally rota- 343
tions (R) in a nested way, resulting in a C-dimensional global 344
descriptor. Formally, for the cth feature map, nt-norm pooling 345
over translations Gt is given by 346
γx(Gs,Gr , c) =
⎛
⎝ 1W ′ × H ′
W′×H
′∑
j=1
γx(j,Gs,Gr , c)nt
⎞
⎠
1n t
(2)where the pooling operation has a parameter nt defining the sta- 347
tistical moments, e.g., nt = 1 is first-order (i.e., average pool- 348
ing), nt → +∞ on the other extreme is infinite order (i.e., max 349
pooling), and nt = 2 is second order (i.e., square-root pooling). 350
Equation (2) leads to a 3-D data γx(Gs,Gr , C) ∈ RS×R×C . 351
Analogously, ns -norm pooling over scale transformations Gs 352
and the subsequent nr -norm pooling over rotation transforma- 353
tions Gr are 354
γx(Gr , c) =
⎛
⎝ 1S
S∑
j=1
γx(j,Gr , c)ns
⎞
⎠
1n s
, (3)
γx(c) =
⎛
⎝ 1R
R∑
j=1
γx(j, c)nr
⎞
⎠
1n r
. (4)
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 5
Fig. 2. (a) Nested invariance pooling (NIP) on feature maps extracted from intermediate layer of CNN architecture. (b) A single convolution-pooling operationfrom a CNN schematized for a single input layer and single output neuron. The parallel with invariance theory shows that the universal building block of CNN iscompatible with the incorporation of invariance to local translations of the input.
The corresponding global descriptor is obtained by concatenat-355
ing γx(c) for all feature maps356
f(x) = {γx(c)}0≤c<C . (5)
As such, the keyframe matching function in (1) is defined as357
k(f(x), f(y)) = β(x)β(y)C∑
c=1
< γx(c), γy(c) > (6)
where β(·) is a normalization term computed by β(x) =358
(∑C
c=1 < γx(c), γx(c) >)−12 . Equation (6) refers to cosine sim-359
ilarity by accumulating the scalar products of normalized pooled360
features for each feature map. HNIP descriptors can be further361
improved by post-processing techniques such as PCA whiten-362
ing [16], [18]. In this work, The global descriptor is firstly L2363
normalized, followed by PCA projection and whitening with a364
pre-trained PCA matrix. The whitened vectors are L2 normal-365
ized and compared with (6).366
Subsequently, we further investigate HNIP with more details.367
In Section IV-B, inspired by a recent invariance theory [59],368
HNIP is proven to be approximately invariant to translation,369
scale and rotation transformations, which is independent of370
the statistical moments chosen in the nested pooling stages.371
In Section IV-C, both quantitative and qualitative evaluations372
are presented to illustrate the invariance properties. Moreover,373
we observe the statistical moments in HNIP drastically affect374
video matching performance. Our empirical results show that375
the optimal nested pooling moments correspond to: nt = 2 as a376
second order moment, ns first-order and nr of infinite order.377
B. Theory on Transformation Invariance378
Invariance theory in a nutshell: Recently, Anselmi and379
Poggio [59] proposed an invariance theory exploring how signal380
(e.g., image) representations are invariant to various transforma-381
tions. Denote f(x) as the representation for image x, f(x) is382
invariant to transformation g (e.g., rotation) if f(x) = f(g · x)383
is hold ∀g ∈ G, where we define the orbit of x by a transfor-384
mation group G as Ox = {g · x|g ∈ G}. It can be easily shown385
that Ox is globally invariant to the action of any element of 386
G and thus any descriptor computed directly from Ox will be 387
globally invariant to G. 388
More specifically, the invariant descriptor f(x) can be 389
derived in two stages. First, given a predefined template t (e.g., 390
convolutional filter in CNN), we compute the dot products of 391
t over the orbit: Dx,t = {< g · x, t >∈ R|g ∈ G}. Second, the 392
extracted invariant descriptor should be a histogram represen- 393
tation of the distribution Dx,t with a specific bin configuration, 394
for example, the statistical moments (e.g., mean, max, standard 395
deviation, etc.) derived from Dx,t . Such a representation is 396
mathematically proven to have proper invariance property for 397
transformations such as translations (Gt), scales (Gs) and 398
rotations (Gr ). One may note that the transformation g can 399
be applied either on the image or the template indifferently, 400
i.e., {< g · x, t >=< x, g · t > |g ∈ G}. Recent work on face 401
verification [60] and music classification [61] successfully 402
applied this theory to practical applications. More details about 403
invariance theory are referred to [59]. 404
An example: translation invariance of CNN: The convolution- 405
pooling operations in CNN are compliant with the invari- 406
ance theory. Existing well-known CNN architectures like 407
AlexNet [19] and VGG16 [20] share a common building block: 408
a succession of convolution and pooling operations, which in 409
fact provides a way to incorporate local translation invariance. 410
As shown in Fig. 2(b), convolution operation on translated im- 411
age patches (i.e., sliding windows) is equivalent to < g · x, t >, 412
and max pooling operation is in line with the statistical his- 413
togram computation over the distribution of the dot products 414
(i.e., feature maps). For instance, considering a convolutional 415
filter learned “cat face”, the filter would always response to an 416
image depicted cat face no matter where is the face located in 417
the image. Subsequently, max pooling over the activation fea- 418
ture maps captures the most salient feature of the cat face, which 419
is naturally invariant to object translation. 420
Incorporating scale and rotation invariances into CNN: 421
Based on the already locally translation invariant feature maps 422
(e.g., the last pooling layer, pool5), we propose to further im- 423
IEEE P
roof
6 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 3. Comparison of pooled descriptors invariant to (a) rotation and (b)scale changes of query images, measured by retrieval accuracy (mAP) on theHolidays data set. fc6 layer of VGG16 [20] pretrained on ImageNet datasetis used.
prove the invariance of pool5 CNN descriptors by incorporating424
global invariance to several transformation groups. The spe-425
cific transformation groups considered in this study are scales426
GS and rotations GR . As one can see, it is impractical to gen-427
erate all transformations g · x for the orbit Ox. In addition,428
the computational complexity of feedforward pass in CNN in-429
creases linearly with the number of transformed version of the430
input x. For practical consideration, we simplify the orbit by431
a finite set of transformations (e.g., # of rotations R = 4, # of432
scales S = 3). This results in HNIP descriptors approximately433
invariant to transformations, without huge increase in feature434
extraction time.435
An interesting aspect of the invariance theory is the possibility436
in practice to chain multiple types of group invariances one after437
the other as already demonstrated in [61]. In this study, we con-438
struct descriptors invariant to several transformation groups by439
progressively applying the method to different transformation440
groups as shown in (2)–(4).441
Discussions: While there is theoretical guarantee in the scale-442
and rotation-invariance of handcrafted local feature detectors443
such as DoG, classical CNN architectures lack invariance to444
these geometric transformations [62]. Many works have pro-445
posed to encode transformation invariances into both hand-446
crafted (e.g., BoW built on dense sampled SIFT [63]) and CNN447
representations [64], by explicitly augmenting input images with448
rotation and scale transformations. Our HNIP takes similar idea449
of image augmentation, but has several significant differences.450
First, rather than a single pooling (max or average) layer over451
all transformations, HNIP progressively pools features together452
across different transformations with different moments, which453
is essential for significantly improving the quality of pooled454
CNN descriptors. Second, unlike previous empirical studies,455
we have attempted to mathematically show that the design of456
nested pooling ensures that HNIP is approximately invariant to457
multiple transformations, which is inspired by the recently de-458
veloped invariance theory. Third, to the best of our knowledge,459
this work is the first to comprehensively analyze invariant prop-460
erties of CNN descriptors, in the context of large scale video461
matching and retrieval.462
C. Quantitative and Qualitative Evaluations463
Transformation invariance: In this section, we propose a464
database-side data augmentation strategy for image retrieval465
Fig. 4. Distances for three matching pairs from the MPEG CDVA dataset (seeSection VI-A for more details). For each pair, three pairwise distances (L2normalized) are computed by progressively encoding translations (Gt ), scales(Gt + Gs ), and rotations (Gt + Gs + Gr ) into the nested pooling stages.Average pooling is used for all transformations. Feature maps are extractedfrom the last pooling layer of pretrained VGG16.
to study rotation and scale invariance properties, respectively. 466
For simplicity, we represent an image as a 4096 dimensional 467
descriptor extracted from the first fully connected layer (fc6) 468
of VGG16 [20] pre-trained on ImageNet dataset. We report re- 469
trieval results in terms of mean Average Precision (mAP) on the 470
INRIA Holidays dataset [65] (500 query images, 991 reference 471
images). 472
Fig. 3(a) investigates the invariance property with respect to 473
query rotations. First, we observe that the retrieval performance 474
drops significantly as we rotate query images when fixing the 475
reference images (the red curve). To gain invariance to query 476
rotations, we rotate each reference image within a range of 477
−180◦ to 180◦, with the step of 10◦. The fc6 features for its 478
36 rotated images are pooled together into one common global 479
descriptor representation, with either max or average pooling. 480
We observe that the performance is relatively consistent (blue for 481
max pooling, green for average pooling) against the rotation of 482
query images. Moreover, performance in terms of the variations 483
of the query image scale is plotted in Fig. 3(b). It is observed 484
that the database-side augmentation by max or average pooling 485
over scale changes (scale ratio of 0.75, 0.5, 0.375, 0.25, 0.2 and 486
0.125) can improve the performance when query scale is small 487
(e.g., 0.125). 488
Nesting multiple transformations: We further analyze nested 489
pooling over multiple transformations. Fig. 4 provides an insight 490
on how progressively adding different types of transformations 491
affect the matching distance on different image matching pairs. 492
We can see the reduction in matching distance with the incor- 493
poration of each new transformation group. Fig. 5 takes a closer 494
look at pairwise similarity maps between local deep features 495
of query keyframes and the global deep descriptors of refer- 496
ence keyframes, which explicitly reflects the regions of query 497
keyframe significantly contributing to similarity measurement. 498
We compare our HNIP (the third heatmap at each row) to the 499
state-of-the-art deep descriptors MAC [18] and R-MAC [18]. 500
Because of the introduction of scale and rotation transforma- 501
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 7
Fig. 5. Example similarity maps between local deep features of query keyframes and the global deep descriptors of reference keyframes, using off-the-shelfVGG16. For query (left image) and reference (right image) keyframe pair at each row, the middle three similarity maps are from MAC, R-MAC, and HNIP (fromleft to right), respectively. Each similarity map is generated by cosine similarity between query local features at each feature map location and the pooled globaldescriptors for reference keyframe (i.e., MAC, R-MAC, or HNIP), which allows locating the regions of query keyframe contributing most to the pairwise similarity.
TABLE IPAIRWISE VIDEO MATCHING BETWEEN MATCHING AND NON-MATCHING
VIDEO DATASETS, WITH POOLING CARDINALITY INCREASED
BY PROGRESSIVELY ENCODING TRANSLATION, SCALE, AND
ROTATION TRANSFORMATIONS, FOR DIFFERENT POOLING
STRATEGIES, I.E., MAX-MAX-MAX, AVG-AVG-AVG,AND OUR HNIP (SQU+AVG+MAX)
Gt Gt -Gs Gt -Gs -Gr
Max 71.9 Max-Max 72.8 Max-Max-Max 73.9Avg 76.9 Avg-Avg 79.2 Avg-Avg-Avg 82.2Squ 81.6 Squ-Avg 82.7 Squ-Avg-Max 84.3
TABLE IISTATISTICS ON THE NUMBER OF RELEVANT DATABASE VIDEOS RETURNED
IN TOP 100 LIST (I.E., RECALL@100) FOR ALL QUERY VIDEOS IN THE
MPEG CDVA DATASET (SEE SECTION VI-A FOR MORE DETAILS)
Landmarks Scenes Objects
HNIP \ CXM 8143 1477 1218CXM \ HNIP 1052 105 1834
“A \ B” represents relevant instances are successfully re-trieved by method A but missed in the list generated bymethod B. The last pooling layer of pretrained VGG16 isused for HNIP.
tions, HNIP is able to locate the query object of interest re-502
sponsible for similarity measures more precisely than MAC503
and R-MAC, though there are scale and rotation changes be-504
tween query-reference pairs. Moreover, as shown in Table I,505
quantitative results on video matching by HNIP with progres-506
sively encoded multiple transformations provide additional pos-507
itive evidence for the nested invariance property.508
Pooling moments: In Fig. 3, it is interesting to note that any509
choice of pooling moment n in the pooling stage can produce510
invariant descriptors. However, the discriminability of NIP with511
varied pooling moments could be quite different. For video512
retrieval, we empirically observe that pooling with hybrid mo-513
ments works well for NIP, e.g., starting with square-root pool-514
ing (nt = 2) for translations and average pooling (ns = 1) for515
scales, and ending with max pooling (nr → +∞) for rotations.516
Here, we present an empirical analysis of how pooling moments517
affect pairwise video matching performance.518
We construct matching and non-matching video set from the 519
MPEG CDVA dataset. Both sets contain 4690 video pairs. With 520
input video keyframe size 640 × 480, feature maps of size 20 × 521
15 are extracted from the last pooling layer of VGG16 [20] pre- 522
trained on ImageNet dataset. For transformations, we consider 523
nested pooling by progressively adding transformations with 524
translations (Gt), scales (Gt-Gs) and rotations (Gt-Gs-Gr ). 525
For pooling moments, we evaluate Max-Max-Max, Avg-Avg- 526
Avg and our HNIP (i.e., Squ-Avg-Max). Finally, video similarity 527
is computed using (1) with pooled features. 528
Table I reports pairwise matching performance in terms 529
of True Positive Rate with False Positive Rate set to 1%, 530
with transformations switching from Gt to Gt-Gs-Gr , for 531
Max-Max-Max, Avg-Avg-Avg and HNIP. As more transfor- 532
mations nested in, the separability between matching and 533
non-matching video sets becomes larger, regardless of the 534
pooling moments used. More importantly, HNIP performs the 535
best compared to Max-Max-Max and Avg-Avg-Avg, while 536
Max-Max-Max is the worst. For instance, HNIP outperforms 537
Avg-Avg-Avg, i.e., 84.3% vs. 82.2%. 538
V. COMBINING DEEP AND HANDCRAFTED DESCRIPTORS 539
In this section, we analyze the strength and weakness of deep 540
features in the context of video retrieval and matching, com- 541
pared to state-of-the-art handcrafted descriptors built upon local 542
invariant features (SIFT). To this end, we respectively compute 543
statistics of HNIP and CDVA handcrafted descriptors (CXM) 544
by retrieving different types of video data. In particular, we fo- 545
cus on videos depicting landmarks, scenes and common objects, 546
collected by MPEG CDVA. Here we describe how to compute 547
the statistics. First, for each query video, we retrieve top 100 548
most similar database videos using HNIP and CXM, respec- 549
tively. Second, for all queries from each type of video data, we 550
accumulate the number of relevant database videos (1) retrieved 551
by HNIP but missed by CXM (denoted as HNIP \ CXM), and 552
(2) retrieved by CXM but missed by HNIP (CXM \ HNIP). 553
The statistics are presented in Table II. As one can see, com- 554
pared to handcrafted descriptors CXM, HNIP is able to identify 555
much more relevant landmark and scene videos in which CXM 556
fails. On the other hand, CXM recalls more videos depicting 557
common objects than HNIP. Fig. 6 shows qualitative examples 558
IEEE P
roof
8 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 6. Examples of keyframe pairs in which (a) HNIP determines as matching but CXM as non-matching (HNIP \ CXM) and (b) CXM determines as matchingbut HNIP as non-matching (CXM \ HNIP).
of keyframe pairs corresponding to HNIP \ CXM and CXM \559
HNIP, respectively.560
Fig. 7 further visualizes intermediate keyframe matching re-561
sults produced by handcrafted and deep features, respectively.562
Despite viewpoint change of the landmark images in Fig. 7(a),563
the salient features fired in their activation maps are spatially564
consistent. Similar observations exist in indoor scene images in565
Fig. 7(b). These observations are probably attributed to deep566
descriptors are excelling in characterizing global salient fea-567
tures. On the other hand, handcrafted descriptors work on local568
patches detected at sparse interest points, which prefer rich tex-569
tured blobs [Fig. 7(c)] rather than lower textured ones [Fig. 7(a)570
and 7(b)]. This may explain why there are more inlier matches571
found by GCC for the product images in Fig. 7(a). Finally, com-572
pared to approximate scale and rotation invariances provided573
by HNIP analyzed in the previous section, handcrafted local574
features have with built-in mechanism to ensure nearly exact575
invariances to these transformations of rigid object in the 2D576
plane, and examples can be found in Fig. 6(b).577
In summary, these observations reveal that the deep learn-578
ing features may not always outperform handcrafted features.579
There may exist complementary effects between CNN deep de-580
scriptors and handcrafted descriptors. Therefore, we propose to581
leverage the benefits of both deep and handcrafted descriptors.582
Considering that handcrafted descriptors are categorized into583
local and global ones, we investigate the combination of deep584
descriptors with either handcrafted local or global descriptors,585
respectively.586
Combining HNIP with handcrafted local descriptors: For587
matching, if the HNIP matching score exceeds a threshold, then588
we use handcrafted local descriptors for verification. For re-589
trieval, HNIP matching score is used to select the top 500 590
candidates list, then we use handcrafted local descriptors for 591
reranking. 592
Combining HNIP and handcrafted global descriptors: In- 593
stead of simply concatenating the HNIP derived deep descrip- 594
tors and handcrafted descriptors, for both matching and retrieval, 595
the similarity score is defined as the weighted sum of matching 596
scores of HNIP and handcrafted global descriptors 597
k(x,y) = α · kc(x,y) + (1 − α) · kh(x,y) (7)
where α is the weighting factor. kc and kh represent the matching 598
score of HNIP and handcrafted descriptors, respectively. In this 599
work, α is empirically set to 0.75. 600
VI. EXPERIMENTAL RESULTS 601
A. Datasets and Evaluation Metrics 602
Datasets: MPEG CDVA ad-hoc group collects large scale 603
diverse video dataset to evaluate the effectiveness of video 604
descriptors for video matching, localization and retrieval ap- 605
plications, with resource constraints including descriptor size, 606
extraction time and matching complexity. This CDVA dataset2 607
is diversified to contain views of 1) stationary large objects, e.g., 608
buildings, landmarks (most likely background objects, possibly 609
partially occluded or a close-up), 2) generally smaller items 610
(e.g., paintings, books, CD covers, products) which typically 611
2MPEG CDVA dataset and evaluation framework are available upon re-quest at http://www.cldatlas.com/cdva/dataset.html. CDVA standard documentsare available at http://mpeg.chiariglione.org/standards/exploration/compact-descriptors-video-analysis.
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 9
Fig. 7. Keyframe matching examples which illustrate the strength and weakness of CNN based deep descriptors and handcrafted descriptors. In (a) and (b), deepdescriptors perform well but handcrafted ones fail, while (c) is the opposite.
TABLE IIISTATISTICS ON THE MPEG CDVA BENCHMARK DATASETS
IoI: items of interest. q.v. : query videos. r.v.: reference videos.
appear in front of background scenes, possibly occluded and 3)612
scenes (e.g., interior scenes, natural scenes, multi-camera shots,613
etc.). CDVA dataset also comprises planar or non-planar, rigid614
or partially rigid, textured or partially textured objects (scenes),615
which are captured from different view-points with different616
camera parameters and lighting conditions.617
Specifically, MPEG CDVA dataset contains 9974 query and618
5127 reference videos (denoted as All), depicting 796 items of619
interest in which 489 large landmarks (e.g., buildings), 71 scenes620
(e.g., interior or natural scenes) and 236 small common objects621
(e.g., paintings, books, products). The videos have durations622
from 1 sec to 1+ min. To evaluate video retrieval on different623
types of video data, we categorize query videos into Landmarks624
(5224 queries), Scenes (915 queries) and Objects (3835 queries).625
Table III summaries the numbers of items of interest and their626
instances for each category. Fig. 8 shows some example video627
clips from the three categories.628
To evaluate the performance of large scale video retrieval,629
we combine the reference videos with a set of user-generated630
and broadcast videos as distractors, which consist of content631
unrelated to the items of interest. There are 14537 distractor632
videos with more than 1000 hours data.633
Moreover, to evaluate pairwise video matching and tempo-634
ral localization, 4693 matching video pairs and 46911 non-635
matching video pairs are constructed from query and reference636
videos. Temporal location of items of interest within each video637
pair is annotated as the ground truth.638
We also evaluate our method on image retrieval bench- 639
mark datasets. INRIA Holidays, [65] dataset is composed of 640
1491 high-resolution (e.g., 2048 × 1536) scene-centric images, 641
500 of them are queries. This dataset includes a large variety 642
of outdoor scene/object types: natural, man-made, water and 643
fire effects. We evaluate the rotated version of Holidays [13], 644
where all images are with up-right orientation. Oxford5k, [66] 645
is buildings datasets consisting of 5062 images, mainly with 646
size 1024 × 768. There are 55 queries composed of 11 land- 647
marks, each represented by 5 queries. We use the provided 648
bounding boxes to crop query images. The University of Ken- 649
tucky Benchmark (UKBench) [67] consists of 10200 VGA size 650
images, organized into 2550 groups of common objects, each 651
object represented by 4 images. All 10200 images are serving as 652
queries. 653
Evaluation metrics: Retrieval performance is evaluated by 654
mean Average Precision (mAP) and precision at a given cut- 655
off rank R for query videos (Precisian@R), and we set R = 656
100 following MPEG CDVA standard. Pairwise video matching 657
performance is evaluated by Receiver Operating Characteristic 658
(ROC) curve. We also report pairwise matching results in terms 659
of True Positive Rate (TPR), given False Positive Rate (FPR) 660
equals to 1%. In case a video pair is predicted as a match, 661
temporal location of the item of interest is further identified 662
within the video pair. The localization accuracy is measured 663
by Jaccard Index: [T s t a r t ,Te n d ]⋂
[T ′s t a r t ,T
′e n d ]
[T s t a r t ,Te n d ]⋃
[T ′s t a r t ,T
′e n d ] , where [Tstart , Tend ] 664
denotes the ground truth and [T ′start , T
′end ] denotes the predicted 665
start and end frame timestamps. 666
Besides these accuracy measurements, we also measure the 667
complexity of algorithms, including descriptor size, transmis- 668
sion bit rate, extraction time and search time. In particular, 669
transmission bit rate is measured by (# query keyframes) × 670
(descriptor size in Bytes) / (query durations in seconds). 671
B. Implementation Details 672
In this work, we build HNIP descriptors with two widely 673
used CNN architectures : AlexNet [19] and VGG16 [20]. We 674
test off-the-shelf networks pre-trained on ImageNet ILSVRC 675
classification data set. In particular, we crop the networks to the 676
IEEE P
roof
10 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 8. Example video clips from the CDVA dataset.
TABLE IVVIDEO RETRIEVAL COMPARISON (MAP) BY PROGRESSIVELY ADDING
TRANSFORMATIONS (TRANSLATION, SCALE, ROTATION) INTO NIP(OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)
Transf. size (kB/k.f.) Landmarks Scenes Objects All
Gt 2 64.0 82.9 64.8 66.0Gt -Gs 2 65.3 82.4 67.3 67.6Gt -Gs -Gr 2 64.6 82.7 72.2 69.2
Average pooling is applied to all transformations. No PCA whitening is performed.kB/k.f.: descriptor size per key frame. The best results are highlighted in bold.
last pooling layer (i.e., pool5). We resize all video keyframes677
to 640×480 and Holidays (Oxford5k) images to 1024×768678
as the inputs of CNN for descriptor extraction. Finally, post-679
processing can be applied to the pooled descriptors like HNIP680
and R-MAC [18]. Following the standard practice, we choose681
PCA whitening in this work. We randomly sample 40K frames682
from the distractor videos for PCA learning. These experimen-683
tal setups are applied to both HNIP and state-of-the-art deep684
pooled descriptors like MAC [18], SPoC [16], CroW [17] and685
R-MAC [18].686
We also compare HNIP with the MPEG CXM, which is cur-687
rent state-of-the-art handcrafted compact descriptors for video688
analysis. Following the practice in CDVA standard, we em-689
ploy OpenMP to perform parallel retrieval for both CXM and690
deep global descriptors. Experiments are conducted on Tianhe691
HPC platform, where each node is equipped with 2 processors692
(24 cores, Xeon E5-2692) @2.2GHZ, and 64 GB RAM. For693
CNN feature maps extraction, we use NVIDIA Tesla K80 GPU.694
C. Evaluations on HNIP variants695
We perform video retrieval experiments to assess the effect of696
transformations and pooling moments in HNIP pipeline, using697
off-the-shelf VGG16.698
Transformations: Table IV studies the influence of pooling699
cardinalities by progressively adding transformations into the700
nested pooling stages. We simply apply average pooling to all701
transformations. First, dimensions of all NIP variants are 512702
for VGG16, resulting in descriptor size 2 kB per keyframe for703
floating point vectors (4 bytes each element). Second, over-704
all, retrieval performance (mAP) increases as more transfor-705
mations nested in the pooled descriptors, e.g., from 66.0% for706
Gt to 69.2% for Gt-Gs -Gr on the full test dataset (All). Also,707
we observe that Gt-Gs-Gr outperforms Gt-Gs and Gt by a708
large margin on Objects, while achieving comparable perfor-709
TABLE VVIDEO RETRIEVAL COMPARISON (MAP) OF NIP WITH DIFFERENT POOLING
MOMENTS (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)
Pool Op. size (kB/k.f.) Landmarks Scenes Objects All
Max-Max-Max 2 63.2 82.9 69.9 67.6Avg-Avg-Avg 2 64.6 82.7 72.2 69.2Squ-Squ-Squ 2 54.2 65.6 66.5 60.0Max-Squ-Avg 2 60.6 80.7 74.4 67.8Hybrid (HNIP) 2 70.0 88.2 80.9 75.9
Transformations are Gt -Gs -Gr for all experiments. No PCA whitening is performed.The best results are highlighted in bold.
mance on Landmarks and Scenes. Revisiting the analysis of ro- 710
tation invariance pooling on the scene-centric Holidays dataset 711
in Fig. 3(a), though invariance to query rotation changes can 712
be gained by database-side augmented pooling, one may note 713
that its retrieval performance is comparable to the one without 714
rotating query and reference images (i.e., the peak value of the 715
red curve). These observations are probably because there are 716
relatively limited rotation (scale) changes for videos depicting 717
large landmarks or scenes, compared to small common objects. 718
More examples can be found in Fig. 8. 719
Hybrid pooling moments: Table V explores the effects of 720
pooling moments within NIP. Transformations are fixed as Gt- 721
Gs-Gr . There are 33 = 27 possible combinations of pooling 722
moments in HNIP. For simplicity, we compare our Hybrid NIP 723
(i.e., Squ-Avg-Max) to two widely-used pooling strategies (i.e., 724
max or average pooling across all transformations) and another 725
two schemes: square-root pooling across all transformations 726
and Max-Squ-Avg which decreases pooling moment along the 727
way. First, for uniform pooling, Avg-Avg-Avg is overall supe- 728
rior over Max-Max-Max and Squ-Squ-Squ, while Squ-Squ-Squ 729
performs much worse than the other two. Second, HNIP out- 730
performs the optimal uniform pooling Avg-Avg-Avg by a large 731
margin. For instance, the gains over Avg-Avg-Avg are +5.4%, 732
+5.5% and +8.7% on Landmarks, Scenes and Objects, respec- 733
tively. Finally, for hybrid pooling, HNIP performs significantly 734
better than Max-Squ-Avg over all test datasets. We observe 735
similar trends when comparing HNIP to other hybrid pooling 736
combinations. 737
D. Comparisons Between HNIP and State-of-the-Art Deep 738
Descriptors 739
Previous experiments show that the integration of transfor- 740
mations and hybrid pooling moments offers remarkable video 741
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 11
TABLE VIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART DEEP DESCRIPTORS
IN TERMS OF MAP (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)
method size (kB/k.f.) extra. time (s/k.f.) Landmarks Scenes Objects All
w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW
MAC [18] 2 0.32 57.8 61.9 77.4 76.2 70.0 71.8 64.3 67.0SPoC [16] 2 0.32 64.0 69.1 82.9 84.0 64.8 70.3 66.0 70.9CroW [17] 2 0.32 62.3 63.9 79.2 78.4 71.9 72.0 67.5 68.3R-MAC [18] 2 0.32 69.4 74.6 84.4 87.3 73.8 78.2 72.5 77.1HNIP (Ours) 2 0.96 70.0 74.8 88.2 90.1 80.9 85.0 75.9 80.1
We implement MAC [18], SPoC [16], CroW [17], R-MAC [18] based on the source codes released by the authors, while following the same experimental setups as our HNIP. Thebest results are highlighted in bold.
TABLE VIIEFFECT OF THE NUMBER OF DETECTED VIDEO KEYFRAMES ON DESCRIPTOR TRANSMISSION BIT RATE,
RETRIEVAL PERFORMANCE (MAP), AND SEARCH TIME, ON THE FULL TEST DATASET (ALL)
# query k.f. # DB k.f. method size (kB/k.f.) bit rate (Bps) mAP search time (s/q.v.)
∼140K (1.6%) ∼105K (2.4%) CXM ∼4 2840 73.6 12.4AlexNet-HNIP 1 459 71.4 1.6
VGG16-HNIP 2 918 80.1 2.3∼175K (2.0%) ∼132K (3.0%) CXM ∼4 3463 74.3 16.6
AlexNet-HNIP 1 571 71.9 2.0VGG16-HNIP 2 1143 80.6 2.8
∼231K (2.7%) ∼176K (3.9%) CXM ∼4 4494 74.6 21.0AlexNet-HNIP 1 759 71.9 2.2VGG16-HNIP 2 1518 80.7 3.1
We report performance of state-of-the-art handcrafted descriptors (CXM), and PCA whitened HNIP with both off-the-shelf AlexNet andVGG16. Numbers in bracket denote the percentage of detected keyframes from raw videos. Bps: bytes per second. s/q.v.: seconds perquery video.
TABLE VIIIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART
HANDCRAFTED DESCRIPTORS (CXM), FOR ALL TEST DATASETS
method Landmarks Scenes Objects All
CXM 61.4/60.9 63.0/61.9 92.6/91.2 73.6/72.6AlexNet-HNIP 65.2/62.3 77.6/74.1 78.4/74.5 71.4/68.1VGG16-HNIP 74.8/71.6 90.1/86.6 85.0/81.3 80.1/76.7
The former (latter) number in each cell represents performance in terms of mAP(Precision@R). We report performance of PCA whitened HNIP with both off-the-shelf AlexNet and VGG16. The best results are highlighted in bold.
retrieval performance improvements. Here, we conduct another742
round of video retrieval experiments to validate the effective-743
ness of our optimal reported HNIP, compared to state-of-the-art744
deep descriptors [16]–[18].745
Effect of PCA whitening: Table VI studies the effect of PCA746
whitening on different deep descriptors in video retrieval per-747
formance (mAP), using off-the-shelf VGG16. Overall, PCA748
whitened descriptors perform better than their counterparts749
without PCA whitening. More specifically, the improvements750
on SPoC, R-MAC and our HNIP are much larger than MAC751
and CroW. In view of this, we apply PCA whitening to HNIP in752
the following sections.753
HNIP versus MAC, SPoC, CroW and R-MAC: Table VI754
presents the comparison of HNIP against state-of-the-art deep755
descriptors. We observe that HNIP obtains consistently better756
performance than other approaches on all test datasets, at the cost 757
of extra extraction time.3 HNIP significantly improves the re- 758
trieval performance over MAC [18], SPoC [16] and CroW [17], 759
e.g., over 10% in mAP on the full test dataset (All). Compared 760
with the state-of-the-art R-MAC [18], +7% mAP improvement 761
is achieved on Objects, which is mainly attributed to the im- 762
proved robustness against the rotation changes in videos (the 763
keyframes capture small objects from different angles). 764
E. Comparisons Between HNIP and Handcrafted Descriptors 765
In this section, we first study the influences of the number 766
of detected video keyframes on descriptor transmission bit rate, 767
retrieval performance and search time. Then, with keyframes 768
fixed, we compare HNIP to the state-of-the-art compact hand- 769
crafted descriptors (CXM), which currently obtains the optimal 770
video retrieval performance in MPEG CDVA datasets. 771
Effect of the number of detected video keyframes: As shown 772
in Table VII, we generate three keyframe detection config- 773
urations by varying the detection parameters. Also, we test 774
3The extraction time of deep descriptors is mainly decomposed into 1) feed-forward pass for extracting feature maps and 2) pooling over feature mapsfollowed by post-processing such as PCA whitening. In our implementationbased on MatConvNet, the first stage takes 0.21 seconds per keyframe (VGAsize input image to VGG16 executed on a NVIDIA Tesla K80 GPU); HNIP isfour times slower (∼0.84) as there are four rotations for each keyframe. Thesecond stage is ∼0.11 seconds for MAC, SPoC and CroW, ∼0.115 seconds forR-MAC and ∼0.12 seconds for HNIP. Therefore, the extraction time of HNIPis roughly three times as much as others.
IEEE P
roof
12 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 9. Pairwise video matching comparison of HNIP with state-of-the-art handcrafted descriptors (CXM) in terms of ROC curve, for all test datasets. Experimentalsettings are identical with those in Table VIII.
their retrieval performance and complexity for CXM descriptors775
(∼4 kB per keyframe) and PCA whitened HNIP with both off-776
the-shelf AlexNet (1 kB per keyframe) and VGG16 (2 kB per777
keyframe), on the full test dataset (All). It is easy to find that778
descriptor transmission bit rate and search time increase pro-779
portionally with the number of detected keyframes. However,780
there is little retrieval performance gain for all descriptors, i.e.,781
less than 1% in mAP. Thus, we consider the first configuration782
throughout this work, which achieves a good tradeoff between783
accuracy and complexities. For instance, mAP for VGG16-784
HNIP is 80.1% when the descriptor transmission bit rate is785
only 2840 Bytes per second.786
Video retrieval: Table VIII shows the video retrieval com-787
parison of HNIP with handcrafted descriptors CXM on all788
test datasets. Overall, AlexNet-HNIP is inferior to CXM,789
while VGG16-HNIP performs the best. Second, HNIP with790
both AlexNet and VGG16 outperform CXM on Landmarks791
and Scenes. The performance gap between HNIP and CXM792
becomes larger as network goes deeper from AlexNet to793
VGG16, e.g., AlexNet-HNIP and VGG16-HNIP improve CXM794
by 3.8% and 13.4% in mAP on Landmarks, respectively. Third,795
we observe AlexNet-HNIP performs much worse than CXM on796
Objects (e.g., 74.5% vs. 91.2% in Precision@R). VGG16-HNIP797
reduces the gap, but still underperforms CXM. This is reason-798
able as handcrafted descriptors based on SIFT are more robust799
to scale and rotation changes of rigid objects in the 2D plane.800
Video pairwise matching and localization: Fig. 9 and801
Table IX further show pairwise video matching and temporal802
localization performance of HNIP and CXM on all test datasets,803
respectively. For pairwise video matching, VGG16-HNIP and804
AlexNet-HNIP consistently outperform CXM in terms of TPR805
for varied FPR on Landmarks and Scenes. In Table IX, we806
observe the performance trends of temporal localization are807
roughly consistent with pairwise video matching.808
One may note that the localization accuracy of CXM is809
worse than HNIP on Objects (see Table IX), but CXM ob-810
tains much better video retrieval mAP than HNIP on Objects811
(see Table VIII). First, given a query-reference video pair, video812
retrieval tries to identify the most similar keyframe pair, but813
temporal localization aims to locate multiple keyframe pairs by814
comparing with a predefined threshold. Second, as shown in815
Fig. 9, CXM achieves better TPR (Recall) than both VGG16-816
HNIP and AlexNet-HNIP on Objects when FPR is small (e.g.,817
TABLE IXVIDEO LOCALIZATION COMPARISON OF HNIP WITH STATE-OF-THE-ART
HANDCRAFTED DESCRIPTORS (CXM) IN TERMS
OF JACCARD INDEX, FOR ALL TEST DATASETS
method Landmarks Scenes Objects All
CXM 45.5 45.9 68.8 54.4AlexNet-HNIP 48.9 63.0 67.3 57.1VGG16-HNIP 50.8 63.8 71.2 59.7
Experimental settings are the same as in Table VIII. The best resultsare highlighted in bold.
TABLE XIMAGE RETRIEVAL COMPARISON (MAP) OF HNIP WITH STATE-OF-THE-ART
DEEP AND HANDCRAFTED DESCRIPTORS (CXM),ON HOLIDAYS, OXFORD5K, AND UKBENCH
method Holidays Oxford5k UKbench
CXM 71.2 43.5 3.46MAC [18] 78.3 56.1 3.65SPoC [16] 84.5 68.6 3.68R-MAC [18] 87.2 67.6 3.73HNIP (Ours) 88.9 69.3 3.90
We report performance of PCA whitened deep descriptorswith off-the-shelf VGG16. The best results are highlightedin bold.
FPR = 1%), and TPR becomes worse as FPR increases. This 818
implies that 1) CXM ranks relevant videos and keyframes higher 819
than HNIP in the retrieved list for object queries, which leads to 820
better mAP on Objects when evaluating retrieval performance 821
in a small shortlist (100 in our experiments). 2) VGG16-HNIP 822
gains higher Recall than CXM when FPR becomes large, which 823
leads to higher localization accuracy on Objects. In other words, 824
towards better temporal localization, we choose a small thresh- 825
old (the corresponding FPR = 14.3% in our experiments) in 826
order to recall as many relevant keyframes as possible. 827
Image retrieval: To further verify the effectiveness of HNIP, 828
we conduct image instance retrieval experiments on scene- 829
centric Holidays, landmark-centric Oxford5k and object-centric 830
UKbench. Table X shows the comparisons of HNIP with MAC, 831
SPoC, R-MAC, and handcrafted descriptors from MPEG CDVA 832
evaluation framework. First, we observe HNIP outperforms 833
handcrafted descriptors by a large margin on all data sets. 834
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 13
Fig. 10. (a) and (b) Video retrieval, (c) pairwise video matching, and (d) and localization performance of the optimal reported HNIP (i.e., VGG16-HNIP)combined with either CXM local or CXM global descriptors, for all test datasets. For simplicity, we report the pairwise video matching performance in terms ofTPR given FPR = 1%.
TABLE XIVIDEO RETRIEVAL COMPARISON OF HNIP WITH CXM AND THE COMBINATION OF HNIP WITH CXM-LOCAL AND CXM-GLOBAL RESPECTIVELY,
ON THE FULL TEST DATASET (ALL), WITHOUT (“W/O D ”), or WITH (“W/ D ”) COMBINATION OF THE LARGE SCALE DISTRACTOR VIDEOS
method size (kB/k.f.) mAP Precision@R # query k.f. # DB k.f. search time (s/q.v.)
w/o D w/ D w/o D w/ D w/o D w/ D w/o D w/ D
CXM ∼4 73.6 72.1 72.6 71.2 ∼140K ∼105K ∼1.25M 12.4 38.6VGG16-HNIP 2 80.1 76.8 76.7 73.6 2.3 9.2VGG16-HNIP + CXM-Local ∼4 75.7 75.4 74.4 74.1 12.9 17.8VGG16-HNIP + CXM-Global ∼4 84.9 82.6 82.4 80.3 4.9 39.5
Second, HNIP performs significantly better than state-of-the-835
art deep descriptors R-MAC on UKbench, though it shows836
marginally better performance on Holidays. The performance837
advantage trend between HNIP and R-MAC is consistent with838
video retrieval results on CDVA-Scenes and CDVA-Objects in839
Table VI. It is again demonstrated that HNIP tends to be more840
effective on object-centric datasets compared to scene- and841
landmark-centric datasets, as object-centric datasets are usually842
with more rotation and scale distortions.843
F. Combination of HNIP and Handcrafted Descriptors844
CXM contains both compressed local descriptors845
(∼2 kB/frame) and compact global descriptors (∼2 kB/frame)846
aggregated from local ones. Following the combination strate-847
gies designed in Section V, Fig. 10 shows the effectiveness848
of combining VGG16-HNIP with either CXM-Global or849
CXM-Local descriptors,4 in video retrieval (a) & (b), matching850
(c) and localization (d). First, we observe the combination of851
VGG16-HNIP with either CXM-Global or CXM-Local consis-852
tently improves CXM across all tasks on all test datasets. In this853
regard, the improvements of VGG16-HNIP + CXM-Global are854
much larger than VGG16-HNIP + CXM-Local, especially for855
Landmarks and Scenes. Second, VGG16-HNIP + CXM-Global856
performs better on all test datasets in video retrieval, matching857
and localization (except localization accuracy on Landmarks).858
In particular, VGG16-HNIP + CXM-Global significantly859
4Here, we did not introduce the complicated combination of VGG-HNIP +CXM-Global + CXM-Local, because its performance is very close to VGG-HNIP + CXM-Local, and moreover it further increases descriptor size andsearch time, compared to VGG-HNIP + CXM-Local.
improves VGG16-HNIP on Objects in terms of mAP and 860
Precision@R (+10%). This leads us to the conclusion that 861
deep descriptors VGG16-HNIP and handcrafted descriptors 862
CXM-Global are complementary to each other. Third, we 863
observe VGG16-HNIP + CXM-Local significantly degrades 864
the performance of VGG16-HNIP on Landmarks and Scenes, 865
e.g., there is a drop of ∼10% in mAP on Landmarks. This 866
is due to the fact that matching pairs retrieved by HNIP (but 867
handcrafted features fail) cannot pass the GCC step, i.e., the 868
number of inliers (patch-level matching pairs) is less sufficient. 869
For instance, in Fig. 7, the landmark pair is determined as 870
a match by VGG16-HNIP, but the subsequent GCC step 871
considers it as a non-match because there are only 2 matched 872
patch pairs. More examples can be found in Fig. 6(a). 873
G. Large Scale Video Retrieval 874
Table XI studies video retrieval performance of CXM, 875
VGG16-HNIP, their combinations VGG16-HNIP + CXM-Local 876
and VGG16-HNIP + CXM-Global, on the full test dataset 877
(All) without or with the large scale distractor video set. By 878
combining reference videos with the large scale distractor set, 879
the number of database keyframes increases from ∼105K to 880
∼1.25M, making the search speed significantly slower. For ex- 881
ample, HNIP is ∼5 times slower with the 512-D Euclidean 882
distance computation. Further compressing HNIP into extreme 883
compact code (e.g., 256 bits) for ultra-fast Hamming distance 884
computation is highly desirable, while without incurring con- 885
siderable performance loss. We will study it in our future work. 886
Second, the performance ordering of approaches remains the 887
same on large scale experiments, i.e., VGG16-HNIP + CXM- 888
IEEE P
roof
14 IEEE TRANSACTIONS ON MULTIMEDIA
Global performs the best, followed by VGG16-HNIP, VGG16-889
HNIP + CXM-Local and CXM. Finally, when increasing the890
database size by 10× larger, we observe the performance loss891
is relatively small, e.g., −1.5%, −3.3%, −0.3% and −2.3% in892
mAP for CXM, VGG16-HNIP, VGG16-HNIP + CXM-Local893
and VGG16-HNIP + CXM-Global, respectively.894
VII. CONCLUSION AND DISCUSSIONS895
In this work, we propose a compact and discriminative CNN896
descriptor HNIP for video retrieval, matching and localization.897
Based on the invariance theory, HNIP is proven to be robust898
to multiple geometric transformations. More importantly, our899
empirical studies show that the statistical moments in HNIP900
dramatically affect video matching performance, which leads901
us to the design of hybrid pooling moments within HNIP. In ad-902
dition, we study the complementary nature of deep learned and903
handcrafted descriptors, then propose a strategy to combine the904
two descriptors. Experimental results demonstrate that HNIP905
descriptor significantly outperforms state-of-the-art deep and906
handcrafted descriptors, with comparable or even smaller de-907
scriptor size. Furthermore, the combination of HNIP and hand-908
crafted descriptors offers the optimal performance.909
This work provides valuable insights for the ongoing CDVA910
standardization efforts. During the recent 116th MPEG meeting911
in Oct. 2016, MPEG CDVA Ad-hoc group has adopted the pro-912
posed HNIP into core experiments [21] for investigating more913
practical issues when dealing with deep learned descriptors in914
the well-established CDVA evaluation framework. There are915
several directions for future work. First, an in-depth theoretical916
analysis on how pooling moments affect video matching perfor-917
mance is valuable to further reveal and clarify the mechanism918
of hybrid pooling, which may contribute to the invariance the-919
ory. Second, it is interesting to study how to further improve re-920
trieval performance by optimizing deep features like fine-tuning921
CNN tailored for the video retrieval task, instead of off-the-shelf922
CNNs used in this work. Third, to accelerate search speed, fur-923
ther compressing deep descriptors into extremely compact codes924
(e.g., dozens of bits) while still preserving retrieval accuracy is925
worth investigating. Last but not the least, as CNN incurs huge926
number of network model parameters (over 10 millions), how to927
effectively and efficiently compress CNN model is a promising928
direction.929
REFERENCES930
[1] Compact Descriptors for Video Analysis: Objectives, Applications and931Use Cases, ISO/IEC JTC1/SC29/WG11/N14507, 2014.932
[2] Compact Descriptors for Video Analysis: Requirements for Search Appli-933cations, ISO/IEC JTC1/SC29/WG11/N15040, 2014.934
[3] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,935vol. 28, no. 4, pp. 61–76, Jul. 2011.936
[4] R. Ji et al., “Learning compact visual descriptor for low bit rate mobile937landmark search,” vol. 22, no. 3, 2011.Q1 938
[5] L.-Y. Duan et al., “Overview of the MPEG-CDVS standard,” IEEE Trans.939Image Process., vol. 25, no. 1, pp. 179–194, Jan. 2016.940
[6] Test Model 14: Compact Descriptors for Visual Search, ISO/IEC941JTC1/SC29/WG11/W15372, 2011.942
[7] Call for Proposals for Compact Descriptors for Video Analysis (CDVA)-943Search and Retrieval, ISO/IEC JTC1/SC29/WG11/N15339, 2015.944
[8] Cdva Experimentation Model (cxm) 0.2, ISO/IEC JTC1/SC29/945WG11/W16274, 2015.946
[9] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier, “Large-scale image re- 947trieval with compressed fisher vectors,” in Proc. IEEE Conf. Comput. Vis. 948Pattern Recog., Jun. 2010, pp. 3384–3391. Q2949
[10] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descrip- 950tors into a compact image representation,” in Proc. IEEE Conf. Comput. 951Vis. Pattern Recog., Jun. 2010, pp. 3304–3311. 952
[11] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn fea- 953tures off-the-shelf: An astounding baseline for recognition,” in Proc. IEEE 954Conf. Comput. Vis. Pattern Recog. Workshops, Jun. 2014, pp. 512–519. 955
[12] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling 956of deep convolutional activation features,” in Proc. Eur. Conf. Comput. 957Vis., 2014. 958
[13] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes 959for image retrieval,” in Proc. Eur. Conf. Comput. Vis., 2014. 960
[14] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “A baseline for 961visual instance retrieval with deep convolutional networks,” CoRR, 2014. 962[Online]. Available: http://arxiv.org/abs/1412.6574 Q3963
[15] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson, 964“From generic to specific deep representations for visual recognition,” in 965Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2015. 966
[16] A. Babenko and V. Lempitsky, “Aggregating local deep features for image 967retrieval,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1269– 9681277. 969
[17] Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weight- 970ing for aggregated deep convolutional features,” CoRR, 2015. [Online]. 971Available: http://arxiv.org/1512.04065 972
[18] G. Tolias, R. Sicre, and H. Jegou, “Particular object retrieval with inte- 973gral max-pooling of cnn activations,” CoRR, 2015. [Online]. Available: 974http://arxiv.org/abs/1511.05879 975
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 976with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- 977cess. Syst., 2012. 978
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks 979for large-scale image recognition,” CoRR, 2014. [Online]. Available: 980http://arxiv.org/abs/1409.1556 981
[21] Description of Core Experiments in CDVA, ISO/IEC JTC1/SC29/ 982WG11/W16510, 2016. 983
[22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” 984Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. 985
[23] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” 986in Proc. Eur. Conf. Comput. Vis., 2006. 987
[24] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach 988to object matching in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 989Oct. 2003, vol. 2, pp. 1470–1477. 990
[25] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” 991in Proc. Comput. Vis. Pattern Recog., 2006. 992
[26] H. Jegou and A. Zisserman, “Triangulation embedding and democratic 993aggregation for image search,” in Proc. IEEE Conf. Comput. Vis. Pattern 994Recog., Jun. 2014, pp. 3310–3317. 995
[27] S. S. Husain and M. Bober, “Improving large-scale image retrieval through 996robust aggregation of local descriptors,” IEEE Trans. Pattern Anal. Mach. 997Intell., to be published. 998
[28] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift- 999invariant kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2009. 1000
[29] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Adv. 1001Neural Inf. Process. Syst., 2009. 1002
[30] V. Chandrasekhar et al.,“Transform coding of image feature descriptors,” 1003in Proc. IS&T/SPIE Electron. Imag., 2009. 1004
[31] V. Chandrasekhar et al., “CHoG: Compressed histogram of gradients a 1005low bit-rate feature descriptor,” in Proc. IEEE Conf. Comput. Vis. Pattern 1006Recog., Jun. 2009, pp. 2504–2511. 1007
[32] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest 1008neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, 1009pp. 117–128, Jan. 2011. 1010
[33] M. Calonder et al., “BRIEF: Computing a local binary descriptor 1011very fast,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, 1012pp. 1281–1298, Jul. 2012. 1013
[34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficient 1014alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis., 1015Nov. 2011, pp. 2564–2571. 1016
[35] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust 1017invariant scalable keypoints,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, 1018pp. 2548–2555. 1019
[36] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui, “USB: Ultrashort 1020binary descriptor for fast visual matching and retrieval,” IEEE Trans. 1021Image Process., vol. 23, no. 8, pp. 3671–3683, Aug. 2014. 1022
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 15
[37] D. M. Chen et al., “Tree histogram coding for mobile image matching,”1023in Proc. Data Compression Conf., 2009.1024
[38] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scal-1025able image search,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct.10262009, pp. 2130–2137.1027
[39] D. Chen et al., “Residual enhanced visual vector as a compact signature1028for mobile visual search,” Signal Process., vol. 93, no. 8, pp. 2316–2327,10292013.1030
[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies1031for accurate object detection and semantic segmentation,” in Proc. IEEE1032Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587.1033
[41] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:1034CNN architecture for weakly supervised place recognition,” in Proc. Com-1035put. Vis. Pattern Recog., Jun. 2016, pp. 5297–5307.1036
[42] F. Radenovic, G. Tolias, and O. Chum, “CNN image retrieval learns from1037BoW: Unsupervised fine-tuning with hard examples,” in Proc. Eur. Conf.1038Comput. Vis., 2016.1039
[43] A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “Deep image retrieval:1040Learning global representations for image search,” in Proc. Eur. Conf.1041Comput. Vis., 2016.1042
[44] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, and S. Tubaro,1043“Coding visual features extracted from video sequences,” IEEE Trans.1044Image Process., vol. 23, no. 5, pp. 2262–2276, May 2014.1045
[45] A. Redondi, L. Baroffio, M. Cesana, and M. Tagliasacchi, “Compress-1046then-analyze vs. analyze-then-compress: Two paradigms for image anal-1047ysis in visual sensor networks,” in Proc. IEEE Int. Workshop Multimedia1048Signal Process., Sep.-Oct. 2013, pp. 278–282.1049
[46] L. Baroffio, J. Ascenso, M. Cesana, A. Redondi, and M. Tagliasacchi,1050“Coding binary local features extracted from video sequences,” in Proc.1051IEEE Int. Conf. Image Process., Oct. 2014, pp. 2794–2798.1052
[47] M. Makar, V. Chandrasekhar, S. Tsai, D. Chen, and B. Girod, “Interframe1053coding of feature descriptors for mobile augmented reality,” IEEE Trans.1054Image Process., vol. 23, no. 8, pp. 3352–3367, Aug. 2014.1055
[48] J. Chao and E. G. Steinbach, “Keypoint encoding for improved feature ex-1056traction from compressed video at low bitrates,” IEEE Trans. Multimedia,1057vol. 18, no. 1, pp. 25–39, Jan. 2016.1058
[49] L. Baroffio et al., “Coding local and global binary visual features extracted1059from video sequences,” IEEE Trans. Image Process., vol. 24, no. 11,1060pp. 3546–3560, Nov. 2015.1061
[50] D. M. Chen, M. Makar, A. F. Araujo, and B. Girod, “Interframe coding1062of global image signatures for mobile augmented reality,” in Proc. Data1063Compression Conf., 2014.1064
[51] D. M. Chen and B. Girod, “A hybrid mobile visual search system with1065compact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7,1066pp. 1019–1030, Jul. 2015.1067
[52] C.-Z. Zhu, H. Jegou, and S. Satoh, “Nii team: Query-adaptive asymmetri-1068cal dissimilarities for instance search,” in Proc. TRECVID 2013 Workshop,1069Gaithersburg, USA, 2013.1070
[53] N. Ballas et al., “Irim at trecvid 2014: Semantic indexing and instance1071search,” in Proc. TRECVID 2014 Workshop, 2014.1072
[54] A. Araujo, J. Chaves, R. Angst, and B. Girod, “Temporal aggregation1073for large-scale query-by-image video retrieval,” in Proc. IEEE Int. Conf.1074Image Process., Sep. 2015, pp. 1519–1522.1075
[55] M. Shi, T. Furon, and H. Jegou, “A group testing framework for simi-1076larity search in high-dimensional spaces,” in Proc. 22nd ACM Int. Conf.1077Multimedia. 2014, pp. 407–416.1078
[56] J. Lin et al., “Rate-adaptive compact fisher codes for mobile vi-1079sual search,” IEEE Signal Process. Lett., vol. 21, no. 2, pp. 195–198,1080Feb. 2014.1081
[57] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN video1082representation for event detection,” in Proc. IEEE Conf. Comput. Vis.1083Pattern Recog., Jun. 2015, pp. 1798–1807.1084
[58] L.-Y. et al., “Compact Descriptors for Video Analysis: The1085Emerging MPEG Standard,” CoRR, 2017. [Online]. Available:1086http://arxiv.org/abs/1704.081411087
[59] F. Anselmi and T. Poggio, “Representation learning in sensory cortex: A1088theory,” in Proc. Center Brains, Minds Mach., 2014.1089
[60] Q. Liao, J. Z. Leibo, and T. Poggio, “Learning invariant representations and1090applications to face verification,” in Proc. Advances Neural Inf. Process.1091Syst., Lake Tahoe, NV, 2013.1092
[61] C. Zhang, G. Evangelopoulos, S. Voinea, L. Rosasco, and T. Poggio, “A1093deep representation for invariance and music classification,” in Proc. IEEE1094Int. Conf. Acoust., Speech, Signal Process., May 2014, pp. 6984–6988.1095
[62] K. Lenc and A. Vedaldi, “Understanding image representations by mea-1096suring their equivariance and equivalence,” in Proc. IEEE Conf. Comput.1097Vis. Pattern Recog., Jun. 2015, pp. 991–995.1098
[63] J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-time 1099visual concept classification.” IEEE Trans. Multimedia, vol. 12, no. 7, 1100pp. 665–681, Nov. 2010. 1101
[64] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recogni- 1102tion and segmentation.” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1103Jun. 2015, pp. 3828–3836. 1104
[65] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak 1105geometric consistency for large scale image search,” in Proc. Eur. Conf. 1106Comput. Vis., 2008. 1107
[66] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval 1108with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. 1109Comput. Vis. Pattern Recog., Jun. 2007, pp. 1–8. 1110
[67] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” 1111in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., Jun. 2006, 1112vol. 2, pp. 2161–2168. 1113
Jie Lin received the B.S. and Ph.D. degrees from the 1114School of Computer Science and Technology, Bei- 1115jing Jiaotong University, Beijing, China, in 2006 and 11162014, respectively. 1117
He is currently a Research Scientist with the In- 1118stitute of Infocomm Research, A*STAR, Singapore. 1119He was previously a visiting student in the Rapid- 1120Rich Object Search Laboratory, Nanyang Technolog- 1121ical University, Singapore, and the Institute of Digital 1122Media, Peking University, Beijing, China, from 2011 1123to 2014. His research interests include deep learning, 1124
feature coding and large-scale image/video retrieval. His work on image feature 1125coding has been recognized as core contribution to the MPEG-7 Compact De- 1126scriptors for Visual Search (CDVS) standard. 1127
1128
Ling-Yu Duan (M’09) received the M.Sc. degree in 1129automation from the University of Science and Tech- 1130nology of China, Hefei, China, in 1999, the M.Sc. 1131degree in computer science from the National Uni- 1132versity of Singapore (NUS), Singapore, in 2002, and 1133the Ph.D. degree in information technology from 1134The University of Newcastle, Callaghan, Australia, 1135in 2008. 1136
He is currently a Full Professor with the National 1137Engineering Laboratory of Video Technology, School 1138of Electronics Engineering and Computer Science, 1139
Peking University (PKU), Beijing, China. He was the Associate Director of the 1140Rapid-Rich Object Search Laboratory, a joint lab between Nanyang Techno- 1141logical University, Singapore, and PKU since 2012. Before he joined PKU, he 1142was a Research Scientist with the Institute for Infocomm Research, Singapore, 1143from Mar. 2003 to Aug. 2008. He has authored or coauthored more than 130 1144research papers in international journals and conferences. His research interests 1145include multimedia indexing, search, and retrieval, mobile visual search, visual 1146feature coding, and video analytics. Prior to 2010, his research was basically 1147focused on multimedia (semantic) content analysis, especially in the domains 1148of broadcast sports videos and TV commercial videos. 1149
Prof. Duan was the recipient of the EURASIP Journal on Image and Video 1150Processing Best Paper Award in 2015, and the Ministry of Education Technology 1151Invention Award (First Prize) in 2016. He was a co-editor of MPEG Compact 1152Descriptor for Visual Search (CDVS) Standard (ISO/IEC 15938-13), and is a 1153Co-Chair of MPEG Compact Descriptor for Video Analytics (CDVA). His re- 1154cent major achievements focus on the topic of compact representation of visual 1155features and high-performance image search. He made significant contribution 1156in the completed MPEG-CDVS standard. The suite of CDVS technologies have 1157been successfully deployed, which impact visual search products/services of 1158leading Internet companies such as Tencent (WeChat) and Baidu (Image Search 1159Engine). 1160
1161
IEEE P
roof
16 IEEE TRANSACTIONS ON MULTIMEDIA
Shiqi Wang received the B.S. degree in computer sci-1162ence from the Harbin Institute of Technology, Harbin,1163China, in 2008, and the Ph.D. degree in computer1164application technology from the Peking University,1165Beijing, China, in 2014.1166
From March 2014 to March 2016, he was a Post-1167doc Fellow with the Department of Electrical and1168Computer Engineering, University of Waterloo, Wa-1169terloo, ON, Canada. From April 2016 to Aprill 2017,1170he was with the Rapid-Rich Object Search Labora-1171tory, Nanyang Technological University, Singapore,1172
as a Research Fellow. He is currently an Assistant Professor in the Depart-1173ment of Computer Science, City University of Hong Kong, Hong Kong. He1174has proposed more than 30 technical proposals to ISO/MPEG, ITU-T, and AVS1175standards. His research interests include image/video compression, analysis and1176quality assessment.1177
1178
Yan Bai received the B.S. degree in software en-1179gineering from Dalian University of Technology,1180Liaoning, China, in 2015, and is currently working1181toward the M.S. at the School of Electrical Engi-1182neering and Computer Science, Peking University,1183Beijing, China.1184
Her research interests include large-scale video1185retrieval and fine-grained visual recognition.1186
1187
Yihang Lou received the B.S. degree in software1188engineering from Dalian University of Technology,1189Liaoning, China, in 2015, and is currently working1190toward the M.S. degree at the School of Electrical1191Engineering and Computer Science, Peking Univer-1192sity, Beijing, China.1193
His current interests include large-scale video re-1194trieval and object detection.1195
1196
Vijay Chandrasekhar received the B.S and M.S. de-1197grees from Carnegie Mellon University, Pittsburgh,1198PA, USA, in 2002 and 2005, respectively, and the1199Ph.D. degree in electrical engineering from Stanford1200University, Stanford, CA, USA, in 2013.1201
He has authored or coauthored more than 80 pa-1202pers/MPEG contributions in a wide range of top-1203tier journals/conferences such as the International1204Journal of Computer Vision, ICCV, CVPR, the IEEE1205Signal Processing Magazine, ACM Multimedia, the1206IEEE TRANSACTIONS ON IMAGE PROCESSING, De-1207
signs, Codes and Cryptography, the International Society of Music Information1208Retrieval, and MPEG-CDVS, and has filed 7 U.S. patents (one granted, six pend-1209ing). His research interests include mobile audio and visual search, large-scale1210image and video retrieval, machine learning, and data compression. His Ph.D.1211work on feature compression led to the MPEG-CDVS (Compact Descriptors1212for Visual Search) standard, which he actively contributed from 2010 to 2013.1213
Dr. Chandrasekhar was the recipient of the A*STAR National Science Schol-1214arship (NSS) in 2002.1215
1216
Tiejun Huang received the B.S. and M.S. degrees 1217in computer science from the Wuhan University of 1218Technology, Wuhan, China, in 1992 and 1995, re- 1219spectively, and the Ph.D. degree in pattern recogni- 1220tion and intelligent system from the Huazhong (Cen- 1221tral China) University of Science and Technology, 1222Wuhan, China, in 1998. 1223
He is a Professor and the Chair of the Department 1224of Computer Science, School of Electronic Engineer- 1225ing and Computer Science, Peking University, Bei- 1226jing, China. His research areas include video coding, 1227
image understanding, and neuromorphic computing. 1228Prof. Huang is a Member of the Board of the Chinese Institute of Electronics 1229
and the Advisory Board of IEEE Computing Now. He was the recipient of the 1230National Science Fund for Distinguished Young Scholars of China in 2014, and 1231was awarded the Distinguished Professor of the Chang Jiang Scholars Program 1232by the Ministry of Education in 2015. 1233
1234
Alex Kot (S’85–M’89–SM’98–F’06) has been with 1235Nanyang Technological University, Singapore, since 12361991. He headed the Division of Information En- 1237gineering, School of Electrical and Electronic En- 1238gineering, for eight years and was an Associate 1239Chair/Research and the Vice Dean Research with the 1240School of Electrical and Electronic Engineering. He 1241is currently a Professor with the College of Engi- 1242neering and the Director of the Rapid-Rich Object 1243Search Laboratory. He has authored or coauthored 1244extensively in the areas of signal processing for com- 1245
munication, biometrics, data-hiding, image forensics, and information security. 1246Prof. Kot is a Member of the IEEE Fellow Evaluation Committee and a 1247
Fellow of Academy of Engineering, Singapore. He was the recipient of the Best 1248Teacher of the Year Award and is a coauthor for several Best Paper Awards, 1249including ICPR, IEEE WIFS, ICEC, and IWDW. He was on the IEEE SP So- 1250ciety in various capacities, such as the General Co-Chair for the 2004 IEEE 1251International Conference on Image Processing and a Chair of the worldwide 1252SPS Chapter Chairs, and the Distinguished Lecturer Program. He is the Vice 1253President for the IEEE Signal Processing Society. He was a Guest Editor for 1254the Special Issues for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS 1255FOR VIDEO TECHNOLOGY and the EURASIP Journal on Advances in Signal 1256Processing. He is also an Editor of the EURASIP Journal on Advances in Signal 1257Processing. He is an IEEE SPS Distinguished Lecturer. He was an Associate 1258Editor of the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANS- 1259ACTIONS ON SIGNAL PROCESSING, the IEEE TRANSACTIONS ON MULTIMEDIA, 1260the IEEE SIGNAL PROCESSING LETTERS, the IEEE SIGNAL PROCESSING MAGA- 1261ZINE, the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, the IEEE 1262TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, and the 1263IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, and the IEEE TRANSAC- 1264TIONS ON CIRCUITS AND SYSTEMS II. 1265
1266
Wen Gao (S’87–M’88–SM’05–F’09) received the 1267Ph.D. degree in electronics engineering from the Uni- 1268versity of Tokyo, Tokyo, Japan, in 1991. 1269
He was a Professor of Computer Science with the 1270Harbin Institute of Technology, Harbin, China, from 12711991 to 1995, and a Professor with the Institute of 1272Computing Technology, Chinese Academy of Sci- 1273ences, Beijing, China. He is currently a Professor of 1274computer science with the Institute of Digital Media, 1275School of Electronic Engineering and Computer Sci- 1276ence, Peking University, Beijing, China. 1277
Prof. Gao has served on the editorial boards for several journals, such as the 1278IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1279the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS ON AU- 1280TONOMOUS MENTAL DEVELOPMENT, the EURASIP Journal of Image Commu- 1281nications, and the Journal of Visual Communication and Image Representation. 1282He has chaired a number of prestigious international conferences on multimedia 1283and video signal processing, such as IEEE ICME and ACM Multimedia, and 1284also served on the advisory and technical committees of numerous professional 1285organizations. 1286
1287
IEEE P
roof
QUERIES 1288
Q1. Author: Please provide the journal name and the page range in Ref. [4]. 1289
Q2. Author: Please provide the page range in Refs. [9], [11]–[13], [15], [16], [19], [23]–[25], [28]–[31], [34], [37], [38], [4]–[43], 1290
[45]–[47], [50], [52], [53], [59]–[62], and [64]–[67]. 1291
Q3. Author: Please update Refs. [14], [17], [18], [20], [27], and [58]. 1292
IEEE P
roof
IEEE TRANSACTIONS ON MULTIMEDIA 1
HNIP: Compact Deep Invariant Representations forVideo Matching, Localization, and Retrieval
1
2
Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang,Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE
3
4
Abstract—With emerging demand for large-scale video analysis,5MPEG initiated the compact descriptor for video analysis (CDVA)6standardization in 2014. Unlike handcrafted descriptors adopted7by the ongoing CDVA standard, in this work, we study the8problem of deep learned global descriptors for video matching,9localization, and retrieval. First, inspired by a recent invariance10theory, we propose a nested invariance pooling (NIP) method11to derive compact deep global descriptors from convolutional12neural networks (CNN), by progressively encoding translation,13scale, and rotation invariances into the pooled descriptors. Second,14our empirical studies have shown that pooling moments (e.g.,15max or average) drastically affect video matching performance,16which motivates us to design hybrid pooling operations within17NIP (HNIP). HNIP further improves the discriminability of18deep global descriptors. Third, the advantages and performance19on the combination of deep and handcrafted descriptors are20provided to better investigate the complementary effects of them.21We evaluate the effectiveness of HNIP by incorporating it into22the well-established CDVA evaluation framework. Experimental23results show that HNIP outperforms state-of-the-art deep and24canonical handcrafted descriptors with significant mAP gains of255.5% and 4.7%, respectively. Moreover, the combination of HNIP26and handcrafted global descriptors further boosts the performance27of CDVA core techniques with comparable descriptor size.28
Index Terms—Convolutional neural networks (CNN), deep29global descriptors, handcrafted descriptors, hybrid nested30invariance pooling, MPEG compact descriptor for video analysis31(CDVA), MPEG CDVS.32
I. INTRODUCTION33
R ECENT years have witnessed a remarkable growth of34
interest for video retrieval, which refers to searching for35
Manuscript received December 5, 2016; revised April 3, 2017 and May 30,2017; accepted May 30, 2017. This work was supported in part by the NationalHightech R&D Program of China under Grant 2015AA016302, in part bythe National Natural Science Foundation of China under Grant U1611461 andGrant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI)sponsored by a donation from the Ng Teng Fong Charitable Foundation. Theguest editor coordinating the review of this manuscript and approving it forpublication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.)
J. Lin and V. Chandrasekhar is with the Institute for Infocomm Research,A∗STAR, Singapore 138632 (e-mail: lin-j@i2r.a-star.edu.sg).
L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti-tute of Digital Media, Peking University, Beijing 100080, China (e-mail:lingyu@pku.edu.cn; yanbai@pku.edu.cn; yihang@pku.edu.cn; tjhuang@pku.edu.cn; wgao@pku.edu.cn).
S. Wang is with the Department of Computer Science, City University ofHong Kong, Hong Kong, China (e-mail: shiqwang@cityu.edu.hk).
V. Chandrasekhar and A. Kot are with the Nanyang Technological University,Singapore 639798 (e-mail: vijay@i2r.a-star.edu.sg; EACKOT@ntu.edu.sg).
Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2017.2713410
videos representing the same object or scene as the one depicted 36
in a query video. Such capability can facilitate a variety of appli- 37
cations including mobile augmented reality (MAR), automotive, 38
surveillance, media entertainment, etc. [1]. In the rapid evolution 39
of video retrieval, both great promises and new challenges aris- 40
ing from real applications have been perceived [2]. Typically, the 41
video retrieval is performed at the server end, which requires 42
the transmission of visual data via wireless network [3], [4]. 43
Instead of directly sending huge volume of compressed video 44
data, developing compact and robust video feature representa- 45
tions is highly desirable, which fulfills low latency transmission 46
over bandwidth constrained network, e.g., thousands of bytes 47
per second. 48
To this end, in 2009, MPEG started the standardization of 49
Compact Descriptors for Visual Search (CDVS) [5], which 50
came up with a normative bitstream of standardized descriptors 51
for mobile visual search and augmented reality applications. 52
In Sep. 2015, MPEG published the CDVS standard [6]. Very 53
recently, towards large-scale video analysis, MPEG has moved 54
forward to standardize Compact Descriptors for Video Analy- 55
sis (CDVA) [7]. To deal with content redundancy in temporal 56
dimension, the latest CDVA Experimental Model (CXM) [8] 57
casts video retrieval into keyframe based image retrieval task, 58
in which the keyframe-level matching results are combined for 59
video matching. The keyframe-level representation avoids de- 60
scriptor extraction on dense frames in videos, which largely re- 61
duces the computational complexity (e.g., CDVA only extracts 62
descriptors of 1∼2% frames detected from raw videos). 63
In CDVS, handcrafted local and global descriptors have been 64
successfully standardized in a compact and scalable manner 65
(e.g., from 512 B to 16 KB), where local descriptors capture the 66
invariant characteristics of local image patches and global de- 67
scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect the 68
aggregated statistics of local descriptors. Though handcrafted 69
descriptors have achieved great success in CDVS standard [5] 70
and CDVA experimental model, how to leverage promising 71
deep learned global descriptors remains an open issue in the 72
MPEG CDVA Ad-hoc group. Many recent works [11]–[18] 73
have shown the advantages of deep global descriptors for image 74
retrieval, which may be attributed to the remarkable success of 75
Convolutional Neural Networks (CNN) [19], [20]. In particular, 76
state-of-the-art deep global descriptors R-MAC [18] computes 77
the max over a set of Region-of-Interest (ROI) cropped from 78
feature maps output by intermediate convolutional layer, 79
followed by the average of these regional max-pooled features. 80
Results show that R-MAC offers remarkable improvements over 81
1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
IEEE P
roof
2 IEEE TRANSACTIONS ON MULTIMEDIA
other deep global descriptors like MAC [18] and SPoC [16],82
while maintaining the same dimensionality.83
In the context of compact descriptors for video retrieval, there84
exist important practical issues with CNN based deep global de-85
scriptors. First, one main drawback of CNN is the lack of invari-86
ance to geometric transformations of the input image such as87
rotations. The performance of deep global descriptors quickly88
degrades when the objects in query and database videos are89
rotated differently. Second, different from CNN features, hand-90
crafted descriptors are robust to scale and rotation changes in91
2D plane, because of the local invariant feature detectors. As92
such, more insights should be provided on whether there is great93
complementarity between CNN and conventional handcrafted94
descriptors for better performance.95
To tackle the above issues, we make the following contribu-96
tions in this work:97
1) We propose a Nested Invariance Pooling (NIP) method98
to produce compact global descriptors from CNN by pro-99
gressively encoding translation, scale and rotation invari-100
ances. NIP is inspired from a recent invariance theory,101
which provides a practical and mathematically proven102
way for computing invariant representations with feedfor-103
ward neural networks. In this respect, NIP is extensible104
to other types of transformations. Both quantitative and105
qualitative evaluations are introduced for a deeper look at106
the invariance properties.107
2) We further improve the discriminability of deep global108
descriptors by designing Hybrid pooling moments within109
NIP (HNIP). Evaluations on video retrieval show that110
HNIP outperforms state-of-the-art deep and handcrafted111
descriptors by a large margin with comparable or smaller112
descriptor size.113
3) We analyze the complementary nature of deep features114
and handcrafted descriptors over diverse datasets (land-115
marks, scenes and common objects). A simple combina-116
tion strategy is introduced to fuse the strengths of both117
deep and handcrafted global descriptors. We show that118
the combined descriptors offer the optimal video match-119
ing and retrieval performance, without incurring extra cost120
on descriptor size compared to CDVA.121
4) Due to the superior performance, HNIP has been adopted122
by CDVA Ad-hoc group as technical reference to setup123
new core experiments [21], which opens up future ex-124
ploration of CNN techniques in the development of stan-125
dardized video descriptors. The latest core experiments126
involve compact deep feature representation, CNN model127
compression, etc.128
II. RELATED WORK129
Handcrafted descriptors: Handcrafted local descriptors [22],130
[23], such as SIFT based on Difference of Gaussians (DoG)131
detector [22], have been successfully and widely employed132
to conduct image matching and localization tasks due to133
their robustness in scale and rotation changes. Building on134
local image descriptors, global image representations aim to135
provide statistical summaries of high level image properties136
and facilitate fast large-scale image search. In particular, for137
global image descriptors, the most prominent ones include 138
Bag-of-Words (BoW) [24], [25], Fisher Vector (FV) [9], 139
VLAD [10], Triangulation Embedding [26] and Robust Visual 140
Descriptor with Whitening (RVDW) [27], with which fast 141
comparisons against a large scale database become practical. 142
Given the fact that raw descriptors such as SIFT and FV may 143
consume extraordinarily large number of bits for transmission 144
and storage, many compact descriptors were developed. For ex- 145
ample, numerous strategies have been proposed to compress 146
SIFT using hashing [28], [29], transform coding [30], lattice 147
coding [31] and vector quantization [32]. On the other hand, bi- 148
nary descriptors including BRIEF [33], ORB [34], BRISK [35] 149
and Ultrashort Binary Descriptor (USB) [36] were proposed, 150
which support fast Hamming distance matching. For compact 151
global descriptors, efforts have also been made to reduce their 152
descriptor size, such as tree-structure quantizer [37] for BoW 153
histogram, locality sensitive hashing [38], dimensionality re- 154
duction and vector quantization for FV [9], and simple sign 155
binarization for VLAD like descriptors [9], [39]. 156
Deep descriptors: Deep learned descriptors have been ex- 157
tensively applied to image retrieval [11]–[18]. First initial 158
study [11], [13] proposed to use representations directly ex- 159
tracted from fully connected layer of CNN. More compact 160
global descriptors [14]–[16] can be extracted by performing 161
either global max or average pooling (e.g., SPoC in [16]) over 162
feature maps output by intermediate layers. Further improve- 163
ments are obtained by spatial or channel-wise weighting of 164
pooled descriptors [17]. Very recently, inspired by the R-CNN 165
approach [40] used for object detection, Tolias et al. [18] pro- 166
posed ROI based pooling on deep convolutional features, Re- 167
gional Maximum Activation of Convolutions (R-MAC), which 168
significantly improves global pooling approaches. Though 169
R-MAC is scale invariant to some extent, it suffers from the 170
lack of rotation invariance. These regional deep features can be 171
also aggregated into global descriptors by VLAD [12]. 172
In a number of recent works [13], [41]–[43], pre-trained 173
CNNs for image classification are repurposed for the image 174
retrieval task, by fine-tuning them with specific loss functions 175
(e.g., Siamese or triplet networks) over carefully constructed 176
matching and non-matching training image sets. There is consid- 177
erable performance improvement when training and test datasets 178
in similar domains (e.g., buildings). In this work, we aim to 179
explicitly construct invariant deep global descriptors from the 180
perspective of better leveraging the state-of-the-arts or classical 181
CNN architectures, rather than further optimizing the learning 182
of invariant deep descriptors. 183
Video descriptors: Video is typically composed of a num- 184
ber of moving frames. Therefore, a straightforward method for 185
video descriptor representation is extracting feature descriptors 186
at frame level then reducing the temporal redundancies of these 187
descriptors for compact representation. For local descriptors, 188
Baroffio et al. [44] proposed both intra- and inter-feature cod- 189
ing methods of SIFT in the context of visual sensor network, 190
and an additional mode decision scheme based on rate-distortion 191
optimization was designed to further improve the feature coding 192
efficiency. In [45], [46], studies have been conducted to encode 193
the binary features such as BRISK [35]. Makar et al. [47] pre- 194
sented a temporally coherent keypoint detector to allow efficient 195
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 3
interframe coding of canonical patches, corresponding feature196
descriptors, and locations towards mobile augmented reality197
application. Chao et al. [48] developed a key-points encoding198
technique, where locations, scales and orientations extracted199
from original videos are encoded and transmitted along with200
compressed video to the server. Recently, the temporal depen-201
dencies of global descriptors have also been exploited. For BoW202
extracted from video sequence, Baroffio et al. [49] proposed an203
intra-frame coding method with uniform scalar quantization,204
as well as an inter-frame technique with arithmetic coding the205
quantized symbols. Chen et al. [50], [51] proposed an encoding206
scheme for scalable residual based global signatures given the207
fact that REVVs [39] of adjacent frames share most codewords208
and residual vectors.209
Besides the frame-level approaches, aggregations of local de-210
scriptors over video slots and global descriptors over scenes211
have also been intensively explored [52]–[55]. In [54], tempo-212
ral aggregation strategies for large scale video retrieval were213
experimentally studied and evaluated with the CDVS global214
descriptors [56]. Four aggregation modes, including local fea-215
ture, global signature, tracking-based and independent frame216
based aggregation schemes were investigated. For video-level217
CNN representation, in [57], the authors applied FV and VLAD218
aggregation techniques over dense local features of CNN acti-219
vation maps for video event detection.220
III. MPEG CDVA221
MPEG CDVA [7] aims to standardize the bitstream of com-222
pact video descriptors for large-scale video analysis. The CDVA223
standard incurs two main technical requirements of the dedi-224
cated descriptors: compactness and robustness. On the one hand,225
compact representation is an efficient way to economize the226
transmission bandwidth, storage space and computational cost.227
On the other hand, robust representation in the scenario of ge-228
ometric transformations such as rotation and scale variations is229
particularly required. To this end, in the 115th MPEG meeting,230
the CXM [8] was released, which mainly relies on MPEG CDVS231
reference software TM14.2 [6] for keyframe-level compact and232
robust handcrafted descriptor representation based on scale and233
rotation invariant local features.234
A. CDVS-Based Handcrafted Descriptors235
The MPEG CDVS [5] standardized descriptors serve as the236
fundamental infrastructure to represent video keyframes. The237
normative blocks of CDVS standard are illustrated in Fig. 1(b),238
mainly involving extraction of compressed local and global239
descriptors. First, scale and rotation invariant interest key points240
are detected from image, and a subset of reliable key points241
are retained, followed by the computation of handcrafted local242
SIFT features. The compressed local descriptors are formed243
by applying a low-complexity transform coding on local SIFT244
features. The compact global descriptors are Fisher vectors ag-245
gregated from the selected local features, followed by scalable246
descriptor compression with simple sign binarization. Basically,247
pairwise image matching is accomplished by first comparing248
compact global descriptors, then further performing geometric249
consistency checking (GCC) with compressed local descriptors.250
CDVS handcrafted descriptors are with very low memory foot- 251
print, while preserving competitive matching and retrieval accu- 252
racy. The standard supports operating points ranging from 512 B 253
to 16 kB specified for different bandwidth constraints. Overall, 254
the 4 kB operating point achieves a good tradeoff between ac- 255
curacy and complexity (e.g., transmission bitrate, search time). 256
Thus, CDVA CXM adopts the 4 kB descriptors for keyframe- 257
level representation, in which compressed local descriptors and 258
compact global descriptors are both ∼2 kB per keyframe. 259
B. CDVA Evaluation Framework 260
Fig. 1(a) shows the evaluation framework of CDVA, includ- 261
ing keyframe detection, CDVS descriptors extraction, trans- 262
mission, and video retrieval and matching. At the client side, 263
color histogram comparison is applied to identify keyframes 264
from video clips. The standardized CDVS descriptors are ex- 265
tracted from these keyframes, which can be further packed to 266
form CDVA descriptors [58]. Keyframe detection has largely 267
reduced the temporal redundancy in videos, resulting in low bi- 268
trate query descriptor transmission. At the server side, the same 269
keyframe detection and CDVS descriptors extraction are ap- 270
plied to database videos. Formally, we denote query video X = 271
{x1 , ...,xNx} and reference video Y = {y1 , ...,yNy
}, where x 272
and y denote keyframes. Nx and Ny are the number of detected 273
keyframes in query and reference videos, respectively. The start 274
and end timestamps for keyframes are recorded, e.g., [T sxn
, T exn
] 275
for query keyframe xn . Here, we briefly describe the pipeline of 276
pairwise video matching, localization and video retrieval with 277
CDVA descriptors. 278
Pairwise video matching and localization: Pairwise video 279
matching is performed by comparing the CDVA descriptors of 280
video keyframe pair (X,Y). Each keyframe in X is compared 281
with all keyframes in Y. The video-level similarity K(X, Y) is 282
defined as the largest matching score among all keyframe-level 283
similarities. For example, if we consider video matching with 284
CDVS global descriptors only 285
K(X, Y) = maxx∈X,y∈Y
k(f(x), f(y)) (1)
where k(·, ·) denotes a matching function (e.g., cosine similar- 286
ity). f(x) denotes CDVS global descriptors for keyframe x.1 287
Following the matching pipeline in CDVS, if k(·, ·) exceeds a 288
pre-defined threshold, GCC with CDVS local descriptors is sub- 289
sequently applied for verifying true positive keyframe matches. 290
Then the keyframe-level similarity is finally determined as the 291
multiplication of matching scores from both CDVS global and 292
local descriptors. Correspondingly, K(X, Y) in (1) is refined as 293
the maximum of their combined similarities. 294
The matched keyframe timestamps between query and refer- 295
ence videos are recorded for evaluating the temporal localization 296
task, i.e., locating the video segment containing item of inter- 297
est. In particular, if the multiplication of CDVS global and local 298
matching scores exceeds a predefined threshold, the correspond- 299
ing keyframe timestamps are recorded. Assuming there are τ 300
(τ ≤ Nx ) keyframes satisfying such criterion in a query video, 301
the matching video segment is defined as T ′start = min
{T s
xn
}302
1We use the same notation for deep global descriptors in the following section.
IEEE P
roof
4 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 1. (a) Illustration of MPEG CDVA evaluation framework. (b) Descriptor extraction pipeline for MPEG CDVS. (c) Temporal localization of item of interestbetween video pair.
and T ′end = max
{T e
xn
}, where 1 ≤ n ≤ τ . As such, we can ob-303
tain the predicted matching video segment by descriptor match-304
ing, as shown in Fig. 1(c).305
Video retrieval: Video retrieval differs from pairwise video306
matching in that the former is one to many matching, while307
the latter is one to one matching. Thus, video retrieval shares308
similar matching pipeline with pairwise video matching, ex-309
cept for the following differences: 1) For each query keyframe,310
the top Kg candidate keyframes are retrieved from database311
by comparing CDVS global descriptors. Subsequently, GCC312
reranking with CDVS local descriptors is performed between313
query keyframe and each candidate. The top Kl database314
keyframes are recorded. The default choices for Kg and Kl315
are 500 and 100, respectively. 2) For each query video, all re-316
turned database keyframes are merged into candidate database317
videos according to their video indices. Then, the video-level318
similarity between query and each candidate database video319
is obtained following the same principle as pairwise video320
matching. Finally, the top ranked candidate database videos are321
returned.322
IV. METHOD323
A. Hybrid Nested Invariance Pooling324
Fig. 2(a) shows the extraction pipeline of our compact deep325
invariant global descriptors with HNIP. Given a video keyframe326
x, we rotate it by R times (with step size θ◦). By forward-327
ing each rotated image through a pre-trained deep CNN, the328
convolutional feature maps output by intermediate layer (e.g.,329
convolutional layer) are represented by a cube W × H × C,330
where W and H denote width and height of each feature331
map respectively and C is the number of channels in the fea-332
ture maps. Subsequently, we extract a set of ROIs from the333
cube using an overlapping sliding window, with window size334
W′ ≤ W and H
′ ≤ H . The window size is adjusted to incor- 335
porate ROIs with different scales (e.g., 5 × 5, 10 × 10). Here, 336
we denote the number of scales as S. Finally, a 5-D data struc- 337
ture γx(Gt,Gs,Gr , C) ∈ RW′×H
′×S×R×C is derived, which 338
encodes translations Gt (i.e., spatial locations W′ × H
′), scales 339
Gs and rotations Gr of input keyframe x. 340
HNIP aims to aggregate the 5-D data into a compact deep in- 341
variant global descriptor. In particular, it firstly performs pooling 342
over translations (W′ × H
′), then scales (S) and finally rota- 343
tions (R) in a nested way, resulting in a C-dimensional global 344
descriptor. Formally, for the cth feature map, nt-norm pooling 345
over translations Gt is given by 346
γx(Gs,Gr , c) =
⎛
⎝ 1W ′ × H ′
W′×H
′∑
j=1
γx(j,Gs,Gr , c)nt
⎞
⎠
1n t
(2)where the pooling operation has a parameter nt defining the sta- 347
tistical moments, e.g., nt = 1 is first-order (i.e., average pool- 348
ing), nt → +∞ on the other extreme is infinite order (i.e., max 349
pooling), and nt = 2 is second order (i.e., square-root pooling). 350
Equation (2) leads to a 3-D data γx(Gs,Gr , C) ∈ RS×R×C . 351
Analogously, ns -norm pooling over scale transformations Gs 352
and the subsequent nr -norm pooling over rotation transforma- 353
tions Gr are 354
γx(Gr , c) =
⎛
⎝ 1S
S∑
j=1
γx(j,Gr , c)ns
⎞
⎠
1n s
, (3)
γx(c) =
⎛
⎝ 1R
R∑
j=1
γx(j, c)nr
⎞
⎠
1n r
. (4)
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 5
Fig. 2. (a) Nested invariance pooling (NIP) on feature maps extracted from intermediate layer of CNN architecture. (b) A single convolution-pooling operationfrom a CNN schematized for a single input layer and single output neuron. The parallel with invariance theory shows that the universal building block of CNN iscompatible with the incorporation of invariance to local translations of the input.
The corresponding global descriptor is obtained by concatenat-355
ing γx(c) for all feature maps356
f(x) = {γx(c)}0≤c<C . (5)
As such, the keyframe matching function in (1) is defined as357
k(f(x), f(y)) = β(x)β(y)C∑
c=1
< γx(c), γy(c) > (6)
where β(·) is a normalization term computed by β(x) =358
(∑C
c=1 < γx(c), γx(c) >)−12 . Equation (6) refers to cosine sim-359
ilarity by accumulating the scalar products of normalized pooled360
features for each feature map. HNIP descriptors can be further361
improved by post-processing techniques such as PCA whiten-362
ing [16], [18]. In this work, The global descriptor is firstly L2363
normalized, followed by PCA projection and whitening with a364
pre-trained PCA matrix. The whitened vectors are L2 normal-365
ized and compared with (6).366
Subsequently, we further investigate HNIP with more details.367
In Section IV-B, inspired by a recent invariance theory [59],368
HNIP is proven to be approximately invariant to translation,369
scale and rotation transformations, which is independent of370
the statistical moments chosen in the nested pooling stages.371
In Section IV-C, both quantitative and qualitative evaluations372
are presented to illustrate the invariance properties. Moreover,373
we observe the statistical moments in HNIP drastically affect374
video matching performance. Our empirical results show that375
the optimal nested pooling moments correspond to: nt = 2 as a376
second order moment, ns first-order and nr of infinite order.377
B. Theory on Transformation Invariance378
Invariance theory in a nutshell: Recently, Anselmi and379
Poggio [59] proposed an invariance theory exploring how signal380
(e.g., image) representations are invariant to various transforma-381
tions. Denote f(x) as the representation for image x, f(x) is382
invariant to transformation g (e.g., rotation) if f(x) = f(g · x)383
is hold ∀g ∈ G, where we define the orbit of x by a transfor-384
mation group G as Ox = {g · x|g ∈ G}. It can be easily shown385
that Ox is globally invariant to the action of any element of 386
G and thus any descriptor computed directly from Ox will be 387
globally invariant to G. 388
More specifically, the invariant descriptor f(x) can be 389
derived in two stages. First, given a predefined template t (e.g., 390
convolutional filter in CNN), we compute the dot products of 391
t over the orbit: Dx,t = {< g · x, t >∈ R|g ∈ G}. Second, the 392
extracted invariant descriptor should be a histogram represen- 393
tation of the distribution Dx,t with a specific bin configuration, 394
for example, the statistical moments (e.g., mean, max, standard 395
deviation, etc.) derived from Dx,t . Such a representation is 396
mathematically proven to have proper invariance property for 397
transformations such as translations (Gt), scales (Gs) and 398
rotations (Gr ). One may note that the transformation g can 399
be applied either on the image or the template indifferently, 400
i.e., {< g · x, t >=< x, g · t > |g ∈ G}. Recent work on face 401
verification [60] and music classification [61] successfully 402
applied this theory to practical applications. More details about 403
invariance theory are referred to [59]. 404
An example: translation invariance of CNN: The convolution- 405
pooling operations in CNN are compliant with the invari- 406
ance theory. Existing well-known CNN architectures like 407
AlexNet [19] and VGG16 [20] share a common building block: 408
a succession of convolution and pooling operations, which in 409
fact provides a way to incorporate local translation invariance. 410
As shown in Fig. 2(b), convolution operation on translated im- 411
age patches (i.e., sliding windows) is equivalent to < g · x, t >, 412
and max pooling operation is in line with the statistical his- 413
togram computation over the distribution of the dot products 414
(i.e., feature maps). For instance, considering a convolutional 415
filter learned “cat face”, the filter would always response to an 416
image depicted cat face no matter where is the face located in 417
the image. Subsequently, max pooling over the activation fea- 418
ture maps captures the most salient feature of the cat face, which 419
is naturally invariant to object translation. 420
Incorporating scale and rotation invariances into CNN: 421
Based on the already locally translation invariant feature maps 422
(e.g., the last pooling layer, pool5), we propose to further im- 423
IEEE P
roof
6 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 3. Comparison of pooled descriptors invariant to (a) rotation and (b)scale changes of query images, measured by retrieval accuracy (mAP) on theHolidays data set. fc6 layer of VGG16 [20] pretrained on ImageNet datasetis used.
prove the invariance of pool5 CNN descriptors by incorporating424
global invariance to several transformation groups. The spe-425
cific transformation groups considered in this study are scales426
GS and rotations GR . As one can see, it is impractical to gen-427
erate all transformations g · x for the orbit Ox. In addition,428
the computational complexity of feedforward pass in CNN in-429
creases linearly with the number of transformed version of the430
input x. For practical consideration, we simplify the orbit by431
a finite set of transformations (e.g., # of rotations R = 4, # of432
scales S = 3). This results in HNIP descriptors approximately433
invariant to transformations, without huge increase in feature434
extraction time.435
An interesting aspect of the invariance theory is the possibility436
in practice to chain multiple types of group invariances one after437
the other as already demonstrated in [61]. In this study, we con-438
struct descriptors invariant to several transformation groups by439
progressively applying the method to different transformation440
groups as shown in (2)–(4).441
Discussions: While there is theoretical guarantee in the scale-442
and rotation-invariance of handcrafted local feature detectors443
such as DoG, classical CNN architectures lack invariance to444
these geometric transformations [62]. Many works have pro-445
posed to encode transformation invariances into both hand-446
crafted (e.g., BoW built on dense sampled SIFT [63]) and CNN447
representations [64], by explicitly augmenting input images with448
rotation and scale transformations. Our HNIP takes similar idea449
of image augmentation, but has several significant differences.450
First, rather than a single pooling (max or average) layer over451
all transformations, HNIP progressively pools features together452
across different transformations with different moments, which453
is essential for significantly improving the quality of pooled454
CNN descriptors. Second, unlike previous empirical studies,455
we have attempted to mathematically show that the design of456
nested pooling ensures that HNIP is approximately invariant to457
multiple transformations, which is inspired by the recently de-458
veloped invariance theory. Third, to the best of our knowledge,459
this work is the first to comprehensively analyze invariant prop-460
erties of CNN descriptors, in the context of large scale video461
matching and retrieval.462
C. Quantitative and Qualitative Evaluations463
Transformation invariance: In this section, we propose a464
database-side data augmentation strategy for image retrieval465
Fig. 4. Distances for three matching pairs from the MPEG CDVA dataset (seeSection VI-A for more details). For each pair, three pairwise distances (L2normalized) are computed by progressively encoding translations (Gt ), scales(Gt + Gs ), and rotations (Gt + Gs + Gr ) into the nested pooling stages.Average pooling is used for all transformations. Feature maps are extractedfrom the last pooling layer of pretrained VGG16.
to study rotation and scale invariance properties, respectively. 466
For simplicity, we represent an image as a 4096 dimensional 467
descriptor extracted from the first fully connected layer (fc6) 468
of VGG16 [20] pre-trained on ImageNet dataset. We report re- 469
trieval results in terms of mean Average Precision (mAP) on the 470
INRIA Holidays dataset [65] (500 query images, 991 reference 471
images). 472
Fig. 3(a) investigates the invariance property with respect to 473
query rotations. First, we observe that the retrieval performance 474
drops significantly as we rotate query images when fixing the 475
reference images (the red curve). To gain invariance to query 476
rotations, we rotate each reference image within a range of 477
−180◦ to 180◦, with the step of 10◦. The fc6 features for its 478
36 rotated images are pooled together into one common global 479
descriptor representation, with either max or average pooling. 480
We observe that the performance is relatively consistent (blue for 481
max pooling, green for average pooling) against the rotation of 482
query images. Moreover, performance in terms of the variations 483
of the query image scale is plotted in Fig. 3(b). It is observed 484
that the database-side augmentation by max or average pooling 485
over scale changes (scale ratio of 0.75, 0.5, 0.375, 0.25, 0.2 and 486
0.125) can improve the performance when query scale is small 487
(e.g., 0.125). 488
Nesting multiple transformations: We further analyze nested 489
pooling over multiple transformations. Fig. 4 provides an insight 490
on how progressively adding different types of transformations 491
affect the matching distance on different image matching pairs. 492
We can see the reduction in matching distance with the incor- 493
poration of each new transformation group. Fig. 5 takes a closer 494
look at pairwise similarity maps between local deep features 495
of query keyframes and the global deep descriptors of refer- 496
ence keyframes, which explicitly reflects the regions of query 497
keyframe significantly contributing to similarity measurement. 498
We compare our HNIP (the third heatmap at each row) to the 499
state-of-the-art deep descriptors MAC [18] and R-MAC [18]. 500
Because of the introduction of scale and rotation transforma- 501
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 7
Fig. 5. Example similarity maps between local deep features of query keyframes and the global deep descriptors of reference keyframes, using off-the-shelfVGG16. For query (left image) and reference (right image) keyframe pair at each row, the middle three similarity maps are from MAC, R-MAC, and HNIP (fromleft to right), respectively. Each similarity map is generated by cosine similarity between query local features at each feature map location and the pooled globaldescriptors for reference keyframe (i.e., MAC, R-MAC, or HNIP), which allows locating the regions of query keyframe contributing most to the pairwise similarity.
TABLE IPAIRWISE VIDEO MATCHING BETWEEN MATCHING AND NON-MATCHING
VIDEO DATASETS, WITH POOLING CARDINALITY INCREASED
BY PROGRESSIVELY ENCODING TRANSLATION, SCALE, AND
ROTATION TRANSFORMATIONS, FOR DIFFERENT POOLING
STRATEGIES, I.E., MAX-MAX-MAX, AVG-AVG-AVG,AND OUR HNIP (SQU+AVG+MAX)
Gt Gt -Gs Gt -Gs -Gr
Max 71.9 Max-Max 72.8 Max-Max-Max 73.9Avg 76.9 Avg-Avg 79.2 Avg-Avg-Avg 82.2Squ 81.6 Squ-Avg 82.7 Squ-Avg-Max 84.3
TABLE IISTATISTICS ON THE NUMBER OF RELEVANT DATABASE VIDEOS RETURNED
IN TOP 100 LIST (I.E., RECALL@100) FOR ALL QUERY VIDEOS IN THE
MPEG CDVA DATASET (SEE SECTION VI-A FOR MORE DETAILS)
Landmarks Scenes Objects
HNIP \ CXM 8143 1477 1218CXM \ HNIP 1052 105 1834
“A \ B” represents relevant instances are successfully re-trieved by method A but missed in the list generated bymethod B. The last pooling layer of pretrained VGG16 isused for HNIP.
tions, HNIP is able to locate the query object of interest re-502
sponsible for similarity measures more precisely than MAC503
and R-MAC, though there are scale and rotation changes be-504
tween query-reference pairs. Moreover, as shown in Table I,505
quantitative results on video matching by HNIP with progres-506
sively encoded multiple transformations provide additional pos-507
itive evidence for the nested invariance property.508
Pooling moments: In Fig. 3, it is interesting to note that any509
choice of pooling moment n in the pooling stage can produce510
invariant descriptors. However, the discriminability of NIP with511
varied pooling moments could be quite different. For video512
retrieval, we empirically observe that pooling with hybrid mo-513
ments works well for NIP, e.g., starting with square-root pool-514
ing (nt = 2) for translations and average pooling (ns = 1) for515
scales, and ending with max pooling (nr → +∞) for rotations.516
Here, we present an empirical analysis of how pooling moments517
affect pairwise video matching performance.518
We construct matching and non-matching video set from the 519
MPEG CDVA dataset. Both sets contain 4690 video pairs. With 520
input video keyframe size 640 × 480, feature maps of size 20 × 521
15 are extracted from the last pooling layer of VGG16 [20] pre- 522
trained on ImageNet dataset. For transformations, we consider 523
nested pooling by progressively adding transformations with 524
translations (Gt), scales (Gt-Gs) and rotations (Gt-Gs-Gr ). 525
For pooling moments, we evaluate Max-Max-Max, Avg-Avg- 526
Avg and our HNIP (i.e., Squ-Avg-Max). Finally, video similarity 527
is computed using (1) with pooled features. 528
Table I reports pairwise matching performance in terms 529
of True Positive Rate with False Positive Rate set to 1%, 530
with transformations switching from Gt to Gt-Gs-Gr , for 531
Max-Max-Max, Avg-Avg-Avg and HNIP. As more transfor- 532
mations nested in, the separability between matching and 533
non-matching video sets becomes larger, regardless of the 534
pooling moments used. More importantly, HNIP performs the 535
best compared to Max-Max-Max and Avg-Avg-Avg, while 536
Max-Max-Max is the worst. For instance, HNIP outperforms 537
Avg-Avg-Avg, i.e., 84.3% vs. 82.2%. 538
V. COMBINING DEEP AND HANDCRAFTED DESCRIPTORS 539
In this section, we analyze the strength and weakness of deep 540
features in the context of video retrieval and matching, com- 541
pared to state-of-the-art handcrafted descriptors built upon local 542
invariant features (SIFT). To this end, we respectively compute 543
statistics of HNIP and CDVA handcrafted descriptors (CXM) 544
by retrieving different types of video data. In particular, we fo- 545
cus on videos depicting landmarks, scenes and common objects, 546
collected by MPEG CDVA. Here we describe how to compute 547
the statistics. First, for each query video, we retrieve top 100 548
most similar database videos using HNIP and CXM, respec- 549
tively. Second, for all queries from each type of video data, we 550
accumulate the number of relevant database videos (1) retrieved 551
by HNIP but missed by CXM (denoted as HNIP \ CXM), and 552
(2) retrieved by CXM but missed by HNIP (CXM \ HNIP). 553
The statistics are presented in Table II. As one can see, com- 554
pared to handcrafted descriptors CXM, HNIP is able to identify 555
much more relevant landmark and scene videos in which CXM 556
fails. On the other hand, CXM recalls more videos depicting 557
common objects than HNIP. Fig. 6 shows qualitative examples 558
IEEE P
roof
8 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 6. Examples of keyframe pairs in which (a) HNIP determines as matching but CXM as non-matching (HNIP \ CXM) and (b) CXM determines as matchingbut HNIP as non-matching (CXM \ HNIP).
of keyframe pairs corresponding to HNIP \ CXM and CXM \559
HNIP, respectively.560
Fig. 7 further visualizes intermediate keyframe matching re-561
sults produced by handcrafted and deep features, respectively.562
Despite viewpoint change of the landmark images in Fig. 7(a),563
the salient features fired in their activation maps are spatially564
consistent. Similar observations exist in indoor scene images in565
Fig. 7(b). These observations are probably attributed to deep566
descriptors are excelling in characterizing global salient fea-567
tures. On the other hand, handcrafted descriptors work on local568
patches detected at sparse interest points, which prefer rich tex-569
tured blobs [Fig. 7(c)] rather than lower textured ones [Fig. 7(a)570
and 7(b)]. This may explain why there are more inlier matches571
found by GCC for the product images in Fig. 7(a). Finally, com-572
pared to approximate scale and rotation invariances provided573
by HNIP analyzed in the previous section, handcrafted local574
features have with built-in mechanism to ensure nearly exact575
invariances to these transformations of rigid object in the 2D576
plane, and examples can be found in Fig. 6(b).577
In summary, these observations reveal that the deep learn-578
ing features may not always outperform handcrafted features.579
There may exist complementary effects between CNN deep de-580
scriptors and handcrafted descriptors. Therefore, we propose to581
leverage the benefits of both deep and handcrafted descriptors.582
Considering that handcrafted descriptors are categorized into583
local and global ones, we investigate the combination of deep584
descriptors with either handcrafted local or global descriptors,585
respectively.586
Combining HNIP with handcrafted local descriptors: For587
matching, if the HNIP matching score exceeds a threshold, then588
we use handcrafted local descriptors for verification. For re-589
trieval, HNIP matching score is used to select the top 500 590
candidates list, then we use handcrafted local descriptors for 591
reranking. 592
Combining HNIP and handcrafted global descriptors: In- 593
stead of simply concatenating the HNIP derived deep descrip- 594
tors and handcrafted descriptors, for both matching and retrieval, 595
the similarity score is defined as the weighted sum of matching 596
scores of HNIP and handcrafted global descriptors 597
k(x,y) = α · kc(x,y) + (1 − α) · kh(x,y) (7)
where α is the weighting factor. kc and kh represent the matching 598
score of HNIP and handcrafted descriptors, respectively. In this 599
work, α is empirically set to 0.75. 600
VI. EXPERIMENTAL RESULTS 601
A. Datasets and Evaluation Metrics 602
Datasets: MPEG CDVA ad-hoc group collects large scale 603
diverse video dataset to evaluate the effectiveness of video 604
descriptors for video matching, localization and retrieval ap- 605
plications, with resource constraints including descriptor size, 606
extraction time and matching complexity. This CDVA dataset2 607
is diversified to contain views of 1) stationary large objects, e.g., 608
buildings, landmarks (most likely background objects, possibly 609
partially occluded or a close-up), 2) generally smaller items 610
(e.g., paintings, books, CD covers, products) which typically 611
2MPEG CDVA dataset and evaluation framework are available upon re-quest at http://www.cldatlas.com/cdva/dataset.html. CDVA standard documentsare available at http://mpeg.chiariglione.org/standards/exploration/compact-descriptors-video-analysis.
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 9
Fig. 7. Keyframe matching examples which illustrate the strength and weakness of CNN based deep descriptors and handcrafted descriptors. In (a) and (b), deepdescriptors perform well but handcrafted ones fail, while (c) is the opposite.
TABLE IIISTATISTICS ON THE MPEG CDVA BENCHMARK DATASETS
IoI: items of interest. q.v. : query videos. r.v.: reference videos.
appear in front of background scenes, possibly occluded and 3)612
scenes (e.g., interior scenes, natural scenes, multi-camera shots,613
etc.). CDVA dataset also comprises planar or non-planar, rigid614
or partially rigid, textured or partially textured objects (scenes),615
which are captured from different view-points with different616
camera parameters and lighting conditions.617
Specifically, MPEG CDVA dataset contains 9974 query and618
5127 reference videos (denoted as All), depicting 796 items of619
interest in which 489 large landmarks (e.g., buildings), 71 scenes620
(e.g., interior or natural scenes) and 236 small common objects621
(e.g., paintings, books, products). The videos have durations622
from 1 sec to 1+ min. To evaluate video retrieval on different623
types of video data, we categorize query videos into Landmarks624
(5224 queries), Scenes (915 queries) and Objects (3835 queries).625
Table III summaries the numbers of items of interest and their626
instances for each category. Fig. 8 shows some example video627
clips from the three categories.628
To evaluate the performance of large scale video retrieval,629
we combine the reference videos with a set of user-generated630
and broadcast videos as distractors, which consist of content631
unrelated to the items of interest. There are 14537 distractor632
videos with more than 1000 hours data.633
Moreover, to evaluate pairwise video matching and tempo-634
ral localization, 4693 matching video pairs and 46911 non-635
matching video pairs are constructed from query and reference636
videos. Temporal location of items of interest within each video637
pair is annotated as the ground truth.638
We also evaluate our method on image retrieval bench- 639
mark datasets. INRIA Holidays, [65] dataset is composed of 640
1491 high-resolution (e.g., 2048 × 1536) scene-centric images, 641
500 of them are queries. This dataset includes a large variety 642
of outdoor scene/object types: natural, man-made, water and 643
fire effects. We evaluate the rotated version of Holidays [13], 644
where all images are with up-right orientation. Oxford5k, [66] 645
is buildings datasets consisting of 5062 images, mainly with 646
size 1024 × 768. There are 55 queries composed of 11 land- 647
marks, each represented by 5 queries. We use the provided 648
bounding boxes to crop query images. The University of Ken- 649
tucky Benchmark (UKBench) [67] consists of 10200 VGA size 650
images, organized into 2550 groups of common objects, each 651
object represented by 4 images. All 10200 images are serving as 652
queries. 653
Evaluation metrics: Retrieval performance is evaluated by 654
mean Average Precision (mAP) and precision at a given cut- 655
off rank R for query videos (Precisian@R), and we set R = 656
100 following MPEG CDVA standard. Pairwise video matching 657
performance is evaluated by Receiver Operating Characteristic 658
(ROC) curve. We also report pairwise matching results in terms 659
of True Positive Rate (TPR), given False Positive Rate (FPR) 660
equals to 1%. In case a video pair is predicted as a match, 661
temporal location of the item of interest is further identified 662
within the video pair. The localization accuracy is measured 663
by Jaccard Index: [T s t a r t ,Te n d ]⋂
[T ′s t a r t ,T
′e n d ]
[T s t a r t ,Te n d ]⋃
[T ′s t a r t ,T
′e n d ] , where [Tstart , Tend ] 664
denotes the ground truth and [T ′start , T
′end ] denotes the predicted 665
start and end frame timestamps. 666
Besides these accuracy measurements, we also measure the 667
complexity of algorithms, including descriptor size, transmis- 668
sion bit rate, extraction time and search time. In particular, 669
transmission bit rate is measured by (# query keyframes) × 670
(descriptor size in Bytes) / (query durations in seconds). 671
B. Implementation Details 672
In this work, we build HNIP descriptors with two widely 673
used CNN architectures : AlexNet [19] and VGG16 [20]. We 674
test off-the-shelf networks pre-trained on ImageNet ILSVRC 675
classification data set. In particular, we crop the networks to the 676
IEEE P
roof
10 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 8. Example video clips from the CDVA dataset.
TABLE IVVIDEO RETRIEVAL COMPARISON (MAP) BY PROGRESSIVELY ADDING
TRANSFORMATIONS (TRANSLATION, SCALE, ROTATION) INTO NIP(OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)
Transf. size (kB/k.f.) Landmarks Scenes Objects All
Gt 2 64.0 82.9 64.8 66.0Gt -Gs 2 65.3 82.4 67.3 67.6Gt -Gs -Gr 2 64.6 82.7 72.2 69.2
Average pooling is applied to all transformations. No PCA whitening is performed.kB/k.f.: descriptor size per key frame. The best results are highlighted in bold.
last pooling layer (i.e., pool5). We resize all video keyframes677
to 640×480 and Holidays (Oxford5k) images to 1024×768678
as the inputs of CNN for descriptor extraction. Finally, post-679
processing can be applied to the pooled descriptors like HNIP680
and R-MAC [18]. Following the standard practice, we choose681
PCA whitening in this work. We randomly sample 40K frames682
from the distractor videos for PCA learning. These experimen-683
tal setups are applied to both HNIP and state-of-the-art deep684
pooled descriptors like MAC [18], SPoC [16], CroW [17] and685
R-MAC [18].686
We also compare HNIP with the MPEG CXM, which is cur-687
rent state-of-the-art handcrafted compact descriptors for video688
analysis. Following the practice in CDVA standard, we em-689
ploy OpenMP to perform parallel retrieval for both CXM and690
deep global descriptors. Experiments are conducted on Tianhe691
HPC platform, where each node is equipped with 2 processors692
(24 cores, Xeon E5-2692) @2.2GHZ, and 64 GB RAM. For693
CNN feature maps extraction, we use NVIDIA Tesla K80 GPU.694
C. Evaluations on HNIP variants695
We perform video retrieval experiments to assess the effect of696
transformations and pooling moments in HNIP pipeline, using697
off-the-shelf VGG16.698
Transformations: Table IV studies the influence of pooling699
cardinalities by progressively adding transformations into the700
nested pooling stages. We simply apply average pooling to all701
transformations. First, dimensions of all NIP variants are 512702
for VGG16, resulting in descriptor size 2 kB per keyframe for703
floating point vectors (4 bytes each element). Second, over-704
all, retrieval performance (mAP) increases as more transfor-705
mations nested in the pooled descriptors, e.g., from 66.0% for706
Gt to 69.2% for Gt-Gs -Gr on the full test dataset (All). Also,707
we observe that Gt-Gs-Gr outperforms Gt-Gs and Gt by a708
large margin on Objects, while achieving comparable perfor-709
TABLE VVIDEO RETRIEVAL COMPARISON (MAP) OF NIP WITH DIFFERENT POOLING
MOMENTS (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)
Pool Op. size (kB/k.f.) Landmarks Scenes Objects All
Max-Max-Max 2 63.2 82.9 69.9 67.6Avg-Avg-Avg 2 64.6 82.7 72.2 69.2Squ-Squ-Squ 2 54.2 65.6 66.5 60.0Max-Squ-Avg 2 60.6 80.7 74.4 67.8Hybrid (HNIP) 2 70.0 88.2 80.9 75.9
Transformations are Gt -Gs -Gr for all experiments. No PCA whitening is performed.The best results are highlighted in bold.
mance on Landmarks and Scenes. Revisiting the analysis of ro- 710
tation invariance pooling on the scene-centric Holidays dataset 711
in Fig. 3(a), though invariance to query rotation changes can 712
be gained by database-side augmented pooling, one may note 713
that its retrieval performance is comparable to the one without 714
rotating query and reference images (i.e., the peak value of the 715
red curve). These observations are probably because there are 716
relatively limited rotation (scale) changes for videos depicting 717
large landmarks or scenes, compared to small common objects. 718
More examples can be found in Fig. 8. 719
Hybrid pooling moments: Table V explores the effects of 720
pooling moments within NIP. Transformations are fixed as Gt- 721
Gs-Gr . There are 33 = 27 possible combinations of pooling 722
moments in HNIP. For simplicity, we compare our Hybrid NIP 723
(i.e., Squ-Avg-Max) to two widely-used pooling strategies (i.e., 724
max or average pooling across all transformations) and another 725
two schemes: square-root pooling across all transformations 726
and Max-Squ-Avg which decreases pooling moment along the 727
way. First, for uniform pooling, Avg-Avg-Avg is overall supe- 728
rior over Max-Max-Max and Squ-Squ-Squ, while Squ-Squ-Squ 729
performs much worse than the other two. Second, HNIP out- 730
performs the optimal uniform pooling Avg-Avg-Avg by a large 731
margin. For instance, the gains over Avg-Avg-Avg are +5.4%, 732
+5.5% and +8.7% on Landmarks, Scenes and Objects, respec- 733
tively. Finally, for hybrid pooling, HNIP performs significantly 734
better than Max-Squ-Avg over all test datasets. We observe 735
similar trends when comparing HNIP to other hybrid pooling 736
combinations. 737
D. Comparisons Between HNIP and State-of-the-Art Deep 738
Descriptors 739
Previous experiments show that the integration of transfor- 740
mations and hybrid pooling moments offers remarkable video 741
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 11
TABLE VIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART DEEP DESCRIPTORS
IN TERMS OF MAP (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)
method size (kB/k.f.) extra. time (s/k.f.) Landmarks Scenes Objects All
w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW
MAC [18] 2 0.32 57.8 61.9 77.4 76.2 70.0 71.8 64.3 67.0SPoC [16] 2 0.32 64.0 69.1 82.9 84.0 64.8 70.3 66.0 70.9CroW [17] 2 0.32 62.3 63.9 79.2 78.4 71.9 72.0 67.5 68.3R-MAC [18] 2 0.32 69.4 74.6 84.4 87.3 73.8 78.2 72.5 77.1HNIP (Ours) 2 0.96 70.0 74.8 88.2 90.1 80.9 85.0 75.9 80.1
We implement MAC [18], SPoC [16], CroW [17], R-MAC [18] based on the source codes released by the authors, while following the same experimental setups as our HNIP. Thebest results are highlighted in bold.
TABLE VIIEFFECT OF THE NUMBER OF DETECTED VIDEO KEYFRAMES ON DESCRIPTOR TRANSMISSION BIT RATE,
RETRIEVAL PERFORMANCE (MAP), AND SEARCH TIME, ON THE FULL TEST DATASET (ALL)
# query k.f. # DB k.f. method size (kB/k.f.) bit rate (Bps) mAP search time (s/q.v.)
∼140K (1.6%) ∼105K (2.4%) CXM ∼4 2840 73.6 12.4AlexNet-HNIP 1 459 71.4 1.6
VGG16-HNIP 2 918 80.1 2.3∼175K (2.0%) ∼132K (3.0%) CXM ∼4 3463 74.3 16.6
AlexNet-HNIP 1 571 71.9 2.0VGG16-HNIP 2 1143 80.6 2.8
∼231K (2.7%) ∼176K (3.9%) CXM ∼4 4494 74.6 21.0AlexNet-HNIP 1 759 71.9 2.2VGG16-HNIP 2 1518 80.7 3.1
We report performance of state-of-the-art handcrafted descriptors (CXM), and PCA whitened HNIP with both off-the-shelf AlexNet andVGG16. Numbers in bracket denote the percentage of detected keyframes from raw videos. Bps: bytes per second. s/q.v.: seconds perquery video.
TABLE VIIIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART
HANDCRAFTED DESCRIPTORS (CXM), FOR ALL TEST DATASETS
method Landmarks Scenes Objects All
CXM 61.4/60.9 63.0/61.9 92.6/91.2 73.6/72.6AlexNet-HNIP 65.2/62.3 77.6/74.1 78.4/74.5 71.4/68.1VGG16-HNIP 74.8/71.6 90.1/86.6 85.0/81.3 80.1/76.7
The former (latter) number in each cell represents performance in terms of mAP(Precision@R). We report performance of PCA whitened HNIP with both off-the-shelf AlexNet and VGG16. The best results are highlighted in bold.
retrieval performance improvements. Here, we conduct another742
round of video retrieval experiments to validate the effective-743
ness of our optimal reported HNIP, compared to state-of-the-art744
deep descriptors [16]–[18].745
Effect of PCA whitening: Table VI studies the effect of PCA746
whitening on different deep descriptors in video retrieval per-747
formance (mAP), using off-the-shelf VGG16. Overall, PCA748
whitened descriptors perform better than their counterparts749
without PCA whitening. More specifically, the improvements750
on SPoC, R-MAC and our HNIP are much larger than MAC751
and CroW. In view of this, we apply PCA whitening to HNIP in752
the following sections.753
HNIP versus MAC, SPoC, CroW and R-MAC: Table VI754
presents the comparison of HNIP against state-of-the-art deep755
descriptors. We observe that HNIP obtains consistently better756
performance than other approaches on all test datasets, at the cost 757
of extra extraction time.3 HNIP significantly improves the re- 758
trieval performance over MAC [18], SPoC [16] and CroW [17], 759
e.g., over 10% in mAP on the full test dataset (All). Compared 760
with the state-of-the-art R-MAC [18], +7% mAP improvement 761
is achieved on Objects, which is mainly attributed to the im- 762
proved robustness against the rotation changes in videos (the 763
keyframes capture small objects from different angles). 764
E. Comparisons Between HNIP and Handcrafted Descriptors 765
In this section, we first study the influences of the number 766
of detected video keyframes on descriptor transmission bit rate, 767
retrieval performance and search time. Then, with keyframes 768
fixed, we compare HNIP to the state-of-the-art compact hand- 769
crafted descriptors (CXM), which currently obtains the optimal 770
video retrieval performance in MPEG CDVA datasets. 771
Effect of the number of detected video keyframes: As shown 772
in Table VII, we generate three keyframe detection config- 773
urations by varying the detection parameters. Also, we test 774
3The extraction time of deep descriptors is mainly decomposed into 1) feed-forward pass for extracting feature maps and 2) pooling over feature mapsfollowed by post-processing such as PCA whitening. In our implementationbased on MatConvNet, the first stage takes 0.21 seconds per keyframe (VGAsize input image to VGG16 executed on a NVIDIA Tesla K80 GPU); HNIP isfour times slower (∼0.84) as there are four rotations for each keyframe. Thesecond stage is ∼0.11 seconds for MAC, SPoC and CroW, ∼0.115 seconds forR-MAC and ∼0.12 seconds for HNIP. Therefore, the extraction time of HNIPis roughly three times as much as others.
IEEE P
roof
12 IEEE TRANSACTIONS ON MULTIMEDIA
Fig. 9. Pairwise video matching comparison of HNIP with state-of-the-art handcrafted descriptors (CXM) in terms of ROC curve, for all test datasets. Experimentalsettings are identical with those in Table VIII.
their retrieval performance and complexity for CXM descriptors775
(∼4 kB per keyframe) and PCA whitened HNIP with both off-776
the-shelf AlexNet (1 kB per keyframe) and VGG16 (2 kB per777
keyframe), on the full test dataset (All). It is easy to find that778
descriptor transmission bit rate and search time increase pro-779
portionally with the number of detected keyframes. However,780
there is little retrieval performance gain for all descriptors, i.e.,781
less than 1% in mAP. Thus, we consider the first configuration782
throughout this work, which achieves a good tradeoff between783
accuracy and complexities. For instance, mAP for VGG16-784
HNIP is 80.1% when the descriptor transmission bit rate is785
only 2840 Bytes per second.786
Video retrieval: Table VIII shows the video retrieval com-787
parison of HNIP with handcrafted descriptors CXM on all788
test datasets. Overall, AlexNet-HNIP is inferior to CXM,789
while VGG16-HNIP performs the best. Second, HNIP with790
both AlexNet and VGG16 outperform CXM on Landmarks791
and Scenes. The performance gap between HNIP and CXM792
becomes larger as network goes deeper from AlexNet to793
VGG16, e.g., AlexNet-HNIP and VGG16-HNIP improve CXM794
by 3.8% and 13.4% in mAP on Landmarks, respectively. Third,795
we observe AlexNet-HNIP performs much worse than CXM on796
Objects (e.g., 74.5% vs. 91.2% in Precision@R). VGG16-HNIP797
reduces the gap, but still underperforms CXM. This is reason-798
able as handcrafted descriptors based on SIFT are more robust799
to scale and rotation changes of rigid objects in the 2D plane.800
Video pairwise matching and localization: Fig. 9 and801
Table IX further show pairwise video matching and temporal802
localization performance of HNIP and CXM on all test datasets,803
respectively. For pairwise video matching, VGG16-HNIP and804
AlexNet-HNIP consistently outperform CXM in terms of TPR805
for varied FPR on Landmarks and Scenes. In Table IX, we806
observe the performance trends of temporal localization are807
roughly consistent with pairwise video matching.808
One may note that the localization accuracy of CXM is809
worse than HNIP on Objects (see Table IX), but CXM ob-810
tains much better video retrieval mAP than HNIP on Objects811
(see Table VIII). First, given a query-reference video pair, video812
retrieval tries to identify the most similar keyframe pair, but813
temporal localization aims to locate multiple keyframe pairs by814
comparing with a predefined threshold. Second, as shown in815
Fig. 9, CXM achieves better TPR (Recall) than both VGG16-816
HNIP and AlexNet-HNIP on Objects when FPR is small (e.g.,817
TABLE IXVIDEO LOCALIZATION COMPARISON OF HNIP WITH STATE-OF-THE-ART
HANDCRAFTED DESCRIPTORS (CXM) IN TERMS
OF JACCARD INDEX, FOR ALL TEST DATASETS
method Landmarks Scenes Objects All
CXM 45.5 45.9 68.8 54.4AlexNet-HNIP 48.9 63.0 67.3 57.1VGG16-HNIP 50.8 63.8 71.2 59.7
Experimental settings are the same as in Table VIII. The best resultsare highlighted in bold.
TABLE XIMAGE RETRIEVAL COMPARISON (MAP) OF HNIP WITH STATE-OF-THE-ART
DEEP AND HANDCRAFTED DESCRIPTORS (CXM),ON HOLIDAYS, OXFORD5K, AND UKBENCH
method Holidays Oxford5k UKbench
CXM 71.2 43.5 3.46MAC [18] 78.3 56.1 3.65SPoC [16] 84.5 68.6 3.68R-MAC [18] 87.2 67.6 3.73HNIP (Ours) 88.9 69.3 3.90
We report performance of PCA whitened deep descriptorswith off-the-shelf VGG16. The best results are highlightedin bold.
FPR = 1%), and TPR becomes worse as FPR increases. This 818
implies that 1) CXM ranks relevant videos and keyframes higher 819
than HNIP in the retrieved list for object queries, which leads to 820
better mAP on Objects when evaluating retrieval performance 821
in a small shortlist (100 in our experiments). 2) VGG16-HNIP 822
gains higher Recall than CXM when FPR becomes large, which 823
leads to higher localization accuracy on Objects. In other words, 824
towards better temporal localization, we choose a small thresh- 825
old (the corresponding FPR = 14.3% in our experiments) in 826
order to recall as many relevant keyframes as possible. 827
Image retrieval: To further verify the effectiveness of HNIP, 828
we conduct image instance retrieval experiments on scene- 829
centric Holidays, landmark-centric Oxford5k and object-centric 830
UKbench. Table X shows the comparisons of HNIP with MAC, 831
SPoC, R-MAC, and handcrafted descriptors from MPEG CDVA 832
evaluation framework. First, we observe HNIP outperforms 833
handcrafted descriptors by a large margin on all data sets. 834
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 13
Fig. 10. (a) and (b) Video retrieval, (c) pairwise video matching, and (d) and localization performance of the optimal reported HNIP (i.e., VGG16-HNIP)combined with either CXM local or CXM global descriptors, for all test datasets. For simplicity, we report the pairwise video matching performance in terms ofTPR given FPR = 1%.
TABLE XIVIDEO RETRIEVAL COMPARISON OF HNIP WITH CXM AND THE COMBINATION OF HNIP WITH CXM-LOCAL AND CXM-GLOBAL RESPECTIVELY,
ON THE FULL TEST DATASET (ALL), WITHOUT (“W/O D ”), or WITH (“W/ D ”) COMBINATION OF THE LARGE SCALE DISTRACTOR VIDEOS
method size (kB/k.f.) mAP Precision@R # query k.f. # DB k.f. search time (s/q.v.)
w/o D w/ D w/o D w/ D w/o D w/ D w/o D w/ D
CXM ∼4 73.6 72.1 72.6 71.2 ∼140K ∼105K ∼1.25M 12.4 38.6VGG16-HNIP 2 80.1 76.8 76.7 73.6 2.3 9.2VGG16-HNIP + CXM-Local ∼4 75.7 75.4 74.4 74.1 12.9 17.8VGG16-HNIP + CXM-Global ∼4 84.9 82.6 82.4 80.3 4.9 39.5
Second, HNIP performs significantly better than state-of-the-835
art deep descriptors R-MAC on UKbench, though it shows836
marginally better performance on Holidays. The performance837
advantage trend between HNIP and R-MAC is consistent with838
video retrieval results on CDVA-Scenes and CDVA-Objects in839
Table VI. It is again demonstrated that HNIP tends to be more840
effective on object-centric datasets compared to scene- and841
landmark-centric datasets, as object-centric datasets are usually842
with more rotation and scale distortions.843
F. Combination of HNIP and Handcrafted Descriptors844
CXM contains both compressed local descriptors845
(∼2 kB/frame) and compact global descriptors (∼2 kB/frame)846
aggregated from local ones. Following the combination strate-847
gies designed in Section V, Fig. 10 shows the effectiveness848
of combining VGG16-HNIP with either CXM-Global or849
CXM-Local descriptors,4 in video retrieval (a) & (b), matching850
(c) and localization (d). First, we observe the combination of851
VGG16-HNIP with either CXM-Global or CXM-Local consis-852
tently improves CXM across all tasks on all test datasets. In this853
regard, the improvements of VGG16-HNIP + CXM-Global are854
much larger than VGG16-HNIP + CXM-Local, especially for855
Landmarks and Scenes. Second, VGG16-HNIP + CXM-Global856
performs better on all test datasets in video retrieval, matching857
and localization (except localization accuracy on Landmarks).858
In particular, VGG16-HNIP + CXM-Global significantly859
4Here, we did not introduce the complicated combination of VGG-HNIP +CXM-Global + CXM-Local, because its performance is very close to VGG-HNIP + CXM-Local, and moreover it further increases descriptor size andsearch time, compared to VGG-HNIP + CXM-Local.
improves VGG16-HNIP on Objects in terms of mAP and 860
Precision@R (+10%). This leads us to the conclusion that 861
deep descriptors VGG16-HNIP and handcrafted descriptors 862
CXM-Global are complementary to each other. Third, we 863
observe VGG16-HNIP + CXM-Local significantly degrades 864
the performance of VGG16-HNIP on Landmarks and Scenes, 865
e.g., there is a drop of ∼10% in mAP on Landmarks. This 866
is due to the fact that matching pairs retrieved by HNIP (but 867
handcrafted features fail) cannot pass the GCC step, i.e., the 868
number of inliers (patch-level matching pairs) is less sufficient. 869
For instance, in Fig. 7, the landmark pair is determined as 870
a match by VGG16-HNIP, but the subsequent GCC step 871
considers it as a non-match because there are only 2 matched 872
patch pairs. More examples can be found in Fig. 6(a). 873
G. Large Scale Video Retrieval 874
Table XI studies video retrieval performance of CXM, 875
VGG16-HNIP, their combinations VGG16-HNIP + CXM-Local 876
and VGG16-HNIP + CXM-Global, on the full test dataset 877
(All) without or with the large scale distractor video set. By 878
combining reference videos with the large scale distractor set, 879
the number of database keyframes increases from ∼105K to 880
∼1.25M, making the search speed significantly slower. For ex- 881
ample, HNIP is ∼5 times slower with the 512-D Euclidean 882
distance computation. Further compressing HNIP into extreme 883
compact code (e.g., 256 bits) for ultra-fast Hamming distance 884
computation is highly desirable, while without incurring con- 885
siderable performance loss. We will study it in our future work. 886
Second, the performance ordering of approaches remains the 887
same on large scale experiments, i.e., VGG16-HNIP + CXM- 888
IEEE P
roof
14 IEEE TRANSACTIONS ON MULTIMEDIA
Global performs the best, followed by VGG16-HNIP, VGG16-889
HNIP + CXM-Local and CXM. Finally, when increasing the890
database size by 10× larger, we observe the performance loss891
is relatively small, e.g., −1.5%, −3.3%, −0.3% and −2.3% in892
mAP for CXM, VGG16-HNIP, VGG16-HNIP + CXM-Local893
and VGG16-HNIP + CXM-Global, respectively.894
VII. CONCLUSION AND DISCUSSIONS895
In this work, we propose a compact and discriminative CNN896
descriptor HNIP for video retrieval, matching and localization.897
Based on the invariance theory, HNIP is proven to be robust898
to multiple geometric transformations. More importantly, our899
empirical studies show that the statistical moments in HNIP900
dramatically affect video matching performance, which leads901
us to the design of hybrid pooling moments within HNIP. In ad-902
dition, we study the complementary nature of deep learned and903
handcrafted descriptors, then propose a strategy to combine the904
two descriptors. Experimental results demonstrate that HNIP905
descriptor significantly outperforms state-of-the-art deep and906
handcrafted descriptors, with comparable or even smaller de-907
scriptor size. Furthermore, the combination of HNIP and hand-908
crafted descriptors offers the optimal performance.909
This work provides valuable insights for the ongoing CDVA910
standardization efforts. During the recent 116th MPEG meeting911
in Oct. 2016, MPEG CDVA Ad-hoc group has adopted the pro-912
posed HNIP into core experiments [21] for investigating more913
practical issues when dealing with deep learned descriptors in914
the well-established CDVA evaluation framework. There are915
several directions for future work. First, an in-depth theoretical916
analysis on how pooling moments affect video matching perfor-917
mance is valuable to further reveal and clarify the mechanism918
of hybrid pooling, which may contribute to the invariance the-919
ory. Second, it is interesting to study how to further improve re-920
trieval performance by optimizing deep features like fine-tuning921
CNN tailored for the video retrieval task, instead of off-the-shelf922
CNNs used in this work. Third, to accelerate search speed, fur-923
ther compressing deep descriptors into extremely compact codes924
(e.g., dozens of bits) while still preserving retrieval accuracy is925
worth investigating. Last but not the least, as CNN incurs huge926
number of network model parameters (over 10 millions), how to927
effectively and efficiently compress CNN model is a promising928
direction.929
REFERENCES930
[1] Compact Descriptors for Video Analysis: Objectives, Applications and931Use Cases, ISO/IEC JTC1/SC29/WG11/N14507, 2014.932
[2] Compact Descriptors for Video Analysis: Requirements for Search Appli-933cations, ISO/IEC JTC1/SC29/WG11/N15040, 2014.934
[3] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,935vol. 28, no. 4, pp. 61–76, Jul. 2011.936
[4] R. Ji et al., “Learning compact visual descriptor for low bit rate mobile937landmark search,” vol. 22, no. 3, 2011.Q1 938
[5] L.-Y. Duan et al., “Overview of the MPEG-CDVS standard,” IEEE Trans.939Image Process., vol. 25, no. 1, pp. 179–194, Jan. 2016.940
[6] Test Model 14: Compact Descriptors for Visual Search, ISO/IEC941JTC1/SC29/WG11/W15372, 2011.942
[7] Call for Proposals for Compact Descriptors for Video Analysis (CDVA)-943Search and Retrieval, ISO/IEC JTC1/SC29/WG11/N15339, 2015.944
[8] Cdva Experimentation Model (cxm) 0.2, ISO/IEC JTC1/SC29/945WG11/W16274, 2015.946
[9] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier, “Large-scale image re- 947trieval with compressed fisher vectors,” in Proc. IEEE Conf. Comput. Vis. 948Pattern Recog., Jun. 2010, pp. 3384–3391. Q2949
[10] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descrip- 950tors into a compact image representation,” in Proc. IEEE Conf. Comput. 951Vis. Pattern Recog., Jun. 2010, pp. 3304–3311. 952
[11] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn fea- 953tures off-the-shelf: An astounding baseline for recognition,” in Proc. IEEE 954Conf. Comput. Vis. Pattern Recog. Workshops, Jun. 2014, pp. 512–519. 955
[12] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling 956of deep convolutional activation features,” in Proc. Eur. Conf. Comput. 957Vis., 2014. 958
[13] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes 959for image retrieval,” in Proc. Eur. Conf. Comput. Vis., 2014. 960
[14] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “A baseline for 961visual instance retrieval with deep convolutional networks,” CoRR, 2014. 962[Online]. Available: http://arxiv.org/abs/1412.6574 Q3963
[15] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson, 964“From generic to specific deep representations for visual recognition,” in 965Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2015. 966
[16] A. Babenko and V. Lempitsky, “Aggregating local deep features for image 967retrieval,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1269– 9681277. 969
[17] Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weight- 970ing for aggregated deep convolutional features,” CoRR, 2015. [Online]. 971Available: http://arxiv.org/1512.04065 972
[18] G. Tolias, R. Sicre, and H. Jegou, “Particular object retrieval with inte- 973gral max-pooling of cnn activations,” CoRR, 2015. [Online]. Available: 974http://arxiv.org/abs/1511.05879 975
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 976with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- 977cess. Syst., 2012. 978
[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks 979for large-scale image recognition,” CoRR, 2014. [Online]. Available: 980http://arxiv.org/abs/1409.1556 981
[21] Description of Core Experiments in CDVA, ISO/IEC JTC1/SC29/ 982WG11/W16510, 2016. 983
[22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” 984Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. 985
[23] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” 986in Proc. Eur. Conf. Comput. Vis., 2006. 987
[24] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach 988to object matching in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 989Oct. 2003, vol. 2, pp. 1470–1477. 990
[25] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” 991in Proc. Comput. Vis. Pattern Recog., 2006. 992
[26] H. Jegou and A. Zisserman, “Triangulation embedding and democratic 993aggregation for image search,” in Proc. IEEE Conf. Comput. Vis. Pattern 994Recog., Jun. 2014, pp. 3310–3317. 995
[27] S. S. Husain and M. Bober, “Improving large-scale image retrieval through 996robust aggregation of local descriptors,” IEEE Trans. Pattern Anal. Mach. 997Intell., to be published. 998
[28] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift- 999invariant kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2009. 1000
[29] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Adv. 1001Neural Inf. Process. Syst., 2009. 1002
[30] V. Chandrasekhar et al.,“Transform coding of image feature descriptors,” 1003in Proc. IS&T/SPIE Electron. Imag., 2009. 1004
[31] V. Chandrasekhar et al., “CHoG: Compressed histogram of gradients a 1005low bit-rate feature descriptor,” in Proc. IEEE Conf. Comput. Vis. Pattern 1006Recog., Jun. 2009, pp. 2504–2511. 1007
[32] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest 1008neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, 1009pp. 117–128, Jan. 2011. 1010
[33] M. Calonder et al., “BRIEF: Computing a local binary descriptor 1011very fast,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, 1012pp. 1281–1298, Jul. 2012. 1013
[34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficient 1014alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis., 1015Nov. 2011, pp. 2564–2571. 1016
[35] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust 1017invariant scalable keypoints,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, 1018pp. 2548–2555. 1019
[36] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui, “USB: Ultrashort 1020binary descriptor for fast visual matching and retrieval,” IEEE Trans. 1021Image Process., vol. 23, no. 8, pp. 3671–3683, Aug. 2014. 1022
IEEE P
roof
LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 15
[37] D. M. Chen et al., “Tree histogram coding for mobile image matching,”1023in Proc. Data Compression Conf., 2009.1024
[38] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scal-1025able image search,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct.10262009, pp. 2130–2137.1027
[39] D. Chen et al., “Residual enhanced visual vector as a compact signature1028for mobile visual search,” Signal Process., vol. 93, no. 8, pp. 2316–2327,10292013.1030
[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies1031for accurate object detection and semantic segmentation,” in Proc. IEEE1032Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587.1033
[41] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:1034CNN architecture for weakly supervised place recognition,” in Proc. Com-1035put. Vis. Pattern Recog., Jun. 2016, pp. 5297–5307.1036
[42] F. Radenovic, G. Tolias, and O. Chum, “CNN image retrieval learns from1037BoW: Unsupervised fine-tuning with hard examples,” in Proc. Eur. Conf.1038Comput. Vis., 2016.1039
[43] A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “Deep image retrieval:1040Learning global representations for image search,” in Proc. Eur. Conf.1041Comput. Vis., 2016.1042
[44] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, and S. Tubaro,1043“Coding visual features extracted from video sequences,” IEEE Trans.1044Image Process., vol. 23, no. 5, pp. 2262–2276, May 2014.1045
[45] A. Redondi, L. Baroffio, M. Cesana, and M. Tagliasacchi, “Compress-1046then-analyze vs. analyze-then-compress: Two paradigms for image anal-1047ysis in visual sensor networks,” in Proc. IEEE Int. Workshop Multimedia1048Signal Process., Sep.-Oct. 2013, pp. 278–282.1049
[46] L. Baroffio, J. Ascenso, M. Cesana, A. Redondi, and M. Tagliasacchi,1050“Coding binary local features extracted from video sequences,” in Proc.1051IEEE Int. Conf. Image Process., Oct. 2014, pp. 2794–2798.1052
[47] M. Makar, V. Chandrasekhar, S. Tsai, D. Chen, and B. Girod, “Interframe1053coding of feature descriptors for mobile augmented reality,” IEEE Trans.1054Image Process., vol. 23, no. 8, pp. 3352–3367, Aug. 2014.1055
[48] J. Chao and E. G. Steinbach, “Keypoint encoding for improved feature ex-1056traction from compressed video at low bitrates,” IEEE Trans. Multimedia,1057vol. 18, no. 1, pp. 25–39, Jan. 2016.1058
[49] L. Baroffio et al., “Coding local and global binary visual features extracted1059from video sequences,” IEEE Trans. Image Process., vol. 24, no. 11,1060pp. 3546–3560, Nov. 2015.1061
[50] D. M. Chen, M. Makar, A. F. Araujo, and B. Girod, “Interframe coding1062of global image signatures for mobile augmented reality,” in Proc. Data1063Compression Conf., 2014.1064
[51] D. M. Chen and B. Girod, “A hybrid mobile visual search system with1065compact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7,1066pp. 1019–1030, Jul. 2015.1067
[52] C.-Z. Zhu, H. Jegou, and S. Satoh, “Nii team: Query-adaptive asymmetri-1068cal dissimilarities for instance search,” in Proc. TRECVID 2013 Workshop,1069Gaithersburg, USA, 2013.1070
[53] N. Ballas et al., “Irim at trecvid 2014: Semantic indexing and instance1071search,” in Proc. TRECVID 2014 Workshop, 2014.1072
[54] A. Araujo, J. Chaves, R. Angst, and B. Girod, “Temporal aggregation1073for large-scale query-by-image video retrieval,” in Proc. IEEE Int. Conf.1074Image Process., Sep. 2015, pp. 1519–1522.1075
[55] M. Shi, T. Furon, and H. Jegou, “A group testing framework for simi-1076larity search in high-dimensional spaces,” in Proc. 22nd ACM Int. Conf.1077Multimedia. 2014, pp. 407–416.1078
[56] J. Lin et al., “Rate-adaptive compact fisher codes for mobile vi-1079sual search,” IEEE Signal Process. Lett., vol. 21, no. 2, pp. 195–198,1080Feb. 2014.1081
[57] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN video1082representation for event detection,” in Proc. IEEE Conf. Comput. Vis.1083Pattern Recog., Jun. 2015, pp. 1798–1807.1084
[58] L.-Y. et al., “Compact Descriptors for Video Analysis: The1085Emerging MPEG Standard,” CoRR, 2017. [Online]. Available:1086http://arxiv.org/abs/1704.081411087
[59] F. Anselmi and T. Poggio, “Representation learning in sensory cortex: A1088theory,” in Proc. Center Brains, Minds Mach., 2014.1089
[60] Q. Liao, J. Z. Leibo, and T. Poggio, “Learning invariant representations and1090applications to face verification,” in Proc. Advances Neural Inf. Process.1091Syst., Lake Tahoe, NV, 2013.1092
[61] C. Zhang, G. Evangelopoulos, S. Voinea, L. Rosasco, and T. Poggio, “A1093deep representation for invariance and music classification,” in Proc. IEEE1094Int. Conf. Acoust., Speech, Signal Process., May 2014, pp. 6984–6988.1095
[62] K. Lenc and A. Vedaldi, “Understanding image representations by mea-1096suring their equivariance and equivalence,” in Proc. IEEE Conf. Comput.1097Vis. Pattern Recog., Jun. 2015, pp. 991–995.1098
[63] J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-time 1099visual concept classification.” IEEE Trans. Multimedia, vol. 12, no. 7, 1100pp. 665–681, Nov. 2010. 1101
[64] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recogni- 1102tion and segmentation.” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1103Jun. 2015, pp. 3828–3836. 1104
[65] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak 1105geometric consistency for large scale image search,” in Proc. Eur. Conf. 1106Comput. Vis., 2008. 1107
[66] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval 1108with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. 1109Comput. Vis. Pattern Recog., Jun. 2007, pp. 1–8. 1110
[67] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” 1111in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., Jun. 2006, 1112vol. 2, pp. 2161–2168. 1113
Jie Lin received the B.S. and Ph.D. degrees from the 1114School of Computer Science and Technology, Bei- 1115jing Jiaotong University, Beijing, China, in 2006 and 11162014, respectively. 1117
He is currently a Research Scientist with the In- 1118stitute of Infocomm Research, A*STAR, Singapore. 1119He was previously a visiting student in the Rapid- 1120Rich Object Search Laboratory, Nanyang Technolog- 1121ical University, Singapore, and the Institute of Digital 1122Media, Peking University, Beijing, China, from 2011 1123to 2014. His research interests include deep learning, 1124
feature coding and large-scale image/video retrieval. His work on image feature 1125coding has been recognized as core contribution to the MPEG-7 Compact De- 1126scriptors for Visual Search (CDVS) standard. 1127
1128
Ling-Yu Duan (M’09) received the M.Sc. degree in 1129automation from the University of Science and Tech- 1130nology of China, Hefei, China, in 1999, the M.Sc. 1131degree in computer science from the National Uni- 1132versity of Singapore (NUS), Singapore, in 2002, and 1133the Ph.D. degree in information technology from 1134The University of Newcastle, Callaghan, Australia, 1135in 2008. 1136
He is currently a Full Professor with the National 1137Engineering Laboratory of Video Technology, School 1138of Electronics Engineering and Computer Science, 1139
Peking University (PKU), Beijing, China. He was the Associate Director of the 1140Rapid-Rich Object Search Laboratory, a joint lab between Nanyang Techno- 1141logical University, Singapore, and PKU since 2012. Before he joined PKU, he 1142was a Research Scientist with the Institute for Infocomm Research, Singapore, 1143from Mar. 2003 to Aug. 2008. He has authored or coauthored more than 130 1144research papers in international journals and conferences. His research interests 1145include multimedia indexing, search, and retrieval, mobile visual search, visual 1146feature coding, and video analytics. Prior to 2010, his research was basically 1147focused on multimedia (semantic) content analysis, especially in the domains 1148of broadcast sports videos and TV commercial videos. 1149
Prof. Duan was the recipient of the EURASIP Journal on Image and Video 1150Processing Best Paper Award in 2015, and the Ministry of Education Technology 1151Invention Award (First Prize) in 2016. He was a co-editor of MPEG Compact 1152Descriptor for Visual Search (CDVS) Standard (ISO/IEC 15938-13), and is a 1153Co-Chair of MPEG Compact Descriptor for Video Analytics (CDVA). His re- 1154cent major achievements focus on the topic of compact representation of visual 1155features and high-performance image search. He made significant contribution 1156in the completed MPEG-CDVS standard. The suite of CDVS technologies have 1157been successfully deployed, which impact visual search products/services of 1158leading Internet companies such as Tencent (WeChat) and Baidu (Image Search 1159Engine). 1160
1161
IEEE P
roof
16 IEEE TRANSACTIONS ON MULTIMEDIA
Shiqi Wang received the B.S. degree in computer sci-1162ence from the Harbin Institute of Technology, Harbin,1163China, in 2008, and the Ph.D. degree in computer1164application technology from the Peking University,1165Beijing, China, in 2014.1166
From March 2014 to March 2016, he was a Post-1167doc Fellow with the Department of Electrical and1168Computer Engineering, University of Waterloo, Wa-1169terloo, ON, Canada. From April 2016 to Aprill 2017,1170he was with the Rapid-Rich Object Search Labora-1171tory, Nanyang Technological University, Singapore,1172
as a Research Fellow. He is currently an Assistant Professor in the Depart-1173ment of Computer Science, City University of Hong Kong, Hong Kong. He1174has proposed more than 30 technical proposals to ISO/MPEG, ITU-T, and AVS1175standards. His research interests include image/video compression, analysis and1176quality assessment.1177
1178
Yan Bai received the B.S. degree in software en-1179gineering from Dalian University of Technology,1180Liaoning, China, in 2015, and is currently working1181toward the M.S. at the School of Electrical Engi-1182neering and Computer Science, Peking University,1183Beijing, China.1184
Her research interests include large-scale video1185retrieval and fine-grained visual recognition.1186
1187
Yihang Lou received the B.S. degree in software1188engineering from Dalian University of Technology,1189Liaoning, China, in 2015, and is currently working1190toward the M.S. degree at the School of Electrical1191Engineering and Computer Science, Peking Univer-1192sity, Beijing, China.1193
His current interests include large-scale video re-1194trieval and object detection.1195
1196
Vijay Chandrasekhar received the B.S and M.S. de-1197grees from Carnegie Mellon University, Pittsburgh,1198PA, USA, in 2002 and 2005, respectively, and the1199Ph.D. degree in electrical engineering from Stanford1200University, Stanford, CA, USA, in 2013.1201
He has authored or coauthored more than 80 pa-1202pers/MPEG contributions in a wide range of top-1203tier journals/conferences such as the International1204Journal of Computer Vision, ICCV, CVPR, the IEEE1205Signal Processing Magazine, ACM Multimedia, the1206IEEE TRANSACTIONS ON IMAGE PROCESSING, De-1207
signs, Codes and Cryptography, the International Society of Music Information1208Retrieval, and MPEG-CDVS, and has filed 7 U.S. patents (one granted, six pend-1209ing). His research interests include mobile audio and visual search, large-scale1210image and video retrieval, machine learning, and data compression. His Ph.D.1211work on feature compression led to the MPEG-CDVS (Compact Descriptors1212for Visual Search) standard, which he actively contributed from 2010 to 2013.1213
Dr. Chandrasekhar was the recipient of the A*STAR National Science Schol-1214arship (NSS) in 2002.1215
1216
Tiejun Huang received the B.S. and M.S. degrees 1217in computer science from the Wuhan University of 1218Technology, Wuhan, China, in 1992 and 1995, re- 1219spectively, and the Ph.D. degree in pattern recogni- 1220tion and intelligent system from the Huazhong (Cen- 1221tral China) University of Science and Technology, 1222Wuhan, China, in 1998. 1223
He is a Professor and the Chair of the Department 1224of Computer Science, School of Electronic Engineer- 1225ing and Computer Science, Peking University, Bei- 1226jing, China. His research areas include video coding, 1227
image understanding, and neuromorphic computing. 1228Prof. Huang is a Member of the Board of the Chinese Institute of Electronics 1229
and the Advisory Board of IEEE Computing Now. He was the recipient of the 1230National Science Fund for Distinguished Young Scholars of China in 2014, and 1231was awarded the Distinguished Professor of the Chang Jiang Scholars Program 1232by the Ministry of Education in 2015. 1233
1234
Alex Kot (S’85–M’89–SM’98–F’06) has been with 1235Nanyang Technological University, Singapore, since 12361991. He headed the Division of Information En- 1237gineering, School of Electrical and Electronic En- 1238gineering, for eight years and was an Associate 1239Chair/Research and the Vice Dean Research with the 1240School of Electrical and Electronic Engineering. He 1241is currently a Professor with the College of Engi- 1242neering and the Director of the Rapid-Rich Object 1243Search Laboratory. He has authored or coauthored 1244extensively in the areas of signal processing for com- 1245
munication, biometrics, data-hiding, image forensics, and information security. 1246Prof. Kot is a Member of the IEEE Fellow Evaluation Committee and a 1247
Fellow of Academy of Engineering, Singapore. He was the recipient of the Best 1248Teacher of the Year Award and is a coauthor for several Best Paper Awards, 1249including ICPR, IEEE WIFS, ICEC, and IWDW. He was on the IEEE SP So- 1250ciety in various capacities, such as the General Co-Chair for the 2004 IEEE 1251International Conference on Image Processing and a Chair of the worldwide 1252SPS Chapter Chairs, and the Distinguished Lecturer Program. He is the Vice 1253President for the IEEE Signal Processing Society. He was a Guest Editor for 1254the Special Issues for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS 1255FOR VIDEO TECHNOLOGY and the EURASIP Journal on Advances in Signal 1256Processing. He is also an Editor of the EURASIP Journal on Advances in Signal 1257Processing. He is an IEEE SPS Distinguished Lecturer. He was an Associate 1258Editor of the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANS- 1259ACTIONS ON SIGNAL PROCESSING, the IEEE TRANSACTIONS ON MULTIMEDIA, 1260the IEEE SIGNAL PROCESSING LETTERS, the IEEE SIGNAL PROCESSING MAGA- 1261ZINE, the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, the IEEE 1262TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, and the 1263IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, and the IEEE TRANSAC- 1264TIONS ON CIRCUITS AND SYSTEMS II. 1265
1266
Wen Gao (S’87–M’88–SM’05–F’09) received the 1267Ph.D. degree in electronics engineering from the Uni- 1268versity of Tokyo, Tokyo, Japan, in 1991. 1269
He was a Professor of Computer Science with the 1270Harbin Institute of Technology, Harbin, China, from 12711991 to 1995, and a Professor with the Institute of 1272Computing Technology, Chinese Academy of Sci- 1273ences, Beijing, China. He is currently a Professor of 1274computer science with the Institute of Digital Media, 1275School of Electronic Engineering and Computer Sci- 1276ence, Peking University, Beijing, China. 1277
Prof. Gao has served on the editorial boards for several journals, such as the 1278IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1279the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS ON AU- 1280TONOMOUS MENTAL DEVELOPMENT, the EURASIP Journal of Image Commu- 1281nications, and the Journal of Visual Communication and Image Representation. 1282He has chaired a number of prestigious international conferences on multimedia 1283and video signal processing, such as IEEE ICME and ACM Multimedia, and 1284also served on the advisory and technical committees of numerous professional 1285organizations. 1286
1287
IEEE P
roof
QUERIES 1288
Q1. Author: Please provide the journal name and the page range in Ref. [4]. 1289
Q2. Author: Please provide the page range in Refs. [9], [11]–[13], [15], [16], [19], [23]–[25], [28]–[31], [34], [37], [38], [4]–[43], 1290
[45]–[47], [50], [52], [53], [59]–[62], and [64]–[67]. 1291
Q3. Author: Please update Refs. [14], [17], [18], [20], [27], and [58]. 1292