IEEE TRANSACTIONS ON MULTIMEDIA 1 HNIP: …vijaychan.github.io/Publications/2017_TMM_HNIP.pdf ·...

transcript

IEEE P

IEEE TRANSACTIONS ON MULTIMEDIA 1

HNIP: Compact Deep Invariant Representations forVideo Matching, Localization, and Retrieval

Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang,Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE

Abstract—With emerging demand for large-scale video analysis,5MPEG initiated the compact descriptor for video analysis (CDVA)6standardization in 2014. Unlike handcrafted descriptors adopted7by the ongoing CDVA standard, in this work, we study the8problem of deep learned global descriptors for video matching,9localization, and retrieval. First, inspired by a recent invariance10theory, we propose a nested invariance pooling (NIP) method11to derive compact deep global descriptors from convolutional12neural networks (CNN), by progressively encoding translation,13scale, and rotation invariances into the pooled descriptors. Second,14our empirical studies have shown that pooling moments (e.g.,15max or average) drastically affect video matching performance,16which motivates us to design hybrid pooling operations within17NIP (HNIP). HNIP further improves the discriminability of18deep global descriptors. Third, the advantages and performance19on the combination of deep and handcrafted descriptors are20provided to better investigate the complementary effects of them.21We evaluate the effectiveness of HNIP by incorporating it into22the well-established CDVA evaluation framework. Experimental23results show that HNIP outperforms state-of-the-art deep and24canonical handcrafted descriptors with significant mAP gains of255.5% and 4.7%, respectively. Moreover, the combination of HNIP26and handcrafted global descriptors further boosts the performance27of CDVA core techniques with comparable descriptor size.28

Index Terms—Convolutional neural networks (CNN), deep29global descriptors, handcrafted descriptors, hybrid nested30invariance pooling, MPEG compact descriptor for video analysis31(CDVA), MPEG CDVS.32

I. INTRODUCTION33

R ECENT years have witnessed a remarkable growth of34

interest for video retrieval, which refers to searching for35

Manuscript received December 5, 2016; revised April 3, 2017 and May 30,2017; accepted May 30, 2017. This work was supported in part by the NationalHightech R&D Program of China under Grant 2015AA016302, in part bythe National Natural Science Foundation of China under Grant U1611461 andGrant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI)sponsored by a donation from the Ng Teng Fong Charitable Foundation. Theguest editor coordinating the review of this manuscript and approving it forpublication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.)

J. Lin and V. Chandrasekhar is with the Institute for Infocomm Research,A∗STAR, Singapore 138632 (e-mail: lin-j@i2r.a-star.edu.sg).

L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti-tute of Digital Media, Peking University, Beijing 100080, China (e-mail:lingyu@pku.edu.cn; yanbai@pku.edu.cn; yihang@pku.edu.cn; tjhuang@pku.edu.cn; wgao@pku.edu.cn).

S. Wang is with the Department of Computer Science, City University ofHong Kong, Hong Kong, China (e-mail: shiqwang@cityu.edu.hk).

V. Chandrasekhar and A. Kot are with the Nanyang Technological University,Singapore 639798 (e-mail: vijay@i2r.a-star.edu.sg; EACKOT@ntu.edu.sg).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2017.2713410

videos representing the same object or scene as the one depicted 36

in a query video. Such capability can facilitate a variety of appli- 37

cations including mobile augmented reality (MAR), automotive, 38

surveillance, media entertainment, etc. [1]. In the rapid evolution 39

of video retrieval, both great promises and new challenges aris- 40

ing from real applications have been perceived [2]. Typically, the 41

video retrieval is performed at the server end, which requires 42

the transmission of visual data via wireless network [3], [4]. 43

Instead of directly sending huge volume of compressed video 44

data, developing compact and robust video feature representa- 45

tions is highly desirable, which fulfills low latency transmission 46

over bandwidth constrained network, e.g., thousands of bytes 47

per second. 48

To this end, in 2009, MPEG started the standardization of 49

Compact Descriptors for Visual Search (CDVS) [5], which 50

came up with a normative bitstream of standardized descriptors 51

for mobile visual search and augmented reality applications. 52

In Sep. 2015, MPEG published the CDVS standard [6]. Very 53

recently, towards large-scale video analysis, MPEG has moved 54

forward to standardize Compact Descriptors for Video Analy- 55

sis (CDVA) [7]. To deal with content redundancy in temporal 56

dimension, the latest CDVA Experimental Model (CXM) [8] 57

casts video retrieval into keyframe based image retrieval task, 58

in which the keyframe-level matching results are combined for 59

video matching. The keyframe-level representation avoids de- 60

scriptor extraction on dense frames in videos, which largely re- 61

duces the computational complexity (e.g., CDVA only extracts 62

descriptors of 1∼2% frames detected from raw videos). 63

In CDVS, handcrafted local and global descriptors have been 64

successfully standardized in a compact and scalable manner 65

(e.g., from 512 B to 16 KB), where local descriptors capture the 66

invariant characteristics of local image patches and global de- 67

scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect the 68

aggregated statistics of local descriptors. Though handcrafted 69

descriptors have achieved great success in CDVS standard [5] 70

and CDVA experimental model, how to leverage promising 71

deep learned global descriptors remains an open issue in the 72

MPEG CDVA Ad-hoc group. Many recent works [11]–[18] 73

have shown the advantages of deep global descriptors for image 74

retrieval, which may be attributed to the remarkable success of 75

Convolutional Neural Networks (CNN) [19], [20]. In particular, 76

state-of-the-art deep global descriptors R-MAC [18] computes 77

the max over a set of Region-of-Interest (ROI) cropped from 78

feature maps output by intermediate convolutional layer, 79

followed by the average of these regional max-pooled features. 80

Results show that R-MAC offers remarkable improvements over 81

1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

IEEE P

2 IEEE TRANSACTIONS ON MULTIMEDIA

other deep global descriptors like MAC [18] and SPoC [16],82

while maintaining the same dimensionality.83

In the context of compact descriptors for video retrieval, there84

exist important practical issues with CNN based deep global de-85

scriptors. First, one main drawback of CNN is the lack of invari-86

ance to geometric transformations of the input image such as87

rotations. The performance of deep global descriptors quickly88

degrades when the objects in query and database videos are89

rotated differently. Second, different from CNN features, hand-90

crafted descriptors are robust to scale and rotation changes in91

2D plane, because of the local invariant feature detectors. As92

such, more insights should be provided on whether there is great93

complementarity between CNN and conventional handcrafted94

descriptors for better performance.95

To tackle the above issues, we make the following contribu-96

tions in this work:97

1) We propose a Nested Invariance Pooling (NIP) method98

to produce compact global descriptors from CNN by pro-99

gressively encoding translation, scale and rotation invari-100

ances. NIP is inspired from a recent invariance theory,101

which provides a practical and mathematically proven102

way for computing invariant representations with feedfor-103

ward neural networks. In this respect, NIP is extensible104

to other types of transformations. Both quantitative and105

qualitative evaluations are introduced for a deeper look at106

the invariance properties.107

2) We further improve the discriminability of deep global108

descriptors by designing Hybrid pooling moments within109

NIP (HNIP). Evaluations on video retrieval show that110

HNIP outperforms state-of-the-art deep and handcrafted111

descriptors by a large margin with comparable or smaller112

descriptor size.113

3) We analyze the complementary nature of deep features114

and handcrafted descriptors over diverse datasets (land-115

marks, scenes and common objects). A simple combina-116

tion strategy is introduced to fuse the strengths of both117

deep and handcrafted global descriptors. We show that118

the combined descriptors offer the optimal video match-119

ing and retrieval performance, without incurring extra cost120

on descriptor size compared to CDVA.121

4) Due to the superior performance, HNIP has been adopted122

by CDVA Ad-hoc group as technical reference to setup123

new core experiments [21], which opens up future ex-124

ploration of CNN techniques in the development of stan-125

dardized video descriptors. The latest core experiments126

involve compact deep feature representation, CNN model127

compression, etc.128

II. RELATED WORK129

Handcrafted descriptors: Handcrafted local descriptors [22],130

[23], such as SIFT based on Difference of Gaussians (DoG)131

detector [22], have been successfully and widely employed132

to conduct image matching and localization tasks due to133

their robustness in scale and rotation changes. Building on134

local image descriptors, global image representations aim to135

provide statistical summaries of high level image properties136

and facilitate fast large-scale image search. In particular, for137

global image descriptors, the most prominent ones include 138

Bag-of-Words (BoW) [24], [25], Fisher Vector (FV) [9], 139

VLAD [10], Triangulation Embedding [26] and Robust Visual 140

Descriptor with Whitening (RVDW) [27], with which fast 141

comparisons against a large scale database become practical. 142

Given the fact that raw descriptors such as SIFT and FV may 143

consume extraordinarily large number of bits for transmission 144

and storage, many compact descriptors were developed. For ex- 145

ample, numerous strategies have been proposed to compress 146

SIFT using hashing [28], [29], transform coding [30], lattice 147

coding [31] and vector quantization [32]. On the other hand, bi- 148

nary descriptors including BRIEF [33], ORB [34], BRISK [35] 149

and Ultrashort Binary Descriptor (USB) [36] were proposed, 150

which support fast Hamming distance matching. For compact 151

global descriptors, efforts have also been made to reduce their 152

descriptor size, such as tree-structure quantizer [37] for BoW 153

histogram, locality sensitive hashing [38], dimensionality re- 154

duction and vector quantization for FV [9], and simple sign 155

binarization for VLAD like descriptors [9], [39]. 156

Deep descriptors: Deep learned descriptors have been ex- 157

tensively applied to image retrieval [11]–[18]. First initial 158

study [11], [13] proposed to use representations directly ex- 159

tracted from fully connected layer of CNN. More compact 160

global descriptors [14]–[16] can be extracted by performing 161

either global max or average pooling (e.g., SPoC in [16]) over 162

feature maps output by intermediate layers. Further improve- 163

ments are obtained by spatial or channel-wise weighting of 164

pooled descriptors [17]. Very recently, inspired by the R-CNN 165

approach [40] used for object detection, Tolias et al. [18] pro- 166

posed ROI based pooling on deep convolutional features, Re- 167

gional Maximum Activation of Convolutions (R-MAC), which 168

significantly improves global pooling approaches. Though 169

R-MAC is scale invariant to some extent, it suffers from the 170

lack of rotation invariance. These regional deep features can be 171

also aggregated into global descriptors by VLAD [12]. 172

In a number of recent works [13], [41]–[43], pre-trained 173

CNNs for image classification are repurposed for the image 174

retrieval task, by fine-tuning them with specific loss functions 175

(e.g., Siamese or triplet networks) over carefully constructed 176

matching and non-matching training image sets. There is consid- 177

erable performance improvement when training and test datasets 178

in similar domains (e.g., buildings). In this work, we aim to 179

explicitly construct invariant deep global descriptors from the 180

perspective of better leveraging the state-of-the-arts or classical 181

CNN architectures, rather than further optimizing the learning 182

of invariant deep descriptors. 183

Video descriptors: Video is typically composed of a num- 184

ber of moving frames. Therefore, a straightforward method for 185

video descriptor representation is extracting feature descriptors 186

at frame level then reducing the temporal redundancies of these 187

descriptors for compact representation. For local descriptors, 188

Baroffio et al. [44] proposed both intra- and inter-feature cod- 189

ing methods of SIFT in the context of visual sensor network, 190

and an additional mode decision scheme based on rate-distortion 191

optimization was designed to further improve the feature coding 192

efficiency. In [45], [46], studies have been conducted to encode 193

the binary features such as BRISK [35]. Makar et al. [47] pre- 194

sented a temporally coherent keypoint detector to allow efficient 195

IEEE P

LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 3

interframe coding of canonical patches, corresponding feature196

descriptors, and locations towards mobile augmented reality197

application. Chao et al. [48] developed a key-points encoding198

technique, where locations, scales and orientations extracted199

from original videos are encoded and transmitted along with200

compressed video to the server. Recently, the temporal depen-201

dencies of global descriptors have also been exploited. For BoW202

extracted from video sequence, Baroffio et al. [49] proposed an203

intra-frame coding method with uniform scalar quantization,204

as well as an inter-frame technique with arithmetic coding the205

quantized symbols. Chen et al. [50], [51] proposed an encoding206

scheme for scalable residual based global signatures given the207

fact that REVVs [39] of adjacent frames share most codewords208

and residual vectors.209

Besides the frame-level approaches, aggregations of local de-210

scriptors over video slots and global descriptors over scenes211

have also been intensively explored [52]–[55]. In [54], tempo-212

ral aggregation strategies for large scale video retrieval were213

experimentally studied and evaluated with the CDVS global214

descriptors [56]. Four aggregation modes, including local fea-215

ture, global signature, tracking-based and independent frame216

based aggregation schemes were investigated. For video-level217

CNN representation, in [57], the authors applied FV and VLAD218

aggregation techniques over dense local features of CNN acti-219

vation maps for video event detection.220

III. MPEG CDVA221

MPEG CDVA [7] aims to standardize the bitstream of com-222

pact video descriptors for large-scale video analysis. The CDVA223

standard incurs two main technical requirements of the dedi-224

cated descriptors: compactness and robustness. On the one hand,225

compact representation is an efficient way to economize the226

transmission bandwidth, storage space and computational cost.227

On the other hand, robust representation in the scenario of ge-228

ometric transformations such as rotation and scale variations is229

particularly required. To this end, in the 115th MPEG meeting,230

the CXM [8] was released, which mainly relies on MPEG CDVS231

reference software TM14.2 [6] for keyframe-level compact and232

robust handcrafted descriptor representation based on scale and233

rotation invariant local features.234

A. CDVS-Based Handcrafted Descriptors235

The MPEG CDVS [5] standardized descriptors serve as the236

fundamental infrastructure to represent video keyframes. The237

normative blocks of CDVS standard are illustrated in Fig. 1(b),238

mainly involving extraction of compressed local and global239

descriptors. First, scale and rotation invariant interest key points240

are detected from image, and a subset of reliable key points241

are retained, followed by the computation of handcrafted local242

SIFT features. The compressed local descriptors are formed243

by applying a low-complexity transform coding on local SIFT244

features. The compact global descriptors are Fisher vectors ag-245

gregated from the selected local features, followed by scalable246

descriptor compression with simple sign binarization. Basically,247

pairwise image matching is accomplished by first comparing248

compact global descriptors, then further performing geometric249

consistency checking (GCC) with compressed local descriptors.250

CDVS handcrafted descriptors are with very low memory foot- 251

print, while preserving competitive matching and retrieval accu- 252

racy. The standard supports operating points ranging from 512 B 253

to 16 kB specified for different bandwidth constraints. Overall, 254

the 4 kB operating point achieves a good tradeoff between ac- 255

curacy and complexity (e.g., transmission bitrate, search time). 256

Thus, CDVA CXM adopts the 4 kB descriptors for keyframe- 257

level representation, in which compressed local descriptors and 258

compact global descriptors are both ∼2 kB per keyframe. 259

B. CDVA Evaluation Framework 260

Fig. 1(a) shows the evaluation framework of CDVA, includ- 261

ing keyframe detection, CDVS descriptors extraction, trans- 262

mission, and video retrieval and matching. At the client side, 263

color histogram comparison is applied to identify keyframes 264

from video clips. The standardized CDVS descriptors are ex- 265

tracted from these keyframes, which can be further packed to 266

form CDVA descriptors [58]. Keyframe detection has largely 267

reduced the temporal redundancy in videos, resulting in low bi- 268

trate query descriptor transmission. At the server side, the same 269

keyframe detection and CDVS descriptors extraction are ap- 270

plied to database videos. Formally, we denote query video X = 271

{x1 , ...,xNx} and reference video Y = {y1 , ...,yNy

}, where x 272

and y denote keyframes. Nx and Ny are the number of detected 273

keyframes in query and reference videos, respectively. The start 274

and end timestamps for keyframes are recorded, e.g., [T sxn

, T exn

for query keyframe xn . Here, we briefly describe the pipeline of 276

pairwise video matching, localization and video retrieval with 277

CDVA descriptors. 278

Pairwise video matching and localization: Pairwise video 279

matching is performed by comparing the CDVA descriptors of 280

video keyframe pair (X,Y). Each keyframe in X is compared 281

with all keyframes in Y. The video-level similarity K(X, Y) is 282

defined as the largest matching score among all keyframe-level 283

similarities. For example, if we consider video matching with 284

CDVS global descriptors only 285

K(X, Y) = maxx∈X,y∈Y

k(f(x), f(y)) (1)

where k(·, ·) denotes a matching function (e.g., cosine similar- 286

ity). f(x) denotes CDVS global descriptors for keyframe x.1 287

Following the matching pipeline in CDVS, if k(·, ·) exceeds a 288

pre-defined threshold, GCC with CDVS local descriptors is sub- 289

sequently applied for verifying true positive keyframe matches. 290

Then the keyframe-level similarity is finally determined as the 291

multiplication of matching scores from both CDVS global and 292

local descriptors. Correspondingly, K(X, Y) in (1) is refined as 293

the maximum of their combined similarities. 294

The matched keyframe timestamps between query and refer- 295

ence videos are recorded for evaluating the temporal localization 296

task, i.e., locating the video segment containing item of inter- 297

est. In particular, if the multiplication of CDVS global and local 298

matching scores exceeds a predefined threshold, the correspond- 299

ing keyframe timestamps are recorded. Assuming there are τ 300

(τ ≤ Nx ) keyframes satisfying such criterion in a query video, 301

the matching video segment is defined as T ′start = min

1We use the same notation for deep global descriptors in the following section.

IEEE P

Fig. 1. (a) Illustration of MPEG CDVA evaluation framework. (b) Descriptor extraction pipeline for MPEG CDVS. (c) Temporal localization of item of interestbetween video pair.

and T ′end = max

}, where 1 ≤ n ≤ τ . As such, we can ob-303

tain the predicted matching video segment by descriptor match-304

ing, as shown in Fig. 1(c).305

Video retrieval: Video retrieval differs from pairwise video306

matching in that the former is one to many matching, while307

the latter is one to one matching. Thus, video retrieval shares308

similar matching pipeline with pairwise video matching, ex-309

cept for the following differences: 1) For each query keyframe,310

the top Kg candidate keyframes are retrieved from database311

by comparing CDVS global descriptors. Subsequently, GCC312

reranking with CDVS local descriptors is performed between313

query keyframe and each candidate. The top Kl database314

keyframes are recorded. The default choices for Kg and Kl315

are 500 and 100, respectively. 2) For each query video, all re-316

turned database keyframes are merged into candidate database317

videos according to their video indices. Then, the video-level318

similarity between query and each candidate database video319

is obtained following the same principle as pairwise video320

matching. Finally, the top ranked candidate database videos are321

returned.322

IV. METHOD323

A. Hybrid Nested Invariance Pooling324

Fig. 2(a) shows the extraction pipeline of our compact deep325

invariant global descriptors with HNIP. Given a video keyframe326

x, we rotate it by R times (with step size θ◦). By forward-327

ing each rotated image through a pre-trained deep CNN, the328

convolutional feature maps output by intermediate layer (e.g.,329

convolutional layer) are represented by a cube W × H × C,330

where W and H denote width and height of each feature331

map respectively and C is the number of channels in the fea-332

ture maps. Subsequently, we extract a set of ROIs from the333

cube using an overlapping sliding window, with window size334

W′ ≤ W and H

′ ≤ H . The window size is adjusted to incor- 335

porate ROIs with different scales (e.g., 5 × 5, 10 × 10). Here, 336

we denote the number of scales as S. Finally, a 5-D data struc- 337

ture γx(Gt,Gs,Gr , C) ∈ RW′×H

′×S×R×C is derived, which 338

encodes translations Gt (i.e., spatial locations W′ × H

′), scales 339

Gs and rotations Gr of input keyframe x. 340

HNIP aims to aggregate the 5-D data into a compact deep in- 341

variant global descriptor. In particular, it firstly performs pooling 342

over translations (W′ × H

′), then scales (S) and finally rota- 343

tions (R) in a nested way, resulting in a C-dimensional global 344

descriptor. Formally, for the cth feature map, nt-norm pooling 345

over translations Gt is given by 346

γx(Gs,Gr , c) =

⎝ 1W ′ × H ′

W′×H

′∑

γx(j,Gs,Gr , c)nt

(2)where the pooling operation has a parameter nt defining the sta- 347

tistical moments, e.g., nt = 1 is first-order (i.e., average pool- 348

ing), nt → +∞ on the other extreme is infinite order (i.e., max 349

pooling), and nt = 2 is second order (i.e., square-root pooling). 350

Equation (2) leads to a 3-D data γx(Gs,Gr , C) ∈ RS×R×C . 351

Analogously, ns -norm pooling over scale transformations Gs 352

and the subsequent nr -norm pooling over rotation transforma- 353

tions Gr are 354

γx(Gr , c) =

⎝ 1S

γx(j,Gr , c)ns

γx(c) =

⎝ 1R

γx(j, c)nr

IEEE P

Fig. 2. (a) Nested invariance pooling (NIP) on feature maps extracted from intermediate layer of CNN architecture. (b) A single convolution-pooling operationfrom a CNN schematized for a single input layer and single output neuron. The parallel with invariance theory shows that the universal building block of CNN iscompatible with the incorporation of invariance to local translations of the input.

The corresponding global descriptor is obtained by concatenat-355

ing γx(c) for all feature maps356

f(x) = {γx(c)}0≤c<C . (5)

As such, the keyframe matching function in (1) is defined as357

k(f(x), f(y)) = β(x)β(y)C∑

< γx(c), γy(c) > (6)

where β(·) is a normalization term computed by β(x) =358

c=1 < γx(c), γx(c) >)−12 . Equation (6) refers to cosine sim-359

ilarity by accumulating the scalar products of normalized pooled360

features for each feature map. HNIP descriptors can be further361

improved by post-processing techniques such as PCA whiten-362

ing [16], [18]. In this work, The global descriptor is firstly L2363

normalized, followed by PCA projection and whitening with a364

pre-trained PCA matrix. The whitened vectors are L2 normal-365

ized and compared with (6).366

Subsequently, we further investigate HNIP with more details.367

In Section IV-B, inspired by a recent invariance theory [59],368

HNIP is proven to be approximately invariant to translation,369

scale and rotation transformations, which is independent of370

the statistical moments chosen in the nested pooling stages.371

In Section IV-C, both quantitative and qualitative evaluations372

are presented to illustrate the invariance properties. Moreover,373

we observe the statistical moments in HNIP drastically affect374

video matching performance. Our empirical results show that375

the optimal nested pooling moments correspond to: nt = 2 as a376

second order moment, ns first-order and nr of infinite order.377

B. Theory on Transformation Invariance378

Invariance theory in a nutshell: Recently, Anselmi and379

Poggio [59] proposed an invariance theory exploring how signal380

(e.g., image) representations are invariant to various transforma-381

tions. Denote f(x) as the representation for image x, f(x) is382

invariant to transformation g (e.g., rotation) if f(x) = f(g · x)383

is hold ∀g ∈ G, where we define the orbit of x by a transfor-384

mation group G as Ox = {g · x|g ∈ G}. It can be easily shown385

that Ox is globally invariant to the action of any element of 386

G and thus any descriptor computed directly from Ox will be 387

globally invariant to G. 388

More specifically, the invariant descriptor f(x) can be 389

derived in two stages. First, given a predefined template t (e.g., 390

convolutional filter in CNN), we compute the dot products of 391

t over the orbit: Dx,t = {< g · x, t >∈ R|g ∈ G}. Second, the 392

extracted invariant descriptor should be a histogram represen- 393

tation of the distribution Dx,t with a specific bin configuration, 394

for example, the statistical moments (e.g., mean, max, standard 395

deviation, etc.) derived from Dx,t . Such a representation is 396

mathematically proven to have proper invariance property for 397

transformations such as translations (Gt), scales (Gs) and 398

rotations (Gr ). One may note that the transformation g can 399

be applied either on the image or the template indifferently, 400

i.e., {< g · x, t >=< x, g · t > |g ∈ G}. Recent work on face 401

verification [60] and music classification [61] successfully 402

applied this theory to practical applications. More details about 403

invariance theory are referred to [59]. 404

An example: translation invariance of CNN: The convolution- 405

pooling operations in CNN are compliant with the invari- 406

ance theory. Existing well-known CNN architectures like 407

AlexNet [19] and VGG16 [20] share a common building block: 408

a succession of convolution and pooling operations, which in 409

fact provides a way to incorporate local translation invariance. 410

As shown in Fig. 2(b), convolution operation on translated im- 411

age patches (i.e., sliding windows) is equivalent to < g · x, t >, 412

and max pooling operation is in line with the statistical his- 413

togram computation over the distribution of the dot products 414

(i.e., feature maps). For instance, considering a convolutional 415

filter learned “cat face”, the filter would always response to an 416

image depicted cat face no matter where is the face located in 417

the image. Subsequently, max pooling over the activation fea- 418

ture maps captures the most salient feature of the cat face, which 419

is naturally invariant to object translation. 420

Incorporating scale and rotation invariances into CNN: 421

Based on the already locally translation invariant feature maps 422

(e.g., the last pooling layer, pool5), we propose to further im- 423

IEEE P

Fig. 3. Comparison of pooled descriptors invariant to (a) rotation and (b)scale changes of query images, measured by retrieval accuracy (mAP) on theHolidays data set. fc6 layer of VGG16 [20] pretrained on ImageNet datasetis used.

prove the invariance of pool5 CNN descriptors by incorporating424

global invariance to several transformation groups. The spe-425

cific transformation groups considered in this study are scales426

GS and rotations GR . As one can see, it is impractical to gen-427

erate all transformations g · x for the orbit Ox. In addition,428

the computational complexity of feedforward pass in CNN in-429

creases linearly with the number of transformed version of the430

input x. For practical consideration, we simplify the orbit by431

a finite set of transformations (e.g., # of rotations R = 4, # of432

scales S = 3). This results in HNIP descriptors approximately433

invariant to transformations, without huge increase in feature434

extraction time.435

An interesting aspect of the invariance theory is the possibility436

in practice to chain multiple types of group invariances one after437

the other as already demonstrated in [61]. In this study, we con-438

struct descriptors invariant to several transformation groups by439

progressively applying the method to different transformation440

groups as shown in (2)–(4).441

Discussions: While there is theoretical guarantee in the scale-442

and rotation-invariance of handcrafted local feature detectors443

such as DoG, classical CNN architectures lack invariance to444

these geometric transformations [62]. Many works have pro-445

posed to encode transformation invariances into both hand-446

crafted (e.g., BoW built on dense sampled SIFT [63]) and CNN447

representations [64], by explicitly augmenting input images with448

rotation and scale transformations. Our HNIP takes similar idea449

of image augmentation, but has several significant differences.450

First, rather than a single pooling (max or average) layer over451

all transformations, HNIP progressively pools features together452

across different transformations with different moments, which453

is essential for significantly improving the quality of pooled454

CNN descriptors. Second, unlike previous empirical studies,455

we have attempted to mathematically show that the design of456

nested pooling ensures that HNIP is approximately invariant to457

multiple transformations, which is inspired by the recently de-458

veloped invariance theory. Third, to the best of our knowledge,459

this work is the first to comprehensively analyze invariant prop-460

erties of CNN descriptors, in the context of large scale video461

matching and retrieval.462

C. Quantitative and Qualitative Evaluations463

Transformation invariance: In this section, we propose a464

database-side data augmentation strategy for image retrieval465

Fig. 4. Distances for three matching pairs from the MPEG CDVA dataset (seeSection VI-A for more details). For each pair, three pairwise distances (L2normalized) are computed by progressively encoding translations (Gt ), scales(Gt + Gs ), and rotations (Gt + Gs + Gr ) into the nested pooling stages.Average pooling is used for all transformations. Feature maps are extractedfrom the last pooling layer of pretrained VGG16.

to study rotation and scale invariance properties, respectively. 466

For simplicity, we represent an image as a 4096 dimensional 467

descriptor extracted from the first fully connected layer (fc6) 468

of VGG16 [20] pre-trained on ImageNet dataset. We report re- 469

trieval results in terms of mean Average Precision (mAP) on the 470

INRIA Holidays dataset [65] (500 query images, 991 reference 471

images). 472

Fig. 3(a) investigates the invariance property with respect to 473

query rotations. First, we observe that the retrieval performance 474

drops significantly as we rotate query images when fixing the 475

reference images (the red curve). To gain invariance to query 476

rotations, we rotate each reference image within a range of 477

−180◦ to 180◦, with the step of 10◦. The fc6 features for its 478

36 rotated images are pooled together into one common global 479

descriptor representation, with either max or average pooling. 480

We observe that the performance is relatively consistent (blue for 481

max pooling, green for average pooling) against the rotation of 482

query images. Moreover, performance in terms of the variations 483

of the query image scale is plotted in Fig. 3(b). It is observed 484

that the database-side augmentation by max or average pooling 485

over scale changes (scale ratio of 0.75, 0.5, 0.375, 0.25, 0.2 and 486

0.125) can improve the performance when query scale is small 487

(e.g., 0.125). 488

Nesting multiple transformations: We further analyze nested 489

pooling over multiple transformations. Fig. 4 provides an insight 490

on how progressively adding different types of transformations 491

affect the matching distance on different image matching pairs. 492

We can see the reduction in matching distance with the incor- 493

poration of each new transformation group. Fig. 5 takes a closer 494

look at pairwise similarity maps between local deep features 495

of query keyframes and the global deep descriptors of refer- 496

ence keyframes, which explicitly reflects the regions of query 497

keyframe significantly contributing to similarity measurement. 498

We compare our HNIP (the third heatmap at each row) to the 499

state-of-the-art deep descriptors MAC [18] and R-MAC [18]. 500

Because of the introduction of scale and rotation transforma- 501

IEEE P

Fig. 5. Example similarity maps between local deep features of query keyframes and the global deep descriptors of reference keyframes, using off-the-shelfVGG16. For query (left image) and reference (right image) keyframe pair at each row, the middle three similarity maps are from MAC, R-MAC, and HNIP (fromleft to right), respectively. Each similarity map is generated by cosine similarity between query local features at each feature map location and the pooled globaldescriptors for reference keyframe (i.e., MAC, R-MAC, or HNIP), which allows locating the regions of query keyframe contributing most to the pairwise similarity.

TABLE IPAIRWISE VIDEO MATCHING BETWEEN MATCHING AND NON-MATCHING

VIDEO DATASETS, WITH POOLING CARDINALITY INCREASED

BY PROGRESSIVELY ENCODING TRANSLATION, SCALE, AND

ROTATION TRANSFORMATIONS, FOR DIFFERENT POOLING

STRATEGIES, I.E., MAX-MAX-MAX, AVG-AVG-AVG,AND OUR HNIP (SQU+AVG+MAX)

Gt Gt -Gs Gt -Gs -Gr

Max 71.9 Max-Max 72.8 Max-Max-Max 73.9Avg 76.9 Avg-Avg 79.2 Avg-Avg-Avg 82.2Squ 81.6 Squ-Avg 82.7 Squ-Avg-Max 84.3

TABLE IISTATISTICS ON THE NUMBER OF RELEVANT DATABASE VIDEOS RETURNED

IN TOP 100 LIST (I.E., RECALL@100) FOR ALL QUERY VIDEOS IN THE

MPEG CDVA DATASET (SEE SECTION VI-A FOR MORE DETAILS)

Landmarks Scenes Objects

HNIP \ CXM 8143 1477 1218CXM \ HNIP 1052 105 1834

“A \ B” represents relevant instances are successfully re-trieved by method A but missed in the list generated bymethod B. The last pooling layer of pretrained VGG16 isused for HNIP.

tions, HNIP is able to locate the query object of interest re-502

sponsible for similarity measures more precisely than MAC503

and R-MAC, though there are scale and rotation changes be-504

tween query-reference pairs. Moreover, as shown in Table I,505

quantitative results on video matching by HNIP with progres-506

sively encoded multiple transformations provide additional pos-507

itive evidence for the nested invariance property.508

Pooling moments: In Fig. 3, it is interesting to note that any509

choice of pooling moment n in the pooling stage can produce510

invariant descriptors. However, the discriminability of NIP with511

varied pooling moments could be quite different. For video512

retrieval, we empirically observe that pooling with hybrid mo-513

ments works well for NIP, e.g., starting with square-root pool-514

ing (nt = 2) for translations and average pooling (ns = 1) for515

scales, and ending with max pooling (nr → +∞) for rotations.516

Here, we present an empirical analysis of how pooling moments517

affect pairwise video matching performance.518

We construct matching and non-matching video set from the 519

MPEG CDVA dataset. Both sets contain 4690 video pairs. With 520

input video keyframe size 640 × 480, feature maps of size 20 × 521

15 are extracted from the last pooling layer of VGG16 [20] pre- 522

trained on ImageNet dataset. For transformations, we consider 523

nested pooling by progressively adding transformations with 524

translations (Gt), scales (Gt-Gs) and rotations (Gt-Gs-Gr ). 525

For pooling moments, we evaluate Max-Max-Max, Avg-Avg- 526

Avg and our HNIP (i.e., Squ-Avg-Max). Finally, video similarity 527

is computed using (1) with pooled features. 528

Table I reports pairwise matching performance in terms 529

of True Positive Rate with False Positive Rate set to 1%, 530

with transformations switching from Gt to Gt-Gs-Gr , for 531

Max-Max-Max, Avg-Avg-Avg and HNIP. As more transfor- 532

mations nested in, the separability between matching and 533

non-matching video sets becomes larger, regardless of the 534

pooling moments used. More importantly, HNIP performs the 535

best compared to Max-Max-Max and Avg-Avg-Avg, while 536

Max-Max-Max is the worst. For instance, HNIP outperforms 537

Avg-Avg-Avg, i.e., 84.3% vs. 82.2%. 538

V. COMBINING DEEP AND HANDCRAFTED DESCRIPTORS 539

In this section, we analyze the strength and weakness of deep 540

features in the context of video retrieval and matching, com- 541

pared to state-of-the-art handcrafted descriptors built upon local 542

invariant features (SIFT). To this end, we respectively compute 543

statistics of HNIP and CDVA handcrafted descriptors (CXM) 544

by retrieving different types of video data. In particular, we fo- 545

cus on videos depicting landmarks, scenes and common objects, 546

collected by MPEG CDVA. Here we describe how to compute 547

the statistics. First, for each query video, we retrieve top 100 548

most similar database videos using HNIP and CXM, respec- 549

tively. Second, for all queries from each type of video data, we 550

accumulate the number of relevant database videos (1) retrieved 551

by HNIP but missed by CXM (denoted as HNIP \ CXM), and 552

(2) retrieved by CXM but missed by HNIP (CXM \ HNIP). 553

The statistics are presented in Table II. As one can see, com- 554

pared to handcrafted descriptors CXM, HNIP is able to identify 555

much more relevant landmark and scene videos in which CXM 556

fails. On the other hand, CXM recalls more videos depicting 557

common objects than HNIP. Fig. 6 shows qualitative examples 558

IEEE P

Fig. 6. Examples of keyframe pairs in which (a) HNIP determines as matching but CXM as non-matching (HNIP \ CXM) and (b) CXM determines as matchingbut HNIP as non-matching (CXM \ HNIP).

of keyframe pairs corresponding to HNIP \ CXM and CXM \559

HNIP, respectively.560

Fig. 7 further visualizes intermediate keyframe matching re-561

sults produced by handcrafted and deep features, respectively.562

Despite viewpoint change of the landmark images in Fig. 7(a),563

the salient features fired in their activation maps are spatially564

consistent. Similar observations exist in indoor scene images in565

Fig. 7(b). These observations are probably attributed to deep566

descriptors are excelling in characterizing global salient fea-567

tures. On the other hand, handcrafted descriptors work on local568

patches detected at sparse interest points, which prefer rich tex-569

tured blobs [Fig. 7(c)] rather than lower textured ones [Fig. 7(a)570

and 7(b)]. This may explain why there are more inlier matches571

found by GCC for the product images in Fig. 7(a). Finally, com-572

pared to approximate scale and rotation invariances provided573

by HNIP analyzed in the previous section, handcrafted local574

features have with built-in mechanism to ensure nearly exact575

invariances to these transformations of rigid object in the 2D576

plane, and examples can be found in Fig. 6(b).577

In summary, these observations reveal that the deep learn-578

ing features may not always outperform handcrafted features.579

There may exist complementary effects between CNN deep de-580

scriptors and handcrafted descriptors. Therefore, we propose to581

leverage the benefits of both deep and handcrafted descriptors.582

Considering that handcrafted descriptors are categorized into583

local and global ones, we investigate the combination of deep584

descriptors with either handcrafted local or global descriptors,585

respectively.586

Combining HNIP with handcrafted local descriptors: For587

matching, if the HNIP matching score exceeds a threshold, then588

we use handcrafted local descriptors for verification. For re-589

trieval, HNIP matching score is used to select the top 500 590

candidates list, then we use handcrafted local descriptors for 591

reranking. 592

Combining HNIP and handcrafted global descriptors: In- 593

stead of simply concatenating the HNIP derived deep descrip- 594

tors and handcrafted descriptors, for both matching and retrieval, 595

the similarity score is defined as the weighted sum of matching 596

scores of HNIP and handcrafted global descriptors 597

k(x,y) = α · kc(x,y) + (1 − α) · kh(x,y) (7)

where α is the weighting factor. kc and kh represent the matching 598

score of HNIP and handcrafted descriptors, respectively. In this 599

work, α is empirically set to 0.75. 600

VI. EXPERIMENTAL RESULTS 601

A. Datasets and Evaluation Metrics 602

Datasets: MPEG CDVA ad-hoc group collects large scale 603

diverse video dataset to evaluate the effectiveness of video 604

descriptors for video matching, localization and retrieval ap- 605

plications, with resource constraints including descriptor size, 606

extraction time and matching complexity. This CDVA dataset2 607

is diversified to contain views of 1) stationary large objects, e.g., 608

buildings, landmarks (most likely background objects, possibly 609

partially occluded or a close-up), 2) generally smaller items 610

(e.g., paintings, books, CD covers, products) which typically 611

2MPEG CDVA dataset and evaluation framework are available upon re-quest at http://www.cldatlas.com/cdva/dataset.html. CDVA standard documentsare available at http://mpeg.chiariglione.org/standards/exploration/compact-descriptors-video-analysis.

IEEE P

Fig. 7. Keyframe matching examples which illustrate the strength and weakness of CNN based deep descriptors and handcrafted descriptors. In (a) and (b), deepdescriptors perform well but handcrafted ones fail, while (c) is the opposite.

TABLE IIISTATISTICS ON THE MPEG CDVA BENCHMARK DATASETS

IoI: items of interest. q.v. : query videos. r.v.: reference videos.

appear in front of background scenes, possibly occluded and 3)612

scenes (e.g., interior scenes, natural scenes, multi-camera shots,613

etc.). CDVA dataset also comprises planar or non-planar, rigid614

or partially rigid, textured or partially textured objects (scenes),615

which are captured from different view-points with different616

camera parameters and lighting conditions.617

Specifically, MPEG CDVA dataset contains 9974 query and618

5127 reference videos (denoted as All), depicting 796 items of619

interest in which 489 large landmarks (e.g., buildings), 71 scenes620

(e.g., interior or natural scenes) and 236 small common objects621

(e.g., paintings, books, products). The videos have durations622

from 1 sec to 1+ min. To evaluate video retrieval on different623

types of video data, we categorize query videos into Landmarks624

(5224 queries), Scenes (915 queries) and Objects (3835 queries).625

Table III summaries the numbers of items of interest and their626

instances for each category. Fig. 8 shows some example video627

clips from the three categories.628

To evaluate the performance of large scale video retrieval,629

we combine the reference videos with a set of user-generated630

and broadcast videos as distractors, which consist of content631

unrelated to the items of interest. There are 14537 distractor632

videos with more than 1000 hours data.633

Moreover, to evaluate pairwise video matching and tempo-634

ral localization, 4693 matching video pairs and 46911 non-635

matching video pairs are constructed from query and reference636

videos. Temporal location of items of interest within each video637

pair is annotated as the ground truth.638

We also evaluate our method on image retrieval bench- 639

mark datasets. INRIA Holidays, [65] dataset is composed of 640

1491 high-resolution (e.g., 2048 × 1536) scene-centric images, 641

500 of them are queries. This dataset includes a large variety 642

of outdoor scene/object types: natural, man-made, water and 643

fire effects. We evaluate the rotated version of Holidays [13], 644

where all images are with up-right orientation. Oxford5k, [66] 645

is buildings datasets consisting of 5062 images, mainly with 646

size 1024 × 768. There are 55 queries composed of 11 land- 647

marks, each represented by 5 queries. We use the provided 648

bounding boxes to crop query images. The University of Ken- 649

tucky Benchmark (UKBench) [67] consists of 10200 VGA size 650

images, organized into 2550 groups of common objects, each 651

object represented by 4 images. All 10200 images are serving as 652

queries. 653

Evaluation metrics: Retrieval performance is evaluated by 654

mean Average Precision (mAP) and precision at a given cut- 655

off rank R for query videos (Precisian@R), and we set R = 656

100 following MPEG CDVA standard. Pairwise video matching 657

performance is evaluated by Receiver Operating Characteristic 658

(ROC) curve. We also report pairwise matching results in terms 659

of True Positive Rate (TPR), given False Positive Rate (FPR) 660

equals to 1%. In case a video pair is predicted as a match, 661

temporal location of the item of interest is further identified 662

within the video pair. The localization accuracy is measured 663

by Jaccard Index: [T s t a r t ,Te n d ]⋂

[T ′s t a r t ,T

′e n d ]

[T s t a r t ,Te n d ]⋃

[T ′s t a r t ,T

′e n d ] , where [Tstart , Tend ] 664

denotes the ground truth and [T ′start , T

′end ] denotes the predicted 665

start and end frame timestamps. 666

Besides these accuracy measurements, we also measure the 667

complexity of algorithms, including descriptor size, transmis- 668

sion bit rate, extraction time and search time. In particular, 669

transmission bit rate is measured by (# query keyframes) × 670

(descriptor size in Bytes) / (query durations in seconds). 671

B. Implementation Details 672

In this work, we build HNIP descriptors with two widely 673

used CNN architectures : AlexNet [19] and VGG16 [20]. We 674

test off-the-shelf networks pre-trained on ImageNet ILSVRC 675

classification data set. In particular, we crop the networks to the 676

IEEE P

Fig. 8. Example video clips from the CDVA dataset.

TABLE IVVIDEO RETRIEVAL COMPARISON (MAP) BY PROGRESSIVELY ADDING

TRANSFORMATIONS (TRANSLATION, SCALE, ROTATION) INTO NIP(OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

Transf. size (kB/k.f.) Landmarks Scenes Objects All

Gt 2 64.0 82.9 64.8 66.0Gt -Gs 2 65.3 82.4 67.3 67.6Gt -Gs -Gr 2 64.6 82.7 72.2 69.2

Average pooling is applied to all transformations. No PCA whitening is performed.kB/k.f.: descriptor size per key frame. The best results are highlighted in bold.

last pooling layer (i.e., pool5). We resize all video keyframes677

to 640×480 and Holidays (Oxford5k) images to 1024×768678

as the inputs of CNN for descriptor extraction. Finally, post-679

processing can be applied to the pooled descriptors like HNIP680

and R-MAC [18]. Following the standard practice, we choose681

PCA whitening in this work. We randomly sample 40K frames682

from the distractor videos for PCA learning. These experimen-683

tal setups are applied to both HNIP and state-of-the-art deep684

pooled descriptors like MAC [18], SPoC [16], CroW [17] and685

R-MAC [18].686

We also compare HNIP with the MPEG CXM, which is cur-687

rent state-of-the-art handcrafted compact descriptors for video688

analysis. Following the practice in CDVA standard, we em-689

ploy OpenMP to perform parallel retrieval for both CXM and690

deep global descriptors. Experiments are conducted on Tianhe691

HPC platform, where each node is equipped with 2 processors692

(24 cores, Xeon E5-2692) @2.2GHZ, and 64 GB RAM. For693

CNN feature maps extraction, we use NVIDIA Tesla K80 GPU.694

C. Evaluations on HNIP variants695

We perform video retrieval experiments to assess the effect of696

transformations and pooling moments in HNIP pipeline, using697

off-the-shelf VGG16.698

Transformations: Table IV studies the influence of pooling699

cardinalities by progressively adding transformations into the700

nested pooling stages. We simply apply average pooling to all701

transformations. First, dimensions of all NIP variants are 512702

for VGG16, resulting in descriptor size 2 kB per keyframe for703

floating point vectors (4 bytes each element). Second, over-704

all, retrieval performance (mAP) increases as more transfor-705

mations nested in the pooled descriptors, e.g., from 66.0% for706

Gt to 69.2% for Gt-Gs -Gr on the full test dataset (All). Also,707

we observe that Gt-Gs-Gr outperforms Gt-Gs and Gt by a708

large margin on Objects, while achieving comparable perfor-709

TABLE VVIDEO RETRIEVAL COMPARISON (MAP) OF NIP WITH DIFFERENT POOLING

MOMENTS (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

Pool Op. size (kB/k.f.) Landmarks Scenes Objects All

Max-Max-Max 2 63.2 82.9 69.9 67.6Avg-Avg-Avg 2 64.6 82.7 72.2 69.2Squ-Squ-Squ 2 54.2 65.6 66.5 60.0Max-Squ-Avg 2 60.6 80.7 74.4 67.8Hybrid (HNIP) 2 70.0 88.2 80.9 75.9

Transformations are Gt -Gs -Gr for all experiments. No PCA whitening is performed.The best results are highlighted in bold.

mance on Landmarks and Scenes. Revisiting the analysis of ro- 710

tation invariance pooling on the scene-centric Holidays dataset 711

in Fig. 3(a), though invariance to query rotation changes can 712

be gained by database-side augmented pooling, one may note 713

that its retrieval performance is comparable to the one without 714

rotating query and reference images (i.e., the peak value of the 715

red curve). These observations are probably because there are 716

relatively limited rotation (scale) changes for videos depicting 717

large landmarks or scenes, compared to small common objects. 718

More examples can be found in Fig. 8. 719

Hybrid pooling moments: Table V explores the effects of 720

pooling moments within NIP. Transformations are fixed as Gt- 721

Gs-Gr . There are 33 = 27 possible combinations of pooling 722

moments in HNIP. For simplicity, we compare our Hybrid NIP 723

(i.e., Squ-Avg-Max) to two widely-used pooling strategies (i.e., 724

max or average pooling across all transformations) and another 725

two schemes: square-root pooling across all transformations 726

and Max-Squ-Avg which decreases pooling moment along the 727

way. First, for uniform pooling, Avg-Avg-Avg is overall supe- 728

rior over Max-Max-Max and Squ-Squ-Squ, while Squ-Squ-Squ 729

performs much worse than the other two. Second, HNIP out- 730

performs the optimal uniform pooling Avg-Avg-Avg by a large 731

margin. For instance, the gains over Avg-Avg-Avg are +5.4%, 732

+5.5% and +8.7% on Landmarks, Scenes and Objects, respec- 733

tively. Finally, for hybrid pooling, HNIP performs significantly 734

better than Max-Squ-Avg over all test datasets. We observe 735

similar trends when comparing HNIP to other hybrid pooling 736

combinations. 737

D. Comparisons Between HNIP and State-of-the-Art Deep 738

Descriptors 739

Previous experiments show that the integration of transfor- 740

mations and hybrid pooling moments offers remarkable video 741

IEEE P

TABLE VIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART DEEP DESCRIPTORS

IN TERMS OF MAP (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

method size (kB/k.f.) extra. time (s/k.f.) Landmarks Scenes Objects All

w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW

MAC [18] 2 0.32 57.8 61.9 77.4 76.2 70.0 71.8 64.3 67.0SPoC [16] 2 0.32 64.0 69.1 82.9 84.0 64.8 70.3 66.0 70.9CroW [17] 2 0.32 62.3 63.9 79.2 78.4 71.9 72.0 67.5 68.3R-MAC [18] 2 0.32 69.4 74.6 84.4 87.3 73.8 78.2 72.5 77.1HNIP (Ours) 2 0.96 70.0 74.8 88.2 90.1 80.9 85.0 75.9 80.1

We implement MAC [18], SPoC [16], CroW [17], R-MAC [18] based on the source codes released by the authors, while following the same experimental setups as our HNIP. Thebest results are highlighted in bold.

TABLE VIIEFFECT OF THE NUMBER OF DETECTED VIDEO KEYFRAMES ON DESCRIPTOR TRANSMISSION BIT RATE,

RETRIEVAL PERFORMANCE (MAP), AND SEARCH TIME, ON THE FULL TEST DATASET (ALL)

# query k.f. # DB k.f. method size (kB/k.f.) bit rate (Bps) mAP search time (s/q.v.)

∼140K (1.6%) ∼105K (2.4%) CXM ∼4 2840 73.6 12.4AlexNet-HNIP 1 459 71.4 1.6

VGG16-HNIP 2 918 80.1 2.3∼175K (2.0%) ∼132K (3.0%) CXM ∼4 3463 74.3 16.6

AlexNet-HNIP 1 571 71.9 2.0VGG16-HNIP 2 1143 80.6 2.8

∼231K (2.7%) ∼176K (3.9%) CXM ∼4 4494 74.6 21.0AlexNet-HNIP 1 759 71.9 2.2VGG16-HNIP 2 1518 80.7 3.1

We report performance of state-of-the-art handcrafted descriptors (CXM), and PCA whitened HNIP with both off-the-shelf AlexNet andVGG16. Numbers in bracket denote the percentage of detected keyframes from raw videos. Bps: bytes per second. s/q.v.: seconds perquery video.

TABLE VIIIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART

HANDCRAFTED DESCRIPTORS (CXM), FOR ALL TEST DATASETS

method Landmarks Scenes Objects All

CXM 61.4/60.9 63.0/61.9 92.6/91.2 73.6/72.6AlexNet-HNIP 65.2/62.3 77.6/74.1 78.4/74.5 71.4/68.1VGG16-HNIP 74.8/71.6 90.1/86.6 85.0/81.3 80.1/76.7

The former (latter) number in each cell represents performance in terms of mAP(Precision@R). We report performance of PCA whitened HNIP with both off-the-shelf AlexNet and VGG16. The best results are highlighted in bold.

retrieval performance improvements. Here, we conduct another742

round of video retrieval experiments to validate the effective-743

ness of our optimal reported HNIP, compared to state-of-the-art744

deep descriptors [16]–[18].745

Effect of PCA whitening: Table VI studies the effect of PCA746

whitening on different deep descriptors in video retrieval per-747

formance (mAP), using off-the-shelf VGG16. Overall, PCA748

whitened descriptors perform better than their counterparts749

without PCA whitening. More specifically, the improvements750

on SPoC, R-MAC and our HNIP are much larger than MAC751

and CroW. In view of this, we apply PCA whitening to HNIP in752

the following sections.753

HNIP versus MAC, SPoC, CroW and R-MAC: Table VI754

presents the comparison of HNIP against state-of-the-art deep755

descriptors. We observe that HNIP obtains consistently better756

performance than other approaches on all test datasets, at the cost 757

of extra extraction time.3 HNIP significantly improves the re- 758

trieval performance over MAC [18], SPoC [16] and CroW [17], 759

e.g., over 10% in mAP on the full test dataset (All). Compared 760

with the state-of-the-art R-MAC [18], +7% mAP improvement 761

is achieved on Objects, which is mainly attributed to the im- 762

proved robustness against the rotation changes in videos (the 763

keyframes capture small objects from different angles). 764

E. Comparisons Between HNIP and Handcrafted Descriptors 765

In this section, we first study the influences of the number 766

of detected video keyframes on descriptor transmission bit rate, 767

retrieval performance and search time. Then, with keyframes 768

fixed, we compare HNIP to the state-of-the-art compact hand- 769

crafted descriptors (CXM), which currently obtains the optimal 770

video retrieval performance in MPEG CDVA datasets. 771

Effect of the number of detected video keyframes: As shown 772

in Table VII, we generate three keyframe detection config- 773

urations by varying the detection parameters. Also, we test 774

3The extraction time of deep descriptors is mainly decomposed into 1) feed-forward pass for extracting feature maps and 2) pooling over feature mapsfollowed by post-processing such as PCA whitening. In our implementationbased on MatConvNet, the first stage takes 0.21 seconds per keyframe (VGAsize input image to VGG16 executed on a NVIDIA Tesla K80 GPU); HNIP isfour times slower (∼0.84) as there are four rotations for each keyframe. Thesecond stage is ∼0.11 seconds for MAC, SPoC and CroW, ∼0.115 seconds forR-MAC and ∼0.12 seconds for HNIP. Therefore, the extraction time of HNIPis roughly three times as much as others.

IEEE P

Fig. 9. Pairwise video matching comparison of HNIP with state-of-the-art handcrafted descriptors (CXM) in terms of ROC curve, for all test datasets. Experimentalsettings are identical with those in Table VIII.

their retrieval performance and complexity for CXM descriptors775

(∼4 kB per keyframe) and PCA whitened HNIP with both off-776

the-shelf AlexNet (1 kB per keyframe) and VGG16 (2 kB per777

keyframe), on the full test dataset (All). It is easy to find that778

descriptor transmission bit rate and search time increase pro-779

portionally with the number of detected keyframes. However,780

there is little retrieval performance gain for all descriptors, i.e.,781

less than 1% in mAP. Thus, we consider the first configuration782

throughout this work, which achieves a good tradeoff between783

accuracy and complexities. For instance, mAP for VGG16-784

HNIP is 80.1% when the descriptor transmission bit rate is785

only 2840 Bytes per second.786

Video retrieval: Table VIII shows the video retrieval com-787

parison of HNIP with handcrafted descriptors CXM on all788

test datasets. Overall, AlexNet-HNIP is inferior to CXM,789

while VGG16-HNIP performs the best. Second, HNIP with790

both AlexNet and VGG16 outperform CXM on Landmarks791

and Scenes. The performance gap between HNIP and CXM792

becomes larger as network goes deeper from AlexNet to793

VGG16, e.g., AlexNet-HNIP and VGG16-HNIP improve CXM794

by 3.8% and 13.4% in mAP on Landmarks, respectively. Third,795

we observe AlexNet-HNIP performs much worse than CXM on796

Objects (e.g., 74.5% vs. 91.2% in Precision@R). VGG16-HNIP797

reduces the gap, but still underperforms CXM. This is reason-798

able as handcrafted descriptors based on SIFT are more robust799

to scale and rotation changes of rigid objects in the 2D plane.800

Video pairwise matching and localization: Fig. 9 and801

Table IX further show pairwise video matching and temporal802

localization performance of HNIP and CXM on all test datasets,803

respectively. For pairwise video matching, VGG16-HNIP and804

AlexNet-HNIP consistently outperform CXM in terms of TPR805

for varied FPR on Landmarks and Scenes. In Table IX, we806

observe the performance trends of temporal localization are807

roughly consistent with pairwise video matching.808

One may note that the localization accuracy of CXM is809

worse than HNIP on Objects (see Table IX), but CXM ob-810

tains much better video retrieval mAP than HNIP on Objects811

(see Table VIII). First, given a query-reference video pair, video812

retrieval tries to identify the most similar keyframe pair, but813

temporal localization aims to locate multiple keyframe pairs by814

comparing with a predefined threshold. Second, as shown in815

Fig. 9, CXM achieves better TPR (Recall) than both VGG16-816

HNIP and AlexNet-HNIP on Objects when FPR is small (e.g.,817

TABLE IXVIDEO LOCALIZATION COMPARISON OF HNIP WITH STATE-OF-THE-ART

HANDCRAFTED DESCRIPTORS (CXM) IN TERMS

OF JACCARD INDEX, FOR ALL TEST DATASETS

CXM 45.5 45.9 68.8 54.4AlexNet-HNIP 48.9 63.0 67.3 57.1VGG16-HNIP 50.8 63.8 71.2 59.7

Experimental settings are the same as in Table VIII. The best resultsare highlighted in bold.

TABLE XIMAGE RETRIEVAL COMPARISON (MAP) OF HNIP WITH STATE-OF-THE-ART

DEEP AND HANDCRAFTED DESCRIPTORS (CXM),ON HOLIDAYS, OXFORD5K, AND UKBENCH

method Holidays Oxford5k UKbench

CXM 71.2 43.5 3.46MAC [18] 78.3 56.1 3.65SPoC [16] 84.5 68.6 3.68R-MAC [18] 87.2 67.6 3.73HNIP (Ours) 88.9 69.3 3.90

We report performance of PCA whitened deep descriptorswith off-the-shelf VGG16. The best results are highlightedin bold.

FPR = 1%), and TPR becomes worse as FPR increases. This 818

implies that 1) CXM ranks relevant videos and keyframes higher 819

than HNIP in the retrieved list for object queries, which leads to 820

better mAP on Objects when evaluating retrieval performance 821

in a small shortlist (100 in our experiments). 2) VGG16-HNIP 822

gains higher Recall than CXM when FPR becomes large, which 823

leads to higher localization accuracy on Objects. In other words, 824

towards better temporal localization, we choose a small thresh- 825

old (the corresponding FPR = 14.3% in our experiments) in 826

order to recall as many relevant keyframes as possible. 827

Image retrieval: To further verify the effectiveness of HNIP, 828

we conduct image instance retrieval experiments on scene- 829

centric Holidays, landmark-centric Oxford5k and object-centric 830

UKbench. Table X shows the comparisons of HNIP with MAC, 831

SPoC, R-MAC, and handcrafted descriptors from MPEG CDVA 832

evaluation framework. First, we observe HNIP outperforms 833

handcrafted descriptors by a large margin on all data sets. 834

IEEE P

Fig. 10. (a) and (b) Video retrieval, (c) pairwise video matching, and (d) and localization performance of the optimal reported HNIP (i.e., VGG16-HNIP)combined with either CXM local or CXM global descriptors, for all test datasets. For simplicity, we report the pairwise video matching performance in terms ofTPR given FPR = 1%.

TABLE XIVIDEO RETRIEVAL COMPARISON OF HNIP WITH CXM AND THE COMBINATION OF HNIP WITH CXM-LOCAL AND CXM-GLOBAL RESPECTIVELY,

ON THE FULL TEST DATASET (ALL), WITHOUT (“W/O D ”), or WITH (“W/ D ”) COMBINATION OF THE LARGE SCALE DISTRACTOR VIDEOS

method size (kB/k.f.) mAP Precision@R # query k.f. # DB k.f. search time (s/q.v.)

w/o D w/ D w/o D w/ D w/o D w/ D w/o D w/ D

CXM ∼4 73.6 72.1 72.6 71.2 ∼140K ∼105K ∼1.25M 12.4 38.6VGG16-HNIP 2 80.1 76.8 76.7 73.6 2.3 9.2VGG16-HNIP + CXM-Local ∼4 75.7 75.4 74.4 74.1 12.9 17.8VGG16-HNIP + CXM-Global ∼4 84.9 82.6 82.4 80.3 4.9 39.5

Second, HNIP performs significantly better than state-of-the-835

art deep descriptors R-MAC on UKbench, though it shows836

marginally better performance on Holidays. The performance837

advantage trend between HNIP and R-MAC is consistent with838

video retrieval results on CDVA-Scenes and CDVA-Objects in839

Table VI. It is again demonstrated that HNIP tends to be more840

effective on object-centric datasets compared to scene- and841

landmark-centric datasets, as object-centric datasets are usually842

with more rotation and scale distortions.843

F. Combination of HNIP and Handcrafted Descriptors844

CXM contains both compressed local descriptors845

(∼2 kB/frame) and compact global descriptors (∼2 kB/frame)846

aggregated from local ones. Following the combination strate-847

gies designed in Section V, Fig. 10 shows the effectiveness848

of combining VGG16-HNIP with either CXM-Global or849

CXM-Local descriptors,4 in video retrieval (a) & (b), matching850

(c) and localization (d). First, we observe the combination of851

VGG16-HNIP with either CXM-Global or CXM-Local consis-852

tently improves CXM across all tasks on all test datasets. In this853

regard, the improvements of VGG16-HNIP + CXM-Global are854

much larger than VGG16-HNIP + CXM-Local, especially for855

Landmarks and Scenes. Second, VGG16-HNIP + CXM-Global856

performs better on all test datasets in video retrieval, matching857

and localization (except localization accuracy on Landmarks).858

In particular, VGG16-HNIP + CXM-Global significantly859

4Here, we did not introduce the complicated combination of VGG-HNIP +CXM-Global + CXM-Local, because its performance is very close to VGG-HNIP + CXM-Local, and moreover it further increases descriptor size andsearch time, compared to VGG-HNIP + CXM-Local.

improves VGG16-HNIP on Objects in terms of mAP and 860

Precision@R (+10%). This leads us to the conclusion that 861

deep descriptors VGG16-HNIP and handcrafted descriptors 862

CXM-Global are complementary to each other. Third, we 863

observe VGG16-HNIP + CXM-Local significantly degrades 864

the performance of VGG16-HNIP on Landmarks and Scenes, 865

e.g., there is a drop of ∼10% in mAP on Landmarks. This 866

is due to the fact that matching pairs retrieved by HNIP (but 867

handcrafted features fail) cannot pass the GCC step, i.e., the 868

number of inliers (patch-level matching pairs) is less sufficient. 869

For instance, in Fig. 7, the landmark pair is determined as 870

a match by VGG16-HNIP, but the subsequent GCC step 871

considers it as a non-match because there are only 2 matched 872

patch pairs. More examples can be found in Fig. 6(a). 873

G. Large Scale Video Retrieval 874

Table XI studies video retrieval performance of CXM, 875

VGG16-HNIP, their combinations VGG16-HNIP + CXM-Local 876

and VGG16-HNIP + CXM-Global, on the full test dataset 877

(All) without or with the large scale distractor video set. By 878

combining reference videos with the large scale distractor set, 879

the number of database keyframes increases from ∼105K to 880

∼1.25M, making the search speed significantly slower. For ex- 881

ample, HNIP is ∼5 times slower with the 512-D Euclidean 882

distance computation. Further compressing HNIP into extreme 883

compact code (e.g., 256 bits) for ultra-fast Hamming distance 884

computation is highly desirable, while without incurring con- 885

siderable performance loss. We will study it in our future work. 886

Second, the performance ordering of approaches remains the 887

same on large scale experiments, i.e., VGG16-HNIP + CXM- 888

IEEE P

Global performs the best, followed by VGG16-HNIP, VGG16-889

HNIP + CXM-Local and CXM. Finally, when increasing the890

database size by 10× larger, we observe the performance loss891

is relatively small, e.g., −1.5%, −3.3%, −0.3% and −2.3% in892

mAP for CXM, VGG16-HNIP, VGG16-HNIP + CXM-Local893

and VGG16-HNIP + CXM-Global, respectively.894

VII. CONCLUSION AND DISCUSSIONS895

In this work, we propose a compact and discriminative CNN896

descriptor HNIP for video retrieval, matching and localization.897

Based on the invariance theory, HNIP is proven to be robust898

to multiple geometric transformations. More importantly, our899

empirical studies show that the statistical moments in HNIP900

dramatically affect video matching performance, which leads901

us to the design of hybrid pooling moments within HNIP. In ad-902

dition, we study the complementary nature of deep learned and903

handcrafted descriptors, then propose a strategy to combine the904

two descriptors. Experimental results demonstrate that HNIP905

descriptor significantly outperforms state-of-the-art deep and906

handcrafted descriptors, with comparable or even smaller de-907

scriptor size. Furthermore, the combination of HNIP and hand-908

crafted descriptors offers the optimal performance.909

This work provides valuable insights for the ongoing CDVA910

standardization efforts. During the recent 116th MPEG meeting911

in Oct. 2016, MPEG CDVA Ad-hoc group has adopted the pro-912

posed HNIP into core experiments [21] for investigating more913

practical issues when dealing with deep learned descriptors in914

the well-established CDVA evaluation framework. There are915

several directions for future work. First, an in-depth theoretical916

analysis on how pooling moments affect video matching perfor-917

mance is valuable to further reveal and clarify the mechanism918

of hybrid pooling, which may contribute to the invariance the-919

ory. Second, it is interesting to study how to further improve re-920

trieval performance by optimizing deep features like fine-tuning921

CNN tailored for the video retrieval task, instead of off-the-shelf922

CNNs used in this work. Third, to accelerate search speed, fur-923

ther compressing deep descriptors into extremely compact codes924

(e.g., dozens of bits) while still preserving retrieval accuracy is925

worth investigating. Last but not the least, as CNN incurs huge926

number of network model parameters (over 10 millions), how to927

effectively and efficiently compress CNN model is a promising928

direction.929

REFERENCES930

[1] Compact Descriptors for Video Analysis: Objectives, Applications and931Use Cases, ISO/IEC JTC1/SC29/WG11/N14507, 2014.932

[2] Compact Descriptors for Video Analysis: Requirements for Search Appli-933cations, ISO/IEC JTC1/SC29/WG11/N15040, 2014.934

[3] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,935vol. 28, no. 4, pp. 61–76, Jul. 2011.936

[4] R. Ji et al., “Learning compact visual descriptor for low bit rate mobile937landmark search,” vol. 22, no. 3, 2011.Q1 938

[5] L.-Y. Duan et al., “Overview of the MPEG-CDVS standard,” IEEE Trans.939Image Process., vol. 25, no. 1, pp. 179–194, Jan. 2016.940

[6] Test Model 14: Compact Descriptors for Visual Search, ISO/IEC941JTC1/SC29/WG11/W15372, 2011.942

[7] Call for Proposals for Compact Descriptors for Video Analysis (CDVA)-943Search and Retrieval, ISO/IEC JTC1/SC29/WG11/N15339, 2015.944

[8] Cdva Experimentation Model (cxm) 0.2, ISO/IEC JTC1/SC29/945WG11/W16274, 2015.946

[9] F. Perronnin, Y. Liu, J. Sanchez, and H. Poirier, “Large-scale image re- 947trieval with compressed fisher vectors,” in Proc. IEEE Conf. Comput. Vis. 948Pattern Recog., Jun. 2010, pp. 3384–3391. Q2949

[10] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descrip- 950tors into a compact image representation,” in Proc. IEEE Conf. Comput. 951Vis. Pattern Recog., Jun. 2010, pp. 3304–3311. 952

[11] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn fea- 953tures off-the-shelf: An astounding baseline for recognition,” in Proc. IEEE 954Conf. Comput. Vis. Pattern Recog. Workshops, Jun. 2014, pp. 512–519. 955

[12] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless pooling 956of deep convolutional activation features,” in Proc. Eur. Conf. Comput. 957Vis., 2014. 958

[13] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes 959for image retrieval,” in Proc. Eur. Conf. Comput. Vis., 2014. 960

[14] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “A baseline for 961visual instance retrieval with deep convolutional networks,” CoRR, 2014. 962[Online]. Available: http://arxiv.org/abs/1412.6574 Q3963

[15] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson, 964“From generic to specific deep representations for visual recognition,” in 965Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2015. 966

[16] A. Babenko and V. Lempitsky, “Aggregating local deep features for image 967retrieval,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1269– 9681277. 969

[17] Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weight- 970ing for aggregated deep convolutional features,” CoRR, 2015. [Online]. 971Available: http://arxiv.org/1512.04065 972

[18] G. Tolias, R. Sicre, and H. Jegou, “Particular object retrieval with inte- 973gral max-pooling of cnn activations,” CoRR, 2015. [Online]. Available: 974http://arxiv.org/abs/1511.05879 975

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification 976with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- 977cess. Syst., 2012. 978

[20] K. Simonyan and A. Zisserman, “Very deep convolutional networks 979for large-scale image recognition,” CoRR, 2014. [Online]. Available: 980http://arxiv.org/abs/1409.1556 981

[21] Description of Core Experiments in CDVA, ISO/IEC JTC1/SC29/ 982WG11/W16510, 2016. 983

[22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” 984Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004. 985

[23] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” 986in Proc. Eur. Conf. Comput. Vis., 2006. 987

[24] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach 988to object matching in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 989Oct. 2003, vol. 2, pp. 1470–1477. 990

[25] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” 991in Proc. Comput. Vis. Pattern Recog., 2006. 992

[26] H. Jegou and A. Zisserman, “Triangulation embedding and democratic 993aggregation for image search,” in Proc. IEEE Conf. Comput. Vis. Pattern 994Recog., Jun. 2014, pp. 3310–3317. 995

[27] S. S. Husain and M. Bober, “Improving large-scale image retrieval through 996robust aggregation of local descriptors,” IEEE Trans. Pattern Anal. Mach. 997Intell., to be published. 998

[28] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift- 999invariant kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2009. 1000

[29] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Adv. 1001Neural Inf. Process. Syst., 2009. 1002

[30] V. Chandrasekhar et al.,“Transform coding of image feature descriptors,” 1003in Proc. IS&T/SPIE Electron. Imag., 2009. 1004

[31] V. Chandrasekhar et al., “CHoG: Compressed histogram of gradients a 1005low bit-rate feature descriptor,” in Proc. IEEE Conf. Comput. Vis. Pattern 1006Recog., Jun. 2009, pp. 2504–2511. 1007

[32] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest 1008neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, 1009pp. 117–128, Jan. 2011. 1010

[33] M. Calonder et al., “BRIEF: Computing a local binary descriptor 1011very fast,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7, 1012pp. 1281–1298, Jul. 2012. 1013

[34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficient 1014alternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis., 1015Nov. 2011, pp. 2564–2571. 1016

[35] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robust 1017invariant scalable keypoints,” in Proc. Int. Conf. Comput. Vis., Nov. 2011, 1018pp. 2548–2555. 1019

[36] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui, “USB: Ultrashort 1020binary descriptor for fast visual matching and retrieval,” IEEE Trans. 1021Image Process., vol. 23, no. 8, pp. 3671–3683, Aug. 2014. 1022

IEEE P

[37] D. M. Chen et al., “Tree histogram coding for mobile image matching,”1023in Proc. Data Compression Conf., 2009.1024

[38] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scal-1025able image search,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct.10262009, pp. 2130–2137.1027

[39] D. Chen et al., “Residual enhanced visual vector as a compact signature1028for mobile visual search,” Signal Process., vol. 93, no. 8, pp. 2316–2327,10292013.1030

[40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies1031for accurate object detection and semantic segmentation,” in Proc. IEEE1032Conf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587.1033

[41] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:1034CNN architecture for weakly supervised place recognition,” in Proc. Com-1035put. Vis. Pattern Recog., Jun. 2016, pp. 5297–5307.1036

[42] F. Radenovic, G. Tolias, and O. Chum, “CNN image retrieval learns from1037BoW: Unsupervised fine-tuning with hard examples,” in Proc. Eur. Conf.1038Comput. Vis., 2016.1039

[43] A. Gordo, J. Almazan, J. Revaud, and D. Larlus, “Deep image retrieval:1040Learning global representations for image search,” in Proc. Eur. Conf.1041Comput. Vis., 2016.1042

[44] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, and S. Tubaro,1043“Coding visual features extracted from video sequences,” IEEE Trans.1044Image Process., vol. 23, no. 5, pp. 2262–2276, May 2014.1045

[45] A. Redondi, L. Baroffio, M. Cesana, and M. Tagliasacchi, “Compress-1046then-analyze vs. analyze-then-compress: Two paradigms for image anal-1047ysis in visual sensor networks,” in Proc. IEEE Int. Workshop Multimedia1048Signal Process., Sep.-Oct. 2013, pp. 278–282.1049

[46] L. Baroffio, J. Ascenso, M. Cesana, A. Redondi, and M. Tagliasacchi,1050“Coding binary local features extracted from video sequences,” in Proc.1051IEEE Int. Conf. Image Process., Oct. 2014, pp. 2794–2798.1052

[47] M. Makar, V. Chandrasekhar, S. Tsai, D. Chen, and B. Girod, “Interframe1053coding of feature descriptors for mobile augmented reality,” IEEE Trans.1054Image Process., vol. 23, no. 8, pp. 3352–3367, Aug. 2014.1055

[48] J. Chao and E. G. Steinbach, “Keypoint encoding for improved feature ex-1056traction from compressed video at low bitrates,” IEEE Trans. Multimedia,1057vol. 18, no. 1, pp. 25–39, Jan. 2016.1058

[49] L. Baroffio et al., “Coding local and global binary visual features extracted1059from video sequences,” IEEE Trans. Image Process., vol. 24, no. 11,1060pp. 3546–3560, Nov. 2015.1061

[50] D. M. Chen, M. Makar, A. F. Araujo, and B. Girod, “Interframe coding1062of global image signatures for mobile augmented reality,” in Proc. Data1063Compression Conf., 2014.1064

[51] D. M. Chen and B. Girod, “A hybrid mobile visual search system with1065compact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7,1066pp. 1019–1030, Jul. 2015.1067

[52] C.-Z. Zhu, H. Jegou, and S. Satoh, “Nii team: Query-adaptive asymmetri-1068cal dissimilarities for instance search,” in Proc. TRECVID 2013 Workshop,1069Gaithersburg, USA, 2013.1070

[53] N. Ballas et al., “Irim at trecvid 2014: Semantic indexing and instance1071search,” in Proc. TRECVID 2014 Workshop, 2014.1072

[54] A. Araujo, J. Chaves, R. Angst, and B. Girod, “Temporal aggregation1073for large-scale query-by-image video retrieval,” in Proc. IEEE Int. Conf.1074Image Process., Sep. 2015, pp. 1519–1522.1075

[55] M. Shi, T. Furon, and H. Jegou, “A group testing framework for simi-1076larity search in high-dimensional spaces,” in Proc. 22nd ACM Int. Conf.1077Multimedia. 2014, pp. 407–416.1078

[56] J. Lin et al., “Rate-adaptive compact fisher codes for mobile vi-1079sual search,” IEEE Signal Process. Lett., vol. 21, no. 2, pp. 195–198,1080Feb. 2014.1081

[57] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN video1082representation for event detection,” in Proc. IEEE Conf. Comput. Vis.1083Pattern Recog., Jun. 2015, pp. 1798–1807.1084

[58] L.-Y. et al., “Compact Descriptors for Video Analysis: The1085Emerging MPEG Standard,” CoRR, 2017. [Online]. Available:1086http://arxiv.org/abs/1704.081411087

[59] F. Anselmi and T. Poggio, “Representation learning in sensory cortex: A1088theory,” in Proc. Center Brains, Minds Mach., 2014.1089

[60] Q. Liao, J. Z. Leibo, and T. Poggio, “Learning invariant representations and1090applications to face verification,” in Proc. Advances Neural Inf. Process.1091Syst., Lake Tahoe, NV, 2013.1092

[61] C. Zhang, G. Evangelopoulos, S. Voinea, L. Rosasco, and T. Poggio, “A1093deep representation for invariance and music classification,” in Proc. IEEE1094Int. Conf. Acoust., Speech, Signal Process., May 2014, pp. 6984–6988.1095

[62] K. Lenc and A. Vedaldi, “Understanding image representations by mea-1096suring their equivariance and equivalence,” in Proc. IEEE Conf. Comput.1097Vis. Pattern Recog., Jun. 2015, pp. 991–995.1098

[63] J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-time 1099visual concept classification.” IEEE Trans. Multimedia, vol. 12, no. 7, 1100pp. 665–681, Nov. 2010. 1101

[64] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recogni- 1102tion and segmentation.” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1103Jun. 2015, pp. 3828–3836. 1104

[65] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak 1105geometric consistency for large scale image search,” in Proc. Eur. Conf. 1106Comput. Vis., 2008. 1107

[66] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval 1108with large vocabularies and fast spatial matching,” in Proc. IEEE Conf. 1109Comput. Vis. Pattern Recog., Jun. 2007, pp. 1–8. 1110

[67] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” 1111in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recog., Jun. 2006, 1112vol. 2, pp. 2161–2168. 1113

Jie Lin received the B.S. and Ph.D. degrees from the 1114School of Computer Science and Technology, Bei- 1115jing Jiaotong University, Beijing, China, in 2006 and 11162014, respectively. 1117

He is currently a Research Scientist with the In- 1118stitute of Infocomm Research, A*STAR, Singapore. 1119He was previously a visiting student in the Rapid- 1120Rich Object Search Laboratory, Nanyang Technolog- 1121ical University, Singapore, and the Institute of Digital 1122Media, Peking University, Beijing, China, from 2011 1123to 2014. His research interests include deep learning, 1124

feature coding and large-scale image/video retrieval. His work on image feature 1125coding has been recognized as core contribution to the MPEG-7 Compact De- 1126scriptors for Visual Search (CDVS) standard. 1127

Ling-Yu Duan (M’09) received the M.Sc. degree in 1129automation from the University of Science and Tech- 1130nology of China, Hefei, China, in 1999, the M.Sc. 1131degree in computer science from the National Uni- 1132versity of Singapore (NUS), Singapore, in 2002, and 1133the Ph.D. degree in information technology from 1134The University of Newcastle, Callaghan, Australia, 1135in 2008. 1136

He is currently a Full Professor with the National 1137Engineering Laboratory of Video Technology, School 1138of Electronics Engineering and Computer Science, 1139

Peking University (PKU), Beijing, China. He was the Associate Director of the 1140Rapid-Rich Object Search Laboratory, a joint lab between Nanyang Techno- 1141logical University, Singapore, and PKU since 2012. Before he joined PKU, he 1142was a Research Scientist with the Institute for Infocomm Research, Singapore, 1143from Mar. 2003 to Aug. 2008. He has authored or coauthored more than 130 1144research papers in international journals and conferences. His research interests 1145include multimedia indexing, search, and retrieval, mobile visual search, visual 1146feature coding, and video analytics. Prior to 2010, his research was basically 1147focused on multimedia (semantic) content analysis, especially in the domains 1148of broadcast sports videos and TV commercial videos. 1149

Prof. Duan was the recipient of the EURASIP Journal on Image and Video 1150Processing Best Paper Award in 2015, and the Ministry of Education Technology 1151Invention Award (First Prize) in 2016. He was a co-editor of MPEG Compact 1152Descriptor for Visual Search (CDVS) Standard (ISO/IEC 15938-13), and is a 1153Co-Chair of MPEG Compact Descriptor for Video Analytics (CDVA). His re- 1154cent major achievements focus on the topic of compact representation of visual 1155features and high-performance image search. He made significant contribution 1156in the completed MPEG-CDVS standard. The suite of CDVS technologies have 1157been successfully deployed, which impact visual search products/services of 1158leading Internet companies such as Tencent (WeChat) and Baidu (Image Search 1159Engine). 1160

IEEE P

Shiqi Wang received the B.S. degree in computer sci-1162ence from the Harbin Institute of Technology, Harbin,1163China, in 2008, and the Ph.D. degree in computer1164application technology from the Peking University,1165Beijing, China, in 2014.1166

From March 2014 to March 2016, he was a Post-1167doc Fellow with the Department of Electrical and1168Computer Engineering, University of Waterloo, Wa-1169terloo, ON, Canada. From April 2016 to Aprill 2017,1170he was with the Rapid-Rich Object Search Labora-1171tory, Nanyang Technological University, Singapore,1172

as a Research Fellow. He is currently an Assistant Professor in the Depart-1173ment of Computer Science, City University of Hong Kong, Hong Kong. He1174has proposed more than 30 technical proposals to ISO/MPEG, ITU-T, and AVS1175standards. His research interests include image/video compression, analysis and1176quality assessment.1177

Yan Bai received the B.S. degree in software en-1179gineering from Dalian University of Technology,1180Liaoning, China, in 2015, and is currently working1181toward the M.S. at the School of Electrical Engi-1182neering and Computer Science, Peking University,1183Beijing, China.1184

Her research interests include large-scale video1185retrieval and fine-grained visual recognition.1186

Yihang Lou received the B.S. degree in software1188engineering from Dalian University of Technology,1189Liaoning, China, in 2015, and is currently working1190toward the M.S. degree at the School of Electrical1191Engineering and Computer Science, Peking Univer-1192sity, Beijing, China.1193

His current interests include large-scale video re-1194trieval and object detection.1195

Vijay Chandrasekhar received the B.S and M.S. de-1197grees from Carnegie Mellon University, Pittsburgh,1198PA, USA, in 2002 and 2005, respectively, and the1199Ph.D. degree in electrical engineering from Stanford1200University, Stanford, CA, USA, in 2013.1201

He has authored or coauthored more than 80 pa-1202pers/MPEG contributions in a wide range of top-1203tier journals/conferences such as the International1204Journal of Computer Vision, ICCV, CVPR, the IEEE1205Signal Processing Magazine, ACM Multimedia, the1206IEEE TRANSACTIONS ON IMAGE PROCESSING, De-1207

signs, Codes and Cryptography, the International Society of Music Information1208Retrieval, and MPEG-CDVS, and has filed 7 U.S. patents (one granted, six pend-1209ing). His research interests include mobile audio and visual search, large-scale1210image and video retrieval, machine learning, and data compression. His Ph.D.1211work on feature compression led to the MPEG-CDVS (Compact Descriptors1212for Visual Search) standard, which he actively contributed from 2010 to 2013.1213

Dr. Chandrasekhar was the recipient of the A*STAR National Science Schol-1214arship (NSS) in 2002.1215

Tiejun Huang received the B.S. and M.S. degrees 1217in computer science from the Wuhan University of 1218Technology, Wuhan, China, in 1992 and 1995, re- 1219spectively, and the Ph.D. degree in pattern recogni- 1220tion and intelligent system from the Huazhong (Cen- 1221tral China) University of Science and Technology, 1222Wuhan, China, in 1998. 1223

He is a Professor and the Chair of the Department 1224of Computer Science, School of Electronic Engineer- 1225ing and Computer Science, Peking University, Bei- 1226jing, China. His research areas include video coding, 1227

image understanding, and neuromorphic computing. 1228Prof. Huang is a Member of the Board of the Chinese Institute of Electronics 1229

and the Advisory Board of IEEE Computing Now. He was the recipient of the 1230National Science Fund for Distinguished Young Scholars of China in 2014, and 1231was awarded the Distinguished Professor of the Chang Jiang Scholars Program 1232by the Ministry of Education in 2015. 1233

Alex Kot (S’85–M’89–SM’98–F’06) has been with 1235Nanyang Technological University, Singapore, since 12361991. He headed the Division of Information En- 1237gineering, School of Electrical and Electronic En- 1238gineering, for eight years and was an Associate 1239Chair/Research and the Vice Dean Research with the 1240School of Electrical and Electronic Engineering. He 1241is currently a Professor with the College of Engi- 1242neering and the Director of the Rapid-Rich Object 1243Search Laboratory. He has authored or coauthored 1244extensively in the areas of signal processing for com- 1245

munication, biometrics, data-hiding, image forensics, and information security. 1246Prof. Kot is a Member of the IEEE Fellow Evaluation Committee and a 1247

Fellow of Academy of Engineering, Singapore. He was the recipient of the Best 1248Teacher of the Year Award and is a coauthor for several Best Paper Awards, 1249including ICPR, IEEE WIFS, ICEC, and IWDW. He was on the IEEE SP So- 1250ciety in various capacities, such as the General Co-Chair for the 2004 IEEE 1251International Conference on Image Processing and a Chair of the worldwide 1252SPS Chapter Chairs, and the Distinguished Lecturer Program. He is the Vice 1253President for the IEEE Signal Processing Society. He was a Guest Editor for 1254the Special Issues for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS 1255FOR VIDEO TECHNOLOGY and the EURASIP Journal on Advances in Signal 1256Processing. He is also an Editor of the EURASIP Journal on Advances in Signal 1257Processing. He is an IEEE SPS Distinguished Lecturer. He was an Associate 1258Editor of the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANS- 1259ACTIONS ON SIGNAL PROCESSING, the IEEE TRANSACTIONS ON MULTIMEDIA, 1260the IEEE SIGNAL PROCESSING LETTERS, the IEEE SIGNAL PROCESSING MAGA- 1261ZINE, the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, the IEEE 1262TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, and the 1263IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, and the IEEE TRANSAC- 1264TIONS ON CIRCUITS AND SYSTEMS II. 1265

Wen Gao (S’87–M’88–SM’05–F’09) received the 1267Ph.D. degree in electronics engineering from the Uni- 1268versity of Tokyo, Tokyo, Japan, in 1991. 1269

He was a Professor of Computer Science with the 1270Harbin Institute of Technology, Harbin, China, from 12711991 to 1995, and a Professor with the Institute of 1272Computing Technology, Chinese Academy of Sci- 1273ences, Beijing, China. He is currently a Professor of 1274computer science with the Institute of Digital Media, 1275School of Electronic Engineering and Computer Sci- 1276ence, Peking University, Beijing, China. 1277

Prof. Gao has served on the editorial boards for several journals, such as the 1278IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 1279the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS ON AU- 1280TONOMOUS MENTAL DEVELOPMENT, the EURASIP Journal of Image Commu- 1281nications, and the Journal of Visual Communication and Image Representation. 1282He has chaired a number of prestigious international conferences on multimedia 1283and video signal processing, such as IEEE ICME and ACM Multimedia, and 1284also served on the advisory and technical committees of numerous professional 1285organizations. 1286

IEEE P

QUERIES 1288

Q1. Author: Please provide the journal name and the page range in Ref. [4]. 1289

Q2. Author: Please provide the page range in Refs. [9], [11]–[13], [15], [16], [19], [23]–[25], [28]–[31], [34], [37], [38], [4]–[43], 1290

[45]–[47], [50], [52], [53], [59]–[62], and [64]–[67]. 1291

Q3. Author: Please update Refs. [14], [17], [18], [20], [27], and [58]. 1292