+ All Categories
Home > Documents > arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019....

arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019....

Date post: 08-Jul-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
8
Learning Category Correlations for Multi-label Image Recognition with Graph Networks Qing Li 1,2 , Xiaojiang Peng 2,* , Yu Qiao 2 , Qiang Peng 1 1 Department of School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China 2 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China [email protected], {xj.peng,yu.qiao}@siat.ac.cn Abstract Multi-label image recognition is a task that predicts a set of object labels in an image. As the objects co- occur in the physical world, it is desirable to model label dependencies. Previous existing methods resort to ei- ther recurrent networks or pre-defined label correlation graphs for this purpose. In this paper, instead of using a pre-defined graph which is inflexible and may be sub- optimal for multi-label classification, we propose the A- GCN, which leverages the popular Graph Convolutional Networks with an Adaptive label correlation graph to model label dependencies. Specifically, we introduce a plug-and-play Label Graph (LG) module to learn label correlations with word embeddings, and then utilize tra- ditional GCN to map this graph into label-dependent object classifiers which are further applied to image fea- tures. The basic LG module incorporates two 1 × 1 con- volutional layers and uses the dot product to generate la- bel graphs. In addition, we propose a sparse correlation constraint to enhance the LG module, and also explore different LG architectures. We validate our method on two diverse multi-label datasets: MS-COCO and Fash- ion550K. Experimental results show that our A-GCN significantly improves baseline methods and achieves performance superior or comparable to the state of the art. Introduction As an important problem in computer vision community, multi-label image recognition has attracted considerable at- tention due to its wide applications such as music emotion categorization (Trohidis et al. 2008), fashion attribute recog- nition (Inoue et al. 2017), human attribute recognition (Li et al. 2016), etc. Unlike conventional multi-class classification problems, which only predict one class label for each im- age, multi-label image recognition needs to assign multiple labels to a single image. Its challenges come from the rich and diverse semantic information in images. Early existing methods (Clare and King 2001; Tsoumakas and Katakis 2007; Cheng and H ¨ ullermeier 2009; Zhou et al. 2012; Zhang and Zhou 2013) address the multi-label clas- sification problem by either transform it into i) multiple bi- nary classification tasks or ii) multivariate regression prob- Copyright c 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. lem or iii) adapting single-label classification algorithms. With the great success of deep Convolutional Neural Net- works (CNNs) on single-label multi-class image classifica- tion (Krizhevsky, Sutskever, and Hinton 2012), recent multi- label image classification methods are mainly based CNNs with certain adaptions (Wei et al. 2014; Wei et al. 2015; Wang et al. 2016; Wang et al. 2017; Zhu et al. 2017; Ge, Yang, and Yu 2018; Chen et al. 2018a; Yu et al. 2019; Chen et al. 2019). A popular way of modern CNN-based multi-label classi- fication is to model label dependencies as the objects usu- ally co-occur in the physical world. For instance, ’base- ball’, ’bat’ and ’person’ always appear in the same image, but ’baseball’ and ’ocean’ rarely appear together. Wang et al. (Wang et al. 2016) propose a CNN-RNN framework, which learns a joint image-label embedding to character- ize the semantic label dependency. It shows that the recur- rent neural networks (RNNs) can capture the higher-order label dependencies in a sequential fashion. However, this method ignores the explicit associations between semantic labels and image regions. Consequently, some works com- bine the attention mechanism (Xu et al. 2015) with CNN- RNN framework to explore the associations between la- bels and image regions (Wang et al. 2017; Zhu et al. 2017; Ge, Yang, and Yu 2018; Chen et al. 2018a). For example, Zhu et al. (Zhu et al. 2017) propose a Spatial Regularization Network which generates class-related attention maps and captures both spatial and semantic label dependencies via learnable convolutions. These methods essentially learn lo- cal correlations by attention regions of an image which intro- duce limited complementary information. Chen et al. (Chen et al. 2019) propose a multilabel-GCN (ML-GCN) frame- work, which leverages Graph Convolutional Networks to capture global correlations between labels with extra knowl- edge from label statistical information. One drawback of ML-GCN is that the label correlation graph is manually de- signed and needs carefully adaptions. This hand-crafted cor- relation graph makes the ML-GCN inflexible and may be sub-optimal for multi-label classification. In this paper, we propose a unified multi-label GCN framework, termed as A-GCN to address the inflexible cor- relation graph problem in ML-GCN. The key of A-GCN is that it learns an Adaptive label correlation graph to model la- bel dependencies in an end-to-end manner. Specifically, we arXiv:1909.13005v1 [cs.CV] 28 Sep 2019
Transcript
Page 1: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

Learning Category Correlations for Multi-label Image Recognition with Graph Networks

Qing Li1,2, Xiaojiang Peng2,∗, Yu Qiao2, Qiang Peng1

1 Department of School of Information Science and Technology, Southwest Jiaotong University, Chengdu, China2 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

[email protected], {xj.peng,yu.qiao}@siat.ac.cn

Abstract

Multi-label image recognition is a task that predicts aset of object labels in an image. As the objects co-occur in the physical world, it is desirable to model labeldependencies. Previous existing methods resort to ei-ther recurrent networks or pre-defined label correlationgraphs for this purpose. In this paper, instead of usinga pre-defined graph which is inflexible and may be sub-optimal for multi-label classification, we propose the A-GCN, which leverages the popular Graph ConvolutionalNetworks with an Adaptive label correlation graph tomodel label dependencies. Specifically, we introduce aplug-and-play Label Graph (LG) module to learn labelcorrelations with word embeddings, and then utilize tra-ditional GCN to map this graph into label-dependentobject classifiers which are further applied to image fea-tures. The basic LG module incorporates two 1×1 con-volutional layers and uses the dot product to generate la-bel graphs. In addition, we propose a sparse correlationconstraint to enhance the LG module, and also exploredifferent LG architectures. We validate our method ontwo diverse multi-label datasets: MS-COCO and Fash-ion550K. Experimental results show that our A-GCNsignificantly improves baseline methods and achievesperformance superior or comparable to the state of theart.

IntroductionAs an important problem in computer vision community,multi-label image recognition has attracted considerable at-tention due to its wide applications such as music emotioncategorization (Trohidis et al. 2008), fashion attribute recog-nition (Inoue et al. 2017), human attribute recognition (Li etal. 2016), etc. Unlike conventional multi-class classificationproblems, which only predict one class label for each im-age, multi-label image recognition needs to assign multiplelabels to a single image. Its challenges come from the richand diverse semantic information in images.

Early existing methods (Clare and King 2001; Tsoumakasand Katakis 2007; Cheng and Hullermeier 2009; Zhou et al.2012; Zhang and Zhou 2013) address the multi-label clas-sification problem by either transform it into i) multiple bi-nary classification tasks or ii) multivariate regression prob-

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

lem or iii) adapting single-label classification algorithms.With the great success of deep Convolutional Neural Net-works (CNNs) on single-label multi-class image classifica-tion (Krizhevsky, Sutskever, and Hinton 2012), recent multi-label image classification methods are mainly based CNNswith certain adaptions (Wei et al. 2014; Wei et al. 2015;Wang et al. 2016; Wang et al. 2017; Zhu et al. 2017;Ge, Yang, and Yu 2018; Chen et al. 2018a; Yu et al. 2019;Chen et al. 2019).

A popular way of modern CNN-based multi-label classi-fication is to model label dependencies as the objects usu-ally co-occur in the physical world. For instance, ’base-ball’, ’bat’ and ’person’ always appear in the same image,but ’baseball’ and ’ocean’ rarely appear together. Wang etal. (Wang et al. 2016) propose a CNN-RNN framework,which learns a joint image-label embedding to character-ize the semantic label dependency. It shows that the recur-rent neural networks (RNNs) can capture the higher-orderlabel dependencies in a sequential fashion. However, thismethod ignores the explicit associations between semanticlabels and image regions. Consequently, some works com-bine the attention mechanism (Xu et al. 2015) with CNN-RNN framework to explore the associations between la-bels and image regions (Wang et al. 2017; Zhu et al. 2017;Ge, Yang, and Yu 2018; Chen et al. 2018a). For example,Zhu et al. (Zhu et al. 2017) propose a Spatial RegularizationNetwork which generates class-related attention maps andcaptures both spatial and semantic label dependencies vialearnable convolutions. These methods essentially learn lo-cal correlations by attention regions of an image which intro-duce limited complementary information. Chen et al. (Chenet al. 2019) propose a multilabel-GCN (ML-GCN) frame-work, which leverages Graph Convolutional Networks tocapture global correlations between labels with extra knowl-edge from label statistical information. One drawback ofML-GCN is that the label correlation graph is manually de-signed and needs carefully adaptions. This hand-crafted cor-relation graph makes the ML-GCN inflexible and may besub-optimal for multi-label classification.

In this paper, we propose a unified multi-label GCNframework, termed as A-GCN to address the inflexible cor-relation graph problem in ML-GCN. The key of A-GCN isthat it learns an Adaptive label correlation graph to model la-bel dependencies in an end-to-end manner. Specifically, we

arX

iv:1

909.

1300

5v1

[cs

.CV

] 2

8 Se

p 20

19

Page 2: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations with word embeddings, andthen utilize traditional GCN to map this graph into label-dependent object classifiers, and further applied these clas-sifiers to image features. By default, we implement LG mod-ule by two 1×1 convolutional layers and uses dot product togenerate label graphs. As label co-occurance is sparse in cur-rent popular multi-label datasets, we also introduce a sparsecorrelation constraint to enhance the LG module by usinga L1-norm loss between the learned correlation graph andan identity matrix. Furthermore, we explore three alterna-tive architectures to evaluate the LG module. We validate ourmethod on two diverse multi-label datasets: MS-COCO andFashion550K. Experimental results show that our A-GCNsignificantly improves baseline methods and achieves per-formance superior or comparable to the state of the art.

Relate WorkOur work is mainly related to multi-label image recognitionand graph neural network. In this section, we first present re-lated works on multi-label image recognition methods, andthen graph neural network methods.

Multi-label Image RecognitionRemarkable developments in image recognition have beenobserved over the past few years due to the availability oflarge-scale hand-labeled datasets like ImageNet (Deng et al.2009) and MS-COCO (Lin et al. 2014). Recent progresson single-label image classification is made based on thedeep convolutional neural networks (CNNs) (Krizhevsky,Sutskever, and Hinton 2012; Simonyan and Zisserman 2014;He et al. 2016) that learn powerful visual representation viastacking multiple nonlinear transformations. A simple wayis to adapt these single-label classification networks to themulti-label image recognition with the deep CNNs, whichhas been witnessed good results (Sharif Razavian et al. 2014;Wang et al. 2016; Wang et al. 2017; Chen et al. 2018b).

Early works on multi-label image recognition utilizehand-crafted image features and linear models to solve thisproblem (Tsoumakas, Katakis, and Vlahavas 2009; Tai andLin 2012; Cabral et al. 2014; Chen et al. 2012). Intuitively,as a well-known example is to decompose the multi-labelimage recognition problems into multiple binary classifi-cation tasks (Tsoumakas and Katakis 2007). As in paper(Tsoumakas, Katakis, and Vlahavas 2009), to train a set ofindependent linear classifiers for each label. Zhang et al.(Zhang and Zhou 2007) propose a multi-label lazy learningapproach named ML-KNN, using k-nearest neighbor to pre-dict labels for unseen data from training data. Tai et al. (Taiand Lin 2012) design a novel Principle Label Space Trans-formation (PLST) algorithm, which seeks important correla-tions between labels before learning. Chen et al. (Chen et al.2012) introduce a hierarchical matching framework with so-called side information for image classification based on thebag-of-words model. Although these methods may performwell on the simple benchmarks, they can’t generalize as wellas deep learning-based methods on input images with com-plex scenes and multiple objects.

Several studies based on CNNs still attract the attentionof researchers in Multi-label image recognition tasks (Chenet al. 2012; Wang et al. 2016; Wang et al. 2017). The ear-liest applications of deep learning to multi-label classifica-tion is done by Gong et al. (Gong et al. 2013), who pro-pose to combine convolutional architectures with an approx-imate top-k ranking objective function for annotating multi-label images. Instead of extracting off-the-shelf deep fea-tures, Chatfield et al. (Chatfield et al. 2014) fine-tune thenetwork using the target multi-label dataset, which is usedto learn more domain-specific features to boost the classifi-cation performance. Wu et al. (Wu et al. 2015) propose anapproach named weakly semi-supervised deep learning formulti-label image annotation, which uses a triplet loss func-tion to draw images with similar label sets. To better con-sider the correlations between labels instead of treat eachlabel independently, various approaches have been consid-ered in recent works. One of the popular trends utilizes thegraph models to build the label co-occurrence dependency(Tehrani and Ahrens 2017), such as Conditional RandomField (Ghamrawi and McCallum 2005), Dependency Net-work (Guo and Gu 2011), and Co-occurrence Matrix (Xueet al. 2011). In order to explore the label co-occurrence de-pendency combined with CNNs model, another group of re-searchers applies the low-dimensional recurrent neurons inRNN model to efficiently abstract high-order label correla-tion. For example, Wang et al. (Wang et al. 2016) utilizethe RNNs combined with CNN to learn a joint image-labelembedding to characterize the semantic label dependency aswell as the image-label relevance. Wang et al. (Wang et al.2017) introduce a spatial transformer layer and long short-term memory (LSTM) units to capture the label correlation.Lee et al. (Lee et al. 2018) propose a framework that incor-porates knowledge graphs for describing the relationshipsbetween multiple labels and learned representations of thisgraph to enhance image feature representation to promotemulti-label recognition.

Graph Convolutional Neural NetworksGeneralization of GCNNs has drawn great attention in re-cent years. There are two typical types of GCNNs: spatialmanner and spectral manner. The first type adopts feed-forward neural networks to every node (Scarselli et al.2008). For example, Marino et al. (Marino, Salakhutdi-nov, and Gupta 2016) successfully apply GCNNs for multi-label image classification to exploit explicit semantic rela-tions in the form of structured knowledge graphs. Wanget al. (Wang and Gupta 2018) propose to represent videosas space-time region graphs which capture similarity re-lationships and spatial-temporal relationships. Wang et al.(Wang et al. 2019) propose a spatial-based GCN to solvethe link prediction problem. The second type provides well-defined localization operators on graphs via convolutions inthe Fourier domain (Kipf and Welling 2016). In recent years,an important branch of the spectral GCNNs has been pro-posed to tackle graph-structured data. The outputs of spec-tral GCNNs are updated features for each object node, lead-ing to an advanced performance on any tasks related tograph based information processing. More specifically, Kipf

Page 3: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

Figure 1: The pipeline of our A-GCN for multi-label image recognition. It consists of two branches, namely an image-levelbranch to extract image features and a label GCN branch to learn label-dependent classifiers. An adaptive label graph (LG)module is introduced to construct the label correlation matrix from label embeddings for the label GCN branch.

Figure 2: Three kinds of alternative label graph architec-tures.

et al. (Kipf and Welling 2016) apply the GCNNs to semi-supervised classification. Hamilton et al. (Hamilton, Ying,and Leskovec 2017) leverage GCNs to learn feature repre-sentations. (Chen et al. 2019) propose a novel GCN basedmodel (aka ML-GCN) to learn the label correlations formulti-label image recognition tasks. It utilizes the GCN tolearn an object classifier via mining their co-occurrence pat-terns within the dataset. Motivated by ML-GCN (Chen et al.2019), our work leverages the graph structure to capture andexplore an adaptive label correlation graph. With the pro-posed A-GCN, we can overcome the limitation caused bythe manually designed graph and automatically learn the la-bel correlation by an LG module. We also demonstrate thatour A-GCN is also an effective model for label dependencyand can be trained in an end-to-end manner.

Approach

To efficiently exploit the label dependencies and make GCNflexible, we propose the A-GCN to learn label correlationsfor GCN based multi-label image classification. In this sec-tion, we first present some notations to define the problem,and then introduce the basic GCN based multi-label classifi-cation, finally we present our A-GCN and several alternativelabel graph architectures.

PreliminariesNotations. Let D = {(Ii,yi) | i = 1 . . . N} be the trainingdata, where Ii is the i−th image and yi ∈ {0, 1}C is thecorresponding multi-hot label vector. Zeros or ones in thelabel vector y denote the absence or presence of the corre-sponding category in the image. Let xi ∈ RD = f(Ii; θ)denote the CNN feature of Ii, and f(·; θ) as a CNN modelwith parameters θ. Assume we have object classifiers W ∈RC×D = {wi}Ci=1, then the predicted logit scores of featurexi can be defined as,

pi = Wxi. (1)

The CNN model and classifiers can be optimized by thefollowing multi-label classification loss,

Lclassifier = − 1

C

C∑j=1

yji log (σ(pji ))+(1−yji )∗log (1− σ(pji ))

(2)where σ(·) the sigmoid function.

Multi-label classification with GCN. We revisit the ML-GCN (Chen et al. 2019) pipeline for multi-label classifica-tion in the following. It performs GCN on the word em-beddings E ∈ RC×de of labels, and learns inter-dependentobject classifiers to improve performance. The purpose ofGCN is to learn a function on a graph G = (V, E), whichtakes previous feature descriptions Hl ∈ RC×d and the cor-relation matrix A ∈ RC×C , and outputs learned node fea-tures as Hl+1 ∈ RC×d

. One GCN layer can be formulatedas,

H(l+1) = δ(AH(l)W(l)), (3)

whereA = D−

12 (A + IC)D−

12 , (4)

Page 4: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

Algorithm 1 Training of A-GCNInput:

image data and ground-truth label data (I, Y);labels’ word embedding E;

Output:image-level feature X;adaptive Correlation Matrix A;label-dependent classifiers W;the final predicted score vector P;Repeat:

Branch 1: Feedforward of image CNNExtract image feature X :X = fCNN (I; θCNN );

Branch 2: Feedforward of label-dependent classi-fiers

Learn/initialize the correlation matrix A withlabels’ word embedding:A← Eq.(5);Compute the LA:LA ← Eq.(6);Learn the label-dependent classifier W by GCN:W = fGCN (E, A; θGCN )← Eq.(3;Get predictions by applying classifier W toimage feature X:P ← Eq.(1);Compute the Lclassifier:Lclassifier(P ;Y )← Eq.(2);Compute the Ltotal:Ltotal = Lclassifier + α ∗ LA;

Backpropagation until Ltotal converges;

and W(l) ∈ Rd×d′

is a transformation matrix to be learned,A is the normalized version of A with Dii =

∑j Aij and

A = A + IC , IC is an identity matrix, and δ(·) is an acti-vation function which is set as LeakyReLU following (Chenet al. 2019). The input of the first layer is E and the out-put of the last layer is W ∈ RC×D, i.e. the inter-dependentclassifiers.

The crucial problem of ML-GCN is how to build corre-lation matrix A. (Chen et al. 2019) constructs it via mininglabel co-occurance within the target datasets. To overcomethe over-smoothing problem of A, it either binarizes or re-weights the original co-occurance matrix with thresholding.

A-GCNFollowing the pipeline of ML-GCN, we propose the A-GCNto address the generation of label correlation matrix A. Fig-ure 1 depicts the framework of A-GCN. It mainly consists oftwo branches. The upper branch is a traditional CNN for im-age feature learning, and the bottom branch is a GCN modelto generate inter-dependent classifiers.

The key difference between our A-GCN and ML-GCN isthe construction method of A. We argue that building cor-relation matrix A by counting the occurrence of label pairsand thresholding is inflexible and may be sub-optimal formulti-label classification. To address this problem, we pro-

pose an adaptive label graph (i.e correlation matrix) moduleto learn label correlations in an end-to-end manner.

Adaptive label graph (LG) module. As shown in Figure1, the adaptive LG module is comprised of two 1× 1 convo-lutional layers and a dot product operation. The LG moduletakes as input the word embeddings of labels and output alearned label correlation matrix A. Formally, the learned Acan be written as,

A =1

C(Wφ ∗E)T (Wθ ∗E) (5)

where Wφ and Wθ are the convolutional kernels to belearned, and ∗ denotes the convolutional operation.

Following the normalization trick in (Kipf and Welling2016), we normalize A to A by Equation (4).

Sparse correlation constraint. For each node of a certaingraph, GCN gradually aggregates information from its ownfeature and the adjacent nodes’ features. We can imaginethat the features can be indistinguishable by over-smoothingif the learned A becomes uniform. A uniform A denotesdense correlations among different labels. To avoid this is-sue, we enforce a sparse correlation constraint on A by aL1-norm loss as follows,

LA = |A− IC |. (6)

This constraint encourages high self-correlation weights toavoid over-smoothed features in GCN. Our total loss isLtotal = Lclassifier +α∗LA, where α is a trade-off weightand is default as 1.0 in our experiments.

Alternative LG architectures. As illustrated in Figure2, we propose three alternative LG architectures, namely i)pair-wise cosine similarity (abbreviated as Cos-A), ii) lineartransformation of E by a full-connected layer (FC-A), andiii) linear transformation of E with a dot product (Dot-A).

Cos-A simply computes the cosine similarities betweenlabel embeddings which generates a symmetrical correlationmatrix. Each element in A is defined by,

A(i, j) = cos(Ei,Ej). (7)

FC-A directly utilizes a linear layer Wl ∈ Rde×C to gen-erate the correlation matrix as,

A = W>l E. (8)

Dot-A first uses a 1 × 1 convolutional layer for E and adot product operation, and then compute the self-correlationmatrix as A,

A =1

C(Wφ ∗E)T (Wφ ∗E) (9)

Training. We illustrate the training process of A-GCN inAlgorithm 1. We train A-GCN in an end-to-end manner withtwo branches. Branch 1 extracts image features and updatesimage-level CNN parameters. Branch 2 learns adaptive la-bel correlation graph and the GCN model to generate label-dependent classifiers. The total loss is the combination ofsparse correlation constraint LA and multi-label classifica-tion loss Lclassifier.

Page 5: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

Table 1: Performance comparison of our framework and state-of-the-art methods on MS-COCO. ∗It denotes our re-implementation results.

Model All Top-3mAP CP CR CF1 OP OR OF1 CP CR CF1 OP OR OF1

CNN-RNN 61.2 - - - - - - 66.0 55.6 60.4 69.2 66.4 67.8RNN-Attention - - - - - - - 79.1 58.7 67.4 84.0 63.0 72.08Order-Free RNN - - - - - - - 71.6 54.8 62.1 74.2 62.2 67.7ML-ZSL - - - - - - - 74.1 64.5 69.0 - - -SRN 77.1 81.6 65.4 71.2 82.7 69.9 75.8 85.2 58.8 67.4 87.4 62.5 72.9Multi-Evidence - 80.4 70.2 74.9 85.2 72.5 78.4 84.5 62.2 70.6 89.1 64.3 74.7ML-GCN (Binary) 80.3 81.1 70.1 75.2 83.8 74.2 78.7 84.9 61.3 71.2 88.8 65.2 75.2ML-GCN (Re-weighted) 83.0 85.1 72.0 78.0 85.8 75.4 80.3 89.2 64.1 74.6 90.5 66.5 76.7ML-GCN (Re-weighted)∗ 82.5 83.7 72.0 77.4 84.7 75.5 79.8 88.4 63.8 74.1 89.9 66.2 76.3Our baseline (ResNet101) 80.3 77.8 72.8 75.2 81.5 75.1 78.2 82.5 64.6 72.4 87.3 65.7 75.0A-GCN 83.1 84.7 72.3 78.0 85.6 75.5 80.3 89.0 64.2 74.6 90.5 66.3 76.6A-GCN (w/o LA) 82.78 83.04 72.87 77.63 84.45 75.75 79.87 87.48 64.73 74.4 89.55 66.54 76.35Cos-A (w LA) 82.77 84.89 71.67 77.72 85.77 74.83 79.93 88.92 64.03 74.45 90.24 66.2 76.37FC-A (w LA) 82.85 83.65 72.45 77.65 84.99 75.56 80.0 88.29 64.23 74.37 89.95 66.3 76.34Dot-A (w LA) 82.22 84.64 70.93 77.18 85.86 74.65 79.87 88.74 63.19 73.82 90.37 65.93 76.24

Experiment

In this section, we evaluate the proposed A-GCN and com-pare it to the state-of-the-art methods on two public multi-label benchmark datasets: MS-COCO (Lin et al. 2014) andFashion550K (Inoue et al. 2017). We first present the im-plementation details and metrics, and then extensively ex-plore our A-GCN on MS-COCO, and finally apply A-GCNon Fashion550K.

Implementations and evaluation metrics

We implement our method with Pytorch. For data augmen-tation, we resize images to scale 512×512 on MS-COCO(256×256 on Fashion550K), and randomly crop regions of448×448 (224×224 on Fashion550K) with random flipping.For test, we resize images to scale 448×448 (224×224). Forfair comparison, we use ResNet-101 on MS-COCO (Chenet al. 2019), and ResNet-50 on Fashion550K (Inoue et al.2017), which are pre-trained on ImageNet. We use the SGDmethod for optimization with a momentum of 0.9 and aweight decay of 10−4. We set the minibatch size as 32, theinitial learning rate (lr) as 10−2. We divide lr by 10 afterevery 30 epochs, and stop training after 65 epochs. For wordembedding method and other hyper-parameters of GCN arekept consistent with (Chen et al. 2019).

Evaluation metrics. For MS-COCO dataset, we use thesame evaluation metrics as (Chen et al. 2019), i.e. the meanof class-average precision (mAP), overall precision (OP),recall (OR), F1 (OF1), and average per-class precision (CP),recall (CR), F1 (CF1). For each image, the labels are pre-dicted as positive if the confidences of them are greater than0.5. Among all these metrics, mAP is known as the mostimportant one. For fair comparisons, we also report the re-sults of top-3 labels.On Fashion550K, we also use mAP andthe class agnostic average precision (APall) to evaluate theperformance for consistency with (Inoue et al. 2017).

Figure 3: Accuracy comparisons with different values of α.

Exploration on MS-COCOMS-COCO is the most popular multi-label image datasetwhich consists of 80 categories with 82,081 training imagesand 40,137 test images. We compare our A-GCN to sev-eral state-of-the-art methods including CNN-RNN (Wanget al. 2016), RNN-Attention (Wang et al. 2017), Order-Free RNN (Chen et al. 2018a), ML-ZSL (Lee et al. 2018),SRN (Zhu et al. 2017), Multi-Evidence (Ge, Yang, and Yu2018), ML-GCN (Chen et al. 2019). The results are pre-sented in Table 1. Our A-GCN significantly improves thebaseline (ResNet101) in most of the metrics. Specifically,the A-GCN improves the mAP of baseline from 80.3% to83.1%. In addition, our A-GCN slightly outperforms themost related method ML-GCN in mAP. Compared to ML-GCN, our A-GCN, with a small extra LG module, is moreflexible which does not need to elaborately design correla-tion matrix.

Evaluation of LA and LG architectures. We evaluatethe effect of sparse correlation constraint LA and differentlabel graph architectures in the last four rows of Table 1.Several observations can be concluded as following. First,without LA we obtain slightly worse results than the de-fault A-GCN, which indicates the effectiveness of sparse

Page 6: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

(a) Baseline vs A-GCN on MS-COCO dataset (b) ML-GCN vs A-GCN MS-COCO dataset

(c) Baseline vs A-GCN on Fashion550K dataset (d) ML-GCN vs A-GCN on Fashion550K dataset

Figure 4: Per-class improvement or degradation of AP between A-GCN and (or ML-GCN) on MS-COCO (or Fashion550k).The top-10 improved classes from our A-GCN are indicated as red, and the top-10 degraded classes blue.

constraint. Second, all alternative LG architectures improvethe baseline obviously which suggests that all of them learnlabel correlation information effectively. Third, the FC-A,which only differs from the default A-GCN by replacing1×1 convolutional with one FC layer, shows the best resultsin all the alternative ones. Compared to the default A-GCN,the Dot-A has an obviously degradation.

Evaluation of α. The trade-off weight α indicates thecontribution of LA in the whole loss value. Intuitively, thisregularization should not have large weight. Figure 3 showsthe evaluation of α on MS-COCO. Increasing α from 0 to 1slightly boosts performance, while larger α leads to degra-dation and even divergence (α = 2.0 in our test).

Visualization. To further investigate the effect of our A-GCN, we show the per-class improvement (degradation)from A-GCN on MS-COCO and Fashion550K in Figure 4.It shows that those objects (mainly daily needs) whose pres-ences usually depend on their co-occurrence containers arelikely to have large gains, e.g. spoon, backpack, book, tooth-brush in image (a), (or glasses, sneakers, sweatshirts in im-age (c)), etc. It suggests that our A-GCN leverages the graphmodule to automatically learn the objects co-occurrence re-lation, which can effectively improve the multi-label recog-nition performance.

Performance on Fashion550KFashion550K (Inoue et al. 2017) is a multi-label fashiondataset which contains 66 unique weakly-annotated tagswith 407,772 images in total. Among all the images, 3,000images are manually verified for training (i.e. clean set), 300images for validation, and 2,000 images for test. The rest im-ages are used as noisy-labeled data, i.e. noisy set. We reportperformance on the test set following common setting.

Table 2: Comparison of APall and mAP on Fashion550K.Model Data APall mAPBaseline noisy 69.18 % 58.68 %StyleNet noisy 69.53 % 53.24 %ML-GCN noisy 68.46 % 60.85 %Our baseline noisy 68.26 % 58.59 %A-GCN noisy 70.28 % 61.35 %

Baseline noisy+clean 79.39 % 64.04 %Viet et al. noisy+clean 78.92 % 63.08 %Inoue et al. noisy+clean 79.87 % 64.62 %ML-GCN noisy+clean 80.52 % 65.74 %Our baseline noisy+clean 77.84 % 62.92 %A-GCN noisy+clean 80.95 % 66.32 %

We compare our default A-GCN to several well-known state-of-the-art methods on Fashion550K, includ-ing StyleNet (Simo-Serra and Ishikawa 2016), Baseline andInoue et al. proposed method (Inoue et al. 2017), Vietet al. proposed method (Veit et al. 2017), and our re-implementation ML-GCN (Re-weighted). For fair compari-son, we also use two training configurations, namely i) train-ing on noisy set and ii) further fine-tuning on clean set (i.e.noisy+clean). The comparison is presented in Table 2. OurA-GCN improves our baseline by 2.76% and 3.4% in mAPwith both training settings, respectively. It also demonstratesthe label correlation information is helpful for multi-labelfashion image classification.

Page 7: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

ConclusionIn this paper, we proposed a simple and flexible A-GCNframework for multi-label image recognition. The A-GCNleverages a plug-and-play label graph module to automat-ically construct the label correlation matrix for GCN onthe label embeddings. We designed a sparse correlationconstraint on the learned correlation matrix to avoid over-smoothing on the features. We also explored several alter-native label graph modules to demonstrate the effectivenessof our A-GCN. Extensive experiments on MS-COCO andFashion550K show that our A-GCN achieves superior per-formance to several state-of-the-art methods.

References[Cabral et al. 2014] Cabral, R.; De la Torre, F.; Costeira,J. P.; and Bernardino, A. 2014. Matrix completion forweakly-supervised multi-label image classification. IEEEtransactions on pattern analysis and machine intelligence37(1):121–135.

[Chatfield et al. 2014] Chatfield, K.; Simonyan, K.; Vedaldi,A.; and Zisserman, A. 2014. Return of the devil in thedetails: Delving deep into convolutional nets. arXiv preprintarXiv:1405.3531.

[Chen et al. 2012] Chen, Q.; Song, Z.; Hua, Y.; Huang, Z.;and Yan, S. 2012. Hierarchical matching with side informa-tion for image classification. In CVPR, 3426–3433. IEEE.

[Chen et al. 2018a] Chen, S.-F.; Chen, Y.-C.; Yeh, C.-K.; andWang, Y.-C. F. 2018a. Order-free rnn with visual attentionfor multi-label classification. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

[Chen et al. 2018b] Chen, T.; Wang, Z.; Li, G.; and Lin, L.2018b. Recurrent attentional reinforcement learning formulti-label image recognition. In Thirty-Second AAAI Con-ference on Artificial Intelligence.

[Chen et al. 2019] Chen, Z.-M.; Wei, X.-S.; Wang, P.; andGuo, Y. 2019. Multi-label image recognition with graphconvolutional networks. In CVPR, 5177–5186.

[Cheng and Hullermeier 2009] Cheng, W., and Hullermeier,E. 2009. Combining instance-based learning and logisticregression for multilabel classification. Machine Learning76(2-3):211–225.

[Clare and King 2001] Clare, A., and King, R. D. 2001.Knowledge discovery in multi-label phenotype data. InEuropean Conference on Principles of Data Mining andKnowledge Discovery, 42–53. Springer.

[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.;Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hier-archical image database. In CVPR, 248–255. Ieee.

[Ge, Yang, and Yu 2018] Ge, W.; Yang, S.; and Yu, Y. 2018.Multi-evidence filtering and fusion for multi-label classifi-cation, object detection and semantic segmentation based onweakly supervised learning. In CVPR, 1277–1286.

[Ghamrawi and McCallum 2005] Ghamrawi, N., and Mc-Callum, A. 2005. Collective multi-label classification. InProceedings of the 14th ACM international conference onInformation and knowledge management, 195–200. ACM.

[Gong et al. 2013] Gong, Y.; Jia, Y.; Leung, T.; Toshev, A.;and Ioffe, S. 2013. Deep convolutional ranking for multil-abel image annotation. arXiv preprint arXiv:1312.4894.

[Guo and Gu 2011] Guo, Y., and Gu, S. 2011. Multi-labelclassification using conditional dependency networks. InTwenty-Second International Joint Conference on ArtificialIntelligence.

[Hamilton, Ying, and Leskovec 2017] Hamilton, W.; Ying,Z.; and Leskovec, J. 2017. Inductive representation learningon large graphs. In Advances in Neural Information Pro-cessing Systems, 1024–1034.

[He et al. 2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016.Deep residual learning for image recognition. In CVPR,770–778.

[Inoue et al. 2017] Inoue, N.; Simo-Serra, E.; Yamasaki, T.;and Ishikawa, H. 2017. Multi-Label Fashion Image Classi-fication with Minimal Human Supervision. In CVPR Work-shops.

[Kipf and Welling 2016] Kipf, T. N., and Welling, M. 2016.Semi-supervised classification with graph convolutional net-works. arXiv preprint arXiv:1609.02907.

[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-tion with deep convolutional neural networks. In Advancesin neural information processing systems, 1097–1105.

[Lee et al. 2018] Lee, C.-W.; Fang, W.; Yeh, C.-K.; andFrank Wang, Y.-C. 2018. Multi-label zero-shot learningwith structured knowledge graphs. In CVPR, 1576–1585.

[Li et al. 2016] Li, Y.; Huang, C.; Loy, C. C.; and Tang, X.2016. Human attribute recognition by deep hierarchical con-texts. In ECCV, 684–700. Springer.

[Lin et al. 2014] Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.;Perona, P.; Ramanan, D.; Dollar, P.; and Zitnick, C. L. 2014.Microsoft coco: Common objects in context. In ECCV, 740–755. Springer.

[Marino, Salakhutdinov, and Gupta 2016] Marino, K.;Salakhutdinov, R.; and Gupta, A. 2016. The more youknow: Using knowledge graphs for image classification.arXiv preprint arXiv:1612.04844.

[Scarselli et al. 2008] Scarselli, F.; Gori, M.; Tsoi, A. C.; Ha-genbuchner, M.; and Monfardini, G. 2008. The graph neu-ral network model. IEEE Transactions on Neural Networks20(1):61–80.

[Sharif Razavian et al. 2014] Sharif Razavian, A.; Azizpour,H.; Sullivan, J.; and Carlsson, S. 2014. Cnn features off-the-shelf: an astounding baseline for recognition. In CVPRworkshops, 806–813.

[Simo-Serra and Ishikawa 2016] Simo-Serra, E., andIshikawa, H. 2016. Fashion style in 128 floats: Jointranking and classification using weak data for featureextraction. In CVPR, 298–307.

[Simonyan and Zisserman 2014] Simonyan, K., and Zisser-man, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Page 8: arXiv:1909.13005v1 [cs.CV] 28 Sep 2019 · 2019-10-01 · arXiv:1909.13005v1 [cs.CV] 28 Sep 2019. introduce a plug-and-play adaptive Label Graph (LG) mod-ule to learn label correlations

[Tai and Lin 2012] Tai, F., and Lin, H.-T. 2012. Multil-abel classification with principal label space transformation.Neural Computation 24(9):2508–2542.

[Tehrani and Ahrens 2017] Tehrani, A. F., and Ahrens, D.2017. Modeling label dependence for multi-label classifi-cation using the choquistic regression. Pattern RecognitionLetters 92:75–80.

[Trohidis et al. 2008] Trohidis, K.; Tsoumakas, G.; Kalliris,G.; and Vlahavas, I. P. 2008. Multi-label classification ofmusic into emotions. In ISMIR, volume 8, 325–330.

[Tsoumakas and Katakis 2007] Tsoumakas, G., and Katakis,I. 2007. Multi-label classification: An overview. IJDWM3(3):1–13.

[Tsoumakas, Katakis, and Vlahavas 2009] Tsoumakas, G.;Katakis, I.; and Vlahavas, I. 2009. Mining multi-labeldata. In Data mining and knowledge discovery handbook.Springer. 667–685.

[Veit et al. 2017] Veit, A.; Alldrin, N.; Chechik, G.; Krasin,I.; Gupta, A.; and Belongie, S. 2017. Learning from noisylarge-scale datasets with minimal supervision. In CVPR,839–847.

[Wang and Gupta 2018] Wang, X., and Gupta, A. 2018.Videos as space-time region graphs. In ECCV, 399–417.

[Wang et al. 2016] Wang, J.; Yang, Y.; Mao, J.; Huang, Z.;Huang, C.; and Xu, W. 2016. Cnn-rnn: A unified frameworkfor multi-label image classification. In CVPR, 2285–2294.

[Wang et al. 2017] Wang, Z.; Chen, T.; Li, G.; Xu, R.; andLin, L. 2017. Multi-label image recognition by recurrentlydiscovering attentional regions. In CVPR, 464–472.

[Wang et al. 2019] Wang, Z.; Zheng, L.; Li, Y.; and Wang, S.2019. Linkage based face clustering via graph convolutionnetwork. In CVPR, 1117–1125.

[Wei et al. 2014] Wei, Y.; Xia, W.; Huang, J.; Ni, B.; Dong,J.; Zhao, Y.; and Yan, S. 2014. Cnn: Single-label to multi-label. arXiv preprint arXiv:1406.5726.

[Wei et al. 2015] Wei, Y.; Xia, W.; Lin, M.; Huang, J.; Ni,B.; Dong, J.; Zhao, Y.; and Yan, S. 2015. Hcp: A flexiblecnn framework for multi-label image classification. IEEEtransactions on pattern analysis and machine intelligence38(9):1901–1907.

[Wu et al. 2015] Wu, F.; Wang, Z.; Zhang, Z.; Yang, Y.; Luo,J.; Zhu, W.; and Zhuang, Y. 2015. Weakly semi-superviseddeep learning for multi-label image annotation. IEEE Trans-actions on Big Data 1(3):109–122.

[Xu et al. 2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville,A.; Salakhudinov, R.; Zemel, R.; and Bengio, Y. 2015.Show, attend and tell: Neural image caption generation withvisual attention. In International conference on machinelearning, 2048–2057.

[Xue et al. 2011] Xue, X.; Zhang, W.; Zhang, J.; Wu, B.;Fan, J.; and Lu, Y. 2011. Correlative multi-label multi-instance image annotation. In ICCV, 651–658. IEEE.

[Yu et al. 2019] Yu, W.-J.; Chen, Z.-D.; Luo, X.; Liu, W.;and Xu, X.-S. 2019. Delta: A deep dual-stream networkfor multi-label image classification. Pattern Recognition91:322–331.

[Zhang and Zhou 2007] Zhang, M.-L., and Zhou, Z.-H.2007. Ml-knn: A lazy learning approach to multi-label learn-ing. Pattern recognition 40(7):2038–2048.

[Zhang and Zhou 2013] Zhang, M.-L., and Zhou, Z.-H.2013. A review on multi-label learning algorithms.IEEE transactions on knowledge and data engineering26(8):1819–1837.

[Zhou et al. 2012] Zhou, Z.-H.; Zhang, M.-L.; Huang, S.-J.;and Li, Y.-F. 2012. Multi-instance multi-label learning. Ar-tificial Intelligence 176(1):2291–2320.

[Zhu et al. 2017] Zhu, F.; Li, H.; Ouyang, W.; Yu, N.; andWang, X. 2017. Learning spatial regularization with image-level supervisions for multi-label image classification. InCVPR, 5513–5522.


Recommended