Senior Member, IEEE IEEE Proof -...

IEEE P

roof

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1

Cost-Effective Active Learning forDeep Image Classification

Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin, Senior Member, IEEE

Abstract— Recent successes in learning-based image classifi-cation, however, heavily rely on the large number of anno-tated training samples, which may require considerable humaneffort. In this paper, we propose a novel active learning (AL)framework, which is capable of building a competitive classifierwith optimal feature representation via a limited amount oflabeled training instances in an incremental learning manner.Our approach advances the existing AL methods in twoaspects. First, we incorporate deep convolutional neural networksinto AL. Through the properly designed framework, the featurerepresentation and the classifier can be simultaneously updatedwith progressively annotated informative samples. Second,we present a cost-effective sample selection strategy to improvethe classification performance with less manual annotations.Unlike traditional methods focusing on only the uncertain sam-ples of low prediction confidence, we especially discover thelarge amount of high-confidence samples from the unlabeled setfor feature learning. Specifically, these high-confidence samplesare automatically selected and iteratively assigned pseudolabels.We thus call our framework cost-effective AL (CEAL) standingfor the two advantages. Extensive experiments demonstrate thatthe proposed CEAL framework can achieve promising results ontwo challenging image classification data sets, i.e., face recognitionon the cross-age celebrity face recognition data set database andobject categorization on Caltech-256.

Index Terms— Active learning (AL), deep neural nets, imageclassification, incremental learning.

I. INTRODUCTION

A IMING at improving the existing models by incremen-tally selecting and annotating the most informative unla-

beled samples, active learning (AL) has been well studied inthe past few decades [3]–[12], and applied to various kinds

Manuscript received June 26, 2015; revised January 5, 2016 andApril 25, 2016; accepted July 1, 2016. This work was supportedin part by the National Natural Science Foundation of China underGrant 61622214, in part by the State Key Development Program underGrant 2016YFB1001000, in part sponsored by CCF-Tencent Open Fund,in part by the Special Program through the Applied Research on SuperComputation of the Natural Science Foundation of China–Guangdong JointFund (the second phase), and in part by NVIDIA Corporation through theTesla K40 GPU. This paper was recommended by Associate Editor E. Cetin.(Corresponding author: Dongyu Zhang.)

K. Wang, D. Zhang, R. Zhang, and L. Lin are with the School ofData and Computer Science, Sun Yat-sen University, Guangzhou 510006,China, and also with Collaborative Innovation Center of High Perfor-mance Computing, National University of Defense Technology, Chang-sha 410073, China. (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Y. Li is with Guangzhou University, Guangzhou 510182, China (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2016.2589879

of vision tasks, such as image/video categorization [13]–[17],text/Web classification [18]–[20], and image/video retrieval[21], [22]. In the AL methods [3]–[5], the classifier/modelis first initialized with a relatively small set of labeled trainingsamples. Then it is continuously boosted by selecting andpushing some of the most informative samples to user forannotation. Although the existing AL approaches [10], [11]have demonstrated impressive results on image classification,their classifiers/models are trained with hand-craft features(e.g., HoG and SIFT) on small-scale visual data sets. Theeffectiveness of AL on more challenging image classificationtasks has not been studied well.

Recently, incredible progress on visual recognition taskshas been made by deep learning approaches [23], [24].With sufficient labeled data [25], deep convolutional neuralnetworks (CNNs) [23], [26] are trained to directly learnfeatures from raw pixels, which have achieved the state-of-the-art performance for image classification. However, inmany real applications of large-scale image classification,the labeled data are not enough, since the tedious manuallabeling process requires a lot of time and labor. Thus, ithas a great practical significance to develop a framework bycombining CNNs and AL, which can jointly learn features andclassifiers/models from unlabeled training data with minimalhuman annotations. However, incorporating CNNs into ALframework is not straightforward for real image classificationtasks. This is due to the following two issues.

1) The labeled training samples given by current ALapproaches are insufficient for CNNs, as the majority ofunlabeled samples are usually ignored in AL. AL usuallyselects only a few of the most informative samples (e.g.,samples with quite low prediction confidence) in eachlearning step and frequently solicit user labeling. Thus,it is difficult to obtain proper feature representations byfine-tuning CNNs with these minority of informativesamples.

2) The process pipelines of AL and CNNs are inconsistentwith each other. Most of AL methods pay close attentionto model/classifier training. Their strategies to selectthe most informative samples are heavily dependenton the assumption that the feature representation isfixed. However, the feature learning and classifier train-ing are jointly optimized in CNNs. Because of thisinconsistency, simply fine-tuning CNNs in the traditionalAL framework may face the divergence problem.

Inspired by the insights and lessons from a significantamount of previous works as well as the recently

1051-8215 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE P

roof

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

Fig. 1. Illustration of our proposed CEAL framework. Our proposed CEAL progressively feeds the samples from the unlabeled data set into the CNN.Then both of the clearly classified samples and most informative samples selection criteria are applied on the classifier output of the CNN. After addinguser-annotated minority of uncertain samples into the labeled set and pseudolabeling the majority of certain samples, the model (feature representation andclassifier of the CNN) is further updated.

proposed technique, i.e., self-paced learning [27]–[30],we address above-mentioned issues by cost effectivelycombining the CNN and AL via a complementary sampleselection. In particular, we propose a novel AL frameworkcalled cost-effective AL (CEAL), which is enabled to fine-tune the CNN with sufficient unlabeled training data andovercomes the inconsistency between the AL and CNN.

Different from the existing AL approaches that consideronly the most informative and representative samples, ourCEAL proposes to automatically select and pseudoannotateunlabeled samples. As Fig. 1 illustrates, our proposed CEALprogressively feeds the samples from the unlabeled data setinto the CNN and selects two kinds of samples for fine-tuningaccording to the output of CNN’s classifiers. One kind is theminority of samples with low prediction confidence, calledmost informative/uncertain samples. The predicted labels ofsamples are most uncertainty ones. For the selection of theseuncertain samples, the proposed CEAL considers three com-mon AL methods: least confidence (LC) [31], margin sam-pling (MS) [32], and entropy (EN) [33]. The selected samplesare added into the labeled set after active user labeling. Theother kind is the majority of samples with high predictionconfidence, called high-confidence samples. The predictedlabels of samples are most certainty ones. For these certainkinds of samples, the proposed CEAL automatically assignspseudolabels with no human labor cost. As one can see, thesetwo kinds of samples are complementary to each other forrepresenting different confidence levels of the current modelon the unlabeled data set. In the model updating stage, all thesamples in the labeled set and currently pseudolabeled high-confidence samples are exploited to fine-tune the CNN.

The proposed CEAL advances in employing these twocomplementary kinds of samples to incrementally improve themodel’s classifier training and feature learning: the minority ofinformative kind contributes to train more powerful classifiers,while the majority of high confidence kind conduces to learnmore discriminative feature representations. On one hand,although the number is small, most uncertainty unlabeledsamples usually have a great potential impact on the classifiers.Selecting and annotating them into training can lead to abetter decision boundary of the classifiers. On the other hand,though unable to significantly improve the performance of

classifiers, the high-confidence unlabeled samples are closeto the labeled samples in the CNN’s feature space. Thus,pseudolabeling these majority of high-confidence samples fortraining is a reasonable data augmentation way for the CNNto learn robust features. In particular, the number of thehigh-confidence samples is actually much larger than thatof most uncertainty ones. With the obtained robust featurerepresentation, the inconsistency between the AL and CNNcan be overcome.

For the problem of keep the model stable in the trainingstage, many works [34], [35] are proposed in recent yearsinspired by the learning process of humans that graduallyinclude samples into training from easy to complex. Throughthis way, the training samples for further iterations are grad-ually determined by the model itself based on what it hasalready learned [30]. In other words, the model can graduallyselect the high-confidence samples as pseudolabeled onesalong with the training process. The advantages of theserelated studies motivate us to incrementally select unlabeledsamples in an easy-to-hard manner to make pseudolabelingprocess reliable. Specifically, considering that the classificationmodel is usually not reliable enough in the initial iterations, weemploy high-confidence threshold to define clearly classifiedsamples and assign them pseudolabels. When the performanceof the classification model improves, the threshold correspond-ingly decreases.

The main contribution of this paper is threefold. First, tothe best of our knowledge, our work is the first one addressingthe deep image classification problems in conjunction withAL framework and CNN training. Our framework can be eas-ily extended to other similar visual recognition tasks. Second,this paper also advances the AL development by introducinga cost-effective strategy to automatically select and annotatethe high-confidence samples, which improves the traditionalsamples selection strategies. Third, experiments on challengingcross-age celebrity face recognition data set (CACD) [1] andCaltech 256 [2] data sets show that our approach outperformsother methods not only in the classification accuracy but alsoin the reduction of human annotation.

The rest of this paper is organized as follows. Section IIpresents a brief review of related work. Section III discussesthe component of our framework and the corresponding

IEEE P

roof

WANG et al.: CEAL FOR DEEP IMAGE CLASSIFICATION 3

learning algorithm. Section IV presents the experimentsresults with deep empirical analysis. Section V concludes thispaper.

II. RELATED WORK

The key idea of the AL is that a learning algorithmshould achieve higher accuracy with a fewer labeled trainingsamples, if it is allowed to choose the ones from which itlearns [31]. In this way, the instance selection scheme isbecoming extremely important. One of the most commonstrategy is the uncertainty-based selection [12], [18], whichmeasures the uncertainties of novel unlabeled samples fromthe predictions of previous classifiers. Lewis [12] proposedto extract the sample, which has the largest EN on theconditional distribution over predicted labels, as the mostuncertain instance. The support vector machine (SVM)-basedmethod [18] determined the uncertain samples based onthe relative distance between the candidate samples andthe decision boundary. Some earlier works [19], [38] alsodetermined the sample uncertainty referring to a committee ofclassifiers (i.e., examining the disagreement among class labelsassigned by a set of classifiers). Such a theoretically motivatedframework is called query-by-committee in literature [31].All the above-mentioned uncertainty-based methods usuallyignore the majority of certain unlabeled samples and thusare sensitive to outliers. The latter methods have taken theinformation density measure into account and exploitedthe information of unlabeled data when selecting samples.These approaches usually sequentially select the informativesamples relying on the probability estimation [6], [37] orprior information [8] to minimize the generalization error ofthe trained classifier over the unlabeled data. For example,Joshi et al. [6] considered the uncertainty sampling methodbased on the probability estimation of class membership forall the instances in the selection pool, and such a method canbe effective to handle the multiclass case. In [8], some contextconstraints are introduced as the priori to guide users to tagthe face images more efficiently. At the same time, a series ofworks [7], [24] is proposed to take the samples to maximizethe increase of mutual information between the candidateinstance and the remaining unlabeled instances under theGaussian process framework. Li and Guo [10] presented anovel adaptive AL approach that combines an informationdensity measure and a most uncertainty measure together tolabel critical instances for image classifications. Moreover, thediversity of the selected instance over the certain category hasbeen taken into consideration in [4] as well. Such a work isalso the pioneer study expanding the SVM-based AL from thesingle mode to batch mode. Recently, Elhamifar et al. [11] fur-ther integrated the uncertainty and diversity measurement intoa unified batch mode framework via convex programming forunlabeled sample selection. Such an approach is more feasibleto conjunction with any type of classifiers, but not limitedin max-margin based ones. It is obvious that all the above-mentioned AL methods consider only those low-confidencesamples (e.g., uncertain and diverse samples), but losing thesight of a large majority of high-confidence samples. We holdthat due to the majority and consistency, these high-confidence

samples will also be beneficial to improve the accuracy andkeep the stability of classifiers. Even more, we shalldemonstrate that considering these high-confidence samplescan also reduce the user effort of annotation effectively.

III. COST-EFFECTIVE ACTIVE LEARNING

In this section, we develop an efficient algorithm for the pro-posed CEAL framework. Our objective is to apply our CEALframework to deep image classification tasks by progressivelyselecting complementary samples for model updating. Supposewe have a data set of m categories and n samples denotedby D = {xi }n

i=1. We denote the currently annotated samples

of D by DL , while the unlabeled ones by DU . The labelof xi is denoted by yi = j, j ∈ {1, . . . , m}, i.e., xi belongs tothe j th category. We should give two necessary remarks onour problem settings. One is that in our investigated imageclassification problems, almost all data are unlabeled, i.e.,most of the {yi } values of D are unknown and needed to becompleted in the learning process. The other remark is that DU

might possibly been input into the system in an incrementalway. This means that data scale might be consistently growing.

Thanks to handling both manually annotated and automat-ically pseudolabeled samples together, our proposed CEALmodel can progressively fit the consistently growing unlabeleddata in such a holistic manner. The CEAL for deep imageclassification is formulated as follows:

min{W,yi ,i∈DU }

− 1

n

n∑

i=1

m∑

j=1

1{yi = j} log p(yi = j |xi;W) (1)

where 1{·} is the indicator function, so that1{a true statement} = 1 and 1{a false statement} = 0,and W denotes the network parameters of the CNN.p(yi = j |xi;W) denotes the softmax output of the CNNfor the j th category, which represents the probability of thesample xi belonging to the j th classifiers.

The alternative search strategy is readily employed to opti-mize (1). Specifically, the algorithm is designed by alter-natively updating the pseudolabeled sample yi ∈ DU andthe network parameters W . In the following, we introducethe details of the optimization steps and give their physicalinterpretations. The practical implementation of the CEAL willalso be discussed in the end.

A. Initialization

Before the experiment starts, the labeled samples DL isempty. For each class, we randomly select a few trainingsamples from DU and manually annotate them as the startingpoint to initialize the CNN parameters W .

B. Complementary Sample Selection

Fixing the CNN parameters W , we first rank all unlabeledsamples according to the common AL criteria and then man-ually annotate those most uncertain samples and add theminto DL . For those most certain ones, we assign pseudolabelsand denote them by DH .

IEEE P

roof


1) Informative Sample Annotating: Our CEAL can usein conjunction with any type of common actively learningcriteria, e.g., LC [31], MS [32], and EN [33] to select Kmost informative/uncertain samples left in DU . The selectioncriteria are based on p(yi = j |xi;W), which denotes theprobability of xi belonging to the j th class. Specifically, thethree selection criteria are defined as follows.

1) LC: Rank all the unlabeled samples in an ascendingorder according to the lci value. lci is defined as

lci = maxj

p(yi = j |xi;W). (2)

If the probability of the most probable class for a sampleis low, then the classifier is uncertain about the sample.

2) MS: Rank all the unlabeled samples in an ascendingorder according to the msi value. msi is defined as

msi = p(yi = j1|xi ;W) − p(yi = j2|xi;W) (3)

where j1 and j2 represent the first and second most prob-able class labels predicted by the classifiers, respectively.The smaller of the margin means the classifier is moreuncertain about the sample.

3) EN: Rank all the unlabeled samples in an descendingorder according to their eni value. eni is defined as

eni = −m∑

j=1

p(yi = j |xi;W) log p(yi = j |xi;W). (4)

This method takes all class label probabilities intoconsideration to measure the uncertainty. The higherEN value, the more uncertain is the sample.

2) High-Confidence Sample Pseudolabeling: We select thehigh-confidence samples from DU, whose EN is smaller thanthe threshold δ. Then we assign clearly predicted pseudolabelsto them. The pseudolabel yi is defined as

j∗ = arg maxj

p(yi = j |xi;W)

yi ={

j∗, eni < δ

0, otherwise(5)

where yi = 1 denotes that xi is regarded as high-confidencesample. The selected samples are denoted by DH . Note thatcompared with classification probability p(yi = j∗|xi ;W) forthe j∗th category, the employed EN eni holistically considersthe classification probability of the other categories, i.e., theselected sample should be clearly classified with high confi-dence. The threshold δ is set to a large value to guarantee ahigh reliability of assigning a pseudolabel.

C. CNN Fine-Tuning

Fixing the labels of self-labeled high-confidencesamples DH and manually annotated ones DL by active user,(1) can be simplified as

minW

− 1

N

N∑

i=1

m∑

j=1

1{yi = j} log p(yi = j |xi;W) (6)

where N denotes the number of samples in DH ∪ DL . Weemploy the standard back propagation to update the CNN’sparameters W . Specifically, let L denote the loss functionof (6), then the partial derive of the network parameter Waccording to (6) is

∂L∂W = ∂

− 1N

∑Ni=1

∑mj=1 1{yi = j} log p(yi = j |xi;W)

∂W

= − 1

N

N∑

i=1

m∑

j=1

1{yi = j}∂ log p(yi = j |xi;W)

∂W

= − 1

N

N∑

i=1

(1{yi = j} − p(yi = j |xi;W))∂z j (xi ;W)

∂W(7)

where {z j (xi ;W)}mj=1 denotes the activation for the i th sample

of the last layer of CNN model before feeding into the softmaxclassifier, which is defined as

p(yi = j |xi;W) = ez j (xi ;W)

∑mt=1 ezt (xi ;W)

. (8)

After fine-tuning, we put the high-confidence samples DH

back to DU and erase their pseudolabel.

D. Threshold Updating

As the incremental learning process goes on, the classifica-tion capability of classifier improves and more high-confidencesamples are selected, which may result in the decrease ofincorrect autoannotation. In order to guarantee the reliability ofhigh-confidence sample selection, at the end of each iteration t ,we update the high-confidence sample selection threshold bysetting

δ ={

δ0, t = 0

δ − dr ∗ t, t > 0(9)

where δ0 is the initial threshold and dr controls the thresholddecay rate.

The entire algorithm can be then summarized intoAlgorithm 1. It is easy to see that this alternative optimizingstrategy finely accords with the pipeline of the proposed CEALframework.

IV. EXPERIMENTS

A. Data Sets and Experimental Settings

1) Data Set Description: In this section, we evaluate ourCEAL framework on two public challenging benchmarks, i.e.,CACD [1] and the Caltech-256 object categorization [2] dataset (see Fig. 2). CACD is a large-scale and challenging data setfor face identification and retrieval problems. It contains morethan 160 000 images of 2000 celebrities, which are varyingin age, pose, illumination, and occlusion. Since not all of theimages are annotated, we adopt a subset of 580 individualsfrom the whole data set in our experiments, in which 200 indi-viduals are originally annotated and 380 persons are extraannotated by us. Especially, 6336 images of 80 individualsare utilized for pretraining the network and the remaining

IEEE P

roof


Fig. 2. Demonstration of the effectiveness of our proposed heuristic deep AL framework on face recognition and object categorization. First and secondlines: sample images from the Caltech-256 [2] data set. Last line: samples images from CACD [1].

Algorithm 1 Learning Algorithm of CEALInput:

Unlabeled samples DU , initially labeled samples DL ,uncertain samples selection size K , high-confidence sam-ples selection threshold δ, threshold decay rate dr , maxi-mum iteration number T , fine-tuning interval t .

Output:CNN parameters W .

1: Initialize W with DL .2: while not reach maximum iteration T do3: Add K uncertainty samples into DL based on Eq. (2)

or (3) or (4),4: Obtain high confidence samples DH based on Eq. (5)5: In every t iterations:

• Update W via fine-tuning according to Eq. (6) withDH ∪ DL

• Update δ according to Eq. (9)

6: end while7: return W

500 persons are used to perform the experiments. Caltech-256is a challenging object categories data set. It contains a totalof 30 607 images of 256 categories collected from the Internet.

2) Experimental Setting: For CACD, we utilize the methodproposed in [38] to detect the facial points and align thefaces based on the eye locations. We resize all the faces into200 × 150 and then we set the parameters: δ0 = 0.05,dr = 0.0033, and K = 2000. For Caltech-256, we resize allthe images to 256×256 and we set δ0 = 0.005, dr = 0.00033,and K = 1000. Following the settings in the existing ALmethod [11], we randomly select 80% images of each class toform the unlabeled training set, and the rest are as the testingset in our experiments. Among the unlabeled training set, werandomly select 10% samples of each class to initialize thenetwork and the rest are for incremental learning process.To get rid of the influence of randomness, we averagefive times execution results as the final result.

We use different network architectures for CACD [1] andCaltech-256 [2] data sets because the difference between faceand object is relatively large. Table I shows the overall network

TABLE I

DETAILED CONFIGURATION OF THE CNN ARCHITECTURE USED INCACD [1]. IT TAKES THE 200 × 150 × 3 IMAGES AS INPUT AND

GENERATES THE 500-WAY SOFTMAX OUTPUT FOR CLASSES

PREDICTION. THE ReLU [39] ACTIVATION FUNCTIONIS NOT SHOWN FOR BREVITY

TABLE II

DETAILED CONFIGURATION OF THE CNN ARCHITECTURE USED INCALTECH-256 [2]. IT TAKES THE 256 × 256 × 3 IMAGES AS INPUT,WHICH WILL BE RANDOMLY CROPPED INTO 227 × 227 DURING

THE TRAINING, AND GENERATES THE 256-WAY SOFTMAXOUTPUT FOR CLASS PREDICTION. THE ReLU ACTIVATION

FUNCTION IS NOT SHOWN FOR BREVITY

architecture for CACD experiments, and Table II shows theoverall network architecture for Caltech-256 experiments. Weuse Alexnet [23] as the network architecture for Caltech-256and using the ImageNet ILSVRC data set [40] pretrainedmodel as the starting point following the setting of [41]. Thenwe keep all layers fixed and just modify the last layer tobe the 256-way softmax classifier to perform the Caltech-256experiments. We employ Caffe [42] for CNN implementation.

IEEE P

roof


Fig. 3. Classification accuracy under different percentages of annotated samples of the whole training set on the (a) CACD and (b) Caltech-256 data sets.Our proposed method CEAL_MS performs consistently better than the compared TCAL and AL_RAND.

For CACD, we set the learning rates of all the layersas 0.01. For Caltech-256, we set the learning rates of allthe layers as 0.001 except for the softmax layer, which isset to 0.01. All the experiments are conducted on a commondesktop PC with an intel 3.8-GHz CPU and a Titan X GPU.Average 17 h are needed to finish training on the CACD dataset with 44 708 images.

3) Comparison Methods: To demonstrate that our proposedCEAL framework can improve the classification performancewith less labeled data, we compare CEAL with new state-of-the-art AL [triple criteria AL (TCAL)] and baseline methods(AL_ALL and AL_RAND).

1) AL_ALL: We manually label all the training samples anduse them to train the CNN. This method can be regardedas the upper bound (best performance that CNN canreach with all labeled training samples).

2) AL_RAND: During the training process, we randomlyselect samples to be annotated to fine-tune the CNN.This method discards all AL techniques and can beconsidered as the lower bound.

3) TCAL [3]: TCAL is a comprehensive AL approach andis well designed to jointly evaluate sample selectioncriteria (uncertainty, diversity and density), and hasovercome the state-of-the-art methods with much lessannotations. TCAL represents those methods who intendto mine minority of informative samples to improve theperformance. Thus, we regard it as a relevant competitor.

Implementation Details: The compared methods share thesame CNN architecture with our CEAL on the both data sets.The only difference in the sample selection criteria. For theBaseLine method, we select all training samples to fine-tunethe CNN, i.e., all labels are used. For TCAL, we followthe pipeline of [3] by training an SVM classifier and thenapplying the uncertainty, diversity and density criteria to selectthe most informative samples. Specifically, the uncertainty ofsamples is assessed according to the MS strategy. The diversityis calculated by clustering the most uncertain samples viak-means with histogram intersection kernel. The density of onesample is measured by calculating the average distance withother samples within a cluster it belonged to. For each cluster,

the highest density (i.e., the smallest average distance) sampleis selected as the most informative sample. For CACD, wecluster 2000 most uncertain samples and select 500 most infor-mative samples according to the above-mentioned diversityand density. For Caltech-256, we select 250 most informativesamples from 1000 most uncertain samples. To make a faircomparison, samples selected in each iteration by the TCALare also used to fine-tune the CNN to learn the optimal featurerepresentation as AL_RAND. Once optimal feature learned,the SVM classifier of TCAL is further updated.

B. Comparison Results and Empirical Analysis

1) Comparison Results: To demonstrate the effectiveness ofour proposed framework, we also apply the MS criterion tomeasure the uncertainty of samples and denote this methodby CEAL_MS. Fig. 3 illustrates the accuracy-percentage ofannotated samples curve of AL_RAND, AL_ALL, TCAL,and the proposed CEAL_MS on both CACD and Caltech-256data sets. This curve demonstrates the classification accuracyunder different percentages of annotated samples of the wholetraining set.

As illustrated in Fig. 3, Table III(a) and (b), our proposedCEAL framework overcomes the compared method fromthe aspects of the recognition accuracy and user annotationamount. From the aspect of recognition accuracy, given thesame percentage of annotated samples, our CEAL_MS outper-forms the compared method in a clear margin, especially whenthe percentage of annotated samples is low. From the aspectof the user annotation amount, to achieve 91.5% recognitionaccuracy on the CACD data set, AL_RAND and TCALrequire 99% and 81% labeled training samples, respec-tively. CEAL_MS needs only 63% labeled samples andreduces around 36% and 18% user annotations, comparedwith AL_RAND and TCAL. To achieve the 73.8% accuracyon the caltech-256 data set, AL_RAND and TCAL require97% and 93% labeled samples, respectively. CEAL_MS needsonly 78% labeled samples and reduces around 19% and 15%user annotations, compared with AL_RAND and TCAL. Thisjustifies that our proposed CEAL framework can effectivelyreduce the need of labeled samples.

IEEE P

roof


TABLE III

CLASS ACCURACY PER SOME SPECIFIC AL ITERATIONS ON THE (a) CACD AND (b) CALTECH-256 DATA SETS

Fig. 4. Extensive study for different informative sample selection criteria on CACD (the first row) and Caltech-256 (the second row) data sets. These criteriainclude LC (the first column), MS (the second column), and EN (the third column). One can observe that our CEAL framework works consistently well onthe common information sample selection criteria.

From the above results, one can see that our proposedframe CEAL performs consistently better than the state-of-the-art method TCAL in both recognition accuracy and userannotation amount through fair comparisons. This is due tothat TCAL only mines minority of informative samples andis not able to provide sufficient training data for featurelearning under the deep image classification scenario. Hence,our CEAL has a competitive advantage in deep image clas-sification task. To clearly analyze our CEAL and justify theeffectiveness of its component, we have conducted the severalexperiments and discussed in the following sections.

2) Component Analysis: To justify that the proposed CEALcan work consistently well on the common informative sam-ple selection criteria, we implement three variants of CEALaccording to LC, MS, and EN to assess uncertain samples.These three variants are denoted by CEAL_LC, CEAL_MS,

and CEAL_EN. Meanwhile, to show the raw performance ofthese criteria, we discard the cost-effective high-confidencesample selection of the above-mentioned variants and denotedthe discarded versions by AL_LC, AL_MS, and AL_EN. Toclarify the contribution of our pseudolabeling majority of high-confidence samples strategy, we further introduce this strategyinto the AL_RAND and denote this variant by CEAL_RAND.Since AL_RAND means randomly select samples to be anno-tated, CEAL_RAND reflects the original contribution of thepseudolabeled majority of high-confidence samples strategy,i.e., CEAL_RAND denotes the method that uses only thepseudolabeled majority of samples.

Fig. 4 illustrates the results of these variants on the data setsCACD (the first row) and Caltech-256 (the second row). Theresults demonstrate that giving the same percentage of labeledsamples and compared with AL_RAND, CEAL_RAND,

IEEE P

roof


Fig. 5. Comparison between different informative sample selection criteria and their fusion (CEAL_FUSION) on (a) CACD and (b) Caltech-256 data sets.

simply exploiting pseudolabeled majority samples, obtainssimilar performance gain as AL_LC, AL_MS, and AL_EN,which employs the informative sample selection criterion.This justifies that our proposed pseudolabeled majority ofsamples strategy is effective as some common informativesample selection criteria. Moreover, as one can see thatin Fig. 4, CEAL_LC, CEAL_MS, and CEAL_EN allconsistently outperform the pure pseudolabeling samplesversion CEAL_RAND and their excluding pseudolabeledsamples versions AL_LC, AL_MS, and AL_EN in a clearmargin on both the CACD and Caltech-256 data sets,respectively. This validates that our proposed pseudolabeledmajority of samples strategy is complementary to thecommon informative sample selection criteria and can furthersignificantly improve the recognition performance.

To analyze the choice of informative sample selectioncriteria, we have made a comparison among the three above-mentioned criteria. We also make an attempt to simply com-bine them together. Specifically, in each iteration, we selecttop K/2 samples according to each criterion, respectively.Then we remove repeated ones (i.e., some samples may beselected by the three criteria at the same time) from theobtained 3K/2 samples. After removing the repeated samples,we randomly select K samples from them to require userannotations. We denote this method by CEAL_FUSION.

Fig. 5 illustrates that CEAL_LC, CEAL_MS, andCEAL_EN have a similar performance, while CEAL_FUSIONperforms better. This demonstrates that the informative sampleselection criterion still plays an important role in improving therecognition accuracy. Though being a minority, the informativesamples have a great potential impact on the classifier.

C. Reliability of CEAL

From the above experiments, we know that the performanceof our framework is better than those of other methods,which shows the superiority of introducing the majorityof pseudolabeled samples. But how does the accuracy ofassigning the pseudo-label to those high-confidence samples?In order to demonstrate the reliability of our proposed CEALframework, we also evaluate the average error in selecting

Fig. 6. Average error rate of the pseudolabels of high-confidence samplesassigned by the heuristic strategy on the CACD and Caltech-256 data setsexperiments. The vertical axes represent the average error rate and the hori-zontal axes represent the learning iteration. Our proposed CEAL frameworkcan assign reliable pseudolabels to the unlabeled samples under acceptableaverage error rate.

high-confidence samples. Fig. 6 shows the error rate ofassigning pseudolabel along with the learning iteration.As one can see, the average error rate is quite low (say lessthan 3% on the CACD data set and less than 5.5% on theCaltech-256 data set) even at early iterations. Hence, ourproposed CEAL framework can assign reliable pseudolabelsto the unlabeled samples under acceptable average error ratealong with the learning iteration.

D. Sensitivity of High-Confidence Threshold

Since the training phase of deep CNNs is time consuming,it is not affordable to employ a try and error approachto set the threshold for defining high-confidence samples.We further analyze the sensitivity of the threshold parame-ters δ (threshold) and dr (threshold decay rate) on our systemperformance on the CACD data set using CEAL_EN. Whileanalyzing the sensitivity of the parameter δ on our system, wefix the decrease rate dr to 0.0033. We fix the threshold δto 0.05 when analyzing the sensitivity of dr . The resultsof the sensitivity analysis of δ (range 0.045 to 0.1) are

IEEE P

roof


Fig. 7. Sensitivity analysis of heuristic threshold δ (top) and decayrate dr (bottom). One can observe that these parameters do not substantiallyaffect the overall system performance.

shown in the top of Fig. 7, while the sensitivity analysis ofdr (range 0.001 to 0.0035) is shown in the bottom of Fig. 7.Note that the test range of δ and dr is set to ensurethe majority of high confidence assumption of this paper.Though the range of {δ, dr} seems to be narrow from thevalue, it leads to a significant difference: about 10%–60%samples are pseudolabeled in high-confidence sample selec-tion. The lower standard deviation of the accuracy in Fig. 7proves that the choice of these parameters does not signifi-cantly affect the overall system performance.

V. CONCLUSION

In this paper, we propose a CEAL framework for deepimage classification tasks, which employs a complementarysample selection strategy: progressively select the minorityof most informative samples and pseudolabel the majorityof high-confidence samples for model updating. In sucha holistic manner, the minority of labeled samples benefitthe decision boundary of classifier and the majority ofpseudolabeled samples provide sufficient training data forrobust feature learning. Extensive experiment results on twopublic challenging benchmarks justify the effectiveness ofour proposed CEAL framework. In future work, we planto apply our framework on more challenging large-scaleobject recognition tasks (e.g., 1000 categories in ImageNet).And we plan to incorporate more persons from the

CACD data set to evaluate our framework. Moreover, we planto generalize our framework into other multilabel objectrecognition tasks (e.g., 20 categories in PASCAL VOC).

ACKNOWLEDGMENT

The authors would like to thank D. Liang and J. Xu fortheir preliminary contributions on this project.

REFERENCES

[1] B.-C. Chen, C.-S. Chen, and W. H. Hsu, “Cross-age reference codingfor age-invariant face recognition and retrieval,” in Proc. ECCV, 2014,pp. 768–783.

[2] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” California Inst. Technol., Pasadena, CA, USA, Tech. Rep. 7694,2007.

[3] B. Demir and L. Bruzzone, “A novel active learning method in relevancefeedback for content-based remote sensing image retrieval,” IEEE Trans.Geosci. Remote Sens., vol. 53, no. 5, pp. 2323–2334, May 2015.

[4] K. Brinker, “Incorporating diversity in active learning with supportvector machines,” in Proc. ICML, 2003, pp. 1–8.

[5] B. Long, J. Bian, O. Chapelle, Y. Zhang, Y. Inagaki, and Y. Chang,“Active learning for ranking through expected loss optimization,” IEEETrans. Knowl. Data Eng., vol. 27, no. 5, pp. 1180–1191, May 2015.

[6] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-classactive learning for image classification,” in Proc. CVPR, Jun. 2009,pp. 2372–2379.

[7] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Active learningwith Gaussian processes for object categorization,” in Proc. ICCV,Oct. 2007, pp. 1–8.

[8] A. Kapoor, G. Hua, A. Akbarzadeh, and S. Baker, “Which faces totag: Adding prior constraints into active learning,” in Proc. ICCV,Sep./Oct. 2009, pp. 1058–1065.

[9] R. M. Castro and R. D. Nowak, “Minimax bounds for active learning,”IEEE Trans. Inf. Theory, vol. 54, no. 5, pp. 2339–2353, May 2008.

[10] X. Li and Y. Guo, “Adaptive active learning for image classification,”in Proc. CVPR, Jun. 2013, pp. 859–866.

[11] E. Elhamifar, G. Sapiro, A. Yang, and S. S. Sasrty, “A convex opti-mization framework for active learning,” in Proc. Int. Conf. Comput.Vis. (ICCV), Dec. 2013, pp. 209–216.

[12] D. D. Lewis, “A sequential algorithm for training text classifiers:Corrigendum and additional data,” ACM SIGIR Forum, vol. 29, no. 2,pp. 13–19, 1995.

[13] X. Li and Y. Guo, “Multi-level adaptive active learning for sceneclassification,” in Proc. ECCV, 2014, pp. 234–249.

[14] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification viahigh-order label correlation driven active learning,” IEEE Trans. ImageProcess., vol. 23, no. 3, pp. 1430–1441, Mar. 2014.

[15] F. Sun, M. Xu, and X. Jiang, “Robust multi-label image classificationwith semi-supervised learning and active learning,” in Proc. 21st Int.Conf. MultiMedia Modeling, 2015, pp. 512–523.

[16] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learningand its application to medical image classification,” in Proc. ICML, 2006,pp. 417–424.

[17] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery,“Correction to ‘active learning methods for remote sensing imageclassification,”’ IEEE Trans. Geosci. Remote Sens., vol. 48, no. 6,p. 2767, Jun. 2010.

[18] S. Tong and D. Koller, “Support vector machine active learningwith applications to text classification,” J. Mach. Learn. Res., vol. 2,pp. 45–66, Mar. 2001.

[19] A. McCallum and K. Nigam, “Employing EM and pool-based activelearning for text classification,” in Proc. ICML, 1998, pp. 350–358.

[20] G. Schohn and D. Cohn, “Less is more: Active learning with supportvector machines,” in Proc. ICML, 2000, pp. 1–8.

[21] S. Vijayanarasimhan and K. Grauman, “Large-scale live active learning:Training object detectors with crawled data and crowds,” in Proc. CVPR,Jun. 2011, pp. 1449–1456.

[22] A. G. Hauptmann, W.-H. Lin, R. Yan, J. Yang, and M.-Y. Chen,“Extreme video retrieval: Joint maximization of human and com-puter performance,” in Proc. 14th ACM Int. Conf. Multimedia, 2006,pp. 385–394.

IEEE P

roof


[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in Proc. NIPS, 2012,pp. 1097–1105.

[24] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deepneural networks for image classification,” in Proc. CVPR, Jun. 2012,pp. 3642–3649.

[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in Proc. CVPR, Jun. 2009,pp. 248–255.

[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in Proc. ICLR, 2015, pp. 1–14.

[27] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-pacedcurriculum learning,” in Proc. AAAI, 2015, pp. 1–7.

[28] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann,“Self-paced learning for matrix factorization,” in Proc. AAAI, 2015,pp. 3196–3202.

[29] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann, “Easy samplesfirst: Self-paced reranking for zero-example multimedia search,” in Proc.22nd ACM Int. Conf. Multimedia, 2014, pp. 547–556.

[30] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. G. Hauptmann,“Self-paced learning with diversity,” in Proc. NIPS, 2014,pp. 2078–2086.

[31] B. Settles, “Active learning literature survey,” Comput. Sci. Dept.,Univ. Wisconsin–Madison, Madison, WI, USA, Tech. Rep. 1648, 2009.

[32] T. Scheffer, C. Decomain, and S. Wrobel, “Active hidden Markov modelsfor information extraction,” in Proc. IDA, 2001, pp. 309–318.

[33] C. E. Shannon, “A mathematical theory of communication,” ACMSIGMOBILE Mobile Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55,2001.

[34] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latentvariable models,” in Proc. NIPS, 2010, pp. 1189–1197.

[35] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” in Proc. ICML, 2009, pp. 41–48.

[36] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective samplingusing the query by committee algorithm,” Mach. Learn., vol. 28, no. 2,pp. 133–168, 1997.

[37] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable activelearning for multiclass image classification,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 11, pp. 2259–2273, Nov. 2012.

[38] X. Xiong and F. De la Torre, “Supervised descent method andits applications to face alignment,” in Proc. CVPR, Jun. 2013,pp. 532–539.

[39] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedBoltzmann machines,” in Proc. ICML, 2010, pp. 1–8.

[40] O. Russakovsky et al., “ImageNet large scale visual recognition chal-lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.

[41] J. Donahue et al., “DeCAF: A deep convolutional activation fea-ture for generic visual recognition,” in Proc. 31st Int. Conf. Mach.Learn. (ICML), Beijing, China, Jun. 2014, pp. 21–26.

[42] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.

Keze Wang received the B.S. degree in softwareengineering from Sun Yat-sen University,Guangzhou, China, in 2012. He is currentlypursuing the Ph.D. degrees in computer scienceand technology with Sun Yat-sen University andHong Kong Polytechnic University, Hong Kong,under the supervision of Prof. L. Lin andProf. L. Zhang.

His current research interests include computervision and machine learning.

Dongyu Zhang received the B.S. and Ph.D. degreesfrom the Harbin Institute of Technology, Harbin,China, in 2003 and 2010, respectively.

He is currently a Research Associate with theSchool of Data and Computer Science, Sun Yat-senUniversity, Guangzhou, China. His current researchinterests include computer vision and machinelearning.

Ya Li received the B.E. degree from ZhengzhouUniversity, Zhengzhou, China, in 2002, the M.E.degree from Southwest Jiaotong University,Chengdu, China, in 2006, and the Ph.D. degreefrom Sun Yat-sen University, Guangzhou, China,in 2015.

She is currently a Lecturer with the Schoolof Computer Science and Educational Software,Guangzhou University, Guangzhou. Her currentresearch interests include computer vision andmachine learning.

Ruimao Zhang received the B.E. degree fromthe School of Software, Sun Yat-sen University,Guangzhou, China, in 2011, where he is currentlypursuing the Ph.D. degree in computer science withthe School of Information Science and Technology.

He was a Visiting Ph.D. Student with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hong Kong, from 2013 to 2014. Hiscurrent research interests include computer vision,pattern recognition, machine learning, and relatedapplications.

Liang Lin (SM’14) received the B.S. and Ph.D.degrees from the Beijing Institute of Technology,Beijing, China, in 1999 and 2008, respectively.

He was a Post-Doctoral Research Fellow with theDepartment of Statistics, University of Californiaat Los Angeles, Los Angeles, CA, USA, from2008 to 2010. He was a Visiting Scholar with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hong Kong, and with the Departmentof Electronic Engineering, Chinese University ofHong Kong, Hong Kong. He is currently a Professor

with the School of Computer Science, Sun Yat-sen University, Guangzhou,China. He has authored over 100 papers in top tier academic journals andconferences. His current research interests include new models, algorithmsand systems for intelligent processing, and understanding of visual data, suchas images and videos.

Prof. Lin received the Best Paper Runners-Up Award in ACM NPAR 2010,the Google Faculty Award in 2012, the Best Student Paper Award in the IEEEICME 2014, and the Hong Kong Scholars Award in 2014. He currently servesas an Associate Editor of the IEEE TRANSACTIONS ON HUMAN-MACHINE

SYSTEMS.

IEEE P

roof

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 1

Cost-Effective Active Learning forDeep Image Classification

Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin, Senior Member, IEEE

Abstract— Recent successes in learning-based image classifi-cation, however, heavily rely on the large number of anno-tated training samples, which may require considerable humaneffort. In this paper, we propose a novel active learning (AL)framework, which is capable of building a competitive classifierwith optimal feature representation via a limited amount oflabeled training instances in an incremental learning manner.Our approach advances the existing AL methods in twoaspects. First, we incorporate deep convolutional neural networksinto AL. Through the properly designed framework, the featurerepresentation and the classifier can be simultaneously updatedwith progressively annotated informative samples. Second,we present a cost-effective sample selection strategy to improvethe classification performance with less manual annotations.Unlike traditional methods focusing on only the uncertain sam-ples of low prediction confidence, we especially discover thelarge amount of high-confidence samples from the unlabeled setfor feature learning. Specifically, these high-confidence samplesare automatically selected and iteratively assigned pseudolabels.We thus call our framework cost-effective AL (CEAL) standingfor the two advantages. Extensive experiments demonstrate thatthe proposed CEAL framework can achieve promising results ontwo challenging image classification data sets, i.e., face recognitionon the cross-age celebrity face recognition data set database andobject categorization on Caltech-256.

Index Terms— Active learning (AL), deep neural nets, imageclassification, incremental learning.

I. INTRODUCTION

A IMING at improving the existing models by incremen-tally selecting and annotating the most informative unla-

beled samples, active learning (AL) has been well studied inthe past few decades [3]–[12], and applied to various kinds

Manuscript received June 26, 2015; revised January 5, 2016 andApril 25, 2016; accepted July 1, 2016. This work was supportedin part by the National Natural Science Foundation of China underGrant 61622214, in part by the State Key Development Program underGrant 2016YFB1001000, in part sponsored by CCF-Tencent Open Fund,in part by the Special Program through the Applied Research on SuperComputation of the Natural Science Foundation of China–Guangdong JointFund (the second phase), and in part by NVIDIA Corporation through theTesla K40 GPU. This paper was recommended by Associate Editor E. Cetin.(Corresponding author: Dongyu Zhang.)

K. Wang, D. Zhang, R. Zhang, and L. Lin are with the School ofData and Computer Science, Sun Yat-sen University, Guangzhou 510006,China, and also with Collaborative Innovation Center of High Perfor-mance Computing, National University of Defense Technology, Chang-sha 410073, China. (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Y. Li is with Guangzhou University, Guangzhou 510182, China (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2016.2589879

of vision tasks, such as image/video categorization [13]–[17],text/Web classification [18]–[20], and image/video retrieval[21], [22]. In the AL methods [3]–[5], the classifier/modelis first initialized with a relatively small set of labeled trainingsamples. Then it is continuously boosted by selecting andpushing some of the most informative samples to user forannotation. Although the existing AL approaches [10], [11]have demonstrated impressive results on image classification,their classifiers/models are trained with hand-craft features(e.g., HoG and SIFT) on small-scale visual data sets. Theeffectiveness of AL on more challenging image classificationtasks has not been studied well.

Recently, incredible progress on visual recognition taskshas been made by deep learning approaches [23], [24].With sufficient labeled data [25], deep convolutional neuralnetworks (CNNs) [23], [26] are trained to directly learnfeatures from raw pixels, which have achieved the state-of-the-art performance for image classification. However, inmany real applications of large-scale image classification,the labeled data are not enough, since the tedious manuallabeling process requires a lot of time and labor. Thus, ithas a great practical significance to develop a framework bycombining CNNs and AL, which can jointly learn features andclassifiers/models from unlabeled training data with minimalhuman annotations. However, incorporating CNNs into ALframework is not straightforward for real image classificationtasks. This is due to the following two issues.

1) The labeled training samples given by current ALapproaches are insufficient for CNNs, as the majority ofunlabeled samples are usually ignored in AL. AL usuallyselects only a few of the most informative samples (e.g.,samples with quite low prediction confidence) in eachlearning step and frequently solicit user labeling. Thus,it is difficult to obtain proper feature representations byfine-tuning CNNs with these minority of informativesamples.

2) The process pipelines of AL and CNNs are inconsistentwith each other. Most of AL methods pay close attentionto model/classifier training. Their strategies to selectthe most informative samples are heavily dependenton the assumption that the feature representation isfixed. However, the feature learning and classifier train-ing are jointly optimized in CNNs. Because of thisinconsistency, simply fine-tuning CNNs in the traditionalAL framework may face the divergence problem.

Inspired by the insights and lessons from a significantamount of previous works as well as the recently

1051-8215 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE P

roof


Fig. 1. Illustration of our proposed CEAL framework. Our proposed CEAL progressively feeds the samples from the unlabeled data set into the CNN.Then both of the clearly classified samples and most informative samples selection criteria are applied on the classifier output of the CNN. After addinguser-annotated minority of uncertain samples into the labeled set and pseudolabeling the majority of certain samples, the model (feature representation andclassifier of the CNN) is further updated.

proposed technique, i.e., self-paced learning [27]–[30],we address above-mentioned issues by cost effectivelycombining the CNN and AL via a complementary sampleselection. In particular, we propose a novel AL frameworkcalled cost-effective AL (CEAL), which is enabled to fine-tune the CNN with sufficient unlabeled training data andovercomes the inconsistency between the AL and CNN.

Different from the existing AL approaches that consideronly the most informative and representative samples, ourCEAL proposes to automatically select and pseudoannotateunlabeled samples. As Fig. 1 illustrates, our proposed CEALprogressively feeds the samples from the unlabeled data setinto the CNN and selects two kinds of samples for fine-tuningaccording to the output of CNN’s classifiers. One kind is theminority of samples with low prediction confidence, calledmost informative/uncertain samples. The predicted labels ofsamples are most uncertainty ones. For the selection of theseuncertain samples, the proposed CEAL considers three com-mon AL methods: least confidence (LC) [31], margin sam-pling (MS) [32], and entropy (EN) [33]. The selected samplesare added into the labeled set after active user labeling. Theother kind is the majority of samples with high predictionconfidence, called high-confidence samples. The predictedlabels of samples are most certainty ones. For these certainkinds of samples, the proposed CEAL automatically assignspseudolabels with no human labor cost. As one can see, thesetwo kinds of samples are complementary to each other forrepresenting different confidence levels of the current modelon the unlabeled data set. In the model updating stage, all thesamples in the labeled set and currently pseudolabeled high-confidence samples are exploited to fine-tune the CNN.

The proposed CEAL advances in employing these twocomplementary kinds of samples to incrementally improve themodel’s classifier training and feature learning: the minority ofinformative kind contributes to train more powerful classifiers,while the majority of high confidence kind conduces to learnmore discriminative feature representations. On one hand,although the number is small, most uncertainty unlabeledsamples usually have a great potential impact on the classifiers.Selecting and annotating them into training can lead to abetter decision boundary of the classifiers. On the other hand,though unable to significantly improve the performance of

classifiers, the high-confidence unlabeled samples are closeto the labeled samples in the CNN’s feature space. Thus,pseudolabeling these majority of high-confidence samples fortraining is a reasonable data augmentation way for the CNNto learn robust features. In particular, the number of thehigh-confidence samples is actually much larger than thatof most uncertainty ones. With the obtained robust featurerepresentation, the inconsistency between the AL and CNNcan be overcome.

For the problem of keep the model stable in the trainingstage, many works [34], [35] are proposed in recent yearsinspired by the learning process of humans that graduallyinclude samples into training from easy to complex. Throughthis way, the training samples for further iterations are grad-ually determined by the model itself based on what it hasalready learned [30]. In other words, the model can graduallyselect the high-confidence samples as pseudolabeled onesalong with the training process. The advantages of theserelated studies motivate us to incrementally select unlabeledsamples in an easy-to-hard manner to make pseudolabelingprocess reliable. Specifically, considering that the classificationmodel is usually not reliable enough in the initial iterations, weemploy high-confidence threshold to define clearly classifiedsamples and assign them pseudolabels. When the performanceof the classification model improves, the threshold correspond-ingly decreases.

The main contribution of this paper is threefold. First, tothe best of our knowledge, our work is the first one addressingthe deep image classification problems in conjunction withAL framework and CNN training. Our framework can be eas-ily extended to other similar visual recognition tasks. Second,this paper also advances the AL development by introducinga cost-effective strategy to automatically select and annotatethe high-confidence samples, which improves the traditionalsamples selection strategies. Third, experiments on challengingcross-age celebrity face recognition data set (CACD) [1] andCaltech 256 [2] data sets show that our approach outperformsother methods not only in the classification accuracy but alsoin the reduction of human annotation.

The rest of this paper is organized as follows. Section IIpresents a brief review of related work. Section III discussesthe component of our framework and the corresponding

IEEE P

roof


learning algorithm. Section IV presents the experimentsresults with deep empirical analysis. Section V concludes thispaper.

II. RELATED WORK

The key idea of the AL is that a learning algorithmshould achieve higher accuracy with a fewer labeled trainingsamples, if it is allowed to choose the ones from which itlearns [31]. In this way, the instance selection scheme isbecoming extremely important. One of the most commonstrategy is the uncertainty-based selection [12], [18], whichmeasures the uncertainties of novel unlabeled samples fromthe predictions of previous classifiers. Lewis [12] proposedto extract the sample, which has the largest EN on theconditional distribution over predicted labels, as the mostuncertain instance. The support vector machine (SVM)-basedmethod [18] determined the uncertain samples based onthe relative distance between the candidate samples andthe decision boundary. Some earlier works [19], [38] alsodetermined the sample uncertainty referring to a committee ofclassifiers (i.e., examining the disagreement among class labelsassigned by a set of classifiers). Such a theoretically motivatedframework is called query-by-committee in literature [31].All the above-mentioned uncertainty-based methods usuallyignore the majority of certain unlabeled samples and thusare sensitive to outliers. The latter methods have taken theinformation density measure into account and exploitedthe information of unlabeled data when selecting samples.These approaches usually sequentially select the informativesamples relying on the probability estimation [6], [37] orprior information [8] to minimize the generalization error ofthe trained classifier over the unlabeled data. For example,Joshi et al. [6] considered the uncertainty sampling methodbased on the probability estimation of class membership forall the instances in the selection pool, and such a method canbe effective to handle the multiclass case. In [8], some contextconstraints are introduced as the priori to guide users to tagthe face images more efficiently. At the same time, a series ofworks [7], [24] is proposed to take the samples to maximizethe increase of mutual information between the candidateinstance and the remaining unlabeled instances under theGaussian process framework. Li and Guo [10] presented anovel adaptive AL approach that combines an informationdensity measure and a most uncertainty measure together tolabel critical instances for image classifications. Moreover, thediversity of the selected instance over the certain category hasbeen taken into consideration in [4] as well. Such a work isalso the pioneer study expanding the SVM-based AL from thesingle mode to batch mode. Recently, Elhamifar et al. [11] fur-ther integrated the uncertainty and diversity measurement intoa unified batch mode framework via convex programming forunlabeled sample selection. Such an approach is more feasibleto conjunction with any type of classifiers, but not limitedin max-margin based ones. It is obvious that all the above-mentioned AL methods consider only those low-confidencesamples (e.g., uncertain and diverse samples), but losing thesight of a large majority of high-confidence samples. We holdthat due to the majority and consistency, these high-confidence

samples will also be beneficial to improve the accuracy andkeep the stability of classifiers. Even more, we shalldemonstrate that considering these high-confidence samplescan also reduce the user effort of annotation effectively.

III. COST-EFFECTIVE ACTIVE LEARNING

In this section, we develop an efficient algorithm for the pro-posed CEAL framework. Our objective is to apply our CEALframework to deep image classification tasks by progressivelyselecting complementary samples for model updating. Supposewe have a data set of m categories and n samples denotedby D = {xi }n

i=1. We denote the currently annotated samples

of D by DL , while the unlabeled ones by DU . The labelof xi is denoted by yi = j, j ∈ {1, . . . , m}, i.e., xi belongs tothe j th category. We should give two necessary remarks onour problem settings. One is that in our investigated imageclassification problems, almost all data are unlabeled, i.e.,most of the {yi } values of D are unknown and needed to becompleted in the learning process. The other remark is that DU

might possibly been input into the system in an incrementalway. This means that data scale might be consistently growing.

Thanks to handling both manually annotated and automat-ically pseudolabeled samples together, our proposed CEALmodel can progressively fit the consistently growing unlabeleddata in such a holistic manner. The CEAL for deep imageclassification is formulated as follows:

min{W,yi ,i∈DU }

− 1

n

n∑

i=1

m∑

j=1

1{yi = j} log p(yi = j |xi;W) (1)

where 1{·} is the indicator function, so that1{a true statement} = 1 and 1{a false statement} = 0,and W denotes the network parameters of the CNN.p(yi = j |xi;W) denotes the softmax output of the CNNfor the j th category, which represents the probability of thesample xi belonging to the j th classifiers.

The alternative search strategy is readily employed to opti-mize (1). Specifically, the algorithm is designed by alter-natively updating the pseudolabeled sample yi ∈ DU andthe network parameters W . In the following, we introducethe details of the optimization steps and give their physicalinterpretations. The practical implementation of the CEAL willalso be discussed in the end.

A. Initialization

Before the experiment starts, the labeled samples DL isempty. For each class, we randomly select a few trainingsamples from DU and manually annotate them as the startingpoint to initialize the CNN parameters W .

B. Complementary Sample Selection

Fixing the CNN parameters W , we first rank all unlabeledsamples according to the common AL criteria and then man-ually annotate those most uncertain samples and add theminto DL . For those most certain ones, we assign pseudolabelsand denote them by DH .

IEEE P

roof


1) Informative Sample Annotating: Our CEAL can usein conjunction with any type of common actively learningcriteria, e.g., LC [31], MS [32], and EN [33] to select Kmost informative/uncertain samples left in DU . The selectioncriteria are based on p(yi = j |xi;W), which denotes theprobability of xi belonging to the j th class. Specifically, thethree selection criteria are defined as follows.

1) LC: Rank all the unlabeled samples in an ascendingorder according to the lci value. lci is defined as

lci = maxj

p(yi = j |xi;W). (2)

If the probability of the most probable class for a sampleis low, then the classifier is uncertain about the sample.

2) MS: Rank all the unlabeled samples in an ascendingorder according to the msi value. msi is defined as

msi = p(yi = j1|xi ;W) − p(yi = j2|xi;W) (3)

where j1 and j2 represent the first and second most prob-able class labels predicted by the classifiers, respectively.The smaller of the margin means the classifier is moreuncertain about the sample.

3) EN: Rank all the unlabeled samples in an descendingorder according to their eni value. eni is defined as

eni = −m∑

j=1

p(yi = j |xi;W) log p(yi = j |xi;W). (4)

This method takes all class label probabilities intoconsideration to measure the uncertainty. The higherEN value, the more uncertain is the sample.

2) High-Confidence Sample Pseudolabeling: We select thehigh-confidence samples from DU, whose EN is smaller thanthe threshold δ. Then we assign clearly predicted pseudolabelsto them. The pseudolabel yi is defined as

j∗ = arg maxj

p(yi = j |xi;W)

yi ={

j∗, eni < δ

0, otherwise(5)

where yi = 1 denotes that xi is regarded as high-confidencesample. The selected samples are denoted by DH . Note thatcompared with classification probability p(yi = j∗|xi ;W) forthe j∗th category, the employed EN eni holistically considersthe classification probability of the other categories, i.e., theselected sample should be clearly classified with high confi-dence. The threshold δ is set to a large value to guarantee ahigh reliability of assigning a pseudolabel.

C. CNN Fine-Tuning

Fixing the labels of self-labeled high-confidencesamples DH and manually annotated ones DL by active user,(1) can be simplified as

minW

− 1

N

N∑

i=1

m∑

j=1

1{yi = j} log p(yi = j |xi;W) (6)

where N denotes the number of samples in DH ∪ DL . Weemploy the standard back propagation to update the CNN’sparameters W . Specifically, let L denote the loss functionof (6), then the partial derive of the network parameter Waccording to (6) is

∂L∂W = ∂

− 1N

∑Ni=1

∑mj=1 1{yi = j} log p(yi = j |xi;W)

∂W

= − 1

N

N∑

i=1

m∑

j=1

1{yi = j}∂ log p(yi = j |xi;W)

∂W

= − 1

N

N∑

i=1

(1{yi = j} − p(yi = j |xi;W))∂z j (xi ;W)

∂W(7)

where {z j (xi ;W)}mj=1 denotes the activation for the i th sample

of the last layer of CNN model before feeding into the softmaxclassifier, which is defined as

p(yi = j |xi;W) = ez j (xi ;W)

∑mt=1 ezt (xi ;W)

. (8)

After fine-tuning, we put the high-confidence samples DH

back to DU and erase their pseudolabel.

D. Threshold Updating

As the incremental learning process goes on, the classifica-tion capability of classifier improves and more high-confidencesamples are selected, which may result in the decrease ofincorrect autoannotation. In order to guarantee the reliability ofhigh-confidence sample selection, at the end of each iteration t ,we update the high-confidence sample selection threshold bysetting

δ ={

δ0, t = 0

δ − dr ∗ t, t > 0(9)

where δ0 is the initial threshold and dr controls the thresholddecay rate.

The entire algorithm can be then summarized intoAlgorithm 1. It is easy to see that this alternative optimizingstrategy finely accords with the pipeline of the proposed CEALframework.

IV. EXPERIMENTS

A. Data Sets and Experimental Settings

1) Data Set Description: In this section, we evaluate ourCEAL framework on two public challenging benchmarks, i.e.,CACD [1] and the Caltech-256 object categorization [2] dataset (see Fig. 2). CACD is a large-scale and challenging data setfor face identification and retrieval problems. It contains morethan 160 000 images of 2000 celebrities, which are varyingin age, pose, illumination, and occlusion. Since not all of theimages are annotated, we adopt a subset of 580 individualsfrom the whole data set in our experiments, in which 200 indi-viduals are originally annotated and 380 persons are extraannotated by us. Especially, 6336 images of 80 individualsare utilized for pretraining the network and the remaining

IEEE P

roof


Fig. 2. Demonstration of the effectiveness of our proposed heuristic deep AL framework on face recognition and object categorization. First and secondlines: sample images from the Caltech-256 [2] data set. Last line: samples images from CACD [1].

Algorithm 1 Learning Algorithm of CEALInput:

Unlabeled samples DU , initially labeled samples DL ,uncertain samples selection size K , high-confidence sam-ples selection threshold δ, threshold decay rate dr , maxi-mum iteration number T , fine-tuning interval t .

Output:CNN parameters W .

1: Initialize W with DL .2: while not reach maximum iteration T do3: Add K uncertainty samples into DL based on Eq. (2)

or (3) or (4),4: Obtain high confidence samples DH based on Eq. (5)5: In every t iterations:

• Update W via fine-tuning according to Eq. (6) withDH ∪ DL

• Update δ according to Eq. (9)

6: end while7: return W

500 persons are used to perform the experiments. Caltech-256is a challenging object categories data set. It contains a totalof 30 607 images of 256 categories collected from the Internet.

2) Experimental Setting: For CACD, we utilize the methodproposed in [38] to detect the facial points and align thefaces based on the eye locations. We resize all the faces into200 × 150 and then we set the parameters: δ0 = 0.05,dr = 0.0033, and K = 2000. For Caltech-256, we resize allthe images to 256×256 and we set δ0 = 0.005, dr = 0.00033,and K = 1000. Following the settings in the existing ALmethod [11], we randomly select 80% images of each class toform the unlabeled training set, and the rest are as the testingset in our experiments. Among the unlabeled training set, werandomly select 10% samples of each class to initialize thenetwork and the rest are for incremental learning process.To get rid of the influence of randomness, we averagefive times execution results as the final result.

We use different network architectures for CACD [1] andCaltech-256 [2] data sets because the difference between faceand object is relatively large. Table I shows the overall network

TABLE I

DETAILED CONFIGURATION OF THE CNN ARCHITECTURE USED INCACD [1]. IT TAKES THE 200 × 150 × 3 IMAGES AS INPUT AND

GENERATES THE 500-WAY SOFTMAX OUTPUT FOR CLASSES

PREDICTION. THE ReLU [39] ACTIVATION FUNCTIONIS NOT SHOWN FOR BREVITY

TABLE II

DETAILED CONFIGURATION OF THE CNN ARCHITECTURE USED INCALTECH-256 [2]. IT TAKES THE 256 × 256 × 3 IMAGES AS INPUT,WHICH WILL BE RANDOMLY CROPPED INTO 227 × 227 DURING

THE TRAINING, AND GENERATES THE 256-WAY SOFTMAXOUTPUT FOR CLASS PREDICTION. THE ReLU ACTIVATION

FUNCTION IS NOT SHOWN FOR BREVITY

architecture for CACD experiments, and Table II shows theoverall network architecture for Caltech-256 experiments. Weuse Alexnet [23] as the network architecture for Caltech-256and using the ImageNet ILSVRC data set [40] pretrainedmodel as the starting point following the setting of [41]. Thenwe keep all layers fixed and just modify the last layer tobe the 256-way softmax classifier to perform the Caltech-256experiments. We employ Caffe [42] for CNN implementation.

IEEE P

roof


Fig. 3. Classification accuracy under different percentages of annotated samples of the whole training set on the (a) CACD and (b) Caltech-256 data sets.Our proposed method CEAL_MS performs consistently better than the compared TCAL and AL_RAND.

For CACD, we set the learning rates of all the layersas 0.01. For Caltech-256, we set the learning rates of allthe layers as 0.001 except for the softmax layer, which isset to 0.01. All the experiments are conducted on a commondesktop PC with an intel 3.8-GHz CPU and a Titan X GPU.Average 17 h are needed to finish training on the CACD dataset with 44 708 images.

3) Comparison Methods: To demonstrate that our proposedCEAL framework can improve the classification performancewith less labeled data, we compare CEAL with new state-of-the-art AL [triple criteria AL (TCAL)] and baseline methods(AL_ALL and AL_RAND).

1) AL_ALL: We manually label all the training samples anduse them to train the CNN. This method can be regardedas the upper bound (best performance that CNN canreach with all labeled training samples).

2) AL_RAND: During the training process, we randomlyselect samples to be annotated to fine-tune the CNN.This method discards all AL techniques and can beconsidered as the lower bound.

3) TCAL [3]: TCAL is a comprehensive AL approach andis well designed to jointly evaluate sample selectioncriteria (uncertainty, diversity and density), and hasovercome the state-of-the-art methods with much lessannotations. TCAL represents those methods who intendto mine minority of informative samples to improve theperformance. Thus, we regard it as a relevant competitor.

Implementation Details: The compared methods share thesame CNN architecture with our CEAL on the both data sets.The only difference in the sample selection criteria. For theBaseLine method, we select all training samples to fine-tunethe CNN, i.e., all labels are used. For TCAL, we followthe pipeline of [3] by training an SVM classifier and thenapplying the uncertainty, diversity and density criteria to selectthe most informative samples. Specifically, the uncertainty ofsamples is assessed according to the MS strategy. The diversityis calculated by clustering the most uncertain samples viak-means with histogram intersection kernel. The density of onesample is measured by calculating the average distance withother samples within a cluster it belonged to. For each cluster,

the highest density (i.e., the smallest average distance) sampleis selected as the most informative sample. For CACD, wecluster 2000 most uncertain samples and select 500 most infor-mative samples according to the above-mentioned diversityand density. For Caltech-256, we select 250 most informativesamples from 1000 most uncertain samples. To make a faircomparison, samples selected in each iteration by the TCALare also used to fine-tune the CNN to learn the optimal featurerepresentation as AL_RAND. Once optimal feature learned,the SVM classifier of TCAL is further updated.

B. Comparison Results and Empirical Analysis

1) Comparison Results: To demonstrate the effectiveness ofour proposed framework, we also apply the MS criterion tomeasure the uncertainty of samples and denote this methodby CEAL_MS. Fig. 3 illustrates the accuracy-percentage ofannotated samples curve of AL_RAND, AL_ALL, TCAL,and the proposed CEAL_MS on both CACD and Caltech-256data sets. This curve demonstrates the classification accuracyunder different percentages of annotated samples of the wholetraining set.

As illustrated in Fig. 3, Table III(a) and (b), our proposedCEAL framework overcomes the compared method fromthe aspects of the recognition accuracy and user annotationamount. From the aspect of recognition accuracy, given thesame percentage of annotated samples, our CEAL_MS outper-forms the compared method in a clear margin, especially whenthe percentage of annotated samples is low. From the aspectof the user annotation amount, to achieve 91.5% recognitionaccuracy on the CACD data set, AL_RAND and TCALrequire 99% and 81% labeled training samples, respec-tively. CEAL_MS needs only 63% labeled samples andreduces around 36% and 18% user annotations, comparedwith AL_RAND and TCAL. To achieve the 73.8% accuracyon the caltech-256 data set, AL_RAND and TCAL require97% and 93% labeled samples, respectively. CEAL_MS needsonly 78% labeled samples and reduces around 19% and 15%user annotations, compared with AL_RAND and TCAL. Thisjustifies that our proposed CEAL framework can effectivelyreduce the need of labeled samples.

IEEE P

roof


TABLE III

CLASS ACCURACY PER SOME SPECIFIC AL ITERATIONS ON THE (a) CACD AND (b) CALTECH-256 DATA SETS

Fig. 4. Extensive study for different informative sample selection criteria on CACD (the first row) and Caltech-256 (the second row) data sets. These criteriainclude LC (the first column), MS (the second column), and EN (the third column). One can observe that our CEAL framework works consistently well onthe common information sample selection criteria.

From the above results, one can see that our proposedframe CEAL performs consistently better than the state-of-the-art method TCAL in both recognition accuracy and userannotation amount through fair comparisons. This is due tothat TCAL only mines minority of informative samples andis not able to provide sufficient training data for featurelearning under the deep image classification scenario. Hence,our CEAL has a competitive advantage in deep image clas-sification task. To clearly analyze our CEAL and justify theeffectiveness of its component, we have conducted the severalexperiments and discussed in the following sections.

2) Component Analysis: To justify that the proposed CEALcan work consistently well on the common informative sam-ple selection criteria, we implement three variants of CEALaccording to LC, MS, and EN to assess uncertain samples.These three variants are denoted by CEAL_LC, CEAL_MS,

and CEAL_EN. Meanwhile, to show the raw performance ofthese criteria, we discard the cost-effective high-confidencesample selection of the above-mentioned variants and denotedthe discarded versions by AL_LC, AL_MS, and AL_EN. Toclarify the contribution of our pseudolabeling majority of high-confidence samples strategy, we further introduce this strategyinto the AL_RAND and denote this variant by CEAL_RAND.Since AL_RAND means randomly select samples to be anno-tated, CEAL_RAND reflects the original contribution of thepseudolabeled majority of high-confidence samples strategy,i.e., CEAL_RAND denotes the method that uses only thepseudolabeled majority of samples.

Fig. 4 illustrates the results of these variants on the data setsCACD (the first row) and Caltech-256 (the second row). Theresults demonstrate that giving the same percentage of labeledsamples and compared with AL_RAND, CEAL_RAND,

IEEE P

roof


Fig. 5. Comparison between different informative sample selection criteria and their fusion (CEAL_FUSION) on (a) CACD and (b) Caltech-256 data sets.

simply exploiting pseudolabeled majority samples, obtainssimilar performance gain as AL_LC, AL_MS, and AL_EN,which employs the informative sample selection criterion.This justifies that our proposed pseudolabeled majority ofsamples strategy is effective as some common informativesample selection criteria. Moreover, as one can see thatin Fig. 4, CEAL_LC, CEAL_MS, and CEAL_EN allconsistently outperform the pure pseudolabeling samplesversion CEAL_RAND and their excluding pseudolabeledsamples versions AL_LC, AL_MS, and AL_EN in a clearmargin on both the CACD and Caltech-256 data sets,respectively. This validates that our proposed pseudolabeledmajority of samples strategy is complementary to thecommon informative sample selection criteria and can furthersignificantly improve the recognition performance.

To analyze the choice of informative sample selectioncriteria, we have made a comparison among the three above-mentioned criteria. We also make an attempt to simply com-bine them together. Specifically, in each iteration, we selecttop K/2 samples according to each criterion, respectively.Then we remove repeated ones (i.e., some samples may beselected by the three criteria at the same time) from theobtained 3K/2 samples. After removing the repeated samples,we randomly select K samples from them to require userannotations. We denote this method by CEAL_FUSION.

Fig. 5 illustrates that CEAL_LC, CEAL_MS, andCEAL_EN have a similar performance, while CEAL_FUSIONperforms better. This demonstrates that the informative sampleselection criterion still plays an important role in improving therecognition accuracy. Though being a minority, the informativesamples have a great potential impact on the classifier.

C. Reliability of CEAL

From the above experiments, we know that the performanceof our framework is better than those of other methods,which shows the superiority of introducing the majorityof pseudolabeled samples. But how does the accuracy ofassigning the pseudo-label to those high-confidence samples?In order to demonstrate the reliability of our proposed CEALframework, we also evaluate the average error in selecting

Fig. 6. Average error rate of the pseudolabels of high-confidence samplesassigned by the heuristic strategy on the CACD and Caltech-256 data setsexperiments. The vertical axes represent the average error rate and the hori-zontal axes represent the learning iteration. Our proposed CEAL frameworkcan assign reliable pseudolabels to the unlabeled samples under acceptableaverage error rate.

high-confidence samples. Fig. 6 shows the error rate ofassigning pseudolabel along with the learning iteration.As one can see, the average error rate is quite low (say lessthan 3% on the CACD data set and less than 5.5% on theCaltech-256 data set) even at early iterations. Hence, ourproposed CEAL framework can assign reliable pseudolabelsto the unlabeled samples under acceptable average error ratealong with the learning iteration.

D. Sensitivity of High-Confidence Threshold

Since the training phase of deep CNNs is time consuming,it is not affordable to employ a try and error approachto set the threshold for defining high-confidence samples.We further analyze the sensitivity of the threshold parame-ters δ (threshold) and dr (threshold decay rate) on our systemperformance on the CACD data set using CEAL_EN. Whileanalyzing the sensitivity of the parameter δ on our system, wefix the decrease rate dr to 0.0033. We fix the threshold δto 0.05 when analyzing the sensitivity of dr . The resultsof the sensitivity analysis of δ (range 0.045 to 0.1) are

IEEE P

roof


Fig. 7. Sensitivity analysis of heuristic threshold δ (top) and decayrate dr (bottom). One can observe that these parameters do not substantiallyaffect the overall system performance.

shown in the top of Fig. 7, while the sensitivity analysis ofdr (range 0.001 to 0.0035) is shown in the bottom of Fig. 7.Note that the test range of δ and dr is set to ensurethe majority of high confidence assumption of this paper.Though the range of {δ, dr} seems to be narrow from thevalue, it leads to a significant difference: about 10%–60%samples are pseudolabeled in high-confidence sample selec-tion. The lower standard deviation of the accuracy in Fig. 7proves that the choice of these parameters does not signifi-cantly affect the overall system performance.

V. CONCLUSION

In this paper, we propose a CEAL framework for deepimage classification tasks, which employs a complementarysample selection strategy: progressively select the minorityof most informative samples and pseudolabel the majorityof high-confidence samples for model updating. In sucha holistic manner, the minority of labeled samples benefitthe decision boundary of classifier and the majority ofpseudolabeled samples provide sufficient training data forrobust feature learning. Extensive experiment results on twopublic challenging benchmarks justify the effectiveness ofour proposed CEAL framework. In future work, we planto apply our framework on more challenging large-scaleobject recognition tasks (e.g., 1000 categories in ImageNet).And we plan to incorporate more persons from the

CACD data set to evaluate our framework. Moreover, we planto generalize our framework into other multilabel objectrecognition tasks (e.g., 20 categories in PASCAL VOC).

ACKNOWLEDGMENT

The authors would like to thank D. Liang and J. Xu fortheir preliminary contributions on this project.

REFERENCES

[1] B.-C. Chen, C.-S. Chen, and W. H. Hsu, “Cross-age reference codingfor age-invariant face recognition and retrieval,” in Proc. ECCV, 2014,pp. 768–783.

[2] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” California Inst. Technol., Pasadena, CA, USA, Tech. Rep. 7694,2007.

[3] B. Demir and L. Bruzzone, “A novel active learning method in relevancefeedback for content-based remote sensing image retrieval,” IEEE Trans.Geosci. Remote Sens., vol. 53, no. 5, pp. 2323–2334, May 2015.

[4] K. Brinker, “Incorporating diversity in active learning with supportvector machines,” in Proc. ICML, 2003, pp. 1–8.

[5] B. Long, J. Bian, O. Chapelle, Y. Zhang, Y. Inagaki, and Y. Chang,“Active learning for ranking through expected loss optimization,” IEEETrans. Knowl. Data Eng., vol. 27, no. 5, pp. 1180–1191, May 2015.

[6] A. J. Joshi, F. Porikli, and N. Papanikolopoulos, “Multi-classactive learning for image classification,” in Proc. CVPR, Jun. 2009,pp. 2372–2379.

[7] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Active learningwith Gaussian processes for object categorization,” in Proc. ICCV,Oct. 2007, pp. 1–8.

[8] A. Kapoor, G. Hua, A. Akbarzadeh, and S. Baker, “Which faces totag: Adding prior constraints into active learning,” in Proc. ICCV,Sep./Oct. 2009, pp. 1058–1065.

[9] R. M. Castro and R. D. Nowak, “Minimax bounds for active learning,”IEEE Trans. Inf. Theory, vol. 54, no. 5, pp. 2339–2353, May 2008.

[10] X. Li and Y. Guo, “Adaptive active learning for image classification,”in Proc. CVPR, Jun. 2013, pp. 859–866.

[11] E. Elhamifar, G. Sapiro, A. Yang, and S. S. Sasrty, “A convex opti-mization framework for active learning,” in Proc. Int. Conf. Comput.Vis. (ICCV), Dec. 2013, pp. 209–216.

[12] D. D. Lewis, “A sequential algorithm for training text classifiers:Corrigendum and additional data,” ACM SIGIR Forum, vol. 29, no. 2,pp. 13–19, 1995.

[13] X. Li and Y. Guo, “Multi-level adaptive active learning for sceneclassification,” in Proc. ECCV, 2014, pp. 234–249.

[14] B. Zhang, Y. Wang, and F. Chen, “Multilabel image classification viahigh-order label correlation driven active learning,” IEEE Trans. ImageProcess., vol. 23, no. 3, pp. 1430–1441, Mar. 2014.

[15] F. Sun, M. Xu, and X. Jiang, “Robust multi-label image classificationwith semi-supervised learning and active learning,” in Proc. 21st Int.Conf. MultiMedia Modeling, 2015, pp. 512–523.

[16] S. C. H. Hoi, R. Jin, J. Zhu, and M. R. Lyu, “Batch mode active learningand its application to medical image classification,” in Proc. ICML, 2006,pp. 417–424.

[17] D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, and W. J. Emery,“Correction to ‘active learning methods for remote sensing imageclassification,”’ IEEE Trans. Geosci. Remote Sens., vol. 48, no. 6,p. 2767, Jun. 2010.

[18] S. Tong and D. Koller, “Support vector machine active learningwith applications to text classification,” J. Mach. Learn. Res., vol. 2,pp. 45–66, Mar. 2001.

[19] A. McCallum and K. Nigam, “Employing EM and pool-based activelearning for text classification,” in Proc. ICML, 1998, pp. 350–358.

[20] G. Schohn and D. Cohn, “Less is more: Active learning with supportvector machines,” in Proc. ICML, 2000, pp. 1–8.

[21] S. Vijayanarasimhan and K. Grauman, “Large-scale live active learning:Training object detectors with crawled data and crowds,” in Proc. CVPR,Jun. 2011, pp. 1449–1456.

[22] A. G. Hauptmann, W.-H. Lin, R. Yan, J. Yang, and M.-Y. Chen,“Extreme video retrieval: Joint maximization of human and com-puter performance,” in Proc. 14th ACM Int. Conf. Multimedia, 2006,pp. 385–394.

IEEE P

roof


[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica-tion with deep convolutional neural networks,” in Proc. NIPS, 2012,pp. 1097–1105.

[24] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deepneural networks for image classification,” in Proc. CVPR, Jun. 2012,pp. 3642–3649.

[25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A large-scale hierarchical image database,” in Proc. CVPR, Jun. 2009,pp. 248–255.

[26] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in Proc. ICLR, 2015, pp. 1–14.

[27] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-pacedcurriculum learning,” in Proc. AAAI, 2015, pp. 1–7.

[28] Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, and A. G. Hauptmann,“Self-paced learning for matrix factorization,” in Proc. AAAI, 2015,pp. 3196–3202.

[29] L. Jiang, D. Meng, T. Mitamura, and A. G. Hauptmann, “Easy samplesfirst: Self-paced reranking for zero-example multimedia search,” in Proc.22nd ACM Int. Conf. Multimedia, 2014, pp. 547–556.

[30] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. G. Hauptmann,“Self-paced learning with diversity,” in Proc. NIPS, 2014,pp. 2078–2086.

[31] B. Settles, “Active learning literature survey,” Comput. Sci. Dept.,Univ. Wisconsin–Madison, Madison, WI, USA, Tech. Rep. 1648, 2009.

[32] T. Scheffer, C. Decomain, and S. Wrobel, “Active hidden Markov modelsfor information extraction,” in Proc. IDA, 2001, pp. 309–318.

[33] C. E. Shannon, “A mathematical theory of communication,” ACMSIGMOBILE Mobile Comput. Commun. Rev., vol. 5, no. 1, pp. 3–55,2001.

[34] M. P. Kumar, B. Packer, and D. Koller, “Self-paced learning for latentvariable models,” in Proc. NIPS, 2010, pp. 1189–1197.

[35] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculumlearning,” in Proc. ICML, 2009, pp. 41–48.

[36] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby, “Selective samplingusing the query by committee algorithm,” Mach. Learn., vol. 28, no. 2,pp. 133–168, 1997.

[37] A. J. Joshi, F. Porikli, and N. P. Papanikolopoulos, “Scalable activelearning for multiclass image classification,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 11, pp. 2259–2273, Nov. 2012.

[38] X. Xiong and F. De la Torre, “Supervised descent method andits applications to face alignment,” in Proc. CVPR, Jun. 2013,pp. 532–539.

[39] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedBoltzmann machines,” in Proc. ICML, 2010, pp. 1–8.

[40] O. Russakovsky et al., “ImageNet large scale visual recognition chal-lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Dec. 2015.

[41] J. Donahue et al., “DeCAF: A deep convolutional activation fea-ture for generic visual recognition,” in Proc. 31st Int. Conf. Mach.Learn. (ICML), Beijing, China, Jun. 2014, pp. 21–26.

[42] Y. Jia et al., “Caffe: Convolutional architecture for fast feature embed-ding,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 675–678.

Keze Wang received the B.S. degree in softwareengineering from Sun Yat-sen University,Guangzhou, China, in 2012. He is currentlypursuing the Ph.D. degrees in computer scienceand technology with Sun Yat-sen University andHong Kong Polytechnic University, Hong Kong,under the supervision of Prof. L. Lin andProf. L. Zhang.

His current research interests include computervision and machine learning.

Dongyu Zhang received the B.S. and Ph.D. degreesfrom the Harbin Institute of Technology, Harbin,China, in 2003 and 2010, respectively.

He is currently a Research Associate with theSchool of Data and Computer Science, Sun Yat-senUniversity, Guangzhou, China. His current researchinterests include computer vision and machinelearning.

Ya Li received the B.E. degree from ZhengzhouUniversity, Zhengzhou, China, in 2002, the M.E.degree from Southwest Jiaotong University,Chengdu, China, in 2006, and the Ph.D. degreefrom Sun Yat-sen University, Guangzhou, China,in 2015.

She is currently a Lecturer with the Schoolof Computer Science and Educational Software,Guangzhou University, Guangzhou. Her currentresearch interests include computer vision andmachine learning.

Ruimao Zhang received the B.E. degree fromthe School of Software, Sun Yat-sen University,Guangzhou, China, in 2011, where he is currentlypursuing the Ph.D. degree in computer science withthe School of Information Science and Technology.

He was a Visiting Ph.D. Student with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hong Kong, from 2013 to 2014. Hiscurrent research interests include computer vision,pattern recognition, machine learning, and relatedapplications.

Liang Lin (SM’14) received the B.S. and Ph.D.degrees from the Beijing Institute of Technology,Beijing, China, in 1999 and 2008, respectively.

He was a Post-Doctoral Research Fellow with theDepartment of Statistics, University of Californiaat Los Angeles, Los Angeles, CA, USA, from2008 to 2010. He was a Visiting Scholar with theDepartment of Computing, Hong Kong PolytechnicUniversity, Hong Kong, and with the Departmentof Electronic Engineering, Chinese University ofHong Kong, Hong Kong. He is currently a Professor

with the School of Computer Science, Sun Yat-sen University, Guangzhou,China. He has authored over 100 papers in top tier academic journals andconferences. His current research interests include new models, algorithmsand systems for intelligent processing, and understanding of visual data, suchas images and videos.

Prof. Lin received the Best Paper Runners-Up Award in ACM NPAR 2010,the Google Faculty Award in 2012, the Best Student Paper Award in the IEEEICME 2014, and the Hong Kong Scholars Award in 2014. He currently servesas an Associate Editor of the IEEE TRANSACTIONS ON HUMAN-MACHINE

SYSTEMS.

Date post:	09-Oct-2018
Category:	Documents
Upload:	doanthuan
View:	218 times
Download:	0 times

Senior Member, IEEE IEEE Proof -...

Documents