+ All Categories
Home > Documents > DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment...

DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment...

Date post: 05-Jul-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
7
DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen 1 , Damian Borth 2 , Trevor Darrell 2 and Shih-Fu Chang 1 1 Columbia University, USA 2 University of California, Berkeley, CA 1 {taochen,sfchang}@ee.columbia.edu 2 {borth@icsi,trevor@eecs}.berkeley.edu ABSTRACT This paper introduces a visual sentiment concept classifica- tion method based on deep convolutional neural networks (CNNs). The visual sentiment concepts are adjective noun pairs (ANPs) automatically discovered from the tags of web photos, and can be utilized as effective statistical cues for detecting emotions depicted in the images. Nearly one mil- lion Flickr images tagged with these ANPs are downloaded to train the classifiers of the concepts. We adopt the popular model of deep convolutional neural networks which recently shows great performance improvement on classifying large- scale web-based image dataset such as ImageNet. Our deep CNNs model is trained based on Caffe, a newly developed deep learning framework. To deal with the biased training data which only contains images with strong sentiment and to prevent overfitting, we initialize the model with the model weights trained from ImageNet. Performance evaluation shows the newly trained deep CNNs model SentiBank 2.0 (or called DeepSentiBank) is significantly improved in both annotation accuracy and retrieval performance, compared to its predecessors which mainly use binary SVM classification models. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Retrieval and Indexing Keywords deep learning, visual sentiment, affective computing 1. INTRODUCTION The explosive growth of social media and online visual content has motivated the research on large-scale social multimedia analysis. Among these research efforts, under- standing the emotion and sentiment in visual media content has attracted increasing attention in research and practical applications.. Images and videos depicting strong senti- ments can strengthen the opinion conveyed in the content and more effectively influence the audience. Understanding sentiment expressed in visual content will greatly benefit social media communication and enable broad applications in education, advertisement and entertainment. Modeling generic visual concepts (nouns) such as “sky” and “dog” has been studied extensively in computer vision, but modeling adjectives correlated with visual sentiments like “amazing” and “shy” remains difficult, if not impos- sible, due to the big “affective gap” between the low-level visual features and the high-level sentiment. Therefore, Borth et al. [1] proposed a more tractable approach which models sentiment related visual concepts as a mid-level rep- resentation to fill the gap. Those concepts are Adjective Noun Pairs (ANPs), such as “happy dog” and “beautiful sky”, which combine the sentimental strength of adjectives and detectability of nouns. Though these ANP concepts do not directly express emotions or sentiments, they were discovered based on strong co-occurrence relationships with emotion tags of web photos, and thus are useful as effec- tive statistical cues for detecting emotions depicted in the images. In [1] binary SVM classifiers of the ANPs are trained on the whole images, denoted as SentiBank 1.1. Later Chen et al. [2] improve these classifiers by considering object-based concept localization and leveraging semantic similarity among the concepts. The dataset for training the visual sentiment concepts involves thousands of categories consisting of about one mil- lion images downloaded from Flickr. Recently, Krizhevsky et al. [18] show deep convolutional neural networks (CNNs) is able to achieve great classification performance improve- ment and efficiency on similar datasets such as ImageNet [4]. The model has a much larger learning capacity that can be controlled by varying the network depth and breadth, compared to SVM and other learning methods. Its strong assumptions of stationarity of statistics and locality of pixel dependencies about the nature of images are also mostly correct. CNNs are also easier to train than standard feed- forward neural networks with layers of similar size, since they have much fewer connections and parameters, with only slightly degraded theoretic performance. CNNs also have the capability to incorporate model weights learned from more general dataset, which can be applied to our case by transferring the model learned over ImageNet to the specialized dataset like SentiBank.. This work introduces SentiBank 2.0, or called DeepSen- tiBank, a visual sentiment concepts classification model which is trained under Caffe [14, 15], a GPU based deep learning framework. We adopt similar CNNs architecture used in [18] while training on the ILSVRC2012 [4] dataset. We find that initializing the model with the model weights trained from ImageNet provides much better performance that training from visual sentiments dataset alone. Performance evaluation and comparisons with its predecessors show the newly trained DeepSentiBank significantly improves the annotation accuracy in ANP classification as well as moderately improves the ANP retrieval performance. arXiv:1410.8586v1 [cs.CV] 30 Oct 2014
Transcript
Page 1: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

DeepSentiBank: Visual Sentiment Concept Classificationwith Deep Convolutional Neural Networks

Tao Chen1, Damian Borth2, Trevor Darrell2 and Shih-Fu Chang1

1Columbia University, USA 2University of California, Berkeley, CA1{taochen,sfchang}@ee.columbia.edu 2{borth@icsi,trevor@eecs}.berkeley.edu

ABSTRACTThis paper introduces a visual sentiment concept classifica-tion method based on deep convolutional neural networks(CNNs). The visual sentiment concepts are adjective nounpairs (ANPs) automatically discovered from the tags of webphotos, and can be utilized as effective statistical cues fordetecting emotions depicted in the images. Nearly one mil-lion Flickr images tagged with these ANPs are downloadedto train the classifiers of the concepts. We adopt the popularmodel of deep convolutional neural networks which recentlyshows great performance improvement on classifying large-scale web-based image dataset such as ImageNet. Our deepCNNs model is trained based on Caffe, a newly developeddeep learning framework. To deal with the biased trainingdata which only contains images with strong sentiment andto prevent overfitting, we initialize the model with the modelweights trained from ImageNet. Performance evaluationshows the newly trained deep CNNs model SentiBank 2.0(or called DeepSentiBank) is significantly improved in bothannotation accuracy and retrieval performance, compared toits predecessors which mainly use binary SVM classificationmodels.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationRetrieval and Indexing

Keywordsdeep learning, visual sentiment, affective computing

1. INTRODUCTIONThe explosive growth of social media and online visual

content has motivated the research on large-scale socialmultimedia analysis. Among these research efforts, under-standing the emotion and sentiment in visual media contenthas attracted increasing attention in research and practicalapplications.. Images and videos depicting strong senti-ments can strengthen the opinion conveyed in the contentand more effectively influence the audience. Understandingsentiment expressed in visual content will greatly benefitsocial media communication and enable broad applicationsin education, advertisement and entertainment.Modeling generic visual concepts (nouns) such as “sky”

and “dog” has been studied extensively in computer vision,but modeling adjectives correlated with visual sentimentslike “amazing” and “shy” remains difficult, if not impos-sible, due to the big “affective gap” between the low-level

visual features and the high-level sentiment. Therefore,Borth et al. [1] proposed a more tractable approach whichmodels sentiment related visual concepts as a mid-level rep-resentation to fill the gap. Those concepts are AdjectiveNoun Pairs (ANPs), such as “happy dog” and “beautifulsky”, which combine the sentimental strength of adjectivesand detectability of nouns. Though these ANP conceptsdo not directly express emotions or sentiments, they werediscovered based on strong co-occurrence relationships withemotion tags of web photos, and thus are useful as effec-tive statistical cues for detecting emotions depicted in theimages. In [1] binary SVM classifiers of the ANPs aretrained on the whole images, denoted as SentiBank 1.1.Later Chen et al. [2] improve these classifiers by consideringobject-based concept localization and leveraging semanticsimilarity among the concepts.The dataset for training the visual sentiment concepts

involves thousands of categories consisting of about one mil-lion images downloaded from Flickr. Recently, Krizhevskyet al. [18] show deep convolutional neural networks (CNNs)is able to achieve great classification performance improve-ment and efficiency on similar datasets such as ImageNet [4].The model has a much larger learning capacity that canbe controlled by varying the network depth and breadth,compared to SVM and other learning methods. Its strongassumptions of stationarity of statistics and locality of pixeldependencies about the nature of images are also mostlycorrect. CNNs are also easier to train than standard feed-forward neural networks with layers of similar size, sincethey have much fewer connections and parameters, withonly slightly degraded theoretic performance. CNNs alsohave the capability to incorporate model weights learnedfrom more general dataset, which can be applied to ourcase by transferring the model learned over ImageNet tothe specialized dataset like SentiBank..This work introduces SentiBank 2.0, or called DeepSen-

tiBank, a visual sentiment concepts classification modelwhich is trained under Caffe [14, 15], a GPU based deeplearning framework. We adopt similar CNNs architectureused in [18] while training on the ILSVRC2012 [4] dataset.We find that initializing the model with the modelweights trained from ImageNet provides much betterperformance that training from visual sentiments datasetalone. Performance evaluation and comparisons withits predecessors show the newly trained DeepSentiBanksignificantly improves the annotation accuracy in ANPclassification as well as moderately improves the ANPretrieval performance.

arX

iv:1

410.

8586

v1 [

cs.C

V]

30

Oct

201

4

Page 2: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

2. RELATED WORK

2.1 Modeling SentimentMost work on sentiment analysis so far has been based on

textual information [36, 8, 32]. Sentiment models have beendemonstrated to be useful in various applications includinghuman behavior prediction [8], business [26], and politicalscience [34].Compared to text-based sentiment analysis, modeling sen-

timent based on images has been much less studied. Themost relevant work is [1], which proposed to design a large-scale visual sentiment ontology based on Adjective-NounPairs (the sentiment modeling is then based on one-vs-allSVMs). Chen et al. [2] further improve the model byconsidering object-based concept localization and leveragingsemantic similarity among the concepts.

2.2 Modeling Visual ConceptsConcept modeling has been widely studied in multimedia

[25, 31], and computer vision (often referred as “attributes”)[9]. The concepts being modeled are mostly objects [31],scenes [27], or activities [10]. There is work trying to solvethe “fine grained recognition” task, where the categories areusually organized in a hierarchical structure. [6, 7, 5]. Thereis also work trying to model “non-conventional” conceptsor properties of the images, such as image aesthetic andquality [16, 22], memorability [12], interestingness [12], andaffection/emotions [21, 35, 13, 21, 35, 37]. The models areusually trained by SVM and other layer lacking learningmethods.

2.3 Deep LearningDeep convolutional networks have been long studied in

computer vision. Successful results on digit recognitionusing supervised back-propagation networks have beenachieved in early research[20]. More recently, similarnetworks are applied on large benchmark datasets consistingof more than one million images, such as ImageNet [4], withcompetition-winning results [18].The learned deep representations can be transferred across

tasks. It has been extensively studied in an unsupervisedsetting [29, 23]. However, such models in convolutionalnetworks have been limited to relatively small datasets suchas CIFAR and MNIST, and only achieved modest successin [19]. Sermanet et al. [30] propose to use unsupervisedpre-training, followed by supervised fine-tuning to solvethe problem of insufficient training data. Supervised pre-training approach using a concept-bank paradigm [17, 33]is also proven successful in computer vision and multimediasettings. It learns the features on large-scale data in asupervised setting, then transfers them to different taskswith different labels. Recently, Girshick et al. [11] showsthat supervised pre-training on a large dataset, followedby domain-adaptive fine-tuning on smaller dataset is anefficient paradigm for scarce data.

3. VISUAL SENTIMENT ONTOLOGYAND CONCEPTS OVERVIEW

In this section, we briefly review the visual sentimentontology construction in [1] and define our classificationproblem.

3.1 Building OntologyThe analysis of emotion, affect and sentiment from visual

content has become an exciting area in the multimediacommunity allowing to build new applications for brandmonitoring, advertising, and opinion mining. To create ancorpora for sentiment analysis on visual content and stimu-late innovative research on this challenging issue, a databaseis constructed by Borth et al. [1]. This database contains aVisual Sentiment Ontology (VSO) consisting of more than3,000 adjective noun pairs (ANPs), SentiBank1, a set of1,200 trained visual concept detectors providing a mid-levelrepresentation of sentiment, and associated training imagesacquired from Flickr. Construction of the VSO is founded onpsychological research by data-driven discovery - for each ofthe 24 emotions defined in Plutchik’s theory [28], images andvideos are retrieved from Flickr and YouTube respectivelyto extract concurrent tags. The set of all adjectives andall nouns is then used to form ANPs such as “beautifulflowers” or “sad eyes”. SentiBank is then trained on theimages tagged by these ANPs.

3.2 DatasetThe database contains a set of Flickr images for training

and testing ANP classifiers in SentiBank 1.1. For eachANP, at most 1,000 images tagged with it are downloaded,resulting about one million images for 3,316 ANPs. To trainthe visual sentiment concept or ANP classifiers, we first filterout the ANPs associated with less than 120 images. 2,089ANPs with 867,919 images are left after filtering. For eachANP, 20 images are randomly selected for testing, whileothers are used in training, ensuring at least 100 trainingimages per ANP. To prevent bias in the test set, any trainingimage and test image pair associated with same ANP mustnot share a same publisher on Flickr. The ANP tags fromFlickr users are used as labels for each image. Note thoselabels may suffer from incompleteness and noisiness, i.e., notall true labels are annotated and sometimes there are falselyassigned labels also. However we do not fix them due to thehuge amount of annotation tasks. We use the labels as isand thus will refer to them as pseudo ground truth.We also build a subset to compare the retrieval perfor-

mance of different models. This subset only contains imagesassociated with six nouns, namely “car”, “dog”, “dress”,“face”, “flower” and “food”. These nouns are not only fre-quently tagged in the social multimedia, but also associatedwith diverse adjectives to form a large set of ANPs (135 intotal). Its training set is the corresponding subset of thefull training set. Its test set however, contains 60 manuallyannotated images for each ANP, where 20 are positive and 40are negative. The retrieval performance is evaluated by theaverage precision on the ranking result of the 60 test imagesfor each ANP. For this dataset, we will compare the newDeepSentiBank with an earlier version of SentiBank usingobject-based localization, called SentiBank 1.5R (indicatingregion based SentiBank) [2].

4. DEEP CONVOLUTIONAL NEURALNETWORKS SOLUTION

4.1 Introduction of Caffe1Version 1.1 can be downloaded fromhttp://visual-sentiment-ontology.appspot.com/.

Page 3: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

Figure 1: The architecture of the deep convolutional neural networks.

Caffe is a deep learning framework developed by takingfull account of cleanliness, readability, and speed. It wascreated by Jia [14], and is in active development by theBerkeley Vision and Learning Center (BVLC) and by com-munity contributors. Caffe is released under the BSD 2-Clause license 2.Using Caffe for deep learning programming has multiple

advantages. Its clean architecture enables rapid deployment.Networks are specified in simple config files, with no hard-coded parameters in the code. Switching between CPU andGPU is as simple as setting a flag ĺC so models can be trainedon a GPU machine, and then used on commodity clusters.

4.2 CNN ArchitectureHere we describe the overall architecture of the deep con-

volutional neural networks for training the visual sentimentconcept classification model, SentiBank 2.0 or DeepSen-tiBank. The architecture mostly follows [18]. As depictedin Figure 1, the net contains eight main layers (conv orfc) with weights; the first five are convolutional and theother three are fully- connected. The output of the lastfully-connected layer is fed to a 2089-way softmax whichproduces a distribution over the 2089 class labels. Thenetwork maximizes the average across training instances ofthe log-probability of the correct label under the predictiondistribution by multinomial logistic regression. The kernelsof the second, fourth, and fifth convolutional layers areconnected only to half of kernel maps in the previous layer.The kernels of the third convolutional layer are connected toall kernel maps in the second layer. The neurons in the fully-connected layers are connected to all neurons in the previouslayer. Following [24], the Rectified Linear Units (ReLUs)non-linearity f(x) = max(0, x) is applied to the output ofevery convolutional and fully-connected layer. Overlappingmax-pooling layers (pool) follow the first, second and fifthReLU layers (relu). The pooling layer consists of a gridof pooling units spaced 2 pixels apart, each summarizinga neighborhood of size 3 × 3 centered at the location ofthe pooling unit. Local response normalization layers (lm)2http://caffe.berkeleyvision.org/

Table 1: The input/output data size (left) and thelayer shape (right) for each layer.

name sizeinput 3 × 256 × 256data 3 × 227 × 227conv1 96 × 55 × 55pool1 96 × 27 × 27 name shapenorm1 96 × 27 × 27 conv1 96 × 3 × 11 × 11conv2 256 × 27 × 27 conv2 256 × 48 × 5 × 5pool2 256 × 13 × 13 conv3 384 × 256 × 3 × 3norm2 256 × 13 × 13 conv4 384 × 192 × 3 × 3conv3 384 × 13 × 13 conv5 256 × 192 × 3 × 3conv4 384 × 13 × 13 fc6 4096 × 9216conv5 256 × 13 × 13 fc7 4096 × 4096pool5 256 × 6 × 6 fc8 2089 × 4096fc6 4096fc7 4096fc8 2089label 1output 1

follow the first and second pooling layers. The response-normalized activity bix,y is given by the expression

bix,y = aix,y/

k + α

min(N−1,i+n/2)∑j=max(0,i−n/2)

(aix,y)2

β

where aix,y is the activity of a neuron computed by max-pooling, the sum runs over n “adjacent” kernel maps at thesame spatial position, and N is the total number of kernelsin the layer. The constants k = 2, n = 5, α = 10−4, andβ = 0.75. The dropout layers (dropout) are applied in thefirst two fully-connected layers.The input/output data size and the layer shape for each

layer is shown in Table 1. All training and test images arefirst normalized to 256 × 256 without keeping the aspectratio. To prevent overfitting, we apply data augmenta-tion consists of generating image translations and horizontal

Page 4: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

reflections. We do this by extracting random 227 × 227patches (and their horizontal reflections) from the 256×256images and training our network on these extracted patches.The first convolutional layer filters the 227 × 227 × 3 inputimage with 96 kernels of size 11 × 11 × 3 with a stride of4 pixels. The second convolutional layer takes as inputthe (pooled and response-normalized) output of the firstconvolutional layer and filters it with 256 kernels of size 5 ×5 × 48. The third, fourth, and fifth convolutional layers areconnected to one another without pooling or normalization.The third convolutional layer has 384 kernels of size 3 × 3 ×256 connected to the (normalized and pooled) outputs of thesecond convolutional layer. The fourth convolutional layerhas 384 kernels of size 3×3×192 , and the fifth convolutionallayer has 256 kernels of size 3×3×192. The fully-connectedlayers have 4096 neurons each.

4.3 Learning DetailsThe regression objective is minimized by stochastic gradi-

ent descent with a batch size of 256 examples, momentumof 0.9, and weight decay of 0.0005. The small weight decayhere is not only a regularizer by also reduces the model’straining error.Due to insufficient data and the bias to images with

strong sentiment, training on our dataset may suffer fromoverfitting. Since our dataset is from the same domain ofImageNet, it is promising to use fine-tuning. We initial-ized the weights by the model trained from ILSVRC2012except the top layer. the pre-trained model can be down-loaded from http://caffe.berkeleyvision.org/getting_pretrained_models.html. The learning rate is initializedat 0.001. Regarding the full forward-backward pass of eachbatch as an iteration, we run a total of 250,000 iterations(about 77 epochs). We divide learning rate by 10 after every100,000 iterations (about 20 epochs).For comparison, we also train a similar model without

fine-tuning. We initialize the weights in each layer from azero-mean Gaussian distribution with standard deviation0.01. We initialize the neuron biases in the second, fourth,and fifth convolutional layers, as well as in the fully-connected hidden layers with the constant 0.1, and in theremaining layers with the constant 0. The learning rate isinitialized at 0.01.During testing, we center crop the test images into

227 × 227, apply forward propagation with the trainedmodel weights and use the softmax as predicted probabilityof each concept.

5. EXPERIMENTAL RESULTS

5.1 Computation SpeedOur experiment is done on a single server machine with

16-core dual Intel E5-2650L processor, 64GB memory and anVidia K20 GPU. The training over 826,806 images takesabout 9 days and testing over 41,780 test images takesabout 6 minutes. The maximum memory used is 42GB,and storing data takes 300GB disk space.

5.2 Performance and ComparisonsWe evaluate the new classification model by both annota-

tion accuracy (measured by the percentage of images thathave the pseudo ground truth label in top detected concepts)

Figure 2: The curves of ranked top-10 accuracy perANP of different approaches. The curves have beensmoothed.

and retrieval performance (measured by mean average pre-cision).

5.2.1 Annotation accuracyThe annotation accuracy is evaluated on the full test set

of 2,089 ANPs mentioned in Section 3.2 and measured bytop-k accuracy - the percentage of images that have thepseudo ground truth label in top k detected concepts. Top-1, 5, 10 accuracies of each and all ANPs are computed andcompared among fine-tuned deep CNNs model (SentiBank2.0), deep CNNs model without fine-tuning, and SentiBank1.1 [1]. The overall accuracies are listed in Table 2. Differentfrom genetic visual concepts, some visual sentiment conceptscan be very abstract, such as “terrible crime” and “strongcommunity”. Such ANPs usually have very low classificationperformance, and are meaningless to be included in theclassifiers library for generating mid-level sentiment relatedfeatures. Thus it is important to compare the performancesof ANPs with acceptable detectability. Similar to [1], foreach approach, we select top 1,200 ANPs ranked by Top-10 accuracy. Note different approach will produce differentANP subsets. The overall accuracies for these subsets arealso shown in Table 2. Figure 2 shows the curve of rankedtop-10 accuracy per ANP for each subset. According tothe table and the figure it is clear that the CNNs-basedapproaches greatly outperform the SVM based approach,with as much as 370% performance gain on top-1 accuracy,200% on top-5, and 150% on top-10. Fine-tuned modelis also 14~25% better than the one without fine-tuning.Figure 3 shows some examples of top detected concepts fromtest images by the fine-tuned model. It shows that despitethe serious problem of incomplete and incorrect labels inour dataset, the top detected concepts can still be accurate.Since the pseudo ground truth labels may not be correct,thus the top-5 and top-10 accuracies are more appropriatethan top-1 accuracy. We also realize an important reasonfor the performance boost is that the SVM based SentiBanktrains binary classifiers, rather than a general multi-labelclassification approach. Such binary classification setting ismore suitable for retrieval, instead of annotation. Thus,in the next section, we will evaluate the performance ofDeepSentiBank in terms of image retrieval.

5.2.2 Retrieval performance

Page 5: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

Figure 3: Examples of top 10 concepts detected from test images by the fine-tuned DeepSentiBank model.The red concepts are the pseudo ground truth concepts. Credits of images (from top to bottom, fromleft to right): ©Mauricio Gelfuso, Bob Wright, fraKara, Melanie Bateman, Photographs-n-Memories, MattSwanson, Twan Goossens, S Debras, Erin Nichols, Yael Levine, 7 Years Later... and Anda Stavri of Flickr.

Figure 4: The mean AP for each and all nouncategories for the subset of 135 ANPs mentionedin Section 3.2.

The retrieval performance is evaluated on the subset of135 ANPs mentioned in Section 3.2. We apply the modelstrained from SentiBank 1.1, 1.5R and DeepSentiBank tothe test set. For each ANP, the test images are rankedby the estimated probability of the ANP. The performanceis measured by average precision (AP) at top 20. Themean AP for each and all noun categories are shown in Fig-ure 4. Although not designed for retrieval, DeepSentiBankstill outperforms SentiBank 1.1 by 62.3% and SentiBank1.5R by 8.9%. Note DeepSentiBank is only trained onwhole images and does not consider concept localizationor concept similarity. It means the performance could befurther improved if we incorporate the two factors into deeplearning. Recently, R-CNN [11] shows state-of-the-art per-formance on object detection, which can be a promisingcandidate approach for the concept localization.

6. CONCLUSIONThis paper presents a visual sentiment concept classifi-

cation model based on deep convolutional neural networks.The deep CNNs model is trained based on Caffe, a newly

developed deep learning framework. To deal with the biasedtraining data which only contains images with strong sen-timent and to prevent overfitting, we initialize the modelwith the model weights trained from ImageNet. Perfor-mance evaluation shows the newly trained deep CNNs modelDeepSentiBank is significantly better in both annotationand retrieval, compared to previous work using independentbinary SVM classification models. In the future, we willincorporate the concept localization into the deep CNNsmodel, and improve network structure by leveraging conceptrelations. The high performance boost will also help toimprove applications built on SentiBank, such as assistivecomment robot [3] and twitter sentiment prediction, or otherapplications such as sentiment-aware image editing.

7. REFERENCES

[1] Damian Borth, Rongrong Ji, Tao Chen, ThomasBreuel, and Shih-Fu Chang. Large-scale visualsentiment ontology and detectors using adjective nounpairs. In Proceedings of the 21st ACM InternationalConference on Multimedia. ACM, 2013.

[2] Tao Chen, Felix X. Yu, Jiawei Chen, Yin Cui,Yan-Ying Chen, and Shih-Fu Chang. Object-basedvisual sentiment concept analysis and application. InProceedings of the 22nd ACM InternationalConference on Multimedia. ACM, 2014.

[3] Yan-Ying Chen, Tao Chen, Winston H. Hsu,Hong-Yuan Mark Liao, and Shih-Fu Chang. Predictingviewer affective comments based on image content insocial media. In Proceedings of the InternationalConference on Multimedia Retrieval. ACM, 2014.

[4] J Deng, A Berg, S Satheesh, H Su, A Khosla, andL Fei-Fei. Large scale visual recognition challenge.www. image-net. org/challenges/LSVRC/2012, 1,2012.

Page 6: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

Table 2: Evaluation of Different SentiBank Models in terms of Retrieval

SentiBank ver. 2,089 ANPs 1,200 ANPsTop-1 Top-5 Top-10 Top-1 Top-5 Top-10

SentiBank 1.1 1.7075% 6.3211% 10.2917% 3.0386% 11.4288% 18.7356%DeepSentiBank w/o fine-tuning 6.5235% 16.0095% 22.4941% 11.4430% 28.4856% 39.0800%

DeepSentiBank 8.1629% 19.1132% 26.1012% 14.3572% 33.3726% 44.3664%

[5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, andL. Fei-Fei. ImageNet: A Large-Scale HierarchicalImage Database. In Computer Vision and PatternRecognition, 2009.

[6] Jia Deng, Jonathan Krause, and Li Fei-Fei.Fine-grained crowdsourcing for fine-grainedrecognition. In Computer Vision and PatternRecognition (CVPR), 2013 IEEE Conference on,pages 580–587. IEEE, 2013.

[7] Kun Duan, Devi Parikh, David Crandall, and KristenGrauman. Discovering localized attributes forfine-grained recognition. In Computer Vision andPattern Recognition. IEEE, 2012.

[8] Andrea Esuli and Fabrizio Sebastiani. SentiWordNet:A publicly available lexical resource for opinionmining. In Proceedings of the Conference on LanguageResources and Evaluation, volume 6, 2006.

[9] V. Ferrari and A. Zisserman. Learning visualattributes. In Neural Information Processing Systems,2007.

[10] Yanwei Fu, Timothy M Hospedales, Tao Xiang, andShaogang Gong. Attribute learning for understandingunstructured social activity. In European Conferenceon Computer Vision. Springer, 2012.

[11] Ross Girshick, Jeff Donahue, Trevor Darrell, andJitendra Malik. Rich feature hierarchies for accurateobject detection and semantic segmentation. In IEEECVPR, 2014.

[12] P. Isola, J. Xiao, A. Torralba, and A. Oliva. Whatmakes an image memorable? In Computer Vision andPattern Recognition, 2011.

[13] Jia Jia, Sen Wu, Xiaohui Wang, Peiyun Hu, LianhongCai, and Jie Tang. Can we understand van Gogh’smood?: Learning to infer affects from images in socialnetworks. In Proceedings of the 20th ACMinternational conference on Multimedia, pages857–860. ACM, 2012.

[14] Yangqing Jia. Caffe: An open source convolutionalarchitecture for fast feature embedding. Available fromhttp://caffe.berkeleyvision.org/, 2013.

[15] Yangqing Jia, Evan Shelhamer, Jeff Donahue, SergeyKarayev, Jonathan Long, Ross Girshick, SergioGuadarrama, and Trevor Darrell. Caffe: Convolutionalarchitecture for fast feature embedding. In Proceedingsof the 22nd ACM International Conference onMultimedia. ACM, 2014.

[16] Dhiraj Joshi, Ritendra Datta, Elena Fedorovskaya,Quang-Tuan Luong, James Z Wang, Jia Li, and JieboLuo. Aesthetics and emotions in images. SignalProcessing Magazine, IEEE, 28(5):94–115, 2011.

[17] Lyndon Kennedy and Alexander Hauptmann. Lscomlexicon definitions and annotations (version 1.0). 2006.

[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E.Hinton. Imagenet classification with deepconvolutional neural networks. In F. Pereira, C.J.C.Burges, L. Bottou, and K.Q. Weinberger, editors,Advances in Neural Information Processing Systems25, pages 1097–1105. Curran Associates, Inc., 2012.

[19] Quoc V Le. Building high-level features using largescale unsupervised learning. In Acoustics, Speech andSignal Processing (ICASSP), 2013 IEEE InternationalConference on, pages 8595–8598. IEEE, 2013.

[20] Yann LeCun, Bernhard Boser, John S Denker, DonnieHenderson, Richard E Howard, Wayne Hubbard, andLawrence D Jackel. Backpropagation applied tohandwritten zip code recognition. Neural computation,1(4):541–551, 1989.

[21] J. Machajdik and A. Hanbury. Affective imageclassification using features inspired by psychologyand art theory. In Proceedings of ACM Multimedia,pages 83–92, 2010.

[22] L. Marchesotti, F. Perronnin, D. Larlus, andG. Csurka. Assessing the aesthetic quality ofphotographs using generic image descriptors. InProceedings of the International Conference onComputer Vision, 2011.

[23] Grégoire Mesnil, Yann Dauphin, Xavier Glorot, SalahRifai, Yoshua Bengio, Ian J Goodfellow, Erick Lavoie,Xavier Muller, Guillaume Desjardins, DavidWarde-Farley, et al. Unsupervised and transferlearning challenge: a deep learning approach. In ICMLUnsupervised and Transfer Learning, pages 97–110,2012.

[24] Vinod Nair and Geoffrey E Hinton. Rectified linearunits improve restricted boltzmann machines. InProceedings of the 27th International Conference onMachine Learning (ICML-10), pages 807–814, 2010.

[25] M. Naphade, J. Smith, J. Tesic, S.-F. Chang, W. Hsu,L. Kennedy, A. Hauptmann, and Curtis J. Large-scaleconcept ontology for multimedia. In IEEE Multimedia,2006.

[26] Bo Pang and Lillian Lee. Opinion mining andsentiment analysis. Information Retrieval,2(1-2):1–135, 2008.

[27] Genevieve Patterson and James Hays. Sun attributedatabase: Discovering, annotating, and recognizingscene attributes. In Computer Vision and PatternRecognition. IEEE, 2012.

[28] Robert Plutchik. Emotion: A PsychoevolutionarySynthesis. Harper & Row, Publishers, 1980.

[29] Rajat Raina, Alexis Battle, Honglak Lee, BenjaminPacker, and Andrew Y Ng. Self-taught learning:transfer learning from unlabeled data. In Proceedings

Page 7: DeepSentiBank: Visual Sentiment Concept Classification with ... · DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks Tao Chen1, Damian

of the 24th international conference on Machinelearning, pages 759–766. ACM, 2007.

[30] Pierre Sermanet, Koray Kavukcuoglu, SoumithChintala, and Yann LeCun. Pedestrian detection withunsupervised multi-stage feature learning. InComputer Vision and Pattern Recognition (CVPR),2013 IEEE Conference on, pages 3626–3633. IEEE,2013.

[31] J.R. Smith, M. Naphade, and A. Natsev. Multimediasemantic indexing using model vectors. InInternational Conference on Multimedia and Expo,2003.

[32] Mike Thelwall, Kevan Buckley, Georgios Paltoglou,Di Cai, and Arvid Kappas. Sentiment strengthdetection in short informal text. Journal of theAmerican Society for Information Science andTechnology, 61(12):2544–2558, 2010.

[33] Lorenzo Torresani, Martin Szummer, and AndrewFitzgibbon. Efficient object category recognition usingclassemes. In Computer Vision–ECCV 2010, pages776–789. Springer, 2010.

[34] Andranik Tumasjan, Timm O Sprenger, Philipp GSandner, and Isabell M Welpe. Predicting electionswith Twitter: What 140 characters reveal aboutpolitical sentiment. In Proceedings of the 4thInternational AAAI Conference on Weblogs and SocialMedia, 2010.

[35] Weining Wang and Qianhua He. A survey onemotional semantic image retrieval. In ImageProcessing, 2008. ICIP 2008. 15th IEEE InternationalConference on, pages 117–120. IEEE, 2008.

[36] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.Recognizing contextual polarity in phrase-levelsentiment analysis. In Proceedings of the Conferenceon Human Language Technology and EmpiricalMethods in Natural Language Processing, 2005.

[37] V. Yanulevskaya, J. van Gemert, K. Roth,A. Herbold, N. Sebe, and J.M. Geusebroek. Emotionalvalence categorization using holistic image features. InProceedings of the IEEE International Conference onImage Processing, pages 101–104, 2008.


Recommended