Robust Image Sentiment Analysis Using...

Robust Image Sentiment Analysis Using Progressively Trained and DomainTransferred Deep Networks

Quanzeng You and Jiebo LuoDepartment of Computer Science

University of RochesterRochester, NY 14623

{qyou, jluo}@cs.rochester.edu

Hailin Jin and Jianchao YangAdobe Research345 Park Avenue

San Jose, CA 95110{hljin, jiayang}@adobe.com

Abstract

Sentiment analysis of online user generated content isimportant for many social media analytics tasks. Re-searchers have largely relied on textual sentiment anal-ysis to develop systems to predict political election-s, measure economic indicators, and so on. Recently,social media users are increasingly using images andvideos to express their opinions and share their expe-riences. Sentiment analysis of such large scale visualcontent can help better extract user sentiments towardevents or topics, such as those in image tweets, so thatprediction of sentiment from visual content is comple-mentary to textual sentiment analysis. Motivated by theneeds in leveraging large scale yet noisy training data tosolve the extremely challenging problem of image sen-timent analysis, we employ Convolutional Neural Net-works (CNN). We first design a suitable CNN archi-tecture for image sentiment analysis. We obtain half amillion training samples by using a baseline sentimentalgorithm to label Flickr images. To make use of suchnoisy machine labeled data, we employ a progressive s-trategy to fine-tune the deep network. Furthermore, weimprove the performance on Twitter images by induc-ing domain transfer with a small number of manuallylabeled Twitter images. We have conducted extensiveexperiments on manually labeled Twitter images. Theresults show that the proposed CNN can achieve betterperformance in image sentiment analysis than compet-ing algorithms.

IntroductionOnline social networks are providing more and more con-venient services to their users. Today, social networks havegrown to be one of the most important sources for people toacquire information on all aspects of their lives. Meanwhile,every online social network user is a contributor to suchlarge amounts of information. Online users love to sharetheir experiences and to express their opinions on virtuallyall events and subjects.

Among the large amount of online user generated data, weare particularly interested in people’s opinions or sentimentstowards specific topics and events. There have been many

Copyright c© 2015, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Figure 1: Examples of Flickr images related to the 2012 U-nited States presidential election.

works on using online users’ sentiments to predict box-office revenues for movies (Asur and Huberman 2010), po-litical elections (O’Connor et al. 2010; Tumasjan et al. 2010)and economic indicators (Bollen, Mao, and Zeng 2011;Zhang, Fuehres, and Gloor 2011). These works have sug-gested that online users’ opinions or sentiments are closelycorrelated with our real-world activities. All of these resultshinge on accurate estimation of people’s sentiments accord-ing to their online generated content. Currently all of theseworks only rely on sentiment analysis from textual content.However, multimedia content, including images and videos,has become prevalent over all online social networks. In-deed, online social network providers are competing witheach other by providing easier access to their increasinglypowerful and diverse services. Figure 1 shows example im-ages related to the 2012 United States presidential election.Clearly, images in the top and bottom rows convey oppositesentiments towards the two candidates.

A picture is worth a thousand words. People with differ-ent backgrounds can easily understand the main content ofan image or video. Apart from the large amount of easilyavailable visual content, today’s computational infrastruc-ture is also much cheaper and more powerful to make theanalysis of computationally intensive visual content analy-sis feasible. In this era of big data, it has been shown thatthe integration of visual content can provide us more reli-able or complementary online social signals (Jin et al. 2010;Yuan et al. 2013).

To the best of our knowledge, little attention has been paidto the sentiment analysis of visual content. Only a few recentworks attempted to predict visual sentiment using features

from images (Siersdorfer et al. 2010; Borth et al. 2013b;2013a; Yuan et al. 2013) and videos (Morency, Mihalcea,and Doshi 2011). Visual sentiment analysis is extremelychallenging. First, image sentiment analysis is inherentlymore challenging than object recognition as the latter is usu-ally well defined. Image sentiment involves a much higherlevel of abstraction and subjectivity in the human recogni-tion process (Joshi et al. 2011), on top of a wide variety ofvisual recognition tasks including object, scene, action andevent recognition. In order to use supervised learning, it isimperative to collect a large and diverse labeled training setperhaps on the order of millions of images. This is an almostinsurmountable hurdle due to the tremendous labor requiredfor image labeling. Second, the learning schemes need tohave high generalizability to cover more different domains.However, the existing works use either pixel-level featuresor a limited number of predefined attribute features, whichis difficult to adapt the trained models to images from a dif-ferent domain.

The deep learning framework enables robust and accuratefeature learning, which in turn produces the state-of-the-artperformance on digit recognition (LeCun et al. 1989; Hin-ton, Osindero, and Teh 2006), image classification (Ciresanet al. 2011; Krizhevsky, Sutskever, and Hinton 2012), mu-sical signal processing (Hamel and Eck 2010) and naturallanguage processing (Maas et al. 2011). Both the academiaand industry have invested a huge amount of effort in build-ing powerful neural networks. These works suggested thatdeep learning is very effective in learning robust features in asupervised or unsupervised fashion. Even though deep neu-ral networks may be trapped in local optima (Hinton 2010;Bengio 2012), using different optimization techniques, onecan achieve the state-of-the-art performance on many chal-lenging tasks mentioned above.

Inspired by the recent successes of deep learning, we areinterested in solving the challenging visual sentiment anal-ysis task using deep learning algorithms. For images relat-ed tasks, Convolutional Neural Network (CNN) are widelyused due to the usage of convolutional layers. It takes intoconsideration the locations and neighbors of image pixel-s, which are important to capture useful features for visualtasks. Convolutional Neural Networks (LeCun et al. 1998;Ciresan et al. 2011; Krizhevsky, Sutskever, and Hinton2012) have been proved very powerful in solving computervision related tasks. We intend to find out whether applyingCNN to visual sentiment analysis provides advantages overusing a predefined collection of low-level visual features orvisual attributes, which have been done in prior works.

To that end, we address in this work two major challenges:1) how to learn with large scale weakly labeled training da-ta, and 2) how to generalize and extend the learned modelacross domains. In particular, we make the following contri-butions.

• We develop an effective deep convolutional network ar-chitecture for visual sentiment analysis. Our architectureemploys two convolutional layers and several fully con-nected layers for the prediction of visual sentiment labels.

• Our model attempts to address the weakly labeled nature

of the training image data, where such labels are machinegenerated, by leveraging a progressive training strategyand a domain transfer strategy to fine-tune the neural net-work. Our evaluation results suggest that this strategy iseffective for improving the performance of neural net-work in terms of generalizability.

• In order to evaluate our model as well as competing algo-rithms, we build a large manually labeled visual sentimentdataset using Amazon Mechanical Turk. This dataset willbe released to the research community to promote furtherinvestigations on visual sentiment.

Related WorkIn this section, we review literature closely related to ourstudy on visual sentiment analysis, particularly in sentimentanalysis and Convolutional Neural Networks.

Sentiment AnalysisSentiment analysis is a very challenging task (Liu et al.2003; Li et al. 2010). Researchers from natural languageprocessing and information retrieval have developed differ-ent approaches to solve this problem, achieving promisingor satisfying results (Pang and Lee 2008). In the context ofsocial media, there are several additional unique challenges.First, there are huge amounts of data available. Second, mes-sages on social networks are by nature informal and short.Third, people use not only textual messages, but also imagesand videos to express themselves.

Tumasjan et al. (2010) and Bollen et al. (2011) employedpre-defined dictionaries for measuring the sentiment levelof Tweets. The volume or percentage of sentiment-bearingwords can produce an estimate of the sentiment of one par-ticular tweet. Davidov et al. (2010) used the weak labelsfrom a large amount of Tweets. In contrast, they manual-ly selected hashtags with strong positive and negative senti-ments and ASCII smileys are also utilized to label the sen-timents of tweets. Furthermore, Hu et al. (2013) incorporat-ed social signals into their unsupervised sentiment analysisframework. They defined and integrated both emotion indi-cation and correlation into a framework to learn parametersfor their sentiment classifier.

There are also several recent works on visual sentimentanalysis. Siersdorfer et al. (2010) proposes a machine learn-ing algorithm to predict the sentiment of images using pixel-level features. Motivated by the fact that sentiment involveshigh-level abstraction, which may be easier to explain byobjects or attributes in images, both (Borth et al. 2013a)and (Yuan et al. 2013) propose to employ visual entities orattributes as features for visual sentiment analysis. In (Borthet al. 2013a), 1200 adjective noun pairs (ANP), which maycorrespond to different levels of different emotions, are ex-tracted. These ANPs are used as queries to crawl imagesfrom Flickr. Next, pixel-level features of images in each AN-P are employed to train 1200 ANP detectors. The responsesof these 1200 classifiers can then be considered as mid-levelfeatures for visual sentiment analysis. The work in (Yuan etal. 2013) employed a similar mechanism. The main differ-ence is that 102 scene attributes are used instead.

256

256

227

227

3

3

227

227

1111

5

5

96

55

55

256

27

27

512 51224

2

Figure 2: Convolutional Neural Network for Visual Sentiment Analysis.

Convolutional Neural NetworksConvolutional Neural Networks (CNN) have been very suc-cessful in document recognition (LeCun et al. 1998). CNNtypically consists of several convolutional layers and sever-al fully connected layers. Between the convolutional layers,there may also be pooling layers and normalization layers.CNN is a supervised learning algorithm, where parameter-s of different layers are learned through back-propagation.Due to the computational complexity of CNN, it has only beapplied to relatively small images in the literature. Recently,thanks to the increasing computational power of GPU, it isnow possible to train a deep convolutional neural network ona large scale image dataset (Krizhevsky, Sutskever, and Hin-ton 2012). Indeed, in the past several years, CNN has beensuccessfully applied to scene parsing (Grangier, Bottou, andCollobert 2009), feature learning (LeCun, Kavukcuoglu, andFarabet 2010), visual recognition (Kavukcuoglu et al. 2010)and image classification (Krizhevsky, Sutskever, and Hinton2012). In our work, we intend to use CNN to learn featureswhich are useful for visual sentiment analysis.

Visual Sentiment AnalysisWe propose to develop a suitable convolutional neural net-work architecture for visual sentiment analysis. Moreover,we employ a progressive training strategy that leverages thetraining results of convolutional neural network to furtherfilter out (noisy) training data. The details of the proposedframework will be described in the following sections.

Visual Sentiment Analysis with regular CNNCNN has been proven to be effective in image classifica-tion tasks, e.g., achieving the state-of-the-art performancein ImageNet Challenge (Krizhevsky, Sutskever, and Hinton2012). Visual sentiment analysis can also be treated as animage classification problem. It may seem to be a mucheasier problem than image classification from ImageNet (2classes vs. 1000 classes in ImageNet). However, visual sen-timent analysis is quite challenging because sentiments oropinions correspond to high level abstractions from a giv-en image. This type of high level abstraction may requireviewer’s knowledge beyond the image content itself. Mean-while, images in the same class of ImageNet mainly contain

the same type of object. In sentiment analysis, each classcontains much more diverse images. It is therefore extreme-ly challenging to discover features which can distinguishmuch more diverse classes from each other. In addition, peo-ple may have totally different sentiments over the same im-age. This adds difficulties to not only our classification task,but also the acquisition of labeled images. In other word-s, it is nontrivial to obtain highly reliable labeled instances,let alone a large number of them. Therefore, we need a su-pervised learning engine that is able to tolerate a significantlevel of noise in the training dataset.

The architecture of the CNN we employ for sentimen-t analysis is shown in Figure 2. Each image is resized to256 × 256 (if needed, we employ center crop, which firstresizes the shorter dimension to 256 and then crops the mid-dle section of the resized image). The resized images areprocessed by two convolutional layers. Each convolutionallayer is also followed by max-pooling layers and normaliza-tion layers. The first convolutional layer has 96 kernels ofsize 11 × 11 × 3 with a stride of 4 pixels. The second con-volutional layer has 256 kernels of size 5 × 5 with a strideof 2 pixels. Furthermore, we have four fully connected lay-ers. Inspired by (Caglar Gulcehre et al. 2013), we constrainthe second to last fully connected layer to have 24 neuron-s. According to the Plutchik’s wheel of emotions (Plutchik1984), there are a total of 24 emotions belonging to two cate-gories: positive emotions and negative emotions. Intuitively.we hope these 24 nodes may help the network to learn the 24emotions from a given image and then classify each imageinto positive or negative class according to the responses ofthese 24 emotions.

The last layer is designed to learn the parameter w bymaximizing the following conditional log likelihood func-tion (xi and yi are the feature vector and label for the i-thinstance respectively):

l(w) =

n∑i=1

yi ln p(yi = 1|xi, w)+(1−yi) ln p(yi = 0|xi, w)

(1)where

p(yi = 1|xi, w) =exp(w0 +

∑kj=1 wjxij)

1 + exp(w0 +∑k

j=1 wjxij)(2)

... ...

... f(·)PredictCNN

PCNN

1) Input

Train convolutional Neural Network

2) CNN model

3) 4) Sampling

5) Fine-tune

6) PCNN model

Figure 3: Progressive CNN (PCNN) for visual sentiment analysis.

Visual Sentiment Analysis with Progressive CNNSince the images are weakly labeled, it is possible that theneural network can get stuck in a bad local optimum. Thismay lead to poor generalizability of the trained neural net-work. On the other hand, we found that the neural networkis still able to correctly classify a large proportion of thetraining instances. In other words, the neural network haslearned knowledge to distinguish the training instances withrelatively distinct sentiment labels. Therefore, we propose toprogressively select a subset of the training instances to re-duce the impact of noisy training instances. Figure 3 showsthe overall flow of the proposed progressive CNN (PCN-N). We first train a CNN on Flickr images. Next, we selec-t training samples according to the prediction score of thetrained model on the training data itself. Instead of trainingfrom the beginning, we further fine-tune the trained modelusing these newly selected, and potentially cleaner traininginstances. This fine-tuned model will be our final model forvisual sentiment analysis.

Algorithm 1 Progressive CNN training for Visual SentimentAnalysisInput: X = {x1, x2, . . . , xn} a set of images of size 256×

256Y = {y1, y2, . . . , yn} sentiment labels of X

1: Train convolutional neural network CNN with input Xand Y

2: Let S ∈ Rn×2 be the sentiment scores of X predictedusing CNN

3: for si ∈ S do4: Delete xi from X with probability pi (Eqn.(3))5: end for6: Let X ′ ⊂ X be the remaining training images, Y ′ be

their sentiment labels7: Fine-tune CNN with input X ′ and Y ′ to get PCNN8: return PCNN

In particular, we employ a probabilistic sampling algo-rithm to select the new training subset. The intuition is that

we want to keep instances with distinct sentiment scores be-tween the two classes with a high probability, and converse-ly remove instances with similar sentiment scores for bothclasses with a high probability. Let si = (si1, si2) be theprediction sentiment scores for the two classes of instance i.We choose to remove the training instance i with probabilitypi given by Eqn.(3). Algorithm 1 summarizes the steps ofthe proposed framework.

pi = max (0, 2− exp(|si1 − si2|)) (3)

When the difference between the predicted sentiment scoresof one training instance are large enough, this training in-stance will be kept in the training set. Otherwise, the smallerthe difference between the predicted sentiment scores be-come, the larger the probability of this instance being re-moved from the training set.

ExperimentsWe choose to use the same half million Flickr imagesfrom SentiBank1 to train our Convolutional Neural Network.These images are only weakly labeled since each image be-longs to one adjective noun pair (ANP). There are a totalof 1200 ANPs. According to the Plutchik’s Wheel of Emo-tions (Plutchik 1984), each ANP is generated by the combi-nation of adjectives with strong sentiment values and nounsfrom tags of images and videos (Borth et al. 2013b). TheseANPs are then used as queries to collect related imagesfor each ANP. The released SentiBank contains 1200 ANPswith about half million Flickr images. We train our convolu-tional neural network mainly on this image dataset. We im-plement the proposed architecture of CNN on the publiclyavailable implementation Caffe (Jia 2013). All of our exper-iments are evaluated on a Linux X86 64 machine with 32GRAM and two NVIDIA GTX Titan GPUs.

Comparisons of different CNN architecturesThe architecture of our model is shown in Figure 2. Howev-er, we also evaluate other architectures for the visual senti-ment analysis task. Table 1 summarizes the performance ofdifferent architectures on a randomly chosen Flickr testing

1http://visual-sentiment-ontology.appspot.com/

dataset. In Table 1, iCONV-jFC indicates that there are iconvolutional layers and j fully connected layers in the ar-chitecture. The model in Figure 2 shows slightly better per-formance than other models in terms of F1 and accuracy. Inthe following experiments, we mainly focus on the evalua-tion of CNN using the architecture in Figure 2.

Table 1: Summary of performance of different architectureson randomly chosen testing data.

Architecture Precision Recall F1 Accuracy3CONV-4FC 0.679 0.845 0.753 0.6443CONV-2FC 0.69 0.847 0.76 0.6572CONV-3FC 0.679 0.874 0.765 0.6542CONV-4FC 0.688 0.875 0.77 0.665

BaselinesWe compare the performance of PCNN with three otherbaselines or competing algorithms for image sentiment clas-sification.

Low-level Feature-based Siersdorfer et al. (2010) definedboth global and local visual features. Specifically, the glob-al color histograms (GCH) features consist of 64-bin RGBhistogram. The local color histogram features (LCH) firstdivided the image into 16 blocks and used the 64-bin RGBhistogram for each block. They also employed SIFT featuresto learn a visual word dictionary. Next, they defined bag ofvisual word features (BoW) for each image.

Mid-level Feature-based Damian et al. (2013a; 2013b)proposed a framework to build visual sentiment ontologyand SentiBank according to the previously discussed 1200ANPs. With the trained 1200 ANP detectors, they are ableto generate 1200 responses for any given test image usingthese pre-trained 1200 ANP detectors. A sentiment classifi-er is built on top of these mid-level features according to thesentiment label of training images. Sentribute (Yuan et al.2013) also employed mid-level features for sentiment pre-diction. However, instead of using adjective noun pairs, theyemployed scene-based attributes (Patterson and Hays 2012)to define the mid-level features.

Deep Learning on Flickr DatasetWe randomly choose 90% images from the half millionFlickr images as our training dataset. The remaining 10%images are our testing dataset. We train the convolution-al neural network with 300,000 iterations of mini-batches(each mini-batch contains 256 images). We employ the sam-pling probability in Eqn.(3) to filter the training images ac-cording to the prediction score of CNN on its training data.In the fine-tuning stage of PCNN, we run another 100,000iterations of mini-batches using the filtered training dataset.Table 2 gives a summary of the number of data instances inour experiments. Figure 4 shows the filters learned in thefirst convolutional layer of CNN and PCNN, respectively.There are some differences between 4(a) and 4(b). Whileit is somewhat inconclusive that the neural networks havereached a better local optimum, at least we can conclude that

Table 2: Statistics of the number of Flickr image dataset.Models training testing # of iterationsCNN 401,739 44,637 300,000PCNN 369,828 44,637 100,000

Table 3: Performance on the Testing Dataset by CNN andPCNN.

Algorithm Precision Recall F1 AccuracyCNN 0.714 0.729 0.722 0.718PCNN 0.759 0.826 0.791 0.781

the fine-tuning stage using a progressively cleaner trainingdataset has prompted the neural networks to learn differentknowledge. Indeed, the evaluation results suggest that thisfine-tuning leads to the improvement of performance.

Table 3 shows the performance of both CNN and PCNNon the 10% randomly chosen testing data. PCNN outper-formed CNN in terms of Precision, Recall, F1 and Accura-cy. The results in Table 3 and the filters from Figure 4 showsthat the fine-tuning stage of PCNN can help the neural net-work to search for a better local optimum.

(a) Filters learned from CNN

(b) Filters learned from PCNN

Figure 4: Filters of the first convolutional layer.

Twitter Testing DatasetWe also built a new image dataset from image tweets. Im-age tweets refer to those tweets that contain images. Webuilt a total of 1269 images as our candidate testing images.We employed crowd intelligence, Amazon Mechanical Turk(AMT), to generate sentiment labels for these testing im-ages, in a similar fashion to (Borth et al. 2013b). We recruit-ed 5 AMT workers for each of the candidate image. Table 4shows the statistics of the labeling results from the AmazonMechanical Turk. In the table, “five agree” indicates that al-l the 5 AMT workers gave the same sentiment label for agiven image. Only a small portion of the images, 153 out of1269, had significant disagreements between the 5 worker-

Table 5: Performance of different algorithms on the Twitter image dataset (Acc stands for Accuracy).

Algorithms Five Agree At Least Four Agree At Least Three AgreePrecision Recall F1 Acc Precision Recall F1 Acc Precision Recall F1 Acc

CNN 0.749 0.869 0.805 0.722 0.707 0.839 0.768 0.686 0.691 0.814 0.747 0.667PCNN 0.77 0.878 0.821 0.747 0.733 0.845 0.785 0.714 0.714 0.806 0.757 0.687

s (3 vs. 2). We evaluate the performance of Convolution-

Table 4: Summary of AMT labeled results for the Twittertesting dataset.

Sentiment Five Agree At Least FourAgree

At LeastThree Agree

Positive 581 689 769Negative 301 427 500Sum 882 1116 1269

al Neural Networks on this manually labeled image datasetaccording to the model trained on Flickr images. Table 5shows the performance of the two frameworks. Not surpris-ingly, both models perform better on the less ambiguous im-age set (“five agree” by AMT). Meanwhile, PCNN showsbetter performance than CNN on all the three labeling set-s in terms of both F1 and accuracy. This suggests that thefine-tuning stage of CNN effectively improves the general-izability extensibility of the neural networks.

Transfer LearningHalf million Flickr images are used in our CNN training.The features learned are generic features on these half mil-lion images. Table 5 shows that these generic features alsohave the ability to predict visual sentiment of images fromother domains. The question we ask is whether we can fur-ther improve the performance of visual sentiment analysison Twitter images by inducing transfer learning. In this sec-tion, we conduct experiments to answer this question.

The users of Flickr are more likely to spend more timeon taking high quality pictures. Twitter users are likely toshare the moment with the world. Thus, most of the Twitterimages are casually taken snapshots. Meanwhile, most of theimages are related to current trending topics and personalexperiences, making the images on Twitter much diverse incontent as well as quality.

In this experiment, we fine-tune the pre-trained neural net-work model in the following way to achieve transfer learn-ing. We randomly divide the Twitter images into 5 equal par-titions. Every time, we use 4 of the 5 partitions to fine-tuneour pre-trained model from the half million Flickr imagesand evaluate the new model on the remaining partition. Theaveraged evaluation results are reported. The algorithm isdetailed in Algorithm 2.

Similar to (Borth et al. 2013b), we also employ 5-foldcross-validation to evaluate the performance of all the base-line algorithms. Table 6 summarizes the averaged perfor-mance results of different baseline algorithms and our twoCNN models. Overall, both CNN models outperform thebaseline algorithms. In the baseline algorithms, Sentribute

Figure 5: Positive (top block) and Negative (bottom block)examples. Each column shows the negative example im-ages for each algorithm (PCNN, CNN, Sentribute, Sen-tibank, GCH, LCH, GCH+BoW, LCH+BoW). The imagesare ranked by the prediction score from top to bottom in adecreasing order.

Algorithm 2 Transfer Learning to fine-tune CNNInput: X = {x1, x2, . . . , xn} a set of images of size 256×

256Y = {y1, y2, . . . , yn} sentiment labels of XPre-trained CNN model M

1: Randomly partition X and Y into 5 equal groups{(X1, Y1), . . . , (X5, Y5)}.

2: for i from 1 to 5 do3: Let (X ′, Y ′) = (X,Y )− (Xi, Yi)4: Fine-tune M with input (X ′, Y ′) to obtain model Mi

5: Evaluate the performance of Mi on (Xi, Yi)6: end for7: return The averaged performance of Mi on (Xi, Yi) (i

from 1 to 5)

gives slightly better results than the other two baseline al-gorithms. Interestingly, even the combination of using low-

Table 6: 5-Fold Cross-Validation Performance of different algorithms on the Twitter image dataset. Note that compared withTable 5, both fine-tuned CNN models have been improved due to domain transfer learning (Acc stands for Accuracy).

Algorithms Five Agree At Least Four Agree At Least Three AgreePrecision Recall F1 Acc Precision Recall F1 Acc Precision Recall F1 Acc

GCH 0.708 0.888 0.787 0.684 0.687 0.84 0.756 0.665 0.678 0.836 0.749 0.66LCH 0.764 0.809 0.786 0.71 0.725 0.753 0.739 0.671 0.716 0.737 0.726 0.664GCH + BoW 0.724 0.904 0.804 0.71 0.703 0.849 0.769 0.685 0.683 0.835 0.751 0.665LCH + BoW 0.771 0.811 0.79 0.717 0.751 0.762 0.756 0.697 0.722 0.726 0.723 0.664SentiBank 0.785 0.768 0.776 0.709 0.742 0.727 0.734 0.675 0.720 0.723 0.721 0.662Sentribute 0.789 0.823 0.805 0.738 0.75 0.792 0.771 0.709 0.733 0.783 0.757 0.696CNN 0.795 0.905 0.846 0.783 0.773 0.855 0.811 0.755 0.734 0.832 0.779 0.715PCNN 0.797 0.881 0.836 0.773 0.786 0.842 0.811 0.759 0.755 0.805 0.778 0.723

level features local color histogram (LCH) and bag of vi-sual words (BoW) shows better results than SentiBank onour Twitter dataset. Both fine-tuned CNN models have beenimproved. This improvement is significant given that we on-ly use four fifth of the 1269 images for domain adaptation.Both neural network models have similar performance on allthe three sets of the Twitter testing data. This suggests thatthe fine-tuning stage helps both models to find a better lo-cal minimum. In particular, the knowledge from the Twitterimages starts to determine the performance of both neuralnetworks. The previously trained model only determines thestart position of the fine-tuned model.

Meanwhile, for each model, we respectively select the top5 positive and top 5 negative examples from the 1269 Twit-ter images according to the evaluation scores. Figure showthose examples for each model. In both figures, each columncontains the images for one model. A green solid box meansthe prediction label of the image agrees with the human la-bel. Otherwise, we use a red dashed box. The labels of topranked images in both neural network models are all correct-ly predicted. However, the images are not all the same. Thison the other hand suggests that even though the two modelsachieve similar results after fine-tuning, they may have ar-rived at somewhat different local optima due to the differentstarting positions, as well as the transfer learning process.For all the baseline models, it is difficult to say which kindof images are more likely to be correctly classified accordingto these images. However, we observe that there are severalmistakenly classified images in common among the modelsusing low-level features (the four rightmost columns in Fig-ure ). Similarly, for Sentibank and Sentribute, several of thesame images are also in the top ranked samples. This indi-cates that there are some common learned knowledge in thelow-level feature models and mid-level feature models.

ConclusionsVisual sentiment analysis is a challenging and interestingproblem. In this paper, we adopt the recent developed con-volutional neural networks to solve this problem. We havedesigned a new architecture, as well as new training strate-gies to overcome the noisy nature of the large-scale train-ing samples. Both progressive training and transfer learninginducted by a small number of confidently labeled imagesfrom the target domain have yielded notable improvements.

The experimental results suggest that convolutional neuralnetworks that are properly trained can outperform both clas-sifiers that use predefined low-level features or mid-level vi-sual attributes for the highly challenging problem of visualsentiment analysis. Meanwhile, the main advantage of us-ing convolutional neural networks is that we can transferthe knowledge to other domains using a much simpler fine-tuning technique than those in the literature e.g., (Duan et al.2012).

It is important to reiterate the significance of this workover the state-of-the-art (Siersdorfer et al. 2010; Borth et al.2013b; Yuan et al. 2013). We are able to directly leverage amuch larger weakly labeled data set for training, as well asa larger manually labeled dataset for testing. The larger da-ta sets, along with the proposed deep CNN and its trainingstrategies, give rise to better generalizability of the trainedmodel and higher confidence of such generalizability. In thefuture, we plan to develop robust multimodality models thatemploy both the textual and visual content for social medi-a sentiment analysis. We also hope our sentiment analysisresults can encourage further research on online user gener-ated content.

We believe that sentiment analysis on large scale onlineuser generated content is quite useful since it can providemore robust signals and information for many data analyticstasks, such as using social media for prediction and forecast-ing. In the future, we plan to develop robust multimodalitymodels that employ both the textual and visual content forsocial media sentiment analysis. We also hope our sentimentanalysis results can encourage further research on online us-er generated content.

AcknowledgmentsThis work was generously supported in part by Adobe Re-search. We would like to thank Digital Video and Multime-dia (DVMM) Lab at Columbia University for providing thehalf million Flickr images and their machine-generated la-bels.

ReferencesAsur, S., and Huberman, B. A. 2010. Predicting the futurewith social media. In WI-IAT, volume 1, 492–499. IEEE.Bengio, Y. 2012. Practical recommendations for gradient-

based training of deep architectures. In Neural Networks:Tricks of the Trade. Springer. 437–478.Bollen, J.; Mao, H.; and Pepe, A. 2011. Modeling publicmood and emotion: Twitter sentiment and socio-economicphenomena. In ICWSM.Bollen, J.; Mao, H.; and Zeng, X. 2011. Twitter mood pre-dicts the stock market. Journal of Computational Science2(1):1–8.Borth, D.; Chen, T.; Ji, R.; and Chang, S.-F. 2013a. Sen-tibank: large-scale ontology and classifiers for detecting sen-timent and emotions in visual content. In ACM MM, 459–460. ACM.Borth, D.; Ji, R.; Chen, T.; Breuel, T.; and Chang, S.-F.2013b. Large-scale visual sentiment ontology and detectorsusing adjective noun pairs. In ACM MM, 223–232. ACM.Caglar Gulcehre; Cho, K.; Pascanu, R.; and Bengio, Y. 2013.Learned-norm pooling for deep neural networks. CoRR ab-s/1311.1780.Ciresan, D. C.; Meier, U.; Masci, J.; Gambardella, L. M.;and Schmidhuber, J. 2011. Flexible, high performance con-volutional neural networks for image classification. In IJ-CAI, 1237–1242. AAAI Press.Davidov, D.; Tsur, O.; and Rappoport, A. 2010. Enhancedsentiment learning using twitter hashtags and smileys. InICL, 241–249. Association for Computational Linguistics.Duan, L.; Xu, D.; Tsang, I.-H.; and Luo, J. 2012. Visualevent recognition in videos by learning from web data. IEEEPAMI 34(9):1667–1680.Grangier, D.; Bottou, L.; and Collobert, R. 2009. Deepconvolutional networks for scene parsing. In ICML 2009Deep Learning Workshop, volume 3. Citeseer.Hamel, P., and Eck, D. 2010. Learning features from musicaudio with deep belief networks. In ISMIR, 339–344.Hinton, G. E.; Osindero, S.; and Teh, Y.-W. 2006. A fastlearning algorithm for deep belief nets. Neural computation18(7):1527–1554.Hinton, G. 2010. A practical guide to training restrictedboltzmann machines. Momentum 9(1):926.Hu, X.; Tang, J.; Gao, H.; and Liu, H. 2013. Unsupervisedsentiment analysis with emotional signals. In WWW, 607–618. International World Wide Web Conferences SteeringCommittee.Jia, Y. 2013. Caffe: An open source convolutional ar-chitecture for fast feature embedding. http://caffe.berkeleyvision.org/.Jin, X.; Gallagher, A.; Cao, L.; Luo, J.; and Han, J. 2010.The wisdom of social multimedia: using flickr for predictionand forecast. In ACM MM, 1235–1244. ACM.Joshi, D.; Datta, R.; Fedorovskaya, E.; Luong, Q.-T.; Wang,J. Z.; Li, J.; and Luo, J. 2011. Aesthetics and emotions inimages. IEEE Signal Processing Magazine 28(5):94–115.Kavukcuoglu, K.; Sermanet, P.; Boureau, Y.-L.; Gregor, K.;Mathieu, M.; and LeCun, Y. 2010. Learning convolutionalfeature hierarchies for visual recognition. In NIPS, 5.

Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Ima-genet classification with deep convolutional neural network-s. In NIPS, 4.LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard,R. E.; Hubbard, W.; and Jackel, L. D. 1989. Backpropaga-tion applied to handwritten zip code recognition. Neuralcomputation 1(4):541–551.LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.Gradient-based learning applied to document recognition.Proceedings of the IEEE 86(11):2278–2324.LeCun, Y.; Kavukcuoglu, K.; and Farabet, C. 2010. Con-volutional networks and applications in vision. In ISCAS,253–256. IEEE.Li, G.; Hoi, S. C.; Chang, K.; and Jain, R. 2010. Micro-blogging sentiment detection by collaborative online learn-ing. In ICDM, 893–898. IEEE.Liu, B.; Dai, Y.; Li, X.; Lee, W. S.; and Yu, P. S. 2003. Build-ing text classifiers using positive and unlabeled examples. InICDM, 179–186. IEEE.Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.;and Potts, C. 2011. Learning word vectors for sentimentanalysis. In ACL, 142–150.Morency, L.-P.; Mihalcea, R.; and Doshi, P. 2011. Toward-s multimodal sentiment analysis: Harvesting opinions fromthe web. In ICMI, 169–176. New York, NY, USA: ACM.O’Connor, B.; Balasubramanyan, R.; Routledge, B. R.; andSmith, N. A. 2010. From tweets to polls: Linking text sen-timent to public opinion time series. ICWSM 11:122–129.Pang, B., and Lee, L. 2008. Opinion mining and sentimentanalysis. Foundations and trends in information retrieval2(1-2):1–135.Patterson, G., and Hays, J. 2012. Sun attribute database:Discovering, annotating, and recognizing scene attributes. InCVPR.Plutchik, R. 1984. Emotions: A general psychoevolutionarytheory. Approaches to emotion 1984:197–219.Siersdorfer, S.; Minack, E.; Deng, F.; and Hare, J. 2010.Analyzing and predicting sentiment of images on the socialweb. In ACM MM, 715–718. ACM.Tumasjan, A.; Sprenger, T. O.; Sandner, P. G.; and Welpe,I. M. 2010. Predicting elections with twitter: What 140characters reveal about political sentiment. ICWSM 178–185.Yuan, J.; Mcdonough, S.; You, Q.; and Luo, J. 2013. Sen-tribute: image sentiment analysis from a mid-level perspec-tive. In Proceedings of the Second International Workshopon Issues of Sentiment Discovery and Opinion Mining, 10.ACM.Zhang, X.; Fuehres, H.; and Gloor, P. A. 2011. Predictingstock market indicators through twitter i hope it is not as badas i fear. Procedia-Social and Behavioral Sciences 26:55–62.

Date post:	12-Mar-2020
Category:	Documents
Upload:	others
View:	11 times
Download:	0 times

Robust Image Sentiment Analysis Using...

Documents