Multi-class Sentiment Classification on Twitter …927073/FULLTEXT01.pdfMulti-class Sentiment...

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2016

Multi-class Sentiment Classification on Twitter using an Emoji Training Heuristic

FREDRIK HALLSMAR AND JONAS PALM

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Multi-class Sentiment Classification on Twitterusing an Emoji Training Heuristic

FREDRIK HALLSMARJONAS PALM

Degree Project in Computer Science, DD143XSupervisor: Richard GlasseyExaminer: Orjan Ekeberg

CSC, KTH 2016-05

Abstract

Sentiment analysis on social media is an important partof today’s need for information gathering. Di↵erent ma-chine learning techniques have been used in recent years,and usage of an emoticon heuristic to automatically anno-tate training sets has been a popular approach. As emojisare becoming more popular to use in text-based commu-nication this thesis investigates the feasibility of an emojitraining heuristic for multi-class sentiment analysis usinga Multinomial Naive Bayes Classifier. Training sets con-sisting of 4000 to 400 000 tweets were used to train theclassifier using various configurations of N -grams. The re-sults show that an emoji heuristic performs well comparedto emoticon- or hashtag-based heuristics. However, clas-sifier confusion is highly dependent on class selection andemoji representations when multi-class sentiment analysisis performed.

Referat

Sentimentanalys ar ett problem av stor vikt pa sociala me-dier. Ett flertal olika maskininlarningstekniker har anvantspa senare ar och att anvanda en traningsmangd som ar au-tomatiskt annoterad med halp av en heuristik baserad pasa kallade emoticons har varit ett populart angreppssatt.Anvandningen av sa kallade emojis i textbaserad kommu-nikation har okat pa sistone. I linje med denna utvecklingsa amnar studien att undersoka om det ar hallbart medanvandning av en heuristik baserad pa emojis for flerklassigsentimentanalys. Detta undersoks med hjalp av en Multi-nomial Naive Bayes-klassificerare som tranas med mangderav storlek 4000 till 400 000 (stycken tweets) och olika va-riationer av N -gram. Resultatet visar att en emojibaseradheuristik fungerar bra jamfort med en som ar baserad pahashtags eller emoticons. Dock sa har val av klasser ochemojirepresentationer en stor paverkan pa forvirringen hosklassificeraren.

Contents

1 Introduction 1

1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Scope of study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Sentiment analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Emojis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3.2 N -gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.4 An example of Naive Bayes and N -grams . . . . . . . . . . . . . . . 6

3 Method 8

3.1 Sentiment classes and their emoji representation . . . . . . . . . . . 83.2 Data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.2 Preprocessing and tokenization . . . . . . . . . . . . . . . . . 103.3.3 Classification settings . . . . . . . . . . . . . . . . . . . . . . 113.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

4 Results 12

4.1 Accuracy diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.2 Data tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.3 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Discussion 17

5.1 Result discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Method discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6 Conclusion 20

6.1 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Bibliography 21

1 Introduction

Social media usage has increased drastically in the last decade. Services such asTwitter, Instagram and Facebook have become a crucial part in everyday life [1].These services are used for communication but also to express one’s emotions andopinions. At the time of writing, Twitter has a 140 character limit for its messages.This creates a need for people and companies using Twitter to be able to expressthemselves in as few words as possible [2]. Additionally, these texts often lack non-verbal cues present in face-to-face communication such as sarcasm and excitement[3][4].

Emojis were created in 1990 by a Japanese telecom company as a way to attractteenagers to use their pager service [5][6][7]. The word “emoji” is Japanese for“picture character”. Later adopted by the west when users discovered a hiddenkeyboard in the Apple iPhone meant for the Japanese market [5][7]. Emojis areactual pictures embedded in text and should not be confused with emoticons (faceimitations created using ASCII characters)[8]. Today emojis are available as anoptional written language in most smartphones and have become an importantpart in our everyday communication, commonly used to spice up or add emotionalcues to text messages [4][9].

Being able to determine sentiment of posts on social media is of great value forbusinesses and organizations since people are increasingly using social media toexpress opinions [10][11]. For example, a company can use sentiment analysis toperform market research to evaluate how their products are experienced by theirusers, without having to send out questionnaires or in other ways bother theircustomers [11]. Common approaches for determining sentiment on Twitter havefocused on whether a tweet is positive or negative, also known as binary (or polar)sentiment analysis [3]. Machine learning algorithms such as Naive Bayes classifiershave turned out to be successful within this field of study [12].

A study by Twitter in 2015 shows that 15% of tweets during TV prime time containat least one emoji and that the most popular emojis are not and but rather

and [13]. Multi-class sentiment analysis aims to utilize this information byusing more than two classes of sentiment. Whereas traditional sentiment analysisdetermines whether a text is positive or negative (polarity), multi-class sentimentanalysis uses categories or clusters such as excited, happy, bored and angry to betterunderstand the emotions expressed in the text [14].

1

CHAPTER 1. INTRODUCTION

Training data annotated by a heuristic based on emoticons already exists and isproved to perform well compared to hand annotated training sets [12]. This studyintends to investigate the feasibility of a heuristic that annotates tweets using emojis.By investigating an emoji heuristic this study fills a gap in the area where mostresearch has been focused on the e↵ects of emoticons and sentiment in text.

1.1 Problem statement

This report investigates whether multi-class sentiment classification of tweets canbe achieved by automatically annotating training sets using a heuristic based onemojis. The results will be evaluated against a testing set annotated by hand.

The result will be used to answer the question of whether multi-class sentiment

of tweets can be determined by using emojis as a training heuristic?

The contribution to the field of sentiment analysis will be a proof of concept withproposals for future research.

1.2 Scope of study

The number of categories of which each tweet can be classified as is limited to four.The emojis used in the training heuristic are restricted to emojis present in theApple iOS firmware at the time of writing. Due to expressional ambiguity of someemojis, a small number of emojis for each class were selected to make the classesmore distinctive [8][9].

Furthermore, this study will only be concerned with tweets in English even thoughthe heuristic can be considered language agnostic. As the testing set used to evaluateperformance will be annotated by hand, knowledge of the language is required. Allpeople involved in this study is proficient in English and therefore English wasselected.

This study only evaluates the generated training sets using a Naive Bayes classifier.Naive Bayes classifiers have been proved to work well in previous studies [12][15].Furthermore, only unigram and bigram models, as well as a combination of bothwill be tested and evaluated.

2

2 Background

This section introduces the concept of emojis and sentiment analysis. The di↵erencebetween emojis and emoticons are explained as well as the di↵erence between polarand multi-class classification. Finally means of classification and machine learningconcepts are presented.

2.1 Sentiment analysis

Sentiment analysis, also known as opinion mining, is the task of determining theunderlying attitude of a writer or speaker. Sentiment analysis is a subfield of Nat-ural Language Processing (NLP) which deals with the task of processing naturallanguages. Natural languages are languages that people speak, such as Swedish orEnglish, and evolve naturally by invention of new words or slang. These languagescontain ambiguity and other complications such as irony or sarcasm, as opposed toformal languages such as programming languages. Today NLP is widely used, forexample to find relevant search results in a search engine and to correct misspellingsin word processing applications.

A common approach to sentiment analysis is to determine whether the sentiment ofa sample is positive or negative, also known as polarity. This is often achieved usingeither a rule-based approach or an approach based on machine learning algorithms[3][16]. The analysis works by extracting features from a text and determines whichsentiment class the sample most likely belongs to. This can be done either by using alexicon (rule-based) or probabilities (machine learning). Common machine learningclassifiers used in sentiment analysis are Naive Bayes and Maximum Entropy [12].

Tasks such as TV- and movie review analyzing have shown that polar sentimentanalysis is not always su�cient. Within this domain a sample that would be classi-fied as negative in a polar analysis might actually be “positive”. For example, sen-timent analysis performed on a sad movie should not necessarily treat sad tweets assomething negative. This brings the attention to methods using more classes, alsoknown as multi-class sentiment analysis [14]. The di↵erence between multi-classand polar classification is illustrated in Table 2.1. In the case of movie reviews for asad drama film, classifying sad tweets as negative might not be desirable and shouldrather be seen as something positive in this context.

3

CHAPTER 2. BACKGROUND

Table 2.1: Example of di↵erence between polar classification and multi-class classi-fication.

Phrase Polar sentiment Multi-classThis movie is so sad Negative SadThe hotel in The Shining is so scary Negative FearfulThe election debate is making me angry Negative AngryI am loving my new car Positive Happy

2.2 Emojis

Emojis are ”picture characters” originating from Japan in the late 1990. They werecreated by a Japanese telecom company as a way to attract teenagers to use theirpager service [5][6][7]. Emojis have become increasingly popular in recent years andare used worldwide in text based communication [17]. Software keyboards for emojiusage are implemented in most major mobile operating systems, most known areperhaps Apple iOS and Google Android.

Emojis can be seen as a further development of emoticons, sequences of text symbolsmeant to represent a face expression for example “:-)” and “:-(“. Looked at side-ways the combinations of characters become a happy face and a sad face. Thebirthplace of the emoticon can be traced back to Carnegie Mellon University, wherein the 1980’s their Computer Science department needed a way to explicitly markcertain posts as not to be taken seriously on their online bulletin boards [7][18].Recent studies have shown that the usage of emojis are growing and are not onlycomplementing emoticons but instead replacing them on social media [8]. A recentstudy shows that during the period of February 2014 to August 2015 almost 14%of all public messages on Twitter contained at least one emoji, whereas emoticonswere found in roughly 2% of the messages [17].

In contrast to emoticons, emojis are actual pictures and have the ability to conveya wider range of emotions due to their nature. In addition to expressing facialexpressions similar to emoticons such as , a much wider range of concepts andideas can be visualized such as weather, food and events. For example, the rugbyball emoji ( ) can be inserted to indicate that the text is about rugby. However,due to the increasing number of emojis, the level of ambiguity, created by culturaldi↵erences and cross-device implementations, has become a problem in sentimentanalysis [8].

4


2.3 Classification

Machine learning algorithms are often used for classification purposes. In general,a training set is constructed from known samples and used to train a classifier. Theclassifier can then be used to predict unseen samples with the most probable class,often done with a testing set to evaluate performance. It is of importance that thesesets are di↵erent and that the testing set is not used during development to avoidover-training which occurs when the algorithm performs good on testing data butgeneralizes badly [16].

2.3.1 Naive Bayes

The Naive Bayes (NB) classifier has been proved e�cient for sentiment analysis[12]. The idea behind the NB classifier is that the probability of a feature vector ~fbelonging to a class s can be computed using the Bayes’ theorem (Equation 2.1).

P (s|~f) = P (~f |s)P (s)

P (~f)(2.1)

In addition to this, the naive assumption (indicated by its name) is made that allfeatures are independent of each other. As a consequence of this assumption andthat P (~f) is the same for all classes, Equation 2.2 can be used to find the mostprobable class s for a feature vector ~f [16].

s = argmaxs2S

P (s)nY

j=1

P (fj |s) (2.2)

There are several di↵erent variations of NB classifiers. Two common variants beingMultinomial Naive Bayes and Bernoulli Naive Bayes, as the names suggest theseclassifiers make di↵erent assumptions about the distribution of the feature vectors.Multinomial NB is often used when counting frequencies of words is importantwhereas the Bernoulli NB can be used when only the presence of words matters[19].

2.3.2 N-gram

N -gram models can intuitively be described as a method to predict words. Ingeneral, N -grams can be used to predict the next word given the previous N � 1words. More formally, an N -gram is a sequence of N tokens which can be assigned

5


a probability, also known as the maximum likelihood estimation (MLE), calculatedas shown in Equation 2.3 [16].

P (Wn|Wn�1n�N+1) =

C(Wn�1n�N+1Wn)

C(Wn�1n�N+1)

(2.3)

Furthermore, prediction of words has turned out to be closely related to computingthe probability that a sequence of N -grams belongs to a language model (or acertain class) [16]. A common problem within this field is when an N -gram tokennever occurs in the training data of the language model and hence is given theprobability zero. Consequently, samples can be given zero probability when theyshould actually have some non-zero probability. This is solved by a technique calledsmoothing where part of the probability mass is moved to unseen tokens [16].

2.4 An example of Naive Bayes and N-grams

Consider the following example. A maximum-likelihood table has been calculatedfrom unigrams extracted from a given set of training data with two classes, Happyand Sad (in this case the training data would have consisted of text sentences). Anexcerpt of the table relevant to the example is shown in Table 2.2. Given the taskto predict the class of two new sentences, “Hungry I am” and “I am playing”, theprobabilities are calculated by applying the Naive Bayes formula from the previoussection (as shown in Equation 2.4).

In this case the first sentence was labeled as Sad and the second was labeled asHappy. In this case a uniform distribution of the classes is assumed and hence theclass prior probabilities P (s) can be left out. Note that the probabilities are smallwhich can lead to computational underflow. This is usually solved by converting tolog-space and using addition instead of multiplication. Since addition in log-spaceis equivalent to multiplication in linear-space the calculations would produce thesame predictions.

6


P (Sad|Hungry I am) = P (Hungry|Sad)P (I|Sad)P (am|sad)= 0.08 ⇤ 0.075 ⇤ 0.076 = 0.000456

P (Happy|Hungry I am) = P (Hungry|Happy)P (I|Happy)P (am|Happy)= 0.02 ⇤ 0.073 ⇤ 0.05 = 0.000073

P (Sad|I am playing) = P (I|Sad)P (am|Sad)P (playing|Sad)= 0.075 ⇤ 0.076 ⇤ 0.01 = 0.000057

P (Happy|I am playing) = P (I|Happy)P (am|Happy)P (playing|Happy)= 0.073 ⇤ 0.08 ⇤ 0.09 = 0.0005256

(2.4)

Table 2.2: Excerpt from a possible MLE table showing the unigram frequencies forthe classes Sad and Happy.

I am hungry playing basketballSad 0.075 0.076 0.08 0.01 0.02Happy 0.073 0.08 0.02 0.09 0.03

7

3 Method

This study began with a literature study to explore the topic of sentiment classifica-tion and usage of emojis in the field of machine learning. As a result, four sentimentclasses with two emoji representations each were chosen and used to collect and an-notate a training set. The collection was made using the Twitter Streaming API.Additionally, a testing set was created by manually annotating hand-picked tweets.Finally, the annotated training- and testing set was used to train and evaluate aMultinomial NB Classifier, implemented in Python using NLTK (version 3.2) andScikit-learn (version 0.17.1).

3.1 Sentiment classes and their emoji representation

Paul Ekman’s original theory about universal emotion categories suggests that hu-man emotions can be categorized as one out of six di↵erent classes; anger, disgust,fearful, happy, sad and surprise [20]. Furthermore, Instagram’s engineering teamrecently published a study based on 50 million English Instagram comments andcaptions from 2015 with a visualization of how emojis appear in similar contexts.The resulting 2D semantic map (part of which is shown in Figure 3.1) indicateswhich emojis that can be considered as contextually similar [9]. It can be concludedfrom the semantic map that emojis which could be considered as representations ofthe emotions anger, fearful, happy and sad are clearly separated. Consequently, thesubset of Paul Ekman’s original emotion categories shown in Table 3.1 was selectedfor the study.

Table 3.1: Sentiment classes and their emoji representation chosen for this study.

Sad Anger Fearful Happy

3.2 Data mining

To the best of the authors’ knowledge, there exists no gold standard for the fouremotion classes used in this study which is publicly available. Therefore, the authors

8

CHAPTER 3. METHOD

Figure 3.1: Part of the semantic map of emojis produced by the Instagram engineer-ing team in the “Emojineering” report. Classes and emojis selected for this studyare marked by circles.

created a testing set by manually hand-picking tweets from Twitter’s live search,using search queries specifically constructed to find tweets for each class. Tweetswith a clear and distinct class were selected to be included in the tested set. Intotal 80 tweets were collected with an even distribution among the classes.

In contrast to the testing set, the tweets used in the training set was collectedautomatically using the Twitter Streaming API1 and stored in a local database.An English language filter was applied and the API was instructed to only tracktweets containing emojis from the four chosen sentiment classes. In order to beable to create samples with an even distribution of the classes, the collection oftweets was allowed to run continuously until at least 100 000 unique tweets for allfour classes had been collected. The amount of data should be considered enoughcompared to similar studies [21][22]. Furthermore, any tweet containing emojisfrom two or more di↵erent classes was considered ambiguous and excluded from thetraining set.

Ultimately, the distribution of the classes in the database was 26% Sad, 6% Anger,6% Fearful and 62% Happy. The least amount of collected tweets for any class was113 562 (for Fearful).

1Twitter Streaming API - https://dev.twitter.com/streaming/overview

9

CHAPTER 3. METHOD

3.3 Implementation

3.3.1 Sampling

Taking into account that the distribution of collected tweets was uneven, the selec-tion of tweets used to train the classifier was randomly sampled and limited to thenumber of tweets collected from the least frequent class. Five random samples weredrawn from the full database of collected tweets. From each of these five samples,eleven new training sets of di↵erent sizes were constructed: 1 000, 10 000, 20 000,. . . , 100 000 (number of tweets for each class). The selection from the full samplewas ordered sequentially resulting in training set SN always consisting of at leastall tweets in training set SN�1, in other words, SN�1 ⇢ SN , 1 < N < 12.

3.3.2 Preprocessing and tokenization

All steps of preprocessing used in this study are based on recent studies exploringsimilar fields of research [12]. Additionally, stemming was performed for furtherfeature reduction.

• All characters converted to lowercase.

• Usernames (mentions) replaced with ”@USER”, URLs (strings beginning withhttp[s], ftp[s] and www replaced with ”URL”. Subsequently, these words wereadded to the list of stop words.

• Repeated characters that occur in a sequence larger than two are reduced totwo characters. For example, ”Helloooooo” would be transformed to ”Helloo”.

• Stemmed using NLTK’s SnowballStemmer2 for the English language in orderto convert individual words to their stem.

• Tokenized using TweetTokenizer3, a Twitter-aware (i.e. correctly separateshashtags and emojis) tokenizer included in NLTK.

In addition to this all training and testing sets were evaluated with and without anEnglish stop word list provided by Scikit-learn.

2NLTK SnowballStemmer - http://www.nltk.org/api/nltk.stem.html

3NLTK TweetTokenizer - http://www.nltk.org/api/nltk.tokenize.html

10

CHAPTER 3. METHOD

3.3.3 Classification settings

All training sets for each of the five random samples were evaluated with and with-out an English stop word list (provided by Scikit-learn, including the extra wordsas mentioned in the previous section). Additionally, unigrams, bigrams and a com-bination of both were tested, therefore producing 2 ⇤ 3 ⇤ 5 ⇤ 11 = 330 di↵erent setsof results.

The classifier has several parameters that can be changed in order to optimize itspredictions. However, in our implementation, all default values were kept with theexception of fit prior, which is used to choose whether to learn class prior probabili-ties or not. However, as all training sets contain an even distribution this parameterdoes not have any e↵ect on the result. The alpha parameter controls the amountof additive smoothing (0 for no smoothing and 1 for full smoothing). The finalparameters used are displayed in Table 3.2:

Table 3.2: Final parameter settings used to initialize the Scikit-learn MultinomialNB classifier.

Classifier settings

MultinomialNB(alpha=1.0, fit prior = False)

3.3.4 Evaluation

Accuracy is a common measurement used to evaluate the performance of machinelearning classifiers [16]. Consequently, accuracy was used to evaluate the Multino-mial NB classifier in this study. It can be interpreted as the percentage of classifi-cations that were correct (Equation 3.1).

Accuracy =N(correct classifications)

N(classifications)(3.1)

Additionally, a confusion matrix was extracted from the predictions made by theclassifier to visualize class similarity and classifier confusion.

11

4 Results

This section begins with a visualization of the accuracy score as line graphs for thedi↵erent setups explained in section 3.3.3. Followed by this is a presentation of theraw data results achieved after running the classifier and lastly an explanation andvisualization of class prediction confusion.

4.1 Accuracy diagrams

The graphs presented in this section shows the mean accuracy score achieved byrunning the Multinomial NB on the testing set with five di↵erent, random, samples.The graphs in Figure 4.1 also shows a baseline at 25% which in this case would beequal to guessing or picking the same class for each prediction. While all configu-rations performed above baseline, a combination of unigrams and bigrams with theuse of a stop word list produced the best result with a mean accuracy of 71% (exactnumbers are shown in Table 4.1).

4 80 160 240 320 400

0.25

0.4

0.6

0.7

Size of training set (⇥1000)

Accuracy

UnigramsBigramsBoth

(a) With stop word list

4 80 160 240 320 400

0.25

0.4

0.6

0.7


Accuracy

UnigramsBigramsBoth

(b) Without stop word list

Figure 4.1: Average accuracy for unigrams, bigrams and a combination of both.Baseline is plotted as a dashed line at y = 0.25.

12

CHAPTER 4. RESULTS

4 40 400

0.4

0.5

0.6

0.7


Accuracy

UnigramsBigramsBoth

(a) With stop word list

4 40 400

0.5

0.6

0.7


Accuracy

UnigramsBigramsBoth

(b) Without stop word list

Figure 4.2: Average accuracy in log-space for unigrams, bigrams and a combinationof both.

13

CHAPTER 4. RESULTS

4.2 Data tables

The tables presented in this section shows the mean accuracy score for each trainingset and configuration visualized in Section 4.1.

Table 4.1: Mean accuracies for all configurations and training sets.

Dataset Unigrams Bigrams Combination1 0.58 0.4 0.5610 0.67 0.52 0.6520 0.66 0.56 0.6830 0.66 0.56 0.6840 0.66 0.6 0.750 0.67 0.6 0.6960 0.67 0.62 0.6970 0.67 0.65 0.780 0.67 0.65 0.790 0.68 0.66 0.71100 0.68 0.65 0.71

(a) Average accuracies from five rounds of random sampling for unigram, bigram and aunigram-bigram combination when a stop word filter was used.

Dataset Unigrams Bigrams Combination1 0.54 0.47 0.5210 0.65 0.58 0.6420 0.64 0.62 0.6730 0.64 0.64 0.6840 0.65 0.65 0.6750 0.65 0.66 0.6760 0.66 0.66 0.6770 0.67 0.66 0.6880 0.68 0.66 0.6890 0.67 0.67 0.68100 0.68 0.68 0.68

(b) Average accuracies from five rounds of random sampling for unigram, bigram and aunigram-bigram combination without a stop word filter.

14

CHAPTER 4. RESULTS

4.3 Confusion matrix

The confusion matrix is a contingency table with two dimensions, actual and pre-dicted. Each row represents instances of the actual class while each column representthe predicted class. The confusion matrix visualizes error in the predicted classes.For example, to show which items of the class Happy the classifier predicted as Sad.A good classification result should produce a matrix where the diagonal elementscontain the highest value of each row (indicating that the majority were in factcorrectly classified).

Prediction

Anger Fearful Happy Sad

Actualclass

Anger 18 2 0 0

Fearful 1 15 1 3

Happy 0 5 13 2

Sad 8 1 2 9

Figure 4.3: Confusion matrix visualizing the classification result when using com-bined uni- and bigram model, a training set size of 400 000 tweets and a stop wordlist (the configuration which yielded the best mean accuracy).

The confusion matrix for the configuration that yielded the best mean accuracy isdisplayed in Figure 4.3. The matrix shows that the largest confusion is within theclass Sad, where only 45% of the tweets were correctly classified as Sad. 40% of theSad tweets were confused as Angry by the classifier whereas, interestingly, no Angrytweets were classified as Sad. The class that performed best was Angry where onlytwo of the tweets were misclassified, yielding an class accuracy of 90%. For visualpurposes the confusion matrix is also shown as stacked bar chart in Figure 4.4.

15

CHAPTER 4. RESULTS

0

5

10

15

20



Figure 4.4: Confusion matrix in Figure 4.3 visualized as a stacked bar chart.

16

5 Discussion

5.1 Result discussion

The results show that it is possible to perform multi-class sentiment classification us-ing emojis as a training heuristic. Whereas a combination of unigrams and bigramsand the use of a stop word list yielded the best accuracy (71%), all N -gram cate-gories with and without a stop word list performed above the baseline, even whenthe training set size was small. As previously mentioned, NB classifiers have beenshown to work well with both short texts and small training sizes, which the resultsin this study confirm. Compared to other studies the emoji based training heuris-tic achieves better or similar results as heuristics based on emoticons and hashtags[12][22]. However, it is di�cult to perform a fair comparison due to di↵erences inclass selection, data sets and measurements used for evaluation.

It can be concluded from Figure 4.1 that without usage of a stop word list, bigramperformance aligns with that of unigrams. However, presence of a stop word filterresults in a decrease in performance of bigrams, in contrast to unigrams whichperformance is roughly the same. This might be explained by bigrams losing someof the benefits achieved from contextual information, as well as the amount oftraining data is reduced. Naturally, if 25% of a tweet consists of words present inthe stop word list, a stop word filter would remove 25% of the training data.

The confusion matrix in Figure 4.3 is used to describe correct and incorrect predic-tions made by the classifier compared to the actual target value in the data. Thematrix shows, as previously explained, that the classifier often mispredicted tweetsthat should be classified as Sad as Anger. On the contrary, none of the Angrytweets were classified as Sad, thus showing a one-way confusion between the classes.Nevertheless, this still suggests there exists some overlap between the two classes.The same reasoning can also be applied to the confusion between the classes Happyand Fear.

One explanation for this confusion is that there is an overlap in the actual emotionconveyed. In other words, both emotions might be present. For example, twotweets that were mispredicted by the classifier as Angry but labeled as Sad was “Myschool is a jail. They blocked Snapchat, Netflix, Instagram.. Like what’s next?” and“Why did I take tutoring as a job? :( Added stress”. Looking at these two tweetsand trying to decide whether they are sad or angry is a di�cult task, even for

17

CHAPTER 5. DISCUSSION

humans (according to the authors of this study). It might not even be possibleconsidering the correct classification could be to classify them as both Angry andSad. However, this is a multi-label classification task and not in the scope of thisstudy. Furthermore, it is possible that people are more likely to be angry when theyare sad rather than sad when they are angry. In other words, P (Angry|Sad) >P (Sad|Angry). This would explain the one-way relationship.

Another explanation is that the emojis chosen for each class does not properly rep-resent the context in which they are used. As expected, the confusion matrix showsthat Fear, Angry and Sad are closer related to each other than to Happy whichis intuitively explained by the fact that the classes Fear, Angry and Sad representnegative emotions whereas Happy represents a positive emotion. Additionally, therelationships between emoji pairs used for class representation are corresponding tothe relationships displayed in the Instagram semantic map (Figure 3.1). This sug-gests that emojis are used in similar ways on Twitter and Instagram. Consequently,according to Instagram, one of the emojis this study uses to represent fear ( ) isoften used in same context as the internet slang “OMG” (”Oh My God”). This is apossible explanation of the confusion between Fear and Happy. Finally, emoji am-biguity is not limited to the user’s interpretation or intuition. For example, emojisare rendered di↵erently on di↵erent platforms, which has been shown to a↵ect theirsentiment [8].

5.2 Method discussion

The selection of sentiment classes to be evaluated in this study was made withassistance of a study performed by Instagram as well as Paul Ekman’s universalemotion research. The semantic map created in the study by Instagram (Figure3.1) was made using the t-SNE algorithm to transform a 100-dimensional data setto a 2-dimensional representation. This algorithm has a number of weaknessesidentified by its creator, one of which is that it is unclear how well it performs ongeneral dimensionality reduction tasks [23]. Consequently, the semantic map mightnot always be the best visualization to be considered when selecting classes andtheir representation.

The testing set used in this study was collected manually by the two authors usingTwitter’s live search functionality. Queries were constructed in order to find tweetswith a clear and unambiguous sentiment. For example, searching for ”exam” and“afraid” might reach tweets with a fearful sentiment. Perhaps the most seriousdisadvantage of this method is that the assumption is made that the final selectionof testing tweets represent Twitter data in a realistic and general way. As a conse-quence, the selection might have become “too easy” for the classifier to predict. Aspreviously mentioned, the research to date has tended to focus on polar sentiment

18

CHAPTER 5. DISCUSSION

analysis rather than multi-class sentiment analysis and thus no public gold standardexists to be used as a testing set. One common way of solving this is to use humanannotators via Amazon’s Mechanical Turk [22]. However, creating a gold standardis beyond the scope of this study and can be considered a research project on itsown.

Data mining for the training set was run continuously during a short period of time.A serious weakness with this is that the training data could su↵er from time- andevent bias. In addition to this, it would be interesting to compare the result toa sampling performed randomly rather than continuously on all training set sizes.Another weakness of the sampling method was that the samples for Fear and Angercontained almost all tweets from their respective class (in the collected data) in thetraining sets of size 400 000, and thus, were not “randomly” sampled.

Finally, it should also be taken into account that Twitter is considered a noisychannel and that data mining was performed without the use of a proper spamfilter [22]. Nevertheless, even without such filtering the results are still consideredsignificant to answer the research question at hand and maximizing performance isleft as a proposal for future research.

19

6 Conclusion

It can be concluded that multi-class sentiment analysis can be done using an emojiheuristic. The approach used in this study shows reasonable performance well abovebaseline and in line with similar studies for all training sets and configurations. Nev-ertheless, the purpose of this study was not to maximize performance but ratherinvestigate whether or not an emoji heuristic can be used for multi-class sentimentanalysis. However, it is important to select classes carefully in order to minimizeclass overlap. Additional care must also be taken when selecting emoji representa-tions, as emojis have been shown to appear in unforeseeable contexts.

6.1 Further research

One of the major drawbacks of this study has been the testing set. Therefore,another study with focus on creating a good and representative testing set to usein evaluation is needed. Additionally, it would be interesting to see further researchwithin this field utilizing all of of Paul Ekman’s universal emotion classes as wellas an in-depth study on emojis’ contextual appearance. Other areas that would beinteresting to investigate is:

• How would a similar heuristic perform on other mediums such as Instagramand Facebook?

• How does spam and other noise a↵ect sentiment analysis?

• What would happen if a neutral class was added?

• Is it possible to achieve better results using other machine learning techniques?

20

Bibliography

[1] Maeve Duggan et al. “Social media update 2014”. In: Pew Research Center19 (2015).

[2] Michele Zappavigna. Discourse of Twitter and social media: How we use lan-guage to create a�liation on the web. A&C Black, 2012.

[3] Bo Pang and Lillian Lee. “Opinion mining and sentiment analysis”. In: Foun-dations and trends in information retrieval 2.1-2 (2008), pp. 1–135.

[4] Joseph B Walther and Kyle P D’Addario. “The impacts of emoticons on mes-sage interpretation in computer-mediated communication”. In: Social sciencecomputer review 19.3 (2001), pp. 324–347.

[5] Adam Sternbergh. Smile, You’re Speaking Emoji - The rapid evolution of awordless tongue. Nov. 2014. url: http://nymag.com/daily/intelligencer/2014/11/emojis-rapid-evolution.html (visited on 04/18/2016).

[6] Je↵ Blagdon. “How Emoji Conquered the World”. In: The Verge (2013).

[7] Luke Stark and Kate Crawford. “The Conservatism of Emoji: Work, A↵ect,and Communication”. In: Social Media+ Society 1.2 (2015), p. 2056305115604853.

[8] Hannah Miller et al. ““Blissfully happy” or “ready to fight”: Varying Inter-pretations of Emoji”. In: ICWSM-16 (2016).

[9] Thomas Dimson. Emojineering Part 1: Machine Learning for Emoji Trends.2015. url: http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji (visited on 04/18/2016).

[10] Maria Ogneva. “How companies can use sentiment analysis to improve theirbusiness”. In: Retrieved August 30 (2010).

[11] Alex Wright. “Mining the web for feelings, not facts”. In: New York Times 24(2009).

[12] Alec Go, Lei Huang, and Richa Bhayani. “Twitter sentiment analysis”. In:Entropy 17 (2009).

[13] Twitter. Emoji usage in TV conversation. Nov. 2018. url: https://blog.twitter.com/2015/emoji-usage-in-tv-conversation (visited on 04/18/2016).

[14] Yuki Yamamoto, Tadahiko Kumamoto, and Akiyo Nadamoto. “Role of Emoti-cons for Multidimensional Sentiment Analysis of Twitter”. In: Proceedings ofthe 16th International Conference on Information Integration and Web-basedApplications & Services. ACM. 2014, pp. 107–115.

21

BIBLIOGRAPHY

[15] Sida Wang and Christopher D Manning. “Baselines and bigrams: Simple, goodsentiment and topic classification”. In: Proceedings of the 50th Annual Meet-ing of the Association for Computational Linguistics: Short Papers-Volume 2.Association for Computational Linguistics. 2012, pp. 90–94.

[16] Daniel Jurafsky and James H. Martin. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics, andSpeech Recognition. 1st. Upper Saddle River, NJ, USA: Prentice Hall PTR,2000. isbn: 0130950696.

[17] Umashanthi Pavalanathan and Jacob Eisenstein. “Emoticons vs. Emojis onTwitter: A Causal Inference Approach”. In: arXiv preprint arXiv:1510.08480(2015).

[18] Scott E. Fahlman. Smiley Lore :-). url: https://www.cs.cmu.edu/~

sef/

sefSmiley.htm (visited on 04/18/2016).

[19] Andrew McCallum, Kamal Nigam, et al. “A comparison of event models fornaive bayes text classification”. In: AAAI-98 workshop on learning for textcategorization. Vol. 752. Citeseer. 1998, pp. 41–48.

[20] Paul Ekman. “An argument for basic emotions”. In: Cognition & emotion6.3-4 (1992), pp. 169–200.

[21] Alexander Pak and Patrick Paroubek. “Twitter as a Corpus for SentimentAnalysis and Opinion Mining.” In: LREc. Vol. 10. 2010, pp. 1320–1326.

[22] Matthew Purver and Stuart Battersby. “Experimenting with distant supervi-sion for emotion classification”. In: Proceedings of the 13th Conference of theEuropean Chapter of the Association for Computational Linguistics. Associa-tion for Computational Linguistics. 2012, pp. 482–491.

[23] Laurens Van der Maaten and Geo↵rey Hinton. “Visualizing data using t-SNE”. In: Journal of Machine Learning Research 9.2579-2605 (2008), p. 85.

22

www.kth.se

Date post:	22-Jun-2020
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

Multi-class Sentiment Classification on Twitter …927073/FULLTEXT01.pdfMulti-class Sentiment...

Documents