Sentiment Analysis with Deep Neural Networks · Sentiment analysis is an area of research with a...

Sentiment Analysis with Deep Neural Networks

João Carlos Duarte Santos Oliveira Violante

Thesis to obtain the Master of Science Degree in

Telecommunications and Informatics Engineering

Supervisors: Prof. Bruno Emanuel da Graça Martins

Prof. Pavél Pereira Calado

Examination Committee

Chairperson: Prof. Luís Manuel Antunes VeigaSupervisor: Prof. Bruno Emanuel da Graça Martins

Members of the Committee: Prof. Maria Luísa Torres Ribeiro Marques da Silva Coheur

November 2016

Acknowledgements

I would first like to thank Professors Pavel Calado and Bruno Martins, for having contributed

to this work with their extreme knowledge and motivation.

Secondly, I would like to thank my family for all the unconditional support throughout all

these years, and also for giving me the opportunity to learn in this institute. A special thanks

goes to my sister for all the availability and assistance during this period.

Finally, I have to thank all my friends and colleagues who have supported me throughout

my academic career.

Lisbon, November 2016

Joao Carlos Duarte Santos Oliveira Violante

For my family,

Resumo

O aumento de utilizadores da Internet e o consequente aumento do volume de opinioes,

expressas pelos mesmos, nesse meio de comunicacao, resultou em grandes fontes de informacao.

Esta informacao oferece-nos um importante feedback sobre determinados produtos ou servicos,

provocando um aumento do interesse em varios problemas, praticos ou academicos, que lidam

com a analise deste tipo de informacao. A area que tem como objectivo resolver este tipo de

problemas e normalmente designada por sentiment analysis ou opinion mining. Tendo isto em

consideracao, pretende-se com este trabalho abordar o tema de deteccao do tipo de sentimento

expresso num determinado documento textual. Especificamente foram estudadas e compara-

das, em diferentes contextos, algumas das abordagens que representam o actual estado da arte,

maioritariamente relacionadas com o uso de redes neuronais profundas. Adicionalmente, testou-

se a possibilidade de melhorar os resultados dessas abordagens introduzindo alguma informacao

sobre as dimensoes das diferentes emocoes expressas em cada um dos textos. Nesta dissertacao

e apresentada uma descricao para as arquitecturas dos referidos modelos assim como a sua com-

paracao com os sistemas existentes actualmente. Os resultados experimentais obtidos mostram

que a ideia de adicionar informacao sobre as emocoes, em algumas situacoes, melhora o desem-

penho de diferentes abordagens.

Abstract

The increasing amount of Internet users and the consequent increase of online user reviews,

expressing their opinions, has resulted in large sources of information. This information can give

us an important feedback about particular products or services, leading to a growing interest on

several problems that deal with the analysis of this type of information. This area of research is

typically called sentiment analysis or opinion mining. Considering the interest in this area, the

goal of this MSc research project was to address the topic of detecting the sentiment (positive

or negative) of the opinion expressed in a given textual document, by studying and comparing,

in different contexts, some of the approaches that represent the current state of art in the

area, which is mainly related to the use of deep neural networks. Additionally, this work tried

to improve the results of these methods, by adding some additional information about the

dimensions of the different emotions expressed in the documents. This dissertation presents

a description of the considered model architectures, as well as their comparison with existing

systems. Our experimental results show that adding information about the emotions can, in

some cases, improve the performance of different approaches.

Palavras Chave

Keywords

Palavras Chave

Redes Neuronais Profundas

Classificacao de Texto

Analise de Sentimentos

Polaridade de Opinioes

Analise de Emocoes

Keywords

Deep Neural Networks

Text Classification

Sentiment Analysis

Opinion Polarity

Emotion Analysis

Contents

1 Introduction 3

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Fundamental Concepts and Related Work 7

2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Lexicon-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2 Corpus-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.2.3 Combined Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Sentiment Analysis with Deep Neural Networks 31

3.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.2 Word to Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.3 Document to Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Overview on the Layers used within Deep Neural Networks . . . . . . . . 35

3.2.1.1 Embedding Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 35

i

3.2.1.2 Dropout Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1.3 LSTM Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1.4 Dense Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1.5 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.1.6 Pooling Layers and Flatten Layers . . . . . . . . . . . . . . . . . 37

3.2.1.7 Activation Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.2 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2.1 Stack of LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.2.2 Bidirectional LSTMs . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.2.3 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.2.4 Convolutional Neural Neworks . . . . . . . . . . . . . . . . . . . 41

3.2.2.5 Combined CNN-LSTM Network . . . . . . . . . . . . . . . . . . 42

3.2.2.6 Merged CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2.2.7 Merged CNN-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Experimental Evaluation 47

4.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Evaluation for the Sentiment Analysis Task . . . . . . . . . . . . . . . . . 47

4.1.2 Evaluation for the Dimensional Sentiment Analysis Task . . . . . . . . . . 49

4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Sentiment Analysis Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 Dimensional Sentiment Analysis Datasets . . . . . . . . . . . . . . . . . . 50

4.2.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

ii

4.4.1.1 Sentence Polarity Dataset . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1.2 Stanford Sentiment TreeBank Dataset . . . . . . . . . . . . . . . 55

4.4.1.3 Tweet 2016 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4.2 Dimensional Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . 58

4.4.2.1 Affective Norms for English Text Datset . . . . . . . . . . . . . 58

4.4.2.2 EmoTales Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.4.2.3 Facebook Messages Dataset . . . . . . . . . . . . . . . . . . . . . 59

4.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Conclusions 63

5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Bibliography 70

iii

iv

List of Figures

2.1 Tree-Based Convolutional Neural Network (Mou et al., 2015). . . . . . . . . . . . 20

2.2 Gated Recurrent Neural Network Architecture (Tang et al., 2015a). . . . . . . . 21

3.1 Model Architectures (Mikolov et al., 2013). . . . . . . . . . . . . . . . . . . . . . 33

3.2 Examples of a) a traditional Recurrent Neural Network Architecture, and b) a

Long Short Term Memory Architecture. . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 Stack of two LSTMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.4 Bidirectional LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5 Multi-Layer Perceptron Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Convolutional Neural Network Architecture. . . . . . . . . . . . . . . . . . . . . . 41

3.7 CNN-LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.8 Merged CNN Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.9 Merged CNN-LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

v

vi

List of Tables

4.1 Sentence Polarity Dataset: Models results using pre-trained word embeddings

against previous works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Sentence Polarity Dataset: Models results using a concatenated version of word

embeddings, against the results of merged architectures. . . . . . . . . . . . . . . 54

4.3 Stanford Sentiment Treebank Dataset: Models results using pre-trained word

embeddings against previous works. . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 Stanford Sentiment Treebank Dataset: Models results using a concatenated ver-

sion of word embeddings against the results of merged architectures. . . . . . . . 56

4.5 Tweet 2016 Dataset: Models results using pre-trained word embeddings against

previous works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 Tweet 2016 Dataset: Models results using a concatenated version of word embed-

dings against the results of merged architectures. . . . . . . . . . . . . . . . . . . 57

4.7 ANET Dataset: Prediction results for valence, arousal and dominance in terms

of the Pearson correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.8 EmoTales Dataset: Prediction results for valence, arousal and dominance in Pear-

son Correlation Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 Facebook Messages Dataset: Prediction results for valence and arousal in Pearson

Correlation Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

vii

viii

Acronyms

NLP Natural Language Processing

CNN Convolutional Neural Network

RNN Recurrent Neural Network

LSTM Long Short-Term Memory

GRU Gated Recurrent Unit

MLP Multi Layer Perceptron

NB Naive Bayes

SVM Support Vector Machine

VAD Valence Arousal Dominance

BoW Bag of Words

CBoW Continuous Bag of Words

RAE Recursive Autoencoders

MV-RNN Matrix-Vector Recursive Neural Network

UPNN User Product Neural Network

TBCNN Tree-Based Convolution Neural Network

RNTN Recursive Neural Tensor Network

DCNN Dynamic Convolutional Neural Network

Paragraph-Vec Logistic Regression on top of Paragraph Vectors

CCAE Combinatorial Category Autoencoders

Sent-Parser Sentiment Analysis-Specific parser

NBSVM Naive Bayes SVM

MNB Multinomial Naive Bayes

G-Dropout Gaussian Dropout

F-Dropout Fast Dropout

Tree-CRF Dependency Tree with Conditional Random Fields

1

2

1IntroductionIn different areas, as well as in different contexts, the feedback provided by consumers of

a specific product or service has an unmatchable relevance. This kind of information has a

wide range of applications, bringing clear advantages to areas like marketing or politics. For

instance, collecting the opinions of consumers concerning a certain product or service, can allow

a marketing company to achieve a more accurate assessment of the effectiveness of their last

campaign, and suggest to their clients the necessary adjustments to increase sales or become more

efficient. On the other hand, feedback from citizens can be used to measure the popularity, near

election time, of a particular candidate, allowing campaign managers to obtain more accurate

and timely information.

The aforementioned advantages are not limited to producers, embracing final consumers as

well. Searching for opinions about a certain product or service is nowadays practically manda-

tory before purchase or subscription decisions. This is only possible because potential future

consumers can find feedback from former or current consumers, helping them to make the best

decision.

Before the Internet became a day-to-day tool for people, the access to this kind of infor-

mation was very limited, and feedback analysis was practically impossible in large scale. With

the rise of the Internet, the problem of the lack of information sources is almost completely

eliminated, giving users access to a wide range of opinions and experiences.

However, another obstacle arises along with the increasing availability of the Internet: how to

analyse all these information sources. Automatically predicting the emotion/sentiment/opinion

behind a textual document is not exactly an easy task to make. Many opinions are usually

subtle and complex, including negation and sometimes sarcasm. Taking this into account, a

new research area emerged, often referred to as Sentiment Analysis or Opinion Mining.

1.1 Motivation

Sentiment analysis is an area of research with a broad scope, including tasks with different

degrees of complexity. A fundamental task in sentiment analysis consists in detecting words

that express a specific sentiment and then, through the detected words, assign a sentiment to

a particular textual document. Another and more complex task is called aspect-level sentiment

analysis, where the idea is to get a more powerful and fine-grained evaluation about the opinions

expressed for a particular topic. In this case, the different aspects that need to be evaluated

individually will be extracted first and them the opinions related to each aspect will be evaluated.

For example, nowadays when we want to buy a cellphone we search for evaluations about specific

features like camera, processor, RAM or battery. The main idea behind aspect-level sentiment

analysis is to obtain the opinion expressed by the other users about each of these specific features.

Finally, in order to help the user to make the decision (buy or not buy), the sentiment analysis

tool assigns a global positive or negative value taking into account the evaluations of all the

individual features.

Still, using the conventional positive and negative sentiment evaluations is insufficient for

an accurate and more detailed evaluation. Opinions not only include positive or negative sen-

timents, but they are also dependent on the emotional state of the writer in that moment. So,

adding an emotional detection system, going beyond detecting positive versus negative opin-

ions into more nuanced notions of opinion valence, can give us a stronger and more expressive

evaluation over the conventional approaches.

Emotions like joy, surprise, or anger, are present in our daily life, making their analysis a

great source of information about for instance, the emotional state of the employees of a company,

the consumers of a product, or even the emotional state of country’s population. Bearing this

in mind, the expression of emotions over written textual contents have been studied using two

different approaches, namely the discrete approach and the dimensional approach.

While the discrete approach sees the emotions like a set of basic affective states that can

be easily identified by themselves, such as sadness, joy, or frustration, the dimensional approach

clusters affective states in a smaller set of major dimensions like valence, arousal and dominance

(VAD). In brief, valence represents an emotional dimension related to the attractiveness to an

object, event, or situation, while arousal represents a degree of emotional activation (physiolog-

ical and psychological), and finally dominance represents a change in the sensation of having

control on a situation. Although both have their utility, being that the dimensional approach has

properties that makes it more robust than the discrete approach. While the discrete approach is

4

limited to the only emotions defined by the chosen theory (i.e. is useful when we want to study

particular emotions), the VAD measures present in dimensional approach are independent from

any cultural or linguistic interpretation.

Taking the aforementioned aspects into account, an interesting research problem relates to

the development of systems that include sentiment evaluation as well as emotional evaluation.

1.2 Contributions

This thesis makes the followings contributions:

• Development of methods for addressing a sentiment analysis task that aims to predict

the sentiment/polarity of a given text. With this in mind, and in order to obtain the

best possible results in this task, different approaches were tested, ranging from simple

and common models like Naive-Bayes and Support Vector Machine classifiers, leveraging

bag-of-words representations, to more complex models that are considered nowadays the

state of art, named deep neural networks. The considered deep neural network models are

divided in two categories: Recurrent Neural Networks (RNNs) and Convolutional Neural

Networks (CNNs). In addiction, in each of these general types of categories, the abundance

of different architectures and possible configuration is very large. Furthermore, different

input representations can be used, namely the bag of words or representations based on

word/phrase embeddings. Datasets from different contexts and different topics were used

to support an extensive set of comparative experiments. Specifically, the Sentence Polarity

dataset v1.0 (Pang and Lee, 2005) jointly with the Stanford Sentiment TreeBank (Socher

et al., 2013), containing data from movie reviews, and the Tweet 20161 dataset with data

from twitter posts, were used in the experimental evaluation, and the results showed that

in most cases the models used are close to the results presented in the literature.

• Development of methods for addressing a dimensional sentiment analysis task that aims to

predict three emotion dimensions, namely valence, arousal, and dominance. The valence

value indicates the pleasantness of the stimulus, the arousal value the intensity of the

emotion caused by the stimulus, and the dominance value indicates the degree of control

implied by the stimulus. In order to predict these values, regression models using deep

neural networks were applied. The train dataset, in this case, is an expanded version of

1http://alt.qcri.org/semeval2016/task4/

5

an existing dataset of word ratings (Warriner et al., 2013) that, instead of containing only

the VAD values for single English words, also has these values for phrases. This expanded

dataset is created in the context of my MSc thesis and extracts information from the

Paraphrase Database (Pavlick et al., 2015). The test datasets are the Affective Norm for

English Text (ANET) (Bradley and Lang, 2007), the EmoTales (Francisco et al., 2012)

and Facebook Messages (Preoctiuc-Pietro et al., 2016) dataset. The experimental results

showed that the best results are in the ANET dataset, and despite the others have worse

results, sometimes they surpass the performance of other existing approaches.

• The final contribution consists in combining the aforementioned two types of information.

The idea is to use information about the emotion dimensions to help the sentiment analysis

models to improve their prediction performance. Experiments with deep neural networks

using two different types of word representations, pre-trained word embeddings in one

hand and on the other word embeddings with the emotion dimensions values, showed that

adding information about the emotion dimensions, in some cases, improves the prediction

performance in sentiment analysis models.

1.3 Structure of the Document

The remaining of this document is organized as follows. Chapter 2 presents fundamental

concepts required to understand the topics discussed in this thesis, namely text representation

approaches and common algorithms to solve text classification problems. This chapter also

describes previous related work, grouped into three categories: corpus-based approaches, lexicon-

based approaches and combined approaches. Chapter 3 describes the different models used in

each particular task, specifically the sentiment analysis and the dimensional sentiment analysis

tasks. Particular emphasis is given to the discussion of deep learning model architectures.

Chapter 4 describes the evaluation methodology, the datasets, the word embeddings used in

the context of the deep learning models, the evaluation metrics, and the experimental results.

Finally, Chapter 5 concludes this document by summarizing its main points, and presenting

directions for future work.

6

2Fundamental Concepts

and Related Work

This chapter details the basic concepts needed to understand the topics discussed throughout

the document, and also some sentiment analysis approaches studied in previous work.

2.1 Fundamental Concepts

This section presents the main concepts required to understand the topics discussed in

this document. Section 2.1.1 describes the methods that are traditionally used to represent

phrases/documents. Section 2.1.2 introduces some of the most common algorithms used in text

classification problems.

2.1.1 Text Representation

In the context of text classification, documents are typically represented through sets of

smaller components, like words, n-grams of words (i.e., sequences of n continuous words in a

text) or n-grams of characters (i.e., sequences of n continuous characters in a text).

Besides sets, another common representation for documents is the vector space model ap-

proach. The vector space model approach is widely used in information filtering and information

retrieval to compute a continuous degree of similarity between documents. Each document d

is represented as a feature vector df =< w1,f , w2,f , ..., wi,f >, where i is the number of the

features, and where wi,t corresponds to a weight that reflects the importance of the feature f

for describing the contents of document d. The different features can, for instance, correspond

to words, n-grams of words, or n-grams of characters.

In the vector space model approach, the weight of each feature can be computed in several

ways. The methodology used to compute the weights is usually known as term weighting

scheme. One of these schemes involves using binary weights, where wi,f is zero or one, depending

on whether the feature f is present or not in the document d.

Another popular term weighting scheme is called TF-IDF. The motivation for its existence

is that there are words occurring in each document that are also very frequent in many other

documents and, thus, they should not contribute to the comparison process with the same weight

of words that are more specific to some domains. The TF-IDF weighting scheme combines the

individual frequency for each feature f in the document d (i.e, a component represented as

TFf,d) with the inverse frequency of the feature f in the entire collection of documents (i.e,

IDFf ). There are different ways to compute the term frequency. However, the most common

is simply counting the number of occurrences of the feature f within document d without any

further normalization. The inverse document frequency (IDF) is a measure of feature importance

within the collection of documents. A feature that appears in most of the documents of a given

collection is not important to discriminate between the different documents. Taking this into

account, the IDF is the inverse of the number of documents in which a feature occurs, and is

computed as follows:

IDFf = log

(N

df

)(2.1)

In Equation 2.1, N corresponds to the number of documents in the collection and df corre-

sponds to the number of documents containing the feature f .

The TF-IDF weight of a feature f for a document d is defined as follows:

TF-IDFf,d = TFf,d × IDFf (2.2)

As described previously, the vector space model allows to evaluate the degree of similarity

between two documents d1 and d2, as the correlation between their vector representations V (d1)

and V (d2). This correlation can be computed using, for example, the cosine similarity metric:

sim(d1, d2) =V(d1).V(d2)

||V(d1)|| × ||V(d2)||(2.3)

In Equation 2.3, the numerator is the inner product of the vectors V(d1) and V(d2) and the

denominator is the product of their Euclidean lengths. If we also represent a query as a vector it

is possible to compute the similarity between the documents and the query, this way extracting

and raking the most relevant documents.

8

2.1.2 Text Classification

Text classification is nowadays typically addressed through supervised machine learning

methods. Training data, i.e. vectors x labeled by humans with a class y, are used to learn a

function d(x), known as classifier, that aims to automatically predict the class to which a new

instance of test data belong to. Over the years, many different learning methods have been

introduced to address the task of finding the function d(x), such as nearest neighbour classifiers

(Mao and Lebanon, 2006), linear classifiers (Mullen and Collier, 2004), or tree-based models

(Augustyniak et al., 2014).

In sentiment analysis, the most commonly used methods are linear classifiers. These methods

define the function d(x) in terms of a linear combination of the individual dimensions from the

predictor variables (i.e., features). There are two broad classes of methods for determining the

parameters of linear classifiers: the generative approach and the discriminative approach.

The generative approach learns the joint probability distribution P(X,Y ). The Naive Bayes

(NB) classifier is a probabilistic model based on Bayes theorem. Specifically, the NB classifier is

a classification algorithm that assumes the features X1...Xn as being conditionally independent

of one another, given Y . This assumption dramatically simplifies the representation of P(X|Y )

and the problem of estimating it from the training data. When X contains n attributes which

are conditionally independent of one another given Y , we have:

P(X1...Xn|Y ) =n∏

i=1

P(Xi|Y ) (2.4)

Considering that Y is any discrete-valued variable and the features X1...Xn are any discrete

or real-valued variables, our goal is to train a classifier that will output the probability distribu-

tion over possible values of Y , for each new instance X that we ask it to classify. To compute

the probability that Y will take on any of its k-th possible values, we use the following equation:

P(Y = yk|X1...Xn) =P(Y = yk)

∏i P(Xi|Y = yk)∑

j P(Y = yj)∏

i P(Xi|Y = yj)(2.5)

Given a new instance xnew = 〈X1...Xn〉, Equation 2.5 shows how to calculate the probability

that Y will take on any given value, given the observed attributes values of Xnew and the

distributions P(Y ) and P(Xi|Y ) estimated from the training data. In order to choose the most

probable value of Y , we use the following classification rule:

9

Y ← arg maxyk

P(Y = yk)∏i

P(Xi|Y = yk) (2.6)

The discriminative approach learns the conditional probability distribution P(X|Y ). Logis-

tic regression is an approach to learning functions of the form f : X → Y , or P(Y |X), in the case

where Y is discrete-valued and X = 〈X1...Xn〉 is any vector containing discrete or continuous

variables.

This approach assumes a parametric form for the distribution P(Y |X), then directly esti-

mates its parameters from the training data. The parametric model assumed by logistic regres-

sion, in the case where Y can take on any of the discrete values{y1, ..., yk

}, is the following:

P(Y = yk|X) =1

1 +∑k−1

j=1 exp(wj0 +∑n

i=1wjiXi)(2.7)

In the formula, wij denotes the weight associated with the j-th class Y = yj and with input

Xi. To classify any given instance X, we generally want to assign the label yk that maximizes

P(Y = yk|X).

Support Vector Machines (SVMs) are discriminative machine learning methods designed to

solve binary classification problems. The main principle behind SVMs consists in minimizing

the empirical classification error and finding a optimal classification hyperplane with a large

margin. Specifically, the idea is not only make a correct prediction, but also make a confident

prediction (Joachims, 2002).

Let D be a set of n points of the form (#»

Xi, yi), ..., (#»

Xn, yn), where the yi are either 1 or

-1 and indicate the class to which the point#»

Xi (p-dimensional real vectors, representing the

instance) belongs. The goal is to find the maximum-margin hyperplane that divides the group

of points#»

Xi for which yi = 1 from the group of points for which yi = −1. The hyperplane can

be written as the set of points#»

X satisfying #»w.#»

X − b = 0, where #»w is the normal vector to the

hyperplane.

When the training data are linearly separable, is possible to select two parallel hyperplanes

that separate the two classes of the data, where the distance between them is as large as possi-

ble. The region between these two hyperplanes is called the margin and the maximum-margin

hyperplane is the hyperplane that lies halfway between them. In order to represent these two

hyperplanes we can use the following equations:

10

#»w.#»

X − b = 1 and #»w.#»

X − b = −1 (2.8)

The geometric distance between these two hyperplanes is 2|| #»w || . To maximize the distance

between the hyperplanes, we have to minimize #»w. However, we also have to prevent points from

getting into the margin. Putting this together, we get the following optimization problem:

Minimize || #»w|| subject to yi(#»w.

#»

Xi − b) ≥ 1, for i = 1, ..., n (2.9)

When the training data are not linearly separable, we introduce the hinge loss function:

max(0, 1− yi( #»w.#»

Xi + b)) (2.10)

The Equation 2.10 returns zero if the constraint yi(#»w.

#»

Xi − b) ≥ 1 for all 1 ≤ i ≤ n is

satisfied, i.e. if#»

Xi lies on the correct side of the margin. For data on the wrong side of the

margin, the function returns a value proportional to the distance from the margin. We therefore

wish to minimize:

[1

n

n∑i=1

max(0, 1− yi(w.Xi + b))

]+ λ||w||2 (2.11)

In the equation, the parameter λ determines the trade-off between increasing the margin-

size and ensuring that the#»

Xi lies on the correct side of the margin. In order to solve the

optimization problem, we can use for instance the sub-gradient descent approach.

Furthermore, a new branch of machine learning that also allows one to address text classi-

fication is called deep learning. The property of the deep learning that makes it distinctive is

that it studies deep neural networks as a classification model, i.e. neural networks with many

layers that are typically trained end-to-end. The use of multiple layers allow models to warp

the data (progressively) into a form where it is easy to solve the specific classification task.

In these models, each layer is a function acting on the output of a previous layer, so we can

say that the network is a chain of composed functions and also that the chain is optimized to

perform the specific task. As described before, neural networks transform the data through the

existing layers, making their task easier to address. These transformed versions of the data are

called representation.

11

Representation in deep learning is a way to embed the data in k dimensions. Following

this logic, two functions can only be composed together if their types/representations agree,

being that the choice of the representation will be made over the course of training (adjacent

layers will negotiate the representation that they will use to communicate). So, to obtain a good

performance for a particular network we have to meet these requirements.

One of such type of representation is typically called word embeddings. This kind of repre-

sentation is formed or used to solve natural language processing tasks. Typically, the input to

the network for these tasks is a word or multiple words. In this sense, a word can be represented

as a unit vector in a very high-dimensional space, with each dimension corresponding to a word

in the vocabulary. After that, the network will warp and compress the space, mapping words

into a lower dimensional space. This new representation for the words has some really nice

properties. One of those properties is the fact that words with similar meanings will tend to

be close in the resulting space. For instance, the words good and great will be seen as vectors

close to each other. Another but no less important property, is that difference vectors between

words seem to encode analogies, for example the difference between woman and man vectors

is approximately the same as the difference between queen and king vectors. Pre-trained word

embeddings are nowadays typically used when addressing Natural Language (NLP) task with

deep neural networks.

Furthermore, another key point in modern neural networks is the fact that many copies of

one neuron can be used in a neural network. However, writing the same code oftentimes not

only increases the risk of introducing bugs, but also makes it more difficult to catch possible

mistakes. Taking into account that in programming the abstraction of functions is essential, we

can use functions instead of using multiple copies of neurons. This technique is traditionally

called weight tying and is fundamental to the good results that we have seen from deep learning

in many different tasks.

The following bulleted list describes a set of neural network patterns that are widely used,

such as recurrent layers and convolutional layers. These patterns are nothing more that functions

which take functions as arguments, named higher order functions. Some of the most common

patterns are:

• General Recurrent Neural Network - used to make predictions including sequences (e.g

sequences of words). Many different types of recurrent neural networks have been proposed,

including long short-term memory (LSTM) network and gated recurrent units (GRUs).

• Encoding Recurrent Neural Network - used to allow a neural network to take a variable

12

length list as input, for instance taking a sentence as input.

• Generating Recurrent Neural Network - used to allow a neural network to produce a list

of outputs, such as words in a sentence.

• Bidirectional Recursive Neural Network - used to make predictions over a sequence, taking

into account both past and future contexts.

• Convolutional Neural Network - used to look at neighboring elements, applying a function

to a small window around every element.

• Recursive Neural Network - used for natural language processing, allowing neural networks

to operate on parse trees.

In order to build more complex and larger networks, these patterns (i.e, these building

blocks) can also be combined. Some of the aforementioned blocks will be detailed latter in this

dissertation.

2.2 Related Work

Existing sentiment analysis approaches can be divided into two main categories based on

the source of information they use: the lexicon-based approach and the corpus-based approach.

The lexicon-based approach essentially calculates the orientation (i.e., positive or negative

sentiment) for a text by aggregating the semantic orientations of its words. On the other hand

the corpus-based approach uses supervised learning algorithms to train a sentiment classifier,

through training data. Both categories have their advantages and disadvantages. In some cases,

researchers combine the best of both categories, building hybrid models.

2.2.1 Lexicon-Based Approaches

A lexicon-based approach starts with a set of terms with known sentiment orientation. After

selecting the set of terms, an algorithm is used to estimate the sentiment of a text based upon

the occurrences of these words. Some of these approaches have been improved with additional

information, such as emoticon lists and negation word lists (e.g., not or don′t), see for instance

the paper by Taboada et al. (2011).

13

The SentiStrength approach developed by Thelwall et al. (2010) is a lexicon-based classifier

that additionally uses non-lexical linguistic information and rules to detect the sentiment of a

text. In detail, SentiStrength uses as key elements the following resources:

• A word list with human polarity and strength judgements;

• A spelling correction algorithm, that identifies the standard spelling of words that have

been miss-spelled by the inclusion of repeated letters (e.g., the word awwwesome would

be identified as awesome by this algorithm);

• A booster word list is used to strengthen or weaken the emotion words. For example, the

words very and extremely increase the emotion of the near words;

• A negation word list that is used to invert the emotions words;

• An idiomatic expression list is used to detect the sentiment of a few common phrases;

• Repeated letters besides those needed to correct spelling are used to give a strength boost.

Thus, words that have many repeated letters will be quantified with more strength;

• An emoticon list is used to detect additional sentiment;

• Sentences with exclamation marks have a minimum positive strength of 2. Having at least

one exclamation mark gives a strength boost of 1 to the immediately preceding emotion

word or phrase;

• Negative emotions were ignored in questions;

For each text, after applying the above key elements, the SentiStrength algorithm outputs

two integers. The first represents the positive sentiment strength from 1 to 5, where 1 means no

sentiment and 5 strong sentiment, and the second represents, the negative sentiment strength

in the same way. These two scores can also be combined and, for instance, a text with a

combined score of 2,5 would contain a weak positive strength and a strong negative strength.

However, as this approach was initially only tested on the short informal friendship messages of

the social networking service named MySpace, Thelwall et al. (2012) developed a new version

called SentiStrength2 that handles a wider variety of types of text.

To address the weak performance of SentiStrength in negative sentiment strength detection,

some changes were also introduced. The main change was to extend the sentiment word list with

negative General Inquirer (GI) terms (Stone et al., 1966). Second, the sentiment word terms

14

were tested against a dictionary to check for incorrectly matching words and to derive words

that did not match. Third, negating negative terms makes them neutral. Fourth, the list of

idiomatic expression terms was extended. For instance, is like has strength 1 because like is a

comparator after is. At last, the special rule for negative sentiment in questions was removed.

The experimental evaluation showed that SentiStrength2 performed significantly above several

baselines on different datasets.

Another lexicon-based approach to extract sentiment from text is called Semantic Orien-

tation CALculator (SO-CAL). The SO-CAL approach began with two principles. The first is

that words have a semantic orientation that is independent of context, and the second is that

the semantic orientation can be expressed as a numerical value. In previous versions of SO-CAL

(Taboada and Grieve, 2004; Taboada et al., 2006), the classification task was based only in an ad-

jective dictionary. However, the current version (Taboada et al., 2011) is composed by different

dictionaries, including adjectives, verbs, nouns, and adverbs. In addition to these dictionaries,

the SO-CAL approach also incorporates valence shifters such as intensifiers, downtoners, nega-

tion and irrealis markers (i.e., words that can change the meaning of sentiment words). In order

to build the main adjective dictionary, Taboada et al. (2011) manually tagged all adjectives

found in a development corpus, on a scale ranging from −5 for extremely negative to +5 for

extremely positive, where 0 indicates neutral words that were not included in the dictionary.

Remaining dictionaries were built in a similar way, but all of them have their peculiarities.

The adverb dictionary was built automatically using the adjective dictionary, matching

adverbs ending in −ly to their potentially corresponding adjectives. The exception are some

words that were tagged or modified by hand. If a word tagged as an adverb is encountered

by SO-CAL and is not yet in the dictionary, the system stems the word and tries to match

it to an adjective in the main dictionary. The verb dictionary contains, in addition to simple

verbs, multi-word expressions such as fall apart. All nouns and verbs found in the text are

lemmatized, i.e. by grouping the inflected forms of a word in order to be analyzed as a single

word.

Taboada et al. (2011) also incorporated some valence shifters into their method. The inten-

sifier was modeled using modifiers with each intensifying word having a percentage associated

with it. Words that increase the positive sentiment are called amplifiers, whereas words that

increase the negative sentiment are called downtoners. For instance, excelent has a SO-value of

5, and thus most excellent would have a SO-value of : 5×(100%+100%) = 10. Besides adverbs

and adjectives, other intensifiers include all capital letters, the use of exclamation marks and

the use of the discourse connective but to indicate more salient information.

15

In terms of negation, some words such as not, never, without or lack can occur at a

significant distance of the lexical item which they affect. Taking this into account, the SO-CAL

approach includes two options for negation search. First, it looks backwards until a clause

boundary marker (i.e., until punctuation and sentential connectives) are reached. Second, it

looks backwards as long as the words found are in a backward search skip list. After finding the

words affected by negation, instead of changing sign, the SO-value is shifted toward the opposite

polarity by a fixed amount.

The Irrealis blocking filter was created by taking into account that there are a number of

markers that indicate that the words in a sentence might not be reliable for sentiment analysis.

Irrealis markers are words that can change the meaning of sentiment words in very subtle ways.

The solution implemented consists of ignoring the semantic orientation of any word in the scope

of an irrealis marker, within the same clause. The dictionary of irrealis markers is composed by

modals, conditional markers, negative polarity items such as any and anything, certain verbs,

questions and words enclosed quotes.

The final polarity decision for a given text is determined by the average sentiment strengths

(SO values) of the words detected, after modifications.

The experimental evaluation concluded that the new version improves the performance of

the previous version of SO-CAL. However, the main conclusion is that lexicon-based methods

for sentiment analysis can be robust, resulting in good cross-domain performance. They can be

easily improved with multiple sources of knowledge.

2.2.2 Corpus-Based Approaches

A corpus-based approach, as previously described, involves building classifiers from training

data. The training data consists of a set of training examples that are composed by an input

object and a desired output value. In sentiment analysis there are many approaches that use

this concept, and I will separate them in two categories: (i) combining classifiers for sentiment

analysis, and (ii) neural networks for sentiment analysis.

Combining Classifiers for Sentiment Analysis

One of the challenges in Sentiment Analysis is how to represent variable length documents,

given that simple bag of words (BoW) approaches loose word order information. A possibility

to consider involves the use advanced machine learning techniques such as recurrent neural

16

networks (Mikolov et al., 2010; Socher et al., 2011). However, it is not clear if this method

results in improvements in relation to simple bag-of-word and bag-of-ngram techniques.

Mesnil et al. (2014) compared several different approaches and concluded that model com-

bination has a better performance than any individual technique. This is due to the fact that

ensembles best benefit from models that are complementary, i.e. there is a better use of each

model. Following this, and as the majority of models proposed in the literature are discrimina-

tive, the authors proposed to combine both generative and discriminative models together, to

improve the performance of the ensemble in sentiment prediction.

In terms of the generative model, Mesnil et al. (2014) implemented a n-gram language

model using the SRILM toolkit and relying on modified Kneser-Ney smoothing, although this

model suffers from large memory requirements. To address this issue, the authors implemented

a recurrent neural network (Mikolov et al., 2010), that outperforms significantly an n-gram

language model. In both cases, to compute the probability of a test sample belonging to the

positive and negative class, the Bayes rule was used.

For the discriminative model, the authors implemented a supervised re-weighting of the

counts, as in the Naive Bayes Support Vector Machine (NB-SVM) approach (Wang and Man-

ning, 2012). Specifically, this approach computes a log-ratio vector between the average word

counts extracted from positive and negative documents, and the logistic regression (can be re-

placed by a linear SVM) classifier input corresponds to the log-ratio vector multiplied by the

binary pattern for each word in the document vector. However, Mesnil et al. (2014) slightly

improved the performance of this approach by adding tri-grams. In the ensemble model, the

log probability scores of the previously described models are combined via linear interpolation.

Finally, the evaluation demonstrates better results when the models are combined, with each

model contributing to the success of the overall system.

Another challenge is how to use unlabeled data to improve the sentiment classification per-

formance. Recent approaches try to reduce the large dependency on labeled data by introducing

the concept of semi-supervised learning (Gao et al., 2014; Zhu and Ghahramani, 2002). In semi-

supervised learning, the classifier receives both labeled and unlabeled data as input. However,

each semi-supervised approach has its different pros and cons, making it difficult to choose the

best one for a specific domain.

To address this challenge, Li et al. (2015) introduced a new principle that combines two

or more semi-supervised algorithms instead of choosing only one. Specifically, Li et al. (2015)

combined two different probability results from two distinct algorithms. The first algorithm is

17

self-trainingFS proposed by Gao et al. (2014) and the second was label propagation, a graph-

based semi-supervised learning approach proposed by Zhu and Ghahramani (2002). The main

idea was to apply meta-learning, i.e. re-predict the labels of the unlabeled data given the outputs

from the member algorithms. The meaning of meta, in this context, is that the learning samples

(xmeta) are not represented by a bag-of-words, but instead by the posterior probabilities of

the unlabeled samples (xk) belonging to the positive (pos) and negative (neg) classes from the

member algorithms. The feature representation is made as follows:

xmeta =< p1(pos|xk), p1(neg|xk),p2(pos|xk), p2(neg|xk) > (2.12)

In the equation, p1(pos|xk) and p1(neg|xk), are the posterior probabilities from the first

semi-supervised method, and p2(pos|xk) and p2(neg|xk) are the posterior probabilities from

the second semi-supervised method. Then, the probability results and the real labels are used

as meta-learning samples to train the meta-classifier (i.e., a maximum entropy classifier). An

experimental evaluation in four domains demonstrates that this approach outperforms both

member algorithms.

Neural Networks for Sentiment Analysis

One sentiment classification approach that has recently emerged is based on Convolutional

Neural Networks. A CNN architecture can be, for instance, divided in four layers: the first layer

focuses in the representation of the sentences; the second is a convolutional layer with multiple

filter widths and feature maps; the third realizes a max-over-time pooling; the last one is a fully

connected layer with dropout and softmax output.

Specifically, the first layer is responsible for representing the sentence through word vectors.

The second layer is where many convolution operations occur. A convolution operation implies

the use of a filter which is applied to each possible window of words in the sentence, to produce a

feature map. The penultimate layer executes a max-over-time pooling operation over the feature

maps previously computed by each filter, taking the maximum value that corresponds to the

most important feature of each one. The last layer receives the features of the penultimate layer

and passes the features to a fully connected softmax whose output is the probability distribution

over labels.

In natural language processing, CNNs have been shown effective, reaching good results

in semantic parsing, sentence modeling or even search query retrieval. CNNs have also been

18

extensively used for sentiment analysis (Kim, 2014; Mou et al., 2015; Kalchbrenner et al., 2014).

Kim (2014) introduced a principle that is based on training a simple CNN with one layer

of convolution on top of word vectors obtained from an unsupervised neural language model.

In terms of the word vectors, the author used the publicly available word2vec vectors that were

trained on 100 billion words from Google News.

Initially, Kim (2014) keeps the word vectors static and learns only the other parameters of

the model, this way already obtaining excellent results. However, as learning task-specific vectors

through fine-tuning results in further improvements, the author describes a simple modification

to the architecture, to allow the use of both pre-trained and task-specific vectors, by having

multiple channels. The different channels are initialized with word2vec and each filter of CNN

is applied to them. However, gradients are back-propagated only through one of the channels,

allowing the model to adjust only one set of vectors while keeping the other static. The evaluation

of this approach shows that unsupervised pre-training of word vectors is an important ingredient

in deep learning for natural language processing.

In sentiment analysis, capturing the meaning of longer phrases has also received a lot of

attention. However, to be able to extract this information, one needs to address the lack of large

and labeled compositionality resources. Socher et al. (2013) created the Stanford Sentiment

Treebank corpus, i.e. the first corpus that allowed capturing compositional effects of sentiment

in language, providing fully labeled parse trees. The type of information contained in this

dataset enables the community to train and develop new compositional models. Exploiting this

new information resource, Socher et al. (2013) proposed a model called the Recursive Neural

Tensor Network (RNTN) that aims to capture the compositional effects with higher accuracy.

The RNTN addresses several issues on standard RNNs (Goller and Kuchler, 1996; Socher

et al., 2011) and on the previous proposed Matrix-Vector Recursive Neural Netowrk (MV-RNN)

architecture (Socher et al., 2012). In the MV-RNN model, the parameters are associated with

words and each composition function that computes vectors of longer phrases depends on the

actual words being combined. However, the number of parameters can become very large and

depends on the size of vocabulary. Considering this problem, Socher et al. (2013) suggested

that it was more plausible if there was a single composition function with a fixed number of

parameters. Briefly, the main idea of the RNTN is to use the same tensor-based composition

function for all nodes of the compositionality tree.

Experiments showed that the RNTN model improves sentence-level sentiment detection,

achieving better results than MV-RNN. Another relevant aspect, is that this new model captures

19

Figure 2.1: Tree-Based Convolutional Neural Network (Mou et al., 2015).

negation of different sentiments.

Another alternative to capture sentence meanings was purposed by Mou et al. (2015) and

is called Tree-Based Convolutional Neural Network, based on CNNs and RNNs.

CNNs have the capacity to extract neighboring words effectively with short propagation

paths, but they do not capture inherent sentence structures, for instance parsing trees. RNNs

can encode structural information by recursive semantic composition along a parsing tree, but

they have many difficulties in learning deep dependencies because of long propagation paths. In

order to exploit this information, Mou et al. (2015) proposed a novel neural architecture (Figure

2.1) that combines the advantages of CNNs and RNNs, called Tree-Based Convolution Neural

Network (TBCNN).

Initially, in TBCNNs, sentences are converted to either consistency or dependency parse

trees, and each node in the tree is represented as a distributed real-value vector. Afterwards, a

set of fixed-subtree feature detectors is applied (i.e., tree-based convolution window), sliding over

the entire tree of a sentence to extract the structural information. This structural information

is then packaged into one or more fixed-size vectors by max pooling, i.e, the maximum value on

each dimension is taken. Finally, the model has a fully connected hidden layer and a softmax

output layer. One advantage of such an architecture is that all features, along the tree, have

short propagation paths to the output layer, and hence structural information can be learned

effectively. Since there are different approaches to represent sentence structures, two variants

are considered. The c-TBCNN strategy pretrains the constituency tree with an RNN, implying

that vector representations of nodes are fixed. The other case, called d-TBCNN is based on

dependency representations. The nature of this representation leads to the major difference of

d-TBCNN from traditional convolutions, because there are nodes with different number of child

nodes.

20

Figure 2.2: Gated Recurrent Neural Network Architecture (Tang et al., 2015a).

The experimental evaluation showed that both c-TBCNNs and d-TBCNNs have high per-

formance in sentiment analysis, but also that with TBCNNs it is possible to extract sentence

structural information effectively, which is very important for sentence modeling.

Tang et al. (2015a) also created a novel neural network approach to learn continuous docu-

ment representation for sentiment classification (Figure 2.2).

Their method has two main steps. The first uses convolution neural networks and long

short-term memory networks (Kim, 2014; Kalchbrenner et al., 2014) to produce sentence rep-

resentations from word representations. The second step exploits how to adaptively encode

semantics of sentences and their inherent relations in the document. The convolutional neural

networks and long short-term memory models, beyond learning fixed-length vectors for sentences

of varying lengths, also capture word order in sentences. In the second step, Tang et al. (2015a)

developed a Gated Recurrent Neural Network to encode semantics and relations of sentences in

the document. This model can be viewed as an LSTM whose output gate is always on, since

it is preferable not to discard any part of the semantics of sentences to get a better document

representation.

After these two steps, the document representations can be considered as features in models

to classify the document. The experimental evaluation revealed that traditional recurrent neu-

ral networks have a weak performance in modeling document composition, while adding gates

dramatically boosts the performance.

Tang et al. (2015b) introduced a new model called User Product Neural Network (UPNN)

21

in order to capture user and product-level information, UPNN takes as input not just variable-

size documents but also the user who writes the review, as well as the product that is being

evaluated.

The architecture of the system is divided in three steps: modeling semantics of documents,

modeling semantics of users and products, and sentiment classification. The step of modeling the

semantics of documents has two stages. In the first, as documents consist of a list of sentences,

and sentences consist of a list of words, Tang et al. (2015b) began modeling the semantics of

words. For this purpose, each word was represented as a low dimensional continuous and real-

valued vector (i.e., as an embedding). In the final stage, a Convolutional Neural Network (Kim,

2014; Kalchbrenner et al., 2014) to model the semantic representation of sentences was used. In

the intermediate step of modeling the semantics of users and products, we have that both users

and products were encoded in continuous vector spaces, allowing to capture important global

clues such as user preferences and products qualities. Finally, in the sentiment classification

step, instead of using hand-crafted features as input to a classifier, the authors used continuous

representations of documents and vector representations of users and products, as discriminative

features. The experimental evaluation confirmed that including continuous user and product

representations improves significantly the accuracy of sentiment classification.

2.2.3 Combined Approaches

Previous studies (Kennedy and Inkpen, 2006; Andreevskaia and Bergler, 2008; Qiu et al.,

2009) showed that lexicon-based and corpus-based approaches have complementary perfor-

mances and therefore should be combined.

Yang et al. (2015) developed a new approach that combines these two views, named the

LCCT Model (Lexicon-based and Corpus-based, Co-Training Model). Another idea behind the

LCCT model was the fact that social reviews, such as posts in forums and blogs, in contrast

with product and service reviews, do not have associated numerical rankings, making it difficult

to perform supervised learning. Since manual labeling is time consuming and expensive, it

is preferable to label a small portion of social reviews to perform semi-supervised learning,

leveraging information from both labeled and unlabeled data.

In terms of the lexicon-based approach, Yang et al. (2015) presented a novel method that

is called semi-supervised sentiment-aware LDA (ssLDA) to build a domain-specific sentiment

lexicon. Building a domain specific sentiment lexicon has a particular relevance in sentiment

analysis because a single word can carry different sentiment meanings in distinct domains. In

22

the domain-specific sentiment lexicon each word is associated to a particular class (i.e., positive

or negative) through a specific weight. After creating the lexicon, the classification of each

document is realized by aggregating the weight of each word in the document. If the accumulated

weight is greater than zero, the document is classified as positive, otherwise it is classified as

negative.

For the corpus-based approach, Yang et al. (2015) used Stacked Denoising Auto-Encoders

(SDA) to build the corpus-based sentiment classifier. The autoencoder concept was introduced

by Rumelhart et al. (1986) and its denoising variant was proposed by Vincent et al. (2010).

After the SDA parameters are trained, on both labeled and unlabeled data, and after the high-

level representation of each data instance is obtained, a SVM classifier, using the resulting

representation, was employed to train a sentiment classifier from the labeled data.

In order to combine the methods, both are initially trained with the partially available labels,

and then one of the two classifiers (i.e., the corpus-based classifier or the lexicon-based classifier)

is used to label unlabeled documents, adding these instances to the pool of labeled data. The

other classifier is re-trained using the new labeled data produced by the initially chosen classifier.

This procedure is performed iteratively, and after a sufficient number of iterations, the classifiers

are combined using a majority-voting scheme to predict the sentiment label to the test data. The

experimental evaluation demonstrates that LCCT exhibits significantly better performances on

a variety of datasets, comparing with other state-of-art sentiment analysis approaches.

Another combined approach was proposed by Augustyniak et al. (2014), with the aim of

improving the efficiency of lexicon-based methods, combining several lexicon-based methods

thought ensemble classification. In this approach, the initial task is to select the lexicons. The

authors employed a variety of lexicons, starting from a very basic list consisting of 2-word lists

of strong sentiments words (i.e., good/bad), i.e., a lexicon called SM, or with a lexicon with

English verbs conjugated in different tenses, called PF.

Augustyniak et al. (2014) also considered additional word lists (lexicons), which they called

WL, 5MF ans 25MF. They have assumed that the input review sets form a probability space

where the sample space < consists of reviews represented as pairs (score, text). The text is

represented as a set of words occurring in the review and the score is the normalized [−1, 1]

sentiment of the review. To create these new lexicons, they used the following equation:

fqmt(w) =∑

s∈scoress× P (review has score s|review has word w)

P (review has score s)(2.13)

23

In the equation, score is a countable subset of [-1,1] and s is the score of a word w.

The lexicon WL is a list with the 25 most positive (i.e., highest fqmt) and the 25 most

negative (i.e., lowest fqmt) words obtained by merging all corpora. In contrast, 5MF and 25MF

select respectively the 5 and 25 most positive and most negative words, separately for each

corpus.

After selecting and constructing all lexicons, the bag-of-words model was used. For each

word in the review which occurs in a lexicon, the authors assign a numeric value (i.e., 1 if the

word is positive, -1 if the word is negative, and 0 if the word does not appear in the lexicon).

The sentiment of the review is positive if the difference between the number of positive and

negative words identified in the review is greater than zero. If the difference is smaller than zero,

the review is negative. With the previous results, a sentiment polarity matrix is constructed,

where the columns represent the reviews and the rows represent the several existing lexicons.

Specifically, each position of the matrix corresponds to the sentiment polarity value provided by

each lexicon. Completed the first step, Augustyniak et al. (2014) trained a strong classifier, such

as the C4.5 decision tree method, over the previously described matrix and use the classifier to

predict the sentiment of new reviews.

Experiments show that the accuracy obtained from the combination of these lexicons out-

performs other lexicon based approaches.

Considering that the challenges of sentiment classification cannot be easily addressed by

simple text categorization approaches relying on n-gram or keyword identification, Mullen and

Collier (2004) introduced a new concept to classify natural language texts as positive or negative.

To do that, they applied Support Vector Machines using a variety of diverse information sources,

derived from the fact that SVMs are the ideal tool to bring these sources together.

The source of diverse information were provided from the following methods (used to mea-

sure the favorability content of phrases): Semantic orientation with PMI (Turney, 2002); Osgood

semantic differentiation with WordNet; Topic proximity, and syntactic-relation features.

In the first method, the authors relyed on the semantic orientation of words or phrases.

The meaning of the term semantic orientation (SO) refers to a measure (i.e., a real number)

that captures the sentiment (i.e., positive or negative) expressed by a word or phrase. The

solution proposed by the authors allows for the modeling of not only singular words but also

multiple word phrases, named value phrases. In this particular case, the approach taken by

Turney (2002) is used to derive the SO values and also to extracts the value phrases. The

phrases were designated as value phrases being that they are the sources of SO values. After

24

extracting the value phrases, the SO value of each one is determined based upon the pointwise

mutual information (PMI) with the words excellent and poor (Church and Hanks, 1989), in the

following way:

PMI(w1,w2) = log2

(p(w1&w2)

p(w1)p(w2)

)(2.14)

In the equation, p(w1 & w2) is the probability of w1 and w2 occurring simultaneously. Finally,

the value SO for each value phrase is the difference between its PMI with the word excellent

and with word poor.

The second method is based on using WordNet relationships to derive three values, namely

potency (strong or weak), activity (active or passive) and evaluative (good or bad) (Kamps

and Marx, 2002; Osgood et al., 1957). The derivation of these values is obtained by computing

the minimal path length in WordNet between the adjective in question and the pair of words

mentionated before. However, for the purpose of this research, each of these values are averaged

over all the adjectives in a text and then delivered to the SVM model.

The last method aims to exploit the information that is known in advance, i.e. what is

the topic about and which sentiment is being evaluated. Considering this, the method creates

several classes of features based upon the semantic orientation values of phrases, given their

position in relation to the topic of the text. In each review, the references to the target being

reviewed were tagged as THIS WORK and references to the artist under review were tagged

as THIS ARTIST . This is just an example because many other classes were retrieved from

natural language text. Each of these classes are assigned with a value, so representing each

text as a vector of these real-valued features forms the basis for the SVM model. However, if

no topic information is available, only the values of the first and the second method are used.

The authors concluded that combinations of SVMs using these features, in conjunction with

SVMs based on unigrams, and lematized unigrams, outperform models which do not use these

information sources.

In order to perform sentiment analysis more thoroughly, Mudinas et al. (2012) introduced an

aspect-level sentiment analysis system (pSenti) that integrates lexicon-based and corpus-based

approaches. The lexicon-based approaches are generally implemented in two steps: lexicon

detection and sentiment strength measurement. In corpus-based approaches, the sentiment

detection is treated as a simple classification problem, which can be addressed by employing

machine learning algorithms such as Naive Bayes or Support Vector Machines. Following this,

the main idea of the introduced concept is combining the best of both worlds, generating feature

25

vectors for supervised learning in the same way as is seen in lexicon-based approaches.

Initially, in a pre-processing phase, some simplifications were performed such as replacing

known idiomatic expressions and emoticons with text masks. For instance, if the given dataset

shows that the emotion :) has a positive sentiment, then the emotion will be replaced by the

mask Good one . After this simplifications, the Stanford CoreNLP toolkit was used to carry

out POS and named entity tagging.

Considering that people express multiple views, sometimes opposite, about different aspects

of the same product in a single review, it is important to extract the discussed aspects, as well as

the corresponding views. Therefore, for aspect and view extraction, the authors generated lists

of aspects and views. The list of aspects is composed by nouns, noun phrases and entity tags,

identified by the POS tagger. The list of views is composed by adjectives and known sentiment

words, which occur near an aspect. This step was important to find views which can be used to

expand the sentiment lexicon, and also to perform context-aware sentiment value extraction for

such adjectives in the given aspect.

The sentiment lexicon used in this system is constructed using public resources, more specif-

ically 7048 sentiment words and their sentiment values that are marked in the range from −3

to +3. Furthermore, the authors applied heuristic linguistic rules such as negation (i.e., words

that can change the overall sentiment, such as not and don′t) and modifier (i.e., words that can

increase or decrease the sentiment value, such as less and more).

After the lexicon is created, another step is initiated: the corpus-based sentiment evaluation.

In this step, the authors used the linear SVM implementation in LibSVM. The feature vectors

of each aspect are constructed based on three elements. The first element is sentiment words,

where the weight of each such feature is the sum of the sentiment values in the given review. For

instance, if we have a review with the word good appearing twice, which has a sentiment value

+2, we would add the feature Good with a weight of +4. The second element, called other

adjectives, consists in adjectives which are not in the sentiment lexicon but are initialized with

their occurring frequencies and whose sentiment value is estimated by the learning algorithm.

For instance, if the word big appears twice, we would have the feature Big with a weight +2.

The final element is the lexicon based sentiment score which estimates the sentiment value of a

word that was previously unseen in the training samples but that exists in test samples.

After trained, the SVM model can reuse the calculated feature weights to adjust the final

sentiment calculation. The final overall sentiment scoring of pSenti is a real-valued sentiment

score in the range of [-1,1], which is calculated as follows:

26

Ssenti =1

2× log2

(pos

neg

)(2.15)

In the equation, pos are the positive overall sentiment scores and neg are the negative overall

sentiment scores.

The experimental evaluation shows that the proposed hybrid approach achieves a high

accuracy, that is very close to pure corpus-based systems, and much higher than that of pure

lexicon-based systems.

The sentiment lexicons are often used as key sources for the automatic analysis of opinions,

emotions and subjective text. However, manually created sentiment lexicons consist of few

carefully selected words. The associated problem in this case is that these few words fail to

capture the use of non-conventional word spellings and slang, commonly found in social media.

In order to solve this problem, Moreira et al. (2015) developed a system, based on a novel

method, to create large-scale domain-specific sentiment lexicons. The authors address this task

as a regression problem, in which terms are represented as word embeddings. Considering this,

the system can be divided in two main phases.

The first phase consists in deriving word embeddings from large corpora. For this propose,

Moreira et al. (2015) tested some different approaches: the Skip-gram and Structured Skip-

gram methods, the Continuous Bag-Of-Words (CBoW) model and the Global Vector (GloVe)

approach.

The skip-gram and the CBOW models estimate the optimal word embeddings by maximizing

the probability that words within a given window size are predicted correctly. Essential to the

skip-gram method is a log-linear model of word predictions. When given the i-th word from a

sentence wi, the skip-gram method estimates the probability of each word at a distance p from

wi as follows:

p(wi+p|wi; Cp,E) ≈ exp(Cp.E.wi) (2.16)

In the equation, wi ∈{

1, 0}v×1

is a sparse column vector of the size of the vocabulary v with

a 1 on the position corresponding to that word (i.e., one-hot sparse representation). The model

is parametrized by two matrices: E ∈ <e×v is the embedding matrix, transforming the one-hot

representation into a compact real valued space of size e; Cp ∈ <e×v is a matrix mapping the

real valued representation to a vector with the size of the vocabulary v. A distribution over all

27

possible words is then attained by exponentiating and normalizing over the v possible options.

In order to avoid the normalization over the whole vocabulary (i.e., in practice the value of v

is large), some approaches are used. In the structured skip-gram model, the matrix Cp depends

only of the relative position between words p.

The CBOW method defines a objective function that predicts a word at position i given

the window of context i− d, where d is the size of the context window. The probability of the

word wi is defined as follows:

p(wi|wi−d, ...,wi+d; C,E) ≈ exp(C.Si+di−d) (2.17)

In the equation, Si+di−d is the point wise sum of the embeddings of all context words starting

at E.wi−d to E.wi+d, excluding the index wi and once again C ∈ <e×v is a matrix mapping the

embedding space into output vocabulary space v.

The models described above are based on different assumptions about the relations between

words within a context window. The GloVe method combines the logic of the previous models

with ideas drawn from matrix factorization methods. The GloVe method derives the embeddings

with an objective function that combines context window information with corpus statistics

computed efficiently from a global term co-occurrence matrix.

In order to support the unsupervised learning of the embedding matrix E, in all methods,

a corpus of 52 million tweets was used.

In the second phase, after mapping terms to their respective embeddings, a regression model

was trained, using the manually annotated lexicons, to predict a score y ∈ [0, 1] corresponding

to the intensity of sentiment of any word or phrase. For this propose, the authors tested several

linear regression models such as least squares and regularized variants and ridge and elastic net

regressors. They also experimented with Support Vectors Regression (SVR) using non-linear

kernels, namely, polynomial, sigmoid and Radial Basis Function (RBF) kernels. Experiments

indicated that several configurations of the embedding model and size could achieve optimal

results. Therefore, the system was based on structured skip-gram embeddings with 600 dimen-

sions, and SVR with RBF kernel.

Similar to the work of Moreira et al. (2015), Zhang et al. (2015) developed a system to

predict a score between 0 and 1, which is indicative of the strength of association of Twitter

terms with positive sentiment. For this propose, the authors implemented a regression model

28

to calculate the sentiment strength score for each target term with the aid of sentiment lexicon

score features and word embeddings.

Firstly, the authors transformed the informal terms to their normal forms. With this in

mind, some abbreviations and rules to convert the irregular writing of services like Tritter to

normal forms were collected from the Internet. After this process, in order to extract sentiment

lexicon features, they employed some sentiment lexicons and transformed the score of all words

in all sentiment lexicons to the range between -1 and 1. If a target term contained more than one

word, the authors averaged their scores and used these averages as the final sentiment lexicon

feature. Word embedding features were also adopted. Specifically, the authors used the publicly

available word2vec vectors to get word embedding with dimensionality of 300. If a sentence or

phrase contains more than one word, the strategy adopted was sum up all word vectors. Some

of the experiments demonstrated that the combination of sentiment lexicon features and word

embedding is the most effective feature type for sentiment score prediction.

Finally, in order to predict the sentiment score of a new instance, Zhang et al. (2015) trained

a SVM classifier with the sentiment lexicon features and word embedding features together.

The authors concluded that using word embeddings features alone may not achieve sufficiently

good results, but embeddings make a considerable contribution to performance improvement,

in combination with traditional linguistic features.

2.3 Overview

In this chapter, I presented the necessary concepts to understand the work that has been

made in the context of my MSc thesis. Furthermore, I reviewed some of the most representative

sentiment analysis approaches, divided in three categories according to the source of information

they use. I concluded that each of these categories has their pros and cons, making each of them

useful in different contexts.

In the next chapter, I will present the text representations and model architectures that

were used by me in the sentiment analysis and dimensional sentiment analysis tasks.

29

30

3Sentiment Analysis with

Deep Neural Networks

This chapter describes the different text representations, as well as the models architectures,

used in the experiments that are reported in this dissertation. In first place, Section 3.1 presents

the different input representations for the models. Section 3.2 describes the model architectures,

including the description of the different layers used in each of them. Finally, Section 3.3

summarizes the most important aspects of this chapter.

3.1 Text Representation

One of the main problems in sentiment analysis consists in creating representations of text for

computational analysis, i.e. prepare the text to meet the input requirements of the classification

systems. After converting the data to a structured format we need to have an efficient text

representation model to build an efficient classification system. Some of the pre-processing

techniques that deal with this challenge will be described in the next sections.

3.1.1 Bag of Words

The BoW model is one of the pre-processing techniques proposed in the literature. Given a

collection of documents, we first identify the set of words used in the entire collection. Commonly

called vocabulary, this set is traditionally reduced, by keeping only the words that are used in

at least two documents. Besides that, many of the applications in text mining remove from

the vocabulary the so-called stop words. For instance, words like and, because, to, of or you

do not give us additional information about the sentiment polarity of a given document. From

the moment in which the vocabulary is defined, each document can be represented as a vector

(with numeric entries) of length m, being m the size of the vocabulary. In order to calculate the

length (i.e., the norm) n of a document vector we can do the following:

n =m∑j=1

xj (3.1)

In the formula, x is the vector representation of the document and xj is the number of

occurrences of the word j in the document.

The BoW scheme has nonetheless some limitations. The word order is lost, causing that

different sentences may have exactly the same representation, as long as the same words are

used. With the aim of reducing the impact of this limitation, a particular variant of the Bow

technique corresponds to Bag-of-n-grams. The Bag-of-n-grams identifies multi-word expressions

occurring in the document, ensuring word order in short contexts. Expressions like Far cry

from and United States will be detected as single units when using a Bag-of-tri-grams or a

Bag-of-bi-grams, respectively. Still, the BoW representation and its variant Bag-of-n-grams have

little sense about the semantics of the words, and they both suffer from data sparsity and high

dimensionality problems.

3.1.2 Word to Vector

Many of the sentiment analysis systems use techniques that treat words as singular units, i.e.

there is no notion of similarity between them (words are represented as indices in a vocabulary).

Another common limitation is the production of data with high dimensionality and sparsity.

Taking into account these limitations and the progress of machine learning techniques in recent

years, the word to vector representation has been increasing its relevance NLP applications. As

its own name indicates, the idea is to also represent the word as vectors, trying to reduce the

limitations detected in previous work. The produced vectors, in this case, are often called word

embeddings.

The name word embedding comes from the fact that we are embedding the words into

a real-valued low-dimensional space. Basically, word embeddings are used to map words or

phrases from a vocabulary to a corresponding vector of real numbers. The main advantages in

comparison with the previously described BoW technique are: dimensionality reduction, which

makes the representation more efficient, and contextual similarity, which makes the representa-

tion more expressive. Considering the dimensionality reduction and in contrast with the BoW

approach, on which the size of the vectors grows exponentially with the size of the collection

of documents, word embeddings aims to create vector representations with a much lower di-

mensionality. In relation to the contextual similarity, the word embeddings can capture word

meanings (i.e., the semantic similarity between two words is correlated with the cosine of the

angle between their word embeddings). For example, the cosine for tire and car may be 0.7,

whereas for tire and milk it may be 0.1, which is not much different than what a human would

32

Figure 3.1: Model Architectures (Mikolov et al., 2013).

say about the relationship between these words.

In this context, another important property is the fact that most methods for learning good

word embeddings are completely unsupervised, in that they build the embeddings using a big

unannotated text corpus.

One of the most popular algorithm for producing these vectors is the Word2Vec, proposed by

Mikolov et al. (2013). The Word2Vec algorithm uses one of two possible model architectures to

produce a distributed representation of words: the Continuous Bag of Words or the Skip-gram.

The CBoW architecture predicts the current word based on the context, whereas the Skip-gram

predicts surrounding words given the current word (Figure 3.1).

3.1.3 Document to Vector

Le and Mikolov (2014) proposed an extension to the Word2Vec approach that aims to

construct word embeddings from entire documents. This new extension can be called either

Doc2Vec or Paragraph2Vec. Sentiment analysis systems typically require the text input to

be represented as fixed-length vectors. Again, due to its simplicity, efficiency, and sometimes

surprising accuracy, the BoW technique is oftentimes applied. In order to improve the results

reported in previous studies, like BoW, techniques like Doc2Vec arose. Briefly, Doc2Vec consists

in combining the word vectors provided by the Word2Vec algorithm. The main idea is end up

with a single aggregate vector that represents the semantics of the entire document.

This technique allow us to represent a document in a small data space, i.e. in a few hundred

dimensions. The main point in this process is to know how much one should compress the

dimensionality of the matrix encoding the document collection. If the data is compressed too

much, we incur in the risk of not leaving enough space to divide the important differences in

our data points. The central goal is to obtain a nice balance where similar documents will be

33

Figure 3.2: Examples of a) a traditional Recurrent Neural Network Architecture, and b) a LongShort Term Memory Architecture.

clustered together, leaving enough space to discern one cluster from the others. This property

has a big relevance to classification systems. Specifically, applying Doc2Vec to a collection of

documents and tuning well the dimensionality can give an important support to a classification

algorithm to better define the boundary that separates the different categories that we want to

distinguish. In other words, the problem of sparsity, present in previous studies leveraging BoW

representations is reduced due to the fact that we can now manage to get enough density in the

data points.

3.2 Classification Models

This section presents and describes the models used in this work, being that some of them

like (e.g., NB and SVM classifiers) have already been described in detail in chapter 2. The

models present in this chapter are more complex and mainly based on deep neural networks.

Traditionally, neural networks are divided in two categories. The recurrent neural networks

(RNNs)and the convolutional neural networks (CNNs).

Starting with the RNNs, the reason they are called recurrent is because they perform the

same task for every element of an input sequence, with the output being dependent on the

previous computations. Theoretically, RNNs have the possibility to use information in arbitrarily

long sequences, but in practice they are limited to looking back only a few steps. In other words,

RNNs have two sources of input, namely the present and the recent past, that combined give us

the possibility to determine how to respond to new data, as we do in life (Figure 3.2 a)). Still

RNNs suffer from the problem of vanishing gradients. The problem of vanishing gradients is

a challenge found in training neural networks with gradient based methods, for instance when

using the backpropagation algorithm. The backpropagation method is used in conjunction with

34

an optimization method, such as gradient descent, and it attempts to reduce the errors between

the output and the desired result. Specifically, for each training example, the hidden layers

weights are modified in order to minimize the error computed between the network’s prediction

and the correct value. As the name suggests, these modifications are made in the backwards

direction, i.e. from the output layer, through each hidden layer, to the first hidden layer. The

vanishing gradients problem implies loss of sensitivity in the network, over the time as new

inputs overwrite the activations of the hidden layers, causing forgetfulness of the first inputs.

In order to remediate this problem, a new RNN variant called Long Short Term Memory

was introduced. An LSTM can chose to retain memory over arbitrary periods of time, but also

forget if necessary (Figure 3.2 b)). Taking into account these developments, LSTMs are one of

the most used RNN models in Natural Language Processing tasks.

With respect to Convolutional Neural Networks, they are usually composed by several layers

of convolutions with non-linear activation functions applied to the results. In more detail,

convolutions over the input layer are used with the aim to compute the output. In contrast with

Feedforward Neural Networks where each input neuron is connected to each output neuron in

the next layer, in the CNNs the use of convolutions results in many local connections, where

each region of the input is connected to a neuron in the output. Each convolutional layer

applies different filters, and at the end combines their results. The values of these filters are

automatically learned during the training phase.

3.2.1 Overview on the Layers used within Deep Neural Networks

This section presents all the layers composing the models architectures, each of them with

its own section, and where is made a description of the respective usefulness, purpose and

functioning.

3.2.1.1 Embedding Layers

An embedding layer aims to convert word indices into dense representations. In this case,

three different options are available: the first is to learn the embedding weights from scratch, i.e.

the word embeddings are initialized with random values and will be improved over the training

process. The second is to use pre-trained embeddings, keeping the embedding weights static.

The third and last option, is to use pre-trained embeddings, but instead of keeping static the

weights they will be improved during the training process.

35

3.2.1.2 Dropout Layers

The dropout layer has the responsibility of helping models to prevent overfitting. As de-

scribed throughout the document, deep neural networks are composed by multiple non-linear

hidden layers, making these models very expressive. The existence of these types of layers allow

the models to learn very complicated relationships between inputs and outputs. Taking into

account the limited training data, many of the relationships will be the result of sampling noise,

i.e. some relationships will exist in the training set but not in the test set. When this occurs, the

model does not generalize well, causing bad results in terms of the prediction performance. In

order to avoid the overfitting problem, many methods have been developed, one of them being

the dropout method. Essentially, the dropout method forces the neural network to learn mul-

tiple independent representations of the same data (dropping out neurons during the training

phase), with the aim to reduce the dependence on the training set. A dropout layer will thus

set to zero a given portion of its inputs.

3.2.1.3 LSTM Layers

An LSTM layer basically implements the LSTM recurrent model, which is one of the most

used variants of RNNs due to the fact that this approach considers three additional aspects that

improve significantly the prediction performance. The first is that the network has the control

to decide when to let the input enter in the neuron. The second, is the capacity to control

when to remember what was computed in the previous time steps. Finally, this method also

has the capacity to control which parts of the information the system should output. These

improvements are illustrated in Figure 3.2 b).

3.2.1.4 Dense Layers

A dense layer represents a regular fully connected layer. Specifically, the idea of this layer

is to connect every neuron in the network to every neuron in adjacent layers.

3.2.1.5 Convolutional Layers

A convolutional layer is the most relevant in the case of CNN models. These layers are

composed by a set of learnable filters with a small receptive field. During the forward process,

each filter is convolved across the full depth of the input volume. This process is accomplished

computing the dot product between the entries of the filter and the input, in order to produce

36

the activation map of each filter. Finally, the key point is that the network learns filters which

are activated when they see some specific type of feature at some spacial position in the input.

The output of a convolutional layer is represented by a stack of activation maps produced by

each filter. Each stack position can be seen as an output of a neuron that looks at a small region

in the input.

3.2.1.6 Pooling Layers and Flatten Layers

A pooling layer is commonly used between successive convolutional layers in a CNN model.

Their goal consists in progressively reducing the spatial size of the representation, which will

cause also a reduction of network parameters and computations. Due to this fact, these layers

are also an important aid to control the overfitting problem. The most traditional function

to execute this operation is max pooling. The idea is to split input representation into a set

of non-overlapping regions and for each of these regions select the max value. Thus, at the

end, each element of the new representation is the max value of a region in the original input

representation. The intuition behind this is that once a feature has been detected, its exact

location is not very important comparing to the importance of its location relative to other

features.

A flattening layer, as the name indicates, flattens the input. More properly, these layers

flatten all of the dimensions of their inputs into one dimension.

3.2.1.7 Activation Layers

An activation layer just applies an activation function to an output. An activation function

in neural networks, besides restricting outputs to a certain range, breaks the neural network

linearly, allowing it to learn more complex functions than linear regression. A neuron without

an activation function is equivalent to a neuron with a linear activation function, like f(x) =

x. When these functions do not add any non-linearity, the entire network is equivalent to a

single linear neuron. Therefore, it makes no sense building a multi-layer network with linear

activation functions. Moreover, given that a single linear neuron is not capable of dealing with

non separable data, no matter how deep a multi-layer network is it can never solve any non-

linear problem. Taking this into account, the role of activation functions in a neural network is

to produce a non-linear decision boundary, via a non-linear combination of the weight inputs.

37

Figure 3.3: Stack of two LSTMs.

3.2.2 Model Architectures

This section describes models architectures, each of them with its own section. In each

model description, beyond its composition (i.e., layers) are also given the specifications of each

component.

3.2.2.1 Stack of LSTMs

The stack of LSTMs model is simply a stack of several LSTM layers. The idea is to form a

deep network not only in terms of layers but also in recurrent dimensions. The intuition is that

higher LSTM layers can capture abstract concepts in sequences which might help in a particular

task, in our case, sentiment analysis. To better understand this concept, Figure 3.3 presents the

architecture of the Stack of two LSTMs model.

In brief, Figure 3.3 shows a model that is composed by one Embedding layer, two LSTM

layers and one activation layer. The figure, as in all the next model architectures, does not

present any dropout layer. However, in order to prevent overfitting problems, as described in

the previous section, dropout layers with a probability of 0.25 are used between most of the

layers.

Starting with the embedding layer, there are four data parameters that need to be specified:

input dimensionality, output dimensionality, input length and weights. The input dimensionality

is defined as the maximum number of words to consider in the representation (i.e., set to 30000).

The output dimensionality as the size of the word embeddings. The input length as the maximum

length of a sentence (i.e., set to 50 words), and finally the weights as a list of numpy arrays to

38

set the initial embedding weights.

Figure 3.4: Bidirectional LSTM Architecture.

The dropout layers use the dropout assumption that consists in randomly setting a fraction

p of input units to 0 at each update during training time, which helps prevent overfitting. Thus,

the only specification present is the definition of the value p as 0.25.

The LSTM layers have three parameters: output dimensionality, activation and weight

initialization. The output dimensionality is defined as the word embeddings size. The activation

is set to a sigmoid function, and the weights are initialized to zero.

The activation layer only has one parameter that is dependent of using the model for a

regression or classification. If it is chosen to use the model in a regression task, then the activation

is defined as a linear function, otherwise a sigmoid function is used. The output dimensionality

represents the number of output dimensions of the model and is selected according to the specific

task.

3.2.2.2 Bidirectional LSTMs

The bidirectional LSTM model was introduced with the aim of improving the performance

of standard RNNs. RNN models not have access to future information on a current state. To

solve this limitation, the bidirectional LSTM model connects hidden layers of opposite directions

39

Figure 3.5: Multi-Layer Perceptron Architecture.

to the same output, so that the output layer can access information from past and future states.

One of such architecture is presented on Figure 3.4.

Figure 3.4 shows that the bidirectional LSTM model is composed by one embedding layer,

two LSTM layers with a bidirectional arrangement and at the end an activation layer. Each of

these layers has their own parameters, as further detailed bellow.

The embedding, dropout and activation layers have the same parameters that were ex-

plained for the case of the of previous model, although some layers have some additions. The

Bidirectional arrangement causes an addition of parameters in the LSTM layers, such as putting

as true the go backwards parameter.

3.2.2.3 Multi-Layer Perceptron

The multi-layer perceptron (MLP) model is a simple feedforward Neural Network, i.e. the

connections between the network neurons do not form a cycle. More properly, the neurons have

only one direction associated to them, that is from input to the output. All the nodes present in

this model, except the input node, are neurons with a non-linear activation function. The main

improvement of the MLP model, in contrast with standard linear perceptron, is the capacity to

distinguish non-linear data. The architecture of the model is illustrated in Figure 3.5.

Figure 3.5 shows that the multi-layer perceptron is composed by the following layers: three

dense layers and one activation layer. The dropout and the activation layers have the same

40

Figure 3.6: Convolutional Neural Network Architecture.

parameters as in previous models, being the only difference in the dense layers. The first two

dense layers in this model have two parameters: the output dimensionality and the activation.

The output dimensionality is defined as the embeddings size and the activation was set to a relu

function. While the last, the only parameter is the output dimensionality, that represents the

number of output dimensions of the model and is selected according to the specific task.

3.2.2.4 Convolutional Neural Neworks

Convolutional neural network models apply convolutions over the inputs to compute the

outputs, in contrast with other neural networks that only connect each input neuron to each

output neuron of the previous layer. In CNNs, each layer applies different filters over the data,

and during this process, the model learns the values of its filters based on the task that we want

to perform. In the end of this process, the results of each layer are combined. In this case, the

architecture used is composed by three filters with length 3, 5 and 7. In order to make it clearer,

the model architecture is shown in Figure 3.6.

This architecture is composed by the following layers: one embedding layer, three convolu-

tional layers, three maxpooling layers, three flatten layers, and to finalize an activation layer.

The embedding, dropout and activation layers have the same parameters as in previous

models. In this model, the additions are the convolutional, the maxPooling and the flatten

layers, being that the flatten layers does not have any parameter to be described.

41

Figure 3.7: CNN-LSTM Architecture.

In the convolutional layers there are five important parameters: number of convolutional

kernels, filter length, input dimensionality, input length and activation. The number of convo-

lutional kernels is defined as the embedding dimensionality. The filter length is defined as the

extension (spatial or temporal) of each filter and are assigned with values 3, 5, and 7. The input

dimensionality is defined as the embedding dimensionality. The input length is the maximum

length of a sentence (i.e, set to 50) and finally the activation is defined as a relu function.

The maxpooling layers only have one parameter that is the pool length. The pool length is

defined as the region size to which max pooling is applied, in this case is dependent of the filter

used in the convolutional layers.

3.2.2.5 Combined CNN-LSTM Network

The CNN-LSTM model consists, as the name suggests, in combining a CNN architecture

with a LSTM architecture. Previous studies, described in Chapter 2, show that sometimes com-

bining distinct models can give us a more powerful overall model, that improves the prediction

performance. This is ensured because each of the combined models provides their specific fea-

tures, resulting on a more complete model. Specifically, in this case, as CNN models are capable

of extracting local information but may fail to capture long dependencies, the LSTM model will

be combined to try to ensure that this limitation will disappear. The combined architecture can

42

be seen in Figure 3.7.

From figure 3.7, it is perceptible that the model is composed by one embedding layer, one

convolutional layer, one maxPooling layer, one LSTM layer and finally an activation layer.

The embedding, dropout and activation layers have the same parameters as in previous

models. The differences are in the convolutional and maxpooling layers.

The convolutional layer has the following parameters: the number of convolution kernels is

defined as the embeddings dimensionality. The filter length is assigned only with the value 3,

and the activation is set as a relu function.

The only parameter involved in the maxpooling layer is the pool length, that is defined as

the region size to which max pooling is applied. In this specific architecture, this parameter was

assigned with the value of 2.

Figure 3.8: Merged CNN Architecture.

3.2.2.6 Merged CNN

The merged CNN model is a variant of the CNN model, where the main idea was to allow

the previously described CNN model to receive two different types of word representations as

input. Hence, this architecture has two branches, where each of them has their respective word

43

representation. At the end, the goal is to concatenate the two different representations provided

from both branches and perform the classification over this new representation. Figure 3.8 shows

the model architecture with the respective layers. The parameters involved in each of the layers

are the same as in the CNN model.

3.2.2.7 Merged CNN-LSTM

As in the previous merged model, the merged CNN-LSTM model is a variant of the CNN-

LSTM model that was previously introduced. The idea is also to allow the previously described

CNN-LSTM model to receive two different types of word representations. Thus, this architecture

has two branches, being that each of them used different word representations. At the end,

the goal is to concatenate the two different representations, provided from both branches, and

perform the classification over this new representation. Figure 3.9 shows the model architecture.

Figure 3.9: Merged CNN-LSTM Architecture.

44

3.3 Overview

In this chapter, I described all the necessary components involved in addressing the senti-

ment analysis and dimensional sentiment analysis tasks. Specifically, the chapter described the

different text representations as well as the different model architectures. In order to better

understand the layers of each model, brief overviews were also presented, followed by, the model

architectures with their respective layers and parameters. In the next chapter, I present the

evaluation experiments conducted for each particular task.

45

46

4Experimental Evaluation

This chapter starts with the presentation of the general evaluation methodology followed

by an explanation of the different inputs that will be provided to the classification models, the

evaluation metrics, and finally presenting the results of the experimental evaluation. Section

4.1 describes the evaluation methodology. Section 4.2 presents the datasets as well as the fine-

tuned word embeddings that were used, according to each specific task, sentiment analysis or

dimensional sentiment analysis. Section 4.3 details the metrics for measuring the quality of the

results, and finally, Section 4.4 presents the results from the different models, according to each

specific task. Section 4.5 summarizes the contents of this chapter.

4.1 Evaluation Methodology

The evaluation methodology has components that are specific for each task and dataset.

However, a common methodological aspect is present in all the tests with neural network models,

namely the use of a callback with an early stopping function. A callback in this context is not

more than a set of functions to be applied at given stages of the training procedure. The use

of callback functions allow us to get a view on the internal states and statistics of the model,

during training. Specifically, the early stopping function allow us to stop the training process

when a monitored quantity has stopped improving.

In the experiments, the selected quantity that was intended to be analysed was the validation

loss. The validation loss is measured in each training epoch through a validation set that has

been previously defined. In our case, the validation set that was defined as the last 10% of the

training data, and the training procedure will stop when we observe at least two consecutive

epochs with no improvements.

4.1.1 Evaluation for the Sentiment Analysis Task

The sentiment analysis task was performed over different datasets/contexts with the aim

of making the results more robust. Thus the sentence polarity v1.0 (Pang and Lee, 2005), the

stanford sentiment treebank (Socher et al., 2013), and the tweet 20161 datasets were used.

The sentence Polarity dataset v1.0, described in more detail in the next section and that

corresponds to a binary classification task, does not have pre-defined train and test sets. There-

fore, in this case and in order to assess the prediction quality of the different models, a cross

validation with 3 folds was applied.

Cross validation is probably the simplest and most widely used method for estimating predic-

tion errors. This method directly estimates the expected extra-sample error Err = E[L(Y, f(X))]

i.e. the average generalization error when the method f(X) is applied to an independent test

sample from the joint distribution of X and Y .

In cases where there are no pre-defined train and test sets, this method uses part of the

available data to fit the model, and a different part to test it, repeating this process several

times.The idea is to split the data into k roughly equal-sized parts and, after selecting the k-th

part, we fit the model to the other k − 1 parts of the data, and calculate the prediction error

of the fitted model when predicting the k-th part of the data. We do this for k = 1, 2, ..., k and

combine the k estimates of prediction error.

Almost all the different models considered for our tests use word embeddings. In this

particular case, the word embedding dataset that was used is the publicly available word2vec

vectors trained on Google News (Mikolov et al., 2013).

The Stanford Sentiment TreeBank corpus has pre-defined train and test sets. In cases like

this the best option is to use the pre-defined sets in order to obtain a better comparison with

the results of previous studies.

With respect to word embeddings, and similarly to the case with the previous dataset, were

used the publicly available word2vec vectors trained on Google News. This specific dataset had

sentences labeled with an integer sentiment polarity score ranging from 0 to 5.

The Tweet 2016 dataset also has pre-defined train/test sets. However, in contrast with the

previous datasets where were used the word embeddings trained on Google News, this case used

word embeddings trained on Twitter microposts (Godin et al., 2013). This specific dataset had

sentences labeled with an integer sentiment polarity ranging from 0 to 2.

A first set of experiments leveraged models that simply use pre-trained word embeddings. All

the words, including those that are not present in the pre-trained word embeddings (randomly


48

initialized), are fine-tuned throughout the training process. Thus, not only the parameters of

the models are learned but also the weights of the word embeddings.

The second set of experiments involved models that require a pre-processing of the word

embeddings in order to concatenate the pre-trained representations with the emotion dimensions

values (valence, arousal and dominance). The algorithm to produce this concatenation has two

distinct steps. For words that are simultaneously present in a pre-existing dataset of affective

rating for words (Warriner et al., 2013) and in the pre-trained word embeddings, concatenate

the values. For the other cases, is trained a regression model giving it as a training set the

word embeddings and the dimensional values as labels. The idea is to use this model to predict

the dimensional values of words that do not appear in the original affective dataset (Warriner

et al., 2013) and thus all the entire collection of word embeddings can have an associated

group of dimensional values. The final result of this process are word embeddings with their

weights concatenated with the emotion dimensions values. After this pre-processing phase, word

embeddings are delivered to the models, as in the previous set of experiments.

The third and last set of experiments consists in separately using two different types of

word embeddings within the same model architecture. Specifically are used pre-trained word

embeddings and a set of word embeddings with dimensionality of 3 that are composed by the

values of valence, arousal and dominance (using the pre-existing dataset of affective rating for

words (Warriner et al., 2013)). In contrast with previous experiments where there is just one

input to the models, now there are two inputs. The idea is to build a merged architecture that

has two branches, namely one branch that has as input a type of word embeddings, and another

with the other type. Considering that both branches have the same model architecture, and

after some computations in each of these branches, the results are concatenated. This type of

merged architectures was considered only for the models with the best results in the previous

experiments, and are described in Chapter 3, namely the Merged CNN and Merged CNN-LSTM

architectures.

4.1.2 Evaluation for the Dimensional Sentiment Analysis Task

In this particular case, the train set in all experiments is the Extended Warriner dataset of

affective ratings for words and phrases, described in more detail in the next section. The test

sets are the EmoTales (Francisco et al., 2012), ANET (Bradley and Lang, 2007) and a Facebook

Messages dataset (Preoctiuc-Pietro et al., 2016). Where sentences are labeled with real values

encoding emotional valence, arousal and dominance, With respect to the word embeddings

49

required by some of the models, are used the publicly available word2vec vectors trained on

Google News.

4.2 Datasets

This section describes the data involved in the experiments for the sentiment analysis and

the dimensional sentiment analysis tasks. The section presents the different datasets as well as

the word embeddings that were used, according to each specific task.

4.2.1 Sentiment Analysis Datasets

The Sentence Polarity dataset v1.0, described by Pang and Lee (2005), is a corpus composed

by 5331 positive and 5331 negative sentences taken from several movie reviews.

The Stanford Sentiment Treebank dataset, described by Socher et al. (2013), is a corpus

with fully labeled parse trees that supports the complete analysis of compositional effects of

sentiment in language. This corpus is based on the Rotten Tomatoes movie reviews dataset

introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie

reviews. Each unique phrase is annotated by 3 human judges, with the following sentiment

scale: negative - 0, somewhat negative - 1, neutral - 2, somewhat positive - 3 and positive - 4.

The Tweet 2016 dataset was provided for Task 4 (Sentiment Analysis in Twitter) of the Se-

mEval competition 2016 2. This dataset contains tweets annotated with the following sentiment

scale: negative - 0, neutral - 1, positive - 2.

4.2.2 Dimensional Sentiment Analysis Datasets

The EmoTales dataset, described by Francisco et al. (2012), consists of a collection of

1389 English sentences from 18 different folk tales, annotated by 36 different people. All the

sentences are annotated with real-valued ratings for emotion, according to the dimensions of

evaluation/valence, activation/arousal and power/dominance.

The Affective Norms for English Text dataset from Bradley and Lang (2007), provides

normative ratings of emotion (pleasure/valence, arousal, dominance) for a small set of brief

texts (i.e., a total of 100 sentences) in the English language.


50

The dataset introduced by Preoctiuc-Pietro et al. (2016) provides a set of Facebook mes-

sages rated by two psychologically trained annotators on two separate ordinal nine-point scales,

representing valence and arousal. This dataset includes a total of 2896 Facebook messages.

The dataset provided by Warriner et al. (2013) contains 13915 English lemmas annotated

with valence, arousal and dominance values for each lemma. Beyond that, this dataset also

includes complimentary information on the annotations like gender, age and educational differ-

ences in emotion norms.

The Paraphrase Database (PPDB) provided by Pavlick et al. (2015) is an automatically

extracted database containing millions paraphrases for short sentences in 16 different languages.

The goal of PPBD is to improve language processing by making systems more robust to language

variability and unseen words. In the context of this work was used the PPDB package with size

L for the English language.

The Extended Warriner dataset, as the name suggest, is an extension of the Warriner dataset

(Warriner et al., 2013) that was developed by me for the experiments in this dissertation. This

extension was created using information present in the Paraphrase Database (Pavlick et al.,

2015). Specifically, the idea was to filter the PPBD file in order to find phrases with more

than one word, with high confidence to be paraphrases, and equivalent to words present in the

Warriner Dataset. At the end, the goal is to associate these equivalent phrases to the scores

that can be found in the Warriner dataset. Thus, the final result was a dataset not only with

single words but also with phrases and their respective emotion dimensions values. A total of

54000 instances were present in the resulting dataset.

4.2.3 Word Embeddings

Almost all the models described in the previous chapter use word embeddings. The different

word embeddings that were used in the experiments are the following:

• Word embeddings trained on about 100 billion words from Google News (Mikolov et al.,

2013). The training was performed using the continuous bag of words architecture, and

the word vectors have a dimensionality of 300.

• Word embeddings trained on about 400 million Twitter microposts (Godin et al., 2013).

The training was performed using the skip-gram architecture and the word vectors had a

dimensionality of 400.

51

4.3 Evaluation Metrics

The evaluation of the different tasks was done using different measures of result quality. In

order to assess the sentiment analysis task, I used the accuracy evaluation metric.

Consider that each testing instance, after processed by an automated system to detect if it

expresses a positive or negative sentiment, can either be:

• True Positive (TP) - system prediction is positive, as well as the real value.

• False Positive (FP) - system prediction is positive and the real value is negative.

• False Negative (FN) - system prediction is negative and the real value is positive.

• True Negative (TN) - system prediction is negative, as well as the real value.

Accuracy can be computed as the overall correctness of the system i.e., the decisions that

system got right divided by all the total number of decisions made by the system. For a binary

classification problem:

Accuracy =TP + TN

TP + TN + FP + FN(4.1)

In order to assess the dimensional sentiment analysis task, I used the Pearson correlation

coefficient. The Pearson correlation measures the correlation degree between two variables.

Specifically, measures the strength of a linear association between two variables and is denoted

by ρ. The idea is to draw a line of best fit through the data of two variables and the Pearson

correlation coefficient will indicate how far away all these data points are to this line. The ρ can

only take values between −1 and 1, being that:

• ρ = 1 indicates perfect positive correlation between two variables, i.e. if one of them

increases the other will also increase.

• ρ = −1 indicates perfect negative correlation between two variables, i.e. if one of them

increases the other will decrease.

• ρ = 0 indicates that there is no association between the two variables.

Briefly, the stronger the association of the two variables, more closer the ρ value will be to

either −1 or 1, depending on whether the relationship is positive or negative, respectively.

52

4.4 Experimental Results

This section presents the results for the different tasks addressed in this work (i.e., for

the sentiment analysis and dimensional sentiment analysis tasks). In order to evaluate the

performance of each of them, the models were used over different contexts/datasets. Thus,

results from the experiments will be presented according to each specific task and dataset.

4.4.1 Sentiment Analysis

This specific section only presents the results for the sentiment analysis task. All the senti-

ment analysis experiments are reported according to their context/dataset.

Model Epochs Batch size Accuracy

NB Bag of Words - - 78.4 %

SVM Bag of Words - - 75.6 %

NB-SVM Bag of Words - - 76.5 %

MLP Bag of Words 4 32 76.8 %

Stack of Two LSTMs Word2Vec 5 32 76.7 %

Bidirectional LSTM Word2Vec 5 32 75.8 %

CNN Word2Vec 4 32 77.1 %

CNN-LSTM Word2Vec 5 32 76.0 %

MLP Doc2Vec 4 32 64.0 %

CNN-non-static Kim (2014) - - 81.5 %

RAE (Socher et al., 2011) - - 77.7 %

MV-RNN (Socher et al., 2012) - - 79.0 %

CCAE (Hermann and Blunsom,

2013)

- - 77.8 %

sent-Parser (Dong et al., 2014) - - 79.5 %

NBSVM (Wang and Manning,

2012)

- - 79.4 %

MNB (Wang and Manning, 2012) - - 79.0 %

G-Dropout (Wang and Manning,

2013)

- - 79.0 %

F-Dropout (Wang and Manning,

2013)

- - 79.1 %

Tree-CRF (Nakagawa et al., 2010) - - 77.3 %

Table 4.1: Sentence Polarity Dataset: Models results using pre-trained word embeddings against

previous works.

53

4.4.1.1 Sentence Polarity Dataset

The Sentence Polarity dataset v1.0 (Pang and Lee, 2005) was used in experiments regarding

a binary sentiment analysis task on movie reviews. The models used to address this task are

described in Table 4.1, with their respective results in terms of accuracy. The first four models use

the bag of words text representation while the remaining use pre-trained word embeddings. Table

4.1 is divided in two parts. The first half describes the performance of the models implemented

by me and described in this dissertation, and the second part describes the performance of

models from previous works using this dataset.

Table 4.1 shows that the results of models used in my experiments are close to those of

the current state of art. However, can be seen on the table that all these models have a lower

performance comparing with the actual state of art. Another relevant fact is that the best result

was obtained when using a simple Naive Baive classifier with the bag of word representation.

This fact is not very surprising taking into account the results of previous studies, which also

showed a higher performance of these simple baselines comparing to more complex models (Wang

and Manning, 2012).

The next set of experiments involved the addiction of information about emotion dimensions

(i.e., valence, arousal and dominance) and are presented in Table 4.2. The first half of Table 4.2

reports the results of experiments with the concatenated version of the word embeddings, and

the second half the results of experiments using the merged architectures.







MLP Doc2Vec 4 32 64.0 %



Table 4.2: Sentence Polarity Dataset: Models results using a concatenated version of word

embeddings, against the results of merged architectures.

Looking at the first half of Table 4.2 and comparing with the results of Table 4.1 one can

see a slight decrease in performance for most models, with some exceptions - the best model

remains the Naive Bayes classifier. Looking at the second half of the table which represents the

54

results of the merged architectures, one can see a slight increase in performance for all models,

and the CNN model is now the one with better performance.

The results in Table 4.2 give us some indications that adding information about the emotion

dimensions can improve the results of some models in the sentiment analysis task.

4.4.1.2 Stanford Sentiment TreeBank Dataset

The Stanford Sentiment Treebank dataset (Socher et al., 2013) was also used to experiment

with a sentiment analysis task on movie reviews, but instead of only distinguishing the sentiment

as positive or negative, this dataset allows us to experiment with a more detailed sentiment

analysis task, distinguishing the sentiment in five categories. The results of these experiments

are described in Table 4.3. The organization is the same of the previous experiments, i.e. the

models used are the same, as well as the considered division for the table.


NB Bag of Words - - 39.2 %

SVM Bag of Words - - 37.2 %

NB-SVM Bag of Words - - 39.2 %






MLP Doc2Vec 4 32 32.6 %

CNN-non-static Kim (2014) - - 48.0 %

RAE (Socher et al., 2011) - - 43.2 %

MV-RNN (Socher et al., 2012) - - 44.4 %

RNTN (Socher et al., 2013) - - 45.7 %

DCNN (Kalchbrenner et al., 2014) - - 48.5 %

Paragraph-Vec (Le and Mikolov,

2014)

- - 48.7 %

Table 4.3: Stanford Sentiment Treebank Dataset: Models results using pre-trained word em-

beddings against previous works.

Similarly to what has been observed in the previous set of experiments, the results obtained

in this somewhat different task are also relatively close to the results of the current state of

art models. The best result was obtained using the combination between a CNN and a LSTM

model.

55

The results of adding information about emotion dimensions are present in Table 4.4. In the

same way, the first half of the table reports the results of the experiments with the concatenated

version of the word embeddings, and the second half reports on the experiments using merged

architectures.







MLP Doc2Vec 4 32 32.4 %



Table 4.4: Stanford Sentiment Treebank Dataset: Models results using a concatenated version

of word embeddings against the results of merged architectures.

Looking at the first half of Table 4.4, and comparing with the results of Table 4.3 it is

possible to see a significant improvement in the performance of some of the models, namely

the CNN and the CNN-LSTM. On the other hand, models like Bidirectional LSTM and MLP

Doc2Vec have a slight decrease in performance. In the second half of the table, and comparing

with Table 4.3, there is also an improvement, although on a smaller scale.

4.4.1.3 Tweet 2016 Dataset

The Tweet 2016 dataset3 was used to experiment with a sentiment analysis task on Twitter

data. Similarly to the Stanford Sentiment Treebank dataset, this dataset allows one to ex-

periment with a more detailed sentiment analysis task, distinguishing the sentiment in three

different categories. The results of the experiments using this dataset are described in Table

4.5. The first half of the table presents the performance of the models implemented by me, and

the second harf presents, the results obtained by some teams in the SemEval 20164 competition

using this dataset.

3http://alt.qcri.org/semeval2016/task4/4http://alt.qcri.org/semeval2016/task4/

56


Linear model with Bag of Words - - 53.0 %






MLP Doc2Vec 6 32 52.2 %

SwissCheese - - 64.3 %

SENSEI-LIF - - 61.7 %

UNIMELB - - 61.6 %

INESC-ID - - 60.0 %

TwiSE - - 52.8 %

MDSENT - - 54.5 %

Table 4.5: Tweet 2016 Dataset: Models results using pre-trained word embeddings against

previous works.

Table 4.5 shows that the results of the considered models are close to the results of many

of the teams that performed the same task in the SemEval 2016 competition. The best result

was obtained using the combination between a CNN and a LSTM model.

As in the previous datasets, I experimented with the next experiments involve the inclusion

of information about emotion dimensions. The results are described in the table 4.6. The first

half of the table reports the results of the experiments with the concatenated version of the word

embeddings, and the second half refers to the experiments using merged architectures.







MLP Doc2Vec 6 32 52.9 %



Table 4.6: Tweet 2016 Dataset: Models results using a concatenated version of word embeddings

against the results of merged architectures.

Comparing the first half of Table 4.6 with the results of Table 4.5, it is possible to see

57

that some models improve their performance, while others obtain a worse performance. Despite

some improvements, the best result remains the CNN-LSTM model of the previous experiments.

Regarding to the second half, the CNN model improves its performance, establishing itself as

the best result, while the CNN-LSTM model improves its performance when comparing with

the first half of the table, but remains bellow of the results described in the Table 4.5. On

this dataset, the improvements when using the emotion dimensions are not as sharp as in other

datasets, in the sense that there are fewer models with improvements. Nevertheless, the best

result was obtained using this type of information. Taking this into account, the idea that this

kind of information can improve the prediction performance in sentiment analysis tasks still

perhaps worth pursuing.

4.4.2 Dimensional Sentiment Analysis

This section describes results for the dimensional sentiment analysis task. Specifically, I

evaluate the results for each dimension (i.e., valence, arousal, and dominance) individually for

each dataset.

4.4.2.1 Affective Norms for English Text Datset

The ANET dataset (Bradley and Lang, 2007) was used to experiment with a dimensional

sentiment analysis task on brief texts in the English language. The models used to perform this

task are described in Table 4.7, with their respective results.

Models Valence Arousal Dominance

Stack of Two LSTMs Word2Vec 0.400 0.235 0.366

Bidirectional LSTM Word2Vec 0.549 0.183 0.294

CNN Word2Vec 0.551 0.391 0.425

CNN-LSTM Word2Vec 0.434 0.172 0.451

Table 4.7: ANET Dataset: Prediction results for valence, arousal and dominance in terms of

the Pearson correlation.

Looking closely to Table 4.7 it is possible to conclude that the best performance, in terms of

the Pearson correlation coefficient, was achieved for the valence dimension, while in the arousal

dimension the results were much worse. In the valence dimension, the best model was the CNN

model and the worst the Stack of two LSTMs. In the arousal dimension, the best was the CNN

58

model and the worst the CNN-LSTM model. Finally, in the dominance dimension, the best

performance was obtained using the CNN-LSTM model, and the worst using the Bidirectional

LSTM model.

4.4.2.2 EmoTales Dataset

The EmoTales dataset (Francisco et al., 2012) was used to experiment with a dimensional

sentiment analysis task on sentences form 18 different folk tales. The results are described in

Table 4.8.

Models Valence Arousal Dominance

Stack of Two LSTMs Word2Vec 0.205 0.043 0.073

Bidirectional LSTM Word2Vec 0.221 0.064 0.037

CNN Word2Vec 0.152 0.025 -0.026

CNN-LSTM Word2Vec 0.246 0.049 0.033

Table 4.8: EmoTales Dataset: Prediction results for valence, arousal and dominance in Pearson

Correlation Coefficient.

Table 4.8 shows that the best results in terms of the Pearson correlation coefficient and

similarly to the previous experiment with ANET, are in the valence dimension. The arousal

and dominance dimensions have much worse results. Focusing in more detail in each dimension,

the best model in valence was the CNN-LSTM model and the worst the CNN model. In the

arousal dimension, the best one was the Bidirectional LSTM while the worst was the CNN

model. Concluding, in the dominance dimension, the best model was the Stack of two LSTMs,

and the worse was the CNN model.

The results of this dataset, comparing with the previous dataset, are worse. This can be

due to the fact that in this dataset the meaning of each individual sentence, and consequently

its emotion ratings are not independent from the other sentences that compose each tale.

4.4.2.3 Facebook Messages Dataset

The Facebook Messages dataset (Preoctiuc-Pietro et al., 2016) was also used to experiment

with a dimensional sentiment analysis, this time on Facebook posts. In contrast with the

previous experiments, where the results were reported for three dimensions, this dataset only

contains annotations for two dimensions, namely valence and arousal. Taking this into account,

59

the results are just described in terms of these two dimensions, and are presented in Table 4.9.

The first half of the table describes the performance of the models used in my experiment and

the second half, in order to compare the results, describes the performance of previous work

using this dataset.

Models Valence Arousal

Stack of Two LSTMs Word2Vec 0.300 0.073

Bidirectional LSTM Word2Vec 0.310 0.081

CNN Word2Vec 0.345 0.096

CNN-LSTM Word2Vec 0.390 0.105

ANEW (Bradley and Lang, 1999) 0.307 0.085

Aff Norms (Warriner et al., 2013) 0.113 0.188

MPQA (Wilson et al., 2005) 0.385 -

NRC (Mohammad et al., 2013) 0.405 -

BoW model (Preoctiuc-Pietro et al., 2016) 0.650 0.850

Table 4.9: Facebook Messages Dataset: Prediction results for valence and arousal in Pearson

Correlation Coefficient.

Looking at the first half of Table 4.9, one can see that the best results, in terms of the

Pearson correlation coefficient, are in the valence dimension. In the valence dimension, the best

result was obtained using the CNN-LSTM model, and the worst was obtained using the Stack

of two LSTMs model. Finally, in the arousal dimension the best and the worst model are the

same of the previous dimension.

The second half of Table 4.9 presents a number of different existing approaches. Comparing

with the results described in the first half, it is possible to see that some of the models studied

in this particular case surpass the results of other existing approaches. However, results are still

a bit far from the best approach, that is the BoW model by Preoctiuc-Pietro et al. (2016). This

may be due the fact that the BoW model was trained using 10-fold cross-validation, and so has

the advantage of be trained and tested with the same data type.

4.5 Overview

In this chapter, I presented the evaluation experiments concerned with each specific task.

I performed experiments in different contexts/datasets, in order to obtain more robust results.

Despite the fact that I cannot exactly compare the results in the sentiment analysis task, due

to lack of information, I conclude that in most cases the models advanced in this dissertation

60

are close to the best results presented in the literature. Furthermore, the idea that adding

information about the emotion dimensions can improve the prediction performance, in sentiment

analysis tasks, is still alive and reinforced with some positive indications.

In the dimensional sentiment analysis task, was possible to see that the best results were

obtained with the ANET dataset. The remaining datasets have worse results, and that the

Pearson correlation coefficient is very close to zero in some tests, which indicates that there is

almost no relationship between the predicted values and the correct values for each dimension.

However, and in contrast with the ANET and EmoTales datasets, it’s possible to compare the

results obtained by the different models using the Facebook Messages dataset. Taking this into

account, is noticeable that some of the models used in this particular case surpass the results

of other existing approaches. However, it is a bit far from the best approach - the BoW model

by Preoctiuc-Pietro et al. (2016). The reasons behind these results may have different sources.

For example, an inadequate training set, or even the fact that the models being used are not

the most appropriated to perform this kind of task, in these contexts.

The next chapter finishes this dissertation by presenting the conclusions regarding the work

developed while also introducing some directions in terms of future work.

61

62

5ConclusionsThis section presents the conclusions drawn throughout this dissertation as well as possible

approaches for future work. Section 5.1 presents all the conclusions while Section 5.2 describes

different approaches for future work.

5.1 Conclusions

This dissertation described the research work conducted in the context of my Master’s thesis.

Throughout the document, I presented two different tasks, namely sentiment analysis task and

the dimensional sentiment analysis task. The sentiment analysis task aims to predict the polarity

of a given textual document, while the dimensional sentiment analysis aims to predict emotion

dimensions like valence, arousal and dominance.

In the sentiment analysis task, I put into practice the idea of adding information about the

emotion dimensions associated to words. From all the experiments, I can conclude that this idea

is still alive and reinforced with some positive indications. The results showed that, in some

cases, adding this type of information improves the prediction performance.

In the dimensional sentiment analysis task, I created a new dataset corresponding to an

extension of the Warriner dataset. After created, this dataset was used as the training set

in some models, in order to predict the emotion dimensions in different contexts. The results

showed that, in some cases, the models that are advanced performed better than others, although

in some contexts the results are not so good using the same methodology.

For each task I also presented the model architectures and their parameters, as well as the

details involved in the experimental evaluation.

5.2 Future Work

As future work, taking into account the positive indications given by the experimental

evaluation, many other experiments can also be done. In the sentiment analysis task, it would

be interesting, for instance, to experiment with the use of the emotion dimensions in other

models. Another possibility is to use other emotion dimension datasets, like the Extended

Warriner dataset created in this thesis, for trying to improve the results reported throughout

the document.

In the dimensional sentiment analysis task, the indications are not so good in some contexts.

However, for future, it would be interesting experiment with different datasets for training and

testing neural network models.

Furthermore, for both tasks, another possibility can be pre-train a model with a very large

dataset, such as the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015)

or the Paraphrase Database (PPDB) (Pavlick et al., 2015), in order to recognize equivalent

phrases. After this process, the idea is to use the learned parameters of the model for developing

other model that allows to predict the polarity or the emotion dimensions of a given text. In

more detail, the objective is try to explore big datasets to train models that are good modeling

sequences of words (i.e. phrases), and after use these representations in models with other goals,

in this case sentiment analysis and dimensional sentiment analysis.

64

Bibliography

Andreevskaia, A. and S. Bergler (2008). When specialists and generalists work together: over-

coming domain dependence in sentiment tagging. In Proceedings of the Annual Meeting of the

Association for Computational Linguistics.

Augustyniak, L., T. Kajdanowicz, P. Szymanski, W. Tuliglowicz, P. Kazienko, R. Alhajj, and

B. K. Szymanski (2014). Simpler is better? lexicon-based ensemble sentiment classification

beats supervised methods. In Proceedings of the IEEE/ACM International Conference on

Advances in Social Networks Analysis and Mining.

Bowman, S. R., G. Angeli, C. Potts, and C. D. Manning (2015). A large annotated corpus

for learning natural language inference. In Proceedings of the 2015 Conference on Empirical

Methods in Natural Language Processing (EMNLP). Association for Computational Linguis-

tics.

Bradley, M. M. and P. J. Lang (1999). Affective norms for English words (ANEW): Stimuli,

instruction manual, and affective ratings. Technical report, Center for Research in Psychophys-

iology, University of Florida.

Bradley, M. M. and P. J. Lang (2007). Affective norms for English Text (ANET): Affective

ratings of text and instruction manual. Technical report, University of Florida, Gainesville,

Fl.

Church, K. and P. Hanks (1989). Word association norms, mutual information and lexicography.

In Proceedings of the Annual Metting of the Association for Computational linguistics.

Dong, L., F. Wei, S. Liu, M. Zhou, and K. Xu (2014). A statistical parsing framework for

sentiment classification. Computing Research Repository .

Francisco, V., R. Hervas, F. Peinado, and P. Gervas (2012). Emotales: creating a corpus of folk

tales with emotional annotations. Language Resources and Evaluation.

Gao, W., S. Li, Y. Xue, M. Wang, and G. Zhou (2014). Semi-supervised sentiment classifica-

tion with self-training on feature subspaces. In Proceedings of the workshop Chinese Lexical

Semantics.

65

Godin, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Multimedia lab @ acl

wnut ner shared task: Named entity recognition for twitter microposts using distributed word

representations.

Goller, C. and A. Kuchler (1996). Learning task-dependent distributed representations by back-

propagation through structure. In Proceedings of the International Conference on Neural

Networks.

Hermann, K. M. and P. Blunsom (2013). The role of syntax in vector space models of com-

positional semantics. In In Proceedings of the 51st Annual Meeting of the Association for

Computational Linguistics.

Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory

and Algorithms. Kluwer Academic Publishers.

Kalchbrenner, N., E. Grefenstette, and P. Blunsom (2014). A convolutional neural network for

modelling sentences. Proceedings of the Annual Meeting of the Association for Computational

Linguistics.

Kamps, J. and M. Marx (2002). Words with attitude. In Proceedings of the International

WordNet Conference.

Kennedy, A. and D. Inkpen (2006). Sentiment classification of movie reviews using contextual

valence shifters. Computational Intelligence.

Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the

Conference on Empirical Methods in Natural Language Processing.

Le, Q. and T. Mikolov (2014). Distributed representations of sentences and documents. In

Proceedings of the 31st International Conference on Machine Learning.

Li, S., L. Huang, J. Wang, and G. Zhou (2015). Semi-stacking for semi-supervised sentiment

classification. In Proceedings of the Annual Meeting of the Association for Computational

Linguistics and of the International Joint Conference on Natural Language Processing.

Mao, Y. and G. Lebanon (2006). Sequential models for sentiment prediction. In Proceedings

of the International Machine Learning Society Workshop on Learning in Structured Output

Spaces.

Mesnil, G., T. Mikolov, M. Ranzato, and Y. Bengio (2014). Ensemble of generative and discrim-

inative techniques for sentiment analysis of movie reviews. In Proceedings of the International

Conference on Learning Representations.

66

Mikolov, F., B. Vandersmissen, W. De Neve, R. V. d. Walle, and J. Dean (2013). Distributed

representations of words and phrases and their compositionality. In Neural Information Pro-

cessing Systems.

Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word represen-

tations in vector space. Computing Research Repository .

Mikolov, T., M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur (2010). Recurrent neural

network based language model. In Proceedings of the Annual Conference of the International

Speech Communication Association.

Mohammad, S. M., S. Kiritchenko, and X. Zhu (2013). Nrc-canada: Building the state-of-the-art

in sentiment analysis of tweets. Computing Research Repository .

Moreira, S., R. F. Astudillo, W. Ling, B. Martins, M. J. Silva, and I. Trancoso (2015). INESC-ID:

A regression model for twitter sentiment lexicon induction. In Proceedings of the International

Workshop on Semantic Evaluation.

Mou, L., H. Peng, G. Li, Y. Xu, L. Zhang, and Z. Jin (2015). Tree-based convolution: A new

neural architecture for sentence modeling. In Proceedings of the International Conference on

Computer Supported Collaborative Learning.

Mudinas, A., D. Zhang, and M. Levene (2012). Combining lexicon and learning based approaches

for concept-level sentiment analysis. In Proceedings of the International Workshop on Issues

of Sentiment Discovery and Opinion Mining.

Mullen, T. and N. Collier (2004). Sentiment analysis using support vector machines with di-

verse information sources. In Proceedings of the Conference on Empirical Methods on Natural

Language Processing.

Nakagawa, T., K. Inui, and S. Kurohashi (2010). Dependency tree-based sentiment classifica-

tion using crfs with hidden variables. In Human Language Technologies: The 2010 Annual

Conference of the North American Chapter of the Association for Computational Linguistics.

Osgood, C. E., G. J. Suci, and P. H. Tannenbaum (1957). The Measurement of Meaning.

University of Illinois Press.

Pang, B. and L. Lee (2005). Seeing stars: Exploiting class relationships for sentiment catego-

rization with respect to rating scales. In Proceedings of the Annual Meeting on Association

for Computational Linguistics.

67

Pavlick, E., P. Rastogi, J. Ganitkevitch, B. Van Durme, and C. Callison-Burch (2015). Ppdb

2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style

classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational

Linguistics and the 7th International Joint Conference on Natural Language Processing.

Preoctiuc-Pietro, D., H. A. Schwartz, G. Park, J. Eichstaedt, M. Kern, L. Ungar, and E. P.

Shulman (2016). Modelling valence and arousal in facebook posts. In Proceedings of the

Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

(WASSA).

Qiu, L., W. Zhang, C. Hu, and K. Zhao (2009). Selc: A self-supervised model for sentiment

classification. In Proceedings of the ACM Conference on Information and Knowledge Man-

agement.

Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Parallel distributed processing:

Explorations in the microstructure of cognition. In D. E. Rumelhart, J. L. McClelland, and

C. PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Mi-

crostructure of Cognition, Chapter Learning Internal Representations by Error Propagation.

MIT Press.

Socher, R., B. Huval, C. D. Manning, and A. Y. Ng (2012). Semantic compositionality through

recursive matrix-vector spaces. In Proceedings of the Joint Conference on Empirical Methods

in Natural Language Processing and Computational Natural Language Learning.

Socher, R., C. C. Lin, A. Y. Ng, and C. D. Manning (2011). Parsing Natural Scenes and Natural

Language with Recursive Neural Networks. In Proceedings of the International Conference on

Machine Learning.

Socher, R., J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning (2011). Semi-supervised

recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference

on Empirical Methods in Natural Language Processing.

Socher, R., A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013).

Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings

of the Conference on Empirical Methods in Natural Language Processing.

Stone, P. J., D. C. Dunphy, M. S. Smith, and D. M. Ogilvie (1966). The General Inquirer: A

Computer Approach to Content Analysis. The MIT Press.

68

Taboada, M., C. Anthony, and K. Voll (2006). Methods for creating semantic orientation dic-

tionaries. In Proceedings of the Conference on Language Resources and Evaluation.

Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede (2011). Lexicon-based methods for

sentiment analysis. Computational Linguistics.

Taboada, M. and J. Grieve (2004). Analyzing appraisal automatically. In Proceedings of the

AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications.

Tang, D., B. Qin, and T. Liu (2015a). Document modeling with gated recurrent neural network

for sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural


Tang, D., B. Qin, and T. Liu (2015b). Learning semantic representations of users and products

for document level sentiment classification. In Proceedings of the Annual Meeting of the As-

sociation for Computational Linguistics and of the International Joint Conference on Natural


Thelwall, M., K. Buckley, and G. Paltoglou (2012). Sentiment strength detection for the social

web. Journal of the Association for Information Science and Technology .

Thelwall, M., K. Buckley, G. Paltoglou, D. Cai, and A. Kappas (2010). Sentiment in short

strength detection informal text. Journal of the Association for Information Science and

Technology .

Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsu-

pervised classification of reviews. In Proceedings of the Annual Meeting on Association for

Computational Linguistics.

Vincent, P., H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol (2010). Stacked denois-

ing autoencoders: Learning useful representations in a deep network with a local denoising

criterion. Journal of Machine Learning Research.

Wang, S. and C. Manning (2013). Fast dropout training. In Proceedings of the 30th International

Conference on Machine Learning.

Wang, S. and C. D. Manning (2012). Baselines and bigrams: Simple, good sentiment and topic

classification. In Proceedings of the Annual Meeting of the Association for Computational

Linguistics.

69

Warriner, A. B., V. Kuperman, and M. Brysbaert (2013). Norms of valence, arousal, and

dominance for 13,915 english lemmas. Behavior Research Methods.

Wilson, T., J. Wiebe, and P. Hoffmann (2005). Recognizing contextual polarity in phrase-level

sentiment analysis. In Proceedings of the Conference on Human Language Technology and

Empirical Methods in Natural Language Processing.

Yang, M., W. Tu, Z. Lu, W. Yin, and K.-P. Chow (2015). Lcct: A semi-supervised model for

sentiment classification. In Proceedings of the Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies.

Zhang, Z., G. Wu, and M. Lan (2015). Ecnu: Multi-level sentiment analysis on twitter using

traditional linguistic features and word embedding features. In Proceedings of the International

Workshop on Semantic Evaluation.

Zhu, X. and Z. Ghahramani (2002). Learning from labeled and unlabeled data with label

propagation. In Proceedings of the Conference on Automated Learning and Discovery.

70

Date post:	10-Aug-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Sentiment Analysis with Deep Neural Networks · Sentiment analysis is an area of research with a...

Documents