Sentiment Analysis with Deep Neural Networks
João Carlos Duarte Santos Oliveira Violante
Thesis to obtain the Master of Science Degree in
Telecommunications and Informatics Engineering
Supervisors: Prof. Bruno Emanuel da Graça Martins
Prof. Pavél Pereira Calado
Examination Committee
Chairperson: Prof. Luís Manuel Antunes VeigaSupervisor: Prof. Bruno Emanuel da Graça Martins
Members of the Committee: Prof. Maria Luísa Torres Ribeiro Marques da Silva Coheur
November 2016
Acknowledgements
I would first like to thank Professors Pavel Calado and Bruno Martins, for having contributed
to this work with their extreme knowledge and motivation.
Secondly, I would like to thank my family for all the unconditional support throughout all
these years, and also for giving me the opportunity to learn in this institute. A special thanks
goes to my sister for all the availability and assistance during this period.
Finally, I have to thank all my friends and colleagues who have supported me throughout
my academic career.
Lisbon, November 2016
Joao Carlos Duarte Santos Oliveira Violante
For my family,
Resumo
O aumento de utilizadores da Internet e o consequente aumento do volume de opinioes,
expressas pelos mesmos, nesse meio de comunicacao, resultou em grandes fontes de informacao.
Esta informacao oferece-nos um importante feedback sobre determinados produtos ou servicos,
provocando um aumento do interesse em varios problemas, praticos ou academicos, que lidam
com a analise deste tipo de informacao. A area que tem como objectivo resolver este tipo de
problemas e normalmente designada por sentiment analysis ou opinion mining. Tendo isto em
consideracao, pretende-se com este trabalho abordar o tema de deteccao do tipo de sentimento
expresso num determinado documento textual. Especificamente foram estudadas e compara-
das, em diferentes contextos, algumas das abordagens que representam o actual estado da arte,
maioritariamente relacionadas com o uso de redes neuronais profundas. Adicionalmente, testou-
se a possibilidade de melhorar os resultados dessas abordagens introduzindo alguma informacao
sobre as dimensoes das diferentes emocoes expressas em cada um dos textos. Nesta dissertacao
e apresentada uma descricao para as arquitecturas dos referidos modelos assim como a sua com-
paracao com os sistemas existentes actualmente. Os resultados experimentais obtidos mostram
que a ideia de adicionar informacao sobre as emocoes, em algumas situacoes, melhora o desem-
penho de diferentes abordagens.
Abstract
The increasing amount of Internet users and the consequent increase of online user reviews,
expressing their opinions, has resulted in large sources of information. This information can give
us an important feedback about particular products or services, leading to a growing interest on
several problems that deal with the analysis of this type of information. This area of research is
typically called sentiment analysis or opinion mining. Considering the interest in this area, the
goal of this MSc research project was to address the topic of detecting the sentiment (positive
or negative) of the opinion expressed in a given textual document, by studying and comparing,
in different contexts, some of the approaches that represent the current state of art in the
area, which is mainly related to the use of deep neural networks. Additionally, this work tried
to improve the results of these methods, by adding some additional information about the
dimensions of the different emotions expressed in the documents. This dissertation presents
a description of the considered model architectures, as well as their comparison with existing
systems. Our experimental results show that adding information about the emotions can, in
some cases, improve the performance of different approaches.
Palavras Chave
Keywords
Palavras Chave
Redes Neuronais Profundas
Classificacao de Texto
Analise de Sentimentos
Polaridade de Opinioes
Analise de Emocoes
Keywords
Deep Neural Networks
Text Classification
Sentiment Analysis
Opinion Polarity
Emotion Analysis
Contents
1 Introduction 3
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Structure of the Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Fundamental Concepts and Related Work 7
2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Lexicon-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2 Corpus-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Combined Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Sentiment Analysis with Deep Neural Networks 31
3.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Bag of Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Word to Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Document to Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Classification Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Overview on the Layers used within Deep Neural Networks . . . . . . . . 35
3.2.1.1 Embedding Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 35
i
3.2.1.2 Dropout Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1.3 LSTM Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1.4 Dense Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1.5 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1.6 Pooling Layers and Flatten Layers . . . . . . . . . . . . . . . . . 37
3.2.1.7 Activation Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Model Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2.1 Stack of LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2.2 Bidirectional LSTMs . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2.3 Multi-Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2.4 Convolutional Neural Neworks . . . . . . . . . . . . . . . . . . . 41
3.2.2.5 Combined CNN-LSTM Network . . . . . . . . . . . . . . . . . . 42
3.2.2.6 Merged CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.2.7 Merged CNN-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Experimental Evaluation 47
4.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.1 Evaluation for the Sentiment Analysis Task . . . . . . . . . . . . . . . . . 47
4.1.2 Evaluation for the Dimensional Sentiment Analysis Task . . . . . . . . . . 49
4.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.1 Sentiment Analysis Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Dimensional Sentiment Analysis Datasets . . . . . . . . . . . . . . . . . . 50
4.2.3 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
ii
4.4.1.1 Sentence Polarity Dataset . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1.2 Stanford Sentiment TreeBank Dataset . . . . . . . . . . . . . . . 55
4.4.1.3 Tweet 2016 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.2 Dimensional Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.2.1 Affective Norms for English Text Datset . . . . . . . . . . . . . 58
4.4.2.2 EmoTales Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.2.3 Facebook Messages Dataset . . . . . . . . . . . . . . . . . . . . . 59
4.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Conclusions 63
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Bibliography 70
iii
iv
List of Figures
2.1 Tree-Based Convolutional Neural Network (Mou et al., 2015). . . . . . . . . . . . 20
2.2 Gated Recurrent Neural Network Architecture (Tang et al., 2015a). . . . . . . . 21
3.1 Model Architectures (Mikolov et al., 2013). . . . . . . . . . . . . . . . . . . . . . 33
3.2 Examples of a) a traditional Recurrent Neural Network Architecture, and b) a
Long Short Term Memory Architecture. . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Stack of two LSTMs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Bidirectional LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Multi-Layer Perceptron Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.6 Convolutional Neural Network Architecture. . . . . . . . . . . . . . . . . . . . . . 41
3.7 CNN-LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.8 Merged CNN Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.9 Merged CNN-LSTM Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
vi
List of Tables
4.1 Sentence Polarity Dataset: Models results using pre-trained word embeddings
against previous works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Sentence Polarity Dataset: Models results using a concatenated version of word
embeddings, against the results of merged architectures. . . . . . . . . . . . . . . 54
4.3 Stanford Sentiment Treebank Dataset: Models results using pre-trained word
embeddings against previous works. . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Stanford Sentiment Treebank Dataset: Models results using a concatenated ver-
sion of word embeddings against the results of merged architectures. . . . . . . . 56
4.5 Tweet 2016 Dataset: Models results using pre-trained word embeddings against
previous works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 Tweet 2016 Dataset: Models results using a concatenated version of word embed-
dings against the results of merged architectures. . . . . . . . . . . . . . . . . . . 57
4.7 ANET Dataset: Prediction results for valence, arousal and dominance in terms
of the Pearson correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.8 EmoTales Dataset: Prediction results for valence, arousal and dominance in Pear-
son Correlation Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.9 Facebook Messages Dataset: Prediction results for valence and arousal in Pearson
Correlation Coefficient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vii
viii
Acronyms
NLP Natural Language Processing
CNN Convolutional Neural Network
RNN Recurrent Neural Network
LSTM Long Short-Term Memory
GRU Gated Recurrent Unit
MLP Multi Layer Perceptron
NB Naive Bayes
SVM Support Vector Machine
VAD Valence Arousal Dominance
BoW Bag of Words
CBoW Continuous Bag of Words
RAE Recursive Autoencoders
MV-RNN Matrix-Vector Recursive Neural Network
UPNN User Product Neural Network
TBCNN Tree-Based Convolution Neural Network
RNTN Recursive Neural Tensor Network
DCNN Dynamic Convolutional Neural Network
Paragraph-Vec Logistic Regression on top of Paragraph Vectors
CCAE Combinatorial Category Autoencoders
Sent-Parser Sentiment Analysis-Specific parser
NBSVM Naive Bayes SVM
MNB Multinomial Naive Bayes
G-Dropout Gaussian Dropout
F-Dropout Fast Dropout
Tree-CRF Dependency Tree with Conditional Random Fields
1
2
1IntroductionIn different areas, as well as in different contexts, the feedback provided by consumers of
a specific product or service has an unmatchable relevance. This kind of information has a
wide range of applications, bringing clear advantages to areas like marketing or politics. For
instance, collecting the opinions of consumers concerning a certain product or service, can allow
a marketing company to achieve a more accurate assessment of the effectiveness of their last
campaign, and suggest to their clients the necessary adjustments to increase sales or become more
efficient. On the other hand, feedback from citizens can be used to measure the popularity, near
election time, of a particular candidate, allowing campaign managers to obtain more accurate
and timely information.
The aforementioned advantages are not limited to producers, embracing final consumers as
well. Searching for opinions about a certain product or service is nowadays practically manda-
tory before purchase or subscription decisions. This is only possible because potential future
consumers can find feedback from former or current consumers, helping them to make the best
decision.
Before the Internet became a day-to-day tool for people, the access to this kind of infor-
mation was very limited, and feedback analysis was practically impossible in large scale. With
the rise of the Internet, the problem of the lack of information sources is almost completely
eliminated, giving users access to a wide range of opinions and experiences.
However, another obstacle arises along with the increasing availability of the Internet: how to
analyse all these information sources. Automatically predicting the emotion/sentiment/opinion
behind a textual document is not exactly an easy task to make. Many opinions are usually
subtle and complex, including negation and sometimes sarcasm. Taking this into account, a
new research area emerged, often referred to as Sentiment Analysis or Opinion Mining.
1.1 Motivation
Sentiment analysis is an area of research with a broad scope, including tasks with different
degrees of complexity. A fundamental task in sentiment analysis consists in detecting words
that express a specific sentiment and then, through the detected words, assign a sentiment to
a particular textual document. Another and more complex task is called aspect-level sentiment
analysis, where the idea is to get a more powerful and fine-grained evaluation about the opinions
expressed for a particular topic. In this case, the different aspects that need to be evaluated
individually will be extracted first and them the opinions related to each aspect will be evaluated.
For example, nowadays when we want to buy a cellphone we search for evaluations about specific
features like camera, processor, RAM or battery. The main idea behind aspect-level sentiment
analysis is to obtain the opinion expressed by the other users about each of these specific features.
Finally, in order to help the user to make the decision (buy or not buy), the sentiment analysis
tool assigns a global positive or negative value taking into account the evaluations of all the
individual features.
Still, using the conventional positive and negative sentiment evaluations is insufficient for
an accurate and more detailed evaluation. Opinions not only include positive or negative sen-
timents, but they are also dependent on the emotional state of the writer in that moment. So,
adding an emotional detection system, going beyond detecting positive versus negative opin-
ions into more nuanced notions of opinion valence, can give us a stronger and more expressive
evaluation over the conventional approaches.
Emotions like joy, surprise, or anger, are present in our daily life, making their analysis a
great source of information about for instance, the emotional state of the employees of a company,
the consumers of a product, or even the emotional state of country’s population. Bearing this
in mind, the expression of emotions over written textual contents have been studied using two
different approaches, namely the discrete approach and the dimensional approach.
While the discrete approach sees the emotions like a set of basic affective states that can
be easily identified by themselves, such as sadness, joy, or frustration, the dimensional approach
clusters affective states in a smaller set of major dimensions like valence, arousal and dominance
(VAD). In brief, valence represents an emotional dimension related to the attractiveness to an
object, event, or situation, while arousal represents a degree of emotional activation (physiolog-
ical and psychological), and finally dominance represents a change in the sensation of having
control on a situation. Although both have their utility, being that the dimensional approach has
properties that makes it more robust than the discrete approach. While the discrete approach is
4
limited to the only emotions defined by the chosen theory (i.e. is useful when we want to study
particular emotions), the VAD measures present in dimensional approach are independent from
any cultural or linguistic interpretation.
Taking the aforementioned aspects into account, an interesting research problem relates to
the development of systems that include sentiment evaluation as well as emotional evaluation.
1.2 Contributions
This thesis makes the followings contributions:
• Development of methods for addressing a sentiment analysis task that aims to predict
the sentiment/polarity of a given text. With this in mind, and in order to obtain the
best possible results in this task, different approaches were tested, ranging from simple
and common models like Naive-Bayes and Support Vector Machine classifiers, leveraging
bag-of-words representations, to more complex models that are considered nowadays the
state of art, named deep neural networks. The considered deep neural network models are
divided in two categories: Recurrent Neural Networks (RNNs) and Convolutional Neural
Networks (CNNs). In addiction, in each of these general types of categories, the abundance
of different architectures and possible configuration is very large. Furthermore, different
input representations can be used, namely the bag of words or representations based on
word/phrase embeddings. Datasets from different contexts and different topics were used
to support an extensive set of comparative experiments. Specifically, the Sentence Polarity
dataset v1.0 (Pang and Lee, 2005) jointly with the Stanford Sentiment TreeBank (Socher
et al., 2013), containing data from movie reviews, and the Tweet 20161 dataset with data
from twitter posts, were used in the experimental evaluation, and the results showed that
in most cases the models used are close to the results presented in the literature.
• Development of methods for addressing a dimensional sentiment analysis task that aims to
predict three emotion dimensions, namely valence, arousal, and dominance. The valence
value indicates the pleasantness of the stimulus, the arousal value the intensity of the
emotion caused by the stimulus, and the dominance value indicates the degree of control
implied by the stimulus. In order to predict these values, regression models using deep
neural networks were applied. The train dataset, in this case, is an expanded version of
1http://alt.qcri.org/semeval2016/task4/
5
an existing dataset of word ratings (Warriner et al., 2013) that, instead of containing only
the VAD values for single English words, also has these values for phrases. This expanded
dataset is created in the context of my MSc thesis and extracts information from the
Paraphrase Database (Pavlick et al., 2015). The test datasets are the Affective Norm for
English Text (ANET) (Bradley and Lang, 2007), the EmoTales (Francisco et al., 2012)
and Facebook Messages (Preoctiuc-Pietro et al., 2016) dataset. The experimental results
showed that the best results are in the ANET dataset, and despite the others have worse
results, sometimes they surpass the performance of other existing approaches.
• The final contribution consists in combining the aforementioned two types of information.
The idea is to use information about the emotion dimensions to help the sentiment analysis
models to improve their prediction performance. Experiments with deep neural networks
using two different types of word representations, pre-trained word embeddings in one
hand and on the other word embeddings with the emotion dimensions values, showed that
adding information about the emotion dimensions, in some cases, improves the prediction
performance in sentiment analysis models.
1.3 Structure of the Document
The remaining of this document is organized as follows. Chapter 2 presents fundamental
concepts required to understand the topics discussed in this thesis, namely text representation
approaches and common algorithms to solve text classification problems. This chapter also
describes previous related work, grouped into three categories: corpus-based approaches, lexicon-
based approaches and combined approaches. Chapter 3 describes the different models used in
each particular task, specifically the sentiment analysis and the dimensional sentiment analysis
tasks. Particular emphasis is given to the discussion of deep learning model architectures.
Chapter 4 describes the evaluation methodology, the datasets, the word embeddings used in
the context of the deep learning models, the evaluation metrics, and the experimental results.
Finally, Chapter 5 concludes this document by summarizing its main points, and presenting
directions for future work.
6
2Fundamental Concepts
and Related Work
This chapter details the basic concepts needed to understand the topics discussed throughout
the document, and also some sentiment analysis approaches studied in previous work.
2.1 Fundamental Concepts
This section presents the main concepts required to understand the topics discussed in
this document. Section 2.1.1 describes the methods that are traditionally used to represent
phrases/documents. Section 2.1.2 introduces some of the most common algorithms used in text
classification problems.
2.1.1 Text Representation
In the context of text classification, documents are typically represented through sets of
smaller components, like words, n-grams of words (i.e., sequences of n continuous words in a
text) or n-grams of characters (i.e., sequences of n continuous characters in a text).
Besides sets, another common representation for documents is the vector space model ap-
proach. The vector space model approach is widely used in information filtering and information
retrieval to compute a continuous degree of similarity between documents. Each document d
is represented as a feature vector df =< w1,f , w2,f , ..., wi,f >, where i is the number of the
features, and where wi,t corresponds to a weight that reflects the importance of the feature f
for describing the contents of document d. The different features can, for instance, correspond
to words, n-grams of words, or n-grams of characters.
In the vector space model approach, the weight of each feature can be computed in several
ways. The methodology used to compute the weights is usually known as term weighting
scheme. One of these schemes involves using binary weights, where wi,f is zero or one, depending
on whether the feature f is present or not in the document d.
Another popular term weighting scheme is called TF-IDF. The motivation for its existence
is that there are words occurring in each document that are also very frequent in many other
documents and, thus, they should not contribute to the comparison process with the same weight
of words that are more specific to some domains. The TF-IDF weighting scheme combines the
individual frequency for each feature f in the document d (i.e, a component represented as
TFf,d) with the inverse frequency of the feature f in the entire collection of documents (i.e,
IDFf ). There are different ways to compute the term frequency. However, the most common
is simply counting the number of occurrences of the feature f within document d without any
further normalization. The inverse document frequency (IDF) is a measure of feature importance
within the collection of documents. A feature that appears in most of the documents of a given
collection is not important to discriminate between the different documents. Taking this into
account, the IDF is the inverse of the number of documents in which a feature occurs, and is
computed as follows:
IDFf = log
(N
df
)(2.1)
In Equation 2.1, N corresponds to the number of documents in the collection and df corre-
sponds to the number of documents containing the feature f .
The TF-IDF weight of a feature f for a document d is defined as follows:
TF-IDFf,d = TFf,d × IDFf (2.2)
As described previously, the vector space model allows to evaluate the degree of similarity
between two documents d1 and d2, as the correlation between their vector representations V (d1)
and V (d2). This correlation can be computed using, for example, the cosine similarity metric:
sim(d1, d2) =V(d1).V(d2)
||V(d1)|| × ||V(d2)||(2.3)
In Equation 2.3, the numerator is the inner product of the vectors V(d1) and V(d2) and the
denominator is the product of their Euclidean lengths. If we also represent a query as a vector it
is possible to compute the similarity between the documents and the query, this way extracting
and raking the most relevant documents.
8
2.1.2 Text Classification
Text classification is nowadays typically addressed through supervised machine learning
methods. Training data, i.e. vectors x labeled by humans with a class y, are used to learn a
function d(x), known as classifier, that aims to automatically predict the class to which a new
instance of test data belong to. Over the years, many different learning methods have been
introduced to address the task of finding the function d(x), such as nearest neighbour classifiers
(Mao and Lebanon, 2006), linear classifiers (Mullen and Collier, 2004), or tree-based models
(Augustyniak et al., 2014).
In sentiment analysis, the most commonly used methods are linear classifiers. These methods
define the function d(x) in terms of a linear combination of the individual dimensions from the
predictor variables (i.e., features). There are two broad classes of methods for determining the
parameters of linear classifiers: the generative approach and the discriminative approach.
The generative approach learns the joint probability distribution P(X,Y ). The Naive Bayes
(NB) classifier is a probabilistic model based on Bayes theorem. Specifically, the NB classifier is
a classification algorithm that assumes the features X1...Xn as being conditionally independent
of one another, given Y . This assumption dramatically simplifies the representation of P(X|Y )
and the problem of estimating it from the training data. When X contains n attributes which
are conditionally independent of one another given Y , we have:
P(X1...Xn|Y ) =n∏
i=1
P(Xi|Y ) (2.4)
Considering that Y is any discrete-valued variable and the features X1...Xn are any discrete
or real-valued variables, our goal is to train a classifier that will output the probability distribu-
tion over possible values of Y , for each new instance X that we ask it to classify. To compute
the probability that Y will take on any of its k-th possible values, we use the following equation:
P(Y = yk|X1...Xn) =P(Y = yk)
∏i P(Xi|Y = yk)∑
j P(Y = yj)∏
i P(Xi|Y = yj)(2.5)
Given a new instance xnew = 〈X1...Xn〉, Equation 2.5 shows how to calculate the probability
that Y will take on any given value, given the observed attributes values of Xnew and the
distributions P(Y ) and P(Xi|Y ) estimated from the training data. In order to choose the most
probable value of Y , we use the following classification rule:
9
Y ← arg maxyk
P(Y = yk)∏i
P(Xi|Y = yk) (2.6)
The discriminative approach learns the conditional probability distribution P(X|Y ). Logis-
tic regression is an approach to learning functions of the form f : X → Y , or P(Y |X), in the case
where Y is discrete-valued and X = 〈X1...Xn〉 is any vector containing discrete or continuous
variables.
This approach assumes a parametric form for the distribution P(Y |X), then directly esti-
mates its parameters from the training data. The parametric model assumed by logistic regres-
sion, in the case where Y can take on any of the discrete values{y1, ..., yk
}, is the following:
P(Y = yk|X) =1
1 +∑k−1
j=1 exp(wj0 +∑n
i=1wjiXi)(2.7)
In the formula, wij denotes the weight associated with the j-th class Y = yj and with input
Xi. To classify any given instance X, we generally want to assign the label yk that maximizes
P(Y = yk|X).
Support Vector Machines (SVMs) are discriminative machine learning methods designed to
solve binary classification problems. The main principle behind SVMs consists in minimizing
the empirical classification error and finding a optimal classification hyperplane with a large
margin. Specifically, the idea is not only make a correct prediction, but also make a confident
prediction (Joachims, 2002).
Let D be a set of n points of the form (#»
Xi, yi), ..., (#»
Xn, yn), where the yi are either 1 or
-1 and indicate the class to which the point#»
Xi (p-dimensional real vectors, representing the
instance) belongs. The goal is to find the maximum-margin hyperplane that divides the group
of points#»
Xi for which yi = 1 from the group of points for which yi = −1. The hyperplane can
be written as the set of points#»
X satisfying #»w.#»
X − b = 0, where #»w is the normal vector to the
hyperplane.
When the training data are linearly separable, is possible to select two parallel hyperplanes
that separate the two classes of the data, where the distance between them is as large as possi-
ble. The region between these two hyperplanes is called the margin and the maximum-margin
hyperplane is the hyperplane that lies halfway between them. In order to represent these two
hyperplanes we can use the following equations:
10
#»w.#»
X − b = 1 and #»w.#»
X − b = −1 (2.8)
The geometric distance between these two hyperplanes is 2|| #»w || . To maximize the distance
between the hyperplanes, we have to minimize #»w. However, we also have to prevent points from
getting into the margin. Putting this together, we get the following optimization problem:
Minimize || #»w|| subject to yi(#»w.
#»
Xi − b) ≥ 1, for i = 1, ..., n (2.9)
When the training data are not linearly separable, we introduce the hinge loss function:
max(0, 1− yi( #»w.#»
Xi + b)) (2.10)
The Equation 2.10 returns zero if the constraint yi(#»w.
#»
Xi − b) ≥ 1 for all 1 ≤ i ≤ n is
satisfied, i.e. if#»
Xi lies on the correct side of the margin. For data on the wrong side of the
margin, the function returns a value proportional to the distance from the margin. We therefore
wish to minimize:
[1
n
n∑i=1
max(0, 1− yi(w.Xi + b))
]+ λ||w||2 (2.11)
In the equation, the parameter λ determines the trade-off between increasing the margin-
size and ensuring that the#»
Xi lies on the correct side of the margin. In order to solve the
optimization problem, we can use for instance the sub-gradient descent approach.
Furthermore, a new branch of machine learning that also allows one to address text classi-
fication is called deep learning. The property of the deep learning that makes it distinctive is
that it studies deep neural networks as a classification model, i.e. neural networks with many
layers that are typically trained end-to-end. The use of multiple layers allow models to warp
the data (progressively) into a form where it is easy to solve the specific classification task.
In these models, each layer is a function acting on the output of a previous layer, so we can
say that the network is a chain of composed functions and also that the chain is optimized to
perform the specific task. As described before, neural networks transform the data through the
existing layers, making their task easier to address. These transformed versions of the data are
called representation.
11
Representation in deep learning is a way to embed the data in k dimensions. Following
this logic, two functions can only be composed together if their types/representations agree,
being that the choice of the representation will be made over the course of training (adjacent
layers will negotiate the representation that they will use to communicate). So, to obtain a good
performance for a particular network we have to meet these requirements.
One of such type of representation is typically called word embeddings. This kind of repre-
sentation is formed or used to solve natural language processing tasks. Typically, the input to
the network for these tasks is a word or multiple words. In this sense, a word can be represented
as a unit vector in a very high-dimensional space, with each dimension corresponding to a word
in the vocabulary. After that, the network will warp and compress the space, mapping words
into a lower dimensional space. This new representation for the words has some really nice
properties. One of those properties is the fact that words with similar meanings will tend to
be close in the resulting space. For instance, the words good and great will be seen as vectors
close to each other. Another but no less important property, is that difference vectors between
words seem to encode analogies, for example the difference between woman and man vectors
is approximately the same as the difference between queen and king vectors. Pre-trained word
embeddings are nowadays typically used when addressing Natural Language (NLP) task with
deep neural networks.
Furthermore, another key point in modern neural networks is the fact that many copies of
one neuron can be used in a neural network. However, writing the same code oftentimes not
only increases the risk of introducing bugs, but also makes it more difficult to catch possible
mistakes. Taking into account that in programming the abstraction of functions is essential, we
can use functions instead of using multiple copies of neurons. This technique is traditionally
called weight tying and is fundamental to the good results that we have seen from deep learning
in many different tasks.
The following bulleted list describes a set of neural network patterns that are widely used,
such as recurrent layers and convolutional layers. These patterns are nothing more that functions
which take functions as arguments, named higher order functions. Some of the most common
patterns are:
• General Recurrent Neural Network - used to make predictions including sequences (e.g
sequences of words). Many different types of recurrent neural networks have been proposed,
including long short-term memory (LSTM) network and gated recurrent units (GRUs).
• Encoding Recurrent Neural Network - used to allow a neural network to take a variable
12
length list as input, for instance taking a sentence as input.
• Generating Recurrent Neural Network - used to allow a neural network to produce a list
of outputs, such as words in a sentence.
• Bidirectional Recursive Neural Network - used to make predictions over a sequence, taking
into account both past and future contexts.
• Convolutional Neural Network - used to look at neighboring elements, applying a function
to a small window around every element.
• Recursive Neural Network - used for natural language processing, allowing neural networks
to operate on parse trees.
In order to build more complex and larger networks, these patterns (i.e, these building
blocks) can also be combined. Some of the aforementioned blocks will be detailed latter in this
dissertation.
2.2 Related Work
Existing sentiment analysis approaches can be divided into two main categories based on
the source of information they use: the lexicon-based approach and the corpus-based approach.
The lexicon-based approach essentially calculates the orientation (i.e., positive or negative
sentiment) for a text by aggregating the semantic orientations of its words. On the other hand
the corpus-based approach uses supervised learning algorithms to train a sentiment classifier,
through training data. Both categories have their advantages and disadvantages. In some cases,
researchers combine the best of both categories, building hybrid models.
2.2.1 Lexicon-Based Approaches
A lexicon-based approach starts with a set of terms with known sentiment orientation. After
selecting the set of terms, an algorithm is used to estimate the sentiment of a text based upon
the occurrences of these words. Some of these approaches have been improved with additional
information, such as emoticon lists and negation word lists (e.g., not or don′t), see for instance
the paper by Taboada et al. (2011).
13
The SentiStrength approach developed by Thelwall et al. (2010) is a lexicon-based classifier
that additionally uses non-lexical linguistic information and rules to detect the sentiment of a
text. In detail, SentiStrength uses as key elements the following resources:
• A word list with human polarity and strength judgements;
• A spelling correction algorithm, that identifies the standard spelling of words that have
been miss-spelled by the inclusion of repeated letters (e.g., the word awwwesome would
be identified as awesome by this algorithm);
• A booster word list is used to strengthen or weaken the emotion words. For example, the
words very and extremely increase the emotion of the near words;
• A negation word list that is used to invert the emotions words;
• An idiomatic expression list is used to detect the sentiment of a few common phrases;
• Repeated letters besides those needed to correct spelling are used to give a strength boost.
Thus, words that have many repeated letters will be quantified with more strength;
• An emoticon list is used to detect additional sentiment;
• Sentences with exclamation marks have a minimum positive strength of 2. Having at least
one exclamation mark gives a strength boost of 1 to the immediately preceding emotion
word or phrase;
• Negative emotions were ignored in questions;
For each text, after applying the above key elements, the SentiStrength algorithm outputs
two integers. The first represents the positive sentiment strength from 1 to 5, where 1 means no
sentiment and 5 strong sentiment, and the second represents, the negative sentiment strength
in the same way. These two scores can also be combined and, for instance, a text with a
combined score of 2,5 would contain a weak positive strength and a strong negative strength.
However, as this approach was initially only tested on the short informal friendship messages of
the social networking service named MySpace, Thelwall et al. (2012) developed a new version
called SentiStrength2 that handles a wider variety of types of text.
To address the weak performance of SentiStrength in negative sentiment strength detection,
some changes were also introduced. The main change was to extend the sentiment word list with
negative General Inquirer (GI) terms (Stone et al., 1966). Second, the sentiment word terms
14
were tested against a dictionary to check for incorrectly matching words and to derive words
that did not match. Third, negating negative terms makes them neutral. Fourth, the list of
idiomatic expression terms was extended. For instance, is like has strength 1 because like is a
comparator after is. At last, the special rule for negative sentiment in questions was removed.
The experimental evaluation showed that SentiStrength2 performed significantly above several
baselines on different datasets.
Another lexicon-based approach to extract sentiment from text is called Semantic Orien-
tation CALculator (SO-CAL). The SO-CAL approach began with two principles. The first is
that words have a semantic orientation that is independent of context, and the second is that
the semantic orientation can be expressed as a numerical value. In previous versions of SO-CAL
(Taboada and Grieve, 2004; Taboada et al., 2006), the classification task was based only in an ad-
jective dictionary. However, the current version (Taboada et al., 2011) is composed by different
dictionaries, including adjectives, verbs, nouns, and adverbs. In addition to these dictionaries,
the SO-CAL approach also incorporates valence shifters such as intensifiers, downtoners, nega-
tion and irrealis markers (i.e., words that can change the meaning of sentiment words). In order
to build the main adjective dictionary, Taboada et al. (2011) manually tagged all adjectives
found in a development corpus, on a scale ranging from −5 for extremely negative to +5 for
extremely positive, where 0 indicates neutral words that were not included in the dictionary.
Remaining dictionaries were built in a similar way, but all of them have their peculiarities.
The adverb dictionary was built automatically using the adjective dictionary, matching
adverbs ending in −ly to their potentially corresponding adjectives. The exception are some
words that were tagged or modified by hand. If a word tagged as an adverb is encountered
by SO-CAL and is not yet in the dictionary, the system stems the word and tries to match
it to an adjective in the main dictionary. The verb dictionary contains, in addition to simple
verbs, multi-word expressions such as fall apart. All nouns and verbs found in the text are
lemmatized, i.e. by grouping the inflected forms of a word in order to be analyzed as a single
word.
Taboada et al. (2011) also incorporated some valence shifters into their method. The inten-
sifier was modeled using modifiers with each intensifying word having a percentage associated
with it. Words that increase the positive sentiment are called amplifiers, whereas words that
increase the negative sentiment are called downtoners. For instance, excelent has a SO-value of
5, and thus most excellent would have a SO-value of : 5×(100%+100%) = 10. Besides adverbs
and adjectives, other intensifiers include all capital letters, the use of exclamation marks and
the use of the discourse connective but to indicate more salient information.
15
In terms of negation, some words such as not, never, without or lack can occur at a
significant distance of the lexical item which they affect. Taking this into account, the SO-CAL
approach includes two options for negation search. First, it looks backwards until a clause
boundary marker (i.e., until punctuation and sentential connectives) are reached. Second, it
looks backwards as long as the words found are in a backward search skip list. After finding the
words affected by negation, instead of changing sign, the SO-value is shifted toward the opposite
polarity by a fixed amount.
The Irrealis blocking filter was created by taking into account that there are a number of
markers that indicate that the words in a sentence might not be reliable for sentiment analysis.
Irrealis markers are words that can change the meaning of sentiment words in very subtle ways.
The solution implemented consists of ignoring the semantic orientation of any word in the scope
of an irrealis marker, within the same clause. The dictionary of irrealis markers is composed by
modals, conditional markers, negative polarity items such as any and anything, certain verbs,
questions and words enclosed quotes.
The final polarity decision for a given text is determined by the average sentiment strengths
(SO values) of the words detected, after modifications.
The experimental evaluation concluded that the new version improves the performance of
the previous version of SO-CAL. However, the main conclusion is that lexicon-based methods
for sentiment analysis can be robust, resulting in good cross-domain performance. They can be
easily improved with multiple sources of knowledge.
2.2.2 Corpus-Based Approaches
A corpus-based approach, as previously described, involves building classifiers from training
data. The training data consists of a set of training examples that are composed by an input
object and a desired output value. In sentiment analysis there are many approaches that use
this concept, and I will separate them in two categories: (i) combining classifiers for sentiment
analysis, and (ii) neural networks for sentiment analysis.
Combining Classifiers for Sentiment Analysis
One of the challenges in Sentiment Analysis is how to represent variable length documents,
given that simple bag of words (BoW) approaches loose word order information. A possibility
to consider involves the use advanced machine learning techniques such as recurrent neural
16
networks (Mikolov et al., 2010; Socher et al., 2011). However, it is not clear if this method
results in improvements in relation to simple bag-of-word and bag-of-ngram techniques.
Mesnil et al. (2014) compared several different approaches and concluded that model com-
bination has a better performance than any individual technique. This is due to the fact that
ensembles best benefit from models that are complementary, i.e. there is a better use of each
model. Following this, and as the majority of models proposed in the literature are discrimina-
tive, the authors proposed to combine both generative and discriminative models together, to
improve the performance of the ensemble in sentiment prediction.
In terms of the generative model, Mesnil et al. (2014) implemented a n-gram language
model using the SRILM toolkit and relying on modified Kneser-Ney smoothing, although this
model suffers from large memory requirements. To address this issue, the authors implemented
a recurrent neural network (Mikolov et al., 2010), that outperforms significantly an n-gram
language model. In both cases, to compute the probability of a test sample belonging to the
positive and negative class, the Bayes rule was used.
For the discriminative model, the authors implemented a supervised re-weighting of the
counts, as in the Naive Bayes Support Vector Machine (NB-SVM) approach (Wang and Man-
ning, 2012). Specifically, this approach computes a log-ratio vector between the average word
counts extracted from positive and negative documents, and the logistic regression (can be re-
placed by a linear SVM) classifier input corresponds to the log-ratio vector multiplied by the
binary pattern for each word in the document vector. However, Mesnil et al. (2014) slightly
improved the performance of this approach by adding tri-grams. In the ensemble model, the
log probability scores of the previously described models are combined via linear interpolation.
Finally, the evaluation demonstrates better results when the models are combined, with each
model contributing to the success of the overall system.
Another challenge is how to use unlabeled data to improve the sentiment classification per-
formance. Recent approaches try to reduce the large dependency on labeled data by introducing
the concept of semi-supervised learning (Gao et al., 2014; Zhu and Ghahramani, 2002). In semi-
supervised learning, the classifier receives both labeled and unlabeled data as input. However,
each semi-supervised approach has its different pros and cons, making it difficult to choose the
best one for a specific domain.
To address this challenge, Li et al. (2015) introduced a new principle that combines two
or more semi-supervised algorithms instead of choosing only one. Specifically, Li et al. (2015)
combined two different probability results from two distinct algorithms. The first algorithm is
17
self-trainingFS proposed by Gao et al. (2014) and the second was label propagation, a graph-
based semi-supervised learning approach proposed by Zhu and Ghahramani (2002). The main
idea was to apply meta-learning, i.e. re-predict the labels of the unlabeled data given the outputs
from the member algorithms. The meaning of meta, in this context, is that the learning samples
(xmeta) are not represented by a bag-of-words, but instead by the posterior probabilities of
the unlabeled samples (xk) belonging to the positive (pos) and negative (neg) classes from the
member algorithms. The feature representation is made as follows:
xmeta =< p1(pos|xk), p1(neg|xk),p2(pos|xk), p2(neg|xk) > (2.12)
In the equation, p1(pos|xk) and p1(neg|xk), are the posterior probabilities from the first
semi-supervised method, and p2(pos|xk) and p2(neg|xk) are the posterior probabilities from
the second semi-supervised method. Then, the probability results and the real labels are used
as meta-learning samples to train the meta-classifier (i.e., a maximum entropy classifier). An
experimental evaluation in four domains demonstrates that this approach outperforms both
member algorithms.
Neural Networks for Sentiment Analysis
One sentiment classification approach that has recently emerged is based on Convolutional
Neural Networks. A CNN architecture can be, for instance, divided in four layers: the first layer
focuses in the representation of the sentences; the second is a convolutional layer with multiple
filter widths and feature maps; the third realizes a max-over-time pooling; the last one is a fully
connected layer with dropout and softmax output.
Specifically, the first layer is responsible for representing the sentence through word vectors.
The second layer is where many convolution operations occur. A convolution operation implies
the use of a filter which is applied to each possible window of words in the sentence, to produce a
feature map. The penultimate layer executes a max-over-time pooling operation over the feature
maps previously computed by each filter, taking the maximum value that corresponds to the
most important feature of each one. The last layer receives the features of the penultimate layer
and passes the features to a fully connected softmax whose output is the probability distribution
over labels.
In natural language processing, CNNs have been shown effective, reaching good results
in semantic parsing, sentence modeling or even search query retrieval. CNNs have also been
18
extensively used for sentiment analysis (Kim, 2014; Mou et al., 2015; Kalchbrenner et al., 2014).
Kim (2014) introduced a principle that is based on training a simple CNN with one layer
of convolution on top of word vectors obtained from an unsupervised neural language model.
In terms of the word vectors, the author used the publicly available word2vec vectors that were
trained on 100 billion words from Google News.
Initially, Kim (2014) keeps the word vectors static and learns only the other parameters of
the model, this way already obtaining excellent results. However, as learning task-specific vectors
through fine-tuning results in further improvements, the author describes a simple modification
to the architecture, to allow the use of both pre-trained and task-specific vectors, by having
multiple channels. The different channels are initialized with word2vec and each filter of CNN
is applied to them. However, gradients are back-propagated only through one of the channels,
allowing the model to adjust only one set of vectors while keeping the other static. The evaluation
of this approach shows that unsupervised pre-training of word vectors is an important ingredient
in deep learning for natural language processing.
In sentiment analysis, capturing the meaning of longer phrases has also received a lot of
attention. However, to be able to extract this information, one needs to address the lack of large
and labeled compositionality resources. Socher et al. (2013) created the Stanford Sentiment
Treebank corpus, i.e. the first corpus that allowed capturing compositional effects of sentiment
in language, providing fully labeled parse trees. The type of information contained in this
dataset enables the community to train and develop new compositional models. Exploiting this
new information resource, Socher et al. (2013) proposed a model called the Recursive Neural
Tensor Network (RNTN) that aims to capture the compositional effects with higher accuracy.
The RNTN addresses several issues on standard RNNs (Goller and Kuchler, 1996; Socher
et al., 2011) and on the previous proposed Matrix-Vector Recursive Neural Netowrk (MV-RNN)
architecture (Socher et al., 2012). In the MV-RNN model, the parameters are associated with
words and each composition function that computes vectors of longer phrases depends on the
actual words being combined. However, the number of parameters can become very large and
depends on the size of vocabulary. Considering this problem, Socher et al. (2013) suggested
that it was more plausible if there was a single composition function with a fixed number of
parameters. Briefly, the main idea of the RNTN is to use the same tensor-based composition
function for all nodes of the compositionality tree.
Experiments showed that the RNTN model improves sentence-level sentiment detection,
achieving better results than MV-RNN. Another relevant aspect, is that this new model captures
19
Figure 2.1: Tree-Based Convolutional Neural Network (Mou et al., 2015).
negation of different sentiments.
Another alternative to capture sentence meanings was purposed by Mou et al. (2015) and
is called Tree-Based Convolutional Neural Network, based on CNNs and RNNs.
CNNs have the capacity to extract neighboring words effectively with short propagation
paths, but they do not capture inherent sentence structures, for instance parsing trees. RNNs
can encode structural information by recursive semantic composition along a parsing tree, but
they have many difficulties in learning deep dependencies because of long propagation paths. In
order to exploit this information, Mou et al. (2015) proposed a novel neural architecture (Figure
2.1) that combines the advantages of CNNs and RNNs, called Tree-Based Convolution Neural
Network (TBCNN).
Initially, in TBCNNs, sentences are converted to either consistency or dependency parse
trees, and each node in the tree is represented as a distributed real-value vector. Afterwards, a
set of fixed-subtree feature detectors is applied (i.e., tree-based convolution window), sliding over
the entire tree of a sentence to extract the structural information. This structural information
is then packaged into one or more fixed-size vectors by max pooling, i.e, the maximum value on
each dimension is taken. Finally, the model has a fully connected hidden layer and a softmax
output layer. One advantage of such an architecture is that all features, along the tree, have
short propagation paths to the output layer, and hence structural information can be learned
effectively. Since there are different approaches to represent sentence structures, two variants
are considered. The c-TBCNN strategy pretrains the constituency tree with an RNN, implying
that vector representations of nodes are fixed. The other case, called d-TBCNN is based on
dependency representations. The nature of this representation leads to the major difference of
d-TBCNN from traditional convolutions, because there are nodes with different number of child
nodes.
20
Figure 2.2: Gated Recurrent Neural Network Architecture (Tang et al., 2015a).
The experimental evaluation showed that both c-TBCNNs and d-TBCNNs have high per-
formance in sentiment analysis, but also that with TBCNNs it is possible to extract sentence
structural information effectively, which is very important for sentence modeling.
Tang et al. (2015a) also created a novel neural network approach to learn continuous docu-
ment representation for sentiment classification (Figure 2.2).
Their method has two main steps. The first uses convolution neural networks and long
short-term memory networks (Kim, 2014; Kalchbrenner et al., 2014) to produce sentence rep-
resentations from word representations. The second step exploits how to adaptively encode
semantics of sentences and their inherent relations in the document. The convolutional neural
networks and long short-term memory models, beyond learning fixed-length vectors for sentences
of varying lengths, also capture word order in sentences. In the second step, Tang et al. (2015a)
developed a Gated Recurrent Neural Network to encode semantics and relations of sentences in
the document. This model can be viewed as an LSTM whose output gate is always on, since
it is preferable not to discard any part of the semantics of sentences to get a better document
representation.
After these two steps, the document representations can be considered as features in models
to classify the document. The experimental evaluation revealed that traditional recurrent neu-
ral networks have a weak performance in modeling document composition, while adding gates
dramatically boosts the performance.
Tang et al. (2015b) introduced a new model called User Product Neural Network (UPNN)
21
in order to capture user and product-level information, UPNN takes as input not just variable-
size documents but also the user who writes the review, as well as the product that is being
evaluated.
The architecture of the system is divided in three steps: modeling semantics of documents,
modeling semantics of users and products, and sentiment classification. The step of modeling the
semantics of documents has two stages. In the first, as documents consist of a list of sentences,
and sentences consist of a list of words, Tang et al. (2015b) began modeling the semantics of
words. For this purpose, each word was represented as a low dimensional continuous and real-
valued vector (i.e., as an embedding). In the final stage, a Convolutional Neural Network (Kim,
2014; Kalchbrenner et al., 2014) to model the semantic representation of sentences was used. In
the intermediate step of modeling the semantics of users and products, we have that both users
and products were encoded in continuous vector spaces, allowing to capture important global
clues such as user preferences and products qualities. Finally, in the sentiment classification
step, instead of using hand-crafted features as input to a classifier, the authors used continuous
representations of documents and vector representations of users and products, as discriminative
features. The experimental evaluation confirmed that including continuous user and product
representations improves significantly the accuracy of sentiment classification.
2.2.3 Combined Approaches
Previous studies (Kennedy and Inkpen, 2006; Andreevskaia and Bergler, 2008; Qiu et al.,
2009) showed that lexicon-based and corpus-based approaches have complementary perfor-
mances and therefore should be combined.
Yang et al. (2015) developed a new approach that combines these two views, named the
LCCT Model (Lexicon-based and Corpus-based, Co-Training Model). Another idea behind the
LCCT model was the fact that social reviews, such as posts in forums and blogs, in contrast
with product and service reviews, do not have associated numerical rankings, making it difficult
to perform supervised learning. Since manual labeling is time consuming and expensive, it
is preferable to label a small portion of social reviews to perform semi-supervised learning,
leveraging information from both labeled and unlabeled data.
In terms of the lexicon-based approach, Yang et al. (2015) presented a novel method that
is called semi-supervised sentiment-aware LDA (ssLDA) to build a domain-specific sentiment
lexicon. Building a domain specific sentiment lexicon has a particular relevance in sentiment
analysis because a single word can carry different sentiment meanings in distinct domains. In
22
the domain-specific sentiment lexicon each word is associated to a particular class (i.e., positive
or negative) through a specific weight. After creating the lexicon, the classification of each
document is realized by aggregating the weight of each word in the document. If the accumulated
weight is greater than zero, the document is classified as positive, otherwise it is classified as
negative.
For the corpus-based approach, Yang et al. (2015) used Stacked Denoising Auto-Encoders
(SDA) to build the corpus-based sentiment classifier. The autoencoder concept was introduced
by Rumelhart et al. (1986) and its denoising variant was proposed by Vincent et al. (2010).
After the SDA parameters are trained, on both labeled and unlabeled data, and after the high-
level representation of each data instance is obtained, a SVM classifier, using the resulting
representation, was employed to train a sentiment classifier from the labeled data.
In order to combine the methods, both are initially trained with the partially available labels,
and then one of the two classifiers (i.e., the corpus-based classifier or the lexicon-based classifier)
is used to label unlabeled documents, adding these instances to the pool of labeled data. The
other classifier is re-trained using the new labeled data produced by the initially chosen classifier.
This procedure is performed iteratively, and after a sufficient number of iterations, the classifiers
are combined using a majority-voting scheme to predict the sentiment label to the test data. The
experimental evaluation demonstrates that LCCT exhibits significantly better performances on
a variety of datasets, comparing with other state-of-art sentiment analysis approaches.
Another combined approach was proposed by Augustyniak et al. (2014), with the aim of
improving the efficiency of lexicon-based methods, combining several lexicon-based methods
thought ensemble classification. In this approach, the initial task is to select the lexicons. The
authors employed a variety of lexicons, starting from a very basic list consisting of 2-word lists
of strong sentiments words (i.e., good/bad), i.e., a lexicon called SM, or with a lexicon with
English verbs conjugated in different tenses, called PF.
Augustyniak et al. (2014) also considered additional word lists (lexicons), which they called
WL, 5MF ans 25MF. They have assumed that the input review sets form a probability space
where the sample space < consists of reviews represented as pairs (score, text). The text is
represented as a set of words occurring in the review and the score is the normalized [−1, 1]
sentiment of the review. To create these new lexicons, they used the following equation:
fqmt(w) =∑
s∈scoress× P (review has score s|review has word w)
P (review has score s)(2.13)
23
In the equation, score is a countable subset of [-1,1] and s is the score of a word w.
The lexicon WL is a list with the 25 most positive (i.e., highest fqmt) and the 25 most
negative (i.e., lowest fqmt) words obtained by merging all corpora. In contrast, 5MF and 25MF
select respectively the 5 and 25 most positive and most negative words, separately for each
corpus.
After selecting and constructing all lexicons, the bag-of-words model was used. For each
word in the review which occurs in a lexicon, the authors assign a numeric value (i.e., 1 if the
word is positive, -1 if the word is negative, and 0 if the word does not appear in the lexicon).
The sentiment of the review is positive if the difference between the number of positive and
negative words identified in the review is greater than zero. If the difference is smaller than zero,
the review is negative. With the previous results, a sentiment polarity matrix is constructed,
where the columns represent the reviews and the rows represent the several existing lexicons.
Specifically, each position of the matrix corresponds to the sentiment polarity value provided by
each lexicon. Completed the first step, Augustyniak et al. (2014) trained a strong classifier, such
as the C4.5 decision tree method, over the previously described matrix and use the classifier to
predict the sentiment of new reviews.
Experiments show that the accuracy obtained from the combination of these lexicons out-
performs other lexicon based approaches.
Considering that the challenges of sentiment classification cannot be easily addressed by
simple text categorization approaches relying on n-gram or keyword identification, Mullen and
Collier (2004) introduced a new concept to classify natural language texts as positive or negative.
To do that, they applied Support Vector Machines using a variety of diverse information sources,
derived from the fact that SVMs are the ideal tool to bring these sources together.
The source of diverse information were provided from the following methods (used to mea-
sure the favorability content of phrases): Semantic orientation with PMI (Turney, 2002); Osgood
semantic differentiation with WordNet; Topic proximity, and syntactic-relation features.
In the first method, the authors relyed on the semantic orientation of words or phrases.
The meaning of the term semantic orientation (SO) refers to a measure (i.e., a real number)
that captures the sentiment (i.e., positive or negative) expressed by a word or phrase. The
solution proposed by the authors allows for the modeling of not only singular words but also
multiple word phrases, named value phrases. In this particular case, the approach taken by
Turney (2002) is used to derive the SO values and also to extracts the value phrases. The
phrases were designated as value phrases being that they are the sources of SO values. After
24
extracting the value phrases, the SO value of each one is determined based upon the pointwise
mutual information (PMI) with the words excellent and poor (Church and Hanks, 1989), in the
following way:
PMI(w1,w2) = log2
(p(w1&w2)
p(w1)p(w2)
)(2.14)
In the equation, p(w1 & w2) is the probability of w1 and w2 occurring simultaneously. Finally,
the value SO for each value phrase is the difference between its PMI with the word excellent
and with word poor.
The second method is based on using WordNet relationships to derive three values, namely
potency (strong or weak), activity (active or passive) and evaluative (good or bad) (Kamps
and Marx, 2002; Osgood et al., 1957). The derivation of these values is obtained by computing
the minimal path length in WordNet between the adjective in question and the pair of words
mentionated before. However, for the purpose of this research, each of these values are averaged
over all the adjectives in a text and then delivered to the SVM model.
The last method aims to exploit the information that is known in advance, i.e. what is
the topic about and which sentiment is being evaluated. Considering this, the method creates
several classes of features based upon the semantic orientation values of phrases, given their
position in relation to the topic of the text. In each review, the references to the target being
reviewed were tagged as THIS WORK and references to the artist under review were tagged
as THIS ARTIST . This is just an example because many other classes were retrieved from
natural language text. Each of these classes are assigned with a value, so representing each
text as a vector of these real-valued features forms the basis for the SVM model. However, if
no topic information is available, only the values of the first and the second method are used.
The authors concluded that combinations of SVMs using these features, in conjunction with
SVMs based on unigrams, and lematized unigrams, outperform models which do not use these
information sources.
In order to perform sentiment analysis more thoroughly, Mudinas et al. (2012) introduced an
aspect-level sentiment analysis system (pSenti) that integrates lexicon-based and corpus-based
approaches. The lexicon-based approaches are generally implemented in two steps: lexicon
detection and sentiment strength measurement. In corpus-based approaches, the sentiment
detection is treated as a simple classification problem, which can be addressed by employing
machine learning algorithms such as Naive Bayes or Support Vector Machines. Following this,
the main idea of the introduced concept is combining the best of both worlds, generating feature
25
vectors for supervised learning in the same way as is seen in lexicon-based approaches.
Initially, in a pre-processing phase, some simplifications were performed such as replacing
known idiomatic expressions and emoticons with text masks. For instance, if the given dataset
shows that the emotion :) has a positive sentiment, then the emotion will be replaced by the
mask Good one . After this simplifications, the Stanford CoreNLP toolkit was used to carry
out POS and named entity tagging.
Considering that people express multiple views, sometimes opposite, about different aspects
of the same product in a single review, it is important to extract the discussed aspects, as well as
the corresponding views. Therefore, for aspect and view extraction, the authors generated lists
of aspects and views. The list of aspects is composed by nouns, noun phrases and entity tags,
identified by the POS tagger. The list of views is composed by adjectives and known sentiment
words, which occur near an aspect. This step was important to find views which can be used to
expand the sentiment lexicon, and also to perform context-aware sentiment value extraction for
such adjectives in the given aspect.
The sentiment lexicon used in this system is constructed using public resources, more specif-
ically 7048 sentiment words and their sentiment values that are marked in the range from −3
to +3. Furthermore, the authors applied heuristic linguistic rules such as negation (i.e., words
that can change the overall sentiment, such as not and don′t) and modifier (i.e., words that can
increase or decrease the sentiment value, such as less and more).
After the lexicon is created, another step is initiated: the corpus-based sentiment evaluation.
In this step, the authors used the linear SVM implementation in LibSVM. The feature vectors
of each aspect are constructed based on three elements. The first element is sentiment words,
where the weight of each such feature is the sum of the sentiment values in the given review. For
instance, if we have a review with the word good appearing twice, which has a sentiment value
+2, we would add the feature Good with a weight of +4. The second element, called other
adjectives, consists in adjectives which are not in the sentiment lexicon but are initialized with
their occurring frequencies and whose sentiment value is estimated by the learning algorithm.
For instance, if the word big appears twice, we would have the feature Big with a weight +2.
The final element is the lexicon based sentiment score which estimates the sentiment value of a
word that was previously unseen in the training samples but that exists in test samples.
After trained, the SVM model can reuse the calculated feature weights to adjust the final
sentiment calculation. The final overall sentiment scoring of pSenti is a real-valued sentiment
score in the range of [-1,1], which is calculated as follows:
26
Ssenti =1
2× log2
(pos
neg
)(2.15)
In the equation, pos are the positive overall sentiment scores and neg are the negative overall
sentiment scores.
The experimental evaluation shows that the proposed hybrid approach achieves a high
accuracy, that is very close to pure corpus-based systems, and much higher than that of pure
lexicon-based systems.
The sentiment lexicons are often used as key sources for the automatic analysis of opinions,
emotions and subjective text. However, manually created sentiment lexicons consist of few
carefully selected words. The associated problem in this case is that these few words fail to
capture the use of non-conventional word spellings and slang, commonly found in social media.
In order to solve this problem, Moreira et al. (2015) developed a system, based on a novel
method, to create large-scale domain-specific sentiment lexicons. The authors address this task
as a regression problem, in which terms are represented as word embeddings. Considering this,
the system can be divided in two main phases.
The first phase consists in deriving word embeddings from large corpora. For this propose,
Moreira et al. (2015) tested some different approaches: the Skip-gram and Structured Skip-
gram methods, the Continuous Bag-Of-Words (CBoW) model and the Global Vector (GloVe)
approach.
The skip-gram and the CBOW models estimate the optimal word embeddings by maximizing
the probability that words within a given window size are predicted correctly. Essential to the
skip-gram method is a log-linear model of word predictions. When given the i-th word from a
sentence wi, the skip-gram method estimates the probability of each word at a distance p from
wi as follows:
p(wi+p|wi; Cp,E) ≈ exp(Cp.E.wi) (2.16)
In the equation, wi ∈{
1, 0}v×1
is a sparse column vector of the size of the vocabulary v with
a 1 on the position corresponding to that word (i.e., one-hot sparse representation). The model
is parametrized by two matrices: E ∈ <e×v is the embedding matrix, transforming the one-hot
representation into a compact real valued space of size e; Cp ∈ <e×v is a matrix mapping the
real valued representation to a vector with the size of the vocabulary v. A distribution over all
27
possible words is then attained by exponentiating and normalizing over the v possible options.
In order to avoid the normalization over the whole vocabulary (i.e., in practice the value of v
is large), some approaches are used. In the structured skip-gram model, the matrix Cp depends
only of the relative position between words p.
The CBOW method defines a objective function that predicts a word at position i given
the window of context i− d, where d is the size of the context window. The probability of the
word wi is defined as follows:
p(wi|wi−d, ...,wi+d; C,E) ≈ exp(C.Si+di−d) (2.17)
In the equation, Si+di−d is the point wise sum of the embeddings of all context words starting
at E.wi−d to E.wi+d, excluding the index wi and once again C ∈ <e×v is a matrix mapping the
embedding space into output vocabulary space v.
The models described above are based on different assumptions about the relations between
words within a context window. The GloVe method combines the logic of the previous models
with ideas drawn from matrix factorization methods. The GloVe method derives the embeddings
with an objective function that combines context window information with corpus statistics
computed efficiently from a global term co-occurrence matrix.
In order to support the unsupervised learning of the embedding matrix E, in all methods,
a corpus of 52 million tweets was used.
In the second phase, after mapping terms to their respective embeddings, a regression model
was trained, using the manually annotated lexicons, to predict a score y ∈ [0, 1] corresponding
to the intensity of sentiment of any word or phrase. For this propose, the authors tested several
linear regression models such as least squares and regularized variants and ridge and elastic net
regressors. They also experimented with Support Vectors Regression (SVR) using non-linear
kernels, namely, polynomial, sigmoid and Radial Basis Function (RBF) kernels. Experiments
indicated that several configurations of the embedding model and size could achieve optimal
results. Therefore, the system was based on structured skip-gram embeddings with 600 dimen-
sions, and SVR with RBF kernel.
Similar to the work of Moreira et al. (2015), Zhang et al. (2015) developed a system to
predict a score between 0 and 1, which is indicative of the strength of association of Twitter
terms with positive sentiment. For this propose, the authors implemented a regression model
28
to calculate the sentiment strength score for each target term with the aid of sentiment lexicon
score features and word embeddings.
Firstly, the authors transformed the informal terms to their normal forms. With this in
mind, some abbreviations and rules to convert the irregular writing of services like Tritter to
normal forms were collected from the Internet. After this process, in order to extract sentiment
lexicon features, they employed some sentiment lexicons and transformed the score of all words
in all sentiment lexicons to the range between -1 and 1. If a target term contained more than one
word, the authors averaged their scores and used these averages as the final sentiment lexicon
feature. Word embedding features were also adopted. Specifically, the authors used the publicly
available word2vec vectors to get word embedding with dimensionality of 300. If a sentence or
phrase contains more than one word, the strategy adopted was sum up all word vectors. Some
of the experiments demonstrated that the combination of sentiment lexicon features and word
embedding is the most effective feature type for sentiment score prediction.
Finally, in order to predict the sentiment score of a new instance, Zhang et al. (2015) trained
a SVM classifier with the sentiment lexicon features and word embedding features together.
The authors concluded that using word embeddings features alone may not achieve sufficiently
good results, but embeddings make a considerable contribution to performance improvement,
in combination with traditional linguistic features.
2.3 Overview
In this chapter, I presented the necessary concepts to understand the work that has been
made in the context of my MSc thesis. Furthermore, I reviewed some of the most representative
sentiment analysis approaches, divided in three categories according to the source of information
they use. I concluded that each of these categories has their pros and cons, making each of them
useful in different contexts.
In the next chapter, I will present the text representations and model architectures that
were used by me in the sentiment analysis and dimensional sentiment analysis tasks.
29
30
3Sentiment Analysis with
Deep Neural Networks
This chapter describes the different text representations, as well as the models architectures,
used in the experiments that are reported in this dissertation. In first place, Section 3.1 presents
the different input representations for the models. Section 3.2 describes the model architectures,
including the description of the different layers used in each of them. Finally, Section 3.3
summarizes the most important aspects of this chapter.
3.1 Text Representation
One of the main problems in sentiment analysis consists in creating representations of text for
computational analysis, i.e. prepare the text to meet the input requirements of the classification
systems. After converting the data to a structured format we need to have an efficient text
representation model to build an efficient classification system. Some of the pre-processing
techniques that deal with this challenge will be described in the next sections.
3.1.1 Bag of Words
The BoW model is one of the pre-processing techniques proposed in the literature. Given a
collection of documents, we first identify the set of words used in the entire collection. Commonly
called vocabulary, this set is traditionally reduced, by keeping only the words that are used in
at least two documents. Besides that, many of the applications in text mining remove from
the vocabulary the so-called stop words. For instance, words like and, because, to, of or you
do not give us additional information about the sentiment polarity of a given document. From
the moment in which the vocabulary is defined, each document can be represented as a vector
(with numeric entries) of length m, being m the size of the vocabulary. In order to calculate the
length (i.e., the norm) n of a document vector we can do the following:
n =m∑j=1
xj (3.1)
In the formula, x is the vector representation of the document and xj is the number of
occurrences of the word j in the document.
The BoW scheme has nonetheless some limitations. The word order is lost, causing that
different sentences may have exactly the same representation, as long as the same words are
used. With the aim of reducing the impact of this limitation, a particular variant of the Bow
technique corresponds to Bag-of-n-grams. The Bag-of-n-grams identifies multi-word expressions
occurring in the document, ensuring word order in short contexts. Expressions like Far cry
from and United States will be detected as single units when using a Bag-of-tri-grams or a
Bag-of-bi-grams, respectively. Still, the BoW representation and its variant Bag-of-n-grams have
little sense about the semantics of the words, and they both suffer from data sparsity and high
dimensionality problems.
3.1.2 Word to Vector
Many of the sentiment analysis systems use techniques that treat words as singular units, i.e.
there is no notion of similarity between them (words are represented as indices in a vocabulary).
Another common limitation is the production of data with high dimensionality and sparsity.
Taking into account these limitations and the progress of machine learning techniques in recent
years, the word to vector representation has been increasing its relevance NLP applications. As
its own name indicates, the idea is to also represent the word as vectors, trying to reduce the
limitations detected in previous work. The produced vectors, in this case, are often called word
embeddings.
The name word embedding comes from the fact that we are embedding the words into
a real-valued low-dimensional space. Basically, word embeddings are used to map words or
phrases from a vocabulary to a corresponding vector of real numbers. The main advantages in
comparison with the previously described BoW technique are: dimensionality reduction, which
makes the representation more efficient, and contextual similarity, which makes the representa-
tion more expressive. Considering the dimensionality reduction and in contrast with the BoW
approach, on which the size of the vectors grows exponentially with the size of the collection
of documents, word embeddings aims to create vector representations with a much lower di-
mensionality. In relation to the contextual similarity, the word embeddings can capture word
meanings (i.e., the semantic similarity between two words is correlated with the cosine of the
angle between their word embeddings). For example, the cosine for tire and car may be 0.7,
whereas for tire and milk it may be 0.1, which is not much different than what a human would
32
Figure 3.1: Model Architectures (Mikolov et al., 2013).
say about the relationship between these words.
In this context, another important property is the fact that most methods for learning good
word embeddings are completely unsupervised, in that they build the embeddings using a big
unannotated text corpus.
One of the most popular algorithm for producing these vectors is the Word2Vec, proposed by
Mikolov et al. (2013). The Word2Vec algorithm uses one of two possible model architectures to
produce a distributed representation of words: the Continuous Bag of Words or the Skip-gram.
The CBoW architecture predicts the current word based on the context, whereas the Skip-gram
predicts surrounding words given the current word (Figure 3.1).
3.1.3 Document to Vector
Le and Mikolov (2014) proposed an extension to the Word2Vec approach that aims to
construct word embeddings from entire documents. This new extension can be called either
Doc2Vec or Paragraph2Vec. Sentiment analysis systems typically require the text input to
be represented as fixed-length vectors. Again, due to its simplicity, efficiency, and sometimes
surprising accuracy, the BoW technique is oftentimes applied. In order to improve the results
reported in previous studies, like BoW, techniques like Doc2Vec arose. Briefly, Doc2Vec consists
in combining the word vectors provided by the Word2Vec algorithm. The main idea is end up
with a single aggregate vector that represents the semantics of the entire document.
This technique allow us to represent a document in a small data space, i.e. in a few hundred
dimensions. The main point in this process is to know how much one should compress the
dimensionality of the matrix encoding the document collection. If the data is compressed too
much, we incur in the risk of not leaving enough space to divide the important differences in
our data points. The central goal is to obtain a nice balance where similar documents will be
33
Figure 3.2: Examples of a) a traditional Recurrent Neural Network Architecture, and b) a LongShort Term Memory Architecture.
clustered together, leaving enough space to discern one cluster from the others. This property
has a big relevance to classification systems. Specifically, applying Doc2Vec to a collection of
documents and tuning well the dimensionality can give an important support to a classification
algorithm to better define the boundary that separates the different categories that we want to
distinguish. In other words, the problem of sparsity, present in previous studies leveraging BoW
representations is reduced due to the fact that we can now manage to get enough density in the
data points.
3.2 Classification Models
This section presents and describes the models used in this work, being that some of them
like (e.g., NB and SVM classifiers) have already been described in detail in chapter 2. The
models present in this chapter are more complex and mainly based on deep neural networks.
Traditionally, neural networks are divided in two categories. The recurrent neural networks
(RNNs)and the convolutional neural networks (CNNs).
Starting with the RNNs, the reason they are called recurrent is because they perform the
same task for every element of an input sequence, with the output being dependent on the
previous computations. Theoretically, RNNs have the possibility to use information in arbitrarily
long sequences, but in practice they are limited to looking back only a few steps. In other words,
RNNs have two sources of input, namely the present and the recent past, that combined give us
the possibility to determine how to respond to new data, as we do in life (Figure 3.2 a)). Still
RNNs suffer from the problem of vanishing gradients. The problem of vanishing gradients is
a challenge found in training neural networks with gradient based methods, for instance when
using the backpropagation algorithm. The backpropagation method is used in conjunction with
34
an optimization method, such as gradient descent, and it attempts to reduce the errors between
the output and the desired result. Specifically, for each training example, the hidden layers
weights are modified in order to minimize the error computed between the network’s prediction
and the correct value. As the name suggests, these modifications are made in the backwards
direction, i.e. from the output layer, through each hidden layer, to the first hidden layer. The
vanishing gradients problem implies loss of sensitivity in the network, over the time as new
inputs overwrite the activations of the hidden layers, causing forgetfulness of the first inputs.
In order to remediate this problem, a new RNN variant called Long Short Term Memory
was introduced. An LSTM can chose to retain memory over arbitrary periods of time, but also
forget if necessary (Figure 3.2 b)). Taking into account these developments, LSTMs are one of
the most used RNN models in Natural Language Processing tasks.
With respect to Convolutional Neural Networks, they are usually composed by several layers
of convolutions with non-linear activation functions applied to the results. In more detail,
convolutions over the input layer are used with the aim to compute the output. In contrast with
Feedforward Neural Networks where each input neuron is connected to each output neuron in
the next layer, in the CNNs the use of convolutions results in many local connections, where
each region of the input is connected to a neuron in the output. Each convolutional layer
applies different filters, and at the end combines their results. The values of these filters are
automatically learned during the training phase.
3.2.1 Overview on the Layers used within Deep Neural Networks
This section presents all the layers composing the models architectures, each of them with
its own section, and where is made a description of the respective usefulness, purpose and
functioning.
3.2.1.1 Embedding Layers
An embedding layer aims to convert word indices into dense representations. In this case,
three different options are available: the first is to learn the embedding weights from scratch, i.e.
the word embeddings are initialized with random values and will be improved over the training
process. The second is to use pre-trained embeddings, keeping the embedding weights static.
The third and last option, is to use pre-trained embeddings, but instead of keeping static the
weights they will be improved during the training process.
35
3.2.1.2 Dropout Layers
The dropout layer has the responsibility of helping models to prevent overfitting. As de-
scribed throughout the document, deep neural networks are composed by multiple non-linear
hidden layers, making these models very expressive. The existence of these types of layers allow
the models to learn very complicated relationships between inputs and outputs. Taking into
account the limited training data, many of the relationships will be the result of sampling noise,
i.e. some relationships will exist in the training set but not in the test set. When this occurs, the
model does not generalize well, causing bad results in terms of the prediction performance. In
order to avoid the overfitting problem, many methods have been developed, one of them being
the dropout method. Essentially, the dropout method forces the neural network to learn mul-
tiple independent representations of the same data (dropping out neurons during the training
phase), with the aim to reduce the dependence on the training set. A dropout layer will thus
set to zero a given portion of its inputs.
3.2.1.3 LSTM Layers
An LSTM layer basically implements the LSTM recurrent model, which is one of the most
used variants of RNNs due to the fact that this approach considers three additional aspects that
improve significantly the prediction performance. The first is that the network has the control
to decide when to let the input enter in the neuron. The second, is the capacity to control
when to remember what was computed in the previous time steps. Finally, this method also
has the capacity to control which parts of the information the system should output. These
improvements are illustrated in Figure 3.2 b).
3.2.1.4 Dense Layers
A dense layer represents a regular fully connected layer. Specifically, the idea of this layer
is to connect every neuron in the network to every neuron in adjacent layers.
3.2.1.5 Convolutional Layers
A convolutional layer is the most relevant in the case of CNN models. These layers are
composed by a set of learnable filters with a small receptive field. During the forward process,
each filter is convolved across the full depth of the input volume. This process is accomplished
computing the dot product between the entries of the filter and the input, in order to produce
36
the activation map of each filter. Finally, the key point is that the network learns filters which
are activated when they see some specific type of feature at some spacial position in the input.
The output of a convolutional layer is represented by a stack of activation maps produced by
each filter. Each stack position can be seen as an output of a neuron that looks at a small region
in the input.
3.2.1.6 Pooling Layers and Flatten Layers
A pooling layer is commonly used between successive convolutional layers in a CNN model.
Their goal consists in progressively reducing the spatial size of the representation, which will
cause also a reduction of network parameters and computations. Due to this fact, these layers
are also an important aid to control the overfitting problem. The most traditional function
to execute this operation is max pooling. The idea is to split input representation into a set
of non-overlapping regions and for each of these regions select the max value. Thus, at the
end, each element of the new representation is the max value of a region in the original input
representation. The intuition behind this is that once a feature has been detected, its exact
location is not very important comparing to the importance of its location relative to other
features.
A flattening layer, as the name indicates, flattens the input. More properly, these layers
flatten all of the dimensions of their inputs into one dimension.
3.2.1.7 Activation Layers
An activation layer just applies an activation function to an output. An activation function
in neural networks, besides restricting outputs to a certain range, breaks the neural network
linearly, allowing it to learn more complex functions than linear regression. A neuron without
an activation function is equivalent to a neuron with a linear activation function, like f(x) =
x. When these functions do not add any non-linearity, the entire network is equivalent to a
single linear neuron. Therefore, it makes no sense building a multi-layer network with linear
activation functions. Moreover, given that a single linear neuron is not capable of dealing with
non separable data, no matter how deep a multi-layer network is it can never solve any non-
linear problem. Taking this into account, the role of activation functions in a neural network is
to produce a non-linear decision boundary, via a non-linear combination of the weight inputs.
37
Figure 3.3: Stack of two LSTMs.
3.2.2 Model Architectures
This section describes models architectures, each of them with its own section. In each
model description, beyond its composition (i.e., layers) are also given the specifications of each
component.
3.2.2.1 Stack of LSTMs
The stack of LSTMs model is simply a stack of several LSTM layers. The idea is to form a
deep network not only in terms of layers but also in recurrent dimensions. The intuition is that
higher LSTM layers can capture abstract concepts in sequences which might help in a particular
task, in our case, sentiment analysis. To better understand this concept, Figure 3.3 presents the
architecture of the Stack of two LSTMs model.
In brief, Figure 3.3 shows a model that is composed by one Embedding layer, two LSTM
layers and one activation layer. The figure, as in all the next model architectures, does not
present any dropout layer. However, in order to prevent overfitting problems, as described in
the previous section, dropout layers with a probability of 0.25 are used between most of the
layers.
Starting with the embedding layer, there are four data parameters that need to be specified:
input dimensionality, output dimensionality, input length and weights. The input dimensionality
is defined as the maximum number of words to consider in the representation (i.e., set to 30000).
The output dimensionality as the size of the word embeddings. The input length as the maximum
length of a sentence (i.e., set to 50 words), and finally the weights as a list of numpy arrays to
38
set the initial embedding weights.
Figure 3.4: Bidirectional LSTM Architecture.
The dropout layers use the dropout assumption that consists in randomly setting a fraction
p of input units to 0 at each update during training time, which helps prevent overfitting. Thus,
the only specification present is the definition of the value p as 0.25.
The LSTM layers have three parameters: output dimensionality, activation and weight
initialization. The output dimensionality is defined as the word embeddings size. The activation
is set to a sigmoid function, and the weights are initialized to zero.
The activation layer only has one parameter that is dependent of using the model for a
regression or classification. If it is chosen to use the model in a regression task, then the activation
is defined as a linear function, otherwise a sigmoid function is used. The output dimensionality
represents the number of output dimensions of the model and is selected according to the specific
task.
3.2.2.2 Bidirectional LSTMs
The bidirectional LSTM model was introduced with the aim of improving the performance
of standard RNNs. RNN models not have access to future information on a current state. To
solve this limitation, the bidirectional LSTM model connects hidden layers of opposite directions
39
Figure 3.5: Multi-Layer Perceptron Architecture.
to the same output, so that the output layer can access information from past and future states.
One of such architecture is presented on Figure 3.4.
Figure 3.4 shows that the bidirectional LSTM model is composed by one embedding layer,
two LSTM layers with a bidirectional arrangement and at the end an activation layer. Each of
these layers has their own parameters, as further detailed bellow.
The embedding, dropout and activation layers have the same parameters that were ex-
plained for the case of the of previous model, although some layers have some additions. The
Bidirectional arrangement causes an addition of parameters in the LSTM layers, such as putting
as true the go backwards parameter.
3.2.2.3 Multi-Layer Perceptron
The multi-layer perceptron (MLP) model is a simple feedforward Neural Network, i.e. the
connections between the network neurons do not form a cycle. More properly, the neurons have
only one direction associated to them, that is from input to the output. All the nodes present in
this model, except the input node, are neurons with a non-linear activation function. The main
improvement of the MLP model, in contrast with standard linear perceptron, is the capacity to
distinguish non-linear data. The architecture of the model is illustrated in Figure 3.5.
Figure 3.5 shows that the multi-layer perceptron is composed by the following layers: three
dense layers and one activation layer. The dropout and the activation layers have the same
40
Figure 3.6: Convolutional Neural Network Architecture.
parameters as in previous models, being the only difference in the dense layers. The first two
dense layers in this model have two parameters: the output dimensionality and the activation.
The output dimensionality is defined as the embeddings size and the activation was set to a relu
function. While the last, the only parameter is the output dimensionality, that represents the
number of output dimensions of the model and is selected according to the specific task.
3.2.2.4 Convolutional Neural Neworks
Convolutional neural network models apply convolutions over the inputs to compute the
outputs, in contrast with other neural networks that only connect each input neuron to each
output neuron of the previous layer. In CNNs, each layer applies different filters over the data,
and during this process, the model learns the values of its filters based on the task that we want
to perform. In the end of this process, the results of each layer are combined. In this case, the
architecture used is composed by three filters with length 3, 5 and 7. In order to make it clearer,
the model architecture is shown in Figure 3.6.
This architecture is composed by the following layers: one embedding layer, three convolu-
tional layers, three maxpooling layers, three flatten layers, and to finalize an activation layer.
The embedding, dropout and activation layers have the same parameters as in previous
models. In this model, the additions are the convolutional, the maxPooling and the flatten
layers, being that the flatten layers does not have any parameter to be described.
41
Figure 3.7: CNN-LSTM Architecture.
In the convolutional layers there are five important parameters: number of convolutional
kernels, filter length, input dimensionality, input length and activation. The number of convo-
lutional kernels is defined as the embedding dimensionality. The filter length is defined as the
extension (spatial or temporal) of each filter and are assigned with values 3, 5, and 7. The input
dimensionality is defined as the embedding dimensionality. The input length is the maximum
length of a sentence (i.e, set to 50) and finally the activation is defined as a relu function.
The maxpooling layers only have one parameter that is the pool length. The pool length is
defined as the region size to which max pooling is applied, in this case is dependent of the filter
used in the convolutional layers.
3.2.2.5 Combined CNN-LSTM Network
The CNN-LSTM model consists, as the name suggests, in combining a CNN architecture
with a LSTM architecture. Previous studies, described in Chapter 2, show that sometimes com-
bining distinct models can give us a more powerful overall model, that improves the prediction
performance. This is ensured because each of the combined models provides their specific fea-
tures, resulting on a more complete model. Specifically, in this case, as CNN models are capable
of extracting local information but may fail to capture long dependencies, the LSTM model will
be combined to try to ensure that this limitation will disappear. The combined architecture can
42
be seen in Figure 3.7.
From figure 3.7, it is perceptible that the model is composed by one embedding layer, one
convolutional layer, one maxPooling layer, one LSTM layer and finally an activation layer.
The embedding, dropout and activation layers have the same parameters as in previous
models. The differences are in the convolutional and maxpooling layers.
The convolutional layer has the following parameters: the number of convolution kernels is
defined as the embeddings dimensionality. The filter length is assigned only with the value 3,
and the activation is set as a relu function.
The only parameter involved in the maxpooling layer is the pool length, that is defined as
the region size to which max pooling is applied. In this specific architecture, this parameter was
assigned with the value of 2.
Figure 3.8: Merged CNN Architecture.
3.2.2.6 Merged CNN
The merged CNN model is a variant of the CNN model, where the main idea was to allow
the previously described CNN model to receive two different types of word representations as
input. Hence, this architecture has two branches, where each of them has their respective word
43
representation. At the end, the goal is to concatenate the two different representations provided
from both branches and perform the classification over this new representation. Figure 3.8 shows
the model architecture with the respective layers. The parameters involved in each of the layers
are the same as in the CNN model.
3.2.2.7 Merged CNN-LSTM
As in the previous merged model, the merged CNN-LSTM model is a variant of the CNN-
LSTM model that was previously introduced. The idea is also to allow the previously described
CNN-LSTM model to receive two different types of word representations. Thus, this architecture
has two branches, being that each of them used different word representations. At the end,
the goal is to concatenate the two different representations, provided from both branches, and
perform the classification over this new representation. Figure 3.9 shows the model architecture.
Figure 3.9: Merged CNN-LSTM Architecture.
44
3.3 Overview
In this chapter, I described all the necessary components involved in addressing the senti-
ment analysis and dimensional sentiment analysis tasks. Specifically, the chapter described the
different text representations as well as the different model architectures. In order to better
understand the layers of each model, brief overviews were also presented, followed by, the model
architectures with their respective layers and parameters. In the next chapter, I present the
evaluation experiments conducted for each particular task.
45
46
4Experimental Evaluation
This chapter starts with the presentation of the general evaluation methodology followed
by an explanation of the different inputs that will be provided to the classification models, the
evaluation metrics, and finally presenting the results of the experimental evaluation. Section
4.1 describes the evaluation methodology. Section 4.2 presents the datasets as well as the fine-
tuned word embeddings that were used, according to each specific task, sentiment analysis or
dimensional sentiment analysis. Section 4.3 details the metrics for measuring the quality of the
results, and finally, Section 4.4 presents the results from the different models, according to each
specific task. Section 4.5 summarizes the contents of this chapter.
4.1 Evaluation Methodology
The evaluation methodology has components that are specific for each task and dataset.
However, a common methodological aspect is present in all the tests with neural network models,
namely the use of a callback with an early stopping function. A callback in this context is not
more than a set of functions to be applied at given stages of the training procedure. The use
of callback functions allow us to get a view on the internal states and statistics of the model,
during training. Specifically, the early stopping function allow us to stop the training process
when a monitored quantity has stopped improving.
In the experiments, the selected quantity that was intended to be analysed was the validation
loss. The validation loss is measured in each training epoch through a validation set that has
been previously defined. In our case, the validation set that was defined as the last 10% of the
training data, and the training procedure will stop when we observe at least two consecutive
epochs with no improvements.
4.1.1 Evaluation for the Sentiment Analysis Task
The sentiment analysis task was performed over different datasets/contexts with the aim
of making the results more robust. Thus the sentence polarity v1.0 (Pang and Lee, 2005), the
stanford sentiment treebank (Socher et al., 2013), and the tweet 20161 datasets were used.
The sentence Polarity dataset v1.0, described in more detail in the next section and that
corresponds to a binary classification task, does not have pre-defined train and test sets. There-
fore, in this case and in order to assess the prediction quality of the different models, a cross
validation with 3 folds was applied.
Cross validation is probably the simplest and most widely used method for estimating predic-
tion errors. This method directly estimates the expected extra-sample error Err = E[L(Y, f(X))]
i.e. the average generalization error when the method f(X) is applied to an independent test
sample from the joint distribution of X and Y .
In cases where there are no pre-defined train and test sets, this method uses part of the
available data to fit the model, and a different part to test it, repeating this process several
times.The idea is to split the data into k roughly equal-sized parts and, after selecting the k-th
part, we fit the model to the other k − 1 parts of the data, and calculate the prediction error
of the fitted model when predicting the k-th part of the data. We do this for k = 1, 2, ..., k and
combine the k estimates of prediction error.
Almost all the different models considered for our tests use word embeddings. In this
particular case, the word embedding dataset that was used is the publicly available word2vec
vectors trained on Google News (Mikolov et al., 2013).
The Stanford Sentiment TreeBank corpus has pre-defined train and test sets. In cases like
this the best option is to use the pre-defined sets in order to obtain a better comparison with
the results of previous studies.
With respect to word embeddings, and similarly to the case with the previous dataset, were
used the publicly available word2vec vectors trained on Google News. This specific dataset had
sentences labeled with an integer sentiment polarity score ranging from 0 to 5.
The Tweet 2016 dataset also has pre-defined train/test sets. However, in contrast with the
previous datasets where were used the word embeddings trained on Google News, this case used
word embeddings trained on Twitter microposts (Godin et al., 2013). This specific dataset had
sentences labeled with an integer sentiment polarity ranging from 0 to 2.
A first set of experiments leveraged models that simply use pre-trained word embeddings. All
the words, including those that are not present in the pre-trained word embeddings (randomly
1http://alt.qcri.org/semeval2016/task4/
48
initialized), are fine-tuned throughout the training process. Thus, not only the parameters of
the models are learned but also the weights of the word embeddings.
The second set of experiments involved models that require a pre-processing of the word
embeddings in order to concatenate the pre-trained representations with the emotion dimensions
values (valence, arousal and dominance). The algorithm to produce this concatenation has two
distinct steps. For words that are simultaneously present in a pre-existing dataset of affective
rating for words (Warriner et al., 2013) and in the pre-trained word embeddings, concatenate
the values. For the other cases, is trained a regression model giving it as a training set the
word embeddings and the dimensional values as labels. The idea is to use this model to predict
the dimensional values of words that do not appear in the original affective dataset (Warriner
et al., 2013) and thus all the entire collection of word embeddings can have an associated
group of dimensional values. The final result of this process are word embeddings with their
weights concatenated with the emotion dimensions values. After this pre-processing phase, word
embeddings are delivered to the models, as in the previous set of experiments.
The third and last set of experiments consists in separately using two different types of
word embeddings within the same model architecture. Specifically are used pre-trained word
embeddings and a set of word embeddings with dimensionality of 3 that are composed by the
values of valence, arousal and dominance (using the pre-existing dataset of affective rating for
words (Warriner et al., 2013)). In contrast with previous experiments where there is just one
input to the models, now there are two inputs. The idea is to build a merged architecture that
has two branches, namely one branch that has as input a type of word embeddings, and another
with the other type. Considering that both branches have the same model architecture, and
after some computations in each of these branches, the results are concatenated. This type of
merged architectures was considered only for the models with the best results in the previous
experiments, and are described in Chapter 3, namely the Merged CNN and Merged CNN-LSTM
architectures.
4.1.2 Evaluation for the Dimensional Sentiment Analysis Task
In this particular case, the train set in all experiments is the Extended Warriner dataset of
affective ratings for words and phrases, described in more detail in the next section. The test
sets are the EmoTales (Francisco et al., 2012), ANET (Bradley and Lang, 2007) and a Facebook
Messages dataset (Preoctiuc-Pietro et al., 2016). Where sentences are labeled with real values
encoding emotional valence, arousal and dominance, With respect to the word embeddings
49
required by some of the models, are used the publicly available word2vec vectors trained on
Google News.
4.2 Datasets
This section describes the data involved in the experiments for the sentiment analysis and
the dimensional sentiment analysis tasks. The section presents the different datasets as well as
the word embeddings that were used, according to each specific task.
4.2.1 Sentiment Analysis Datasets
The Sentence Polarity dataset v1.0, described by Pang and Lee (2005), is a corpus composed
by 5331 positive and 5331 negative sentences taken from several movie reviews.
The Stanford Sentiment Treebank dataset, described by Socher et al. (2013), is a corpus
with fully labeled parse trees that supports the complete analysis of compositional effects of
sentiment in language. This corpus is based on the Rotten Tomatoes movie reviews dataset
introduced by Pang and Lee (2005) and consists of 11,855 single sentences extracted from movie
reviews. Each unique phrase is annotated by 3 human judges, with the following sentiment
scale: negative - 0, somewhat negative - 1, neutral - 2, somewhat positive - 3 and positive - 4.
The Tweet 2016 dataset was provided for Task 4 (Sentiment Analysis in Twitter) of the Se-
mEval competition 2016 2. This dataset contains tweets annotated with the following sentiment
scale: negative - 0, neutral - 1, positive - 2.
4.2.2 Dimensional Sentiment Analysis Datasets
The EmoTales dataset, described by Francisco et al. (2012), consists of a collection of
1389 English sentences from 18 different folk tales, annotated by 36 different people. All the
sentences are annotated with real-valued ratings for emotion, according to the dimensions of
evaluation/valence, activation/arousal and power/dominance.
The Affective Norms for English Text dataset from Bradley and Lang (2007), provides
normative ratings of emotion (pleasure/valence, arousal, dominance) for a small set of brief
texts (i.e., a total of 100 sentences) in the English language.
2http://alt.qcri.org/semeval2016/task4/
50
The dataset introduced by Preoctiuc-Pietro et al. (2016) provides a set of Facebook mes-
sages rated by two psychologically trained annotators on two separate ordinal nine-point scales,
representing valence and arousal. This dataset includes a total of 2896 Facebook messages.
The dataset provided by Warriner et al. (2013) contains 13915 English lemmas annotated
with valence, arousal and dominance values for each lemma. Beyond that, this dataset also
includes complimentary information on the annotations like gender, age and educational differ-
ences in emotion norms.
The Paraphrase Database (PPDB) provided by Pavlick et al. (2015) is an automatically
extracted database containing millions paraphrases for short sentences in 16 different languages.
The goal of PPBD is to improve language processing by making systems more robust to language
variability and unseen words. In the context of this work was used the PPDB package with size
L for the English language.
The Extended Warriner dataset, as the name suggest, is an extension of the Warriner dataset
(Warriner et al., 2013) that was developed by me for the experiments in this dissertation. This
extension was created using information present in the Paraphrase Database (Pavlick et al.,
2015). Specifically, the idea was to filter the PPBD file in order to find phrases with more
than one word, with high confidence to be paraphrases, and equivalent to words present in the
Warriner Dataset. At the end, the goal is to associate these equivalent phrases to the scores
that can be found in the Warriner dataset. Thus, the final result was a dataset not only with
single words but also with phrases and their respective emotion dimensions values. A total of
54000 instances were present in the resulting dataset.
4.2.3 Word Embeddings
Almost all the models described in the previous chapter use word embeddings. The different
word embeddings that were used in the experiments are the following:
• Word embeddings trained on about 100 billion words from Google News (Mikolov et al.,
2013). The training was performed using the continuous bag of words architecture, and
the word vectors have a dimensionality of 300.
• Word embeddings trained on about 400 million Twitter microposts (Godin et al., 2013).
The training was performed using the skip-gram architecture and the word vectors had a
dimensionality of 400.
51
4.3 Evaluation Metrics
The evaluation of the different tasks was done using different measures of result quality. In
order to assess the sentiment analysis task, I used the accuracy evaluation metric.
Consider that each testing instance, after processed by an automated system to detect if it
expresses a positive or negative sentiment, can either be:
• True Positive (TP) - system prediction is positive, as well as the real value.
• False Positive (FP) - system prediction is positive and the real value is negative.
• False Negative (FN) - system prediction is negative and the real value is positive.
• True Negative (TN) - system prediction is negative, as well as the real value.
Accuracy can be computed as the overall correctness of the system i.e., the decisions that
system got right divided by all the total number of decisions made by the system. For a binary
classification problem:
Accuracy =TP + TN
TP + TN + FP + FN(4.1)
In order to assess the dimensional sentiment analysis task, I used the Pearson correlation
coefficient. The Pearson correlation measures the correlation degree between two variables.
Specifically, measures the strength of a linear association between two variables and is denoted
by ρ. The idea is to draw a line of best fit through the data of two variables and the Pearson
correlation coefficient will indicate how far away all these data points are to this line. The ρ can
only take values between −1 and 1, being that:
• ρ = 1 indicates perfect positive correlation between two variables, i.e. if one of them
increases the other will also increase.
• ρ = −1 indicates perfect negative correlation between two variables, i.e. if one of them
increases the other will decrease.
• ρ = 0 indicates that there is no association between the two variables.
Briefly, the stronger the association of the two variables, more closer the ρ value will be to
either −1 or 1, depending on whether the relationship is positive or negative, respectively.
52
4.4 Experimental Results
This section presents the results for the different tasks addressed in this work (i.e., for
the sentiment analysis and dimensional sentiment analysis tasks). In order to evaluate the
performance of each of them, the models were used over different contexts/datasets. Thus,
results from the experiments will be presented according to each specific task and dataset.
4.4.1 Sentiment Analysis
This specific section only presents the results for the sentiment analysis task. All the senti-
ment analysis experiments are reported according to their context/dataset.
Model Epochs Batch size Accuracy
NB Bag of Words - - 78.4 %
SVM Bag of Words - - 75.6 %
NB-SVM Bag of Words - - 76.5 %
MLP Bag of Words 4 32 76.8 %
Stack of Two LSTMs Word2Vec 5 32 76.7 %
Bidirectional LSTM Word2Vec 5 32 75.8 %
CNN Word2Vec 4 32 77.1 %
CNN-LSTM Word2Vec 5 32 76.0 %
MLP Doc2Vec 4 32 64.0 %
CNN-non-static Kim (2014) - - 81.5 %
RAE (Socher et al., 2011) - - 77.7 %
MV-RNN (Socher et al., 2012) - - 79.0 %
CCAE (Hermann and Blunsom,
2013)
- - 77.8 %
sent-Parser (Dong et al., 2014) - - 79.5 %
NBSVM (Wang and Manning,
2012)
- - 79.4 %
MNB (Wang and Manning, 2012) - - 79.0 %
G-Dropout (Wang and Manning,
2013)
- - 79.0 %
F-Dropout (Wang and Manning,
2013)
- - 79.1 %
Tree-CRF (Nakagawa et al., 2010) - - 77.3 %
Table 4.1: Sentence Polarity Dataset: Models results using pre-trained word embeddings against
previous works.
53
4.4.1.1 Sentence Polarity Dataset
The Sentence Polarity dataset v1.0 (Pang and Lee, 2005) was used in experiments regarding
a binary sentiment analysis task on movie reviews. The models used to address this task are
described in Table 4.1, with their respective results in terms of accuracy. The first four models use
the bag of words text representation while the remaining use pre-trained word embeddings. Table
4.1 is divided in two parts. The first half describes the performance of the models implemented
by me and described in this dissertation, and the second part describes the performance of
models from previous works using this dataset.
Table 4.1 shows that the results of models used in my experiments are close to those of
the current state of art. However, can be seen on the table that all these models have a lower
performance comparing with the actual state of art. Another relevant fact is that the best result
was obtained when using a simple Naive Baive classifier with the bag of word representation.
This fact is not very surprising taking into account the results of previous studies, which also
showed a higher performance of these simple baselines comparing to more complex models (Wang
and Manning, 2012).
The next set of experiments involved the addiction of information about emotion dimensions
(i.e., valence, arousal and dominance) and are presented in Table 4.2. The first half of Table 4.2
reports the results of experiments with the concatenated version of the word embeddings, and
the second half the results of experiments using the merged architectures.
Model Epochs Batch size Accuracy
MLP Bag of Words 4 32 76.2 %
Stack of Two LSTMs Word2Vec 5 32 75.8 %
Bidirectional LSTM Word2Vec 5 32 76.3 %
CNN Word2Vec 4 32 77.5 %
CNN-LSTM Word2Vec 5 32 74.5 %
MLP Doc2Vec 4 32 64.0 %
CNN Word2Vec 4 32 79.1 %
CNN-LSTM Word2Vec 5 32 77.5 %
Table 4.2: Sentence Polarity Dataset: Models results using a concatenated version of word
embeddings, against the results of merged architectures.
Looking at the first half of Table 4.2 and comparing with the results of Table 4.1 one can
see a slight decrease in performance for most models, with some exceptions - the best model
remains the Naive Bayes classifier. Looking at the second half of the table which represents the
54
results of the merged architectures, one can see a slight increase in performance for all models,
and the CNN model is now the one with better performance.
The results in Table 4.2 give us some indications that adding information about the emotion
dimensions can improve the results of some models in the sentiment analysis task.
4.4.1.2 Stanford Sentiment TreeBank Dataset
The Stanford Sentiment Treebank dataset (Socher et al., 2013) was also used to experiment
with a sentiment analysis task on movie reviews, but instead of only distinguishing the sentiment
as positive or negative, this dataset allows us to experiment with a more detailed sentiment
analysis task, distinguishing the sentiment in five categories. The results of these experiments
are described in Table 4.3. The organization is the same of the previous experiments, i.e. the
models used are the same, as well as the considered division for the table.
Model Epochs Batch size Accuracy
NB Bag of Words - - 39.2 %
SVM Bag of Words - - 37.2 %
NB-SVM Bag of Words - - 39.2 %
MLP Bag of Words 4 32 39.0 %
Stack of Two LSTMs Word2Vec 5 32 38.9 %
Bidirectional LSTM Word2Vec 5 32 39.3 %
CNN Word2Vec 4 32 37.9 %
CNN-LSTM Word2Vec 5 32 41.3 %
MLP Doc2Vec 4 32 32.6 %
CNN-non-static Kim (2014) - - 48.0 %
RAE (Socher et al., 2011) - - 43.2 %
MV-RNN (Socher et al., 2012) - - 44.4 %
RNTN (Socher et al., 2013) - - 45.7 %
DCNN (Kalchbrenner et al., 2014) - - 48.5 %
Paragraph-Vec (Le and Mikolov,
2014)
- - 48.7 %
Table 4.3: Stanford Sentiment Treebank Dataset: Models results using pre-trained word em-
beddings against previous works.
Similarly to what has been observed in the previous set of experiments, the results obtained
in this somewhat different task are also relatively close to the results of the current state of
art models. The best result was obtained using the combination between a CNN and a LSTM
model.
55
The results of adding information about emotion dimensions are present in Table 4.4. In the
same way, the first half of the table reports the results of the experiments with the concatenated
version of the word embeddings, and the second half reports on the experiments using merged
architectures.
Model Epochs Batch size Accuracy
MLP Bag of Words 4 32 39.6 %
Stack of Two LSTMs Word2Vec 6 32 40.9 %
Bidirectional LSTM Word2Vec 6 32 37.8 %
CNN Word2Vec 7 32 43.3 %
CNN-LSTM Word2Vec 5 32 44.2 %
MLP Doc2Vec 4 32 32.4 %
CNN Word2Vec 5 32 40.7 %
CNN-LSTM Word2Vec 6 32 40.5 %
Table 4.4: Stanford Sentiment Treebank Dataset: Models results using a concatenated version
of word embeddings against the results of merged architectures.
Looking at the first half of Table 4.4, and comparing with the results of Table 4.3 it is
possible to see a significant improvement in the performance of some of the models, namely
the CNN and the CNN-LSTM. On the other hand, models like Bidirectional LSTM and MLP
Doc2Vec have a slight decrease in performance. In the second half of the table, and comparing
with Table 4.3, there is also an improvement, although on a smaller scale.
4.4.1.3 Tweet 2016 Dataset
The Tweet 2016 dataset3 was used to experiment with a sentiment analysis task on Twitter
data. Similarly to the Stanford Sentiment Treebank dataset, this dataset allows one to ex-
periment with a more detailed sentiment analysis task, distinguishing the sentiment in three
different categories. The results of the experiments using this dataset are described in Table
4.5. The first half of the table presents the performance of the models implemented by me, and
the second harf presents, the results obtained by some teams in the SemEval 20164 competition
using this dataset.
3http://alt.qcri.org/semeval2016/task4/4http://alt.qcri.org/semeval2016/task4/
56
Model Epochs Batch size Accuracy
Linear model with Bag of Words - - 53.0 %
MLP Bag of Words 5 32 56.2 %
Stack of Two LSTMs Word2Vec 10 32 56.1 %
Bidirectional LSTM Word2Vec 14 32 56.4 %
CNN Word2Vec 4 32 58.7 %
CNN-LSTM Word2Vec 11 32 60.0 %
MLP Doc2Vec 6 32 52.2 %
SwissCheese - - 64.3 %
SENSEI-LIF - - 61.7 %
UNIMELB - - 61.6 %
INESC-ID - - 60.0 %
TwiSE - - 52.8 %
MDSENT - - 54.5 %
Table 4.5: Tweet 2016 Dataset: Models results using pre-trained word embeddings against
previous works.
Table 4.5 shows that the results of the considered models are close to the results of many
of the teams that performed the same task in the SemEval 2016 competition. The best result
was obtained using the combination between a CNN and a LSTM model.
As in the previous datasets, I experimented with the next experiments involve the inclusion
of information about emotion dimensions. The results are described in the table 4.6. The first
half of the table reports the results of the experiments with the concatenated version of the word
embeddings, and the second half refers to the experiments using merged architectures.
Model Epochs Batch size Accuracy
MLP Bag of Words 5 32 57.6 %
Stack of Two LSTMs Word2Vec 10 32 57.4 %
Bidirectional LSTM Word2Vec 14 32 56.1 %
CNN Word2Vec 4 32 56.2 %
CNN-LSTM Word2Vec 11 32 58.2 %
MLP Doc2Vec 6 32 52.9 %
CNN Word2Vec 10 32 60.6 %
CNN-LSTM Word2Vec 9 32 59.4 %
Table 4.6: Tweet 2016 Dataset: Models results using a concatenated version of word embeddings
against the results of merged architectures.
Comparing the first half of Table 4.6 with the results of Table 4.5, it is possible to see
57
that some models improve their performance, while others obtain a worse performance. Despite
some improvements, the best result remains the CNN-LSTM model of the previous experiments.
Regarding to the second half, the CNN model improves its performance, establishing itself as
the best result, while the CNN-LSTM model improves its performance when comparing with
the first half of the table, but remains bellow of the results described in the Table 4.5. On
this dataset, the improvements when using the emotion dimensions are not as sharp as in other
datasets, in the sense that there are fewer models with improvements. Nevertheless, the best
result was obtained using this type of information. Taking this into account, the idea that this
kind of information can improve the prediction performance in sentiment analysis tasks still
perhaps worth pursuing.
4.4.2 Dimensional Sentiment Analysis
This section describes results for the dimensional sentiment analysis task. Specifically, I
evaluate the results for each dimension (i.e., valence, arousal, and dominance) individually for
each dataset.
4.4.2.1 Affective Norms for English Text Datset
The ANET dataset (Bradley and Lang, 2007) was used to experiment with a dimensional
sentiment analysis task on brief texts in the English language. The models used to perform this
task are described in Table 4.7, with their respective results.
Models Valence Arousal Dominance
Stack of Two LSTMs Word2Vec 0.400 0.235 0.366
Bidirectional LSTM Word2Vec 0.549 0.183 0.294
CNN Word2Vec 0.551 0.391 0.425
CNN-LSTM Word2Vec 0.434 0.172 0.451
Table 4.7: ANET Dataset: Prediction results for valence, arousal and dominance in terms of
the Pearson correlation.
Looking closely to Table 4.7 it is possible to conclude that the best performance, in terms of
the Pearson correlation coefficient, was achieved for the valence dimension, while in the arousal
dimension the results were much worse. In the valence dimension, the best model was the CNN
model and the worst the Stack of two LSTMs. In the arousal dimension, the best was the CNN
58
model and the worst the CNN-LSTM model. Finally, in the dominance dimension, the best
performance was obtained using the CNN-LSTM model, and the worst using the Bidirectional
LSTM model.
4.4.2.2 EmoTales Dataset
The EmoTales dataset (Francisco et al., 2012) was used to experiment with a dimensional
sentiment analysis task on sentences form 18 different folk tales. The results are described in
Table 4.8.
Models Valence Arousal Dominance
Stack of Two LSTMs Word2Vec 0.205 0.043 0.073
Bidirectional LSTM Word2Vec 0.221 0.064 0.037
CNN Word2Vec 0.152 0.025 -0.026
CNN-LSTM Word2Vec 0.246 0.049 0.033
Table 4.8: EmoTales Dataset: Prediction results for valence, arousal and dominance in Pearson
Correlation Coefficient.
Table 4.8 shows that the best results in terms of the Pearson correlation coefficient and
similarly to the previous experiment with ANET, are in the valence dimension. The arousal
and dominance dimensions have much worse results. Focusing in more detail in each dimension,
the best model in valence was the CNN-LSTM model and the worst the CNN model. In the
arousal dimension, the best one was the Bidirectional LSTM while the worst was the CNN
model. Concluding, in the dominance dimension, the best model was the Stack of two LSTMs,
and the worse was the CNN model.
The results of this dataset, comparing with the previous dataset, are worse. This can be
due to the fact that in this dataset the meaning of each individual sentence, and consequently
its emotion ratings are not independent from the other sentences that compose each tale.
4.4.2.3 Facebook Messages Dataset
The Facebook Messages dataset (Preoctiuc-Pietro et al., 2016) was also used to experiment
with a dimensional sentiment analysis, this time on Facebook posts. In contrast with the
previous experiments, where the results were reported for three dimensions, this dataset only
contains annotations for two dimensions, namely valence and arousal. Taking this into account,
59
the results are just described in terms of these two dimensions, and are presented in Table 4.9.
The first half of the table describes the performance of the models used in my experiment and
the second half, in order to compare the results, describes the performance of previous work
using this dataset.
Models Valence Arousal
Stack of Two LSTMs Word2Vec 0.300 0.073
Bidirectional LSTM Word2Vec 0.310 0.081
CNN Word2Vec 0.345 0.096
CNN-LSTM Word2Vec 0.390 0.105
ANEW (Bradley and Lang, 1999) 0.307 0.085
Aff Norms (Warriner et al., 2013) 0.113 0.188
MPQA (Wilson et al., 2005) 0.385 -
NRC (Mohammad et al., 2013) 0.405 -
BoW model (Preoctiuc-Pietro et al., 2016) 0.650 0.850
Table 4.9: Facebook Messages Dataset: Prediction results for valence and arousal in Pearson
Correlation Coefficient.
Looking at the first half of Table 4.9, one can see that the best results, in terms of the
Pearson correlation coefficient, are in the valence dimension. In the valence dimension, the best
result was obtained using the CNN-LSTM model, and the worst was obtained using the Stack
of two LSTMs model. Finally, in the arousal dimension the best and the worst model are the
same of the previous dimension.
The second half of Table 4.9 presents a number of different existing approaches. Comparing
with the results described in the first half, it is possible to see that some of the models studied
in this particular case surpass the results of other existing approaches. However, results are still
a bit far from the best approach, that is the BoW model by Preoctiuc-Pietro et al. (2016). This
may be due the fact that the BoW model was trained using 10-fold cross-validation, and so has
the advantage of be trained and tested with the same data type.
4.5 Overview
In this chapter, I presented the evaluation experiments concerned with each specific task.
I performed experiments in different contexts/datasets, in order to obtain more robust results.
Despite the fact that I cannot exactly compare the results in the sentiment analysis task, due
to lack of information, I conclude that in most cases the models advanced in this dissertation
60
are close to the best results presented in the literature. Furthermore, the idea that adding
information about the emotion dimensions can improve the prediction performance, in sentiment
analysis tasks, is still alive and reinforced with some positive indications.
In the dimensional sentiment analysis task, was possible to see that the best results were
obtained with the ANET dataset. The remaining datasets have worse results, and that the
Pearson correlation coefficient is very close to zero in some tests, which indicates that there is
almost no relationship between the predicted values and the correct values for each dimension.
However, and in contrast with the ANET and EmoTales datasets, it’s possible to compare the
results obtained by the different models using the Facebook Messages dataset. Taking this into
account, is noticeable that some of the models used in this particular case surpass the results
of other existing approaches. However, it is a bit far from the best approach - the BoW model
by Preoctiuc-Pietro et al. (2016). The reasons behind these results may have different sources.
For example, an inadequate training set, or even the fact that the models being used are not
the most appropriated to perform this kind of task, in these contexts.
The next chapter finishes this dissertation by presenting the conclusions regarding the work
developed while also introducing some directions in terms of future work.
61
62
5ConclusionsThis section presents the conclusions drawn throughout this dissertation as well as possible
approaches for future work. Section 5.1 presents all the conclusions while Section 5.2 describes
different approaches for future work.
5.1 Conclusions
This dissertation described the research work conducted in the context of my Master’s thesis.
Throughout the document, I presented two different tasks, namely sentiment analysis task and
the dimensional sentiment analysis task. The sentiment analysis task aims to predict the polarity
of a given textual document, while the dimensional sentiment analysis aims to predict emotion
dimensions like valence, arousal and dominance.
In the sentiment analysis task, I put into practice the idea of adding information about the
emotion dimensions associated to words. From all the experiments, I can conclude that this idea
is still alive and reinforced with some positive indications. The results showed that, in some
cases, adding this type of information improves the prediction performance.
In the dimensional sentiment analysis task, I created a new dataset corresponding to an
extension of the Warriner dataset. After created, this dataset was used as the training set
in some models, in order to predict the emotion dimensions in different contexts. The results
showed that, in some cases, the models that are advanced performed better than others, although
in some contexts the results are not so good using the same methodology.
For each task I also presented the model architectures and their parameters, as well as the
details involved in the experimental evaluation.
5.2 Future Work
As future work, taking into account the positive indications given by the experimental
evaluation, many other experiments can also be done. In the sentiment analysis task, it would
be interesting, for instance, to experiment with the use of the emotion dimensions in other
models. Another possibility is to use other emotion dimension datasets, like the Extended
Warriner dataset created in this thesis, for trying to improve the results reported throughout
the document.
In the dimensional sentiment analysis task, the indications are not so good in some contexts.
However, for future, it would be interesting experiment with different datasets for training and
testing neural network models.
Furthermore, for both tasks, another possibility can be pre-train a model with a very large
dataset, such as the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015)
or the Paraphrase Database (PPDB) (Pavlick et al., 2015), in order to recognize equivalent
phrases. After this process, the idea is to use the learned parameters of the model for developing
other model that allows to predict the polarity or the emotion dimensions of a given text. In
more detail, the objective is try to explore big datasets to train models that are good modeling
sequences of words (i.e. phrases), and after use these representations in models with other goals,
in this case sentiment analysis and dimensional sentiment analysis.
64
Bibliography
Andreevskaia, A. and S. Bergler (2008). When specialists and generalists work together: over-
coming domain dependence in sentiment tagging. In Proceedings of the Annual Meeting of the
Association for Computational Linguistics.
Augustyniak, L., T. Kajdanowicz, P. Szymanski, W. Tuliglowicz, P. Kazienko, R. Alhajj, and
B. K. Szymanski (2014). Simpler is better? lexicon-based ensemble sentiment classification
beats supervised methods. In Proceedings of the IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining.
Bowman, S. R., G. Angeli, C. Potts, and C. D. Manning (2015). A large annotated corpus
for learning natural language inference. In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing (EMNLP). Association for Computational Linguis-
tics.
Bradley, M. M. and P. J. Lang (1999). Affective norms for English words (ANEW): Stimuli,
instruction manual, and affective ratings. Technical report, Center for Research in Psychophys-
iology, University of Florida.
Bradley, M. M. and P. J. Lang (2007). Affective norms for English Text (ANET): Affective
ratings of text and instruction manual. Technical report, University of Florida, Gainesville,
Fl.
Church, K. and P. Hanks (1989). Word association norms, mutual information and lexicography.
In Proceedings of the Annual Metting of the Association for Computational linguistics.
Dong, L., F. Wei, S. Liu, M. Zhou, and K. Xu (2014). A statistical parsing framework for
sentiment classification. Computing Research Repository .
Francisco, V., R. Hervas, F. Peinado, and P. Gervas (2012). Emotales: creating a corpus of folk
tales with emotional annotations. Language Resources and Evaluation.
Gao, W., S. Li, Y. Xue, M. Wang, and G. Zhou (2014). Semi-supervised sentiment classifica-
tion with self-training on feature subspaces. In Proceedings of the workshop Chinese Lexical
Semantics.
65
Godin, T., I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013). Multimedia lab @ acl
wnut ner shared task: Named entity recognition for twitter microposts using distributed word
representations.
Goller, C. and A. Kuchler (1996). Learning task-dependent distributed representations by back-
propagation through structure. In Proceedings of the International Conference on Neural
Networks.
Hermann, K. M. and P. Blunsom (2013). The role of syntax in vector space models of com-
positional semantics. In In Proceedings of the 51st Annual Meeting of the Association for
Computational Linguistics.
Joachims, T. (2002). Learning to Classify Text Using Support Vector Machines: Methods, Theory
and Algorithms. Kluwer Academic Publishers.
Kalchbrenner, N., E. Grefenstette, and P. Blunsom (2014). A convolutional neural network for
modelling sentences. Proceedings of the Annual Meeting of the Association for Computational
Linguistics.
Kamps, J. and M. Marx (2002). Words with attitude. In Proceedings of the International
WordNet Conference.
Kennedy, A. and D. Inkpen (2006). Sentiment classification of movie reviews using contextual
valence shifters. Computational Intelligence.
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing.
Le, Q. and T. Mikolov (2014). Distributed representations of sentences and documents. In
Proceedings of the 31st International Conference on Machine Learning.
Li, S., L. Huang, J. Wang, and G. Zhou (2015). Semi-stacking for semi-supervised sentiment
classification. In Proceedings of the Annual Meeting of the Association for Computational
Linguistics and of the International Joint Conference on Natural Language Processing.
Mao, Y. and G. Lebanon (2006). Sequential models for sentiment prediction. In Proceedings
of the International Machine Learning Society Workshop on Learning in Structured Output
Spaces.
Mesnil, G., T. Mikolov, M. Ranzato, and Y. Bengio (2014). Ensemble of generative and discrim-
inative techniques for sentiment analysis of movie reviews. In Proceedings of the International
Conference on Learning Representations.
66
Mikolov, F., B. Vandersmissen, W. De Neve, R. V. d. Walle, and J. Dean (2013). Distributed
representations of words and phrases and their compositionality. In Neural Information Pro-
cessing Systems.
Mikolov, T., K. Chen, G. Corrado, and J. Dean (2013). Efficient estimation of word represen-
tations in vector space. Computing Research Repository .
Mikolov, T., M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur (2010). Recurrent neural
network based language model. In Proceedings of the Annual Conference of the International
Speech Communication Association.
Mohammad, S. M., S. Kiritchenko, and X. Zhu (2013). Nrc-canada: Building the state-of-the-art
in sentiment analysis of tweets. Computing Research Repository .
Moreira, S., R. F. Astudillo, W. Ling, B. Martins, M. J. Silva, and I. Trancoso (2015). INESC-ID:
A regression model for twitter sentiment lexicon induction. In Proceedings of the International
Workshop on Semantic Evaluation.
Mou, L., H. Peng, G. Li, Y. Xu, L. Zhang, and Z. Jin (2015). Tree-based convolution: A new
neural architecture for sentence modeling. In Proceedings of the International Conference on
Computer Supported Collaborative Learning.
Mudinas, A., D. Zhang, and M. Levene (2012). Combining lexicon and learning based approaches
for concept-level sentiment analysis. In Proceedings of the International Workshop on Issues
of Sentiment Discovery and Opinion Mining.
Mullen, T. and N. Collier (2004). Sentiment analysis using support vector machines with di-
verse information sources. In Proceedings of the Conference on Empirical Methods on Natural
Language Processing.
Nakagawa, T., K. Inui, and S. Kurohashi (2010). Dependency tree-based sentiment classifica-
tion using crfs with hidden variables. In Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the Association for Computational Linguistics.
Osgood, C. E., G. J. Suci, and P. H. Tannenbaum (1957). The Measurement of Meaning.
University of Illinois Press.
Pang, B. and L. Lee (2005). Seeing stars: Exploiting class relationships for sentiment catego-
rization with respect to rating scales. In Proceedings of the Annual Meeting on Association
for Computational Linguistics.
67
Pavlick, E., P. Rastogi, J. Ganitkevitch, B. Van Durme, and C. Callison-Burch (2015). Ppdb
2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style
classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational
Linguistics and the 7th International Joint Conference on Natural Language Processing.
Preoctiuc-Pietro, D., H. A. Schwartz, G. Park, J. Eichstaedt, M. Kern, L. Ungar, and E. P.
Shulman (2016). Modelling valence and arousal in facebook posts. In Proceedings of the
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
(WASSA).
Qiu, L., W. Zhang, C. Hu, and K. Zhao (2009). Selc: A self-supervised model for sentiment
classification. In Proceedings of the ACM Conference on Information and Knowledge Man-
agement.
Rumelhart, D. E., G. E. Hinton, and R. J. Williams (1986). Parallel distributed processing:
Explorations in the microstructure of cognition. In D. E. Rumelhart, J. L. McClelland, and
C. PDP Research Group (Eds.), Parallel Distributed Processing: Explorations in the Mi-
crostructure of Cognition, Chapter Learning Internal Representations by Error Propagation.
MIT Press.
Socher, R., B. Huval, C. D. Manning, and A. Y. Ng (2012). Semantic compositionality through
recursive matrix-vector spaces. In Proceedings of the Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language Learning.
Socher, R., C. C. Lin, A. Y. Ng, and C. D. Manning (2011). Parsing Natural Scenes and Natural
Language with Recursive Neural Networks. In Proceedings of the International Conference on
Machine Learning.
Socher, R., J. Pennington, E. H. Huang, A. Y. Ng, and C. D. Manning (2011). Semi-supervised
recursive autoencoders for predicting sentiment distributions. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing.
Socher, R., A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013).
Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing.
Stone, P. J., D. C. Dunphy, M. S. Smith, and D. M. Ogilvie (1966). The General Inquirer: A
Computer Approach to Content Analysis. The MIT Press.
68
Taboada, M., C. Anthony, and K. Voll (2006). Methods for creating semantic orientation dic-
tionaries. In Proceedings of the Conference on Language Resources and Evaluation.
Taboada, M., J. Brooke, M. Tofiloski, K. Voll, and M. Stede (2011). Lexicon-based methods for
sentiment analysis. Computational Linguistics.
Taboada, M. and J. Grieve (2004). Analyzing appraisal automatically. In Proceedings of the
AAAI Spring Symposium on Exploring Attitude and Affect in Text: Theories and Applications.
Tang, D., B. Qin, and T. Liu (2015a). Document modeling with gated recurrent neural network
for sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural
Language Processing.
Tang, D., B. Qin, and T. Liu (2015b). Learning semantic representations of users and products
for document level sentiment classification. In Proceedings of the Annual Meeting of the As-
sociation for Computational Linguistics and of the International Joint Conference on Natural
Language Processing.
Thelwall, M., K. Buckley, and G. Paltoglou (2012). Sentiment strength detection for the social
web. Journal of the Association for Information Science and Technology .
Thelwall, M., K. Buckley, G. Paltoglou, D. Cai, and A. Kappas (2010). Sentiment in short
strength detection informal text. Journal of the Association for Information Science and
Technology .
Turney, P. D. (2002). Thumbs up or thumbs down?: Semantic orientation applied to unsu-
pervised classification of reviews. In Proceedings of the Annual Meeting on Association for
Computational Linguistics.
Vincent, P., H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol (2010). Stacked denois-
ing autoencoders: Learning useful representations in a deep network with a local denoising
criterion. Journal of Machine Learning Research.
Wang, S. and C. Manning (2013). Fast dropout training. In Proceedings of the 30th International
Conference on Machine Learning.
Wang, S. and C. D. Manning (2012). Baselines and bigrams: Simple, good sentiment and topic
classification. In Proceedings of the Annual Meeting of the Association for Computational
Linguistics.
69
Warriner, A. B., V. Kuperman, and M. Brysbaert (2013). Norms of valence, arousal, and
dominance for 13,915 english lemmas. Behavior Research Methods.
Wilson, T., J. Wiebe, and P. Hoffmann (2005). Recognizing contextual polarity in phrase-level
sentiment analysis. In Proceedings of the Conference on Human Language Technology and
Empirical Methods in Natural Language Processing.
Yang, M., W. Tu, Z. Lu, W. Yin, and K.-P. Chow (2015). Lcct: A semi-supervised model for
sentiment classification. In Proceedings of the Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies.
Zhang, Z., G. Wu, and M. Lan (2015). Ecnu: Multi-level sentiment analysis on twitter using
traditional linguistic features and word embedding features. In Proceedings of the International
Workshop on Semantic Evaluation.
Zhu, X. and Z. Ghahramani (2002). Learning from labeled and unlabeled data with label
propagation. In Proceedings of the Conference on Automated Learning and Discovery.
70