The degree of polarity as a factor fordeep-learning based Sentiment Analysis
By
BORJA BARES FERNÁNDEZ
Department of Computer ScienceUNIVERSITY OF VIGO
A Master’s thesis submitted to the University of Vigoin accordance with the requirements of the degree ofMÁSTER UNIVERSITARIO EN SISTEMAS SOFTWARE IN-TELIXENTES E ADAPTABLES in the School of ComputerEngineering (ESEI) of Ourense.
JULY 2015
Directed by: PROF. DAVID N. OLIVIERI
ABSTRACT
Recently, exponential strides have been reported in the machine learning literaturefor challenging pattern recognition problems. One application area where markedprogress has been made is the understanding of natural language for Sentiment
Analysis. The goal of this thesis is to investigate these claims. At present, all authorsreport results on standard test datasets where training texts are highly polar (extremelypositive or extremely negative). But using such prepared data sets cast doubts on theexcellent results reported by these authors. It is of interest to determine how this datasetbiases the reported results. Thus, the principal contribution of this thesis is to constructa new sentiment analysis dataset that allows us to study how the of degree of sentimentpolarity affects the recognition error rates. For this, we automatically obtained text froma popular website, Rotten Tomatoes, consisting of consensus scored movie reviews, usingweb scraping and stored them to a local database. By implementing the system, we couldstudy the word embedding transformations based upon Deep-learning methodology withRecurrent Neural networks and slice the training data according to score in order tostudy the effect on learning performance. We present results with various classifiers,including a random forest , stochastic gradient, and comment on test results with arecurrent neural network.
i
DEDICATION AND ACKNOWLEDGEMENTS
Thanks to David and my family.
ii
TABLE OF CONTENTS
Page
List of Tables iv
List of Figures iv
1 Introduction 11.1 Background and State of the art . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Methods and Software 42.0.1 Web Scraping: data preparation and big data . . . . . . . . . . . . . 5
2.0.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.0.3 Vector space model transforms . . . . . . . . . . . . . . . . . . . . . . 10
2.0.4 Statistical Language models and Word embeddings . . . . . . . . . 11
2.0.5 The neural network language models . . . . . . . . . . . . . . . . . . 12
2.0.6 Implementation of a RNN based NNLM . . . . . . . . . . . . . . . . 14
2.0.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Results 183.0.8 The standard Large movie datasets . . . . . . . . . . . . . . . . . . 18
3.0.9 The Rotten Tomatoes Dataset . . . . . . . . . . . . . . . . . . . . . . 19
3.0.10 Classification Models and impact of parameters . . . . . . . . . . . 22
4 Discussion 28
Bibliography 30
iii
LIST OF TABLES
TABLE Page
3.1 Accuracy vs epoch learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Accuracy vs vector length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Accuracy vs number of RF estimators . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Accuracy vs noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Accuracy vs degree of sentiment polarity . . . . . . . . . . . . . . . . . . . . . . 26
LIST OF FIGURES
FIGURE Page
2.1 Workflow of the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Functional details of Web scraper . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Word embedding using tSNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Neural network language model architectures . . . . . . . . . . . . . . . . . . . 13
3.1 tSNE projection of standardized movie review database . . . . . . . . . . . . . 20
3.2 Rotten Tomatoes database: Number of reviews vs score . . . . . . . . . . . . . 21
3.3 Rotten Tomatoes database: Number of reviews vs size . . . . . . . . . . . . . . 22
3.4 Results of tSNE for custom database . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5 Accuracy vs degree of sentiment polarity . . . . . . . . . . . . . . . . . . . . . . 27
iv
CH
AP
TE
R
1INTRODUCTION
Recent advances in machine learning (ML) are ushering in a new age in nearly all
branches of artificial intelligence. Fueled by three fundamental breakthroughs:
the distributed computing capability of modern systems, the availability of ex-
tremely large datasets, and the novel theoretical insights which can be rapidly and
efficiently implemented in modern programming frameworks. Problems in pattern recog-
nition and natural language processing, which were once far beyond the scope of toy
ML models, are becoming mainstream and have been integrated into software that now
operate in commodity devices. The machine learning subfield that has contributed most
to this vertical progress has been called Deep Learning, and is powered for the most part
by a resurgence in neural networks - with the modern enhancement of going ever deeper
and deeper (and recursive) in hidden (latent) layers.
Among the challenging problems that these new breed of deep learning algorithms
has tackled is Sentiment Analysis, or determining whether a chunk of natural language
text has a positive or negative sentiment, called polarity. In general processing natural
language, and obtaining a representations from such text is a grand challenge problem
within the field of artificial intelligence. Sentiment analysis is a difficult machine learn-
ing task, but it represents a fundamental problem for benchmarking new ideas in AI.
Nonetheless, in its limited form, sentiment analysis seeks to predict the overall polarity
of a text, which is a binary learning problem.
The main thrust of this work is to provide a hands-on comparison of a few leading
methods that have emerged in sentiment analysis in the past two years (from approxi-
1
CHAPTER 1. INTRODUCTION
mately 2013) with a newly constructed dataset that allows us to study how the of degreeof sentiment polarity affects the recognition error rates. As such, this work represents a
computational or practical review article of the competing methods. The field has moved
so quickly in the past two years with breakthroughs in deep-learning methodologies, that
such a review (albeit here not extensive) is both timely and necessary. In particular, I
implemented a system for studying word embedding transformations using two methods,
word2vec and doc2vec, and then used these high-dim space projections for predicting
sentiment polarity with modern classifier: random forests, the stochastic gradient, and
recurrent neural networks.
In order to compare methods, a specific contribution of this work is to test the methods
on a real dataset, namely, movie reviews from a leading website, rotten-tomatoes. This
provided a way of selecting the degree of polarity to determine how the models work as a
function of this degree of polarity. With respect to the specific problem, we address the
following: the dependency of corpus domain on classification error rate, word embedding
comparisons, comparison with different classifiers, implementation details of obtaining
word embedding, and the quality of results as a function of the quality of the input
dataset (i.e., whether the text reviews are highly conditioned or can contain noise). For
this, we developed a system for web-scraping and data preparation (data wrangling).
Here this dataset serves as a realistic test system, but could also serve in the future for
new models for sentiment analysis.
1.1 Background and State of the art
In and of itself, sentiment analysis has become an important practical tool for the
booming field of machine intelligence applied to commercial applications. It seems that
everyone these days is a data- scientist! Emerging applications that rely upon sentiment
analysis include product comparisons (including product review recommendation sys-
tems), opinion summarization (from social networks to news summarization), opinion
reason mining (aggregation of opinion polarity or quantifying large amount of opinions
with statistical analysis ), and other applications (search engines, email filtering, etc).
Given the economic implication of these types of applications, it is no wonder that the
major technology companies (Google, Twitter, Facebook, Amazon, to name a few) have
become the major players in this arena, and the principal driving forces within the
theoretical advances of machine learning.
A recent review of Sentiment Analysis can be found in [21]. Some descriptions of
2
1.1. BACKGROUND AND STATE OF THE ART
particular examples that these techniques include are the analysis of financial infor-
mation [12], intonation to text-to-speech [22] and assessments of commercial products
[7]. Nonetheless, a standard test that has emerged as a way of comparing different
machine learning algorithms in sentiment analysis is movie reviews [18]. In particular,
the dataset most used is the Large movie review database [1, 10], consisting of 50k
annotated and highly polar reviews, evenly split between positive and negative reviews.
Automatic classifying of text for discovering sentiment with machine learning has a
long history. A recent review [13] describes modern methods. Several techniques emerged
early on to convert words/text into a numerical feature vector, thereby constructing a
vector space of word embeddings W : w → Rn. The basic idea of this vector space paradigm
follows immediately: two words or phrases are similar in meaning if their distance in Rn
is small, and thus dissimilar if words distances is large. In practice, constructing such a
word embedding is nontrivial.
Historically, a basic word embedding technique that emerged early on is bag-of-words(BOW). While difficult to trace the origins of the first use of BOW, most probably the
following could be credited [9, 11]. The fundamental concept of BOW is that a document
is a mere collection of terms, each assigned a number depending upon its occurrence in
the text, or corpus.
As may be expected, the performance of machine learning predictions with BOW
are poor due to the obvious disadvantages of such a simple representation, namely the
grammatical position of words that is needed for understanding of the linguistic context,
is not considered. Later methods sought to overcome such problems using n−grams,
which are groupings of n words, by using a host of algorithm developments. The most
promising approach that has emerged are probabilistic linguistic models, described
recently by many authors [2, 8, 21]. One approach is Latent Dirichlet Allocation (LDA) [5].
This is a probabilistic graphical model (or generative model, since it produces probability
distributions from which samples can be generated, or drawn) and is often used in topicmodelling or topic reduction; the idea being that groups of words pertain to topics.
A very recent approach that have yielded the best results to date are based upon
a renewal of recurrent neural networks, specifically in the context of deep-learning
[14]. These models generate vectors of fixed size from variable sized inputs, unlike the
classifier based approaches. Two such models are Word2Vec and Doc2Vec.
3
CH
AP
TE
R
2METHODS AND SOFTWARE
In this section, we describe the algorithms and software tools that were developed
to distinguish different reviews by constructing word embeddings and using modern
classifiers. First, we describe details of the overall architecture and data acquisition.
Next, we describe the specific feature vector encoding used for word embedding. Finally,
we discuss the details of the Software/Hardware implementation, providing details with
respect to the language, platforms, and libraries that were used to build the system.
In order to implement new methods, I used several libraries (see Method section) and
platforms for word modeling, machine learning, and deep machine learning. In particular,
for word modeling, I used the Gensim library [17] that provides a python wrapper for
word embedding. For random forest and stochastic gradient classifiers, text preparation
with NLTK y BeautifulSoup as well as data wrangling for cross validation preparation, I
used the Scikit-sklearn library (http://scikit-learn.org/). For modern machine learning
(deep-learning), especially with recurrent neural networks, I used Theano [3] in con-
junction with Keras (http://keras.io/). These latter tools have been developed by some
of the groups that are driving the deep-learning movement. Here I shall show how this
technology fits together and can be used to solve some of the more challenging problems
in the machine learning field.
Figure 2.1 shows the general architecture of the system. The Web scraping block
consists of custom built software for web-scraping and database loading. From the the
reviews, the Word-embedding block uses the reviews to construct a specific corpus and
applies word models for deriving the word-embedding mapping. Finally, the classifier
4
Movie Review
repository
url
1. Web scraping
2. Word
Embedding
Modeling
3. Classification
Indexing
queries
FIGURE 2.1. The workflow/system architecture is composed of three modules:
Web scraper, with two Word embedding module and then to the Classi-fier via a serial link.
method is used to determine the polarity of the review for a given word-embedding
method. In the following sections, each module shall be described in detail.
2.0.1 Web Scraping: data preparation and big data
The present Big-Data paradigm would not be possible without the Internet. Much of the
data in big-data is harvested from the enormous quantity of data generated each day
by social networks/messages, search engines, and news. Indeed, in 2012, 2.5 quintillion
bytes (1018 bytes) were created every day. At this rate, 90% of the world’s data was
produced in last two years. The numbers are just astounding; each minute: Facebook
processes 350GB, Google processes > 2Million search queries, and Twitter generates
more than 300thousand tweets.
For intelligent systems, driven by machine learning, this deluge of information must
be harvested and processed. Web scraping is the name given to harvesting or extracting
data from the web. It consists in automatically performing browsing of web pages.
Technically, it consists of a http connection and a get operation. Once obtained this
information must be transformed into a form that can be used by intelligent algorithms.
This conversion process is referred to in English as data munging or data wrangling,
5
CHAPTER 2. METHODS AND SOFTWARE
because it gives the connotation of fighting with the data to force it into a tame format
that can be used by downstream software tools.
Obtaining a heterogeneous data source for sentiment analysis represents an im-
portant challenge. In this work, I used the reviews available at Rotten Tomatoes
(http://www.rottentomatoes.com/), a popular website for user generated movie criticism;
known as a film review aggregator, since many reviews from different sources (blogs,
articles, etc) can be associated with a particular film. This site has peculiarity that it
doesn’t actually store the text of the reviews, but stores only the link and metadataassociated with each review and an associated score, called the Tomatometer critic aggre-gate score. An excerpt from the wikipedia entry of Rotten Tomatoes explains this process
succinctly:
Rotten Tomatoes staff first collect online reviews from writers who are cer-
tified members of various writing guilds or film critic associations. To be
accepted as a critic on the website, a critic’s original reviews must garner a
specific number of likes from users. Those classified as Top Critics generally
write for major newspapers. The staff determine for each review whether it
is positive (fresh, marked by a small icon of a red tomato) or negative (rotten,
marked by a small icon of a green splattered tomato). (Staff assessment is
needed as some reviews are qualitative rather than numeric in ranking.) At
the end of the year, they identify the film that was rated highest as receiving
the annual Golden Tomato.
The fact that rottentomatoes does not save the text, but provides the URL link to the
review, means that each review has its own particular format and size. From the point of
view of a web scraping front-end and sentiment analysis software, this situation is far
more complex than obtaining tweet data from Twitter, where the format is uniform and
the text size is restricted. Indeed, for rottentomatoes, the “web scraper” must be
designed to extract data from a wide array of different web formats, sizes, and have fault
tolerance against dead URLs and/or websites that do not obey standard markup.
The Web Scraping Application Connecting to the rottentomatoes (RT) website for
obtaining metadata and reviews is greatly facilitated with the availability of the API
provided by RT for third-party developers. Figure 2.2 shows a simplified schematic of
the connectivity of the webscrapper. By connecting via an HTTP petition to this API,
the relevant metadata associated with a movie can be retrieved. In particular, this
6
FIGURE 2.2. Functional details of the Web scraper modules, indicating the
actions taken during a connection with the Rotten Tomatoes third-party
API.
petition returns a list of movies with their associated rottentomatoes ID, which is used
to recuperate the list of URLs pertaining to the review, the polarity, and other metadata
associated with the movie. With this information, the next step is to obtain the actual
text reviews indicated by various URLs (a movie can have multiple reviews, hosted at
different sites).
Given the URL of a review, a webpage is automatically fetched using the func-
tionality of the requests module in python. This fetch operation is blocking, obligat-
ing code to wait until the URL petition is resolved either by data transfer or a time-
out. In a totally synchronous design, such a situation would be inefficient, producing
large amounts of deadtime. To solve this problem, the threading module of python
is used for asynchronous multi-threading. While it is true that the python is built
7
CHAPTER 2. METHODS AND SOFTWARE
with a global interpreter lock (or GIL), Understanding the Python GIL, David Beazley;
http://www.dabeaz.com/python/UnderstandingGIL.pdf] as the cost of being an interpreted
code, which means that the launching threads does not produce pure multi-threading
since a mutual exclusion lock is held by the python interpreter. Nonetheless, since
processes waiting on network connections are on a wait queue, they do not need to be
actively running and the threading primitives can work effectively in this situation.
Thus, we obtained up to 10× speed improvements with a multithreaded implementation.
It should be mentioned that on multicore machines, the python multiprocessing module
could be used to achieve core-level parallelism, since each process has its own interpreter
within each core.
Once the HTML webpage of the review has been retrieved, the text must be extracted
and converted into a useable form. For this, the library BeautifulSoup (https://pypi.py-
thon.org/pypi/beautifulsoup4) was used. This library provides methods for dissecting the
contents of a webpage by navigating, searching and modifying a generated parse tree.
Initially when designing the algorithm for the web scraper, a general purpose solution
was developed. In this generic algorithm, a set of elements were eliminated based upon
expected regular expression patterns. Nonetheless, due to the resulting complexity to
handle general cases, the solutions of this method had limited utility. Instead, a more
specific but effective solution was to use BeautifulSoup to obtain specific HTML elements
in the document, namely the paragraph p, span and the article. In this pass, non-visible
elements are eliminated. For example, CSS style elements such as display: none,
commented code, or JavaScript code are identified and eliminated. Since the HTML
markup tags are recognized by the BeautifulSoup parsing tree, only the text between
markup is extracted. The following code segment describes the default rule for the web
scraper.
def _default_scrapper(body):paragraphs = []for paragraph in body.find_all ([’p’, ’span’, ’article ’]):
paragraphs.append("".join(filter(_visible ,paragraph.find_all(text=True))))
return " ".join(paragraphs)
Some domains had significantly more reviews than others, so that it made sense to
use custom scrapers for these cases. In fact, approximately 15 domains represent more
than 25% of reviews at Rotten Tomatoes. This of course is due to the fact that Rotten
Tomatoes reports reviews from qualified sources, such as professional movie reviews
8
from major newspapers and movie critics. To handle these case, a function was developed
that includes loads specific configurations depending upon the source of the URL. In this
way, by specifying the id or class of the review, specific elements could be eliminated. This
has the advantage that the scrapers could be highly accurate by tailoring them to these
specific web domains. The disadvantage of course is that the application would need
validation in the future in case the information at these domains change. The following
code shows a configurable scraping function that obtains more than 25% of the reviews.
def _class_scraper(body , ** options):scraper_options = options.get(’options ’)review_div = Noneif REVIEW_DIV_CLASS in scraper_options:
review_div = body.find(class_=scraper_options[REVIEW_DIV_CLASS ])
if REVIEW_DIV_ID in scraper_options:review_div = body.find(id=scraper_options[REVIEW_DIV_ID ])
if CLASSES_TO_DECOMPOSE in scraper_options:[div.decompose () for class_to_decompose in scraper_options
[CLASSES_TO_DECOMPOSE] for div in review_div.find_all(class_=class_to_decompose)]
if IDS_TO_DECOMPOSE in scraper_options:[div.decompose () for id_to_decompose in scraper_options[
IDS_TO_DECOMPOSE] for div in review_div.find_all(id=id_to_decompose)]
if HTML_TAGS_TO_DECOMPOSE in scraper_options:[div.decompose () for html_tag_to_decompose in
scraper_options[HTML_TAGS_TO_DECOMPOSE] for div inreview_div.find_all(html_tag_to_decompose)]
return _review_div_scraper(review_div)
There were also several domains that contained many reviews, yet not with a con-
sistent webpage style. In some cases this was because the sites were still serving older
versions (e.g., review written several years ago), or the webpage design changes depend-
ing upon the sub-domain due to different web interfaces. For these domains customized
for each function were created. These functions are for 13 domains that represent more
than 25% of the reviews.
Once the reviews were obtained, they were stored in a MySQL database on disk
together with the metadata of the review. Downstream analysis of reviews was accom-
plished by performing database queries and converting the saved raw text to a set of
numbers. This conversion process of text to high-dimensional vector is referred to as
9
CHAPTER 2. METHODS AND SOFTWARE
word embedding and the subject of the next section.
2.0.2 Word Embeddings
A word embedding W : words → Rn is a parameterized function that maps words
from natural language to a vector on a high-dimensional manifold (perhaps 200 to
500 dimensions). Although it is impossible to visualize such high-dimensional vec-
tors, one method is to map these vectors to a lower-dimensional space. For this, the
t-SNE method (t-distributed stochastic neighbor embedding) [23, 24] is a recent suc-
cessful technique that can perform such a space reduction. Thus, it can be in our
case to to gain an intuition of the the word embedding space is to visualize the most
relevant axis. (more high-dimensional data visualizations of t-SNE can be found at
http://homepage.tudelft.nl/19j49/t-SNE.html). Figure 2.3 shows an example of the word
embeddings we obtained using gensim’s implementation of the word2vec specifically
trained on the rotten-tomato dataset. In the results section, we use this technique to
visualize data vectors formed using the the movie reviews.
2.0.3 Vector space model transforms
As described previously, the bag-of-words (BOW) model is perhaps the first and easiest
method for representing natural language. It is based upon the idea that a full text
(ie. sentence, paragraph or full document) is tantamount to a collection of unconnected
words, as if contained in a bag. In this way, a possible representation of a particular word
would be based upon its frequency of occurrence in the text. The major disadvantage of
this representation is that it disregards word context. Thus, as to be expected, the BOW
representation perform poorly. Nonetheless, it is still a popular choice for rudimentary
classifiers such as those used in simple spam filtering.
Other incremental improvements of the bag-of-word technique include the following
methods:
• Term frequency - inverse document frequency (Tf-Idf): this is a statistic uses word
frequency both in the document as well as in the corpus, performing slightly better
than bag-of-words, since it tends to reflect how words appear in general.
• Latent semantic indexing (LSI) : an index based upon singular value decomposition
that uncovers the semantic structure in the usage of words in a text.
10
FIGURE 2.3. An example of the t-SNE from the word embeddings of a dataset.
See http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html
2.0.4 Statistical Language models and Word embeddings
Clearly, any vector space transform for representing words in a text should contain
information about the word context. The most successful models for attaining this are
those based upon probabilistic descriptions of natural language.
One example of a highly successful probabilistic graphical model technique is the
Latent Dirichlet Allocation (LDA) [5]. In this method, the model attempts to reduces
natural language text to a set of topics. In this sense, it is a space reduction method. With
this reduced dimensional space of topics, all words of the text are classified or grouped
within these categories. An interesting application area of this method is to convert large
texts into summaries. An excellent recent review on Probabilistic Topic Models is given
by one of the leading developers of the method, [4].
11
CHAPTER 2. METHODS AND SOFTWARE
Topic analysis, as described in [4] can be understood in terms of joint distributions
with some simple definitions. Reproducing the notation of [4]: given that there are K(written β1:K ), the βk is the distribution over words; topic proportions are given by θd,k
for the d-th document and kth topic; topic assignments for document d is given by zd,
while for each word n in document d, it is given by zd,n. The observed words for the nth
word in the the dth document is wd,n. With these definitions, the LDA forms the joint
distribution:
p(β1:K ,θ1:D , z1:D ,w1:D)=K∏
i=1p(βi)
D∏d=1
p(θd)( N∏
n=1p(zd,n|θd)p(wd,n|β1:K , zd,n)
)This joint probability is normally formulated in the context of a Bayesian graphical
model and can be solved with various optimization methods, or with a Markov Chain
Monte Carlo (MCMC).
The LDA method is implemented in the gensim (https://radimrehurek.com/gensim/)
system. While we performed initial tests using this word embedding, in the end, we did
not use this for sentiment analysis, since an extra step would still be required to reduce
topics to sentiment polarity. As a result, we turned our attention to other probabilistic
language models, those based upon Neural Networks.
2.0.5 The neural network language models
Similar statistical approaches have been taken in the context of large recurrent neural
networks. The goal of statistical language modeling [15] http://www.fit.vutbr.cz/
research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf is to
predict the next word in textual data given context. Recurrent neural network are in-
teresting because arbitrarily large word contexts can be remembered within the hidden
layers of the network, thereby contributing to more accurate conditional probabilities at
the output layer.
In the paper [16], the power of such models is explained quite succinctly:
Continuous space language models have recently demonstrated outstanding
results across a variety of tasks. In this paper, we examine the vector-space
word representations that are implicitly learned by the input-layer weights.
We find that these representations are surprisingly good at capturing syn-
tactic and semantic regularities in language, and that each relationship is
characterized by a relation-specific vector offset. This allows vector-oriented
reasoning based on the offsets between words. For example, the male/female
12
0 1 0 0 ..... 0 0 0 0 1.....0 input layer
aggregation layer
hidden layer(s)
output layer
0 1 0 0 ..... 0 0 0 0 1.....0
(a) Feed-forward NNLM (b) Recurrent NNLM
FIGURE 2.4. Architecture of a standard feed-forward neural network language
model (NNLM). The two input words u and v are at the input layer with
a one-hot encoding scheme. The projection or aggregation layer, joins the
inputs, while multiple hidden layers are activated either linearly (left) or
with temporal firing for recurrent NN (right).
relationship is automatically learned, and with the induced vector represen-
tations, King - Man + Woman results in a vector very close to Queen.
This last statement, in mathematical form v(King)− v(Man)+ v(Woman) = v(Queen),
by [16] was quite remarkable!! This statement sent shock waves throughout the NL
field, because it revealed the inferential and generative prowess of the method. Thus, in
[14], the word2vec model, that transforms words into fixed sized vectors that capture
meanings of words, was introduced and immediately became the superstar of the show.
Neural network language models (NNLM) follow similar topologies. First, they are
probabilistic graphical models, where the interconnections between nodes i and j describe
a conditional probability p(i| j).Following the excellent recent review of [19], a basic NNLM architecture is shown
in Figure 2.4. The input layers depicted are two words within a specified context (they
can be successive words, or predecessors part of an n-gram). These words u and vare represented with a one-hot encoding scheme, i.e., all elements are zero except one,
representing a δ- function at the position of the jth position. Given a weight matrix W, a
bias b, and an activation function F , the nodes x on the kth layer are given by:
13
CHAPTER 2. METHODS AND SOFTWARE
x(k) =F(W(k)x(k−1) +b(k))
where it can be seen that the kth layer depends explicitly on the previous layer
(k−1). The activation function F varies on each layer. For the aggregation, or projection
layer, the activation function is just the identity matrix. For the hidden layers, a sigmoid
activation function is used. Finally, at the output, a softmax is used.
As can be seen in Figure 2.4, at the output, the probability p(w|u,v) is found given
two input context or prede words. All the work of the neural network training is to
discover the matrix W, such that the probability distribution p(w|u,v) can be known.
Many variants of this basic model exists that further decompose this probability into
clusters. Nonetheless, if wi is the interpreted as the probability of observing a certain
word (or word sequence), then p(w1, · · · ,wm) is n-gram model. A Markov assumption is
then assumed, such that the conditional probabilities are separable:
p(w1, · · · ,wm)=M∏
i=1p(wi|wi−(n−1), ·,wi−1)
which means that only the most recent n words are considered for predicting the ithword. These separable probabilities in the Markov expression above are minimized in the
training of a neural network. In particular, the objective function using the log maximum
likelihood expression
L =∑log p(wi|wi−(n−1), ·,wi−1)
is minimized typically with a stochastic gradient descent algorithm, where the gradients
are obtained by automatic differentiation from the back-propagation algorithm. Further
details can be found in [20] and [19].
Recursive neural networks have a similar architecture, however they allow for the
ability to extend the number of predecessor words with the use of memory. Internally, the
nodes can have loops, however the activation function F will fire at different times and
will have a finite time duration during activation. This is the essential idea behind the
concept of Long short term memory (LSTM) recurrent networks. A detailed mathematical
development of the mechanisms necessary to explain these models is beyond the scope of
this work; such a discussion can be found in [19, 20].
2.0.6 Implementation of a RNN based NNLM
A NNLM that has yielded powerful results is word2vec [16]. The word2vec algorithm
was implemented in C, and was made available by the authors (see https://code.goo-
14
gle.com/p/word2vec/). Since then, a recent general python framework for language
modelling, gensim (https://radimrehurek.com/gensim/), has provided a wrapper to this
method.
An extension of the word2vec model to paragraphs and full documents was developed
later by the same authors [16]. From this work, two separate models emerged that
aggregate word2vec, the Distributed Memory (DM) and the Distributed Bag of Words
(DBOW). In the DM model, an attempt to predict the next word in a document, as
a number of words and a vector for the paragraph, so that although the words are
changing, the reference in paragraph remains. In the DBOW model, the number of words
are predicted given the paragraph vector. The following code shows how the DM and
DBOW models are called using the gensim interface.
model_dm = gensim.models.Doc2Vec(min_count =1, window =10, size=size , sample =1e-3, negative=5, workers =8)
model_dbow = gensim.models.Doc2Vec(min_count=1, window =10, size=size , sample =1e-3, negative=5, dm=0, workers =8)
model_dm.build_vocab(x_array)model_dbow.build_vocab(x_array)
The window size determines the maximum learning context distance (number of n-grams)
used to construct the RNN model. Both the min_count and sample_size parameters are
correction terms for word frequency: the min_count parameter is used to eliminate words
that appear less than the number indicated in the corpus; the sample size down-samples
the words that occur the most in the corpus. A more complete description of these
models can be found in [10] and the tutorial http://districtdatalabs.silvrback.
com/modern-methods-for-sentiment-analysis
As with word2vec, these models are based upon probabilistic graphical models imple-
mented within a RNN deeplearning framework. Also, as in any RNN, iterative training
(epochs in the language of NN) is required to improve the weights. The epoch training is
is carried out as follows:
for epoch in range(n_epoch):perm = np.random.permutation(x_np_array.shape [0])model_dm.train(x_np_array[perm])model_dbow.train(x_np_array[perm])
As can be seen, the result of this training are fixed length Numpy vectors.
15
CHAPTER 2. METHODS AND SOFTWARE
Training As outlined above, the process of training consists in minimizing an objective
function consisting of all input, internal and output nodes of a probabilistic graphical
model. For an even modest sized corpus, the training of the RNNs is computationally
intensive. Thus, modern implementations using any deep-learning NN architectures
assume an efficient computational platform at the onset of problem definition. Because
the back-propagation and other numerical computations of networks can be parallelized,
modern machine learning platforms take advantage of multi-core concurrency as well as
the parallelism offered by GPUs. Two machine learning platforms that have emerged for
researchers in the field are Theano and Torch.
Theano [3] (http://deeplearning.net/software/theano/) is a full featured numerical
and symbolic expression based python framework that performs internal mappings
to bare metal. This has become one of the de-facto standard frameworks for modern
machine learning. A competing framework is Torch ( http://torch.ch/) (with the Lua
http://www.lua.org/ language). Both frameworks have a significant reached a sufficient
level of maturity to have active communities and design roadmaps.
Here we have chosen Theano based development. As a demonstration of how Theano
can be used to obtain computational benefits on the GPU compared to the CPU, the
following example shows the code for performing a matrix multiplications.
import theanoimport numpy...vlen = 10 * 30 * 768 # 10 x #cores x # threads per coreiters = 1000
rng = numpy.random.RandomState (22)x = shared(numpy.asarray(rng.rand(vlen), config.floatX))f = function ([], T.exp(x))for i in xrange(iters):
r = f()
This script computes the function exp() on a set of random numbers. Contained in the
Theano function is a map to the actual implementation, which could be optimized for a
CPU or GPU. By changing the device options in the following ways:
THEANO_FLAGS=mode=FAST_RUN ,device=gpu ,floatX=float32 pythoncheck1.py
THEANO_FLAGS=mode=FAST_RUN ,device=cpu ,floatX=float32 pythoncheck1.py
16
we can determine the performance on each device. On a Intel quad-core i7 CPU, the
operation took 12.02 seconds, while on a Nvidia GTX 780, the operation took 0.96 seconds,
representing more than 10× acceleration.
2.0.7 Classification
Once the vector transformations are obtained for each phrase, we used different clas-
sifiers to train and predict text. We investigated the use of the following classifiers:
Stochastic Gradient Descent, Random Forest and RNN.
For the Stochastic Gradient Descent, Random Forest we used the popular Sklearn
(http://scikit-learn.org/) framework. For the RNN classification, we used Keras (http://keras.io/),
which is a recently constructed framework for wrapping theano implementations of neu-
ral networks. In this framework, simple feedforward networks as well as research level
recurrent neural networks such as LSTM or convolutional networks can be constructed.
The following is an example of constructing a RNN using a LSTM recurrent layer.
import keras...model = Sequential ()model.add(Embedding(max_features , 256))model.add(LSTM (256, 128))model.add(Dropout (0.5))model.add(Dense (128, 1))model.add(Activation(’sigmoid ’))model.compile(loss=’binary_crossentropy ’, optimizer=’adam’,
class_mode="binary")model.fit(test_vecs, y_train , batch_size =16, nb_epoch=2, validation_split =0.1,
show_accuracy=True)
Despite being one of the more developed framework for modern neural networks, the
classes are still restrictive. For example, we found that for fixed vectors, we could run the
model using the RNN for a limited number of reviews, however above >5000 reviews, our
model did not converge. More work would need to be dedicated to this. Our conclusions
are that Keras (and other such frameworks such as Lasagne or Blocks, are interesting
ideas, but lack the flexibility of implementing a network model as is the case in a pure
language such as Theano, python, Torch or even in C (as was the case for the original
word2vec).
17
CH
AP
TE
R
3RESULTS
To understand how training (and subsequent prediction) depend upon the datasets, we
carried out experiments of the algorithms using two datasets:
• Large Movie Review Dataset v1.0. This is a standardized dataset used by re-
searchers as a benchmark for sentiment classification, providing an easy way to
compare methods. The dataset contains reviews along with their associated binary
sentiment polarity labels and contains 50,000 reviews split evenly into testing
and training sets. Also, the overall distribution of labels is balanced (25k positive
and 25k negative). Also there are an additional 50,000 unlabeled documents for
unsupervised learning.
• Custom Rotten Tomatoes web review scraper. This is a dataset we constructed by
obtaining with web-scraping reviews from Rotten Tomatoes. The dataset has more
than 40,000 reviews (and 27,000 are from a custom scraper) and the polarity as
well as a score (i.e., in most cases) are provided.
3.0.8 The standard Large movie datasets
First, we studied obtained doc2vec feature vectors for the standardized Large moviereview database. To gain intuition about the results of the vector spaces created from
the doc2vec method, the t-distributed Stochastic Neighbor Embedding (tSNE) [24] can
be applied to perform a dimensionality reduction and visualize the high-dimensional
18
vectors. While a description of the method is technically involved, it converts distances
between points in the high dim space to joint probabilities and then minimizes the
Kullback-Leibler divergence to attain a low-dimensional embedding. An implementation
of this method is provided in Scikit-learn. In the code below, the (sklearn.manifold import
TSNE) provides the implementation of tSNE. Before using the tSNE, a Singular value
decomposition SVD is used to first reduce the space with a linear transformation.
The following code listing shows how the tSNE method is used in python using
sklearn. As can be seen, first the vectors are truncated to 30 elements, then the tSNE
method is run.
...from sklearn.manifold import TSNEfrom sklearn.decomposition import TruncatedSVD...X_reduced = TruncatedSVD(n_components =30, random_state =0).
fit_transform(train_vecs [:1000])X_embedded = TSNE(n_components =2, perplexity =40, verbose =2).
fit_transform(X_reduced)for i in range(y_train [:1000]. size):
if y_train[i] == 0:neg.append(X_embedded[i])
else:pos.append(X_embedded[i])
neg = np.array(neg)pos = np.array(pos)plt.scatter(neg[:, 0], neg[:, 1], color=next(colors))plt.scatter(pos[:, 0], pos[:, 1], color=next(colors))
Figure 3.1 shows the low- dimensional manifold of Results for the standardized Largemovie review database using reviews from IMDB. The results show that the doc2vec
algorithm does produce feature vectors that separates positive (blue) and negative
polarity (red). Clearly, it is not a clean separation, but it should be remembered that first
this is an approximate embedding.
3.0.9 The Rotten Tomatoes Dataset
As previously mentioned, most sentiment analysis has been carried out with clean
and highly polar datasets. To study the impact of noise, we carried out tests with
reviews that are not restrained in size and/or were obtained with a general purpose
scraper, thereby containing artefacts. For example, the default scraper often will not only
19
CHAPTER 3. RESULTS
FIGURE 3.1. Low- dimensional manifold of Results for the standardized Largemovie review database using reviews from IMDB. Each review was con-
verted to doc2vec. In the plot, the positive sentiment are in blue, and the
negative sentiments are in red.
recover the review, but could retrieve additional information on the page (in the form of
advertisement or other non-markup text) that could contribute as noise.
Our dataset, the Custom Rotten Tomatoes web review scraper, consists of approxi-
mately 40,000 Rotten Tomatoes reviews of different size, and most with an assigned
polarity score. To understand the overall polarity of the database, Figure 3.2 shows the
distribution of the numer of reviews for each polarity score. As can be seen, this dataset is
peaked around a score of 7, indicating that the text is positive, but perhaps not positive!.
Reviews in the Custom Rotten Tomatoes web review scraper database have a wide
range of sizes; ranging from a paragraph to several thousand words. As a sanity check,
we wanted to make sure that no correlation exists between long-windedness and score.
For example could it be that all short reviews are simply negative?. Figure 3.3 shows the
size of the movie revies as a function of score. The results show there does not appear to
20
FIGURE 3.2. Distribution of movie reviews as a function of score from our
rotten-tomato database.
be a correlation between size of the review and the overall score.
We used the Custom Rotten Tomatoes web review scraper database and converted each
review to a fixed sized vector using doc2vec RNN training. Figure 3.4 shows the results
of a low-dimensional space embedding with tSNE for different slices of the data. First
(top-left), we plotted the results of all the reviews. As can be seen, since the distribution
of reviews as a function of score is highly peaked around 7, it is a score that is somewhat
in the middle and not very polar. Any classification on such a dataset will have a hard
time separating the data on polarity.
Next, we sliced the data according to score. In the (top-right) plot of Figure 3.4, we
selected all reviews with a score greater than 8 and those less than 2. As can be seen, the
vectors, projected with tSNE, are clearly separated. Similarly, we repeated this procedure
for the bottom left and right plots of Figure 3.4, where slices of the data were (>8, <3)
and (>9 and <2), respectively. In each of these cases, it is clear that very negative and
very positive reviews can be separated, but if the polarity is close to even a score of 7,
then the sentiment discrimination will be ambiguous.
21
CHAPTER 3. RESULTS
FIGURE 3.3. Distribution of movie reviews text size as a function of score from
our rotten-tomato database.
3.0.10 Classification Models and impact of parameters
For the creation of both word2vec models (DM and DBOW) it is necessary to shuffle the
body of reviews and do several repetitions (epochs) of this corpus. Since this is an iterative
process, the time it takes to do more learning is linear. We studied the importance of
iteration in the learning process on the final quality of word2vec results with prediction
accuracy using different classifiers.
The Word2Vec method generates vectors with fixed size, which can be specified prior
to the learning process. Internally, this controls the output layer of the RNN. We have
studied the impact of the size of these vectors in learning.
In terms of classifiers, the Random Forest method has its own set of parameters
that should be set and optimized for a particular problem. One parameter of particular
interest is the number of estimators, which is equivalent to the number of splits, that
22
FIGURE 3.4. Results of tSNE for the movie reviews obtained with our custom
scraper and converted to doc2vec. Different data slices based upon the
Rotten Tomatoes consensus scores are given. (Top-left) All reviews, (top-
right) >8 and <2, (bottom-left) >8 and <3 (bottom-right) >9 and <2. In the
plot, the positive sentiment are in blue, and the negative sentiments are in
red.
will determine the number of decision trees. We tested the prediction results of word2vec
as a function of the number of estimators for the random forest.
For the training of the model used in doc2vec, it is necessary to carry out repeated
iterations, called epochs in the RNN literature, by shuffling the input vectors. The
number of repetitions influences the resulting vectors generated with doc2vec, which
subsequently influence the prediction results. Table 3.1 shows the dependence of accuracy
23
CHAPTER 3. RESULTS
results for doc2vec embeddings with different epoch iterations. Each row in the Table 3.1
use the same 200 element vectors for the SGD and the RF classifiers.
TABLE 3.1. Dependence of accuracy results for doc2vec embeddings with differ-
ent epoch iteration learning.
Number of
epochs SGD RF
1 65% 59%
3 69% 61%
4 70% 60%
5 71% 61%
6 72% 61%
7 72% 61%
8 73% 59%
9 73% 60%
10 71% 61%
Also, it is possible to vary the size of the vector that is generated by doc2vec for a
review. Table 3.0.10 shows that this will also influence the prediction results for the
classifiers. Each row in the Table 3.0.10 uses the same vector with five epoch iterations
for both the SGD and the RF classifiers.
TABLE 3.2. Dependence of accuracy results for doc2vec using different resulting
vector lengths.
Vector SGD RF
Length accuracy accuracy
50 66% 63%
100 66% 63%
200 71% 60%
300 71% 60%
400 72% 59%
600 69% 59%
With respect to the Random Forest classifier, one important parameter is the number
of classifiers. To control the cost function, regression tree classifiers also use criteria:
either a Gini impurity measure [6] or the entropy information gain measure. These two
24
elections in the RF model will have an impact on the final training/prediction results.
Table 3.3 shows a comparison using 5 repetitions and 200 element vectors, with either
the Gini or the entropy for information gain criteria.
TABLE 3.3. Results with varying Random Forest parameters. Dependence of
accuracy results for doc2vec using different criteria conditions and number
of estimators. These test used 27600 reviews with 5 epochs and a fixed word
vector size of 400.
Number of
estimators Criterion Accuracy
5 gini 59%
10 gini 61%
20 gini 63%
30 gini 66%
50 gini 66%
80 gini 68%
100 gini 69%
200 gini 69%
100 entropy 68%
Table 3.4 shows the prediction accuracy as a function of the quality and polarity of
the text review training set. For each of the rows, the word embedding was performed
with doc2vec using 9 epoch iterations with an output vector of 400 elements. The same
vector was then used in the SGD and RF classifiers.
TABLE 3.4. Accuracy of predictions with different classifiers using different
levels of web scraping results.
Input Number of SGD RF Elapsed
reviews Time(s)
All rotten tomatoes reviews 40000 66% 63% 1496
RT without custom scraper 27600 63% 60% 873
RT with custom scraper 22400 70% 67% 716
Large movie review database 50000 88% 82% 582
Table 3.5 shows the results the data from the Rotten Tomatoes reviews database
constructed for this study. In particular, the data was separated based upon the Rotten
25
CHAPTER 3. RESULTS
Tomatoes consensus scores in order to study the prediction results as a function of degree
of sentiment polarity. For each of the rows, the word embedding was performed with
doc2vec using 9 epoch iterations with an output vector of 400 elements. The same vector
was then used in the SGD and RF classifiers.
TABLE 3.5. Accuracy of prediction based upon slicing the data into different
degrees of sentiment polarity.
Number of
Score reviews SGD RF
10–0 1226 91% 84%
>9–<1 1819 93% 88%
>8–<2 5196 91% 87%
>7–<3 10662 89% 83%
>6–<4 15991 77% 70%
All reviews
with score 19665 74% 68%
26
SGD Accuracy
RF Accuracy
FIGURE 3.5. Accuracy of prediction based upon slicing the data into different
degrees of sentiment polarity.
27
CH
AP
TE
R
4DISCUSSION
The first hypothesis is that sentiment polarity is an important determinant for subse-
quent prediction accuracy. As expected from the outset, the polarity of the dataset greatly
affect the prediction results. Moreover, this study is able to show the level of polarity
necessary for obtaining good prediction. With a difference of 8 score points between
positive and negative polarity, the prediction accuracy improves by approximately 20%.
Additionally, the positive >9 – neg <1 approaches the accuracy cited in the literature.
Prior to these experiments. we could not estimate how much this would affect the re-
sults; Systems are very good at separating very positive and very negative text, but
there is quite a gray area in the middle. Perhaps, by focusing on these sentiments that
lie somewhere in the middle could lead to more robust algorithms, as well as deeper
understanding making machines intelligence, i.e., that they can understand the nuances
of a language just as a human could.
As a corollary, the second hypothesis is that clean data sources are also an important
determinant for downstream prediction accuracy. Once again, our tests confirmed this
hypothesis by providing prediction accuracy results and indicating the influence dirtydata. An area of future research might be an intelligent filter, perhaps based on the
Latent Dirichlet Allocation with topic reduction as a means of eliminating unwanted
topics.
The field of sentiment analysis is broader than it first appears. Presently, making
an original contribution to this field requires formidable background in probabilistic
graphical models, since these are the methods that have produced the best results. As is
28
always the case, original contributions start by identifying the present problems, that
are often not apparent by simply reading scientific journal articles. Instead, we have
asked a very simple questions: Are the datasets setup to give good results? If so, where
can begin to get a handle on the more general problem?
Thus, given the limited scope (and time) of the thesis, there are several limitations
and areas that have been omitted. At the present moment, we tested with RNN as a
classifier using a fixed size input vector. In this way, the true power of the recurrent
neural network is not realized. A more interesting input to the system would be to
classify documents with variable sized input text. This work was necessarily limited
in scope. To compare with the results by the Google group [14], truly massive learning
datasets would be necessary.
Several other ideas for future research would be possible. First, a deeper understand-
ing of the 6–4 case (ambiguous sentiment polarity) would be of interest. This would
require some amount of manual curations, reading the texts and determining if the
sentiment is in fact ambiguous. Next, could the system obtain an actual polarity score
automatically ? And finally, it would be interesting to compare other models, such as LDA
with the word2vec and doc2vec paradigm using our dataset. Clearly more work could
be done related to the deep learning: for example, perhaps using different topologies
of RNN; investigating the range of optimal parameters with the LSTM; trying to use
convolution NN; studying results as a result training set size, as well as hidden layer
depths.
As can be seen with the study of sentiment analysis presented in this thesis, the
field of shall continue to provide a rich test-bed for testing new and exciting ideas in
machine learning. With the present trend and advances of modern intelligent systems,
new and more complex problems shall be tackled and solved. Someday, we may just have
the perfect recommender system. But, in the end, it might be like Bruce Springsteen’s
song: "57 Channels (And Nothin’ On)".... we humans, still need to decide which movie to
watch.
29
BIBLIOGRAPHY
[1] M. ANNETT AND G. KONDRAK, A comparison of sentiment analysis techniques: Po-larizing movie blogs, in Proceedings of the Canadian Society for Computational
Studies of Intelligence, 21st Conference on Advances in Artificial Intelligence,
Canadian AI’08, Berlin, Heidelberg, 2008, Springer-Verlag, pp. 25–35.
[2] Y. BENGIO, H. SCHWENK, J.-S. SENÉCAL, F. MORIN, AND J.-L. GAUVAIN, NeuralProbabilistic Language Models, in Innovations in Machine Learning, D. Holmes
and L. Jain, eds., vol. 194 of Studies in Fuzziness and Soft Computing, Springer
Berlin Heidelberg, 2006, pp. 137–186.
[3] J. BERGSTRA, O. BREULEUX, F. BASTIEN, P. LAMBLIN, R. PASCANU, G. DES-
JARDINS, J. TURIAN, D. WARDE-FARLEY, AND Y. BENGIO, Theano: a CPUand GPU math expression compiler, in Proceedings of the Python for Scientific
Computing Conference (SciPy), June 2010.
[4] D. M. BLEI, Probabilistic topic models, Commun. ACM, 55 (2012), pp. 77–84.
[5] D. M. BLEI, A. Y. NG, AND M. I. JORDAN, Latent dirichlet allocation, J. Mach.
Learn. Res, 3 (2003), pp. 993–1022.
[6] L. BREIMAN, J. FRIEDMAN, R. OLSHEN, AND C. STONE, Classification and Regres-sion Trees, Wadsworth and Brooks, Monterey, CA, 1984.
[7] M. CABANLIT AND K. JUNSHEAN ESPINOSA, Optimizing n-gram based text featureselection in sentiment analysis for commercial products in twitter through polar-ity lexicons, in Information, Intelligence, Systems and Applications, IISA 2014,
The 5th International Conference on, July 2014, pp. 94–97.
[8] E. CAMBRIA, B. SCHULLER, Y. XIA, AND C. HAVASI, New avenues in opinionmining and sentiment analysis, Intelligent Systems, IEEE, 28 (2013), pp. 15–21.
[9] Z. HARRIS, Distributional structure, Word, 10 (1954), pp. 146–162.
30
BIBLIOGRAPHY
[10] Q. V. LE AND T. MIKOLOV, Distributed representations of sentences and documents,
CoRR, abs/1405.4053 (2014).
[11] H. P. LUHN, The automatic creation of literature abstracts, IBM J. Res. Dev., 2
(1958), pp. 159–165.
[12] Y. T. MCINTYRE-BHATTY, Neural network analysis and the characteristics of marketsentiment in the financial markets, Expert Systems, 17 (2000), pp. 191–198.
[13] W. MEDHAT, A. HASSAN, AND H. KORASHY, Sentiment analysis algorithms andapplications: A survey, Ain Shams Engineering Journal, 5 (2014), pp. 1093–1113.
[14] T. MIKOLOV, K. CHEN, G. CORRADO, AND J. DEAN, Efficient estimation of wordrepresentations in vector space, CoRR, abs/1301.3781 (2013).
[15] T. MIKOLOV, M. KARAFIÁT, L. BURGET, J. CERNOCKÝ, AND S. KHUDANPUR,
Recurrent neural network based language model, in INTERSPEECH 2010, 11th
Annual Conference of the International Speech Communication Association,
Makuhari, Chiba, Japan, September 26-30, 2010, 2010, pp. 1045–1048.
[16] T. MIKOLOV, W.-T. YIH, AND G. ZWEIG, Linguistic regularities in continuous spaceword representations, in Proceedings of the 2013 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language
Technologies, Atlanta, Georgia, June 2013, Association for Computational Lin-
guistics, pp. 746–751.
[17] R. REHUREK AND P. SOJKA, Software Framework for Topic Modelling with LargeCorpora, in Proceedings of the LREC 2010 Workshop on New Challenges for
NLP Frameworks, Valletta, Malta, May 2010, ELRA, pp. 45–50.
http://is.muni.cz/publication/884893/en.
[18] V. SINGH, R. PIRYANI, A. UDDIN, AND P. WAILA, Sentiment analysis of moviereviews: A new feature-based heuristic for aspect-level sentiment classification,
in Automation, Computing, Communication, Control and Compressed Sensing
(iMac4s), 2013 International Multi-Conference on, March 2013, pp. 712–717.
[19] M. SUNDERMEYER, H. NEY, AND R. SCHLUTER, From feedforward to recurrentlstm neural networks for language modeling, Audio, Speech, and Language
Processing, IEEE/ACM Transactions on, 23 (2015), pp. 517–529.
31
BIBLIOGRAPHY
[20] M. SUNDERMEYER, I. OPARIN, J.-L. GAUVAIN, B. FREIBERG, R. SCHLUTER, AND
H. NEY, Comparison of feedforward and recurrent neural network languagemodels, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Inter-
national Conference on, May 2013, pp. 8430–8434.
[21] H. TANG, S. TAN, AND X. CHENG, A survey on sentiment detection of reviews,
Expert Syst. Appl., 36 (2009), pp. 10760–10773.
[22] T. TRILLA AND F. ALIAS, Sentence-based sentiment analysis for expressive text-to-speech, Audio, Speech, and Language Processing, IEEE Transactions on, 21
(2013), pp. 223–233.
[23] L. VAN DER MAATEN, Accelerating t-sne using tree-based algorithms, J. Mach.
Learn. Res., 15 (2014), pp. 3221–3245.
[24] L. VAN DER MAATEN AND G. HINTON, Visualizing high-dimensional data usingt-sne, J. Mach. Learn. Res., 9 (2008), pp. 2579–2605.
32