The degree of polarity as a factor for deep-learning based Sentiment Analysis · 2015-07-24 · The...

The degree of polarity as a factor fordeep-learning based Sentiment Analysis

By

BORJA BARES FERNÁNDEZ

Department of Computer ScienceUNIVERSITY OF VIGO

A Master’s thesis submitted to the University of Vigoin accordance with the requirements of the degree ofMÁSTER UNIVERSITARIO EN SISTEMAS SOFTWARE IN-TELIXENTES E ADAPTABLES in the School of ComputerEngineering (ESEI) of Ourense.

JULY 2015

Directed by: PROF. DAVID N. OLIVIERI

ABSTRACT

Recently, exponential strides have been reported in the machine learning literaturefor challenging pattern recognition problems. One application area where markedprogress has been made is the understanding of natural language for Sentiment

Analysis. The goal of this thesis is to investigate these claims. At present, all authorsreport results on standard test datasets where training texts are highly polar (extremelypositive or extremely negative). But using such prepared data sets cast doubts on theexcellent results reported by these authors. It is of interest to determine how this datasetbiases the reported results. Thus, the principal contribution of this thesis is to constructa new sentiment analysis dataset that allows us to study how the of degree of sentimentpolarity affects the recognition error rates. For this, we automatically obtained text froma popular website, Rotten Tomatoes, consisting of consensus scored movie reviews, usingweb scraping and stored them to a local database. By implementing the system, we couldstudy the word embedding transformations based upon Deep-learning methodology withRecurrent Neural networks and slice the training data according to score in order tostudy the effect on learning performance. We present results with various classifiers,including a random forest , stochastic gradient, and comment on test results with arecurrent neural network.

i

DEDICATION AND ACKNOWLEDGEMENTS

Thanks to David and my family.

ii

TABLE OF CONTENTS

Page

List of Tables iv

List of Figures iv

1 Introduction 11.1 Background and State of the art . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Methods and Software 42.0.1 Web Scraping: data preparation and big data . . . . . . . . . . . . . 5

2.0.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.0.3 Vector space model transforms . . . . . . . . . . . . . . . . . . . . . . 10

2.0.4 Statistical Language models and Word embeddings . . . . . . . . . 11

2.0.5 The neural network language models . . . . . . . . . . . . . . . . . . 12

2.0.6 Implementation of a RNN based NNLM . . . . . . . . . . . . . . . . 14

2.0.7 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Results 183.0.8 The standard Large movie datasets . . . . . . . . . . . . . . . . . . 18

3.0.9 The Rotten Tomatoes Dataset . . . . . . . . . . . . . . . . . . . . . . 19

3.0.10 Classification Models and impact of parameters . . . . . . . . . . . 22

4 Discussion 28

Bibliography 30

iii

LIST OF TABLES

TABLE Page

3.1 Accuracy vs epoch learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Accuracy vs vector length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Accuracy vs number of RF estimators . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Accuracy vs noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Accuracy vs degree of sentiment polarity . . . . . . . . . . . . . . . . . . . . . . 26

LIST OF FIGURES

FIGURE Page

2.1 Workflow of the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Functional details of Web scraper . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Word embedding using tSNE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Neural network language model architectures . . . . . . . . . . . . . . . . . . . 13

3.1 tSNE projection of standardized movie review database . . . . . . . . . . . . . 20

3.2 Rotten Tomatoes database: Number of reviews vs score . . . . . . . . . . . . . 21

3.3 Rotten Tomatoes database: Number of reviews vs size . . . . . . . . . . . . . . 22

3.4 Results of tSNE for custom database . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5 Accuracy vs degree of sentiment polarity . . . . . . . . . . . . . . . . . . . . . . 27

iv

CH

AP

TE

R

1INTRODUCTION

Recent advances in machine learning (ML) are ushering in a new age in nearly all

branches of artificial intelligence. Fueled by three fundamental breakthroughs:

the distributed computing capability of modern systems, the availability of ex-

tremely large datasets, and the novel theoretical insights which can be rapidly and

efficiently implemented in modern programming frameworks. Problems in pattern recog-

nition and natural language processing, which were once far beyond the scope of toy

ML models, are becoming mainstream and have been integrated into software that now

operate in commodity devices. The machine learning subfield that has contributed most

to this vertical progress has been called Deep Learning, and is powered for the most part

by a resurgence in neural networks - with the modern enhancement of going ever deeper

and deeper (and recursive) in hidden (latent) layers.

Among the challenging problems that these new breed of deep learning algorithms

has tackled is Sentiment Analysis, or determining whether a chunk of natural language

text has a positive or negative sentiment, called polarity. In general processing natural

language, and obtaining a representations from such text is a grand challenge problem

within the field of artificial intelligence. Sentiment analysis is a difficult machine learn-

ing task, but it represents a fundamental problem for benchmarking new ideas in AI.

Nonetheless, in its limited form, sentiment analysis seeks to predict the overall polarity

of a text, which is a binary learning problem.

The main thrust of this work is to provide a hands-on comparison of a few leading

methods that have emerged in sentiment analysis in the past two years (from approxi-

1

CHAPTER 1. INTRODUCTION

mately 2013) with a newly constructed dataset that allows us to study how the of degreeof sentiment polarity affects the recognition error rates. As such, this work represents a

computational or practical review article of the competing methods. The field has moved

so quickly in the past two years with breakthroughs in deep-learning methodologies, that

such a review (albeit here not extensive) is both timely and necessary. In particular, I

implemented a system for studying word embedding transformations using two methods,

word2vec and doc2vec, and then used these high-dim space projections for predicting

sentiment polarity with modern classifier: random forests, the stochastic gradient, and

recurrent neural networks.

In order to compare methods, a specific contribution of this work is to test the methods

on a real dataset, namely, movie reviews from a leading website, rotten-tomatoes. This

provided a way of selecting the degree of polarity to determine how the models work as a

function of this degree of polarity. With respect to the specific problem, we address the

following: the dependency of corpus domain on classification error rate, word embedding

comparisons, comparison with different classifiers, implementation details of obtaining

word embedding, and the quality of results as a function of the quality of the input

dataset (i.e., whether the text reviews are highly conditioned or can contain noise). For

this, we developed a system for web-scraping and data preparation (data wrangling).

Here this dataset serves as a realistic test system, but could also serve in the future for

new models for sentiment analysis.

1.1 Background and State of the art

In and of itself, sentiment analysis has become an important practical tool for the

booming field of machine intelligence applied to commercial applications. It seems that

everyone these days is a data- scientist! Emerging applications that rely upon sentiment

analysis include product comparisons (including product review recommendation sys-

tems), opinion summarization (from social networks to news summarization), opinion

reason mining (aggregation of opinion polarity or quantifying large amount of opinions

with statistical analysis ), and other applications (search engines, email filtering, etc).

Given the economic implication of these types of applications, it is no wonder that the

major technology companies (Google, Twitter, Facebook, Amazon, to name a few) have

become the major players in this arena, and the principal driving forces within the

theoretical advances of machine learning.

A recent review of Sentiment Analysis can be found in [21]. Some descriptions of

2

1.1. BACKGROUND AND STATE OF THE ART

particular examples that these techniques include are the analysis of financial infor-

mation [12], intonation to text-to-speech [22] and assessments of commercial products

[7]. Nonetheless, a standard test that has emerged as a way of comparing different

machine learning algorithms in sentiment analysis is movie reviews [18]. In particular,

the dataset most used is the Large movie review database [1, 10], consisting of 50k

annotated and highly polar reviews, evenly split between positive and negative reviews.

Automatic classifying of text for discovering sentiment with machine learning has a

long history. A recent review [13] describes modern methods. Several techniques emerged

early on to convert words/text into a numerical feature vector, thereby constructing a

vector space of word embeddings W : w → Rn. The basic idea of this vector space paradigm

follows immediately: two words or phrases are similar in meaning if their distance in Rn

is small, and thus dissimilar if words distances is large. In practice, constructing such a

word embedding is nontrivial.

Historically, a basic word embedding technique that emerged early on is bag-of-words(BOW). While difficult to trace the origins of the first use of BOW, most probably the

following could be credited [9, 11]. The fundamental concept of BOW is that a document

is a mere collection of terms, each assigned a number depending upon its occurrence in

the text, or corpus.

As may be expected, the performance of machine learning predictions with BOW

are poor due to the obvious disadvantages of such a simple representation, namely the

grammatical position of words that is needed for understanding of the linguistic context,

is not considered. Later methods sought to overcome such problems using n−grams,

which are groupings of n words, by using a host of algorithm developments. The most

promising approach that has emerged are probabilistic linguistic models, described

recently by many authors [2, 8, 21]. One approach is Latent Dirichlet Allocation (LDA) [5].

This is a probabilistic graphical model (or generative model, since it produces probability

distributions from which samples can be generated, or drawn) and is often used in topicmodelling or topic reduction; the idea being that groups of words pertain to topics.

A very recent approach that have yielded the best results to date are based upon

a renewal of recurrent neural networks, specifically in the context of deep-learning

[14]. These models generate vectors of fixed size from variable sized inputs, unlike the

classifier based approaches. Two such models are Word2Vec and Doc2Vec.

3

CH

AP

TE

R

2METHODS AND SOFTWARE

In this section, we describe the algorithms and software tools that were developed

to distinguish different reviews by constructing word embeddings and using modern

classifiers. First, we describe details of the overall architecture and data acquisition.

Next, we describe the specific feature vector encoding used for word embedding. Finally,

we discuss the details of the Software/Hardware implementation, providing details with

respect to the language, platforms, and libraries that were used to build the system.

In order to implement new methods, I used several libraries (see Method section) and

platforms for word modeling, machine learning, and deep machine learning. In particular,

for word modeling, I used the Gensim library [17] that provides a python wrapper for

word embedding. For random forest and stochastic gradient classifiers, text preparation

with NLTK y BeautifulSoup as well as data wrangling for cross validation preparation, I

used the Scikit-sklearn library (http://scikit-learn.org/). For modern machine learning

(deep-learning), especially with recurrent neural networks, I used Theano [3] in con-

junction with Keras (http://keras.io/). These latter tools have been developed by some

of the groups that are driving the deep-learning movement. Here I shall show how this

technology fits together and can be used to solve some of the more challenging problems

in the machine learning field.

Figure 2.1 shows the general architecture of the system. The Web scraping block

consists of custom built software for web-scraping and database loading. From the the

reviews, the Word-embedding block uses the reviews to construct a specific corpus and

applies word models for deriving the word-embedding mapping. Finally, the classifier

4

Movie Review

repository

url

1. Web scraping

2. Word

Embedding

Modeling

3. Classification

Indexing

queries

FIGURE 2.1. The workflow/system architecture is composed of three modules:

Web scraper, with two Word embedding module and then to the Classi-fier via a serial link.

method is used to determine the polarity of the review for a given word-embedding

method. In the following sections, each module shall be described in detail.

2.0.1 Web Scraping: data preparation and big data

The present Big-Data paradigm would not be possible without the Internet. Much of the

data in big-data is harvested from the enormous quantity of data generated each day

by social networks/messages, search engines, and news. Indeed, in 2012, 2.5 quintillion

bytes (1018 bytes) were created every day. At this rate, 90% of the world‚Äôs data was

produced in last two years. The numbers are just astounding; each minute: Facebook

processes 350GB, Google processes > 2Million search queries, and Twitter generates

more than 300thousand tweets.

For intelligent systems, driven by machine learning, this deluge of information must

be harvested and processed. Web scraping is the name given to harvesting or extracting

data from the web. It consists in automatically performing browsing of web pages.

Technically, it consists of a http connection and a get operation. Once obtained this

information must be transformed into a form that can be used by intelligent algorithms.

This conversion process is referred to in English as data munging or data wrangling,

5

CHAPTER 2. METHODS AND SOFTWARE

because it gives the connotation of fighting with the data to force it into a tame format

that can be used by downstream software tools.

Obtaining a heterogeneous data source for sentiment analysis represents an im-

portant challenge. In this work, I used the reviews available at Rotten Tomatoes

(http://www.rottentomatoes.com/), a popular website for user generated movie criticism;

known as a film review aggregator, since many reviews from different sources (blogs,

articles, etc) can be associated with a particular film. This site has peculiarity that it

doesn‚Äôt actually store the text of the reviews, but stores only the link and metadataassociated with each review and an associated score, called the Tomatometer critic aggre-gate score. An excerpt from the wikipedia entry of Rotten Tomatoes explains this process

succinctly:

Rotten Tomatoes staff first collect online reviews from writers who are cer-

tified members of various writing guilds or film critic associations. To be

accepted as a critic on the website, a critic’s original reviews must garner a

specific number of likes from users. Those classified as Top Critics generally

write for major newspapers. The staff determine for each review whether it

is positive (fresh, marked by a small icon of a red tomato) or negative (rotten,

marked by a small icon of a green splattered tomato). (Staff assessment is

needed as some reviews are qualitative rather than numeric in ranking.) At

the end of the year, they identify the film that was rated highest as receiving

the annual Golden Tomato.

The fact that rottentomatoes does not save the text, but provides the URL link to the

review, means that each review has its own particular format and size. From the point of

view of a web scraping front-end and sentiment analysis software, this situation is far

more complex than obtaining tweet data from Twitter, where the format is uniform and

the text size is restricted. Indeed, for rottentomatoes, the ‚Äúweb scraper‚Äù must be

designed to extract data from a wide array of different web formats, sizes, and have fault

tolerance against dead URLs and/or websites that do not obey standard markup.

The Web Scraping Application Connecting to the rottentomatoes (RT) website for

obtaining metadata and reviews is greatly facilitated with the availability of the API

provided by RT for third-party developers. Figure 2.2 shows a simplified schematic of

the connectivity of the webscrapper. By connecting via an HTTP petition to this API,

the relevant metadata associated with a movie can be retrieved. In particular, this

6

FIGURE 2.2. Functional details of the Web scraper modules, indicating the

actions taken during a connection with the Rotten Tomatoes third-party

API.

petition returns a list of movies with their associated rottentomatoes ID, which is used

to recuperate the list of URLs pertaining to the review, the polarity, and other metadata

associated with the movie. With this information, the next step is to obtain the actual

text reviews indicated by various URLs (a movie can have multiple reviews, hosted at

different sites).

Given the URL of a review, a webpage is automatically fetched using the func-

tionality of the requests module in python. This fetch operation is blocking, obligat-

ing code to wait until the URL petition is resolved either by data transfer or a time-

out. In a totally synchronous design, such a situation would be inefficient, producing

large amounts of deadtime. To solve this problem, the threading module of python

is used for asynchronous multi-threading. While it is true that the python is built

7


with a global interpreter lock (or GIL), Understanding the Python GIL, David Beazley;

http://www.dabeaz.com/python/UnderstandingGIL.pdf] as the cost of being an interpreted

code, which means that the launching threads does not produce pure multi-threading

since a mutual exclusion lock is held by the python interpreter. Nonetheless, since

processes waiting on network connections are on a wait queue, they do not need to be

actively running and the threading primitives can work effectively in this situation.

Thus, we obtained up to 10× speed improvements with a multithreaded implementation.

It should be mentioned that on multicore machines, the python multiprocessing module

could be used to achieve core-level parallelism, since each process has its own interpreter

within each core.

Once the HTML webpage of the review has been retrieved, the text must be extracted

and converted into a useable form. For this, the library BeautifulSoup (https://pypi.py-

thon.org/pypi/beautifulsoup4) was used. This library provides methods for dissecting the

contents of a webpage by navigating, searching and modifying a generated parse tree.

Initially when designing the algorithm for the web scraper, a general purpose solution

was developed. In this generic algorithm, a set of elements were eliminated based upon

expected regular expression patterns. Nonetheless, due to the resulting complexity to

handle general cases, the solutions of this method had limited utility. Instead, a more

specific but effective solution was to use BeautifulSoup to obtain specific HTML elements

in the document, namely the paragraph p, span and the article. In this pass, non-visible

elements are eliminated. For example, CSS style elements such as display: none,

commented code, or JavaScript code are identified and eliminated. Since the HTML

markup tags are recognized by the BeautifulSoup parsing tree, only the text between

markup is extracted. The following code segment describes the default rule for the web

scraper.

def _default_scrapper(body):paragraphs = []for paragraph in body.find_all ([’p’, ’span’, ’article ’]):

paragraphs.append("".join(filter(_visible ,paragraph.find_all(text=True))))

return " ".join(paragraphs)

Some domains had significantly more reviews than others, so that it made sense to

use custom scrapers for these cases. In fact, approximately 15 domains represent more

than 25% of reviews at Rotten Tomatoes. This of course is due to the fact that Rotten

Tomatoes reports reviews from qualified sources, such as professional movie reviews

8

from major newspapers and movie critics. To handle these case, a function was developed

that includes loads specific configurations depending upon the source of the URL. In this

way, by specifying the id or class of the review, specific elements could be eliminated. This

has the advantage that the scrapers could be highly accurate by tailoring them to these

specific web domains. The disadvantage of course is that the application would need

validation in the future in case the information at these domains change. The following

code shows a configurable scraping function that obtains more than 25% of the reviews.

def _class_scraper(body , ** options):scraper_options = options.get(’options ’)review_div = Noneif REVIEW_DIV_CLASS in scraper_options:

review_div = body.find(class_=scraper_options[REVIEW_DIV_CLASS ])

if REVIEW_DIV_ID in scraper_options:review_div = body.find(id=scraper_options[REVIEW_DIV_ID ])

if CLASSES_TO_DECOMPOSE in scraper_options:[div.decompose () for class_to_decompose in scraper_options

[CLASSES_TO_DECOMPOSE] for div in review_div.find_all(class_=class_to_decompose)]

if IDS_TO_DECOMPOSE in scraper_options:[div.decompose () for id_to_decompose in scraper_options[

IDS_TO_DECOMPOSE] for div in review_div.find_all(id=id_to_decompose)]

if HTML_TAGS_TO_DECOMPOSE in scraper_options:[div.decompose () for html_tag_to_decompose in

scraper_options[HTML_TAGS_TO_DECOMPOSE] for div inreview_div.find_all(html_tag_to_decompose)]

return _review_div_scraper(review_div)

There were also several domains that contained many reviews, yet not with a con-

sistent webpage style. In some cases this was because the sites were still serving older

versions (e.g., review written several years ago), or the webpage design changes depend-

ing upon the sub-domain due to different web interfaces. For these domains customized

for each function were created. These functions are for 13 domains that represent more

than 25% of the reviews.

Once the reviews were obtained, they were stored in a MySQL database on disk

together with the metadata of the review. Downstream analysis of reviews was accom-

plished by performing database queries and converting the saved raw text to a set of

numbers. This conversion process of text to high-dimensional vector is referred to as

9


word embedding and the subject of the next section.

2.0.2 Word Embeddings

A word embedding W : words → Rn is a parameterized function that maps words

from natural language to a vector on a high-dimensional manifold (perhaps 200 to

500 dimensions). Although it is impossible to visualize such high-dimensional vec-

tors, one method is to map these vectors to a lower-dimensional space. For this, the

t-SNE method (t-distributed stochastic neighbor embedding) [23, 24] is a recent suc-

cessful technique that can perform such a space reduction. Thus, it can be in our

case to to gain an intuition of the the word embedding space is to visualize the most

relevant axis. (more high-dimensional data visualizations of t-SNE can be found at

http://homepage.tudelft.nl/19j49/t-SNE.html). Figure 2.3 shows an example of the word

embeddings we obtained using gensim’s implementation of the word2vec specifically

trained on the rotten-tomato dataset. In the results section, we use this technique to

visualize data vectors formed using the the movie reviews.

2.0.3 Vector space model transforms

As described previously, the bag-of-words (BOW) model is perhaps the first and easiest

method for representing natural language. It is based upon the idea that a full text

(ie. sentence, paragraph or full document) is tantamount to a collection of unconnected

words, as if contained in a bag. In this way, a possible representation of a particular word

would be based upon its frequency of occurrence in the text. The major disadvantage of

this representation is that it disregards word context. Thus, as to be expected, the BOW

representation perform poorly. Nonetheless, it is still a popular choice for rudimentary

classifiers such as those used in simple spam filtering.

Other incremental improvements of the bag-of-word technique include the following

methods:

• Term frequency - inverse document frequency (Tf-Idf): this is a statistic uses word

frequency both in the document as well as in the corpus, performing slightly better

than bag-of-words, since it tends to reflect how words appear in general.

• Latent semantic indexing (LSI) : an index based upon singular value decomposition

that uncovers the semantic structure in the usage of words in a text.

10

FIGURE 2.3. An example of the t-SNE from the word embeddings of a dataset.

See http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html

2.0.4 Statistical Language models and Word embeddings

Clearly, any vector space transform for representing words in a text should contain

information about the word context. The most successful models for attaining this are

those based upon probabilistic descriptions of natural language.

One example of a highly successful probabilistic graphical model technique is the

Latent Dirichlet Allocation (LDA) [5]. In this method, the model attempts to reduces

natural language text to a set of topics. In this sense, it is a space reduction method. With

this reduced dimensional space of topics, all words of the text are classified or grouped

within these categories. An interesting application area of this method is to convert large

texts into summaries. An excellent recent review on Probabilistic Topic Models is given

by one of the leading developers of the method, [4].

11


Topic analysis, as described in [4] can be understood in terms of joint distributions

with some simple definitions. Reproducing the notation of [4]: given that there are K(written β1:K ), the βk is the distribution over words; topic proportions are given by θd,k

for the d-th document and kth topic; topic assignments for document d is given by zd,

while for each word n in document d, it is given by zd,n. The observed words for the nth

word in the the dth document is wd,n. With these definitions, the LDA forms the joint

distribution:

p(β1:K ,θ1:D , z1:D ,w1:D)=K∏

i=1p(βi)

D∏d=1

p(θd)( N∏

n=1p(zd,n|θd)p(wd,n|β1:K , zd,n)

)This joint probability is normally formulated in the context of a Bayesian graphical

model and can be solved with various optimization methods, or with a Markov Chain

Monte Carlo (MCMC).

The LDA method is implemented in the gensim (https://radimrehurek.com/gensim/)

system. While we performed initial tests using this word embedding, in the end, we did

not use this for sentiment analysis, since an extra step would still be required to reduce

topics to sentiment polarity. As a result, we turned our attention to other probabilistic

language models, those based upon Neural Networks.

2.0.5 The neural network language models

Similar statistical approaches have been taken in the context of large recurrent neural

networks. The goal of statistical language modeling [15] http://www.fit.vutbr.cz/

research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf is to

predict the next word in textual data given context. Recurrent neural network are in-

teresting because arbitrarily large word contexts can be remembered within the hidden

layers of the network, thereby contributing to more accurate conditional probabilities at

the output layer.

In the paper [16], the power of such models is explained quite succinctly:

Continuous space language models have recently demonstrated outstanding

results across a variety of tasks. In this paper, we examine the vector-space

word representations that are implicitly learned by the input-layer weights.

We find that these representations are surprisingly good at capturing syn-

tactic and semantic regularities in language, and that each relationship is

characterized by a relation-specific vector offset. This allows vector-oriented

reasoning based on the offsets between words. For example, the male/female

12

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

http://www.fit.vutbr.cz/research/groups/speech/publi/2010/mikolov_interspeech2010_IS100722.pdf

0 1 0 0 ..... 0 0 0 0 1.....0 input layer

aggregation layer

hidden layer(s)

output layer

0 1 0 0 ..... 0 0 0 0 1.....0

(a) Feed-forward NNLM (b) Recurrent NNLM

FIGURE 2.4. Architecture of a standard feed-forward neural network language

model (NNLM). The two input words u and v are at the input layer with

a one-hot encoding scheme. The projection or aggregation layer, joins the

inputs, while multiple hidden layers are activated either linearly (left) or

with temporal firing for recurrent NN (right).

relationship is automatically learned, and with the induced vector represen-

tations, King - Man + Woman results in a vector very close to Queen.

This last statement, in mathematical form v(King)− v(Man)+ v(Woman) = v(Queen),

by [16] was quite remarkable!! This statement sent shock waves throughout the NL

field, because it revealed the inferential and generative prowess of the method. Thus, in

[14], the word2vec model, that transforms words into fixed sized vectors that capture

meanings of words, was introduced and immediately became the superstar of the show.

Neural network language models (NNLM) follow similar topologies. First, they are

probabilistic graphical models, where the interconnections between nodes i and j describe

a conditional probability p(i| j).Following the excellent recent review of [19], a basic NNLM architecture is shown

in Figure 2.4. The input layers depicted are two words within a specified context (they

can be successive words, or predecessors part of an n-gram). These words u and vare represented with a one-hot encoding scheme, i.e., all elements are zero except one,

representing a δ- function at the position of the jth position. Given a weight matrix W, a

bias b, and an activation function F , the nodes x on the kth layer are given by:

13


x(k) =F(W(k)x(k−1) +b(k))

where it can be seen that the kth layer depends explicitly on the previous layer

(k−1). The activation function F varies on each layer. For the aggregation, or projection

layer, the activation function is just the identity matrix. For the hidden layers, a sigmoid

activation function is used. Finally, at the output, a softmax is used.

As can be seen in Figure 2.4, at the output, the probability p(w|u,v) is found given

two input context or prede words. All the work of the neural network training is to

discover the matrix W, such that the probability distribution p(w|u,v) can be known.

Many variants of this basic model exists that further decompose this probability into

clusters. Nonetheless, if wi is the interpreted as the probability of observing a certain

word (or word sequence), then p(w1, · · · ,wm) is n-gram model. A Markov assumption is

then assumed, such that the conditional probabilities are separable:

p(w1, · · · ,wm)=M∏

i=1p(wi|wi−(n−1), ·,wi−1)

which means that only the most recent n words are considered for predicting the ithword. These separable probabilities in the Markov expression above are minimized in the

training of a neural network. In particular, the objective function using the log maximum

likelihood expression

L =∑log p(wi|wi−(n−1), ·,wi−1)

is minimized typically with a stochastic gradient descent algorithm, where the gradients

are obtained by automatic differentiation from the back-propagation algorithm. Further

details can be found in [20] and [19].

Recursive neural networks have a similar architecture, however they allow for the

ability to extend the number of predecessor words with the use of memory. Internally, the

nodes can have loops, however the activation function F will fire at different times and

will have a finite time duration during activation. This is the essential idea behind the

concept of Long short term memory (LSTM) recurrent networks. A detailed mathematical

development of the mechanisms necessary to explain these models is beyond the scope of

this work; such a discussion can be found in [19, 20].

2.0.6 Implementation of a RNN based NNLM

A NNLM that has yielded powerful results is word2vec [16]. The word2vec algorithm

was implemented in C, and was made available by the authors (see https://code.goo-

14

gle.com/p/word2vec/). Since then, a recent general python framework for language

modelling, gensim (https://radimrehurek.com/gensim/), has provided a wrapper to this

method.

An extension of the word2vec model to paragraphs and full documents was developed

later by the same authors [16]. From this work, two separate models emerged that

aggregate word2vec, the Distributed Memory (DM) and the Distributed Bag of Words

(DBOW). In the DM model, an attempt to predict the next word in a document, as

a number of words and a vector for the paragraph, so that although the words are

changing, the reference in paragraph remains. In the DBOW model, the number of words

are predicted given the paragraph vector. The following code shows how the DM and

DBOW models are called using the gensim interface.

model_dm = gensim.models.Doc2Vec(min_count =1, window =10, size=size , sample =1e-3, negative=5, workers =8)

model_dbow = gensim.models.Doc2Vec(min_count=1, window =10, size=size , sample =1e-3, negative=5, dm=0, workers =8)

model_dm.build_vocab(x_array)model_dbow.build_vocab(x_array)

The window size determines the maximum learning context distance (number of n-grams)

used to construct the RNN model. Both the min_count and sample_size parameters are

correction terms for word frequency: the min_count parameter is used to eliminate words

that appear less than the number indicated in the corpus; the sample size down-samples

the words that occur the most in the corpus. A more complete description of these

models can be found in [10] and the tutorial http://districtdatalabs.silvrback.

com/modern-methods-for-sentiment-analysis

As with word2vec, these models are based upon probabilistic graphical models imple-

mented within a RNN deeplearning framework. Also, as in any RNN, iterative training

(epochs in the language of NN) is required to improve the weights. The epoch training is

is carried out as follows:

for epoch in range(n_epoch):perm = np.random.permutation(x_np_array.shape [0])model_dm.train(x_np_array[perm])model_dbow.train(x_np_array[perm])

As can be seen, the result of this training are fixed length Numpy vectors.

15

http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis

http://districtdatalabs.silvrback.com/modern-methods-for-sentiment-analysis


Training As outlined above, the process of training consists in minimizing an objective

function consisting of all input, internal and output nodes of a probabilistic graphical

model. For an even modest sized corpus, the training of the RNNs is computationally

intensive. Thus, modern implementations using any deep-learning NN architectures

assume an efficient computational platform at the onset of problem definition. Because

the back-propagation and other numerical computations of networks can be parallelized,

modern machine learning platforms take advantage of multi-core concurrency as well as

the parallelism offered by GPUs. Two machine learning platforms that have emerged for

researchers in the field are Theano and Torch.

Theano [3] (http://deeplearning.net/software/theano/) is a full featured numerical

and symbolic expression based python framework that performs internal mappings

to bare metal. This has become one of the de-facto standard frameworks for modern

machine learning. A competing framework is Torch ( http://torch.ch/) (with the Lua

http://www.lua.org/ language). Both frameworks have a significant reached a sufficient

level of maturity to have active communities and design roadmaps.

Here we have chosen Theano based development. As a demonstration of how Theano

can be used to obtain computational benefits on the GPU compared to the CPU, the

following example shows the code for performing a matrix multiplications.

import theanoimport numpy...vlen = 10 * 30 * 768 # 10 x #cores x # threads per coreiters = 1000

rng = numpy.random.RandomState (22)x = shared(numpy.asarray(rng.rand(vlen), config.floatX))f = function ([], T.exp(x))for i in xrange(iters):

r = f()

This script computes the function exp() on a set of random numbers. Contained in the

Theano function is a map to the actual implementation, which could be optimized for a

CPU or GPU. By changing the device options in the following ways:

THEANO_FLAGS=mode=FAST_RUN ,device=gpu ,floatX=float32 pythoncheck1.py

THEANO_FLAGS=mode=FAST_RUN ,device=cpu ,floatX=float32 pythoncheck1.py

16

we can determine the performance on each device. On a Intel quad-core i7 CPU, the

operation took 12.02 seconds, while on a Nvidia GTX 780, the operation took 0.96 seconds,

representing more than 10× acceleration.

2.0.7 Classification

Once the vector transformations are obtained for each phrase, we used different clas-

sifiers to train and predict text. We investigated the use of the following classifiers:

Stochastic Gradient Descent, Random Forest and RNN.

For the Stochastic Gradient Descent, Random Forest we used the popular Sklearn

(http://scikit-learn.org/) framework. For the RNN classification, we used Keras (http://keras.io/),

which is a recently constructed framework for wrapping theano implementations of neu-

ral networks. In this framework, simple feedforward networks as well as research level

recurrent neural networks such as LSTM or convolutional networks can be constructed.

The following is an example of constructing a RNN using a LSTM recurrent layer.

import keras...model = Sequential ()model.add(Embedding(max_features , 256))model.add(LSTM (256, 128))model.add(Dropout (0.5))model.add(Dense (128, 1))model.add(Activation(’sigmoid ’))model.compile(loss=’binary_crossentropy ’, optimizer=’adam’,

class_mode="binary")model.fit(test_vecs, y_train , batch_size =16, nb_epoch=2, validation_split =0.1,

show_accuracy=True)

Despite being one of the more developed framework for modern neural networks, the

classes are still restrictive. For example, we found that for fixed vectors, we could run the

model using the RNN for a limited number of reviews, however above >5000 reviews, our

model did not converge. More work would need to be dedicated to this. Our conclusions

are that Keras (and other such frameworks such as Lasagne or Blocks, are interesting

ideas, but lack the flexibility of implementing a network model as is the case in a pure

language such as Theano, python, Torch or even in C (as was the case for the original

word2vec).

17

CH

AP

TE

R

3RESULTS

To understand how training (and subsequent prediction) depend upon the datasets, we

carried out experiments of the algorithms using two datasets:

• Large Movie Review Dataset v1.0. This is a standardized dataset used by re-

searchers as a benchmark for sentiment classification, providing an easy way to

compare methods. The dataset contains reviews along with their associated binary

sentiment polarity labels and contains 50,000 reviews split evenly into testing

and training sets. Also, the overall distribution of labels is balanced (25k positive

and 25k negative). Also there are an additional 50,000 unlabeled documents for

unsupervised learning.

• Custom Rotten Tomatoes web review scraper. This is a dataset we constructed by

obtaining with web-scraping reviews from Rotten Tomatoes. The dataset has more

than 40,000 reviews (and 27,000 are from a custom scraper) and the polarity as

well as a score (i.e., in most cases) are provided.

3.0.8 The standard Large movie datasets

First, we studied obtained doc2vec feature vectors for the standardized Large moviereview database. To gain intuition about the results of the vector spaces created from

the doc2vec method, the t-distributed Stochastic Neighbor Embedding (tSNE) [24] can

be applied to perform a dimensionality reduction and visualize the high-dimensional

18

vectors. While a description of the method is technically involved, it converts distances

between points in the high dim space to joint probabilities and then minimizes the

Kullback-Leibler divergence to attain a low-dimensional embedding. An implementation

of this method is provided in Scikit-learn. In the code below, the (sklearn.manifold import

TSNE) provides the implementation of tSNE. Before using the tSNE, a Singular value

decomposition SVD is used to first reduce the space with a linear transformation.

The following code listing shows how the tSNE method is used in python using

sklearn. As can be seen, first the vectors are truncated to 30 elements, then the tSNE

method is run.

...from sklearn.manifold import TSNEfrom sklearn.decomposition import TruncatedSVD...X_reduced = TruncatedSVD(n_components =30, random_state =0).

fit_transform(train_vecs [:1000])X_embedded = TSNE(n_components =2, perplexity =40, verbose =2).

fit_transform(X_reduced)for i in range(y_train [:1000]. size):

if y_train[i] == 0:neg.append(X_embedded[i])

else:pos.append(X_embedded[i])

neg = np.array(neg)pos = np.array(pos)plt.scatter(neg[:, 0], neg[:, 1], color=next(colors))plt.scatter(pos[:, 0], pos[:, 1], color=next(colors))

Figure 3.1 shows the low- dimensional manifold of Results for the standardized Largemovie review database using reviews from IMDB. The results show that the doc2vec

algorithm does produce feature vectors that separates positive (blue) and negative

polarity (red). Clearly, it is not a clean separation, but it should be remembered that first

this is an approximate embedding.

3.0.9 The Rotten Tomatoes Dataset

As previously mentioned, most sentiment analysis has been carried out with clean

and highly polar datasets. To study the impact of noise, we carried out tests with

reviews that are not restrained in size and/or were obtained with a general purpose

scraper, thereby containing artefacts. For example, the default scraper often will not only

19

CHAPTER 3. RESULTS

FIGURE 3.1. Low- dimensional manifold of Results for the standardized Largemovie review database using reviews from IMDB. Each review was con-

verted to doc2vec. In the plot, the positive sentiment are in blue, and the

negative sentiments are in red.

recover the review, but could retrieve additional information on the page (in the form of

advertisement or other non-markup text) that could contribute as noise.

Our dataset, the Custom Rotten Tomatoes web review scraper, consists of approxi-

mately 40,000 Rotten Tomatoes reviews of different size, and most with an assigned

polarity score. To understand the overall polarity of the database, Figure 3.2 shows the

distribution of the numer of reviews for each polarity score. As can be seen, this dataset is

peaked around a score of 7, indicating that the text is positive, but perhaps not positive!.

Reviews in the Custom Rotten Tomatoes web review scraper database have a wide

range of sizes; ranging from a paragraph to several thousand words. As a sanity check,

we wanted to make sure that no correlation exists between long-windedness and score.

For example could it be that all short reviews are simply negative?. Figure 3.3 shows the

size of the movie revies as a function of score. The results show there does not appear to

20

FIGURE 3.2. Distribution of movie reviews as a function of score from our

rotten-tomato database.

be a correlation between size of the review and the overall score.

We used the Custom Rotten Tomatoes web review scraper database and converted each

review to a fixed sized vector using doc2vec RNN training. Figure 3.4 shows the results

of a low-dimensional space embedding with tSNE for different slices of the data. First

(top-left), we plotted the results of all the reviews. As can be seen, since the distribution

of reviews as a function of score is highly peaked around 7, it is a score that is somewhat

in the middle and not very polar. Any classification on such a dataset will have a hard

time separating the data on polarity.

Next, we sliced the data according to score. In the (top-right) plot of Figure 3.4, we

selected all reviews with a score greater than 8 and those less than 2. As can be seen, the

vectors, projected with tSNE, are clearly separated. Similarly, we repeated this procedure

for the bottom left and right plots of Figure 3.4, where slices of the data were (>8, <3)

and (>9 and <2), respectively. In each of these cases, it is clear that very negative and

very positive reviews can be separated, but if the polarity is close to even a score of 7,

then the sentiment discrimination will be ambiguous.

21

CHAPTER 3. RESULTS

FIGURE 3.3. Distribution of movie reviews text size as a function of score from

our rotten-tomato database.

3.0.10 Classification Models and impact of parameters

For the creation of both word2vec models (DM and DBOW) it is necessary to shuffle the

body of reviews and do several repetitions (epochs) of this corpus. Since this is an iterative

process, the time it takes to do more learning is linear. We studied the importance of

iteration in the learning process on the final quality of word2vec results with prediction

accuracy using different classifiers.

The Word2Vec method generates vectors with fixed size, which can be specified prior

to the learning process. Internally, this controls the output layer of the RNN. We have

studied the impact of the size of these vectors in learning.

In terms of classifiers, the Random Forest method has its own set of parameters

that should be set and optimized for a particular problem. One parameter of particular

interest is the number of estimators, which is equivalent to the number of splits, that

22

FIGURE 3.4. Results of tSNE for the movie reviews obtained with our custom

scraper and converted to doc2vec. Different data slices based upon the

Rotten Tomatoes consensus scores are given. (Top-left) All reviews, (top-

right) >8 and <2, (bottom-left) >8 and <3 (bottom-right) >9 and <2. In the

plot, the positive sentiment are in blue, and the negative sentiments are in

red.

will determine the number of decision trees. We tested the prediction results of word2vec

as a function of the number of estimators for the random forest.

For the training of the model used in doc2vec, it is necessary to carry out repeated

iterations, called epochs in the RNN literature, by shuffling the input vectors. The

number of repetitions influences the resulting vectors generated with doc2vec, which

subsequently influence the prediction results. Table 3.1 shows the dependence of accuracy

23

CHAPTER 3. RESULTS

results for doc2vec embeddings with different epoch iterations. Each row in the Table 3.1

use the same 200 element vectors for the SGD and the RF classifiers.

TABLE 3.1. Dependence of accuracy results for doc2vec embeddings with differ-

ent epoch iteration learning.

Number of

epochs SGD RF

1 65% 59%

3 69% 61%

4 70% 60%

5 71% 61%

6 72% 61%

7 72% 61%

8 73% 59%

9 73% 60%

10 71% 61%

Also, it is possible to vary the size of the vector that is generated by doc2vec for a

review. Table 3.0.10 shows that this will also influence the prediction results for the

classifiers. Each row in the Table 3.0.10 uses the same vector with five epoch iterations

for both the SGD and the RF classifiers.

TABLE 3.2. Dependence of accuracy results for doc2vec using different resulting

vector lengths.

Vector SGD RF

Length accuracy accuracy

50 66% 63%

100 66% 63%

200 71% 60%

300 71% 60%

400 72% 59%

600 69% 59%

With respect to the Random Forest classifier, one important parameter is the number

of classifiers. To control the cost function, regression tree classifiers also use criteria:

either a Gini impurity measure [6] or the entropy information gain measure. These two

24

elections in the RF model will have an impact on the final training/prediction results.

Table 3.3 shows a comparison using 5 repetitions and 200 element vectors, with either

the Gini or the entropy for information gain criteria.

TABLE 3.3. Results with varying Random Forest parameters. Dependence of

accuracy results for doc2vec using different criteria conditions and number

of estimators. These test used 27600 reviews with 5 epochs and a fixed word

vector size of 400.

Number of

estimators Criterion Accuracy

5 gini 59%

10 gini 61%

20 gini 63%

30 gini 66%

50 gini 66%

80 gini 68%

100 gini 69%

200 gini 69%

100 entropy 68%

Table 3.4 shows the prediction accuracy as a function of the quality and polarity of

the text review training set. For each of the rows, the word embedding was performed

with doc2vec using 9 epoch iterations with an output vector of 400 elements. The same

vector was then used in the SGD and RF classifiers.

TABLE 3.4. Accuracy of predictions with different classifiers using different

levels of web scraping results.

Input Number of SGD RF Elapsed

reviews Time(s)

All rotten tomatoes reviews 40000 66% 63% 1496

RT without custom scraper 27600 63% 60% 873

RT with custom scraper 22400 70% 67% 716

Large movie review database 50000 88% 82% 582

Table 3.5 shows the results the data from the Rotten Tomatoes reviews database

constructed for this study. In particular, the data was separated based upon the Rotten

25

CHAPTER 3. RESULTS

Tomatoes consensus scores in order to study the prediction results as a function of degree

of sentiment polarity. For each of the rows, the word embedding was performed with

doc2vec using 9 epoch iterations with an output vector of 400 elements. The same vector

was then used in the SGD and RF classifiers.

TABLE 3.5. Accuracy of prediction based upon slicing the data into different

degrees of sentiment polarity.

Number of

Score reviews SGD RF

10–0 1226 91% 84%

>9–<1 1819 93% 88%

>8–<2 5196 91% 87%

>7–<3 10662 89% 83%

>6–<4 15991 77% 70%

All reviews

with score 19665 74% 68%

26

SGD Accuracy

RF Accuracy

FIGURE 3.5. Accuracy of prediction based upon slicing the data into different

degrees of sentiment polarity.

27

CH

AP

TE

R

4DISCUSSION

The first hypothesis is that sentiment polarity is an important determinant for subse-

quent prediction accuracy. As expected from the outset, the polarity of the dataset greatly

affect the prediction results. Moreover, this study is able to show the level of polarity

necessary for obtaining good prediction. With a difference of 8 score points between

positive and negative polarity, the prediction accuracy improves by approximately 20%.

Additionally, the positive >9 – neg <1 approaches the accuracy cited in the literature.

Prior to these experiments. we could not estimate how much this would affect the re-

sults; Systems are very good at separating very positive and very negative text, but

there is quite a gray area in the middle. Perhaps, by focusing on these sentiments that

lie somewhere in the middle could lead to more robust algorithms, as well as deeper

understanding making machines intelligence, i.e., that they can understand the nuances

of a language just as a human could.

As a corollary, the second hypothesis is that clean data sources are also an important

determinant for downstream prediction accuracy. Once again, our tests confirmed this

hypothesis by providing prediction accuracy results and indicating the influence dirtydata. An area of future research might be an intelligent filter, perhaps based on the

Latent Dirichlet Allocation with topic reduction as a means of eliminating unwanted

topics.

The field of sentiment analysis is broader than it first appears. Presently, making

an original contribution to this field requires formidable background in probabilistic

graphical models, since these are the methods that have produced the best results. As is

28

always the case, original contributions start by identifying the present problems, that

are often not apparent by simply reading scientific journal articles. Instead, we have

asked a very simple questions: Are the datasets setup to give good results? If so, where

can begin to get a handle on the more general problem?

Thus, given the limited scope (and time) of the thesis, there are several limitations

and areas that have been omitted. At the present moment, we tested with RNN as a

classifier using a fixed size input vector. In this way, the true power of the recurrent

neural network is not realized. A more interesting input to the system would be to

classify documents with variable sized input text. This work was necessarily limited

in scope. To compare with the results by the Google group [14], truly massive learning

datasets would be necessary.

Several other ideas for future research would be possible. First, a deeper understand-

ing of the 6–4 case (ambiguous sentiment polarity) would be of interest. This would

require some amount of manual curations, reading the texts and determining if the

sentiment is in fact ambiguous. Next, could the system obtain an actual polarity score

automatically ? And finally, it would be interesting to compare other models, such as LDA

with the word2vec and doc2vec paradigm using our dataset. Clearly more work could

be done related to the deep learning: for example, perhaps using different topologies

of RNN; investigating the range of optimal parameters with the LSTM; trying to use

convolution NN; studying results as a result training set size, as well as hidden layer

depths.

As can be seen with the study of sentiment analysis presented in this thesis, the

field of shall continue to provide a rich test-bed for testing new and exciting ideas in

machine learning. With the present trend and advances of modern intelligent systems,

new and more complex problems shall be tackled and solved. Someday, we may just have

the perfect recommender system. But, in the end, it might be like Bruce Springsteen’s

song: "57 Channels (And Nothin’ On)".... we humans, still need to decide which movie to

watch.

29

BIBLIOGRAPHY

[1] M. ANNETT AND G. KONDRAK, A comparison of sentiment analysis techniques: Po-larizing movie blogs, in Proceedings of the Canadian Society for Computational

Studies of Intelligence, 21st Conference on Advances in Artificial Intelligence,

Canadian AI’08, Berlin, Heidelberg, 2008, Springer-Verlag, pp. 25–35.

[2] Y. BENGIO, H. SCHWENK, J.-S. SENÉCAL, F. MORIN, AND J.-L. GAUVAIN, NeuralProbabilistic Language Models, in Innovations in Machine Learning, D. Holmes

and L. Jain, eds., vol. 194 of Studies in Fuzziness and Soft Computing, Springer

Berlin Heidelberg, 2006, pp. 137–186.

[3] J. BERGSTRA, O. BREULEUX, F. BASTIEN, P. LAMBLIN, R. PASCANU, G. DES-

JARDINS, J. TURIAN, D. WARDE-FARLEY, AND Y. BENGIO, Theano: a CPUand GPU math expression compiler, in Proceedings of the Python for Scientific

Computing Conference (SciPy), June 2010.

[4] D. M. BLEI, Probabilistic topic models, Commun. ACM, 55 (2012), pp. 77–84.

[5] D. M. BLEI, A. Y. NG, AND M. I. JORDAN, Latent dirichlet allocation, J. Mach.

Learn. Res, 3 (2003), pp. 993–1022.

[6] L. BREIMAN, J. FRIEDMAN, R. OLSHEN, AND C. STONE, Classification and Regres-sion Trees, Wadsworth and Brooks, Monterey, CA, 1984.

[7] M. CABANLIT AND K. JUNSHEAN ESPINOSA, Optimizing n-gram based text featureselection in sentiment analysis for commercial products in twitter through polar-ity lexicons, in Information, Intelligence, Systems and Applications, IISA 2014,

The 5th International Conference on, July 2014, pp. 94–97.

[8] E. CAMBRIA, B. SCHULLER, Y. XIA, AND C. HAVASI, New avenues in opinionmining and sentiment analysis, Intelligent Systems, IEEE, 28 (2013), pp. 15–21.

[9] Z. HARRIS, Distributional structure, Word, 10 (1954), pp. 146–162.

30

BIBLIOGRAPHY

[10] Q. V. LE AND T. MIKOLOV, Distributed representations of sentences and documents,

CoRR, abs/1405.4053 (2014).

[11] H. P. LUHN, The automatic creation of literature abstracts, IBM J. Res. Dev., 2

(1958), pp. 159–165.

[12] Y. T. MCINTYRE-BHATTY, Neural network analysis and the characteristics of marketsentiment in the financial markets, Expert Systems, 17 (2000), pp. 191–198.

[13] W. MEDHAT, A. HASSAN, AND H. KORASHY, Sentiment analysis algorithms andapplications: A survey, Ain Shams Engineering Journal, 5 (2014), pp. 1093–1113.

[14] T. MIKOLOV, K. CHEN, G. CORRADO, AND J. DEAN, Efficient estimation of wordrepresentations in vector space, CoRR, abs/1301.3781 (2013).

[15] T. MIKOLOV, M. KARAFIÁT, L. BURGET, J. CERNOCKÝ, AND S. KHUDANPUR,

Recurrent neural network based language model, in INTERSPEECH 2010, 11th

Annual Conference of the International Speech Communication Association,

Makuhari, Chiba, Japan, September 26-30, 2010, 2010, pp. 1045–1048.

[16] T. MIKOLOV, W.-T. YIH, AND G. ZWEIG, Linguistic regularities in continuous spaceword representations, in Proceedings of the 2013 Conference of the North Ameri-

can Chapter of the Association for Computational Linguistics: Human Language

Technologies, Atlanta, Georgia, June 2013, Association for Computational Lin-

guistics, pp. 746–751.

[17] R. REHUREK AND P. SOJKA, Software Framework for Topic Modelling with LargeCorpora, in Proceedings of the LREC 2010 Workshop on New Challenges for

NLP Frameworks, Valletta, Malta, May 2010, ELRA, pp. 45–50.

http://is.muni.cz/publication/884893/en.

[18] V. SINGH, R. PIRYANI, A. UDDIN, AND P. WAILA, Sentiment analysis of moviereviews: A new feature-based heuristic for aspect-level sentiment classification,

in Automation, Computing, Communication, Control and Compressed Sensing

(iMac4s), 2013 International Multi-Conference on, March 2013, pp. 712–717.

[19] M. SUNDERMEYER, H. NEY, AND R. SCHLUTER, From feedforward to recurrentlstm neural networks for language modeling, Audio, Speech, and Language

Processing, IEEE/ACM Transactions on, 23 (2015), pp. 517–529.

31

http://is.muni.cz/publication/884893/en

BIBLIOGRAPHY

[20] M. SUNDERMEYER, I. OPARIN, J.-L. GAUVAIN, B. FREIBERG, R. SCHLUTER, AND

H. NEY, Comparison of feedforward and recurrent neural network languagemodels, in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE Inter-

national Conference on, May 2013, pp. 8430–8434.

[21] H. TANG, S. TAN, AND X. CHENG, A survey on sentiment detection of reviews,

Expert Syst. Appl., 36 (2009), pp. 10760–10773.

[22] T. TRILLA AND F. ALIAS, Sentence-based sentiment analysis for expressive text-to-speech, Audio, Speech, and Language Processing, IEEE Transactions on, 21

(2013), pp. 223–233.

[23] L. VAN DER MAATEN, Accelerating t-sne using tree-based algorithms, J. Mach.

Learn. Res., 15 (2014), pp. 3221–3245.

[24] L. VAN DER MAATEN AND G. HINTON, Visualizing high-dimensional data usingt-sne, J. Mach. Learn. Res., 9 (2008), pp. 2579–2605.

32

Date post:	10-Jul-2018
Category:	Documents
Upload:	phamxuyen
View:	213 times
Download:	0 times

The degree of polarity as a factor for deep-learning based Sentiment Analysis · 2015-07-24 · The...

Documents