Download - A recommender system for the TV on the web: integrating unrated reviews and movie ratings

REGULAR PAPER

A recommender system for the TV on the web: integratingunrated reviews and movie ratings

Filipa Peleja • Pedro Dias • Flavio Martins •

Joao Magalhaes

� Springer-Verlag Berlin Heidelberg 2013

Abstract The activity of Social-TV viewers has grown

considerably in the last few years—viewers are no longer

passive elements. The Web has socially empowered the

viewers in many new different ways, for example, viewers

can now rate TV programs, comment them, and suggest

TV shows to friends through Web sites. Some innovations

have been exploring these new activities of viewers but we

are still far from realizing the full potential of this new

setting. For example, social interactions on the Web, such

as comments and ratings in online forums, create valuable

feedback about the targeted TV entertainment shows. In

this paper, we address this last setting: a media recom-

mendation algorithm that suggests recommendations based

on users’ ratings and unrated comments. In contrast to

similar approaches that are only ratings-based, we propose

the inclusion of sentiment knowledge in recommendations.

This approach computes new media recommendations by

merging media ratings and comments written by users

about specific entertainment shows. This contrasts with

existing recommendation methods that explore ratings and

metadata but do not analyze what users have to say about

particular media programs. In this paper, we argue that text

comments are excellent indicators of user satisfaction.

Sentiment analysis algorithms offer an analysis of the

users’ preferences in which the comments may not be

associated with an explicit rating. Thus, this analysis will

also have an impact on the popularity of a given media

show. Thus, the recommendation algorithm—based on

matrix factorization by Singular Value Decomposition—

will consider both explicit ratings and the output of senti-

ment analysis algorithms to compute new recommenda-

tions. The implemented recommendation framework can

be integrated on a Web TV system where users can view

and comment entertainment media from a video-on-

demand service. The recommendation framework was

evaluated on two datasets from IMDb with 53,112 reviews

(50 % unrated) and Amazon entertainment media with

698,210 reviews (26 % unrated). Recommendation results

with ratings and the inferred preferences—based on the

sentiment analysis algorithms—exhibited an improvement

over the ratings only based recommendations. This result

illustrates the potential of sentiment analysis of user com-

ments in recommendation systems.

Keywords Social-TV � Recommendation � Reviews

analysis � Sentiment analysis � Opinion mining

1 Introduction

The times when one would sit in front of the TV and

passively watch a fixed list of TV broadcast shows is

giving space to more interactive services such as video-on-

demand. The increasing competition from computers and

mobile devices as a novel way of accessing entertainment

at multiple places have pushed the TV viewing experience

into a more personalized and interactive experience, TV

F. Peleja � P. Dias � F. Martins � J. Magalhaes (&)

Departamento de Informatica, Centro de Informatica

e Tecnologias da Informacao, Faculdade de Ciencias

e Tecnologia, Universidade Nova de Lisboa,

2829-516 Caparica, Portugal

e-mail: [email protected]

F. Peleja


P. Dias


F. Martins


123

Multimedia Systems

DOI 10.1007/s00530-013-0310-8

entertainment has grown well beyond the traditional

broadcast paradigm. The difference between the TV and

the Web is fading away with new Internet-enable TV

devices and new Web applications delivering media

entertainment from both amateurs and professional media

producers.

Social-TV is a novel paradigm that has received much

attention in the last decade: research in this area has

brought us new technologies to support interaction among

users. The change is even greater than a first look might

indicate: TV shows go far beyond the TV set and extend

themselves onto the Web with extra media entertainment,

online forums, discussions, collectors’ art, etc. People not

only expect rich interactive experiences, they are eager for

more social and engaging entertainment offered by active

services that ‘‘sense’’ their mood and preferences.

In this paper, we argue that social interaction is not

limited to interactions occurring during the show. People

interact after the show in online forums and other social-

media Web sites, they leave valuable feedback about the

show they have watched. For instance, in a recent survey

performed by Nielsen, a multinational survey of device

owers (http://blog.nielsen.com/nielsenwire/?p=31338) has

been observed that within the US and UK, 88 and 80 % of

the people that performed the survey claim to use other

device while watching TV at least once a month.

Thus, the true strength of Social-TV lies on the collab-

orative interaction and on the correct interpretation of

users’ most valuable feedback. Our proposal is to improve

media recommendations (available through a video-on-

demand (VoD) service) by exploring social interactions

among users. We propose a novel media entertainment

system that computes recommendations based on user

ratings and comments.

Products recommendations have become increasingly

popular in e-commerce services such as in Amazon Instant

Video service1 and Netflix service2 [25]. The main goal of

a recommender system is to suggest unknown items

(movies or other entertainment media) by considering the

information exchanged by users when interacting with the

system. Until recently, users commonly asked for a rec-

ommendation from their own circle of known friends or

family. However, recommendations demand a certain level

of trustworthy knowledge and not everyone is eligible to

provide a skilled recommendation. Thus, a recommender

system that observes the viewpoints of diverse users and

movies may offer a more reliable and insightful recom-

mendation than average user which is limited to the movies

own awareness in a vast number of different movies.

In general, two families of algorithms inspire Recom-

mender Systems (RS): content-based and collaborative

filtering. Broadly, to predict movies that might be of

interest to particular users, the content-based approach uses

a correlation analysis between users’ personal information

and movies metadata, while the collaborative-filtering

approach detects user-movie rating patterns over time.

The main difference between these two strategies relies

on the nature of the information used to build the recom-

mender system. In content-based approaches, information

related to users and movies are obtained manually and

therefore are highly expensive. In contrast, collaborative-

filtering approaches automatically identify future prefer-

ences by observing users’ interaction, i.e., user-movie

ratings and users’ comments/reviews.

Figure 1 depicts our approach: users rate media enter-

tainment shows and these ratings are received instanta-

neously by the recommender system; users can later

comment the show in the online forum or fan zone; user

preferences are then inferred from the text comments and

merged into the recommendation algorithm. Since inferred

ratings are the exclusive result of a text sentiment analysis

algorithm, this approach can be seen as a weakly super-

vised recommendation algorithm.

The popularity of the information exchanged by media

consumers (and not only TV viewers) has been increasing

at an enormous rate. While some Web applications allow

users to rate or comment a movie, others only allow one of

the possibilities. For example, blogs and online forums

only support comments, and personal media players only

support ratings. Some authors, such as Takama and Muto

[36], have explored sentiment analysis techniques to infer

TV viewers profiles from comments. In contrast, we bypass

the extraction of user profiles and directly compute new

recommendations. Moreover, we describe a complete

framework that fuses sentiment analysis output with users’

explicit ratings to recommend new media. This integrated

approach grants that no information is lost in the process of

creating a user profile. A comprehensive evaluation of our

framework illustrates how such integrated approach can

Sentiment analysis

Recommendationalgorithm

Social-TV

Online mediaforum

Review

Ratings

Recommendations

TV showsInferred

preferences

Fig. 1 Recommendations based on ratings and reviews

1 http://www.amazon.com.2 http://www.netflix.com.

F. Peleja et al.

123

http://blog.nielsen.com/nielsenwire/?p=31338

http://www.amazon.com

http://www.netflix.com

indeed compute more accurate recommendations. We also

identify the right settings to obtain the best results, i.e.,

compensating for biased or spam comments.

This paper is organized as follows. Section 2 discusses

some of the most relevant previous work. Section 3 offers

an overview of the proposed recommendation framework

and Sects. 4 and 5 discuss the details of the framework.

More specifically, the implemented sentiment analysis

algorithm and the recommendation algorithm, respectively.

Finally, experimental evaluation and discussion on a real

dataset is presented in Sect. 6.

2 Related work

2.1 Social-TV

According to Jenkins [20] and Haythornthwaite [16] media

popularity is linked to social interactions in the new media.

They concluded that social ties and social media are

important to users’ media viewing habits. More recently,

Harboe et al. [15] conducted an experiment examining the

influence of social-TV watching. For example, a user

watching television alone but interacting through instant

notifications with friends/family that are watching the

same, or other, program. The Web is making TV enter-

tainment a social activity through online forums, instant

chatting, and other forms of technology-driven social

interaction.

Brown and Barkhuus [7] studied television-watching

practices among users of interactive media centers. They

have observed that iTV users are willing to signal some

minor preferences to receive personalized content. Uchyi-

git and Clark [41] proposed a similar system for the per-

sonalized generation of Electronic Program Guides (EPGs)

for digital TV. These approaches are a major step forward

in improving the viewer experience in TV, but their limited

user feedback and engagement has been a bottleneck in

these early systems. Thus, the amount of work in Web

recommendation systems [2, 25] shows the potential this

technology has for the TV domain. For example, Vildjio-

unaite and Kyllonen [42] addressed the issue of profiling

the preferences of a household. This implies modeling the

interaction of each individual (each child or each parent)

and the interaction of groups (just children, just parents, or

children with a parent).

To strengthen the adoption of program recommenda-

tions in iTV, the user must be deeply engaged. For

example, some approaches have adapted the live shows on-

the-fly according to users’ explicit votes/audiences col-

lected from set-top-boxes [46]. This example shows how

user feedback contributes to on-the-fly production of TV

shows through personalization for the masses. Thus,

exploiting social interactions is the key to modeling viewer

preferences and make TV entertainment more compelling.

Besides instant messages, set-top-boxes interaction data,

viewing habits, and online forum comments, other authors

have explored emotions [30]. Oliveira et al. [30] explored an

emotion-based approach to categorize, access, explore, and

visualize movies. In their system, named iFelt, users catego-

rize their movies according to the emotions they felt while

watching it. In contrast, our approach does not explicitly ask

information to users, neither computes a user profile; instead,

we pervasively monitor and analyze user actions, and directly

embed this information in the recommendation algorithm.

2.2 Recommender systems

Delivering recommendation services on the TV domain

can be a non-trivial task since the distribution architecture

of the TV content does not favor interactivity. Xu et al. [43]

discuss a general system architecture and its building

blocks for delivering recommendation services to the TV

domain. Their setup integrates DVB-T and DVB-S televi-

sion systems with a WebTV service to provide interaction

mechanisms enabling TV programs recommendations. The

Web part of their system overcomes the broadcast nature of

TV systems and allows user feedback and the delivery of

recommended TV programs. The MPEG-7 multimedia

description standard supports the description of a user TV-

usage history. Ferman et al. [14] propose a fuzzy algorithm

to compute the usage history descriptor. A filtering agent

then combines user preferences, usage history, and content

metadata to compute recommendations.

Recommender systems for the TV domain face several

specific difficulties, and in some cases aggravated diffi-

culties. Baudisch [4] considered collaborative-filtering

approaches and the cold-start problem in the TV domain.

The user burden and tolerance in interacting with such

systems can become a serious disincentive. TV viewers are

accustomed to zero-effort, thus Bausdisch suggest using

opinion leaders. TV viewing is rarely an individual expe-

rience, and most of the times the TV watching is a shared

experience (traditionally the family members). Thus, rec-

ommendations for groups of users are an important aspect

for the TV domain (even if users are not physically toge-

ther as in the Social-TV paradigm).

In this paper, we argue that in Social-TV users not only

rate TV programs but also comment and discuss the movie,

show, etc. Recommender systems can rely on user ratings

as in traditional collaborative-filtering or it can also explore

other data generated by users. Many researchers have

developed different strategies for exploring user feedback

in recommender systems (RS). Most of these approaches

gather movie ratings (explicitly provided by users) and

exploit this data as a collaborative-filtering task [25].

A recommender system for the TV on the web

123

Within collaborative-filtering approaches [23–25], the

matrix factorization methods have proven its superiority in

relation to other methods (e.g. nearest neighbor). Koren’s

work [23] also showed that ratings exhibit temporal pat-

terns linked to seasonal purchases (e.g. Christmas) and

other time-dependent events. In all cases, matrix factor-

ization approaches have been proven capable of modeling

the several details of the problem data. In this paper, we

also follow a matrix factorization approach.

Explicit ratings by itself can prove to be a limited metric

for assessing user opinion about a movie. In some cases,

such information can prove to be very scarce, especially

when the movie is of low quality, users simply do not

bother to rate the movie. In contrast, users may discuss, or

exchange impressions, about the movie. However, as Jakob

et al. [19] point out, most of recommendation algorithms

focus on the explicit ratings and user/products character-

istics disregarding the information enclosed in the free-text

reviews. In addition, to the best of our knowledge, only a

few studies have proposed to integrate sentiment analysis

with recommendation algorithms [1, 19, 26, 28, 47].

Leung et al. [26] suggested inferring ratings from

reviews and integrate them with a collaborative-filtering

(CF) approach. The authors tackle the extraction of mul-

tilevel ratings by proposing a new method of identifying

opinion words, semantic orientation and corresponding

strength. This method allows different semantic orientation

values to similar words. For example, the words terrible

and frightening may seem similar but in some domains

(e.g. movie) frightening is likely to be applied in a positive

context. However, in contrast to the present work, Leung

et al. [26] did not perform any evaluation of the recom-

mendation part and having only assessed the opinion words

sentimental strength and orientation.

In the movies domain, Jakob et al. [19] present the

advantages of improving recommendations with the senti-

ment extracted from user reviews. According to the

authors, the sentiment words should be split into clusters

where each cluster corresponds to different movie aspects.

Hence, the overall sentiment regarding a movie is mea-

sured by observing the sentiment words within these

clusters. In Jakob’s et al. approach, the sentiment review

information is supplied to the recommendation algorithm

as feature vectors. In comparison to Jakob et al. [19], where

recommendation always need explicit ratings, we infer

ratings and their associated confidence value from reviews.

Hence, unlike Jakob’s et al. proposal, the sentiment anal-

ysis inferred ratings are not directly combined with the

explicit ratings. In additional, Jakob’s et al. use a manual

and automatic clustering algorithm to infer movie aspects

upon which users express some opinion. In contrast, we do

not require a complete set of ratings for all reviews or fine-

grain aspects about the reviewed movies.

In a more recent study, Zhang et al. [47] proposes a

comprehensive approach to sentiment-based recommen-

dation algorithms on an online video service. Their sys-

tem extracts a ‘‘like’’/‘‘don’t like’’ information from

reviews, users’ facial expressions and relate the comments

to the video description through a keyword vector space

model. In Zhang’s et al. approach, the inferred prediction

is based on a unsupervised sentiment classification. In

addition, regarding the CF recommender system, our

approach differs on the inferred ratings incorporated

technique, i.e., Zhangs’ et al. build a list of keywords that

are combined in a users matrix, a products matrix, and a

ratings matrix.

To handle the sparsity of ratings, Moshfeghi et al. [28]

propose to improve a RS algorithm by considering not only

ratings but also emotion and semantic spaces to better

describe the movies’ and users’ space. The Latent Dirichlet

Allocation is used to compute a set of latent groups of

users. Moshfeghi et al. evaluation showed that such hybrid

approach (combining ratings with additional spaces

extracted from metadata) outperforms ratings-only

approaches and reduces the effects of cold-start.

In a different study, Aciar et al. [1] propose to analyze

the users’ reviews by developing an ontology to translate

the reviews text. Aciar et al. [1] ontology relies on

observing the review positiveness, negativeness, and the

users’ skill level. However, an important part of their work

relies on a manually created ontology capturing related

words. In addition, the training examples are manually

collected and labeled. Nonetheless, this study presents an

initial approach where the recommender system is based on

the reviews. The related-word concepts allow the identifi-

cation of co-related product characteristics (features). For

instance, in this domain the concept ‘‘carry’’ is related to

the concept of ‘‘size’’. Thus, Aciar et al. ontology measures

the quality of the several features within a product to create

user recommendations. Unlike the approaches to recom-

mendation systems from Aciar et al. [1], Jakob et al. [19],

and Leung et al. [26], we do not have any manual lexicons

or initialization and our focus is on the integration of rat-

ings and unrated comments.

2.3 Sentiment analysis (SA)

Sensing the mood and the preferences of users through text

analysis techniques is a research area that has been quite

active in the last decade. From the first techniques of

review analysis [31, 38], to more recent techniques of

tweets analysis [6, 10], the field has progressed much.

Nonetheless, the sentiment classification is commonly

tackled with binary classifiers in which specific character-

istics of different type of products, or rating scales more

closely related to the domain, suggest a multiclass

F. Peleja et al.

123

classification other than the simplistic view of positive

versus negative [35].

In a sentiment analysis study, one of the most important

tasks relies on identifying which words express a senti-

ment. Similarly to a text categorization task in which not

every word is related to a topic, not every word is qualified

with sentiment intensity. In this context, a word associated

with a sentiment intensity is also referred to as opinion

word. In [40], Turney and Littman reported a study which

shows that using only adjectives the sentiment classifica-

tion is improved. Nonetheless, other studies [12, 17, 37, 39]

reveal that adverbs, nouns, and verbs are also qualified with

sentiment intensity. Hence, in our work, we will consider

these aforementioned word-families as opinion words.

Initial research in SA aimed at understanding ‘‘how and

which words’’ humans use to express their preferences

[27]. Turney [38] aimed at assessing the positiveness or

negativeness of an opinion word through a method called

Semantic Orientation. It assumes that the correlation

between a word and two reference words—‘‘excellent’’ and

‘‘poor’’—indicates the orientation of the sentence (positive

or negative). Turney measures the correlation with the

Point-wise Mutual Information-Information Retrieval

(PMI-IR) algorithm on the Web. Mullen and Collier [10]

observe that the choice of the words ‘‘excellent’’ and

‘‘poor’’ for the PMI-IR metric seems somewhat arbitrary.

However, further experiments led the authors to conclude

that those terms were the most appropriate. In this paper,

we conducted a evaluation of PMI-IR with different ref-

erence terms.

Opinion words can be identified through a manual,

corpus-based or dictionary-based approaches. Since man-

ual approaches are highly time consuming, it is common to

combine it with other automated methods [8, 44]. Typically

a corpus-based approach [11, 38] relies on identifying

co-occurrence patterns while dictionary-based approaches

[18, 22] uses a seed of opinion words and a dictionary. A

popular linguistic resource in sentiment analysis is the

SentiWordNet dictionary which provides an answer to the

‘‘how and which words people use to express prefer-

ences?’’ question. The lexical resource SentiWordNet,

introduced by Esuli and Sebastiani [13] and recently

revised by Baccianella et al. [3], is a lexicon created semi-

automatically by means of linguistic classifiers and human

annotation. In this context, a set of synonyms representing

the same concept is referred as synset. In SentiWordNet,

each synset is annotated with its degree of positivity,

negativity, and neutrality (the same synset can express

opposite feelings). Previous studies using the opinion lex-

icon SentiWordNet for sentiment classification have shown

promising results [9, 29]. Ohana and Tierney [29] applied

the Support Vector Machine (SVM) classifier and a clas-

sifier that summed all the positive and negative features in

a review. Their evaluation on an IMDB corpus [32] indi-

cate SentiWordNet as an important resource for sentiment

analysis tasks, although the best accuracy obtained for their

SVM classifier was 69.35 %. Denecke [9] evaluated a rule-

based classifier with the SentiWordNet on different

domains. Denecke showed that a rule-based classifier

(RIPPER) performed lower than a logistic regression

classifier. In the present study, we will use a gradient-

descent classifier and measure the opinion word degree of

positivity and negativity with the SentiWordNet dictionary.

Previous approaches have shown that movie reviews are

among the most difficult ones to analyze. The most evident

result was obtained by Turney [38] that reached an accu-

racy of 66 % for movie reviews compared to an accuracy

of 80 % for automobile reviews. Despite this fact, we will

show how, given a sufficiently large number of movie

reviews, we can improve movie recommendation

techniques.

3 Sentiment-based recommendation framework

The goal of the proposed framework is to integrate in

one single recommendation framework both explicit rat-

ings and free-text comments with no rating associated.

The algorithm behind the recommendation framework

analyses user comments and represents these together

with user ratings in a collaborative matrix integrating the

interactions of all users. Figure 2 illustrates the proposed

sentiment-based recommendation framework. The

framework is divided into two parts: a comments analysis

algorithm to infer ratings from user comments and a

recommendation algorithm that merges all data into a

single sparse and highly incomplete matrix, to compute

new recommendations by matrix factorization. The two

following sections will detail both algorithms of the

framework.

The laboratory demonstrator where the recommendation

framework is integrated offers a social-media online ser-

vice for sharing, commenting, rating, and interacting with

movies. The Web TV demonstrator home page, Fig. 3, lists

the most popular movies and popular actors. Once the user

asks for personalized recommendations, the system

examines the user interactions with the online service and

computes a playlist recommendation. The playlist recom-

mendation is shown in full screen with the recommended

videos at the top of the screen (Fig. 4). The user is allowed

to comment and rate all movies in the database. In Fig. 5,

the UI for the user interactions with the online service is

shown. The evaluation of this demonstrator is outside the

scope of this paper.


123

4 Comments analysis algorithm

The goal of the comments analysis algorithm is to analyze

the feedback that users write about a movie and infer a

preference in the form of ratings. To formulate the

problem, we consider a set of text reviews and their asso-

ciated rating,

D ¼ fðre1; ra1Þ; . . .; ðren; ranÞg; ð1Þ

where a comment/review rei is rated according to the value

of rai 2 1; 2; 3; 4; 5f g: Reviews are represented by a set of

opinion words (OW), i.e.

rei ¼ ðowi;1; . . .; owi;mÞ; ð2Þ

where each component owi;j represents the opinion word j

of the review i. An opinion word (OW) is a word that

conveys a feeling or preference, e.g. ‘‘great’’ or

‘‘miserable’’. Moreover, for each OW, we identify the

semantic orientation (like or dislike) and quantify its

positiveness or negativeness. Therefore, the comments

analysis algorithm aims at learning a classification

function,

U reið Þ7! 0; 1½ �; ð3Þ

Unrated comments

Comments analysis

1. Corpus representation

2. Orientation and intensity of words

3. Comment classification

Ratings

User-movie recommendations

SemanticOrientation

SentiWordNet

Rated comments

Sentiment-based recommendation

1. User/movie biases

2. Matrix factorization

3. Recommendations inference

Fig. 2 Proposed framework

Fig. 3 Media player with

recommendations at the top

Fig. 4 Playing a video with recommendations at the top

F. Peleja et al.

123

to infer the rating of the review rei. Following a machine

learning approach, this function is learnt as a probabilistic

model p raijreið Þ estimated from a training set.

4.1 Corpus representation

The most elementary representation of an OW is the bag-

of-words representation, i.e., a unigram-based representa-

tion. Pang et al. [6] claims that this representation delivers

fairly good results in relation to a bigram, or adjective,

representation. However, its simplicity may raise some

doubts on its ability to describe a sentiment. For instance,

the unigram representation may fail to capture strong

opinions [7]. For this reason, each review rei ¼ðowi;1; . . .; owi;mÞ is represented as a histogram of unigrams

(isolated words) and frequent bigrams (adjective-word

pairs).

4.2 Orientation and intensity of opinion words

The orientation of an OW is related to a word’s affinity

towards a positive or negative sense. Recently, Turney [8]

proposed a metric to estimate the orientation of a phrase

using the concept of the point-wise mutual information

(PMI), which is known for its ability to measure the

strength of semantic associations. The metric will measure

the degree of statistical dependence between the candidate

word and two reference words (i.e. a positive reference

word and a negative reference word). A high co-occurrence

between the candidate word and the positive word indicates

a positive sense, e.g. high co-occurrence between ‘‘ice-

cream’’ and the reference word ‘‘excellent’’. Since the

choice of reference words is of particular importance, the

algorithm presented in this paper considers the reference

words presented in Table 1.

The semantic orientation (SO) is computed by observing

the co-occurrence between the reference words and the

candidate word on the Web corpus. Thus, the semantic

orientation is given by the expression

SOðwordÞ ¼ hitsðword; ‘‘excellent00Þhitsð‘‘poor00Þhitsðword; ‘‘poor’’ÞÞhitsð‘‘excellent’’ÞÞ ;

where hits(word) and hits(word, ‘‘excellent’’) are given by

the number of hits a search engine returns using these

keywords as search queries.

Computing the semantic orientation can be computa-

tionally very demanding. The high dimensionality gener-

ated by the use of the bigram representation does not allow

the process to scale due to typical search engines querying

constrains. Thus, the semantic orientation of the adjective-

Fig. 5 Media comments and ratings

Table 1 Semantic orientation and word polarity references

Technique Word polarity references

T: Turney [8] ‘‘excellent’’/‘‘poor’’

G: Generic ‘‘good’’/‘‘bad’’

DS: Domain specific ‘‘best movie’’/‘‘worst movie’’

DS ? T ‘‘excellent movie’’/‘‘poor movie’’


123

word pairs is replaced by the semantic orientation of the

adjective.

The semantic orientation determines the polarity of a

word but does not weight the intensity expressed by the

opinion word. OWs may express different intensity values,

for instance: ‘contented’ versus ‘ecstatic’ [4]. Thus, with a

lexical resource SentiWordNet [13], we identify the OW

strength. In this lexical resource, each feature is associated

with two numerical scores (positivity and negativity). In

this context, given the SO for a unigram, or bigram, Sen-

tiWordNet will return its sentiment strength. So, the Sen-

tiWordNet (swn) value of an OW will be given by,

swnðowÞ ¼ posSWNðowÞ; SOðowÞ[ 1

negSWNðowÞ; SOðowÞ� 1

�ð5Þ

where posSWNðowÞ corresponds to the positive score value

given by SentiWordNet and negSWNðowÞ will correspond

to the negative score. For the adjective-word bigram

representation the score will be measured as

swnðadjective�wordÞ ¼ swnðadjectiveÞ þ swnðwordÞð6Þ

Table 2 illustrates how the sentence ‘‘Love it or hate it,

however someone tell me what on Earth…’’ is processed:

according to the word’s family, we either discard it or

determine its semantic orientation and SentiWordNet

weight. Expression (5) is then applied to the weight of

the words vector Eq. (2) accordingly.

5 Classification

For classifying reviews, we used Vowpal Wabbit frame-

work.3 VW uses a linear classifier to assign a confidence

value to a review. The classifier identifies the orientation

and intensity of all opinion words of a review rei ¼owi;1; . . .; owi;m

� �and computes its rating based on the

sigmoid function,

U reið Þ ¼1

1þ exp �P

j owi;j � wj

� � : ð7Þ

The weights wj are learned with an gradient-descent

algorithm to train the function to distinguish between rat-

ings 1 and 5. Other classifiers could have been used;

however, VW implements a stochastic gradient-descent

classifier for optimizing the square loss of a linear model.

In addition, VW operates in an entirely online fashion

overcoming practical limitations such as efficiency and

scalability [45].

6 Computing recommendations with ratings

and unrated comments

In this section, we describe the collaborative-filtering

algorithm that combines ratings with the comments anal-

ysis output. The algorithm decomposes the ratings matrix

into two new matrices representing movies and users. This

matrix factorization introduces a bias correction mecha-

nism to compensate for different users’ optimism and dif-

ferent movie popularities. A second step selects the

comments to be merged with the explicit ratings based on

the inference confidence and convert the probabilistic

analysis output of the unrated comments into a 1–5 star

scale.

6.1 Ratings-based recommendation

Among all recommending techniques, collaborative-filter-

ing approaches are the most widely adopted. Collaborative-

filtering techniques attempt to collaboratively infer users’

preferences towards products, by analyzing the user-movie

ratings matrix R. Each element of this matrix represents a

rating given by user u to movie i, expressed by a numeric

value. It is important to mention that since each user usu-

ally rates a very small portion of all available products, the

ratings matrix R is always sparsely filled. Thus, the purpose

of collaborative-filtering techniques is to work over the few

known ratings to predict the unknown ones.

Within collaborative-filtering techniques, latent factor

approaches are very popular [25]. A well-known alterna-

tive to latent factor approach are the neighborhood meth-

ods. However, neighborhood methods are limited to

like-minded users. Thus, latent factor approaches allow to

discover a wider range of recommendations overlooked by

neighborhood methods. The purpose of latent factor

approaches to recommender systems is to map both users

Table 2 SentiWordNet weights for the sentence ‘‘Love it or hate it,

however someone tell me what on earth…’’

Word Family SO posSWN negSWN

Love N -0.0824 1.375 0.0

It – – – –

Or – – – –

Hate V 0.8399 0.0 0.75

It – – – –

However R -0.3415 0.5 0.5

Someone N -0.6594 0.0 0.0

Tell V -0.3956 0.875 0.625

Me – – – –

What – – – –

On – – – –

Earth N -0.4041 0.0 0.625

3 https://github.com/JohnLangford/vowpal_wabbit/wiki.

F. Peleja et al.

123

https://github.com/JohnLangford/vowpal_wabbit/wiki

and movies onto the same latent factor space, representing

these as vectors with k dimensions:

pu ¼ ðu1; u2; . . .; ukÞ qi ¼ ði1; i2; . . .; ikÞ; ð8Þ

where, pu is the user u factors vector, qi is the movie i

factors vector, and k is the number of latent factors

(dimensions) where each user u and each movie i are

represented. With this latent factor representation of users

and movies, we intend to achieve a rating prediction rule to

assess user preferences for movies, by calculating the dot

product of their respective latent factor vectors, as follows:

rui ¼ pu � qi; ð9Þ

where rui is the predicted rating of user u for movie i.

6.2 Matrix factorization through singular value

decomposition

As previously mentioned, the first step to obtain such

representation is to discover the latent factor space under-

lying the user-movie ratings matrix. The most widely

adopted category of techniques to discover this latent factor

space is matrix factorization, mainly through Singular

Value Decomposition (SVD). SVD is a technique to

decompose the users-products matrix R into a product of

three matrices—U, R and V. The matrix U contains the left

singular vectors, R contains the singular values, and V

contains the right singular vectors of the original matrix R.

Due to the vast amount of users and movies in most real

recommender systems, even if the ratings matrix R would

be full, it would be computationally expensive to deal with

a full SVD. Thus, what is pursued is a low-rank approxi-

mation to SVD. Such low-rank approximation can be

obtained by zeroing out the less relevant (lower) singular

values contained in matrix R; preserving only the k most

relevant ones. Notice that the number k will determine the

dimensionality of the pursued latent factor space.

The application of SVD to recommender systems is

motivated by the desire of decomposing the ratings matrix

into a 2-matrices representation R ¼ P � QT ; as

R ¼r1;1 . . . r1;n

..

. . .. ..

.

rm;1 � � � rm;n

264

375

¼u1;1 . . . u1;k

..

. . .. ..

.

um;1 � � � um;k

264

375:

p1;1 . . . p1;k

..

. . .. ..

.

pn;1 � � � pn;k

264

375

T

ð10Þ

Again, matrix R is the ratings matrix where each rui

value represents a rating given by user u to movie i,

expressed by a real value. Each vector (row) pu of P

represents a user u and each vector (row) qi of Q represents

a movie i, as in Eq. 9. Again, the goal of using matrix

factorization in recommendation problems is to enable the

assessment of user preferences for movies by calculating

the dot product of their factor vector representations, as

previously defined, by Eq. 10.

As previously mentioned, original SVD is designed to

be performed over a complete matrix, decomposing it into

a product of 3 matrices. Thus, the SVD technique must

undergo some modifications to deal with a sparsely filled

ratings matrix and decompose it into a 2-matrices repre-

sentation. In that sense, Simon Funk4 suggested an efficient

solution to learn the factorization model which has been

widely adopted by other researchers [25] and consists in

decomposing the ratings matrix into a product of a user-

factor matrix with a movie-factor matrix, by taking into

account the set of known ratings only. Thus, matrices P and

Q are given by:

½P;Q� ¼ arg minpu;qi

Xrui2R

rui � pu � qTi

� �2 þ k � puk k2þ qik k2� �

ð11Þ

This expression accomplishes two goals: matrix

factorization by minimization and the corresponding

regularization. The first part of Eq. 12 pursues the

minimization of the difference (henceforth referred to as

error) between the known ratings present on the original R

ratings matrix and their decomposed representation P and

Q. The second part controls generality by avoiding over-

fitting during the learning process, where k is a constant

defining the extent of regularization, usually chosen by

cross-validation.

6.3 Rating biases

Although the latent factor vectors inference widely cap-

tures rating tendencies, some improvements can be made to

the model by defining baseline predictors. This allows for

the factor vectors to simply swing the baseline predictions

towards the real rating values instead of having to fully

capture the rating patterns on their own. A straightforward

choice for a baseline predictor is the global average of the

observed ratings. In addition, it is useful to account for the

fact that some users tend to give higher ratings than others

and some movies tend to get higher ratings than others, as

well. Based on this premise, arrangements can be made to

capture these rating trends, regarded as user-related or

movie-related deviations from the average rating, hence-

forth referred to as user and movie biases. This reasoning

leads to a new model where, by considering the global

rating average and biases, the prediction rule can be

modified into

4 http://sifter.org/*simon/journal/20061211.html.


123

http://sifter.org/~simon/journal/20061211.html

rui ¼ pu � qi þ lþ bu þ bi: ð12Þ

On the new prediction rule, the parameters l, bu, and bi

represent the global rating average, the user bias, and the

movie bias, respectively. Accordingly, the new least-

squares problem intended to solve, which is an extension

of the regularized Eq. 12 becomes:

½P;Q� ¼ arg minpu;qi

Xrui ln R

ðrui � ruiÞ2 þ k � ð puk k2þ qik k2þb2u þ b2

i Þ

ð13Þ

7 Merging ratings and unrated comments

So far, the algorithm assumes the existence of a ratings

matrix R containing all user-movie ratings. This matrix is

by nature highly incomplete given the large number of

movies and users and the limited number of rated movies

per user. On average, a user may rate 30 movies from a set

of 2 million movies, and the remaining ratings are

unknown. Thus, the above ratings matrix can be made

more complete by adding ratings inferred by the sentiment

analysis of user comments. Figure 6 illustrates the process

described in this section.

Consider now a set G ¼ ra1; . . .; ranf g of explicit rat-

ings assigned by users, and a set F ¼ re1; . . .; remf g of

unrated comments written by users. Applying the sentiment

analysis function U �ð Þ; described in Eq. 7, to convert

comments into probabilistic ratings we define the C func-

tion to quantize this value into a rating value, i.e.,

rai ¼ C U reið Þð Þ: ð14Þ

This will put all ratings in the same format, rai 21; 2; 3; 4; 5f g; and the union of both data will create a new

set R ¼ G [ C U F ¼ re1; . . .; remf gð Þð Þ; which is easily

represented as a ratings matrix with both explicit ratings

and inferred ratings. The inclusion of the ratings inferred

by the algorithm does not take into account the nature of

the binary classifier. Thus, the final step is to filter-out the

less accurate inferred ratings.

Since the classifier discriminates between positive and

negative reviews, the inferred ratings with a probability

around 0.5 have a higher uncertainty associated to it. Fol-

lowing this reasoning, the C function filters-out the

unwanted ratings by imposing a threshold around the

probability 0.5 to ignore these ratings before translating

the classifier’s output into inferred ratings. Formally, the Cfunction is expressed as

rai ¼ C U reið Þð Þ

¼ ;; 0:5� th\U reið Þ\0:5þ thround 4 �U reið Þ þ 1ð Þ; otherwise

�

ð15Þ

where the threshold th is used to discard ratings whose

probabilities are close to 0.5; all other values are converted

into an integer rating. For example, with th = 0.2, ratings

in the interval 0:3\U �ð Þ\0:7 will be discarded. This

allows the recommendation algorithm to accept ratings for

which the confidence is greater. In the experimental

evaluation, we further assess the influence of the threshold.

8 Experimental evaluation

8.1 Datasets

To evaluate the implemented recommendation framework,

a large-scale dataset with both reviews and ratings for

multiple users and media was required. Such a large dataset

is only available on large production sites such as IMDb5,

Amazon, and other VoD services. Three datasets were

chosen to perform the evaluation: (1) the polarity dataset

[32] widely used in sentiment analysis containing 2.000

movie reviews from IMDb6, (2) 698.210 movie and music

reviews from Amazon [21], and finally, (3) the dataset used

by Jakob et al. [19] which contains 53,112 movie reviews

from IMDb for comparison purposes. More specifically, we

have:

• Polarity: This dataset is used to validate the sentiment

analysis algorithm: it is evenly split into positive and

negative categories. We used 1,400 training and 600

test reviews along with a number of positive and

negative reviews equally divided.

• Amazon movies and music: This large-scale dataset is

used to blend sentiment analysis knowledge in the full

recommender framework. This dataset7 was compiled

by [21] and includes reviews that are rated with 1–5

rating stars.

• IMDb-TSA09: This dataset was released by Jakob et al.

[19]. This data covers 2,731 movies and 509 users. The

reviews are rated with 1–10 rating stars. We have

chosen this dataset for comparison purposes.

Unlike polarity dataset, the Amazon and IMDb-TSA09

datasets do not offer an equally distributed number of

ratings across all scale (1–5, or positive vs. negative). Once

considered what inspires users to offer their’ insights about

a movie should be forseeable the lack of proportionality

between positive and negative reviews. When Amazon is

related to users purchases and intuitively, we may say that

the odds of a user acquiring a movie that is displeasing is

smaller than to be pleased with the purchased. While

5 http://www.imdb.com.6 http://www.cs.cornell.edu/people/pabo/movie-review-data.7 http://131.193.40.52/data/.

F. Peleja et al.

123

http://www.imdb.com

http://www.cs.cornell.edu/people/pabo/movie-review-data

http://131.193.40.52/data/

regarding the movie domain, this notion is not as intuitive.

Table 3 presents the datasets details. Concerning the

positive and negative ratings range, we have followed Pang

et al. [31], and others [5, 28, 34], reasoning, thus, we

trained and evaluated the sentiment analysis algorithm on

the Amazon dataset and considered reviews with 4 or 5

ratings as positive, otherwise negative. In addition, in the

IMDb-TSA09 dataset, we considered reviews with ratings

above 6 as positive, otherwise negative.

8.2 Evaluation metrics

The evaluation of the sentiment analysis algorithm is given

by the standard evaluation measures of precision, recall

and F-score, which is the harmonic mean between preci-

sion and recall,

Prec ¼PN

i¼1 TPiPNi¼1 ðTPi þ FPiÞ

Rec ¼PN

i¼1 TPiPNi¼1 ðTPi þ FNiÞ

ð16Þ

F�score ¼ 2Precision� Recall

Precisionþ Recallð17Þ

where TP, the true positives, corresponds to all correctly

classified reviews as belonging to the class; TN, true

negative, to all the correctly classified as not belonging to

the class; FP, false positive, to all misclassified as

belonging to the class; and, FN, false negative, to all

misclassified as belonging to other class.

To evaluate the RS framework, the statistical measure

root mean square error is applied,

RMSE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPrui2R ðrui � ruiÞ2

Rj j

s; ð18Þ

where R represents the set of ratings, rui represents the

rating given by the user u to movie i, and rui represents the

rating predicted by the RS algorithm. Small values of

RMSE indicate a more accurate performance.

9 Experiment: SO reference words

Different reference words (see Table 1) were used to

compute the influence of the Semantic Orientation on the

sentiment analysis algorithm. Figure 7 quantifies this

effect. The standard Semantic Orientation, as proposed by

Turney [38] reached an F-score of 74 % (word references

are ‘‘poor’’/‘‘excellent’’). The Domain Specific (DS) word

references combined with Turneys proposal (T) improves

the F-score by 4 % (‘‘excellent movie’’/‘‘poor movie’’).

Thus, a careful selection of Domain Specific reference

words can lead to an improved classifier: context sensitive

words should be combined with strong negative or positive

words (‘‘excellent’’/‘‘poor’’).

10 Experiment: recommendation framework

To evaluate the influence of the comments analysis over

the RS algorithm, the datasets were randomly split into

four disjoint datasets. The role of each subset is detailed on

Table 4. These splits allow an unbiased evaluation with

new data for each training and test step.

10.1 Comments and ratings based recommendations

(Amazon dataset)

In order to integrate the inferred ratings in the recom-

mendation system, it is required to examine the SA ratings

confusion-matrix. The Amazon subset #1 was used to train

the algorithm and the subset #2 to test it.

Table 5 presents the ratings confusion-matrix between

the predicted ratings and the actual ratings.The values in

bold in the diagonal correspond to the correctly classified

reviews. If the comments analysis part were completely

accurate, only the diagonal would be active. The right-most

column and the bottom row present the total number of

ratings for a given level. For example, there are 95,248

ratings of 5-star and the algorithm predicted 39,658 ratings

of 5-star. We aim at distinguishing between similar ratings,

which is a more challenging task than a binary classifica-

tion task (positive vs. negative) [35]. In addition, the users’

Table 3 Detailed information the datasets

Dataset #Reviews #Users #Movies

Polarity [32] 2,000 – –

IMDb-TSA09 [19] 53,112 509 2,731

Amazon [21] 698,210 3,700 8,018

3 2

5 1 4

2 3

4 5

2 3

Original ratings

Computed by the sentiment analysis

algorithm

4

3 2

5 1 4

2 3 5

4 5

2 3

4

3 2

5 51 4

2 3 5

4 4 5

4

2 3

Review-based recommendation

Review-enhanced ratings

New recommendations

Use

rs

Media

Fig. 6 Merging ratings and

comments can improve

recommendation performance


123

reasoning when providing a rating, and the associated

review, differs. Some users can prove to be more

demanding, or generous, than others. Nonetheless, we aim

at inferring a rating into a recommendation algorithm in

which ratings are re-adjusted through users’ and products

biases (bu and bi in Eq. 14).

Inspecting the confusion-matrix (Table 5), one can see

how the number of reviews incorrectly classified as

belonging to other ratings are distributed. A more conve-

nient visualization of the confusion-matrix is offered by

Fig. 8. It can be observed that incorrectly predicted ratings

are usually in the neighboring ratings. This is also justified

by the nature of the data since users may write a review

with a rating of 4 while others not as demanding write a

similar review with a rating of 5 [33]. In addition, one can

observe how the data is unbalanced in which most of the

ratings are 4 or 5, which implies a small number of low

ratings. This explains why the confusion is greater among

the low ratings.

Once the reviews are rated by the analysis algorithm, we

proceeded to the recommendation algorithm evaluation. In

this setting, we have three situations corresponding to the

three principal experiments:

1. RS Lower bound (LB): the system was trained on a

minimum set of ratings corresponding to subset #3 (see

Table 4). This establishes the lower bound on the

error.

2. RS Upper bound (UB): the system was trained on the

maximum number of ratings corresponding to the

union of the subsets #2 and #3 (see Table 4). This

establishes an upper bound on the error.

3. RS ? OM (OM): the system was then trained on

explicit ratings (subset #3) and ratings inferred from

unrated reviews (subset #2). In this experiment, all

ratings of subset #2 are withheld.

The summary presented on Fig. 9 provides a strong

message: comments analysis can indeed improve

recommendations with the ratings provided by the users by

examining the reviews that they wrote. The RMSE

obtained with OM brings into light a surprising result

concerning the replacement of the explicit ratings (Fig. 9)

by the inferred ratings. When using just explicit ratings for

the LB and UB, the RMSE was 1.0092 and 0.9963,

respectively. However, with the inferred ratings (OM), we

obtained a lower RMSE of 0.9845. Hence, we argue that

the inferred ratings (OM) can better accommodate the

uncertainty regarding the explicit rating assigned by users.

This is explained by the fact that some ratings are strongly

biased by users and a review provides a more complete

opinion. For example, users’ reviews that focus on

answering other reviews or unrelated information about the

movie (actors previous performances). Moreover, since the

inferred ratings are produced by a single algorithm, its bias

is unique, thus easier to be compensated by the recom-

mendation framework.

Figure 10 provides a detailed view of how the threshold

value influences the recommendations quality (RMSE).

The upper bound (UB) and lower bound (LB) correspond

to the RS classifications where the OM has no influence.

The RS ? OM curve includes the ratings from subset #3

and the ones inferred from subset #2 reviews (Table 4).

When OM inferred ratings are added to the set of LB rat-

ings, we see that the recommendation framework can

indeed improve the overall RMSE.

For a threshold th = 0.0, all inferred ratings are used by

the recommendation system; for a threshold th = 0.5, only

the inferred ratings with probabilities 0.0 and 1.0 are used.

As the threshold th increases, ratings with probabilities

near 0.5 are ignored (they are ambiguously positive or

negative). Thus, the higher the threshold, the fewer inferred

ratings are considered (# of OM ratings curve).

The RS ? OM curve illustrates how the analysis of

unrated comments can indeed improve the RMSE of the

computed recommendations. As we exclude ratings closer

to probability 0.0 and 1.0, the RMSE increases until it

reaches its worst value for th = 0.5. This corresponds to

considering 1-star and 5-star inferred ratings. We believe

that this RMSE value is due to the high amount of 5-stars

spam reviews and to the wrath of some users when writing

Fig. 7 F-score on the polarity dataset

Table 4 Data splits for the recommendation evaluation

Split #Reviews Description

Amazon IMDb-TSA09

#1 184,996 23,599 Train comments analysis

#2 182,651 23,601 Test comments analysis/Train RS

#3 236,450 #1 Train RS combined with #2

#4 94,113 5,912 Test RS

F. Peleja et al.

123

1-star reviews. To best examine this behavior of the

RS ? OM curve, Fig. 11 presents an insightful look into

the performance of the comments analysis algorithm.

Precision is quite high for 5-star ratings but it is extremely

low for the other rating levels—this is critical because the

recommendation algorithm needs both low and high rating

values. Recall is below 30 % for 1- and 2-star ratings and

above 30 % for 3-, 4- and 5-star ratings. These recall

values generate a small set of 1- and 2-star ratings. Note

that precision and recall measure the exact match between

the actual ratings and the inferred ratings. However, for the

recommendation algorithm what is most important is the

average error between the actual rating and the inferred

rating. In other words, we need to consider the mean

absolute error of each predicted rating.

Figure 11 illustrates the MAE curve (mean absolute

error) between the predicted ratings and the true ratings.

One can see that for 3- and 4-star ratings, the average

distance between the predicted and the true rating is less

than 1. Thus, this graph shows that noisier data is con-

centrated on ratings with 1-star and 5-stars, which clarifies

the RS ? OM curve behavior.

10.2 Comments and ratings based recommendations

(IMDb-TSA09)

In this section, we compare our proposal to Jakob et al. [19]

approach. While their approach can explore more media

related information such as genres and actors, it does not

consider unrated comments. The dataset used in this sec-

tion (IMDb-TSA09) corresponds to the dataset used in

[19]. In this experiment, we trained the sentiment analysis

algorithm with the split #1 and inferred the sentiment

analysis ratings (OM) on split #2 (see Table 4). For the

recommendation algorithm, the split #1 was used to train

individually, and combined with the inferred ratings from

split #2. The split #4 was used to evaluate the recom-

mendation algorithm. In the first experiment (Fig. 12), our

baseline performed better (RMSE = 1.819) than Jakob’s

et al. (RMSE = 1.853) when considering the full set of

ratings. In fact, our approach was slightly better than

Jakob’s et al. when their approach included sentiment

analysis (RMSE = 1.823) and we only had explicit ratings

(RMSE 1.819). Thus, one would expect that by including

unrated reviews, our approach would increase this differ-

ence since Jakob’s et al. approach is not designed to handle

unrated reviews.

A second experiment (Fig. 13) was conducted on this

dataset to confirm the influence of the sentiment analysis

on the recommendation framework performance. We used

50 % of explicit ratings and 50 % of inferred ratings to

achieve RMSE = 1.886 which is slightly better than just

using 50 % of explicit ratings RMSE = 1.896. Since this

dataset has a small set of reviews for training, we believe

that a finer-grain classifier or more training data can further

increase this gap. These experiments show how the pro-

posed approach compares to existing ones: despite being

competitive, it can also extract extra information from the

text reviews to infer unknown ratings, which makes it

applicable to a wider range of scenarios.

Table 5 Ratings confusion-matrix

Predicted rating values

1 2 3 4 5

True rating values

1 1,723 1,497 2,205 994 32 6,451

2 2,000 2,086 2,946 1,468 98 8,598

3 2,931 3,806 7,494 6,178 1,173 21,582

4 2,369 4,253 13,410 21,848 8,892 50,772

5 1,695 3,892 16,447 43,751 29,463 95,248

10,718 15,534 42,502 74,239 39,658

Fig. 8 Normalized predicted ratings distribution

Fig. 9 RMSE for LB, UB and OM blend with RS


123

11 Conclusion

In this paper, we proposed a recommendation framework

where ratings and unrated comments from media con-

sumers are blended to improve recommendations accuracy.

This is an ideal application for a cable operator wishing to

implement a system that considers users complains and

praises as measures of user satisfaction. The evaluation

with real user data illustrates the importance of revising

users’ explicit ratings with text analysis techniques.

The described evaluation shows that by applying senti-

ment analysis techniques to the unrated user reviews, we

can compute more accurate recommendations than by just

using explicit ratings. Following the presented evaluation,

we point out the following observations:

• The proposed recommendation framework can suc-

cessfuly integrate unrated reviews with ratings to

improve ratings-only recommendations. This has a

direct applicability to Social-TV environments where

users rate some of their consumed media or discuss the

media in online forums without rating it.

• The recommendation framework can be applied to filter

spam reviews or to add a review-based bias. Since all

reviews are analyzed by a common algorithm, consis-

tency is guaranteed across all inferred ratings.

Fig. 10 RMSE of the

recommendations versus the

opinion mining output. As the

threshold increases, less UðreiÞratings are included

Fig. 11 Sentiment analysis precision and recall per rating. The MAE

measure indicates the average distance to the true rating

Fig. 12 RMSE when only ratings are used

Fig. 13 RMSE improvement when sentiment analysis is added

F. Peleja et al.

123

• Reviews with extreme ratings (1-star or 5-star) will

harm the recommender system performance. We

observed that ratings with very high confidence are

usually the source of undesired biases. A carefull

processing of the inferred ratings is required.

As far as we are aware, only a few authors have tackled

a similar problem [1, 19, 26, 28, 47]. However, none of the

cited studies integrated in their framework unrated reviews.

For example, Jakob et al. [19] use explicit ratings and

enhance these existing ratings with opinion mining and

other techniques. Despite this difference, we compared the

proposed approach to Jakob’s et al. approach. This evalu-

ation illustrated how the recommendation framework is

competitive to similar approaches and how it can tackle

different scenarios.

As a future work we would like to improve the ratings

predictions by identifying reviews that are spam and

detecting particular aspects of the revised product. We

believe that other sentiment analysis algorithms can pro-

vide a finer-grain analysis of the user opinion and conse-

quently improve the overall recommendation performance.

Acknowledgments The authors are much appreciated to the authors

of [19] who have kindly provided us with their IMDb dataset. This

work has been funded by the Portuguese Foundation for Science and

Technology under project references UTA-Est/MAI/0010/2009 and

PEst-OE/EEI/UI0527/2011, Centro de Informatica e Tecnologias da

Informacao (CITI/FCT/UNL)—2011–2012.

References

1. Aciar, S., et al.: Informed recommender: basing recommenda-

tions on consumer product reviews. IEEE Intell. Syst. 22(3),

39–47 (2007)

2. Adomavicius, G., Tuzhilin, A.: Toward the next generation of

recommender systems: a survey of the state-of-the-art and pos-

sible extensions. IEEE Trans. Knowl. Data Eng. 17(6), 734–749

(2005)

3. Baccianella, S., et al.: SentiWordNet 3.0: an enhanced lexical

resource for sentiment analysis and opinion mining. In: Pro-

ceedings of the Seventh Conference on International Language

Resources and Evaluation (LREC’10) (2010)

4. Baudisch, P.: Recommending TV programs: how far can we get

at zero user effort? AAAI Workshop on Recommender Systems

(1998)

5. Bespalov, D., et al.: Sentiment classification based on supervised

latent n-gram analysis. Building, 375–382 (2011)

6. Bollen, J.: Determining the public mood state by analysis of

microblogging posts. Alife XII Conf. MIT Press (2010)

7. Brown, B., Barkhuus, L.: The television will be revolutionized:

effects of PVRs and filesharing on television watching. ACM

SIGCHI Conference on Human Factors in Computing Systems.

ACM (2006)

8. Das, S., Chen, M.: Yahoo! for Amazon: sentiment parsing from

small talk on the Web. EFA 2001 Barcelona Meetings (2001)

9. Denecke, K.: Are SentiWordNet scores suited for multi-domain

sentiment classification? In: Proceedings of ICDIM’2009, 33–38

(2009)

10. Diakopoulos, N.A., Shamma, D.A.: Characterizing debate per-

formance via aggregated twitter sentiment. In: Proceedings of the

28th International Conference on Human Factors in Computing

Systems (2010)

11. Ding, X., et al.: A holistic lexicon-based approach to opinion

mining. In: Proceedings of the International Conference on Web

Search and Web Data Mining, pp. 231–240 (2008)

12. Esuli, A., Sebastiani, F.: Determining the semantic orientation of

terms through gloss classification. In: Proceedings of the 14th

ACM International Conference on Information and Knowledge

Management CIKM 05, 617 (2005)

13. Esuli, A., Sebastiani, F.: Sentiwordnet: a publicly available lex-

ical resource for opinion mining. In: Proceedings of the 5th

Conference on Language Resources and Evaluation (LREC’06).

Citeseer (2006)

14. Ferman, A.M., et al.: Multimedia content recommendation engine

with automatic inference of user preferences. In: IEEE Interna-

tional Conference on Image Processing (2003)

15. Harboe, G., et al.: Ambient social TV: drawing people into a

shared experience. In: ACM SIGCHI Conference on Human

Factors in Computing Systems. ACM (2008)

16. Haythornthwaite, C.: The strength and the impact of new media.

In: Proceedings of the 34th Annual Hawaii International Con-

ference on System Sciences (HICSS-34)-Volume 1-Volume 1.

IEEE Computer Society (2001)

17. Heerschop, B., et al.: Polarity analysis of texts using discourse

structure. In: Proceedings of the 20th ACM International Con-

ference on Information and Knowledge Management CIKM 11,

1061 (2011)

18. Hu, M., Liu, B.: Mining and summarizing customer reviews. In:

Proceedings of the Tenth ACM SIGKDD International Confer-

ence on Knowledge Discovery and Data Mining. ACM (2004)

19. Jakob, N., et al.: Beyond the stars: exploiting free-text user

reviews to improve the accuracy of movie recommendations. In:

Proceeding of the 1st International CIKM Workshop on TOPIC-

Sentiment Analysis for Mass Opinion (TSA), pp. 57–64 (2009)

20. Jenkins, H.: Convergence Culture—Where Old and New Collide.

NYU Press, New York (2006)

21. Jindal, N., Liu, B.: Opinion spam and analysis. In: WSDM’08

Proceedings of the International Conference on Web Search and

Web Data Mining, pp. 219–230 (2008)

22. Kim, S.-M., Hovy, E.: Determining the sentiment of opinions. In:

Proceedings of the 20th International Conference on Computa-

tional Linguistics COLING 04, 1367-es (2004)

23. Koren, Y.: Collaborative filtering with temporal dynamics. In:

Proceedings of the 15th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, pp. 447–456 (2009)

24. Koren, Y.: Factorization meets the neighborhood: a multifaceted

collaborative filtering model. In: Proceeding of the 14th ACM

SIGKDD International Conference on Knowledge Discovery and

Data Mining, pp. 426–434 (2008)

25. Koren, Y., et al.: Matrix factorization techniques for recom-

mender systems. IEEE Comput. 42(8), 30–37 (2009)

26. Leung, C.W.K., et al.: Integrating collaborative filtering and

sentiment analysis: a rating inference approach. In: Proceedings

of the ECAI 2006 Workshop on Recommender Systems,

pp. 62–66 (2006)

27. Liu, B.: Sentiment analysis and subjectivity. Handbook of Nat-

ural Language Processing. (2010), 978-1420085921

28. Moshfeghi, Y., et al.: Handling data sparsity in collaborative

filtering using emotion and semantic based features. In: Pro-

ceedings of the 34th International ACM SIGIR Conference on

Research and Development in Information—SIGIR’11 (New

York, NY, USA, Jul. 2011), 625 (2011)

29. Ohana, B., Tierney, B.: Sentiment classification of reviews using

SentiWordNet. In: 9th. IT & T Conference (2009)


123

30. Oliveira, E. et al.: Ifelt: accessing movies through our emotions.

In; Proceedings of the 9th International Interactive Conference on

Interactive Television—EuroITV’11 (New York, NY, USA, Jun.

2011), 105 (2011)

31. Pang, B., et al.: Thumbs up? Sentiment classification using

machine learning techniques. In: Proceedings of the ACL-02

Conference on Empirical Methods in Natural Language Pro-

cessing-Volume 10, 79–86 (2002)

32. Pang, B., Lee, L.: A sentimental education: sentiment analysis

using subjectivity summarization based on minimum cuts. In:

Proceedings of the Association for Computational Linguistics

(ACL) (2004)

33. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for

sentiment categorization with respect to rating scales. Computer

43(1), 115–124 (2005)

34. Qu, L., et al.: The bag-of-opinions method for review rating

prediction from sparse text patterns. In: COLING’10 Proceedings

of the 23rd International Conference on Computational Linguis-

tics, pp. 913–921 (2010)

35. Sparling, E.I.: Rating: how difficult is it? Methodology 21(3),

149–156 (2011)

36. Takama, Y., Muto, Y.: Profile generation from TV watching

behavior using sentiment analysis. In: Proceedings of the 2007

IEEE/WIC/ACM International Conferences on Web Intelligence

and Intelligent Agent Technology—Workshops. IEEE Computer

Society (2007)

37. Takamura, H., et al.: Extracting semantic orientations of words

using spin model. In: Proceedings of ACL05 43rd Annual

Meeting of the Association for Computational Linguistics,

pp. 133–140 (2005)

38. Turney, P.: Thumbs up or thumbs down? Semantic orientation

applied to unsupervised classification of reviews. In: Proceedings

of the 40th Annual Meeting on Association for Computational

Linguistics (2002)

39. Turney, P.D., Littman, M.L.: Measuring praise and criticism:

inference of semantic orientation from association. ACM Trans.

Inf. Syst. 21(4), 37 (2003)

40. Turney, P.D., Littman, M.L.: Unsupervised learning of semantic

orientation from a hundred-billion-word corpus. Information

Retrieval. ERB-1094, 11 (2002)

41. Uchyigit, G., Clark, K.: Personalised multi-modal electronic

program guide. In: European Conference on Interactive Televi-

sion: from Viewers to Actors? (2003)

42. Vildjiounaite, E., Kyllonen, V.: Unobtrusive dynamic modelling

of TV program preferences in a household. Changing Television

(2008)

43. Xu, J., Zhang, L.: The development and prospect of personalized

TV program recommendation systems. Multimedia Software

Engineering (2002)

44. Yi, J., et al.: Sentiment analyzer: extracting sentiments about a

given topic using natural language processing techniques. In:

IEEE International Conference on Data Mining (ICDM),

pp. 427–434 (2003)

45. Yuan, G.X., et al.: Recent advances of large-scale linear classi-

fication. Computer 3, 1–15 (2011)

46. Zaletelj, J., et al.: Real-time viewer feedback in the iTV pro-

duction. European Conference on Interactive Television and

Video. ACM (2009)

47. Zhang, W., et al.: Augmenting online video recommendations by

fusing review sentiment classification. Data Mining Workshops

(ICDMW), 2010 IEEE International Conference on, pp. 1143–1150

(2010)

F. Peleja et al.

123