Zizka synasc 2012

+

Grouping Customer Opinions Written in

Natural Language Using Unsupervised

Machine Learning

František Dařena

Jan Žižka

Karel Burda

Department

of

Informatics

Faculty of

Business

and

Economics

Mendel

University

in Brno

Czech

Republic

+ Introduction

Many companies collect opinions expressed by

their customers

These opinions can hide valuable knowledge

Discovering such the knowledge by people can

be a very demanding task because:

the opinion database can be very large,

the customers can use different languages,

the people can handle the opinions subjectively,

sometimes additional resources (like lists of positive

and negative words) might be needed.

+ Introduction

Our previous research focused on the analysis

what was significant for including a certain

opinion into one of categories like satisfied or

dissatisfied customers

However, this requires to have the reviews

separated into classes sharing a common

opinion/sentiment

+ Introduction

Clustering as the most common form of unsupervised learning enables automatic grouping of unlabeled documents into subsets called clusters

In the previous research, we analyzed how well a computer can separate the classes expressing a certain opinion and to find a clustering algorithm with a set of its best parameters, similarity, and clustering-criterion function, word representation, and the role of stemming for the given specific data

+ Objective

Clustering process is naturally not errorless

and some reviews labeled as positive appear

in a cluster containing mostly negative

reviews and vice versa

The objective was to analyse why certain

reviews were assigned “wrongly” to a group

containing mostly reviews from a different

class in order to improve the results of

classification and prediction

+ Data description

Processed data included reviews of hotel clients collected from publicly available sources

The reviews were labeled as positive and negative

Reviews characteristics:

more than 5,000,000 reviews

written in more than 25 natural languages

written only by real customers, based on their experience

written relatively carefully but still containing errors that are typical for natural languages

+ Properties of data used for

experiments

The subset (marked as written in English) used in

our experiments contained almost two million

opinions

Review category Positive Negative

Number of reviews 1,190,949 741,092

Maximal review length 391 words 396 words

Average review length 21.67 words 25.73 words

Variance 403.34 words 618.47 words

+ Review examples

Positive The breakfast and the very clean rooms stood out as the best

features of this hotel.

Clean and moden, the great loation near station. Friendly reception!

The rooms are new. The breakfast is also great. We had a really nice stay.

Good location - very quiet and good breakfast.

Negative High price charged for internet access which actual cost now is

extreamly low.

water in the shower did not flow away

The room was noisy and the room temperature was higher than normal.

The air conditioning wasn't working

+ Data preparation

Data collection, cleaning (removing tags, non-letter characters), converting to upper-case

Removing stopwords and words shorter than 3 characters and

Spell checking, diacritics removal etc. were not carried out

Creating three smaller subsets containing positive and negative reviews with the following proportions: about 1,000 positive and 1,000 negative (small)

about 50,000 positive and 50,000 negative (medium)

about 250,000 positive and 250,000 negative (large)

+ Experimental steps

Transformation of the data into the vector representation (bag-

of-words model, tf-idf weighting schema)

Clustering with Cluto* with following parameters

similarity function – cosine similarity,

clustering method – k-means (Cluto’s variation)

criterion function that is optimized during clustering process – H2

weighted entropy of the results varied from about 0.58 to 0.60

(e.g., for the small set of reviews, the entropy was 0.587, and

accuracy 0.859)

* Free software providing different clustering methods working with several

clustering criterion functions and similarity measures, suitable for operating on

very large datasets.

+ Graphical representation of the

results of clustering

False Positive (FP) False Negative (FP)

True Positive (TP) True Negative (TN)

Clustered Positive (CP) Clustered Negative (CN)

+ Analysis of incorrectly clustered

reviews

When a review rPi, originally labeled as positive, is

“wrongly” assigned to a cluster with mostly negative

reviews (CN), we can assume that the properties of this

review are more “similar” to the properties of the other

reviews in CN, i.e., the words of rPi and their combinations

are more similar to the words contained in the dictionary

of CN

The similarity was related to the frequency of words of rPi

in the subsets of the clustering solution (FN is compared

to TN, TP, CP, and FP is compared to TP, TN, CN)


reviews

We introduce the importance of a word wi in a given set X:

where NX(wi) is the frequency of word wi in set X and NX is the number of dictionary words in X

𝑖𝑤𝑖𝑋


reviews

The importance of a word in one set should be similar

to the importance of the same word in the most similar

set, i.e., importance of words in FN and TN should be

more similar than, e.g., importance of words in FN and

TP

Lowest value among

and corresponds to highest

importance similarity with TP, TN, or CN

The same comparisons between FN and TN, TP, and

CP were carried out

+ Importance of words from dictionary of False

Positive set compared to the other sets

+ Importance of words from dictionary of False

Negative set compared to the other sets

+ Results of the analysis

The words with higher frequencies included mostly the words

that could be considered positive (e.g., location, excellent, or

friendly) and negative (e.g., small, noise, or poor) in terms of

their contribution to the assignment of reviews to a “correct”

category

These words that are important from the correct classification

viewpoint have often the most similar importance in a different

set than one would expect, e.g., some words in reviews from

FN bearing a strong positive sentiment had their importance

most similar to their importance in TN and not in TP or CP

+ Example 1 – small data set

A strongly positive word excellent was used 3 times in the

FN (290 positive reviews, 3,678 words) iFN = 0.0008

Such the importance was the most similar to the

importance of the same word in TN (iTN = 0.0007) and not

in TP (iTP = 0.007) or CP (iCP = 0.006)

Review “Excellent bed making. Very good restaurant but

an English language menu would be advantageous to

non-german speaking visitors.” containing a strongly

positive word excellent was categorized incorrectly

+ Example 2 – small data set

A positive word good (with smaller positivity than

excellent) had the importance iFN = 0.0114

Such the importance was most similar to the

importance of the same word in CP (iCP = 0.0146)

and not in TP (iTP = 0.016) or TN (iTN = 0.0021)

Nevertheless, some reviews containing this positive

word were assigned to a group with mostly negative

reviews.

+ Results of the analysis

Both examples demonstrate that other document

properties, i.e., the presence of the other words together

with their importance, are signifi-

cant. This is demonstrated in the

table with importance similarities

of words of an obviously positive

review containing twice strongly

positive word “good” which was

assigned incorrectly to CN.

+ Results of the analysis – importance

vs. frequency

The analysis of the importance of words from dictionary of FN

showed that about 60% of words had their importance similar

to their importance in TN

However, the frequency of each of these words (number of

occurrences in all reviews) was relatively low (many of them

appeared just once)

These words with highly similar importance also often did not

bear any sentiment, such as the words discounted, happening,

or attitude

+ Conclusions

The study aimed at finding what was actually the reason of

assigning some documents into a “wrong” class

The critical information is provided by certain significant words

included in individual reviews

Words that the previous research found significant for opinion

polarity did not take effect as misleading information unlike

words that were much more or quite insignificant

Specific words (or their combinations) can be filtered out as

noise, improving the cluster generation

Date post:	16-Jun-2015
Category:	Documents
Upload:	natalia-ostapuk
View:	210 times
Download:	4 times

Zizka synasc 2012

Documents