+ All Categories
Home > Documents > Zizka synasc 2012

Zizka synasc 2012

Date post: 16-Jun-2015
Category:
Upload: natalia-ostapuk
View: 210 times
Download: 4 times
Share this document with a friend
Popular Tags:
22
+ Grouping Customer Opinions Written in Natural Language Using Unsupervised Machine Learning František Dařena Jan Žižka Karel Burda Department of Informatics Faculty of Business and Economics Mendel University in Brno Czech Republic
Transcript
Page 1: Zizka synasc 2012

+

Grouping Customer Opinions Written in

Natural Language Using Unsupervised

Machine Learning

František Dařena

Jan Žižka

Karel Burda

Department

of

Informatics

Faculty of

Business

and

Economics

Mendel

University

in Brno

Czech

Republic

Page 2: Zizka synasc 2012

+ Introduction

Many companies collect opinions expressed by

their customers

These opinions can hide valuable knowledge

Discovering such the knowledge by people can

be a very demanding task because:

the opinion database can be very large,

the customers can use different languages,

the people can handle the opinions subjectively,

sometimes additional resources (like lists of positive

and negative words) might be needed.

Page 3: Zizka synasc 2012

+ Introduction

Our previous research focused on the analysis

what was significant for including a certain

opinion into one of categories like satisfied or

dissatisfied customers

However, this requires to have the reviews

separated into classes sharing a common

opinion/sentiment

Page 4: Zizka synasc 2012

+ Introduction

Clustering as the most common form of unsupervised learning enables automatic grouping of unlabeled documents into subsets called clusters

In the previous research, we analyzed how well a computer can separate the classes expressing a certain opinion and to find a clustering algorithm with a set of its best parameters, similarity, and clustering-criterion function, word representation, and the role of stemming for the given specific data

Page 5: Zizka synasc 2012

+ Objective

Clustering process is naturally not errorless

and some reviews labeled as positive appear

in a cluster containing mostly negative

reviews and vice versa

The objective was to analyse why certain

reviews were assigned “wrongly” to a group

containing mostly reviews from a different

class in order to improve the results of

classification and prediction

Page 6: Zizka synasc 2012

+ Data description

Processed data included reviews of hotel clients collected from publicly available sources

The reviews were labeled as positive and negative

Reviews characteristics:

more than 5,000,000 reviews

written in more than 25 natural languages

written only by real customers, based on their experience

written relatively carefully but still containing errors that are typical for natural languages

Page 7: Zizka synasc 2012

+ Properties of data used for

experiments

The subset (marked as written in English) used in

our experiments contained almost two million

opinions

Review category Positive Negative

Number of reviews 1,190,949 741,092

Maximal review length 391 words 396 words

Average review length 21.67 words 25.73 words

Variance 403.34 words 618.47 words

Page 8: Zizka synasc 2012

+ Review examples

Positive The breakfast and the very clean rooms stood out as the best

features of this hotel.

Clean and moden, the great loation near station. Friendly reception!

The rooms are new. The breakfast is also great. We had a really nice stay.

Good location - very quiet and good breakfast.

Negative High price charged for internet access which actual cost now is

extreamly low.

water in the shower did not flow away

The room was noisy and the room temperature was higher than normal.

The air conditioning wasn't working

Page 9: Zizka synasc 2012

+ Data preparation

Data collection, cleaning (removing tags, non-letter characters), converting to upper-case

Removing stopwords and words shorter than 3 characters and

Spell checking, diacritics removal etc. were not carried out

Creating three smaller subsets containing positive and negative reviews with the following proportions: about 1,000 positive and 1,000 negative (small)

about 50,000 positive and 50,000 negative (medium)

about 250,000 positive and 250,000 negative (large)

Page 10: Zizka synasc 2012

+ Experimental steps

Transformation of the data into the vector representation (bag-

of-words model, tf-idf weighting schema)

Clustering with Cluto* with following parameters

similarity function – cosine similarity,

clustering method – k-means (Cluto’s variation)

criterion function that is optimized during clustering process – H2

weighted entropy of the results varied from about 0.58 to 0.60

(e.g., for the small set of reviews, the entropy was 0.587, and

accuracy 0.859)

* Free software providing different clustering methods working with several

clustering criterion functions and similarity measures, suitable for operating on

very large datasets.

Page 11: Zizka synasc 2012

+ Graphical representation of the

results of clustering

False Positive (FP) False Negative (FP)

True Positive (TP) True Negative (TN)

Clustered Positive (CP) Clustered Negative (CN)

Page 12: Zizka synasc 2012

+ Analysis of incorrectly clustered

reviews

When a review rPi, originally labeled as positive, is

“wrongly” assigned to a cluster with mostly negative

reviews (CN), we can assume that the properties of this

review are more “similar” to the properties of the other

reviews in CN, i.e., the words of rPi and their combinations

are more similar to the words contained in the dictionary

of CN

The similarity was related to the frequency of words of rPi

in the subsets of the clustering solution (FN is compared

to TN, TP, CP, and FP is compared to TP, TN, CN)

Page 13: Zizka synasc 2012

+ Analysis of incorrectly clustered

reviews

We introduce the importance of a word wi in a given set X:

where NX(wi) is the frequency of word wi in set X and NX is the number of dictionary words in X

𝑖𝑤𝑖𝑋

Page 14: Zizka synasc 2012

+ Analysis of incorrectly clustered

reviews

The importance of a word in one set should be similar

to the importance of the same word in the most similar

set, i.e., importance of words in FN and TN should be

more similar than, e.g., importance of words in FN and

TP

Lowest value among

and corresponds to highest

importance similarity with TP, TN, or CN

The same comparisons between FN and TN, TP, and

CP were carried out

Page 15: Zizka synasc 2012

+ Importance of words from dictionary of False

Positive set compared to the other sets

Page 16: Zizka synasc 2012

+ Importance of words from dictionary of False

Negative set compared to the other sets

Page 17: Zizka synasc 2012

+ Results of the analysis

The words with higher frequencies included mostly the words

that could be considered positive (e.g., location, excellent, or

friendly) and negative (e.g., small, noise, or poor) in terms of

their contribution to the assignment of reviews to a “correct”

category

These words that are important from the correct classification

viewpoint have often the most similar importance in a different

set than one would expect, e.g., some words in reviews from

FN bearing a strong positive sentiment had their importance

most similar to their importance in TN and not in TP or CP

Page 18: Zizka synasc 2012

+ Example 1 – small data set

A strongly positive word excellent was used 3 times in the

FN (290 positive reviews, 3,678 words) iFN = 0.0008

Such the importance was the most similar to the

importance of the same word in TN (iTN = 0.0007) and not

in TP (iTP = 0.007) or CP (iCP = 0.006)

Review “Excellent bed making. Very good restaurant but

an English language menu would be advantageous to

non-german speaking visitors.” containing a strongly

positive word excellent was categorized incorrectly

Page 19: Zizka synasc 2012

+ Example 2 – small data set

A positive word good (with smaller positivity than

excellent) had the importance iFN = 0.0114

Such the importance was most similar to the

importance of the same word in CP (iCP = 0.0146)

and not in TP (iTP = 0.016) or TN (iTN = 0.0021)

Nevertheless, some reviews containing this positive

word were assigned to a group with mostly negative

reviews.

Page 20: Zizka synasc 2012

+ Results of the analysis

Both examples demonstrate that other document

properties, i.e., the presence of the other words together

with their importance, are signifi-

cant. This is demonstrated in the

table with importance similarities

of words of an obviously positive

review containing twice strongly

positive word “good” which was

assigned incorrectly to CN.

Page 21: Zizka synasc 2012

+ Results of the analysis – importance

vs. frequency

The analysis of the importance of words from dictionary of FN

showed that about 60% of words had their importance similar

to their importance in TN

However, the frequency of each of these words (number of

occurrences in all reviews) was relatively low (many of them

appeared just once)

These words with highly similar importance also often did not

bear any sentiment, such as the words discounted, happening,

or attitude

Page 22: Zizka synasc 2012

+ Conclusions

The study aimed at finding what was actually the reason of

assigning some documents into a “wrong” class

The critical information is provided by certain significant words

included in individual reviews

Words that the previous research found significant for opinion

polarity did not take effect as misleading information unlike

words that were much more or quite insignificant

Specific words (or their combinations) can be filtered out as

noise, improving the cluster generation


Recommended