Date post: | 16-Jun-2015 |
Category: |
Documents |
Upload: | natalia-ostapuk |
View: | 210 times |
Download: | 4 times |
+
Grouping Customer Opinions Written in
Natural Language Using Unsupervised
Machine Learning
František Dařena
Jan Žižka
Karel Burda
Department
of
Informatics
Faculty of
Business
and
Economics
Mendel
University
in Brno
Czech
Republic
+ Introduction
Many companies collect opinions expressed by
their customers
These opinions can hide valuable knowledge
Discovering such the knowledge by people can
be a very demanding task because:
the opinion database can be very large,
the customers can use different languages,
the people can handle the opinions subjectively,
sometimes additional resources (like lists of positive
and negative words) might be needed.
+ Introduction
Our previous research focused on the analysis
what was significant for including a certain
opinion into one of categories like satisfied or
dissatisfied customers
However, this requires to have the reviews
separated into classes sharing a common
opinion/sentiment
+ Introduction
Clustering as the most common form of unsupervised learning enables automatic grouping of unlabeled documents into subsets called clusters
In the previous research, we analyzed how well a computer can separate the classes expressing a certain opinion and to find a clustering algorithm with a set of its best parameters, similarity, and clustering-criterion function, word representation, and the role of stemming for the given specific data
+ Objective
Clustering process is naturally not errorless
and some reviews labeled as positive appear
in a cluster containing mostly negative
reviews and vice versa
The objective was to analyse why certain
reviews were assigned “wrongly” to a group
containing mostly reviews from a different
class in order to improve the results of
classification and prediction
+ Data description
Processed data included reviews of hotel clients collected from publicly available sources
The reviews were labeled as positive and negative
Reviews characteristics:
more than 5,000,000 reviews
written in more than 25 natural languages
written only by real customers, based on their experience
written relatively carefully but still containing errors that are typical for natural languages
+ Properties of data used for
experiments
The subset (marked as written in English) used in
our experiments contained almost two million
opinions
Review category Positive Negative
Number of reviews 1,190,949 741,092
Maximal review length 391 words 396 words
Average review length 21.67 words 25.73 words
Variance 403.34 words 618.47 words
+ Review examples
Positive The breakfast and the very clean rooms stood out as the best
features of this hotel.
Clean and moden, the great loation near station. Friendly reception!
The rooms are new. The breakfast is also great. We had a really nice stay.
Good location - very quiet and good breakfast.
Negative High price charged for internet access which actual cost now is
extreamly low.
water in the shower did not flow away
The room was noisy and the room temperature was higher than normal.
The air conditioning wasn't working
+ Data preparation
Data collection, cleaning (removing tags, non-letter characters), converting to upper-case
Removing stopwords and words shorter than 3 characters and
Spell checking, diacritics removal etc. were not carried out
Creating three smaller subsets containing positive and negative reviews with the following proportions: about 1,000 positive and 1,000 negative (small)
about 50,000 positive and 50,000 negative (medium)
about 250,000 positive and 250,000 negative (large)
+ Experimental steps
Transformation of the data into the vector representation (bag-
of-words model, tf-idf weighting schema)
Clustering with Cluto* with following parameters
similarity function – cosine similarity,
clustering method – k-means (Cluto’s variation)
criterion function that is optimized during clustering process – H2
weighted entropy of the results varied from about 0.58 to 0.60
(e.g., for the small set of reviews, the entropy was 0.587, and
accuracy 0.859)
* Free software providing different clustering methods working with several
clustering criterion functions and similarity measures, suitable for operating on
very large datasets.
+ Graphical representation of the
results of clustering
False Positive (FP) False Negative (FP)
True Positive (TP) True Negative (TN)
Clustered Positive (CP) Clustered Negative (CN)
+ Analysis of incorrectly clustered
reviews
When a review rPi, originally labeled as positive, is
“wrongly” assigned to a cluster with mostly negative
reviews (CN), we can assume that the properties of this
review are more “similar” to the properties of the other
reviews in CN, i.e., the words of rPi and their combinations
are more similar to the words contained in the dictionary
of CN
The similarity was related to the frequency of words of rPi
in the subsets of the clustering solution (FN is compared
to TN, TP, CP, and FP is compared to TP, TN, CN)
+ Analysis of incorrectly clustered
reviews
We introduce the importance of a word wi in a given set X:
where NX(wi) is the frequency of word wi in set X and NX is the number of dictionary words in X
𝑖𝑤𝑖𝑋
+ Analysis of incorrectly clustered
reviews
The importance of a word in one set should be similar
to the importance of the same word in the most similar
set, i.e., importance of words in FN and TN should be
more similar than, e.g., importance of words in FN and
TP
Lowest value among
and corresponds to highest
importance similarity with TP, TN, or CN
The same comparisons between FN and TN, TP, and
CP were carried out
+ Importance of words from dictionary of False
Positive set compared to the other sets
+ Importance of words from dictionary of False
Negative set compared to the other sets
+ Results of the analysis
The words with higher frequencies included mostly the words
that could be considered positive (e.g., location, excellent, or
friendly) and negative (e.g., small, noise, or poor) in terms of
their contribution to the assignment of reviews to a “correct”
category
These words that are important from the correct classification
viewpoint have often the most similar importance in a different
set than one would expect, e.g., some words in reviews from
FN bearing a strong positive sentiment had their importance
most similar to their importance in TN and not in TP or CP
+ Example 1 – small data set
A strongly positive word excellent was used 3 times in the
FN (290 positive reviews, 3,678 words) iFN = 0.0008
Such the importance was the most similar to the
importance of the same word in TN (iTN = 0.0007) and not
in TP (iTP = 0.007) or CP (iCP = 0.006)
Review “Excellent bed making. Very good restaurant but
an English language menu would be advantageous to
non-german speaking visitors.” containing a strongly
positive word excellent was categorized incorrectly
+ Example 2 – small data set
A positive word good (with smaller positivity than
excellent) had the importance iFN = 0.0114
Such the importance was most similar to the
importance of the same word in CP (iCP = 0.0146)
and not in TP (iTP = 0.016) or TN (iTN = 0.0021)
Nevertheless, some reviews containing this positive
word were assigned to a group with mostly negative
reviews.
+ Results of the analysis
Both examples demonstrate that other document
properties, i.e., the presence of the other words together
with their importance, are signifi-
cant. This is demonstrated in the
table with importance similarities
of words of an obviously positive
review containing twice strongly
positive word “good” which was
assigned incorrectly to CN.
+ Results of the analysis – importance
vs. frequency
The analysis of the importance of words from dictionary of FN
showed that about 60% of words had their importance similar
to their importance in TN
However, the frequency of each of these words (number of
occurrences in all reviews) was relatively low (many of them
appeared just once)
These words with highly similar importance also often did not
bear any sentiment, such as the words discounted, happening,
or attitude
+ Conclusions
The study aimed at finding what was actually the reason of
assigning some documents into a “wrong” class
The critical information is provided by certain significant words
included in individual reviews
Words that the previous research found significant for opinion
polarity did not take effect as misleading information unlike
words that were much more or quite insignificant
Specific words (or their combinations) can be filtered out as
noise, improving the cluster generation