Date post: | 20-Nov-2014 |
Category: |
Technology |
Upload: | awesomesos |
View: | 667 times |
Download: | 1 times |
Online Feedback Correlation using Clustering
Research Work Done for CS 651: Internet Algorithms
Dedicated to Tibor Horvath
Whose endless pursuit of getting a PhD (imagine that) kept him from researching this topic.
Problem Statement
Millions+ of reviews available Consumers read only a small number of reviews. Reviewer content not always trustworthy
Problem Statement (continued)
What information from reviews is important? What can we extract from the overall set of reviews
efficiently to provide more utility to consumers than is already provided?
Motivation
People are increasingly relying on online feedback mechanisms in making choices [Guernsey 2000]
Online feedback mechanisms draw consumers Competitive Edge Quality currently bad
Current Solutions
“Good” review placement Show small number of reviews
. . . more Trustworthy?
Amazon Example
Observations
Consumers look at a product based on its overall rating
Consumers read “editorial review” for content Reviews indicate can indicate common issues
… Can we correlate these reviews in some meaningful way?
Observations Lead to Hypotheses!
Hypothesis: Products with numerous similar negative reviews will often not be purchased regardless of their positive reviews. Furthermore, the number of negative reviews is a high indication of the likeliness of certain flaws in a product.
Definitions
Semantic Orientation: polar classification of whether something is positive or negative
Natural Language Processing: deciphering parts of speech from free text
Feature: quality of a product that customers care about
Feature Vector: vector representing a review in a d-dimensional space where each dimension represents a feature.
Overview of Project
Obtain large repository of customer reviews Extract features from customer reviews and orient
them Create feature vectors i.e. [1,0,-1,1,1,-1 … ] from
reviews and features Cluster feature vectors to find large negative clusters Analyze clusters and compare to hypothesis
Related Work
Related work has fallen into one of three disparate camps
1. Classification: classifying Reviews into Negative or Positive reviews
2. Domain Specificity: overall effect of reviews in a domain
3. Summarization: features extraction to summarize reviews
Limitations of Related Work
Classification– Overly summarizing
Domain Specificity– Hard to generalize given domain information
Summarization– No overall knowledge of collection
Close to Summarization?
Most closely related to work done in Summarization by Hu and Liu.– Summarization with dynamical feature extraction and
orientation per review
Data for Project
Data from Amazon.com customer reviews – Available through the use of Amazon E-Commerce
Service (ECS)– Four thousand products related to mp3 players– Over twenty thousand customer reviews
Technologies Used
Java to program modules Amazon ECS NLProcessor (trial version) from Infogistics Princeton’s WordNet as a thesaurus KMLocal from David Mount’s group at University of
Maryland for clustering
Project Structure
Simplifications Made
Limited data set Feature list created a priori Features from same sentence given same
orientation Sentences without features neglected Number of clusters chosen only to see correlations in
biggest cluster Small adjective seed set
Analysis
Associated Clusters with Products Found negative clusters using threshold (-0.1) Eliminated non-Negative Clusters Sorted products list twice
– Products by sales rank (given by Amazon)– Products sorted by hypothesis with tweak
Tweak: Relative Size * Distortion Computed Spearman’s Distance
Results
Hypothesis calculates with 82% accuracy! But most of the four thousand products were pruned
due to poor orientation
Conclusion
Consumers are affected by negative reviews that correlate to show similar flaws.
Affected regardless of the positive reviews
Future Work
Larger seed set for adjectives Use more complicated NLP techniques Experiment with the size of clusters Dynamically determine features using summary
techniques Use different data sets Use different distance measure in clustering
Questions