Date post: | 13-Jul-2015 |
Category: |
Technology |
Upload: | haha-teh |
View: | 63 times |
Download: | 0 times |
Mining User’s
Opinions in
Hotel
National University Of Singapore
TEY JUN HONG
U095074X
1. Background
2. Formulating the problem
3. Data Mining Process
4. Techniques
5. Analysis
Content
01
• Extraction of meaningful / useful / Interesting
patterns from a large volume of data sources
• In this project, the source will be large
volume of WEB HOTEL REVIEWS data
• Data mining is one of the top ten emerging
technology
What is Data
Mining?
MIT’s TECHNOLOGY REVIEW 2004
• Process of exploration and analysis
• By automatic / semi automatic means
• With little or no human interactions
• To discover meaningful patterns and rules
What is Data
Mining?
MASTERING DATA MINING BY BERRY AND LINOFF, 2000
• Increase in social media and web
user
• Increase in valuable opinion
oriented data in Hotel due to web
expansion
• Identify potential hotel to stay by
looking at the aspects
• Overall Sentiments on hotel are
greatly sought on the web for
Sentiment Analysis
User’s Opinions in Hotel
• Identify best prospects
(ASPECTS), and retain customers
• Predict what ASPECTS
customers like and promote
accordingly
• Learn parameters influencing
trends in sales and margins
• Identification of opinions for
customers
Sentiment Analysis !!!
What can Data Mining do?
• Exponential growth of user’s
opinions
• Limitations of human analysis
• Accuracy of human analysis
Machines can be trained to take
over human analysis with advanced
computer technology and it is done
with LOW COST
What are the problems?
• Unable to read like a human
• No emotions
• Cannot detect sarcasm
• Expression of sentiments in
different topic and domain
• Polarity analysis
• Facts Vs Opinion
Some Limitations of machines
• “The service is as good as none”. Negation not obvious to machine
• “Swimming pool is big enough to swim with comfort” , “There is a big crowd at the counter complaining”. Polarity might change with context.
• “The room is warmer than the lobby”. Comparisons are hard to classify
Some machine limitation
examples
• Machine learning
• Pattern recognition
• Statistics
• Databases
Sentiment
Analysis
• A tool for data mining and intelligent decision
support
• Application of computer algorithms that
improve automatically through experience
Machine Learning
MASTERING DATA MINING BY BERRY AND LINOFF, 2000
• Supervised Learning
• A training set is provided (data
with correct answers) which is
used to mine for known pattern
• Unsupervised Learning
• Data are provided with no prior
knowledge of the hidden
patterns that they contain.
• Semi Supervised Learning
Types of Machine learning
• Rule Mining and Rule learning
• Bayesian Networks
• Support Vector Machine
Supervised Learning techniques
• Prediction of sentence polarity
• Classification of polarity for sentiment
lexicon
• Detection of relations
Project Objective
• Large data set
• Relevant Prior Knowledge to
domain, in our case the hotel
domain
• Eg. Rating
• Sentiment lexicon for sentiment
analysis
• Data selection for reliability and
standards
Pre-requisite
Data Mining Process
• Frequent problem : Data inconsistencies
• Duplicate data
• Spelling Errors != Trim from data
• Foreign accent and characters
• Singular / Plural conversion
• Punctuations removal / replacement
• Noise and incomplete data
• Naming convention misused, same name but
different meaning
Cleaning the “Dirty” Data (60% of
effort)
• Part of Speech Tagging (POS) using Brill
Tagger
• Polarity tagging using sentiment lexicon
Data Preprocessing (Laundering)
• Part of Speech Tagging (POS) using Brill
Tagger - NO PROBLEM
-95% accuracy POS tagging words after data
cleaning
Findings
• Polarity tagging using sentiment lexicon –
BIG PROBLEM
-40% sentiment words not found in sentiment
lexicon
-10% sentiment words with a positive or
negative polarity found are in the neutral section
of sentiment lexicon
Findings
• Sentiment lexicon not comprehensive to fulfill
machine learning technique adopted
• Polarity of sentiment words who are domain
dependent are founded in neutral section of
sentiment lexicon
• Polarity of sentiment words can also change
within the domain even though they are
domain dependent
EXPANSION OF LEXICON !!!
Problems
• Classify the polarity of unlabeled sentiment
word using rule based mining
• Classify domain dependent sentiment words
• Establish word relations between labeled and
unlabeled sentiment words
Solution
• Rule based mining using conjunction and
punctuation
Data Processing
Polarity Assignment Rules
Same Adj – AND/OR - Adj
Opposite Neg - Adj – AND/OR - Adj / Adj – AND/OR - Neg- Adj
Same Neg - Adj – AND/OR - Neg- Adj
Opposite Adj – BUT/NOR – Adj
Same Neg - Adj – BUT/NOR - Adj / Adj – BUT/NOR - Neg- Adj
Opposite Neg - Adj – BUT/NOR - Neg- Adj
Same Adj , Adj
• Relation Network – Aspect – Sentiment word
pair
Data Processing
• Relation Network – Aspect – Sentiment word
pair
Data Processing
• Using the expanded sentiment lexicon, we
analyze the polarity sentiment by doing a
sentiment lookup using Bayesian Network
Analysis
• To determine polarity of sentiments
P(X | Y) = P(X) P(Y | X) / P(Y)
• Probability that a sentiments is positive or
negative, given it's contents
• Assumptions: There is no link between words
• P(sentiment | sentence) =
P(sentiment)P(sentence | sentiment) /
P(sentence)
Bayesian
• Precision = N (agree & found) / N (found)
• High precision means most of the correct
sentiment words are found by the system
• Recall = N (agree & found) / N (agree)
• High recall means most of found sentiment
words are correctly labeled by the system
Validation
• It is found that out of the 350 aspect-
unlabelled sentiment word pairs,
• Only 194 are founded by the methods.
Thus, the precision is about 57%.
• The recall is also not very high; only 126
words are corrected labelled by the
system, which is about 63%.
Validation Results
• The results will improve if more rules are
applied such the inclusion of more adverbs
such as “excessively” as negation words.
• There might not be enough dataset for the
system to work on. There are only 350 aspect-
unlabelled sentiment word pairs for the
application to work with.
• This, however requires more human judges to
validate the data
Discussion
• Comprehensive Sentiment Lexicon is a
simple yet effective solution to sentiment
analysis as it does not requires prior training
• Current sentiment lexicon does not capture
such domain and context sensitivities of
sentiment expressions
Conclusion
• This leads to poor coverage
• Thus, expanding general sentiment lexicon to
capture domain and context sensitivities of
sentiment expressions are advocated
Conclusion
01DEMO
Questions?