Fypca4

transcript

Mining User’s

Opinions in

National University Of Singapore

TEY JUN HONG

U095074X

1. Background

2. Formulating the problem

3. Data Mining Process

4. Techniques

5. Analysis

Content

• Extraction of meaningful / useful / Interesting

patterns from a large volume of data sources

• In this project, the source will be large

volume of WEB HOTEL REVIEWS data

• Data mining is one of the top ten emerging

technology

What is Data

Mining?

MIT’s TECHNOLOGY REVIEW 2004

• Process of exploration and analysis

• By automatic / semi automatic means

• With little or no human interactions

• To discover meaningful patterns and rules

What is Data

Mining?

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

• Increase in social media and web

• Increase in valuable opinion

oriented data in Hotel due to web

expansion

• Identify potential hotel to stay by

looking at the aspects

• Overall Sentiments on hotel are

greatly sought on the web for

Sentiment Analysis

User’s Opinions in Hotel

• Identify best prospects

(ASPECTS), and retain customers

• Predict what ASPECTS

customers like and promote

accordingly

• Learn parameters influencing

trends in sales and margins

• Identification of opinions for

customers

Sentiment Analysis !!!

What can Data Mining do?

• Exponential growth of user’s

opinions

• Limitations of human analysis

• Accuracy of human analysis

Machines can be trained to take

over human analysis with advanced

computer technology and it is done

with LOW COST

What are the problems?

• Unable to read like a human

• No emotions

• Cannot detect sarcasm

• Expression of sentiments in

different topic and domain

• Polarity analysis

• Facts Vs Opinion

Some Limitations of machines

• “The service is as good as none”. Negation not obvious to machine

• “Swimming pool is big enough to swim with comfort” , “There is a big crowd at the counter complaining”. Polarity might change with context.

• “The room is warmer than the lobby”. Comparisons are hard to classify

Some machine limitation

examples

• Machine learning

• Pattern recognition

• Statistics

• Databases

Sentiment

Analysis

• A tool for data mining and intelligent decision

support

• Application of computer algorithms that

improve automatically through experience

Machine Learning

MASTERING DATA MINING BY BERRY AND LINOFF, 2000

• Supervised Learning

• A training set is provided (data

with correct answers) which is

used to mine for known pattern

• Unsupervised Learning

• Data are provided with no prior

knowledge of the hidden

patterns that they contain.

• Semi Supervised Learning

Types of Machine learning

• Rule Mining and Rule learning

• Bayesian Networks

• Support Vector Machine

Supervised Learning techniques

• Prediction of sentence polarity

• Classification of polarity for sentiment

lexicon

• Detection of relations

Project Objective

• Large data set

• Relevant Prior Knowledge to

domain, in our case the hotel

domain

• Eg. Rating

• Sentiment lexicon for sentiment

analysis

• Data selection for reliability and

standards

Pre-requisite

Data Mining Process

• Frequent problem : Data inconsistencies

• Duplicate data

• Spelling Errors != Trim from data

• Foreign accent and characters

• Singular / Plural conversion

• Punctuations removal / replacement

• Noise and incomplete data

• Naming convention misused, same name but

different meaning

Cleaning the “Dirty” Data (60% of

effort)

• Part of Speech Tagging (POS) using Brill

Tagger

• Polarity tagging using sentiment lexicon

Data Preprocessing (Laundering)

• Part of Speech Tagging (POS) using Brill

Tagger - NO PROBLEM

-95% accuracy POS tagging words after data

cleaning

Findings

• Polarity tagging using sentiment lexicon –

BIG PROBLEM

-40% sentiment words not found in sentiment

lexicon

-10% sentiment words with a positive or

negative polarity found are in the neutral section

of sentiment lexicon

Findings

• Sentiment lexicon not comprehensive to fulfill

machine learning technique adopted

• Polarity of sentiment words who are domain

dependent are founded in neutral section of

sentiment lexicon

• Polarity of sentiment words can also change

within the domain even though they are

domain dependent

EXPANSION OF LEXICON !!!

Problems

• Classify the polarity of unlabeled sentiment

word using rule based mining

• Classify domain dependent sentiment words

• Establish word relations between labeled and

unlabeled sentiment words

Solution

• Rule based mining using conjunction and

punctuation

Data Processing

Polarity Assignment Rules

Same Adj – AND/OR - Adj

Opposite Neg - Adj – AND/OR - Adj / Adj – AND/OR - Neg- Adj

Same Neg - Adj – AND/OR - Neg- Adj

Opposite Adj – BUT/NOR – Adj

Same Neg - Adj – BUT/NOR - Adj / Adj – BUT/NOR - Neg- Adj

Opposite Neg - Adj – BUT/NOR - Neg- Adj

Same Adj , Adj

• Relation Network – Aspect – Sentiment word

Data Processing

• Relation Network – Aspect – Sentiment word

Data Processing

• Using the expanded sentiment lexicon, we

analyze the polarity sentiment by doing a

sentiment lookup using Bayesian Network

Analysis

• To determine polarity of sentiments

P(X | Y) = P(X) P(Y | X) / P(Y)

• Probability that a sentiments is positive or

negative, given it's contents

• Assumptions: There is no link between words

• P(sentiment | sentence) =

P(sentiment)P(sentence | sentiment) /

P(sentence)

Bayesian

• Precision = N (agree & found) / N (found)

• High precision means most of the correct

sentiment words are found by the system

• Recall = N (agree & found) / N (agree)

• High recall means most of found sentiment

words are correctly labeled by the system

Validation

• It is found that out of the 350 aspect-

unlabelled sentiment word pairs,

• Only 194 are founded by the methods.

Thus, the precision is about 57%.

• The recall is also not very high; only 126

words are corrected labelled by the

system, which is about 63%.

Validation Results

• The results will improve if more rules are

applied such the inclusion of more adverbs

such as “excessively” as negation words.

• There might not be enough dataset for the

system to work on. There are only 350 aspect-

unlabelled sentiment word pairs for the

application to work with.

• This, however requires more human judges to

validate the data

Discussion

• Comprehensive Sentiment Lexicon is a

simple yet effective solution to sentiment

analysis as it does not requires prior training

• Current sentiment lexicon does not capture

such domain and context sensitivities of

sentiment expressions

Conclusion

• This leads to poor coverage

• Thus, expanding general sentiment lexicon to

capture domain and context sensitivities of

sentiment expressions are advocated

Conclusion

01DEMO

Questions?

Fypca4

Technology