Date post: | 14-Feb-2017 |
Category: |
Documents |
Upload: | asoka-korale |
View: | 20 times |
Download: | 3 times |
1
A SENTIMENT ANAYSIS AND CLASSIFICATION ALGORITHM UTILIZING AN INDEPENDENT TERM MATCHING SCHEME SENSITIVE TO WORD COUNT PATERNS
Authors:
Asoka Korale, Ph.D., C.Eng., MIETChanuka Perera, Dip., ABE(UK)Eranda Adikari, B.Sc., C.Eng., MIESLNadeesha Ekanayake, B.Sc.,
2
Business Drivers of “Sentiment Analysis” & Classification
Devise a Customer focused Corporate Strategy
Help Determine Areas of Future Investments
Analysis of Customer Feedback for Decision making
Insights on Corporate Image, Service Level and Performance
Business Process Improvement …
3
Objective of the Modeling
Prioritize Comments by Sentiment (Severity of Feedback)
Classify Comments to Pre Defined Categories
Rate Sentiment contained in Feedback
Analyze Feedback Comments, Prioritize and Classify for Timely Action
Direct each Class to Appropriate Authority in Priority Order for Timely action
4
“Sentiment” a Definition
Concise “Comments” give insight to “Emotional” content of message
Emotional Dimensions of Words Valence (Happiness), Activation (Arousal), Dominance
An Opinion, View held or Expressed
Only “Select” words convey “Emotion”
Dictionaries of rated Words across each Emotional Dimension
Account separately for “Negations”
Words rated for “Sentiment” by Human agents via large Surveys
Introduce Local Language Support
5
Feedback Comment Classification Process
Supervised Methods employ “Training Sequences”
Technique uses word Combinations, Patterns, Frequencies
Grouping comments on a “Theme” or Criteria in to “Classes”
Requires Pre Classified Comments
Suitable for classifying large texts
6
Sentiment Analysis via Independent Term Matching
Assumptions -
Twitter, FB & Customer comments
Each term in a comment independent of others
Valence, Activation and Dominance components of each word drawn from a Normal Distribution with specified Mean and Standard Deviation
Combined overall sentiment rating of matched words occurs at maximum of the sum of the individual Normal Densities
Overall Sentiment in a comment represented by the combined effect of the sentiment of individual words in the comment
Suitable for small text data
Ref: http://www.csc.ncsu.edu/faculty/healey/tweet_viz/
7
Algorithm – Sentiment Score for each Comment
I. Comments in Series: Each
Analyzed Separately
II. Select a Comment, Convert words to Lower case and
Remove Punctuation
V. Compute a Normal Density Function with Mean and Standard
Deviation corresponding to each Attribute of each matched word by scaling a Standard Normal Random
Variable
III. Find match in Dictionary for each word in selected comment and get corresponding mean and
standard deviation
IV. Extract Mean and Standard Deviation of “Valence” and
“Activation” attributes of each matched word from Dictionary
Vi. Compute the sum of the Density functions corresponding to each
attribute of all matched words in the comment
Vii. Determine Maximum point “max-GMM” of the sum of the Density functions to arrive at an average score for the effect of that attribute across all words in the comment
µ=
µ1µ2……µ𝑛
𝜎=
𝜎 1
𝜎 2
……𝜎𝑛
Comment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev
'service' 6.83 1.54 2.95 2.09'good' 7.89 1.24 3.66 2.72'late' 3.32 1.17 5.57 2.56
Simple Average 6.01 1.32 4.06 2.46
Word Valence Rating Activation Ratingmax- GMM 7.5 3.7
8
Gaussian Mixtures in Rating “Total Sentiment”
N
kkkk mxgpxf
1);();(
Npk
1
2
21
21),;(
k
kmx
kkk emxg
the mean and stand deviation of the Normal Distribution of the ratings of each matched word
overall sentiment xcomment of a comment in a particular dimension is then determined as
Consider the cumulative effect of all matched sentiment bearing words via the sum of the individual probability densities.
x represents the sentiment score, N the number of matched words in a comment
kkm ,
where and
which is the point at which the probability of the mixture of distribution is a maximum, and so is the most likely value for the overall sentiment of a comment composed of several words.
);(max xfx
xcomment
9
Overall Valance (Happiness) and Activation (Arousal) of a commentComment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev'service' 6.83 1.54 2.95 2.09'good' 7.89 1.24 3.66 2.72'late' 3.32 1.17 5.57 2.56
Simple Average 6.01 1.32 4.06 2.46
Word Valence Rating Activation Ratingmax- GMM 7.5 3.7
Figure 1: Gaussian Mixtures of matched words in the Valence Dimension
Figure 2: Gaussian Mixtures of matched words in the Activation Dimension
10
IMPACT OF “NEGATIONS” ON TOTAL RATING
Comment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev'service' 6.83 1.54 2.95 2.09
Not 'good' 6.65 1.24 6.38 2.72'late' 3.32 1.17 5.57 2.56
Simple Average 5.6 1.32 4.97 2.46
Word Valence Rating Activation Ratingmax- GMM 6.7 4.5
Comment Words Valence Rating Activation Rating
Dictionary Value Mean Std Dev Mean Std Dev'service' 6.83 1.54 2.95 2.09'good' 7.89 1.24 3.66 2.72'late' 3.32 1.17 5.57 2.56
Simple Average 6.01 1.32 4.06 2.46
“the service was not good and late” “the service was good but was late”
Word Valence Rating Activation Ratingmax- GMM 7.5 3.7
Account for Negations by adjusting the sentiment score of word immediately following the negation in a direction opposite in polarity to its matched directory sentiment value.
The magnitude of the adjustment made corresponds to the standard deviation of the particular rating value being adjusted.
The magnitude of the adjustment can also be user definable
11
Variance in Max GMM and Simple Average Measure
It is seen that 90% of the time the samples are within +/- 0.5 in the case of the Valence Attribute.
The CDF of the difference in the Activation attribute is tightly centered on the origin indicating hardly any variance.
This is also an indication that most comments convey sentiments of a single polarity and only a few comments (less than 10%) have words with conflicting emotional content.
Figure 1: Variance between GMM and Simple Average measures for estimating overall comment sentiment
A measure of the degree of disparate emotions in the comments
12
Sample Comments for Rating and Classification
1.HOTLINE ISSUES - DELAY IN ANSWERING - CX SERVICE ASSISTANCE Today morning CX has called to the 444 H\L for Movie Ticket & he has waited for more than 10 mins in the line, regarding this now CX was very disappointed on our service. So pls be kind enough to chk on ths & give the call back to the CX ASAP. * Note: - Regarding this issue CX need the call back from one of our manager & CX has requested not to charge a single rupee from his no for this issue.2.Yes,man magea prshnaya kiyapu gaman eyaa magea prshnea wisaduwaa he's a good3.Yes kad pin nambar signal4.Wenath ayathana wala mema pahasukam nomati nisa5.very good service6.uparimaya7.Uparima8.think so9.thanks10.Super 11.Solved12.She resolved my problem.13.Service nallam14.Sambanda weemata boho welawak giya nisa15.recharge16.Prashnayata pilithura hodin pahadili kara dima17. Payak athulatha gataluwa nirakaranaya karanwa kiuwa. Thawamath gataluwa nirakaranaya kara natha.18.oba ayathanaya sewawan sadaha ihala mudalak ayakarana nisa19.no mms setting laba dunnada save kala nohaka 20.nam apahu e tika ewanna
21.Mata awashshaya u pilithurau pahadili lesa laba ganemata hakiuna.22.mage parshnata pilithuru dunna.23.lotari SMS stop24.Its professional25.ing tone sewawa ain kirima26.I submitted Xtv reg form on 27th oct at yr crescat arcade. They told to call me on 28th wed to give the AC No27.Hot line eka answer karapu girlge voice eka and care eka good28.Hi kohomada? Mama mea dawas wala plan karagena yanawa mage next music video eka karanna. Song eka "Mata Rawana" :-)29.harima pehediliwa mage getaluwa nirakaranaya kala thanks30.Good service but shortcomings due to some arrogant customer care officers31.good men32Good33.getaluwa hadunagenimata noheki wiya..34.First of all its great to be treated as a privilege customer. Reason is simple. I'm using X mobile connection and XTV, because dialog has the better35.durakathanayata pilithuru denda epai eke hoda naraka kiyanna.36.Cx need to add the CHU CHU TV which is a kids channel to the channel list.Since this channel is available on another TV connection.Cx need this channel to activate for XTV aswell.Please check on this and do the needfull. Thank you37.Customer service personal have to be trained better cause they can't think out of the box.38.bashawa wenaskaranna
13
Sentiment Aggregates on Sample Comments
Fig 1: Heat Map of Sentiment rated sample comments Fig 2: Sentiment Dimensions of sample comments
14
A Novel Association Rule Mining Algorithm
• Initialize (at level L1) by determining set of all Items {I} that meet minimum support criteria• Determine support for all pairs of items {Ii,Ij} (i ~= j) in {I}• Determine rules for all pairs of items of the form Ii->Ij
• At each subsequent level (Lp), p > 1• Determine item combinations that meet minimum support criteria• Items at subsequent stages selected from rules of previous stage that met min support
criteria• Antecedent at subsequent level (Lp+1) is formed by merging the antecedent and
consequent terms of the rules that meet the minimum support criteria at level Lp• Stop when combined terms no longer meet min support criteria
Deriving likely word combinations (Keyword Selection)
• Selection Measures NBANBASupport /)()(
)( BAConfidence )(/)( ASupportBASupport
)(/)&( ABA EPEEP
)/( AB EEP
15
Simplifying Assumptions of the Naïve Bayes Technique
Slide | 15
)(/),,...,,()/,...,( 2121 jjNjN CPCXXXPCXXXP
)(/),,..,,(),,...,/( 3221 jJNjN CPCXXXPCXXXP
)(/)()/()......,,..,/( 21 jjjnjN CPCPCXPCXXXP
)/(),,.../( 2 jijNi CXPCXXXP
)/)...(/()/()/,...,,( 2121 jNjjjN CXCXPCXPCXXXP
Under the assumption of conditional independence of word X i given class Cj
)}()/({max)/( jjj
CPCXPC
XCP
)}()./().../()/({max21 jjNjj
j
CPCXPCXPCXPC
probability of a sequence of words {Xi} in a comment given class C j
Probability of class C given a set of words X = {X1,X2…,XN}
16
Classification via Naïve Bayes
Assumptions -
The order of words {Xi} in a comment is independent of each other given the class {Cj}
A class is determined solely on the specific words in a comment and their frequency of occurrence in that comment
Conditional Independence of the words in a comment given the class of the comment
a “bag of words model”
17
Performance of the Classification Algorithm
Accuracy greater than 75% on predicted classes
Accuracy greater than 90% on training samples
Performance will further increase with preprocessing and filtering
single word comments don’t convey meaningful category information
Use misclassified comments to “Retrain” algorithm
Key Words for classification via Association Rules
18
Algorithm Implementation & Results
• Algorithm designed and built from first principals using Matlab programming language
• Local Language Support by updating Dictionary with Sinhala and Tamil words conveying emotion
• 59,000 comments analyzed and Rated for Sentiment and Classified / Binned in to six categories
• Improved Classification by word relationships (key words) derived from Association Rule Mining
• 3000 Training comments used with six classes for Training Model
• Fast implementation processing all comments in a few hours
• A Word vs. Frequency Analysis used to determine which new words to add to the Dictionary
• The Sentiment rating is a means to “prioritize” the handling of the sorted and binned comments
• Performance improvement by “re-classifying” , miss classified comments and reuse in Training
19
Conclusion
• Pre Processing – improved performance by retaining only relevant words and word combinations for the classification the business, purpose of the analysis
• Spelling mistakes will cause problems as words will not match those in dictionary• Update Dictionary with new words and miss spelled words• Introduce limits on the minimum number of words that should be matched for a comment to
be analyzed – for increased reliability
• Independent Term Matching – doesn’t necessarily capture “meaning” of comment• short comments can be analyzed to assess overall sentiment
• Rate the emotional content in a comment
• Algorithm can provide other segmentations by matching words specific to the purpose of routing
• Naïve Bayes gave good classification accuracy • The severity of sentiment in the classified comment used to prioritize comment handling
• Simple averaging of the attribute values to arrive at the combined effect of all matched words in a comment can also be considered and may give results that are not that far off from the assumption of Normality
20
THANK YOU