Date post: | 03-Jul-2015 |
Category: |
Science |
Upload: | knowledge-media-institute-the-open-university |
View: | 437 times |
Download: | 3 times |
On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter
Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani
Knowledge Media Institute, The Open University,
Milton Keynes, United Kingdom
The 9th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland
• Sentiment Analysis
• Stopwords Removal Methods
• Comparative Study
• Conclusion
Outline
“Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text”
3
The main dish was delicious
It is a Syrian dishThe main dish was salty and horrible
Opinion OpinionFact
Sentiment Analysis
Stopwords Removal
Stopwords Removal in Twitter Sentiment Analysis
- Kouloumpis et al. 2011
- Pak & Paroubek, 2010
- Asiaee et al., 2012
- Bollen et al., 2011
- Bifet and Frank, 2010
- Speriosu et al., 2011
- Zhang & Yuan, 2013
- Gokulakrishnan et al 2012
- Saif et al., 2012
- Hu et al., 2013
- Camara et al., 2013Removing Stopwordsis USEFUL
NOYES
• Precompiled
• Very popular
• Outdated
• Domain-Independent
Classic Stopword Lists
• Unsupervised Methods
– Term Frequency
– Term-based Random Sampling
• Supervised
– Term Entropy Measures
– Maximum Likelihood Estimation
Automatic Stopwords Generation Methods
Stopwords Removal for Twitter Sentiment Analysis
Stopword Analysis Set-Up (1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
OMD
HCR
STS
SemEval
WAB
GASP
OMD HCR STS SemEval WAB GASP
Negative 688 957 1402 1590 2580 5235
Positive 393 397 632 3781 2915 1050
Datasets
Stopword Analysis Set-Up (2)
Stopwords Removal Methods
1. The Baseline Method
– (non removal of stopwords)
1. The Classic Method
– This method is based on removing stopwords obtained from pre-compiled lists
– Van Stoplist
Stopword Analysis Set-Up (3)
Stopwords Removal Methods3. Methods based on Zipf’s
Law
- TF-High Method
Removing most frequent
- TF1 Method
Removing singleton words (i.e., words that occur once in tweets)
- IDF Method
Removing words with low inverse document frequency (IDF)
Stopword Analysis Set-Up (4)
Stopwords Removal Methods
4. Term-based Random Sampling (TBRS)
5. The Mutual Information Method (MI)
Stopword Analysis Set-Up (5)
Twitter Sentiment Classifiers
– Two Supervised Classifiers:
• Maximum Entropy (MaxEnt)
• Naïve Bayes (NB)
– Measure the performance in Accuracy and F1 measure
– 10 fold cross validation
Experimental Results
Assess the impact of removing stopwords by observing fluctuations on:
- Classification Performance
- Feature space
- Data Sparsity
Experimental Results (1)
1. Classification Performance
70
75
80
85
90
95
OMD HCR STS-Gold SemEval WAB GASP
Accuracy(%)
MaxEnt NB
60
65
70
75
80
85
90
OMD HCR STS-Gold SemEval WAB GASP F1(%)
MaxEnt NB
The baseline classification performance in Accuracy and F-measure
of MaxEnt and NB classifiers across all datasets
Accuracy F-Measure
Experimental Results (2)
1. Classification Performance
60
65
70
75
80
85
90
Baseline Classic TF1 TF-High IDF TBRS MI
Accuracy(%)
MaxEnt NB
50
55
60
65
70
75
80
85
Baseline Classic TF1 TF-High IDF TBRS MI F1(%)
MaxEnt NB
Accuracy F-Measure
Average Accuracy and F-measure of MaxEnt and NB classifiers using different stoplists
Experimental Results (3)
2. Feature Space
0.005.50
65.24
0.82
11.226.06
19.34
Baseline Classic TF1 TF-High IDF TBRS MI
Reduction rate on the feature space of the various stoplists
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
OMD HCR STS-Gold SemEval WAB GASP
TF=1 TF>1
The number of singleton words to the number non singleton words in all datasets
Experimental Results (4)
3. Data Sparsity
0.98800
0.99000
0.99200
0.99400
0.99600
0.99800
1.00000
Baseline Classic TF1 TF-High IDF TBRS MI
SparsityDegree
OMD HCR STS-Gold SemEval WAB GASP
Stoplist impact on the sparsity degree of all datasets
The Ideal Stoplist (1)
• The ideal stopword removal method is the one which:
– Helps maintaining a high classification performance,
– Leads to shrinking the classifier’s feature space
– Reduces the data sparseness
– Has low runtime and storage complexity
– Has minimal human supervision
The Ideal Stoplist (2)
Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplistmethods. Positive sparsity values refer to an increase in the sparsity degree while negative values refer to a decrease in the sparsity degree.
Overall Analysis Results
Conclusion
• We studied how six different stopword removal methods affect the sentiment polarity classification on Twitter.
• The use of pre-compiled (classic) Stoplist has a negative impact on the classification performance.
• TF1 stopword removal method is the one that obtains the best trade-off:
– Reducing the feature space by nearly 65%, – Decreasing the data sparsity degree up to 0.37%, and – Maintaining a high classification performance.