Date post: | 14-Dec-2014 |
Category: |
Technology |
Upload: | saschanarr |
View: | 496 times |
Download: | 2 times |
Competence Center Information Retrieval & Machine Learning
Sascha Narr, Michael Hülfenhaus, Sahin Albayrak
KDML 2012, LWA, Dortmund, Germany
Language-Independent Twitter Sentiment Analysis
Sascha Narr
10. April 2023 Language-Independent Twitter Sentiment Analysis
2
Overview
► 1. Sentiment analysis on social media► 2. Creation of a multilingual evaluation dataset of
tweets► 3. A language-independent sentiment labeling
heuristic for semi-supervised learning► 4. Experiments on the multilingual dataset
10. April 2023 Language-Independent Twitter Sentiment Analysis
3
Overview
► 1. Sentiment analysis on social media► 2. Creation of a multilingual evaluation dataset of
tweets► 3. A language-independent sentiment labeling
heuristic for semi-supervised learning► 4. Experiments on the multilingual dataset
10. April 2023 Language-Independent Twitter Sentiment Analysis
4
1. Sentiment Analysis on Social Media
► Why Sentiment Analysis? People’s opinions and sentiments about products and events
in large numbers are invaluable: Market research, product feedback and more Sentiment Analysis allows to automatically collect such data
[1]: http://news.cnet.com/8301-1023 3-57448388-93/twitter-hits-400-million-tweets-per-day-mostly-mobile/
► Why Twitter? 400 Million tweets posted each day[1]
Shorter text lengths encourage people to “just write” what they think
Tweets are often informal and contain lots of opinions
10. April 2023 Language-Independent Twitter Sentiment Analysis
5
1. Methods for Sentiment Classification
► Sentiment classification goals: Subjectivity: “Does the tweet contain an opinion?” Polarity: “Is the expressed opinion positive or negative?”
► Classifiers used: Naive Bayes, Maximum Entropy, Support Vector Machines
► Features used: n-grams, WordNet semantics, part-of-speech information
► Tweet texts have unique properties: Informal, contain slang, emoticons, misspellings
10. April 2023 Language-Independent Twitter Sentiment Analysis
6
1. Multilingual Sentiment Analysis
► Less than 40% of tweets are English [1]
► Natural language processing methods are often designed specifically for one language
► Increase coverage of sentiment analysis by using a language-independent approach:
No extra effort for additional languagesIs the approach really effective for all languages?
[1] http://semiocast.com/publications/2011_11_24_Arabic_highest_growth_on_Twitter
10. April 2023 Language-Independent Twitter Sentiment Analysis
7
Overview
► 1. Sentiment analysis on social media► 2. Creation of a multilingual evaluation dataset of
tweets► 3. A language-independent sentiment labeling
heuristic for semi-supervised learning► 4. Experiments on the multilingual dataset
10. April 2023 Language-Independent Twitter Sentiment Analysis
8
2. Creation of a Multilingual Evaluation Dataset
► We created a hand-annotated sentiment evaluation dataset of over 12000 tweets
4 languages: English, German, French, Portuguese► Used the Amazon Mechanical Turk platform for
annotation► Each tweet was annotated by 3 different workers:
Labels: “positive”, “neutral”, “negative”Added validation tweets to try to ensure the quality of the
annotations
10. April 2023 Language-Independent Twitter Sentiment Analysis
9
2. Our Multilingual Evaluation Dataset
► Observed a low inter-annotator agreement in our dataset Sentiment classification is a hard task, even for humans Tweets that humans disagree on are harder to classify as
well► The dataset is publicly available for research purposes
Table 1: Tweet counts for the complete annotated dataset
10. April 2023 Language-Independent Twitter Sentiment Analysis
10
Overview
► 1. Sentiment analysis on social media► 2. Creation of a multilingual evaluation dataset of
tweets► 3. A language-independent sentiment labeling
heuristic for semi-supervised learning► 4. Experiments on the multilingual dataset
10. April 2023 Language-Independent Twitter Sentiment Analysis
11
3. A Language-Independent Heuristic
► To train a sentiment classifier, a large amount of labeled training data is needed
Can be obtained without human effort using a previously proposed heuristic
► The heuristic uses emoticons in tweets as noisy labels
► Heuristic: If a tweet contains only positive emoticons, label its whole text as positive (and vice versa for negative).
► Examples of emoticons we used:Positive: :) :-) =) ;) :] :D ˆ-ˆ ˆ_ˆNegative: :( :-( :(( -.- >:-( D: :/
10. April 2023 Language-Independent Twitter Sentiment Analysis
12
3. Heuristic for Semi-Supervised Learning
► Heuristic can be applied to almost any language, since emoticons are used extensively on Twitter
► Amount of tweets with emoticons differs among languages Caused by many factors like language-specific ways to express
sentiments or different distributions of “formal” tweets
Table 2: Number of tweets containing emoticons for each language
10. April 2023 Language-Independent Twitter Sentiment Analysis
13
Overview
► 1. Sentiment analysis on social media► 2. Creation of a multilingual evaluation dataset of
tweets► 3. A language-independent sentiment labeling
heuristic for semi-supervised learning► 4. Experiments on the multilingual dataset
10. April 2023 Language-Independent Twitter Sentiment Analysis
14
4. Experiments – Sentiment Classification
► Data: Training: From ~ 800M random tweets of mixed languages:
Filter for languages: English, German, French, PortugueseUse emoticon heuristic to select and label training data
Evaluation: 12597 hand-annotated tweets (4 languages)
► Setup: Classification: Sentiment polarity only Classifier: Naive Bayes Features: 1-grams and 1, 2-grams Trained 4 classifiers for en, de, fr, pt
1 classifier for combined en+de+fr+pt
10. April 2023 Language-Independent Twitter Sentiment Analysis
15
4. Experiments: Evaluation Dataset
► 2 variations of our evaluation set for the experiments: agree-3: Tweets all 3 annotators agreed on for a sentiment agree-2: Tweets at least 2 annotators agreed on
► Baseline: always guess “positive” (more pos. tweets than neg.)
Table 3: Tweet counts for the evaluation datasets
10. April 2023 Language-Independent Twitter Sentiment Analysis
16
4. Results – English Classifier
► Best results: English classifier using 1-grams, on the 3-agree set 81.3% accuracy (500k trained tweets)
► Performance on 2-agree set constantly lower than 3-agree
en
10. April 2023 Language-Independent Twitter Sentiment Analysis
17
4. Results – All Languages
en
fr pt
de
10. April 2023 Language-Independent Twitter Sentiment Analysis
18
4. Evaluation – All Languages Compared
► Strong differences between languages
► Differences do not correlate with numberof emoticons in eachlanguage
► Emoticon heuristic better fit for some languages, may depend on the style of expressing sentiment in it
► “muito engraçado kkkkkkkk”Table3: Tweet counts containing emoticons for each language
en
fr pt
de
10. April 2023 Language-Independent Twitter Sentiment Analysis
19
4. Evaluation – Multi-language Classifier
► Tested on combined 4 language evaluation set► Highest Performance: 71.5% accuracy
Slightly less than using 4 individual classifiers (73.9% accuracy)► Usefulness of combined classifier can outweigh performance
degradation en+de+fr+pt
10. April 2023 Language-Independent Twitter Sentiment Analysis
20
Conclusions
► We presented and evaluated a language-independent sentiment classification approach on 4 languages
A language-independent classifier can be trained given only raw tweets, using a noisy label heuristic
Good performances across languages, varies for each Classifiers need a very large number of tweets for training Mixed-language classifiers are viable
► Future work: Currently we only classify sentiment polarity Classifying subjectivity in tweets is important, but finding a
good heuristic to label “neutral” tweets is a challenge
10. April 2023 Language-Independent Twitter Sentiment Analysis
21
Language-Independent Twitter Sentiment Analysis
Thanks for your attention!
Questions?
Competence Center Information Retrieval &Machine Learning
10. April 2023 Language-Independent Twitter Sentiment Analysis
22
www.dai-labor.de
FonFax
+49 (0) 30 / 314 – 74+49 (0) 30 / 314 – 74 003
DAI-LaborTechnische Universität Berlin
Fakultät IV – Elektrontechnik & Informatik
Sekretariat TEL 14Ernst Reuter Platz 710587 Berlin
Sascha Narr
Dipl.-Inform.
Contact
138