Date post: | 07-Feb-2017 |
Category: |
Documents |
Upload: | love-tyagi |
View: | 79 times |
Download: | 1 times |
Sentiment Analysis @JETBLUEP R E S E N T E D BY - - M I G H T Y A N C I E N T T H R E EC H UA N Y E “ E LV I N ” O U YA N GLOV E K T YA G IX I N G J I A N “ S T R AY N ” WA N GTA R E K E L D U Z D A R
Contents
Project Introduction and TheoriesText Cleaning and AnalyticsFeature Engineering and Advanced AnalyticsConclusions
Project Introduction and Theories
Our Topic
What Knowledge to Mine from Tweets that Mention Airline Companies
Question 1:What are the wording and behavior patterns of Twitter users that post tweets related to JetBlue
Questions 2:How to predict the sentiment of tweets that’s related to airlines
Our Data from Real World Data Extracted using Twitter API has:
Contains: “@JetBlue” Date Range: Two weeks of tweets Number of Observations : Around 20,000 Tweets 16 Variables:
"Text" "favorited" "favorites Count" "replyTo" "created" "truncated" "replyTo SID" "id" "replyTo UID" "status Source" "screenName" "retweet_Count" "is Retweet" "retweeted" "longitude" "latitude"
To train data mining models, we used: Labelled US airlines sentiments that contain
All tweets that mention US airline companies Same 16 variables as above Plus the sentiments score variable (Positive/Negative/Neutral)
Our Method: Bag of Words (BOW) The text variable in original tweets dataset is transformed into word matrix
A TDM (Term Document Matrix) was created for each document column wise and each word row wise, which looks like:
Document Late Baggage Airport Time
1294 3 0 2 0
1682 0 1 0 0
2893 1 0 1 1
Text Cleaning and Analytics
Our Work Flow
Five SPRINTs One Trello Board Thousands of Slacks Tens of Emails GitHub project page And a wonderful mindmap!
Text Cleaning and Preprocessing
“ I have been trying to reach @JetBlue hotline for one hour , it’s really frustrating!!”
“try reach hotline hour
frustrate “
Document # try love delay reach thank hotline frustrate pilot
23424 1 0 0 1 0 1 1 0
Remove StopwordsStemming
Remove PunctuationsLowercase
Lexicon Based Approach (Approach One) This approach will score sentiment of a sentence based on sentiment of
individual words Lexicons:
dictionaries with terms and corresponding sentiment scores Assumption:
aggregation of each word’s sentiment = the sentence’s sentiment
For instance : “I hate @JetBlue as its food is bad and its flights alway delay, although
its customer service is pretty decent.”
Total sentiment is Negative
Document# hate bad delay pretty decent
sentiment 7454 N N N P P
Machine Learning on DTM (Approach Two)
Naive Bayesian approach for sentiment analysis:
Model: Sentiment Score ~ DTM
Training: Labelled US airlines sentiments
Testing: Our data downloaded from Twitter API
Tweeting patterns
Positive tweet word clouds
NRC BINGBayesian
Negative tweet word clouds
NRC BINGBayesian
Words associated with “delay”A quick look at the most frequent word in negative tweets, “delay”
Interactive R Shiny App Quick sentiment overview of
the twitter data we downloaded
Link
Predictions on Tweet Data
Models Negative Positive Other
Google API (Benchmark)
439833%
690752%
2066 (neutral)15%
NRC 460834%
400530%
4758 (missing)36%
BING 197715%
513938%
6255 (missing)47%
Naive Bayesian 382929%
9542 (positive)61%
Feature Engineering and Advanced Analytics
Feature Engineering Pipeline(Approach Three)
Link
Conclusion
Which approach works the best?Term matrix has unique features, making it drastically different from structured data
Much effort should go into text cleaning to remove typos and stem Internet slangs
Determination of which approach to use depends on:Level of predicting power neededbroad stroke analytics vs. automated sentiment scoring
Restraints in computing resource and timeframes
Going Beyond Sentiment
Optimize text cleaning process Manually label tweets with sentiment scores Create special lexicon based on corpus Clustering tweets
Dig deeper into the words Explore words associated with most frequent words Create issue/topic and automatically categorize tweets
Generate richer contextual variables Integrate variables such as weather, airline delay, traffic conditions, etc. Generate user tweeting behavior data
Thank you!Team Members
Chuanye “Elvin” OuyangLove K TyagiXingjian “Strayn” WangTarek Elduzdar