Final Presentation

Sentiment Analysis @JETBLUEP R E S E N T E D BY - - M I G H T Y A N C I E N T T H R E EC H UA N Y E “ E LV I N ” O U YA N GLOV E K T YA G IX I N G J I A N “ S T R AY N ” WA N GTA R E K E L D U Z D A R

Contents

Project Introduction and TheoriesText Cleaning and AnalyticsFeature Engineering and Advanced AnalyticsConclusions

Project Introduction and Theories

Our Topic

What Knowledge to Mine from Tweets that Mention Airline Companies

Question 1:What are the wording and behavior patterns of Twitter users that post tweets related to JetBlue

Questions 2:How to predict the sentiment of tweets that’s related to airlines

Our Data from Real World Data Extracted using Twitter API has:

Contains: “@JetBlue” Date Range: Two weeks of tweets Number of Observations : Around 20,000 Tweets 16 Variables:

"Text" "favorited" "favorites Count" "replyTo" "created" "truncated" "replyTo SID" "id" "replyTo UID" "status Source" "screenName" "retweet_Count" "is Retweet" "retweeted" "longitude" "latitude"

To train data mining models, we used: Labelled US airlines sentiments that contain

All tweets that mention US airline companies Same 16 variables as above Plus the sentiments score variable (Positive/Negative/Neutral)

Our Method: Bag of Words (BOW) The text variable in original tweets dataset is transformed into word matrix

A TDM (Term Document Matrix) was created for each document column wise and each word row wise, which looks like:

Document Late Baggage Airport Time

1294 3 0 2 0

1682 0 1 0 0

2893 1 0 1 1

Text Cleaning and Analytics

Our Work Flow

Five SPRINTs One Trello Board Thousands of Slacks Tens of Emails GitHub project page And a wonderful mindmap!

Text Cleaning and Preprocessing

“ I have been trying to reach @JetBlue hotline for one hour , it’s really frustrating!!”

“try reach hotline hour

frustrate “

Document # try love delay reach thank hotline frustrate pilot

23424 1 0 0 1 0 1 1 0

Remove StopwordsStemming

Remove PunctuationsLowercase

Lexicon Based Approach (Approach One) This approach will score sentiment of a sentence based on sentiment of

individual words Lexicons:

dictionaries with terms and corresponding sentiment scores Assumption:

aggregation of each word’s sentiment = the sentence’s sentiment

For instance : “I hate @JetBlue as its food is bad and its flights alway delay, although

its customer service is pretty decent.”

Total sentiment is Negative

Document# hate bad delay pretty decent

sentiment 7454 N N N P P

Machine Learning on DTM (Approach Two)

Naive Bayesian approach for sentiment analysis:

Model: Sentiment Score ~ DTM

Training: Labelled US airlines sentiments

Testing: Our data downloaded from Twitter API

Tweeting patterns

Positive tweet word clouds

NRC BINGBayesian

Negative tweet word clouds

NRC BINGBayesian

Words associated with “delay”A quick look at the most frequent word in negative tweets, “delay”

Interactive R Shiny App Quick sentiment overview of

the twitter data we downloaded

Link

https://lktanalytics.shinyapps.io/Twitter_Analysis2/

Predictions on Tweet Data

Models Negative Positive Other

Google API (Benchmark)

439833%

690752%

2066 (neutral)15%

NRC 460834%

400530%

4758 (missing)36%

BING 197715%

513938%

6255 (missing)47%

Naive Bayesian 382929%

9542 (positive)61%

Feature Engineering and Advanced Analytics

Feature Engineering Pipeline(Approach Three)

Link

https://view.officeapps.live.com/op/view.aspx?src=http://strayn.space:80/wp-content/uploads/2016/12/6103-final-presentation-Strayn.pptx&wdSlideId=256&wdModeSwitchTime=1480549161461

Conclusion

Which approach works the best?Term matrix has unique features, making it drastically different from structured data

Much effort should go into text cleaning to remove typos and stem Internet slangs

Determination of which approach to use depends on:Level of predicting power neededbroad stroke analytics vs. automated sentiment scoring

Restraints in computing resource and timeframes

Going Beyond Sentiment

Optimize text cleaning process Manually label tweets with sentiment scores Create special lexicon based on corpus Clustering tweets

Dig deeper into the words Explore words associated with most frequent words Create issue/topic and automatically categorize tweets

Generate richer contextual variables Integrate variables such as weather, airline delay, traffic conditions, etc. Generate user tweeting behavior data

Thank you!Team Members

Chuanye “Elvin” OuyangLove K TyagiXingjian “Strayn” WangTarek Elduzdar

Date post:	07-Feb-2017
Category:	Documents
Upload:	love-tyagi
View:	79 times
Download:	1 times

Final Presentation

Documents