+ All Categories
Home > Documents > Final Presentation

Final Presentation

Date post: 07-Feb-2017
Category:
Upload: love-tyagi
View: 79 times
Download: 1 times
Share this document with a friend
29
Sentiment Analysis @JETBLUE PRESENTED BY -- MIGHTY ANCIENT THREE CHUANYE “ELVIN” OUYANG LOVE K TYAGI XINGJIAN “STRAYN” WANG TAREK ELDUZDAR
Transcript
Page 1: Final Presentation

Sentiment Analysis @JETBLUEP R E S E N T E D BY - - M I G H T Y A N C I E N T T H R E EC H UA N Y E “ E LV I N ” O U YA N GLOV E K T YA G IX I N G J I A N “ S T R AY N ” WA N GTA R E K E L D U Z D A R

Page 2: Final Presentation

Contents

Project Introduction and TheoriesText Cleaning and AnalyticsFeature Engineering and Advanced AnalyticsConclusions

Page 3: Final Presentation

Project Introduction and Theories

Page 4: Final Presentation

Our Topic

What Knowledge to Mine from Tweets that Mention Airline Companies

Question 1:What are the wording and behavior patterns of Twitter users that post tweets related to JetBlue

Questions 2:How to predict the sentiment of tweets that’s related to airlines

Page 5: Final Presentation

Our Data from Real World Data Extracted using Twitter API has:

Contains: “@JetBlue” Date Range: Two weeks of tweets Number of Observations : Around 20,000 Tweets 16 Variables:

"Text" "favorited" "favorites Count" "replyTo" "created" "truncated" "replyTo SID" "id" "replyTo UID" "status Source" "screenName" "retweet_Count" "is Retweet" "retweeted" "longitude" "latitude"

To train data mining models, we used: Labelled US airlines sentiments that contain

All tweets that mention US airline companies Same 16 variables as above Plus the sentiments score variable (Positive/Negative/Neutral)

Page 6: Final Presentation

Our Method: Bag of Words (BOW) The text variable in original tweets dataset is transformed into word matrix

A TDM (Term Document Matrix) was created for each document column wise and each word row wise, which looks like:

Document Late Baggage Airport Time

1294 3 0 2 0

1682 0 1 0 0

2893 1 0 1 1

Page 7: Final Presentation

Text Cleaning and Analytics

Page 8: Final Presentation

Our Work Flow

Five SPRINTs One Trello Board Thousands of Slacks Tens of Emails GitHub project page And a wonderful mindmap!

Page 9: Final Presentation
Page 10: Final Presentation

Text Cleaning and Preprocessing

“ I have been trying to reach @JetBlue hotline for one hour , it’s really frustrating!!”

“try reach hotline hour

frustrate “

Document # try love delay reach thank hotline frustrate pilot

23424 1 0 0 1 0 1 1 0

Remove StopwordsStemming

Remove PunctuationsLowercase

Page 11: Final Presentation
Page 12: Final Presentation

Lexicon Based Approach (Approach One) This approach will score sentiment of a sentence based on sentiment of

individual words Lexicons:

dictionaries with terms and corresponding sentiment scores Assumption:

aggregation of each word’s sentiment = the sentence’s sentiment

For instance : “I hate @JetBlue as its food is bad and its flights alway delay, although

its customer service is pretty decent.”

Total sentiment is Negative

Document# hate bad delay pretty decent

sentiment 7454 N N N P P

Page 13: Final Presentation
Page 14: Final Presentation

Machine Learning on DTM (Approach Two)

Naive Bayesian approach for sentiment analysis:

Model: Sentiment Score ~ DTM

Training: Labelled US airlines sentiments

Testing: Our data downloaded from Twitter API

Page 15: Final Presentation
Page 16: Final Presentation

Tweeting patterns

Page 17: Final Presentation

Positive tweet word clouds

NRC BINGBayesian

Page 18: Final Presentation

Negative tweet word clouds

NRC BINGBayesian

Page 19: Final Presentation

Words associated with “delay”A quick look at the most frequent word in negative tweets, “delay”

Page 20: Final Presentation

Interactive R Shiny App Quick sentiment overview of

the twitter data we downloaded

Link

Page 21: Final Presentation
Page 22: Final Presentation

Predictions on Tweet Data

Models Negative Positive Other

Google API (Benchmark)

439833%

690752%

2066 (neutral)15%

NRC 460834%

400530%

4758 (missing)36%

BING 197715%

513938%

6255 (missing)47%

Naive Bayesian 382929%

9542 (positive)61%

Page 23: Final Presentation

Feature Engineering and Advanced Analytics

Page 24: Final Presentation
Page 26: Final Presentation

Conclusion

Page 27: Final Presentation

Which approach works the best?Term matrix has unique features, making it drastically different from structured data

Much effort should go into text cleaning to remove typos and stem Internet slangs

Determination of which approach to use depends on:Level of predicting power neededbroad stroke analytics vs. automated sentiment scoring

Restraints in computing resource and timeframes

Page 28: Final Presentation

Going Beyond Sentiment

Optimize text cleaning process Manually label tweets with sentiment scores Create special lexicon based on corpus Clustering tweets

Dig deeper into the words Explore words associated with most frequent words Create issue/topic and automatically categorize tweets

Generate richer contextual variables Integrate variables such as weather, airline delay, traffic conditions, etc. Generate user tweeting behavior data

Page 29: Final Presentation

Thank you!Team Members

Chuanye “Elvin” OuyangLove K TyagiXingjian “Strayn” WangTarek Elduzdar


Recommended