Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets...

transcript

Team :Priya Iyer

Vaidy VenkatSonali Sharma

Mentor: Andy Schlaikjer

Twist : User Timeline Tweets Classifier

Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology

Input: user timeline tweetsOutput: list of auto classified tweets

Rationale

Twitter allows users to create custom Friend Lists based on the user handles.

Rationale (contd.)

Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.

Approach

Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for

the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct

classification

Text Mining Process

Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to

lower case

Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D

SF Giants! amaazzzing feelin’!!!! \/ :D

SF Giants amaazzzing feelin

SF Giants amazing feeling

SF Giants amazing feel me

SF Giants amazing feel

Stopwords

Special chars

Spell check

Stemming

stopwords

Choice of ML technique

Logistic Regression Classifier Reasons:

Most popular linear classification technique for text classification

Ability to handle multiple categories with ease

Gave the best cross-validation accuracy and precision-recall score

Library: LIBLINEAR for Python

Creation of LIBLINEAR training inputSF Giants amazing feel

SF – 1 Giants -2 amazing-3 feel-4

SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)

1 1:1 2:1 3:1 4:1

Boolean

Training Input for the SVM

Indexing

THANK YOU

Marti &

The Twitter Team

Questions?

Data Collection Challenges – Backup Slides

Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”

Tweets were not purely “Sports” or “Business” related

Personal messages were prominent

Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly

Text Mining Challenges

Noise in the data:▪ Tweets are in inconsistent format▪ Lots of meaningless words▪ Misspellings▪ More of individual expression▪ For example, BAAAAAAAAAAAASSKEttt!!!!

bskball , futball, % , :D,\m/, ^xoxo

Solution: Regular expressions and NLP toolkit

Different words, same rootPlaying , plays , playful - playSolution: Stemming

Sample LIBLINEAR input format (Train)

LIBLINEAR output for a test file of 20 tweets

Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)

Comma separated values of the categories that each tweet

Accuracy here is 94%. Precision: 0.89 Recall: 0.89

Experiment with different kernels for a better accuracy

Summary: Data Source/Software/Tools

Category based tweets from https://twitter.com/i/#!/who_to_follow

/interests Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets...

Documents