Post on 14-Dec-2015
transcript
Team :Priya Iyer
Vaidy VenkatSonali Sharma
Mentor: Andy Schlaikjer
Twist : User Timeline Tweets Classifier
Goal
Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, Technology
Input: user timeline tweetsOutput: list of auto classified tweets
Rationale
Twitter allows users to create custom Friend Lists based on the user handles.
Rationale (contd.)
Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.
Approach
Step 1: Data Collection Step 2: Text mining Step 3: Creation of the training file for
the library Step 4: Evaluation of several classifiers Step 5: Selecting the best classifier Step 6: Validating the classification Step 7: Tuning the parameters Step 8: Repeat; until correct
classification
Text Mining Process
Remove special characters Tokenize Remove redundant letters in words Spell Check Stemming Language Identification Remove Stop Words Generate bigrams and change to
lower case
Go SF Giants! Such an amaazzzing feelin’!!!! \m/ :D
SF Giants! amaazzzing feelin’!!!! \/ :D
SF Giants amaazzzing feelin
SF Giants amazing feeling
SF Giants amazing feel me
SF Giants amazing feel
Stopwords
Special chars
Spell check
Stemming
stopwords
Choice of ML technique
Logistic Regression Classifier Reasons:
Most popular linear classification technique for text classification
Ability to handle multiple categories with ease
Gave the best cross-validation accuracy and precision-recall score
Library: LIBLINEAR for Python
Creation of LIBLINEAR training inputSF Giants amazing feel
SF – 1 Giants -2 amazing-3 feel-4
SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)
1 1:1 2:1 3:1 4:1
Boolean
Training Input for the SVM
Indexing
Demo
THANK YOU
Andy,
Marti &
The Twitter Team
Questions?
Data Collection Challenges – Backup Slides
Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”
Tweets were not purely “Sports” or “Business” related
Personal messages were prominent
Solution: Compared against a corpus of sports/business related terms and assigned weights accordingly
Text Mining Challenges
Noise in the data:▪ Tweets are in inconsistent format▪ Lots of meaningless words▪ Misspellings▪ More of individual expression▪ For example, BAAAAAAAAAAAASSKEttt!!!!
bskball , futball, % , :D,\m/, ^xoxo
Solution: Regular expressions and NLP toolkit
Different words, same rootPlaying , plays , playful - playSolution: Stemming
Sample LIBLINEAR input format (Train)
LIBLINEAR output for a test file of 20 tweets
Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)
Comma separated values of the categories that each tweet
Accuracy here is 94%. Precision: 0.89 Recall: 0.89
Experiment with different kernels for a better accuracy
Summary: Data Source/Software/Tools
Category based tweets from https://twitter.com/i/#!/who_to_follow
/interests Coding done in Python Database – sqlite3 ML tool – lib SVM Stemming – Porter’s Stemming NLP Tool kit