Project Discussion-3
-Prof. Vincent Ng-Jitendra Mohanty-Girish Vaidyanathan-Chen Chen
Agenda• Brief recap of last presentation
• Things we have done so far– Finding interests of user– Finding user’s gender– Finding trending topic– Finding Opinion target pair– Argument detection
• Future plans– Argument detection– User profile construction
Brief recap of last presentation
• Finding interests of user• 20 categories of interest• Algorithm
– Neural network– Support Vector Machine– Passive Aggressive algorithm
• Data• Twitter data• Blog data
Data Preparationmusic 0.3305412193304446photography 0.13342809041481524art 0.10207545607595286reading 0.3219854828471283movie 0.19912786686170067sport 0.2109817017635857writing 0.1620484089090056travel 0.12136726188833384cooking 0.0761322551265421fashion 0.0824524604642177food 0.06361604062594872politics 0.04773272983192118god 0.05087903292578588singing 0.055998675240802584dancing 0.05427372836916623family 0.05007865757734662animal 0.04193690834322303shopping 0.043261667540639745game 0.046159578284988824Social media 0.05710264123864985
We have 120,778 users totally.60% are used as training data.20% are used as development data. (Tune learning algorithm parameter)20% are used as testing data.
Finding interests of user• Feature groups
• POS sequence: 1003• Named entities: 14• Social linguistic: 37063• Bigram : 193985• Unigram: 273985• Unigram for description: 18855• Bigram for description: 15754• Ngram for user name: 17482• Ngram for screen name: 19944• Totally: 578,085
Finding interests of user• Measures
• Precision. The fraction of predict users who really have that interest.• Recall. The fraction of relevant instances that are retrieved.• F score.• Accuracy.
Finding interests of user• Neural Network Result
F score Accuracy Precision Recall
Music 0.650616 0.773514 0.65797 0.643426
photography 0.535807 0.896423 0.65906 0.451391
art 0.504014 0.907932 0.574479 0.448947
reading 0.658378 0.764531 0.615774 0.707317
movie 0.574119 0.846456 0.632111 0.525873
sport 0.580578 0.832464 0.612107 0.552139
writing 0.604014 0.876677 0.662005 0.555365
travel 0.496454 0.888309 0.567406 0.441274
cooking 0.479604 0.930286 0.585219 0.406283
fashion 0.649794 0.947301 0.702558 0.604401
food 0.496868 0.950116 0.689455 0.388381
politics 0.604425 0.969656 0.769231 0.497778
god 0.532205 0.958809 0.643182 0.453889
singing 0.563721 0.961169 0.768061 0.445261
dancing 0.553832 0.960714 0.725369 0.447909
family 0.467952 0.957733 0.656433 0.363563
animal 0.532616 0.971229 0.8 0.399194
shopping 0.538219 0.969738 0.771739 0.413191
game 0.562929 0.968372 0.817276 0.429319
socialMedia 0.557978 0.963073 0.840299 0.417656
Finding interests of user• Support Vector Machine Result
Support vector machine F score Accuracy Precision Recall
Music 0.652349 0.76747 0.639563 0.665656
photography 0.524648 0.888227 0.600564 0.465771
art 0.486772 0.907642 0.578142 0.420342
reading 0.656884 0.766725 0.621858 0.69609
movie 0.57672 0.841033 0.605836 0.550273
sport 0.577859 0.837721 0.636838 0.528878
writing 0.599764 0.873862 0.648211 0.558054
travel 0.494236 0.887399 0.562183 0.440942
cooking 0.476624 0.928631 0.567197 0.410995
fashion 0.647953 0.950157 0.755798 0.567042
food 0.499016 0.947301 0.628345 0.413838
politics 0.592075 0.971022 0.85956 0.451556
god 0.522603 0.960217 0.686684 0.421812
singing 0.564531 0.961169 0.766709 0.44673
dancing 0.568401 0.962908 0.775296 0.448669
family 0.472056 0.95972 0.715461 0.352227
animal 0.52027 0.970608 0.788934 0.388105
shopping 0.550218 0.970152 0.770979 0.42774
game 0.563605 0.968331 0.813839 0.431065
socialMedia 0.569113 0.962577 0.796 0.442878
Finding interests of user• Passive Aggressive
F score Accuracy Precision Recall
Music 0.64241 0.614073 0.673487
photography 0.517897 0.52488 0.511097
art 0.484836 0.511926 0.460469
reading 0.651151 0.621134 0.684217
movie 0.559173 0.579553 0.540177
sport 0.572767 0.622685 0.530258
writing 0.596787 0.646874 0.553899
travel 0.483696 0.532721 0.442933
cooking 0.472289 0.556028 0.410471
fashion 0.647539 0.718926 0.589048
food 0.486872 0.57231 0.423629
politics 0.585903 0.769899 0.472889
god 0.51585 0.643114 0.430634
singing 0.553899 0.763427 0.438648
dancing 0.555715 0.748711 0.441825
family 0.452219 0.636765 0.350607
animal 0.519975 0.666142 0.426411
shopping 0.537906 0.708399 0.43356
game 0.55376 0.765794 0.433682
socialMedia 0.552083 0.763089 0.432493
Finding interests of user• Result Comparison
Micro Fscore Macro F-score
Support Vector Machine 0.58499 0.56396
Neural Network 0.58678 0.56504
Passive aggressive 0.57933 0.56027
Finding interests of user• Result Analysis
• Recall Analysis• Recall is low because of the data size per user.
Some users claim they have certain interest, while they have not published any tweets or blogs related to those kinds of interest. However, once they publish some tweets or blogs related to those kinds of interest in the future, our system can make the right prediction.
• Precision Analysis• We find some cases that people have published some
tweets or blogs related to one certain interest, however, they doesn’t specify it as their interest.
• Precision is higher than recall for most interest categories.
Finding interests of user• Including more features
– Tweet POS sequence: 1003– Named entities: 14– Social linguistic: 37063– Bigram for tweets and blogs: 193985– Unigram for tweets and blogs: 273985– Unigram for description: 18855– Bigram for description: 15754– Ngram for user name: 17482– Ngram for screen name: 19944– Gender: 2– Blog pos sequence: 3075– Unigram for “About me”: 18753– Bigram for “About me”: 12391– Industry: 39– Location: 4505– Occupation: 332– Totally: 617180
Finding interests of user• Neural network result after including more feature
Improved F score
Old F score
Music 0.661115 0.650616
photography 0.556041 0.535807
art 0.515067 0.504014
reading 0.664573 0.658378
movie 0.583581 0.574119
sport 0.594504 0.580578
writing 0.617632 0.604014
travel 0.504436 0.496454
cooking 0.49019 0.479604
fashion 0.680579 0.649794
food 0.505812 0.496868
politics 0.613917 0.604425
god 0.542065 0.532205
singing 0.570498 0.563721
dancing 0.565947 0.553832
family 0.469649 0.467952
animal 0.529793 0.532616
shopping 0.54717 0.538219
game 0.571116 0.562929
socialMedia 0.56829 0.557978
Finding interests of user• Result after including more feature.
• After including additional features we are able to predict interests of all categories better than before. Some categories like music, reading, writing, fashion, politics are improved by more than 1 percent.
Micro F-score Macro F-score
Neural Network 0.58678 0.56504
Neural Network (More features) 0.59780 0.57582
Finding interests of user
• Feature Analysis– Totally there are 16 feature groups. Delete one
feature group and then see the result.– As neural network can give the best result, we
apply neural network to analyze.
Feature Analysis resultNeural Network Micro F-Score Macro F-score
All features 0.5978 0.57582
remove tweet pos 0.6041 0.58344
remove ner 0.59739 0.57412
remove social linguistic 0.60028 0.57929
remove bigram 0.55531 0.52591
remove unigram 0.59273 0.57005
remove description unigram 0.59579 0.57388
remove description bigram 0.59622 0.57491
remove name ngram 0.59945 0.57679
remove screen name ngram 0.60034 0.57693
remove gender 0.59737 0.57463
remove blog pos 0.60167 0.57937
remove aboutMe unigram 0.59443 0.57166
remove aboutMe bigram 0.59831 0.57594
remove industry 0.5986 0.57557
remove location 0.59847 0.57524
remove occupation 0.59905 0.57676
Finding gender of user• Motivation:
– Help to construct user’s profile– Help to compare opinions between different gender
• Data– Tweet data– Blog data
• Feature group:– POS sequence: 1003– Named entities: 14– Social linguistic: 37063– Bigram : 193985– Unigram: 273985– Unigram for description: 18855– Bigram for description: 15754– Ngram for user name: 17482– Ngram for screen name: 19944– Interest features *– Total number: 578085 + 20 (Interest features amount)
Gender Distribution
Count Percentage
Male 47404 39.25%
Female 73374 60.75%
Finding gender of user• Result:
If we can improve our interest prediction, it may be possible to improve the gender prediction.
Neural Network F-Score Accuracy Precision Recall
No interest features 0.88528 0.909339 0.882414 0.888165
Real interest features 0.889391 0.912858 0.889251 0.889531
predicted interest features 0.884091 0.908429 0.881505 0.886693
Support Vector Machine F-Score Accuracy Precision Recall
No interest features 0.868892 0.894809 0.85335 0.885012
Real interest features 0.873173 0.897996 0.855558 0.891528
predicted interest features 0.868175 0.893857 0.849738 0.887429
Passive Aggressive Algorithm F-Score Accuracy Precision Recall
No interest features 0.872402 0.862544 0.882489
Real interest features 0.876564 0.852488 0.902039
predicted interest features 0.870567 0.861458 0.879872
Finding Trending Topics
• Motivation– Helps in finding the interesting topics that attract
people’s attention– Trending topics are helpful in argument detection
• Possible Approaches– Naïve Approach– Online Clustering – Latent Dirchlet Allocation (LDA)
Trending Topics Results
• Naïve Approach: A brief recap– Visit every tweet in our dataset and find the
words and phrases that are occurring frequently.– Those are the probable trending topics in our
dataset– The timeframe of a trending topic can be found in
a similar fashion by keeping track of the timestamp associated with every tweet and find the minimum and maximum timestamp with respect to every phrase/word
Some Results for the month of December in our dataset in chronological order using Naïve Approach
campaign 828 gaga 628 obama 759 apple 2075 apple 1027 justinbieber 1249 #thoushallnot 3272allah 659 apple 1070 apple 774 campaign 606 tiger woods 1077 avatar 526 tiger woods 573copenhagen 1229 tiger woods 937 tiger woods 2108 #iphone 800 iphone 2700 gaga 639 jesus 560shooting 607 iphone 963 iphone 1608 copenhagen 1025 justinbieber 2309 obama 505 justinbieber 715#in2010 1024 justinbieber 860 justinbieber 1059 arms 779 blackberry 712 christmas 6933 blackberry 735arms 678 avatar 960 bieber 868 gaga 2032 president 555 iphone 858 gaga 510gaga 2193 #ifsantawasblack2909 #ifsantawasblack1670 iphone 5554 gaga 773 copenhagen 607iphone 7333 christmas 9779 jesus 523 christmas 24522 copenhagen 662 obama 811christmas 18642 obama 1848 avatar 708 blackberry 1442 obama 1432 christmas 10288#thisiswar 682 ladygaga 503 christmas 11805 avatar 2287congress 652 bieber 2737 bieber 933bieber 1616 #ifsantawasblack 607president 1563 president 1347obama 4481 obama 3238#iranelection 748 #iranelection 892justinbieber 2757 justinbieber 5011blackberry 1119 tiger woods 2527
avatar 2155
12/17/200912/10/2009 12/11/2009 12/12/2009 12/14/2009 12/15/2009 12/16/2009
Topics over the month of December12/10/2009 12/11/2009 12/12/2009 12/14/2009 12/15/2009 12/16/2009 12/17/2009 Total count
apple P P P P P 5673campaign P P 1434allah P 659jesus P P 1083#iphone P 800copenhagen P P P P 3523shooting P 607#in2010 P 1024#thoushallnot P 3272arms P P 1457gaga P P P P P P 6775iphone P P P P P P 19016#ifsantawasblack P P P 5186christmas P P P P P P 83129avatar P P P P P 6636ladygaga P 503#thisiswar P 682congress P 652bieber P P P P 6154president P P P 3465obama P P P P P P P 13074#iranelection P P 1640justinbieber P P P P P P P 13960tiger woods P P P P P 7222blackberry P P P P 4008
Problems in Naïve approach
• Many irrelevant words or phrases with large counts will be considered as trending topics.
• For example,– Youtube video– I arrived– Watching movie
Solution for the problem in the previous slide ( part of future work )
• Possible solutions– Online Clustering– Latent Dirichlet Allocation
Solution for the problem in the previous slide ( part of future work )
• Online clustering algorithm– All the tweets are ordered on the timeline– Tweets are represented vector of tf-idf(term freq. & inverted
document freq.) weights– Tweets which have highest similarities are clustered together.– Every cluster corresponds to a trending topic.• Online clustering algorithm works better because it uses tf-idf
weights for all terms and find the similarity between a tweet and all the current clusters available.
• For Example, consider the following tweets• I love lady gaga• I love gandhi• Lady gaga is the best singer in the world
Solution for the problem in the previous slide ( part of future work )
• Latent Dirchlet Allocation(LDA)– LDA is a bag of words model– In LDA, each tweet is viewed as a mixture of
various topics.– Suppose a tweet has a particular trending topic in
it, It has a high probability of belonging to that topic.
Opinion-Target: What is it??
• Opinion words in an opinionated sentence, in most cases, are adjectives which act upon directly on its target. For example:
• I am so excited that the vacation is coming.– Here the opinion word is excited – And its target is I
• The water is green and clear.– Here the opinion word is green– And its target is Water
• The Dream Lake is a beautiful place. – Here the opinion is beautiful– And its target is Dream Lake
Why Opinion-Target pair??Motivation
• An opinionated sentence gives a sense of general opinion of a person on a *subject material* or *topic*, called *target* in the research literature.
• *Subject material* or *topic* is diverse.• For example, it could be travel article which deals with several tourist attractions. • Last two examples in the previous slides are tourism related opinions by us.• Opinions change over the course of time.Example• At time t1 user p’s view, place x has really good scenic view, let’s go for it.• At time t2 = t1+(1year) user p’s view, place y has better scenic view as compared
to place x. • The opinion of the user about the tourist place x has changed over the 1 year time
frame. It has changed from positive to negative over the time.• This gives us a sense of belief that by listening to the posts (tweets in our case), we
can create a profile that would give us a way to see if there is a change in the interests of an user on a particular topic over a time duration.
Extraction of Opinion-Target Pair- Stanford parser was run over the tweets to give us dependencies among different entities
in a tweet.- Following 5 rules were used on the dependency information generated from previous
step to generate Opinion-Target pair. - Direct Object Rule
- dobj(opinion, target) - I love (opinion1) Firefox(target1) and defended(opinion2) it.
- Nominal Subject Rule- nsubj(opinion, target)- IE(target) breaks(opinion) with everything.
- Adjective Modifier Rule- amod(target, opinion)- The annoying(opinion) popup(target)
The opinion is the adjectival modifier of the target- Prepositional Object Rule
- If prep(target1, IN) => pobj(IN, target2)- The prepositional object of a known target is also a target of the same opinion
The annoying(op) popup(tar1) in IE(tar2)- Recursive Modifiers Rule
- If conj(adj2, opinion adj1) => amod(target, adj2)
What to do with Opinion-Target pairs extracted??
• Once we have the opinion-target pair, we used subjectivity lexicon of (Wilson et al., 2005), which contains 8221 words to express the polarity of the opinion. The words are nothing but the opinions.
• Some samples from the lexicon
type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negativetype=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negativetype=weaksubj len=1 word1=ability pos1=noun stemmed1=n priorpolarity=positivetype=weaksubj len=1 word1=above pos1=anypos stemmed1=n priorpolarity=positivetype=strongsubj len=1 word1=amazing pos1=adj stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=absolutely pos1=adj stemmed1=n priorpolarity=neutraltype=weaksubj len=1 word1=absorbed pos1=verb stemmed1=n priorpolarity=neutral
How does Opinion-Target pairs extracted look like?
Tweet_id Opinion-Target pair Polarity-Target pair Tweet:3 will-I I+
Tweet:6 evil-I I-Tweet:7 best-books books+Tweet:7 cry-me me-Tweet:7 think-what what*Tweet:7 think-you you*Tweet:9 amazing-houses houses+Tweet:10 love-I I+
Tweet: <More amazing gingerbread houses>The opinion-target pair extracted from this tweet is <amazing-houses> and the corresponding polarity-target pair is <houses+>, where amazing has the positive prior polarity.
What Next ?
• Integration• Next, we apply these polarity-target pairs to the tweets to get useful
information about the interests of a person.
Integrating opinion-target pair with tweets of trending-topic
• Trending topics are those that are immediately popular in tweeter world.• This helps people discover the *most breaking* news stories from across the world.• Polarity-Target pairs are applied to the trending topic tweets generated to find the opinion of a
person w.r.t a trending-topic.• It gives us a sense of the user who has posted the tweet, along with the actual message, the topic
that the tweet is all about and the corresponding polarity. For example: The following tweet has the tweet_id 746 in a general tweet file.RT @GirlOnMission Hasn't Obama's warranty runs out yet? // It was a limited warranty covering
nothing substantial anyway!
The above tweet has also been tagged as trending topic tweet under the trending topic *Obama*.
Opinion-Target and Polarity-Target pairs for the above tweet generated using the five rules, that we discussed, are as follows:tweet_id: 746 limited-warranty warranty-tweet_id: 746 substantial-nothing nothing+
We have a matched tweet_id from both the scenarios above, which gives us the opinion-target and polarity-target pairs which will be used for argument detection and profile construction.
Drawback• The polarity that we talked about is the prior-polarity. It does not take the
*context of the sentence* into consideration.• However, Opinon-Finder does!
• why prior-polarity is not always effective??– Explained with example in later slides
Opinion-Finder• Software developed at University of Pittsburg to predict the contextual
polarity of the sentence. Mainly, it was designed for documents and has limitation on the size of the file that it can process as well as with the sentence splitting module.
• We modified their software to deal with our purpose, i.e. tweets.• It is extremely slow. For example, processes 27M file in 12 hours approx
(totally tweets file size is 18GB)
How is Opinion-Finder different from conventional
Opinion-Target/Polarity-Target pair • Consider a tweet: <No one is happy with Barack Obama’s healthcare
plan…>• From intuition, we can say that the above tweet has negative
connotation.• How is it depicted in Opinion-Finder?• Output of the Opinion-Finder
<MPQASENT autoclass1="unknown" autoclass2="obj" diff="3.1">No one is <MPQASD><MPQAPOL autoclass="negative">happy</MPQAPOL></MPQASD> with Barack
<MPQASRC>Obama</MPQASRC>'s healthcare plan</MPQASENT>. • Output of the conventional 5-rule system:
– Tweet:1 happy-one one+
How does Contextual-Polarity output look like?
• <MPQASENT autoclass1="obj" autoclass2="obj" diff="21.9">@A_ClayChillin what the <MPQAPOL autoclass="negative">hell</MPQAPOL> did she do, <MPQASD>push</MPQASD> him out the truck?</MPQASENT>
• <MPQASENT autoclass1="unknown" autoclass2="subj" diff="1.7"><MPQASRC>i</MPQASRC> <MPQASD>think</MPQASD> its time for me to go back to bed, dnl going to bed at 3 and waking up at 7. yuck.</MPQASENT>
• <MPQASENT autoclass1="obj" autoclass2="obj" diff="25.2">Quebecor veut vendre Jobboom - secteurs-d-activite - LesAffaires.com - http://bit.ly/5dlkE5</MPQASENT>
• <MPQASENT autoclass1="unknown" autoclass2="obj" diff="7.4">wak @mirandamia maacii ya keik boltantemnyaaa,, *senaaaaannggg*</MPQASENT>
• <MPQASENT autoclass1="subj" autoclass2="subj" diff="12.5">90% of any <MPQAPOL autoclass="negative">pain</MPQAPOL> comes from trying to keep the <MPQAPOL autoclass="negative">pain</MPQAPOL> secret. You cannot keep a secret and let it go.</MPQASENT>
Status as of now..• Opinion-Finder is slow.• It runs a pipeline of internal modules, such as Document Preprocessing,
Sentence Splitting, Tokenization, POS tagging, Feature Finder, Source Finder etc.
• ~20% of the tweets has just completed processing using Opinion-Finder.
Argument Detection• Argument detection is to find the argument
people use to support or oppose an issue.– Example:
• Obama is bankrupting Americans, he does nothing to improve the economy just drain it.
– “Obama” is the issue– “bankrupting American” and “does nothing to improve the
economy just drain it” is the argument
Argument Detection• Motivation
– To discover the reason why people show positive or negative opinion towards to an issue.
– If people suddenly change their opinion towards to an issue because of a particular event, we can infer what exactly the event is from the argument they use.
– Argument will be used as an attribute in user’s profile.– There is a step in argument detection, which is to
classify the polarity of the tweet towards to the issue. From the result of polarity classification, we can infer public’s attribute about the issue.
Argument Detection Approach• Step 1. Given a trending topic, retrieve all the
tweets associated with that trending topic. (Output from trending topic detection)– Assuming trending topics as issues, we will detect
people’s argument for the issue from those tweets which are relevant with that topic.
• Step 2. Determine whether one tweet is subjective or neutral about the topic. (Going on)– Though some tweets belong to a certain trending
topic, they don’t show any subjective opinion to the topic. Example:
• Barack Obama makes banks an offer they can’t refuse.
Argument Detection Approach• Step 3. Polarity classification to decide whether
this tweet is positive or negative towards to the topic. (Going on)– After we get argument from tweets, we need to
know whether the argument is used to support or oppose the issue. So we should know the polarity of the tweet first.
• Step 4. Get all opinion target pairs from those tweets which show positive or negative opinion separately. (Output from opinion-target pairs)– We will collect the argument from those opinion-
target pairs.
Argument Detection Approach• Step 5. Determine whether this opinion target
can be used as argument. (Going on)– There are some opinion-target pairs which can’t be used as argument.
Examples:• Tweet: I envy you guys with a leader like Obama.• Opinion-target pairs:
– envy-I I- (this opinion-target pair can’t be used as argument)– Use mention co-reference and Mutual Information (MI) to find useful
opinion-target pairs.
Argument Detection Approach– Find those targets which are co-referenced
with the topic. Example:• Tweet: Obama is still the best president you Americans have had in a
very long time:)• Opinion-target pairs:
– best-president president+• Argument
– Positive president (best) (Obama and president are co-referenced)
Argument Detection Approach– MI is a quantity that measures the mutual dependence of
the two random variables. Calculate MI between topic and target from opinion-target pairs. If the value of MI exceeds a certain threshold, then consider this opinion-target pair. Example:
• Tweet: Obama’s nasty army. They aren’t funded yet but take a good look…• Opinion-target pairs:
– nasty-army army- • Argument:
– Negative army (nasty and little). (MI between Obama and army is high)
Argument Detection Approach• Step 6. Argument cluster (Going on)
– Cluster those arguments which have the same meaning into the same cluster by WordNet.
• The target of opinion-target pairs may have similar meaning, so we can cluster them. Example : (we can combine troops and army to the same cluster)
– Argument:» Negative troops (Little)» Negative army (Nasty)
Future work
• Trending Topic Detection– Two more algorithm to overcome the shortage of naïve approach.
• Opinion-Target Pairs– Run opinion finder to parse all the tweets.– Compare the results of lexicon polarity and contextual polarity.
• Argument Detection– Identify whether the tweet is subjective or objective towards to an topic.– Identify the polarity of those subjective tweets– Identify those useful opinion target for argument detection– Cluster opinion target
Future work• User profile construction. User’s profiles will include those
following content:– All the tweets one published– Location, description in user’s tweet account profile– Predicted gender– Predicted interests– Opinion target pairs– Trending topics one have ever discussed and also his
opinion towards to those trending topics. – Arguments they use to support or oppose an topic.– …..
Thank You!