Project Discussion-3

Project Discussion-3

-Prof. Vincent Ng-Jitendra Mohanty-Girish Vaidyanathan-Chen Chen

Agenda• Brief recap of last presentation

• Things we have done so far– Finding interests of user– Finding user’s gender– Finding trending topic– Finding Opinion target pair– Argument detection

• Future plans– Argument detection– User profile construction

Brief recap of last presentation

• Finding interests of user• 20 categories of interest• Algorithm

– Neural network– Support Vector Machine– Passive Aggressive algorithm

• Data• Twitter data• Blog data

Data Preparationmusic 0.3305412193304446photography 0.13342809041481524art 0.10207545607595286reading 0.3219854828471283movie 0.19912786686170067sport 0.2109817017635857writing 0.1620484089090056travel 0.12136726188833384cooking 0.0761322551265421fashion 0.0824524604642177food 0.06361604062594872politics 0.04773272983192118god 0.05087903292578588singing 0.055998675240802584dancing 0.05427372836916623family 0.05007865757734662animal 0.04193690834322303shopping 0.043261667540639745game 0.046159578284988824Social media 0.05710264123864985

We have 120,778 users totally.60% are used as training data.20% are used as development data. (Tune learning algorithm parameter)20% are used as testing data.

Finding interests of user• Feature groups

• POS sequence: 1003• Named entities: 14• Social linguistic: 37063• Bigram : 193985• Unigram: 273985• Unigram for description: 18855• Bigram for description: 15754• Ngram for user name: 17482• Ngram for screen name: 19944• Totally: 578,085

Finding interests of user• Measures

• Precision. The fraction of predict users who really have that interest.• Recall. The fraction of relevant instances that are retrieved.• F score.• Accuracy.

Finding interests of user• Neural Network Result

F score Accuracy Precision Recall

Music 0.650616 0.773514 0.65797 0.643426

photography 0.535807 0.896423 0.65906 0.451391

art 0.504014 0.907932 0.574479 0.448947

reading 0.658378 0.764531 0.615774 0.707317

movie 0.574119 0.846456 0.632111 0.525873

sport 0.580578 0.832464 0.612107 0.552139

writing 0.604014 0.876677 0.662005 0.555365

travel 0.496454 0.888309 0.567406 0.441274

cooking 0.479604 0.930286 0.585219 0.406283

fashion 0.649794 0.947301 0.702558 0.604401

food 0.496868 0.950116 0.689455 0.388381

politics 0.604425 0.969656 0.769231 0.497778

god 0.532205 0.958809 0.643182 0.453889

singing 0.563721 0.961169 0.768061 0.445261

dancing 0.553832 0.960714 0.725369 0.447909

family 0.467952 0.957733 0.656433 0.363563

animal 0.532616 0.971229 0.8 0.399194

shopping 0.538219 0.969738 0.771739 0.413191

game 0.562929 0.968372 0.817276 0.429319

socialMedia 0.557978 0.963073 0.840299 0.417656

Finding interests of user• Support Vector Machine Result

Support vector machine F score Accuracy Precision Recall

Music 0.652349 0.76747 0.639563 0.665656

photography 0.524648 0.888227 0.600564 0.465771

art 0.486772 0.907642 0.578142 0.420342

reading 0.656884 0.766725 0.621858 0.69609

movie 0.57672 0.841033 0.605836 0.550273

sport 0.577859 0.837721 0.636838 0.528878

writing 0.599764 0.873862 0.648211 0.558054

travel 0.494236 0.887399 0.562183 0.440942

cooking 0.476624 0.928631 0.567197 0.410995

fashion 0.647953 0.950157 0.755798 0.567042

food 0.499016 0.947301 0.628345 0.413838

politics 0.592075 0.971022 0.85956 0.451556

god 0.522603 0.960217 0.686684 0.421812

singing 0.564531 0.961169 0.766709 0.44673

dancing 0.568401 0.962908 0.775296 0.448669

family 0.472056 0.95972 0.715461 0.352227

animal 0.52027 0.970608 0.788934 0.388105

shopping 0.550218 0.970152 0.770979 0.42774

game 0.563605 0.968331 0.813839 0.431065

socialMedia 0.569113 0.962577 0.796 0.442878

Finding interests of user• Passive Aggressive

F score Accuracy Precision Recall

Music 0.64241 0.614073 0.673487

photography 0.517897 0.52488 0.511097

art 0.484836 0.511926 0.460469

reading 0.651151 0.621134 0.684217

movie 0.559173 0.579553 0.540177

sport 0.572767 0.622685 0.530258

writing 0.596787 0.646874 0.553899

travel 0.483696 0.532721 0.442933

cooking 0.472289 0.556028 0.410471

fashion 0.647539 0.718926 0.589048

food 0.486872 0.57231 0.423629

politics 0.585903 0.769899 0.472889

god 0.51585 0.643114 0.430634

singing 0.553899 0.763427 0.438648

dancing 0.555715 0.748711 0.441825

family 0.452219 0.636765 0.350607

animal 0.519975 0.666142 0.426411

shopping 0.537906 0.708399 0.43356

game 0.55376 0.765794 0.433682

socialMedia 0.552083 0.763089 0.432493

Finding interests of user• Result Comparison

Micro Fscore Macro F-score

Support Vector Machine 0.58499 0.56396

Neural Network 0.58678 0.56504

Passive aggressive 0.57933 0.56027

Finding interests of user• Result Analysis

• Recall Analysis• Recall is low because of the data size per user.

Some users claim they have certain interest, while they have not published any tweets or blogs related to those kinds of interest. However, once they publish some tweets or blogs related to those kinds of interest in the future, our system can make the right prediction.

• Precision Analysis• We find some cases that people have published some

tweets or blogs related to one certain interest, however, they doesn’t specify it as their interest.

• Precision is higher than recall for most interest categories.

Finding interests of user• Including more features

– Tweet POS sequence: 1003– Named entities: 14– Social linguistic: 37063– Bigram for tweets and blogs: 193985– Unigram for tweets and blogs: 273985– Unigram for description: 18855– Bigram for description: 15754– Ngram for user name: 17482– Ngram for screen name: 19944– Gender: 2– Blog pos sequence: 3075– Unigram for “About me”: 18753– Bigram for “About me”: 12391– Industry: 39– Location: 4505– Occupation: 332– Totally: 617180

Finding interests of user• Neural network result after including more feature

Improved F score

Old F score

Music 0.661115 0.650616

photography 0.556041 0.535807

art 0.515067 0.504014

reading 0.664573 0.658378

movie 0.583581 0.574119

sport 0.594504 0.580578

writing 0.617632 0.604014

travel 0.504436 0.496454

cooking 0.49019 0.479604

fashion 0.680579 0.649794

food 0.505812 0.496868

politics 0.613917 0.604425

god 0.542065 0.532205

singing 0.570498 0.563721

dancing 0.565947 0.553832

family 0.469649 0.467952

animal 0.529793 0.532616

shopping 0.54717 0.538219

game 0.571116 0.562929

socialMedia 0.56829 0.557978

Finding interests of user• Result after including more feature.

• After including additional features we are able to predict interests of all categories better than before. Some categories like music, reading, writing, fashion, politics are improved by more than 1 percent.

Micro F-score Macro F-score

Neural Network 0.58678 0.56504

Neural Network (More features) 0.59780 0.57582

Finding interests of user

• Feature Analysis– Totally there are 16 feature groups. Delete one

feature group and then see the result.– As neural network can give the best result, we

apply neural network to analyze.

Feature Analysis resultNeural Network Micro F-Score Macro F-score

All features 0.5978 0.57582

remove tweet pos 0.6041 0.58344

remove ner 0.59739 0.57412

remove social linguistic 0.60028 0.57929

remove bigram 0.55531 0.52591

remove unigram 0.59273 0.57005

remove description unigram 0.59579 0.57388

remove description bigram 0.59622 0.57491

remove name ngram 0.59945 0.57679

remove screen name ngram 0.60034 0.57693

remove gender 0.59737 0.57463

remove blog pos 0.60167 0.57937

remove aboutMe unigram 0.59443 0.57166

remove aboutMe bigram 0.59831 0.57594

remove industry 0.5986 0.57557

remove location 0.59847 0.57524

remove occupation 0.59905 0.57676

Finding gender of user• Motivation:

– Help to construct user’s profile– Help to compare opinions between different gender

• Data– Tweet data– Blog data

• Feature group:– POS sequence: 1003– Named entities: 14– Social linguistic: 37063– Bigram : 193985– Unigram: 273985– Unigram for description: 18855– Bigram for description: 15754– Ngram for user name: 17482– Ngram for screen name: 19944– Interest features *– Total number: 578085 + 20 (Interest features amount)

Gender Distribution

Count Percentage

Male 47404 39.25%

Female 73374 60.75%

Finding gender of user• Result:

If we can improve our interest prediction, it may be possible to improve the gender prediction.

Neural Network F-Score Accuracy Precision Recall

No interest features 0.88528 0.909339 0.882414 0.888165

Real interest features 0.889391 0.912858 0.889251 0.889531

predicted interest features 0.884091 0.908429 0.881505 0.886693

Support Vector Machine F-Score Accuracy Precision Recall

No interest features 0.868892 0.894809 0.85335 0.885012

Real interest features 0.873173 0.897996 0.855558 0.891528

predicted interest features 0.868175 0.893857 0.849738 0.887429

Passive Aggressive Algorithm F-Score Accuracy Precision Recall

No interest features 0.872402 0.862544 0.882489

Real interest features 0.876564 0.852488 0.902039

predicted interest features 0.870567 0.861458 0.879872

Finding Trending Topics

• Motivation– Helps in finding the interesting topics that attract

people’s attention– Trending topics are helpful in argument detection

• Possible Approaches– Naïve Approach– Online Clustering – Latent Dirchlet Allocation (LDA)

Trending Topics Results

• Naïve Approach: A brief recap– Visit every tweet in our dataset and find the

words and phrases that are occurring frequently.– Those are the probable trending topics in our

dataset– The timeframe of a trending topic can be found in

a similar fashion by keeping track of the timestamp associated with every tweet and find the minimum and maximum timestamp with respect to every phrase/word

Some Results for the month of December in our dataset in chronological order using Naïve Approach

campaign 828 gaga 628 obama 759 apple 2075 apple 1027 justinbieber 1249 #thoushallnot 3272allah 659 apple 1070 apple 774 campaign 606 tiger woods 1077 avatar 526 tiger woods 573copenhagen 1229 tiger woods 937 tiger woods 2108 #iphone 800 iphone 2700 gaga 639 jesus 560shooting 607 iphone 963 iphone 1608 copenhagen 1025 justinbieber 2309 obama 505 justinbieber 715#in2010 1024 justinbieber 860 justinbieber 1059 arms 779 blackberry 712 christmas 6933 blackberry 735arms 678 avatar 960 bieber 868 gaga 2032 president 555 iphone 858 gaga 510gaga 2193 #ifsantawasblack2909 #ifsantawasblack1670 iphone 5554 gaga 773 copenhagen 607iphone 7333 christmas 9779 jesus 523 christmas 24522 copenhagen 662 obama 811christmas 18642 obama 1848 avatar 708 blackberry 1442 obama 1432 christmas 10288#thisiswar 682 ladygaga 503 christmas 11805 avatar 2287congress 652 bieber 2737 bieber 933bieber 1616 #ifsantawasblack 607president 1563 president 1347obama 4481 obama 3238#iranelection 748 #iranelection 892justinbieber 2757 justinbieber 5011blackberry 1119 tiger woods 2527

avatar 2155

12/17/200912/10/2009 12/11/2009 12/12/2009 12/14/2009 12/15/2009 12/16/2009

Topics over the month of December12/10/2009 12/11/2009 12/12/2009 12/14/2009 12/15/2009 12/16/2009 12/17/2009 Total count

apple P P P P P 5673campaign P P 1434allah P 659jesus P P 1083#iphone P 800copenhagen P P P P 3523shooting P 607#in2010 P 1024#thoushallnot P 3272arms P P 1457gaga P P P P P P 6775iphone P P P P P P 19016#ifsantawasblack P P P 5186christmas P P P P P P 83129avatar P P P P P 6636ladygaga P 503#thisiswar P 682congress P 652bieber P P P P 6154president P P P 3465obama P P P P P P P 13074#iranelection P P 1640justinbieber P P P P P P P 13960tiger woods P P P P P 7222blackberry P P P P 4008

Problems in Naïve approach

• Many irrelevant words or phrases with large counts will be considered as trending topics.

• For example,– Youtube video– I arrived– Watching movie

Solution for the problem in the previous slide ( part of future work )

• Possible solutions– Online Clustering– Latent Dirichlet Allocation


• Online clustering algorithm– All the tweets are ordered on the timeline– Tweets are represented vector of tf-idf(term freq. & inverted

document freq.) weights– Tweets which have highest similarities are clustered together.– Every cluster corresponds to a trending topic.• Online clustering algorithm works better because it uses tf-idf

weights for all terms and find the similarity between a tweet and all the current clusters available.

• For Example, consider the following tweets• I love lady gaga• I love gandhi• Lady gaga is the best singer in the world


• Latent Dirchlet Allocation(LDA)– LDA is a bag of words model– In LDA, each tweet is viewed as a mixture of

various topics.– Suppose a tweet has a particular trending topic in

it, It has a high probability of belonging to that topic.

Opinion-Target: What is it??

• Opinion words in an opinionated sentence, in most cases, are adjectives which act upon directly on its target. For example:

• I am so excited that the vacation is coming.– Here the opinion word is excited – And its target is I

• The water is green and clear.– Here the opinion word is green– And its target is Water

• The Dream Lake is a beautiful place. – Here the opinion is beautiful– And its target is Dream Lake

Why Opinion-Target pair??Motivation

• An opinionated sentence gives a sense of general opinion of a person on a *subject material* or *topic*, called *target* in the research literature.

• *Subject material* or *topic* is diverse.• For example, it could be travel article which deals with several tourist attractions. • Last two examples in the previous slides are tourism related opinions by us.• Opinions change over the course of time.Example• At time t1 user p’s view, place x has really good scenic view, let’s go for it.• At time t2 = t1+(1year) user p’s view, place y has better scenic view as compared

to place x. • The opinion of the user about the tourist place x has changed over the 1 year time

frame. It has changed from positive to negative over the time.• This gives us a sense of belief that by listening to the posts (tweets in our case), we

can create a profile that would give us a way to see if there is a change in the interests of an user on a particular topic over a time duration.

Extraction of Opinion-Target Pair- Stanford parser was run over the tweets to give us dependencies among different entities

in a tweet.- Following 5 rules were used on the dependency information generated from previous

step to generate Opinion-Target pair. - Direct Object Rule

- dobj(opinion, target) - I love (opinion1) Firefox(target1) and defended(opinion2) it.

- Nominal Subject Rule- nsubj(opinion, target)- IE(target) breaks(opinion) with everything.

- Adjective Modifier Rule- amod(target, opinion)- The annoying(opinion) popup(target)

The opinion is the adjectival modifier of the target- Prepositional Object Rule

- If prep(target1, IN) => pobj(IN, target2)- The prepositional object of a known target is also a target of the same opinion

The annoying(op) popup(tar1) in IE(tar2)- Recursive Modifiers Rule

- If conj(adj2, opinion adj1) => amod(target, adj2)

What to do with Opinion-Target pairs extracted??

• Once we have the opinion-target pair, we used subjectivity lexicon of (Wilson et al., 2005), which contains 8221 words to express the polarity of the opinion. The words are nothing but the opinions.

• Some samples from the lexicon

type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negativetype=weaksubj len=1 word1=abandon pos1=verb stemmed1=y priorpolarity=negativetype=weaksubj len=1 word1=ability pos1=noun stemmed1=n priorpolarity=positivetype=weaksubj len=1 word1=above pos1=anypos stemmed1=n priorpolarity=positivetype=strongsubj len=1 word1=amazing pos1=adj stemmed1=n priorpolarity=positive type=strongsubj len=1 word1=absolutely pos1=adj stemmed1=n priorpolarity=neutraltype=weaksubj len=1 word1=absorbed pos1=verb stemmed1=n priorpolarity=neutral

How does Opinion-Target pairs extracted look like?

Tweet_id Opinion-Target pair Polarity-Target pair Tweet:3 will-I I+

Tweet:6 evil-I I-Tweet:7 best-books books+Tweet:7 cry-me me-Tweet:7 think-what what*Tweet:7 think-you you*Tweet:9 amazing-houses houses+Tweet:10 love-I I+

Tweet: <More amazing gingerbread houses>The opinion-target pair extracted from this tweet is <amazing-houses> and the corresponding polarity-target pair is <houses+>, where amazing has the positive prior polarity.

What Next ?

• Integration• Next, we apply these polarity-target pairs to the tweets to get useful

information about the interests of a person.

Integrating opinion-target pair with tweets of trending-topic

• Trending topics are those that are immediately popular in tweeter world.• This helps people discover the *most breaking* news stories from across the world.• Polarity-Target pairs are applied to the trending topic tweets generated to find the opinion of a

person w.r.t a trending-topic.• It gives us a sense of the user who has posted the tweet, along with the actual message, the topic

that the tweet is all about and the corresponding polarity. For example: The following tweet has the tweet_id 746 in a general tweet file.RT @GirlOnMission Hasn't Obama's warranty runs out yet? // It was a limited warranty covering

nothing substantial anyway!

The above tweet has also been tagged as trending topic tweet under the trending topic *Obama*.

Opinion-Target and Polarity-Target pairs for the above tweet generated using the five rules, that we discussed, are as follows:tweet_id: 746 limited-warranty warranty-tweet_id: 746 substantial-nothing nothing+

We have a matched tweet_id from both the scenarios above, which gives us the opinion-target and polarity-target pairs which will be used for argument detection and profile construction.

Drawback• The polarity that we talked about is the prior-polarity. It does not take the

*context of the sentence* into consideration.• However, Opinon-Finder does!

• why prior-polarity is not always effective??– Explained with example in later slides

Opinion-Finder• Software developed at University of Pittsburg to predict the contextual

polarity of the sentence. Mainly, it was designed for documents and has limitation on the size of the file that it can process as well as with the sentence splitting module.

• We modified their software to deal with our purpose, i.e. tweets.• It is extremely slow. For example, processes 27M file in 12 hours approx

(totally tweets file size is 18GB)

How is Opinion-Finder different from conventional

Opinion-Target/Polarity-Target pair • Consider a tweet: <No one is happy with Barack Obama’s healthcare

plan…>• From intuition, we can say that the above tweet has negative

connotation.• How is it depicted in Opinion-Finder?• Output of the Opinion-Finder

<MPQASENT autoclass1="unknown" autoclass2="obj" diff="3.1">No one is <MPQASD><MPQAPOL autoclass="negative">happy</MPQAPOL></MPQASD> with Barack

<MPQASRC>Obama</MPQASRC>'s healthcare plan</MPQASENT>. • Output of the conventional 5-rule system:

– Tweet:1 happy-one one+

How does Contextual-Polarity output look like?

• <MPQASENT autoclass1="obj" autoclass2="obj" diff="21.9">@A_ClayChillin what the <MPQAPOL autoclass="negative">hell</MPQAPOL> did she do, <MPQASD>push</MPQASD> him out the truck?</MPQASENT>

• <MPQASENT autoclass1="unknown" autoclass2="subj" diff="1.7"><MPQASRC>i</MPQASRC> <MPQASD>think</MPQASD> its time for me to go back to bed, dnl going to bed at 3 and waking up at 7. yuck.</MPQASENT>

• <MPQASENT autoclass1="obj" autoclass2="obj" diff="25.2">Quebecor veut vendre Jobboom - secteurs-d-activite - LesAffaires.com - http://bit.ly/5dlkE5</MPQASENT>

• <MPQASENT autoclass1="unknown" autoclass2="obj" diff="7.4">wak @mirandamia maacii ya keik boltantemnyaaa,, *senaaaaannggg*</MPQASENT>

• <MPQASENT autoclass1="subj" autoclass2="subj" diff="12.5">90% of any <MPQAPOL autoclass="negative">pain</MPQAPOL> comes from trying to keep the <MPQAPOL autoclass="negative">pain</MPQAPOL> secret. You cannot keep a secret and let it go.</MPQASENT>

http://bit.ly/5dlkE5

Status as of now..• Opinion-Finder is slow.• It runs a pipeline of internal modules, such as Document Preprocessing,

Sentence Splitting, Tokenization, POS tagging, Feature Finder, Source Finder etc.

• ~20% of the tweets has just completed processing using Opinion-Finder.

Argument Detection• Argument detection is to find the argument

people use to support or oppose an issue.– Example:

• Obama is bankrupting Americans, he does nothing to improve the economy just drain it.

– “Obama” is the issue– “bankrupting American” and “does nothing to improve the

economy just drain it” is the argument

Argument Detection• Motivation

– To discover the reason why people show positive or negative opinion towards to an issue.

– If people suddenly change their opinion towards to an issue because of a particular event, we can infer what exactly the event is from the argument they use.

– Argument will be used as an attribute in user’s profile.– There is a step in argument detection, which is to

classify the polarity of the tweet towards to the issue. From the result of polarity classification, we can infer public’s attribute about the issue.

Argument Detection Approach• Step 1. Given a trending topic, retrieve all the

tweets associated with that trending topic. (Output from trending topic detection)– Assuming trending topics as issues, we will detect

people’s argument for the issue from those tweets which are relevant with that topic.

• Step 2. Determine whether one tweet is subjective or neutral about the topic. (Going on)– Though some tweets belong to a certain trending

topic, they don’t show any subjective opinion to the topic. Example:

• Barack Obama makes banks an offer they can’t refuse.

Argument Detection Approach• Step 3. Polarity classification to decide whether

this tweet is positive or negative towards to the topic. (Going on)– After we get argument from tweets, we need to

know whether the argument is used to support or oppose the issue. So we should know the polarity of the tweet first.

• Step 4. Get all opinion target pairs from those tweets which show positive or negative opinion separately. (Output from opinion-target pairs)– We will collect the argument from those opinion-

target pairs.

Argument Detection Approach• Step 5. Determine whether this opinion target

can be used as argument. (Going on)– There are some opinion-target pairs which can’t be used as argument.

Examples:• Tweet: I envy you guys with a leader like Obama.• Opinion-target pairs:

– envy-I I- (this opinion-target pair can’t be used as argument)– Use mention co-reference and Mutual Information (MI) to find useful

opinion-target pairs.

Argument Detection Approach– Find those targets which are co-referenced

with the topic. Example:• Tweet: Obama is still the best president you Americans have had in a

very long time:)• Opinion-target pairs:

– best-president president+• Argument

– Positive president (best) (Obama and president are co-referenced)

Argument Detection Approach– MI is a quantity that measures the mutual dependence of

the two random variables. Calculate MI between topic and target from opinion-target pairs. If the value of MI exceeds a certain threshold, then consider this opinion-target pair. Example:

• Tweet: Obama’s nasty army. They aren’t funded yet but take a good look…• Opinion-target pairs:

– nasty-army army- • Argument:

– Negative army (nasty and little). (MI between Obama and army is high)

Argument Detection Approach• Step 6. Argument cluster (Going on)

– Cluster those arguments which have the same meaning into the same cluster by WordNet.

• The target of opinion-target pairs may have similar meaning, so we can cluster them. Example : (we can combine troops and army to the same cluster)

– Argument:» Negative troops (Little)» Negative army (Nasty)

Future work

• Trending Topic Detection– Two more algorithm to overcome the shortage of naïve approach.

• Opinion-Target Pairs– Run opinion finder to parse all the tweets.– Compare the results of lexicon polarity and contextual polarity.

• Argument Detection– Identify whether the tweet is subjective or objective towards to an topic.– Identify the polarity of those subjective tweets– Identify those useful opinion target for argument detection– Cluster opinion target

Future work• User profile construction. User’s profiles will include those

following content:– All the tweets one published– Location, description in user’s tweet account profile– Predicted gender– Predicted interests– Opinion target pairs– Trending topics one have ever discussed and also his

opinion towards to those trending topics. – Arguments they use to support or oppose an topic.– …..

Thank You!

Date post:	12-Jan-2016
Category:	Documents
Upload:	evette
View:	26 times
Download:	0 times

Project Discussion-3

Documents