Prediction of Trends in Online Social Netwokcs5080220/thesis.pdf · 2013-07-15 · Prediction of...

Prediction of Trends in Online Social Netwok

M.Tech Thesis

submitted in partial fulfillment of the requirementsfor the degree of

Dual Degree in Computer Science and Engineering

by

Pranay Agarwal

Entry No: 2008CS50220

under the guidance of

Dr. Amitabha Bagchi

and

Dr. Maya Ramanath

Department of Computer Science and EngineeringIndian Institute of Technology

New Delhi - 110016.

2013

Dissertation Approval Sheet

This is to certify that the dissertation titled

Prediction of Trends in Online Social Netwok

By

Pranay Agarwal(2008CS50220)

is approved for the degree ofDual Degree in Computer Science and Engineering.

Dr. Amitabha Bagchi(Guide)

Dr. Maya Ramanath(Guide)

Date : June 28, 2013

Abstract

In the modern times of information age, the magnitude of online social media activity hasreached unprecendented level. Millions of users participate in social awareness streamssuch as social networks andmicroblogging services to consume and spread informationin the network. Twitter is one such very popular online social networking and micro-blogging service, which enables hundreds of millions of users share short messages in realtime about events worth broad attention expressing public opinion. These messages inaggregate indicate the interests and attention of the local and global communities, whichin particular is known as temporal trends in twitter. There are many events and topicsdiscussed on Twitter, some of which get lot of attention to become a “trend” and some donot. Detecting these trends in online social networks has become an important problemthat has attracted the attention of both the industry and the research community inrecent years.Our challenge is to automatically detect and analyze the emerging topics or trends thatappear in the stream, and predict trending topics from non-trending topics in their earlystage. Unique thing about our approach is that we use the directed links of “following”in the social media of twitter. These directed links determine the flow of informationand hence indicate a users influence on others, which is a crucial phenomena in anytopic becoming trending or viral in any social network.

ii

Acknowledgments

I express my gratitude towards my guides Dr. Amitabha Bagchi and Dr. MayaRamanath for their constant guidance, motivation and help during the course of theproject.

I would also like to thank Jyoti Ahuja, Muthusamy Chelliah and Mridul Jain for makingthis research collaboration with Yahoo! possible and also for their valuable insights,suggesations and help throughout the project.

I am also grateful to CSE Department, IITD and all its staffs, labs incharge for provdingus with a great working environment in the labs.

Pranay AgarwalIIT Delhi

June , 2013

iii

Declaration of AuthorshipI, Pranay Agarwal, declare that this thesis titled, Prediction of Trends in Online SocialNetwokand the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a Dual Degree inComputer Science Engineering at IIT Delhi.

• Where any part of this thesis has previously been submitted for a degree or anyother qualification at this University or any other institution, this has been clearlystated.

• Where I have consulted the published work of others, this is always clearly at-tributed.

• Where I have quoted from the work of others, the source is always given. Withthe exception of such quotations, this thesis is entirely my own work.

• I have acknowledged all main sources of help.

• Where the thesis is based on work done by myself jointly with others, I have madeclear exactly what was done by others and what I have contributed myself.

Signed:

Date:

iv

Contents

Abstract ii

Acknowledgments iii

List of Figures vii

1 Introduction 11.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Related Work 42.1 Filtering model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Building Social Graph for twitter . . . . . . . . . . . . . . . . . . . . . . 42.3 Evolving Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 A model to mine and filter tweets 63.1 “Informative” tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1.1 NLP tools for tweets . . . . . . . . . . . . . . . . . . . . . . . . . 63.1.2 Model Feature weights . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Preprocessing and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2.1 Efficient two phase modeling . . . . . . . . . . . . . . . . . . . . 9

4 Constructing social graph 104.1 Collecting data via Twitter API . . . . . . . . . . . . . . . . . . . . . . 10

4.1.1 Resolving relations for “good” users . . . . . . . . . . . . . . . . 104.2 Storing the graph in Graph Database Neo4j . . . . . . . . . . . . . . . . 114.3 Analysis of the Constructed Social Graph . . . . . . . . . . . . . . . . . 11

5 Evolving graph 125.1 Tweets clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.1.1 Topic detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1.2 N gram matching. . . . . . . . . . . . . . . . . . . . . . . . . . . 125.1.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5.2 Evolving graph of a topic . . . . . . . . . . . . . . . . . . . . . . . . . . 135.3 Experiment with few topics . . . . . . . . . . . . . . . . . . . . . . . . . 135.4 Analysis and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Model Evaluation 166.1 Conductance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

v

6.2 Density and largest component . . . . . . . . . . . . . . . . . . . . . . . 176.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

7 Conclusions 207.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

vi

List of Figures

1.1 Trending topic “Cricket” . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Non-trending topic “Football” . . . . . . . . . . . . . . . . . . . . . . . . 2

3.1 Red : “chat” tweets, Blue : “informative” tweets . . . . . . . . . . . . . 73.2 Inferface to experiment with features on left and ranked tweets on right. 83.3 Pipeline showing ingestion of tweets to ranked output . . . . . . . . . . 8

5.1 IRS Scandal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.2 Angelina Jolie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.3 Cannes Film Festival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.4 The great gatsby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145.5 Number of (tweets, nodes, disjoint components) against time . . . . . . 145.6 IRS Scandal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.7 IRS Scandal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.8 Angelina Jolie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.9 Angelina Jolie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.10 Cannes Film Festival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.11 Cannes Film Festival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.12 The great gatsby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.13 The great gatsby . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155.14 Size of largest and second largest component size against time . . . . . . 15

6.1 Very low trend, topic 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.2 Mild trend, topic 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.3 trend, topic 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.4 High trend, topic 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176.5 Conductance, Largest component size against Time . . . . . . . . . . . 176.6 Very low trend, topic 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.7 Very low trend, topic 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.8 Mild trend, topic 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.9 Mild trend, topic 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.10 trend, topic 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.11 trend, topic 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.12 High trend, topic 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.13 High trend, topic 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186.14 Ratios of number of edges, size of largest component with number of nodes 18

vii

Chapter 1

Introduction

In the times of information age, the magnitude of online social media activity has reachedunprecedented level. Hundreds of millions of users spend hours online everyday to stayconnected and communicate with the rest of world. Millions of users participate in thesesocial networks of Social awareness streams. [19] People generate huge amount of dataeveryday on various social media networks, which in aggregate indicate the interestsand current attention of the local and global communities. There are many events andtopics discussed on Twitter. Some topics may get a lot attention and some may not.Some of these topics become very popular and focus of interests for large number ofpeople. The connections and the nature of social network let information disseminateto large number of other people, a phenomena known as going “viral”. These populartopics of discussions are also called “trends” in the social network. These trends arevery dynamic and temporal in nature which exposes the expose the aggregate interestsand attention of global and local communities.

Trends in social networks are of high significance and a major point of interest in boththe industry [19] and the research community [8, 15]. Many applications on web andbusiness can be immensely benefitted from knowing what is currently “trending”, whichrepresents an answer to the age-old query what are people talking about? [9]. From stockexchange making real time decision to search engines delivering more updated, relevantsearch results. Figures 1.1, 1.2 Twitter is one of the most popular social networkingand micro-blogging service, which had more than 200 Millions registered users by 2013,producing 400 Millions tweets everyday [17]. As a micro blogging website it allows itsusers to create a short text message of 140 characters as their posts called as “tweets”.There are also many different ways for users to update their tweets, including the mobilephone, web and text messaging tools [14] and so on. Twitter is also very real time innature. In pasts several events were reported on twitter as news hours earlier than themain stream media. [10]. Hence twitter is a very robust source for getting the real timetrends in the web.The numbers of active users and tweets generated daily are enormous and hence, theycollectively can give crucial clues to several interesting problems such as public opinionanalysis and hot trend detection. Twitter employs a social model called following [14], inwhich the user is allowed to choose any other users that she wants to follow without anypermission or reciprocating by following her back. The one she follows is her friend, andshe is the follower. Being a follower on Twitter means that she receives all the updatesof her friends [4]. This makes twitter a directed social network where directed

1

1.1 Problem Statement

Figure 1.1: Trending topic “Cricket” Figure 1.2: Non-trending topic “Football”

links could represent anything from intimate friendships to common interests, or even apassion for breaking news or celebrity gossip. Such directed links determine the flow ofinformation and hence indicate a users influence on others a concept that is crucialin sociology and viral marketing. The major drawback with using Twitter as source ofinformation is that not all of tweets are informative. Contrary to it, majority of tweetsare “chaty” or “spamy” in nature [7]. So it’s crucial to filter out this noise from datato use the useful “informative” tweets. Hence we need a system, which could separateuseful tweets, which essentially means a classifier model to classify “informative” tweetsfrom “chat” tweets. This system should work in a single pass and also be robust andfast enough to process up to 400 Millions tweets a day. Our attempt in this thesis is toprovide such a model of framework which given topic wise tweets clusters will be able todetect current trends and also predict some upcoming trends in their early stage usingthe social graph of twitter.

1.1 Problem Statement

Our problem statement can be formally defined as follows.

Create an automated system, which takes in continuous stream of raw tweets,processes these tweets to filter out “noise” and get relevant informativetweets. Further mine these “fitered” tweets to detect and predict evolving“trending” topics in their very early stage.

There are three key aspects to this problem:

• Creating a model to process real time tweets feed to filter out the “noise” andouput only relevant tweets.

• Creating a system to store the social graph of twitter users such that it can be effi-ciently used as a data structure for anwering various user queries and get relationsfor a given user.

• Creating a model to distinguish a “trending” topic from “non-trending” to make

2

1.2 Organization

a prediction in their very early stage of evolution.

1.2 Organization

Chapter 1 motivates to utilize the resource of twitter network to extract the aggregatepublic opinion and topics of interest and activities in the real time.

Chapter 2 summary of the literature review of the past work that has attempted similarproblem in the past. Applying the concept of evolving topic graph to create a trenddetection and prediction model.

Chapter 3 creates a model to filter out the noise from huge data of tweets and get onlyrelevant informative tweets.

Chapter 4 looks at the challenge of the creating big social graph of twitter, whichconsists of getting this data from twitter, storing and processing such a big graph.

Chapter 5 implements the algorithm to create evolving topical graphs for several topicwise tweets clusters and analysis on results of the experiments on these graphs.

Chapter 6 presents the results of few topics of different level of popularity and makessome crucial observations. Chapter 7 concludes this thesis, by talking about our con-tributions and discussing future work.

3

Chapter 2

Related Work

Our works builds upon several areas of research in particular information extractionfrom noisy text, Natural language processing on short text, information flow in socialnetworks.

2.1 Filtering model

Before ingesting twitter data, as we mentioned earlier, it is necessary to remove thenoise from the data to find the needle in the haystack. Several people have worked onthis problem [7, 11, 16]. But most of them have a very specific “target” class of tweetsto filter, such as filtering tweets related to some particular event of sports etc. Findingspecific features set or keywords related to those particular event could give good resultsfor that particular use case but might not be very useful in general. Our goal is to getall the “relevant” and “informative” tweets, which is much more generic definition forclassification. These tweets might include topics related from some sports event to somepolitical activity. Hence we had to develop a model which generic enough to get all kindof “informative” tweets at the same time strict enough to filter out all the “chat” tweets.

2.2 Building Social Graph for twitter

As mentioned earlier core of our method is the social graph, which consists of twitterusers and relations among them. The current size of twitter has more than 200 millionsnodes or users in it. [17]Very few attempts have been made to collect the social graph of twitter. [18, 5].Alsothe twitter restricts on sharing social network relations. Further the twitter has stoppedsupport for “whitelisted” server (with extra permission), which left with the only optionto collect this data via there Twitter API under normal rate limit.Twitter exposes these relations to public via their API. [4] where relations like “followers”and “friends” (users a person follows) can be collected via API calls. The major blockingfactor here is the rate limit to number of calls, which can be made to API. We did severaloptimizations to have multiple nodes collecting data from twitter.

4

2.3 Evolving Graph

2.3 Evolving Graph

Several attempts have been made to detect and even predict trends, particularly fortwitter. [7, 11, 16] The significant thing about our approach is that we are tying makeuse of the fact that the connections of any social network is the most crucial partin spread and dissemination of information. Making use of these connection in thenetwork can give early indication of trends compared to approach, which just use thetime series pattern of tweets volume to detect and predict. Sebastien Ardon, Bagchiet al.[6] proposed a novel approach of dynamic evolving graph for a topic using theauthors involved in the discussion of that topic and their relations. The hypothesis wasthat during evolution of graph, its structure and topology gives rise to patterns, whichcould act as distinct features to distinguish “trending” topic from “non-trending” topic.To construct the graph first they partition the tweets related to a particular topic byday and for each day they construct the sub graph induced on Twitter by the users whohave tweeted on that topic on that particular day. Finally, they study the cumulativeevolving graphs of a topic. They denote by Gt

i = (V it , E

it), the cumulative evolving graph

for topic t on day i and define it as follows:

• The vertex set of Gt0 comprises the users V0 who tweet about t on day 0. the edge

set is empty.

• The vertex set V ti of Gt

i is the set of all users who have tweeted on a topic in days0 through i. An edge(u→ v) is added to Et

i if u ∈ V ti − 1 and v tweets about t on

day i

We plan to make some modification in the above algorithm and then construct evolvinggraphs for all the candidate set of trending topics and distinguish between “trend” and“non trend” by observing the changing topography of the graph such as size of connectedcomponents, density of graph etc. We also observe the amount of information flowingthough this graph, which we have quantified later by a term called conductance.

5

Chapter 3

A model to mine and filter tweets

In this chapter we propose a new model to filter tweets to remove “noise” and getrelevant tweets as per our use case here.

3.1 “Informative” tweets

Let us give some examples of what we mean by “informative” and “chat” tweets hereFigure 3.1. As mentioned earlier we don’t have any training data of labeled tweets,which required before we could try to make any trained classifier. Looking at severalsuch examples gave us insights about what could be the potential features for such aclassifier. 3.1

3.1.1 NLP tools for tweets

As explained by Ritter el at “The performance of standard NLP tools is severely de-graded on tweets” [12]. The key reason behind this is these tools are trained for cleantext such as news article unlike tweets.Twitter present a new and challenging style of text for language technology due to theirnoisy and informal nature [13]. Hence they have trained their tool particularly for tweetswhich gives better accuracy and performance for tweets. [1] It is also to be noted thatthis tools gives NER tags along with class of entities tagged such as location, person,date etc. Using number of distinct classes of NER in a tweet gives better result thanjust using number of NERs in a tweet.

Tweet text based tweet meta data based NLP tags

word count retweet count NER (Persons, location etc)tweet language structure followers count POS tags

mentions URL -hash-tags location and time -

- source of tweet (web, mobile etc.) -

Table 3.1: Different features of tweets filter model

6

3.2 Preprocessing and Scaling

Figure 3.1: Red : “chat” tweets, Blue : “informative” tweets

3.1.2 Model Feature weights

Next, we have to learn the weights of these features to train the model. But as mentionedearlier though we could choose from any of the standard classifier such as SVM, Logisticregression etc, but we don’t have a training data to learn from. In order to generatetraining data and also experiment with the model to see the effect on the output we builta web tool. 3.2 The tool takes in the configuration weights for the model on the interfaceand then ranks all the tweets by an aggregate score which is basically weighted sum ofall the features. Applying various heuristics on this sandbox, we observed followinginteresting points:

• Most relevant “informative” tweets appear on the top, when ranked by score.

• Bottom 80-90 % of the tweets are “chat” which are basically personal conversa-tions.

• Bottom to down there is a pattern of clusters, where top tweets are “NEWS/EVENTS”in nature, the next gradient is about subjective opinions and the last cluster is aboutpersonal errands.


There are 200 Million active users who produce 400 Million tweets everyday. We receivearound 1 Million tweets every 5 minutes. Handling such a scale affects the approachto our algorithmic and processing logic. The figure explains the pipeline of the wholeprocess. Figure 3.3

7


Figure 3.2: Inferface to experiment with features on left and ranked tweets on right.

Figure 3.3: Pipeline showing ingestion of tweets to ranked output

8


3.2.1 Efficient two phase modeling

As shown in the pipeline, the part of the pipeline, which uses NLP tools for entityrecognition, is the most expensive step. Initially, we were pushing all the tweets intothis step before ranking the tweets. Instead of doing this we implemented a two phasemodeling, where we try to drop “chat” tweets even before the NLP step, so that themost expensive step is saved, which improves the overall performance.

• First Phase : Drop “chat” tweets using a heuristic without using NLP tool.

• Second Phase : Take the output of first phase and apply NLP tool and ranktweets.

First Phase (Pre-Processing)

Once we had generated some training data with tweets labeled as “chat” or “informative”using the web tool 3.2, we had some interesting observations

• Too short tweets (<= 2 words) are always “chat” tweets.

• 40 % of the “chat” tweets had one of the stop words while only 2% of the “infor-mative” tweets had it.Stop words = (I,me,mine, you, yours, etc)

So using above observations, we dropped a tweet if it’s “too short” or it contains any ofthe “Stop words”. After these phase 70% of the tweets are dropped which means 70%less data to the next phase.

Second Phase (Ranking)

In this phase, we process all the tweets coming from first phase through NLP tool torecognize the entities in the tweets. Then we extract other features from tweets. Torank the tweets we calculate aggregate score S(t) for each tweet where Wi is Relativeweight of feature i, Ni is Normalizing factor of feature i and f(t)i is feature i value fortweet t

S(t) = Σi(Wi ∗ f(t)i)

Ni(3.1)

Value of f(t)i is normalized for all i such that f(t)i/Ni takes maximum value of 100 byany tweet in the dataset.Experimets with web tool 3.2 allows to configure different Wi to see the effect on finalranked tweets.

9

Chapter 4

Constructing social graph

4.1 Collecting data via Twitter API

As mentioned earlier core of our approach is the social graph, which consists of twitterusers and relations among them. The current size of twitter has more than 200 millionsnodes or users in it. [17] Twitter exposes these relations to public via their API. [4]where relations like “followers” and “friends” (users a person follows) can be collectedvia API calls.The major blocking factor here is the rate limit to number of calls, which can be madeto API. We created more than 10 different API keys for created several instances ofnodes making calls to TwitterAPI. We wrote a python library to communicatewith the API.Also the Twitter API was upgraded several times by Twitter resulting in change of APImethods and responses. Hence required changes in our code accordingly. Also due toheavy traffic or technical glitches there were frequent outages in the API service causingfurther delay and blocking.

4.1.1 Resolving relations for “good” users

It’s clear from the above discussion it’s practically impossible to collect the relations forall twitter users. Hence we will have to decide which are users are “good” enough to beresolved first.We proceeded with a hypothesis that users who are “good” will be involvedin creating and sharing of “informative” tweets. Ideally these users should act asgood “sensors” of the whole bigger graph, which means tracking these users behaviorshould give a good idea of the “trends” in the larger graph.To select these users, we extracted the authors of the top tweets from “filtered” dataoutput by above model from several days tweets. Then we resolved the relations offollowers and friends of these authors via twitter API. API gives back response in thebatches of 5000 users. We also decided to collect only up to 20000 friends or followersfor a particular user because some of the popular celebs might have followers runninginto millions.

10

4.2 Storing the graph in Graph Database Neo4j

4.2 Storing the graph in Graph Database Neo4j

After the data is collected from twitter, the major challenge becomes to store this dataof social graph in a data structure, which will allow us to efficiently get answers tovarious queries such as getting all the “followers” or “friends” of a particular user. Dueto the size of graph and the nature of our queries the conventional database systemslike MySQL etc become a bad choice. [2] Hence we decided to use Neo4j [3] a graphdatabase which could easily handle the amount of data we had and also efficientlyanswer users query we need to make.Twitter API gives back JSON response for the queries of user relations. We stored thesedata in the Neo4j, which is basically a graph database, which stored data in adjacencylist format unlike tables in MySQL.

• Node : Each node is a user, which also contains several other details of userprofiles. We also label users for which we have resolved all the relations.

• Edge : An directed edge edge(u→ v) represents user “u” follows user “v”

• Index : We need to make several queries where given a user details we want toget it’s all followers and friends. To make this query fast and efficient we indexedthe graph by a unique key “uid” as user id of all user nodes. This “uid” issame as the twitter user id, which is already present in the tweet object.

4.3 Analysis of the Constructed Social Graph

Using the approach explained above we ran our data collection and storing process forseveral days. This resulted in a directed graph of more than 30Millions nodes and60Millions edges. This is around 10% of the original twitter graph in terms of numberof users present.To analyze this graph and users collected in it, we collected two sets of users. First setof some top celebs and most “influential” users who are widely mentioned in many newsarticles etc. We crawled such 1000 users from sites like these. [links] Second set of usersfrom the top “filtered” tweets by above model. We had following observations

• Almost 95% of the top celebs present in our graph.

• Around 60% of the users of the second set present in our graph.

It’s very interesting to note that though the size of our graph is only 10% of the wholetwitter graph, 60% of second users set are present. This is a very strong indicationthat our graph mostly contains “active” and “good” users while there couldbe significant fraction of twitter users as “inactive”, who never appear in filteredtweets, which can be assumed to play very little or no role at all in creating and spreadingof a “trends” in the network.

11

Chapter 5

Evolving graph

Now that we have created social graph we can proceed with the creation of topicalevolving graph.Sebastien Ardon, Bagchi et al.[6] have explained that we need tweetscluster of a topic to make the evolving graph for that topic.

5.1 Tweets clusters

We need all the tweets related to particular topic together which we call as cluster. Thistweets cluster tells us who all users are involved in discussion of that particular topic.

5.1.1 Topic detection

To start the clustering we first need the notion of topic. For that we use “Timesense”,a Yahoo! Proprietary service, which gives list of search queries along with “buzz” scorewhich indicates the “trendiness” of that particular topic based on the time series of thevolume and frequency of that query in various Yahoo services like Yahoo Search etc. Itdoes web search for time sensitive (buzzing, breaking news, seasonal/recurrent) queries.The search queries returned by Timesense are not clustered together, which meansdifferent search queries related to same event is given as different queries. 5.1

5.1.2 N gram matching.

Users in twitter use different variants of the same term, while discussing about the sametopic. Hence we need to collect all such tweets, which might have these variants butessentially related to same topic. Therefore we first cluster the variants together to

mark appel houston mlb draft what high school did mark appel go tomark appel and pat appel mark appel major stanfordmark appel 2013 mlb draft mark appel stanford baseball

mark appel contract baseball player mark appel of stanford

Table 5.1: Different features of of tweets filter model

12

5.2 Evolving graph of a topic

Serial no. Trend type Topic

1 High trend IRS Scandal of Obama administration2 High trend Angelina Jolie going through mastectomy3 Low trend Cannes Film Festivel 20134 low trend The Great Gatsby (Movie 2013)

Table 5.2: Different features of of tweets filter model

recognize as variants of same topic. To do that we implemented a Bi-gram matchingalgorithm to cluster together search queries like these. 5.1. We also did some manualcleaning of the queries to ensure we get correct set of variants for a topic.

5.1.3 Clustering

Now, we have list of all variants for a given topic. To get clusters of tweets related tothis particular topic, we just do one pass of all the public tweets and fetch the tweet ifit contains any of the bi-gram terms pair in it.

5.2 Evolving graph of a topic

Now that we have all the ingredients to make evolving graph for a topic we can startbuilding it. Building upon the idea of Sebastien Ardon, Bagchi et al.[6], we made someessential modifications in the algorithm.The time window of one day as used in their work is too big for most of the cases astopics appearing for the first time to becoming “trending”, happen in less than 24 hours.Also we need to do better than doing detection of “trend” with one day of latency.Also we take up little more liberal approach to insert edges in the graph. Previously theywere comparing with users who have tweeted in the previous day, now we compare withusers in the current graph not just previous time window. So effectively the modifiedalgorithm becomes as follow

• The vertex set of Gt0 comprises the users V0 who tweet about t on window 0. the

edge set is empty.

• The vertex set V ti of Gt

i is the set of all users who have tweeted on a topic inwindows 0 through i. An edge(u ← v) is added to Et

i if u ∈ V (Current set) andv has tweeted about t on window i. Note the change in arrow direction.

5.3 Experiment with few topics

Now coming to the core concept of our approach, which is connections of the social graph,which facilitates the dissemination of information. Following put it very precisely:

“If we visualize the social network as a set of communities connected throughusers who may belong to multiple communities, our narrative of topic spread

13

5.4 Analysis and limitations

says that topics that are going to become very popular witness intense dis-cussion within communities at first. When the level of intensity rises thenthe users who bridge communities enter the discussion in a big way causinga merging of what were earlier disjoint discussions.” [6]

We start with selected topics for which we make evolving graph as explained earlier 5.2.We observed the size of largest and second largest connected components in the graphfor each for these topics. Figures 5.14

Figure 5.1: IRS Scandal Figure 5.2: Angelina Jolie

Figure 5.3: Cannes Film Festival Figure 5.4: The great gatsby

Figure 5.5: Number of (tweets, nodes, disjoint components) against time


From figure 5.14 it’s clear that largest component size increases for all the topics. Butthe increase in the size is much more significant in case of topic 1 and 2 which arehigh trending topics.which reflects users forming a large connected community whilediscussing about a particular topic. High trend topics 1 and 2 form a large connectedcomponent 5.3 unlike low trends 3 and 4. It can also be observed that the ratio givenin 5.3 is several times more in case of high trends compared to low trends. Whichshows that graphs of these topics contain most of its nodes in the largest component.But low ratio for other topics shows that there are many small independent clusters ofcommunities discussing among themselves without leading to a large component.It’s important to keep in mind that our social graph doesn’t contain relations for all ofusers. Hence these evolving graphs will “ideally” will contain many more edges, whichmight be missing now.

14


Figure 5.6: IRS Scandal Figure 5.7: IRS Scandal

Figure 5.8: Angelina Jolie Figure 5.9: Angelina Jolie

Figure 5.10: Cannes Film Festival Figure 5.11: Cannes Film Festival

Figure 5.12: The great gatsby Figure 5.13: The great gatsby

Figure 5.14: Size of largest and second largest component size against time

Topic C1 C1/C2

1 1732 3462 974 443 127 84 22 3

Table 5.3: C1, C2 size of largest and second largest component size

15

Chapter 6

Model Evaluation

To investigate the extent of impact of missing edges on our results we chose few topics6.2 and collected the tweets clusters related to it by the same clustering algorithm usedabove. Then we resolved the relations of friends and followers for all the authors of thetweets in those clusters. We do all the analysis on these topics from now on.

6.1 Conductance

When the level of intensity rises then the users who bridge communities enter the dis-cussion in a big way causing a merging of what were earlier disjoint discussions. Bridgeusers serve as a barometer of the topics rising popularity. To investigate the applicabil-ity of this narrative we study the conductance of evolving topic graphs [6] Motivatedby a definition widely used in the study of mixing times of random walks in graphs, wedefine the conductance φ(S) of a subset of nodes S of a directed graph G = (V,E) asthe ratio of the edges outgoing from the vertices of S that land outside S:

φ(S) =|{(u→ v) : u ∈ S, v ∈ V \ S}|

|{(u→ v) : u ∈ S}|(6.1)

Clearly, the higher the value of φ(S), the more the number of nodes outside S that aremade aware of a topic being tweeted by the users in the set S. We plot the evolvingvalue of the conductance of the user set of the windows graph alongside other stats forfour topics mentioned above. 6.5

Nature Topic

Very Low trend Mattapoisett Car accidentMild trend Jeep Patriot

trend Mayor BloomberHigh trend Mark Appel

Table 6.1: Topics of different level of trends

16

6.2 Density and largest component

6.2 Density and largest component

Another interesting distinguishing characteristic was the density of edges in the evolvinggraph. It seems the graphs seem to get denser for more popular and trending topicscompared with low trends.Further, another feature of popular topics is that most of the people discussing aboutthat topic tend to form a large connected component. Which means most of users of thatevolving graph seem to eventually fall into the largest connected component. While incase of less popular topic, there are multiple connected components, which never mergeto form a large component. We plotted these features against the time for all the topicsmentioned earlier. 6.14

Figure 6.1: Very low trend, topic 1 Figure 6.2: Mild trend, topic 2

Figure 6.3: trend, topic 3 Figure 6.4: High trend, topic 4

Figure 6.5: Conductance, Largest component size against Time

6.3 Analysis

Some observations very clear from above plots

• Figure 6.5 shows that the size of largest component increases much more rapidlyin case of more trending topics 3 and 4 compared to low trending topics 1 and 2.The difference in the concavity (down for low trend and up for high trends) of theplots also supports that.

• The fall in conductance is due to the conversion of external edges into internaledges as the topic spreads in the community. Fall is much more significant in caseof trending topics.

17

6.3 Analysis

Figure 6.6: Very low trend, topic 1 Figure 6.7: Very low trend, topic 1

Figure 6.8: Mild trend, topic 2 Figure 6.9: Mild trend, topic 2

Figure 6.10: trend, topic 3 Figure 6.11: trend, topic 3

Figure 6.12: High trend, topic 4 Figure 6.13: High trend, topic 4

Figure 6.14: Ratios of number of edges, size of largest component with number of nodes

18

6.4 Limitations

Topic E/N L/N*100 % Fall in Conductance

1 0.09 8.7 0.32 0.38 36.7 0.13 0.83 82.1 0.54 0.98 97.98 1.2

Table 6.2: E : Number of edges, L : Size of largest component, N : Number of nodes

• Figure 6.14 shows the higher conversion of external to internal edges, in case oftrending topics, which means more behavior influence and spreading to followersin case of trends.

• Largest connected component contains around 15 % and 35 % of all users in caseof topic 1 and 2 respectively, while in case of topic 3 it is 80 % and > 90% in caseof topic 4.This strongly supports the hypothesis that in case of trending topics,users form large connected community.

6.4 Limitations

Though above results look promising the major drawback is the need of all the relationsfor all users of tweets clusters. As mentioned above it’s not possible to collect whole oftwitter graph and keep it updated. It took couple of days time to even resolve all therelations for the users in above four topics.

19

Chapter 7

Conclusions

7.1 Contributions

The main contribution of this thesis is to suggest a new unique way to analyze the trendsin online social networks. We have identified some of the features, which can help developa model, which can be used to classify “trends” and “non-trends” in very early stages.We have also developed a highly scalable and efficient model to filter noise from tweets.Also this prediction model is generic enough to be applied on any social media network,which has connection among users. We have shown that how by constructing evolvinggraphs for different topics and observing several topological properties of the graph wecan distinguish “trend” from “non-trends” topics.

7.2 Future Work

We have certainly established the benefits of constructing social graph of twitter. Soresolving more relations for the users in the graph can be useful and can improve theperformance of the model. The model’s success depends on the topic wise clustering oftweets. Currently we have used simple clustering which not very “strict”, other betterclustering algorithms can be used. Storing and processing graphs have been the realchallenge and bottleneck of the whole pipeline. It’s necessary to improve this step byexploring better ways to do the same. Other data structures and algorithms can beexplored to process the graph faster.

20

Bibliography

[1] Twitter nlp. 2012.

[2] Graph databases, http://www.neo4j.org/learn/graphdatabase. 2013.

[3] neo4j http://www.neo4j.org. 2013.

[4] Twitter api. 2013.

[5] Y. Altshuler, W. Pan, and A. S. Pentland. Trends prediction using social diffusionmodels. In Proceedings of the 5th international conference on Social Computing,Behavioral-Cultural Modeling and Prediction, SBP’12, pages 97–104, Berlin, Hei-delberg, 2012. Springer-Verlag.

[6] S. Ardon, A. Bagchi, A. Mahanti, A. Ruhela, A. Seth, R. M. Tripathy, andS. Triukose. Spatio-temporal analysis of topic popularity in twitter. CoRR,abs/1111.2904, 2011.

[7] L. Barbosa and J. Feng. Robust sentiment detection on twitter from biased andnoisy data. In Proceedings of the 23rd International Conference on ComputationalLinguistics: Posters, COLING ’10, pages 36–44, Stroudsburg, PA, USA, 2010.Association for Computational Linguistics.

[8] L. M. L. Delcambre and G. Giuliano, editors. Proceedings of the 2005 NationalConference on Digital Government Research, DG.O 2005, Atlanta, Georgia, USA,May 15-18, 2005, volume 89 of ACM International Conference Proceeding Series.Digital Government Research Center, 2005.

[9] N. P. Fang Fang1. Detecting Twitter Trends in Real-Time. WITS 2011.

[10] M. Mathioudakis and N. Koudas. Twittermonitor: trend detection over the twitterstream. In Proceedings of the 2010 ACM SIGMOD International Conference onManagement of data, SIGMOD ’10, pages 1155–1158, New York, NY, USA, 2010.ACM.

[11] M. Naaman, H. Becker, and L. Gravano. Hip and trendy: Characterizing emergingtrends on twitter.

[12] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets:An experimental study. In EMNLP, 2011.

[13] A. Ritter, Mausam, O. Etzioni, and S. Clark. Open domain event extraction fromtwitter. In KDD, 2012.

21

BIBLIOGRAPHY

[14] G. H. S. Milstein, A. Chowdhury. Twitter and the Micro-Messaging Revolution:Communication, Connections, and Immediacy–140 Characters at a Time. O’ReillyRadar Report, November 2008.

[15] J. Sankaranarayanan, H. Samet, B. E. Teitler, M. D. Lieberman, and J. Sperling.Twitterstand: news in tweets. In Proceedings of the 17th ACM SIGSPATIAL In-ternational Conference on Advances in Geographic Information Systems, GIS ’09,pages 42–51, New York, NY, USA, 2009. ACM.

[16] A. Vakali, M. Giatsoglou, and S. Antaris. Social networking trends and dynamicsdetection via a cloud-based framework design. In Proceedings of the 21st interna-tional conference companion on World Wide Web, WWW ’12 Companion, pages1213–1220, New York, NY, USA, 2012. ACM.

[17] B. K. Wickre. celebrating-twitter7. J. Mach. Learn. Res., 3, Mar. 2013.

[18] J. Yang and J. Leskovec. Patterns of temporal variation in online media. In Proceed-ings of the fourth ACM international conference on Web search and data mining,WSDM ’11, pages 177–186, New York, NY, USA, 2011. ACM.

[19] M. J. Zaki, A. Siebes, J. X. Yu, B. Goethals, G. I. Webb, and X. Wu, editors. 12thIEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium,December 10-13, 2012. IEEE Computer Society, 2012.

22

Date post:	08-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Prediction of Trends in Online Social Netwokcs5080220/thesis.pdf · 2013-07-15 · Prediction of...

Documents