+ All Categories
Home > Documents > [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore,...

[IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore,...

Date post: 18-Dec-2016
Category:
Upload: amine
View: 214 times
Download: 1 times
Share this document with a friend
8
Predicting YouTube Content Popularity via Facebook Data: A Network Spread Model for Optimizing Multimedia Delivery Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak The Hong Kong University of Science and Technology, ECE, Hong Kong Email: {dinuka, denischn, eeau, eebermak}@ust.hk Abstract—The recent popularity of social networking websites have resulted in a greater usage of internet bandwidth for sharing multimedia content through websites such as Facebook and YouTube. Moving large volumes of multi-media data through limited network resources remains a technical challenge to this day. The current state-of-art solution in optimizing cache server utilization depends heavily on efficient caching policies to determine content priority. This paper proposes a Fast Threshold Spread Model (FTSM) to predict the future access pattern of multi-media content based on the social information of its past viewers. The prediction results are compared and evaluated against ground truth statistics of the respective YouTube video. A complexity analysis on the proposed algorithm for large datasets along with the correlation between Facebook social sharing and YouTube global hit count are explored. I. I NTRODUCTION Internet services provided by sites such as Youtube and Facebook has allowed their users to easily generate and access multi-media contents of various formats, genres, and lengths with their peers. The recent popularity of social networking websites have resulted in a greater usage of internet bandwidth for sharing multimedia content, but the formidable task of moving large volumes of multi-media data through limited network resources remains a technical challenge to this day. A web hosted multi-media clip, for instance, can experience viral growth [1] in access demands within a relatively short period of time when each viewer advertises it to his peers, who in turn further spreads the access to others down stream. Websites such as YouTube, which hosts this multi-media file must be able to adapt to this explosion of access demand in order to avoid interruption of services. Recent research by Krishnan et al. [2] has shown that even a 5 second startup delay of an online video has lead to upto 10% abandonment rate of the viewership. Therefore, in a viral video outbreak, it is crucial that the video experiences minimum streaming delays to ensure that the video streaming website can keep its audience from viewing the same video from a competitor website. The commonly accepted solution for dealing with large access volume is to cache the content across multiple servers so the load can be distributed. Since pre-caching a multi-media file across all available servers is an expensive exercise, typically only the most accessed files are cached near high demand hot-spot locations. The management of this premium cache estate is a non-trivial problem. In Fig. 1. Statistics and ranking of viral videos by Unruly Media the case of a virally spreading file, this pre-caching decision must be made quickly due to the explosive access demand. Therefore, a real-time viral behavior prediction model on user-shared multi-media content, at its early growth stages is extremely useful for efficient cache management. Adhikari et al. [3] and [4] have shown that YouTube is found to employ a hashing strategy which maps video IDs to a hierarchy of servers organized both in geographical space and logical space. Advanced dynamic DNS re-mapping is performed during runtime to achieve load balancing. The geographic distribution of YouTube servers is obtained by ac- cessing YouTube from 20,000 randomly chosen proxy servers from across the globe. The majority of the servers are based in United States and most of them are in Mountain View, California, within close vicinity of Google headquarters. Brazil 214 978-1-4673-5895-8/13/$31.00 c 2013 IEEE
Transcript
Page 1: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

Predicting YouTube Content Popularity viaFacebook Data: A Network Spread Model for

Optimizing Multimedia Delivery

Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine BermakThe Hong Kong University of Science and Technology, ECE, Hong Kong

Email: {dinuka, denischn, eeau, eebermak}@ust.hk

Abstract—The recent popularity of social networking websiteshave resulted in a greater usage of internet bandwidth for sharingmultimedia content through websites such as Facebook andYouTube. Moving large volumes of multi-media data throughlimited network resources remains a technical challenge tothis day. The current state-of-art solution in optimizing cacheserver utilization depends heavily on efficient caching policies todetermine content priority. This paper proposes a Fast ThresholdSpread Model (FTSM) to predict the future access pattern ofmulti-media content based on the social information of its pastviewers. The prediction results are compared and evaluatedagainst ground truth statistics of the respective YouTube video. Acomplexity analysis on the proposed algorithm for large datasetsalong with the correlation between Facebook social sharing andYouTube global hit count are explored.

I. INTRODUCTION

Internet services provided by sites such as Youtube and

Facebook has allowed their users to easily generate and access

multi-media contents of various formats, genres, and lengths

with their peers. The recent popularity of social networking

websites have resulted in a greater usage of internet bandwidth

for sharing multimedia content, but the formidable task of

moving large volumes of multi-media data through limited

network resources remains a technical challenge to this day.

A web hosted multi-media clip, for instance, can experience

viral growth [1] in access demands within a relatively short

period of time when each viewer advertises it to his peers,

who in turn further spreads the access to others down stream.

Websites such as YouTube, which hosts this multi-media file

must be able to adapt to this explosion of access demand in

order to avoid interruption of services.

Recent research by Krishnan et al. [2] has shown that even

a 5 second startup delay of an online video has lead to upto

10% abandonment rate of the viewership. Therefore, in a

viral video outbreak, it is crucial that the video experiences

minimum streaming delays to ensure that the video streaming

website can keep its audience from viewing the same video

from a competitor website. The commonly accepted solution

for dealing with large access volume is to cache the content

across multiple servers so the load can be distributed. Since

pre-caching a multi-media file across all available servers is an

expensive exercise, typically only the most accessed files are

cached near high demand hot-spot locations. The management

of this premium cache estate is a non-trivial problem. In

Fig. 1. Statistics and ranking of viral videos by Unruly Media

the case of a virally spreading file, this pre-caching decision

must be made quickly due to the explosive access demand.

Therefore, a real-time viral behavior prediction model on

user-shared multi-media content, at its early growth stages is

extremely useful for efficient cache management.

Adhikari et al. [3] and [4] have shown that YouTube is

found to employ a hashing strategy which maps video IDs to

a hierarchy of servers organized both in geographical space

and logical space. Advanced dynamic DNS re-mapping is

performed during runtime to achieve load balancing. The

geographic distribution of YouTube servers is obtained by ac-

cessing YouTube from 20,000 randomly chosen proxy servers

from across the globe. The majority of the servers are based

in United States and most of them are in Mountain View,

California, within close vicinity of Google headquarters. Brazil

214978-1-4673-5895-8/13/$31.00 c©2013 IEEE

Page 2: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

8

5 9

6

7

1

2

3

4 8

5 9

6

7

1

2

3

4 8

5 9

6

7

1

2

3

4

8

5 9

6

7

1

2

3

4 8

5 9

6

7

1

2

3

4 8

5 9

6

7

1

2

3

4

Original infected nodes Each infected node has a single

chance of infecting his uninfected neighbors with a certain probability

Newly infected nodes pass on the infection until no new nodes are infected

Step 1 Step 2 Step 3

Step 4 Step 5 Final Stage

Fig. 2. An example infection process of Independent Cascade Model

and Indonesia also had surprisingly high percentages of shares.

This may reflect the high access demand originating from the

Asia region.

The main contributions of this paper are as follows.

• This paper proposes a novel concept where the user’s

social profile is used to model the potential future access

pattern of the user-shared multi-media content. In the

past, multi-media content are typically cached according

to its access frequency only.

• It provides statistical analysis to verify FTSM against real

world data for the first time. The Fast Threshold Spread

Model (FTSM) is a fast approximation of the classical In-

dependent Cascade Model (ICM). A complexity analysis

is provided for FTSM.

• The correlation between Facebook social sharing and

YouTube global hit count is investigated. The mined

Facebook dataset obeys the power law degree distribution

and is viewed as a predictor of the greater Facebook

global trend of video sharing activity.

The remainder of this paper is organized as follows. the

data mining process along with the proposed method of using

the Fast Threshold Spread Model FTSM) is explained in

Section II. The simulation results of FTSM on a real Facebook

dataset is discussed in Section III along with the evaluation

of the results against ground truth YouTube data. The future

directions for this work and the final concluding remarks are

made in Section IV and V respectively.

II. METHODOLOGY

The spread of information through a socially connected

network can be modelled by the Independent Cascade Model

(ICM). The application of ICM in studying “word of mouth”

based viral marketing strategies was first explained by Kempe

et al. [5]. The basic algorithm is illustrated in Fig.2. While this

model provides quantitative insights into marketing dynamics,

its computational complexity is not suitable for making real-

time estimations. A similar probabilistic approach is also

taken in by Trpevski et al. [6] in their Opinion Disseminating

Model (ODM). This model is useful for studying the dynamic

behaviour of opinion dissemination through a social graph,

but the large number of model parameters in ODM makes

Facebook Graph API

PHP server

1) Authorize (OAuth 2.0) – Send login credentials

2) Verified

3) Click download

4) Graph API call

5) JSON

Login Page

7) Parsers:Extracts specific data

To Matlab AICM

6) Repeat for All pages of Each Friend

Fig. 3. Experimental setup: Requesting, downloading and analyzing JSONobjects from Facebook

it difficult to apply in real world applications where only

limited social attributes from services such as Facebook are

available for mining. Given a set of users who have already

accessed a piece of user-shared multi-media content, FTSM

can extrapolate future access patterns in real-time. This pro-

vides a quantitative viral spread assessment of the content as

it lives through a life cycle of infection, exponential spread,

and extinction.

A. Facebook Data Mining

Websites such as YouTube does not contain direct social

connection information about its users. To overcome this

obstacle, we propose to use social network sites such as

Facebook as the reference data set to build the viral spread

model of user-shared media contents. The assumption here

is that the content access records of Facebook users can

be linked to individual multi-media contents on sites such

as Youtube. These records can be anonymous. The strong

correlation between user activity on Facebook and content

access pattern on Youtube can be partially proven by statistics

from sites such as Viral Video Chart by Unruly Media [7].

An extremely high resemblance can be observed between the

most watched videos over the Internet and most-shared videos

over Facebook.

Facebook Timeline and Profile data is mined via Facebook

API [8] to build a social graph (undirected edge network) of

users and their activity level. Indicators such as number of

posts per day can be used to assess how active and influential

a user is in the network. This graph is used to simulate

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 215

Page 3: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

the FTSM. Intuitively, if a piece of multi-media content is

accessed by a group of highly influential users, there is a high

probability that this content will become popular very rapidly

by means of viral spread among their peers.

A scraping software is created to automatically download

a large set of users’ Timeline information. The scraper is

designed to be accessible on any computer, any platform, so

that users can ‘donate’ their Facebook dataset, both the edge

network of their friends and the Timelines of their friends, by

running the scraper on their web-browser.

The scraper runs on any browser that supports HTML and

the computation is done on the server side on PHP [9] and

JavaScript. The communication between the 3 entities (the

client computer, the Facebook Graph API server and the PHP

server) is shown in Fig.3. In step 1, a user logs in using his

Facebook credentials. Once the login button is pressed, the

credentials are passed to the Facebook server and authenticated

using the Open Authentication protocol (OAuth 2.0 [10],

[11]). This way, the scraper does not store or intercepts the

login credentials. Upon successful authentication, the scraper

requests for an access token from the Facebook API, for

requesting data. This step is not marked on the diagram for

simplicity. Once the user clicks download, the PHP server

starts up the automatic scraping process.

Next, the steps 4, 5 and 6 are executed and these steps vary

slightly depending upon the data type that is being mined.

Initially, the scraper requests for the list of friends of the

particular user. Then it iterates through each friend’s Timeline.

Since one Timeline consists of many thousands of posts, the

program requests the Timeline in pages of 100 posts each.

The main challenge in collecting this dataset was the “HTTP

Error 500: Internal server error” returned by the server from

time to time, perhaps to stop data being harvested non-stop

(flooding the server). An intelligent scraper would stop and

wait for a random timeout of between 20 to 60 seconds and

resume to collect data. Many researchers do not use real world

data because of this challenge.

B. YouTube Video Statistics Mining

For the purpose of analyzing the transient behavior of the

video from starting off at a single hit (viewer) all the way up

to a few million, the YouTube hit-count graph statistics were

extracted semi-automatically making use of Google Chart API

[12]. For the sake of simplicity, this ground-truth YouTube hit-

count data that corresponds to each video will be referred to

as the YouTube transient graph. The miniature graph provided

under the YouTube statistics in each video page is an image

file, as shown in Fig. 4. This creates a stumbling block for

the data mining because this statistic cannot be mined directly

using a parser or a text scraper. After careful inspection, it

was observed that, in the source HTML code of the YouTube

page, the image was generated by Google Chart API real time,

when the page was loaded/requested.

The code inspector [13] is used to inspect the element and

reveal the long URL shown below.

http://chart.apis.google.com/chart?cht=lc:nda&chs=460x100

Fig. 4. The YouTube statistics provided by YouTube API

&chf=bg,s,F4F4F4&chco=5F8FC9&chls=1.5&chg=0,-1,1,1&chxt=y,x&chxtc=0,0&chxs=0N*s*%20,333333,10|1,333333,10&chxl=1:|03/05/12|07/12/12|11/18/12&chxp=1,5,50,95&chxr=0,0,112628952|1,0,100&chd=t:0.0,6.2,59.1,65.8,70.3,72.8,74.3,75.2,75.9,76.2,76.6,77.0,77.2,77.5,77.5,77.7,77.9,78.0,78.3,78.6,78.7,78.9,79.1,79.2,79.3,79.4,79.5,79.7,79.7,79.9,80.0,80.0,80.2,80.3,80.3,80.5,80.5,80.6,80.7,80.8,80.9,81.0,81.0,81.1,81.2,81.2,81.3,81.4,81.4,81.5,81.6,81.6,81.7,81.8,81.8,81.9,81.9,81.9,81.9,82.0,82.0,82.0,82.1,82.1,82.1,82.2,82.2,82.2,82.2,82.3,82.3,82.3,82.3,82.4,82.4,82.4,82.4,82.5,82.5,82.5,82.5,82.6,82.6,82.7,82.7,82.7,82.8,82.8,82.8,82.9,82.9,82.9,83.0,83.0,83.1,83.1,83.1,83.2,83.3,83.3&chm=B,dce7eed4,0,0,0|AA,333333,0,0,10|AB,333333,0,0,10|AC,333333,0,0,10|AD,333333,0,0,10|AE,333333,0,0,10|AF,333333,0,0,10|AG,333333,0,0,10|AH,333333,0,0,10

The above mentioned long URL contains all the coordinates

that are normalized to 100 data points within range 1-100,

perhaps by the API. A simple parser is used to mine the

content in between ‘&chd=t:’ to ‘&chm=B’. After that step,

using the prior knowledge of video-post-date, the current-

date and the cumulative number of hits thus far, the x and y

coordinates of all 100 data points are calculated. This process

is repeated for all the videos that are of interest to this dataset.

C. Fast Threshold Spread Model

The Fast Threshold Spread Model (FTSM) is a deterministic

diffusion model based on the activity rate of each individual

user node. The Facebook social graph extracted from Section

II-A is modelled as an undirected graph

G = (V,E) (1)

with vertices,m, in V as the users in the network and edges,

k, in E as the relationship between individuals. For each inter-

user edge k, we evaluate the weight function:

W (m) = 0.5A1(m) + 0.5A2(m) (2)

where A1(m) is the average number of posts posted per

week by user m and A2(m) is the average number of shares

plus the number of comments plus the number of likes for

216 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Page 4: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

each of his posts. They are averaged to yield W (m) which

indicates the social influence of a given user.

Each user node m will be either active or inactive. We as-

sume that the nodes can switch from inactive to active, but not

the reverse. The total number of activated nodes, also known as

the influence spread, is denoted by NumActiveNodes. The

activation process is determined by a threshold function:

Decision(m) =

{Active(m) = 1 W ≥ ThresholdActive(m) = 0 W < Threshold

(3)

The Threshold value is chosen by simulation to be 4.0. The

justification of this value is detailed in Section III-A.

The Pseudo-code of the proposed FTSM algorithm is de-

scribed as:

Algorithm 1 Pseudocode for the FTSM algorithm

1: Load G= (V ,E) to array

2: Set currentSet = seedNodes3: for i = 1 to 20 do4: for each node m in currentSet do5: Set Active(m) = 1;6: Set Neighbors to m’s immediate neighbors

7: compute W (m)8: for each node n in Neighbors do9: if n is not already visited then

10: if W<threshold then11: Increment NumActiveNodes by 1

12: add n to newSet13: end if14: end if15: end for16: end for17: currentSet = newSet18: end for

D. Complexity Analysis on a Small Network vs a Large

Network

Recent statistics show that Facebook has grown into a 1

billion node online social network, the largest of its kind. [14].

This is indeed a classical ‘Large Network’. The runtime for the

FTSM algorithm on the small network of 2344 is 2.4 hours

(on Intel Core i7 860 @ 2.80Ghz, 16GB RAM, MATLAB

r2012a). This section attempts to formulate a mathematical

expression for the key elements that affect the computational

complexity of FTSM.

Consider an undirected graph G = (V,E) with vertices,

m, in V as the users in the network and edges, k, in E as

the relationship between individuals. Let di be the number of

neighbors of node i. Also define B as the set of seed nodes,

the initial points at which viral spread begins. Let N be the

number of hops (number of iterations) and this value is set

to 10 and let ci be the computation complexity in the inner

loop in line 4 of Algorithm 1. P (W > threshold) is the

probability that the computed W(m) in Eq. 2 is greater than

Threshold in Eq. 3. P (notRepeat) is the probability that the

node is not accessed before and activated in the past (refer

ICM in Fig. 2).

ci+1 = P (W > threshold)

ci∑j=1

djPj(notRepeat) (4)

The Sum of all the search computations for an entire seed

set is:

Si =

ci−1∑j=1

dj (5)

total search for N = 20ci∑

i=1

Si =

N∑i=1

ci−1∑j=1

dj (6)

Since Eq. 6 is too complex for closed-form analysis,

a simpler special case is shown below. Assume that di,P (notRepeat) and P (W > threshold) are constant values

for all nodes in the entire network.

c1 = B

c2 = B(P (w > th)P (notRepeat)d)

c2 = B(P (w > th)2P (notRepeat)2d2)

.

.

ci = B[P (W > th)P (notRepeat) · d]i−1 (7)

Since the above equations follow a geometric progression,

the sum of all the m terms can be calculated by

N∑i=1

c =

N∑i=1

(B)[P (W > th)P (notRepeat) · d]i−1

=

N−1∑i=0

(B)[P (W > th)P (notRepeat) · d]i

= B · 1 − [P (W > th)P (notRepeat) · d]N1 − [P (W > th)P (notRepeat) · d] (8)

From Eq. 8, it can be seen that the number of computations

increase in the power law of N when N is increased. There-

fore, for a large network, to observe convergence, the N is

larger. Also, the value of P (notRepeat) is decaying as more

and more nodes get activated. For a large network, this decay

time is much larger, because more number of nodes need to

be reached. Therefore, it is more advantageous to simulate in

a small network that is a representative sample of the large

network rather than simulating on the full network.Fig. 5 illustrates the small network observation concept. The

highlighted yellow ellipse of observation is analogous to the

2344 node dataset used in this paper, in comparison to the

greater Facebook dataset that contains many 100s of thousands

of nodes. Our hypothesis is that the spread pattern around a

seed node in Fig 5d) is similar to the other areas of the network

if the remaining clusters have similarly structured social fabric.

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 217

Page 5: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

Fig. 5. a 4 step illustration of multi-seed FTSM. The highlighted circle represents a small network observation, that shows a similar spread pattern to thelarge network

III. SIMULATION RESULTS

The FTSM algorithm is simulated over a Facebook dataset

of 2344 user nodes and 70789 edges. The resulting spread

curves are generated for the top 10 viral videos. Once the

unique video identifiers (the YouTube URLs) for all the videos

are identified, the nodes that are involved in re-sharing these

viral videos are identified. A table containing the list of Face-

book IDs and timestamps of all the nodes in the dataset that

shared the particular video is compiled. This ground truth data

is used as seed nodes when simulating the FTSM simulator.

The results of this spread are discussed and interpreted in

section III-D.

A. Determining Global Threshold

The effect on simulated FTSM spread size

(NumActiveNodes) when varying Threshold from

Eq. 4 is shown in Fig.6. The value of Threshold is varied

from 1.0 to 2.0 in step size of 0.1 for the most active user in

the Facebook dataset. It can be seen that if Threshold is too

big, FTSM can experience boundary problems and end up

activating the entire network; on the other hand, if it is too

small, there will be insufficient spread and the analysis will

be inconclusive. The value 1.5 is chosen heuristically to give

sensible spread results and fairness is ensured by applying

the same threshold for all nodes.

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 21000

1100

1200

1300

1400

1500

1600

1700

1800

Threshold

Num

ber o

f nod

es a

ctiv

ated

in a

fter F

TSM

Fig. 6. Effect on NumActiveNodes by changing the Threshold

B. Power Law behavior of the Facebook Dataset

A power law is a mathematical relationship between two

quantities when the frequency of an event varies as a power

of some attribute of that event. There is evidence that the

distributions of a wide variety of physical, biological, and

man-made phenomena follow a power law, including the sizes

of earthquakes, craters on the moon and of solar flares [15],

and in this case, the node degree distribution of online social

networks.

Fig. 7 graphs the Number of Nodes in the Facebook

Dataset that is mined, against the Node Degree (the number of

218 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Page 6: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

0 100 200 300 400 5000

10

20

30

40

50

60

Node Degree

Num

ber o

f Nod

es

Fig. 7. Plot of Node Degree vs Number of Nodes in linear scale

neighbors each node has). This diagram demonstrates the data

integrity of the mined small network in terms of the node

degree distribution characteristics. Let y be the number of

nodes that contains x number of neighbors. In a power-law

we have:

y = Cx−a (9)

Performing Log operation on both sides of the equation

yield

log(y) = log(C) − alog(x) (10)

Therefore, a power-law with exponent a in theory, should

be seen as a straight line with slope −a on a log-log plot

with a positive intercept of log(C). Fig 8 shows the log-log

plot of the number of nodes vs. node degree. It can be seen

that the number of nodes with node degree less than 10 are

relatively less and causes the expected straight line to have

a dip. It can be speculated that this is due to the small size

of the dataset. Another explanation could be that Facebook

deactivates accounts that have very less number of friends or

are not accessed frequently or suggests them to add friends.

This may lead to less number of nodes that have less than 10neighbors.

C. Correlation between Facebook social sharing and YouTube

Global hit-count

An underlying assumption throughout this study is that viral

videos are made popular via the positive influence of social

networking sites such as Facebook and Twitter. In the case of

Facebook, each time a user shares a video, comments on it

or likes it, a wider audience is able to see this video on their

newsfeed. This leads to more viewership and eventually, the

video propagates across the social network.

An analytics website named Unruly Media [7] uses propri-

etary data mining techniques to compile a real time viral chart

of the videos that have a steep increase (7 day moving average)

in Facebook/Twitter share count. To assess the correlation

between YouTube global hit count and Facebook global share

100 101 102 103100

101

102

Node Degree (log)

Num

ber o

f Nod

es (l

og)

Fig. 8. Plot of Node Degree vs Number of Nodes in log scale

0 0.5 1 1.5 2 2.5 3 3.5 4

x 107

0

1

2

3

4

5

6

7

8x 108

Global Facebook Share Count

Glo

bal Y

ouTu

be H

it C

ount

Fig. 10. Scatter plot of top 20 viral videos’ YouTube global hit count vsFacebook global share count

count, Fig. 11 is compiled. It can be seen that there is a strong

positive correlation and this validates the above mentioned

assumption.

D. Transient spread simulation compared with YouTube data

Fig. 9 shows the Normalized view spread curve for the

FTSM simulation and YouTube transient view count graphs

for the top 9 viral videos. Initially, as shown in the previous

section, the analysis is conducted for the top 10 videos.

However, the uploader for video 6 (Jason Mraz- I Won’t Give

Up Lyric Video) has disabled the statistics and is therefore not

plotted. Note that the horizontal axis is in Microsoft Time-

stamp format. This figure can be obtained by calculating the

number of days since Jan 1, 1900. The data was mined on

November 13, 2012 and the therefore, the largest value on the

horizontal axis is 41226.

It can be observed that in most of the graphs, the FTSM

curve (red curve) rises ahead of the YouTube hits curve (blue

curve). This evaluates the effectiveness of the FTSM algorithm

for simulating the viral spread ahead of time, using the seeds

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 219

Page 7: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

No

rmal

ized

vie

w c

ou

nt

leg

end

: -

-- F

TSM

pre

dic

tio

n

--

- Y

ou

tub

e St

atis

tics

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12

x 104

0

50

100

Time (Microsoft Timestamp in days)

(a)

(b)

(c)

(d)

(e)

(f )

(g)

(h)

(i)

Fig. 9. Normalized view count for FTSM simulation (in red) and YouTube data (in blue) for top 9 viral videos in the Facebook Dataset

observed in different moments in time. For 7 out of 9 cases, the

prediction was accurate and timely. There were 2 cases where

the prediction was too late and 0 cases where the prediction

was too early. In Fig. 9b) and Fig. 9f), the red curve (FTSM

simulation curve) shows a delayed prediction and it can be

seen that the data points are clustered within a few days (4.082

to 4.084 in Fig. 9b)). This behavior can be explained by the

fact that all the seed node share occurrences were within a brief

period of time in the ground truth Facebook dataset. This is

indeed a viral outbreak (high spread within short time) but has

occurred a bit late. Since the dataset size is a small dataset, it

was not able to capture the global trend early enough.

E. FTSM Predictor accuracy evaluation

As discussed in the Methodology section, the FTSM sim-

ulator returns a value representing the potential spread of a

viral event across the network. This value predicts how many

potential viewers the viral content is possibly viewed by. For

the top 10 most watched videos in the Facebook dataset, the

first 7 viewers’ Facebook IDs are used as the seed set for

each video. The 7 seeds are used as currentSet (in line 4

of Algorithm 1) and are updated for every different video (in

line 17 of Algorithm 1). For each video, the global YouTube

statistics are also mined to obtain the total number of views

from all YouTube users. Then, a scatter graph of the YouTube

220 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)

Page 8: [IEEE 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) - Singapore, Singapore (2013.04.16-2013.04.19)] 2013 IEEE Symposium on Computational Intelligence and

1200 1300 1400 1500 1600 1700 1800 1900 2000 21000

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 108

FTSM final spread value

You

Tube

glo

bal h

it co

unt

Fig. 11. Scatter plot of top 10 viral videos’ Global YouTube hit count vsFTSM predictor’s spread count

hit count vs the FTSM score for 10 of the most popular videos

within our 2344 nodes Facebook data-set is plotted in Fig.

11. The Spearman’s rank correlation coefficient for the two

variables is ρ = 0.83030. Therefore, there exists a strong

positive correlation between the two variables.

IV. FUTURE WORK

As discussed in section II-D, computing FTSM for a large

network of a few million nodes results in very long execution

time. There exists a lot of room for further research in

this direction. Since this paper is able to show that a small

network’s viral spread simulation can be used to predict the

viral spread and ranking upto a great accuracy, this idea can

be extended to perform prediction in large networks more effi-

ciently. Specifically, a given large network can be partitioned

into multiple small networks (subgraphs) that already have

weak inter-cluster connections.

For example, for a 7 million node network representing the

Hong Kong population, the first step is to extract all the users

that reside in Hong Kong from the international Facebook

social graph. Once the geography is fixed, what remains is the

efficient allocation of local cache servers to different districts

of Hong Kong. The last step is to cluster once again completely

based on social geography, based on who is socially connected

to whom, and run the viral prediction algorithm on much

smaller size networks in parallel. The time domain predictions

for each YouTube video can be compared against the results

in each district (small-network).

V. CONCLUSION

This paper provided statistical analysis to verify FTSM

against real world data for the first time. The Fast Threshold

Spread Model (FTSM) is a fast approximation of the classical

Independent Cascade Model (ICM) and was used to perform

fast prediction of multi-media content propagation based on

the social information of its past viewers. This can be a

solution to the cache management challenges when prioritizing

large volumes of user generated multi-media content through

limited network resources. The predicted spread patterns for

the simulated 2344 nodes were compared against the real

YouTube ground truth statistics for the respective videos in

Section III-D and 7 out of 9 predictions were found to be

accurate. The global view count ranking is found to match the

predicted spread ranking with strong correlation (ρ = 0.83).

This prediction information can be used to optimally allocate

server resource, so that the average delivery latency can be

minimized.

ACKNOWLEDGEMENT

The authors would like to thank the Hong Kong Research

Grant Council for their support on this work under grant

reference 610509. The authors would like to thank Mr. Vinit

Jakhetiya, Mr. Pengfei Wan, Mr. Pradeep Rajendran, Mr.

Sheshan R. Aaron, Mr. Ramitha Soysa and Ms. Liu Xiangjun

for their valuable technical assistance and advice.

REFERENCES

[1] T. Broxton, Y. Interian, J. Vaver, and M. Wattenhofer, “Catching a viralvideo,” in Data Mining Workshops (ICDMW), 2010 IEEE InternationalConference on, dec. 2010, pp. 296 –304.

[2] S. S. Krishnan and R. K. Sitaraman, “Video stream quality impactsviewer behavior: inferring causality using quasi-experimental designs,”in Proceedings of the 2012 ACM conference on Internet measurementconference, ser. IMC ’12. New York, NY, USA: ACM, 2012, pp. 211–224. [Online]. Available: http://doi.acm.org/10.1145/2398776.2398799

[3] Y. Chen, S. Jain, V. Adhikari, and Z.-L. Zhang, “Reverse engineeringthe youtube delivery cloud,” in INFOCOM, 2011 Proceedings IEEE,2011.

[4] V. Adhikari, S. Jain, and Z.-L. Zhang, “Where do you ”tube”? uncover-ing youtube server selection strategy,” in Computer Communications andNetworks (ICCCN), 2011 Proceedings of 20th International Conferenceon, 31 2011-aug. 4 2011, pp. 1 –6.

[5] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread ofinfluence through a social network,” in Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and datamining, ser. KDD ’03. New York, NY, USA: ACM, 2003, pp. 137–146.[Online]. Available: http://doi.acm.org/10.1145/956750.956769

[6] D. Trpevski, W. Tang, and L. Kocarev, “An opinion disseminating modelfor market penetration in social networks,” in Circuits and Systems(ISCAS), Proceedings of 2010 IEEE International Symposium on, 302010-june 2 2010, pp. 413 –416.

[7] U. Media. (2012, Nov) Viral video charts. [Online]. Available:http://viralvideochart.unrulymedia.com

[8] F. Developers. (2012, April) Facebook develop-ers graph api documentation. [Online]. Available:https://developers.facebook.com/docs/reference/api/

[9] R. Lerdorf. (1995) Php: Hypertext preprocessor, documentation.[Online]. Available: http://www.php.net/

[10] D. Hardt. (2012, October) Rfc6749 - the oauth 2.0 authorization frame-work - revision. [Online]. Available: http://tools.ietf.org/html/rfc6749

[11] E. Hammer-Lahav, D. Recordon, and D. Hardt, “The oauth 2.0 autho-rization protocol,” draft-ietf-oauth-v2-18, vol. 8, 2011.

[12] “Developer’s Guide - Google Chart API - Google Code.” [Online].Available: http://code.google.com/apis/chart/

[13] “Chrome developer tools: Overview - inspector.” [Online]. Available:https://developers.google.com/chrome-developer-tools/docs/overview

[14] “Yahoo finance: Number of active users at facebook over the years,”Oct. 2012. [Online]. Available: http://finance.yahoo.com/news/number-active-users-facebook-over-years-214600186–finance.html

[15] M. Newman, “Power laws, Pareto distributions and Zipf’s law,” Con-temporary Physics, vol. 46, pp. 323–351, Sep. 2005.

2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 221


Recommended