Predicting YouTube Content Popularity viaFacebook Data: A Network Spread Model for
Optimizing Multimedia Delivery
Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine BermakThe Hong Kong University of Science and Technology, ECE, Hong Kong
Email: {dinuka, denischn, eeau, eebermak}@ust.hk
Abstract—The recent popularity of social networking websiteshave resulted in a greater usage of internet bandwidth for sharingmultimedia content through websites such as Facebook andYouTube. Moving large volumes of multi-media data throughlimited network resources remains a technical challenge tothis day. The current state-of-art solution in optimizing cacheserver utilization depends heavily on efficient caching policies todetermine content priority. This paper proposes a Fast ThresholdSpread Model (FTSM) to predict the future access pattern ofmulti-media content based on the social information of its pastviewers. The prediction results are compared and evaluatedagainst ground truth statistics of the respective YouTube video. Acomplexity analysis on the proposed algorithm for large datasetsalong with the correlation between Facebook social sharing andYouTube global hit count are explored.
I. INTRODUCTION
Internet services provided by sites such as Youtube and
Facebook has allowed their users to easily generate and access
multi-media contents of various formats, genres, and lengths
with their peers. The recent popularity of social networking
websites have resulted in a greater usage of internet bandwidth
for sharing multimedia content, but the formidable task of
moving large volumes of multi-media data through limited
network resources remains a technical challenge to this day.
A web hosted multi-media clip, for instance, can experience
viral growth [1] in access demands within a relatively short
period of time when each viewer advertises it to his peers,
who in turn further spreads the access to others down stream.
Websites such as YouTube, which hosts this multi-media file
must be able to adapt to this explosion of access demand in
order to avoid interruption of services.
Recent research by Krishnan et al. [2] has shown that even
a 5 second startup delay of an online video has lead to upto
10% abandonment rate of the viewership. Therefore, in a
viral video outbreak, it is crucial that the video experiences
minimum streaming delays to ensure that the video streaming
website can keep its audience from viewing the same video
from a competitor website. The commonly accepted solution
for dealing with large access volume is to cache the content
across multiple servers so the load can be distributed. Since
pre-caching a multi-media file across all available servers is an
expensive exercise, typically only the most accessed files are
cached near high demand hot-spot locations. The management
of this premium cache estate is a non-trivial problem. In
Fig. 1. Statistics and ranking of viral videos by Unruly Media
the case of a virally spreading file, this pre-caching decision
must be made quickly due to the explosive access demand.
Therefore, a real-time viral behavior prediction model on
user-shared multi-media content, at its early growth stages is
extremely useful for efficient cache management.
Adhikari et al. [3] and [4] have shown that YouTube is
found to employ a hashing strategy which maps video IDs to
a hierarchy of servers organized both in geographical space
and logical space. Advanced dynamic DNS re-mapping is
performed during runtime to achieve load balancing. The
geographic distribution of YouTube servers is obtained by ac-
cessing YouTube from 20,000 randomly chosen proxy servers
from across the globe. The majority of the servers are based
in United States and most of them are in Mountain View,
California, within close vicinity of Google headquarters. Brazil
214978-1-4673-5895-8/13/$31.00 c©2013 IEEE
8
5 9
6
7
1
2
3
4 8
5 9
6
7
1
2
3
4 8
5 9
6
7
1
2
3
4
8
5 9
6
7
1
2
3
4 8
5 9
6
7
1
2
3
4 8
5 9
6
7
1
2
3
4
Original infected nodes Each infected node has a single
chance of infecting his uninfected neighbors with a certain probability
Newly infected nodes pass on the infection until no new nodes are infected
Step 1 Step 2 Step 3
Step 4 Step 5 Final Stage
Fig. 2. An example infection process of Independent Cascade Model
and Indonesia also had surprisingly high percentages of shares.
This may reflect the high access demand originating from the
Asia region.
The main contributions of this paper are as follows.
• This paper proposes a novel concept where the user’s
social profile is used to model the potential future access
pattern of the user-shared multi-media content. In the
past, multi-media content are typically cached according
to its access frequency only.
• It provides statistical analysis to verify FTSM against real
world data for the first time. The Fast Threshold Spread
Model (FTSM) is a fast approximation of the classical In-
dependent Cascade Model (ICM). A complexity analysis
is provided for FTSM.
• The correlation between Facebook social sharing and
YouTube global hit count is investigated. The mined
Facebook dataset obeys the power law degree distribution
and is viewed as a predictor of the greater Facebook
global trend of video sharing activity.
The remainder of this paper is organized as follows. the
data mining process along with the proposed method of using
the Fast Threshold Spread Model FTSM) is explained in
Section II. The simulation results of FTSM on a real Facebook
dataset is discussed in Section III along with the evaluation
of the results against ground truth YouTube data. The future
directions for this work and the final concluding remarks are
made in Section IV and V respectively.
II. METHODOLOGY
The spread of information through a socially connected
network can be modelled by the Independent Cascade Model
(ICM). The application of ICM in studying “word of mouth”
based viral marketing strategies was first explained by Kempe
et al. [5]. The basic algorithm is illustrated in Fig.2. While this
model provides quantitative insights into marketing dynamics,
its computational complexity is not suitable for making real-
time estimations. A similar probabilistic approach is also
taken in by Trpevski et al. [6] in their Opinion Disseminating
Model (ODM). This model is useful for studying the dynamic
behaviour of opinion dissemination through a social graph,
but the large number of model parameters in ODM makes
Facebook Graph API
PHP server
1) Authorize (OAuth 2.0) – Send login credentials
2) Verified
3) Click download
4) Graph API call
5) JSON
Login Page
7) Parsers:Extracts specific data
To Matlab AICM
6) Repeat for All pages of Each Friend
Fig. 3. Experimental setup: Requesting, downloading and analyzing JSONobjects from Facebook
it difficult to apply in real world applications where only
limited social attributes from services such as Facebook are
available for mining. Given a set of users who have already
accessed a piece of user-shared multi-media content, FTSM
can extrapolate future access patterns in real-time. This pro-
vides a quantitative viral spread assessment of the content as
it lives through a life cycle of infection, exponential spread,
and extinction.
A. Facebook Data Mining
Websites such as YouTube does not contain direct social
connection information about its users. To overcome this
obstacle, we propose to use social network sites such as
Facebook as the reference data set to build the viral spread
model of user-shared media contents. The assumption here
is that the content access records of Facebook users can
be linked to individual multi-media contents on sites such
as Youtube. These records can be anonymous. The strong
correlation between user activity on Facebook and content
access pattern on Youtube can be partially proven by statistics
from sites such as Viral Video Chart by Unruly Media [7].
An extremely high resemblance can be observed between the
most watched videos over the Internet and most-shared videos
over Facebook.
Facebook Timeline and Profile data is mined via Facebook
API [8] to build a social graph (undirected edge network) of
users and their activity level. Indicators such as number of
posts per day can be used to assess how active and influential
a user is in the network. This graph is used to simulate
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 215
the FTSM. Intuitively, if a piece of multi-media content is
accessed by a group of highly influential users, there is a high
probability that this content will become popular very rapidly
by means of viral spread among their peers.
A scraping software is created to automatically download
a large set of users’ Timeline information. The scraper is
designed to be accessible on any computer, any platform, so
that users can ‘donate’ their Facebook dataset, both the edge
network of their friends and the Timelines of their friends, by
running the scraper on their web-browser.
The scraper runs on any browser that supports HTML and
the computation is done on the server side on PHP [9] and
JavaScript. The communication between the 3 entities (the
client computer, the Facebook Graph API server and the PHP
server) is shown in Fig.3. In step 1, a user logs in using his
Facebook credentials. Once the login button is pressed, the
credentials are passed to the Facebook server and authenticated
using the Open Authentication protocol (OAuth 2.0 [10],
[11]). This way, the scraper does not store or intercepts the
login credentials. Upon successful authentication, the scraper
requests for an access token from the Facebook API, for
requesting data. This step is not marked on the diagram for
simplicity. Once the user clicks download, the PHP server
starts up the automatic scraping process.
Next, the steps 4, 5 and 6 are executed and these steps vary
slightly depending upon the data type that is being mined.
Initially, the scraper requests for the list of friends of the
particular user. Then it iterates through each friend’s Timeline.
Since one Timeline consists of many thousands of posts, the
program requests the Timeline in pages of 100 posts each.
The main challenge in collecting this dataset was the “HTTP
Error 500: Internal server error” returned by the server from
time to time, perhaps to stop data being harvested non-stop
(flooding the server). An intelligent scraper would stop and
wait for a random timeout of between 20 to 60 seconds and
resume to collect data. Many researchers do not use real world
data because of this challenge.
B. YouTube Video Statistics Mining
For the purpose of analyzing the transient behavior of the
video from starting off at a single hit (viewer) all the way up
to a few million, the YouTube hit-count graph statistics were
extracted semi-automatically making use of Google Chart API
[12]. For the sake of simplicity, this ground-truth YouTube hit-
count data that corresponds to each video will be referred to
as the YouTube transient graph. The miniature graph provided
under the YouTube statistics in each video page is an image
file, as shown in Fig. 4. This creates a stumbling block for
the data mining because this statistic cannot be mined directly
using a parser or a text scraper. After careful inspection, it
was observed that, in the source HTML code of the YouTube
page, the image was generated by Google Chart API real time,
when the page was loaded/requested.
The code inspector [13] is used to inspect the element and
reveal the long URL shown below.
http://chart.apis.google.com/chart?cht=lc:nda&chs=460x100
Fig. 4. The YouTube statistics provided by YouTube API
&chf=bg,s,F4F4F4&chco=5F8FC9&chls=1.5&chg=0,-1,1,1&chxt=y,x&chxtc=0,0&chxs=0N*s*%20,333333,10|1,333333,10&chxl=1:|03/05/12|07/12/12|11/18/12&chxp=1,5,50,95&chxr=0,0,112628952|1,0,100&chd=t:0.0,6.2,59.1,65.8,70.3,72.8,74.3,75.2,75.9,76.2,76.6,77.0,77.2,77.5,77.5,77.7,77.9,78.0,78.3,78.6,78.7,78.9,79.1,79.2,79.3,79.4,79.5,79.7,79.7,79.9,80.0,80.0,80.2,80.3,80.3,80.5,80.5,80.6,80.7,80.8,80.9,81.0,81.0,81.1,81.2,81.2,81.3,81.4,81.4,81.5,81.6,81.6,81.7,81.8,81.8,81.9,81.9,81.9,81.9,82.0,82.0,82.0,82.1,82.1,82.1,82.2,82.2,82.2,82.2,82.3,82.3,82.3,82.3,82.4,82.4,82.4,82.4,82.5,82.5,82.5,82.5,82.6,82.6,82.7,82.7,82.7,82.8,82.8,82.8,82.9,82.9,82.9,83.0,83.0,83.1,83.1,83.1,83.2,83.3,83.3&chm=B,dce7eed4,0,0,0|AA,333333,0,0,10|AB,333333,0,0,10|AC,333333,0,0,10|AD,333333,0,0,10|AE,333333,0,0,10|AF,333333,0,0,10|AG,333333,0,0,10|AH,333333,0,0,10
The above mentioned long URL contains all the coordinates
that are normalized to 100 data points within range 1-100,
perhaps by the API. A simple parser is used to mine the
content in between ‘&chd=t:’ to ‘&chm=B’. After that step,
using the prior knowledge of video-post-date, the current-
date and the cumulative number of hits thus far, the x and y
coordinates of all 100 data points are calculated. This process
is repeated for all the videos that are of interest to this dataset.
C. Fast Threshold Spread Model
The Fast Threshold Spread Model (FTSM) is a deterministic
diffusion model based on the activity rate of each individual
user node. The Facebook social graph extracted from Section
II-A is modelled as an undirected graph
G = (V,E) (1)
with vertices,m, in V as the users in the network and edges,
k, in E as the relationship between individuals. For each inter-
user edge k, we evaluate the weight function:
W (m) = 0.5A1(m) + 0.5A2(m) (2)
where A1(m) is the average number of posts posted per
week by user m and A2(m) is the average number of shares
plus the number of comments plus the number of likes for
216 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
each of his posts. They are averaged to yield W (m) which
indicates the social influence of a given user.
Each user node m will be either active or inactive. We as-
sume that the nodes can switch from inactive to active, but not
the reverse. The total number of activated nodes, also known as
the influence spread, is denoted by NumActiveNodes. The
activation process is determined by a threshold function:
Decision(m) =
{Active(m) = 1 W ≥ ThresholdActive(m) = 0 W < Threshold
(3)
The Threshold value is chosen by simulation to be 4.0. The
justification of this value is detailed in Section III-A.
The Pseudo-code of the proposed FTSM algorithm is de-
scribed as:
Algorithm 1 Pseudocode for the FTSM algorithm
1: Load G= (V ,E) to array
2: Set currentSet = seedNodes3: for i = 1 to 20 do4: for each node m in currentSet do5: Set Active(m) = 1;6: Set Neighbors to m’s immediate neighbors
7: compute W (m)8: for each node n in Neighbors do9: if n is not already visited then
10: if W<threshold then11: Increment NumActiveNodes by 1
12: add n to newSet13: end if14: end if15: end for16: end for17: currentSet = newSet18: end for
D. Complexity Analysis on a Small Network vs a Large
Network
Recent statistics show that Facebook has grown into a 1
billion node online social network, the largest of its kind. [14].
This is indeed a classical ‘Large Network’. The runtime for the
FTSM algorithm on the small network of 2344 is 2.4 hours
(on Intel Core i7 860 @ 2.80Ghz, 16GB RAM, MATLAB
r2012a). This section attempts to formulate a mathematical
expression for the key elements that affect the computational
complexity of FTSM.
Consider an undirected graph G = (V,E) with vertices,
m, in V as the users in the network and edges, k, in E as
the relationship between individuals. Let di be the number of
neighbors of node i. Also define B as the set of seed nodes,
the initial points at which viral spread begins. Let N be the
number of hops (number of iterations) and this value is set
to 10 and let ci be the computation complexity in the inner
loop in line 4 of Algorithm 1. P (W > threshold) is the
probability that the computed W(m) in Eq. 2 is greater than
Threshold in Eq. 3. P (notRepeat) is the probability that the
node is not accessed before and activated in the past (refer
ICM in Fig. 2).
ci+1 = P (W > threshold)
ci∑j=1
djPj(notRepeat) (4)
The Sum of all the search computations for an entire seed
set is:
Si =
ci−1∑j=1
dj (5)
total search for N = 20ci∑
i=1
Si =
N∑i=1
ci−1∑j=1
dj (6)
Since Eq. 6 is too complex for closed-form analysis,
a simpler special case is shown below. Assume that di,P (notRepeat) and P (W > threshold) are constant values
for all nodes in the entire network.
c1 = B
c2 = B(P (w > th)P (notRepeat)d)
c2 = B(P (w > th)2P (notRepeat)2d2)
.
.
ci = B[P (W > th)P (notRepeat) · d]i−1 (7)
Since the above equations follow a geometric progression,
the sum of all the m terms can be calculated by
N∑i=1
c =
N∑i=1
(B)[P (W > th)P (notRepeat) · d]i−1
=
N−1∑i=0
(B)[P (W > th)P (notRepeat) · d]i
= B · 1 − [P (W > th)P (notRepeat) · d]N1 − [P (W > th)P (notRepeat) · d] (8)
From Eq. 8, it can be seen that the number of computations
increase in the power law of N when N is increased. There-
fore, for a large network, to observe convergence, the N is
larger. Also, the value of P (notRepeat) is decaying as more
and more nodes get activated. For a large network, this decay
time is much larger, because more number of nodes need to
be reached. Therefore, it is more advantageous to simulate in
a small network that is a representative sample of the large
network rather than simulating on the full network.Fig. 5 illustrates the small network observation concept. The
highlighted yellow ellipse of observation is analogous to the
2344 node dataset used in this paper, in comparison to the
greater Facebook dataset that contains many 100s of thousands
of nodes. Our hypothesis is that the spread pattern around a
seed node in Fig 5d) is similar to the other areas of the network
if the remaining clusters have similarly structured social fabric.
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 217
Fig. 5. a 4 step illustration of multi-seed FTSM. The highlighted circle represents a small network observation, that shows a similar spread pattern to thelarge network
III. SIMULATION RESULTS
The FTSM algorithm is simulated over a Facebook dataset
of 2344 user nodes and 70789 edges. The resulting spread
curves are generated for the top 10 viral videos. Once the
unique video identifiers (the YouTube URLs) for all the videos
are identified, the nodes that are involved in re-sharing these
viral videos are identified. A table containing the list of Face-
book IDs and timestamps of all the nodes in the dataset that
shared the particular video is compiled. This ground truth data
is used as seed nodes when simulating the FTSM simulator.
The results of this spread are discussed and interpreted in
section III-D.
A. Determining Global Threshold
The effect on simulated FTSM spread size
(NumActiveNodes) when varying Threshold from
Eq. 4 is shown in Fig.6. The value of Threshold is varied
from 1.0 to 2.0 in step size of 0.1 for the most active user in
the Facebook dataset. It can be seen that if Threshold is too
big, FTSM can experience boundary problems and end up
activating the entire network; on the other hand, if it is too
small, there will be insufficient spread and the analysis will
be inconclusive. The value 1.5 is chosen heuristically to give
sensible spread results and fairness is ensured by applying
the same threshold for all nodes.
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 21000
1100
1200
1300
1400
1500
1600
1700
1800
Threshold
Num
ber o
f nod
es a
ctiv
ated
in a
fter F
TSM
Fig. 6. Effect on NumActiveNodes by changing the Threshold
B. Power Law behavior of the Facebook Dataset
A power law is a mathematical relationship between two
quantities when the frequency of an event varies as a power
of some attribute of that event. There is evidence that the
distributions of a wide variety of physical, biological, and
man-made phenomena follow a power law, including the sizes
of earthquakes, craters on the moon and of solar flares [15],
and in this case, the node degree distribution of online social
networks.
Fig. 7 graphs the Number of Nodes in the Facebook
Dataset that is mined, against the Node Degree (the number of
218 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
0 100 200 300 400 5000
10
20
30
40
50
60
Node Degree
Num
ber o
f Nod
es
Fig. 7. Plot of Node Degree vs Number of Nodes in linear scale
neighbors each node has). This diagram demonstrates the data
integrity of the mined small network in terms of the node
degree distribution characteristics. Let y be the number of
nodes that contains x number of neighbors. In a power-law
we have:
y = Cx−a (9)
Performing Log operation on both sides of the equation
yield
log(y) = log(C) − alog(x) (10)
Therefore, a power-law with exponent a in theory, should
be seen as a straight line with slope −a on a log-log plot
with a positive intercept of log(C). Fig 8 shows the log-log
plot of the number of nodes vs. node degree. It can be seen
that the number of nodes with node degree less than 10 are
relatively less and causes the expected straight line to have
a dip. It can be speculated that this is due to the small size
of the dataset. Another explanation could be that Facebook
deactivates accounts that have very less number of friends or
are not accessed frequently or suggests them to add friends.
This may lead to less number of nodes that have less than 10neighbors.
C. Correlation between Facebook social sharing and YouTube
Global hit-count
An underlying assumption throughout this study is that viral
videos are made popular via the positive influence of social
networking sites such as Facebook and Twitter. In the case of
Facebook, each time a user shares a video, comments on it
or likes it, a wider audience is able to see this video on their
newsfeed. This leads to more viewership and eventually, the
video propagates across the social network.
An analytics website named Unruly Media [7] uses propri-
etary data mining techniques to compile a real time viral chart
of the videos that have a steep increase (7 day moving average)
in Facebook/Twitter share count. To assess the correlation
between YouTube global hit count and Facebook global share
100 101 102 103100
101
102
Node Degree (log)
Num
ber o
f Nod
es (l
og)
Fig. 8. Plot of Node Degree vs Number of Nodes in log scale
0 0.5 1 1.5 2 2.5 3 3.5 4
x 107
0
1
2
3
4
5
6
7
8x 108
Global Facebook Share Count
Glo
bal Y
ouTu
be H
it C
ount
Fig. 10. Scatter plot of top 20 viral videos’ YouTube global hit count vsFacebook global share count
count, Fig. 11 is compiled. It can be seen that there is a strong
positive correlation and this validates the above mentioned
assumption.
D. Transient spread simulation compared with YouTube data
Fig. 9 shows the Normalized view spread curve for the
FTSM simulation and YouTube transient view count graphs
for the top 9 viral videos. Initially, as shown in the previous
section, the analysis is conducted for the top 10 videos.
However, the uploader for video 6 (Jason Mraz- I Won’t Give
Up Lyric Video) has disabled the statistics and is therefore not
plotted. Note that the horizontal axis is in Microsoft Time-
stamp format. This figure can be obtained by calculating the
number of days since Jan 1, 1900. The data was mined on
November 13, 2012 and the therefore, the largest value on the
horizontal axis is 41226.
It can be observed that in most of the graphs, the FTSM
curve (red curve) rises ahead of the YouTube hits curve (blue
curve). This evaluates the effectiveness of the FTSM algorithm
for simulating the viral spread ahead of time, using the seeds
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 219
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
No
rmal
ized
vie
w c
ou
nt
leg
end
: -
-- F
TSM
pre
dic
tio
n
--
- Y
ou
tub
e St
atis
tics
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
4.02 4.03 4.04 4.05 4.06 4.07 4.08 4.09 4.1 4.11 4.12
x 104
0
50
100
Time (Microsoft Timestamp in days)
(a)
(b)
(c)
(d)
(e)
(f )
(g)
(h)
(i)
Fig. 9. Normalized view count for FTSM simulation (in red) and YouTube data (in blue) for top 9 viral videos in the Facebook Dataset
observed in different moments in time. For 7 out of 9 cases, the
prediction was accurate and timely. There were 2 cases where
the prediction was too late and 0 cases where the prediction
was too early. In Fig. 9b) and Fig. 9f), the red curve (FTSM
simulation curve) shows a delayed prediction and it can be
seen that the data points are clustered within a few days (4.082
to 4.084 in Fig. 9b)). This behavior can be explained by the
fact that all the seed node share occurrences were within a brief
period of time in the ground truth Facebook dataset. This is
indeed a viral outbreak (high spread within short time) but has
occurred a bit late. Since the dataset size is a small dataset, it
was not able to capture the global trend early enough.
E. FTSM Predictor accuracy evaluation
As discussed in the Methodology section, the FTSM sim-
ulator returns a value representing the potential spread of a
viral event across the network. This value predicts how many
potential viewers the viral content is possibly viewed by. For
the top 10 most watched videos in the Facebook dataset, the
first 7 viewers’ Facebook IDs are used as the seed set for
each video. The 7 seeds are used as currentSet (in line 4
of Algorithm 1) and are updated for every different video (in
line 17 of Algorithm 1). For each video, the global YouTube
statistics are also mined to obtain the total number of views
from all YouTube users. Then, a scatter graph of the YouTube
220 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM)
1200 1300 1400 1500 1600 1700 1800 1900 2000 21000
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5x 108
FTSM final spread value
You
Tube
glo
bal h
it co
unt
Fig. 11. Scatter plot of top 10 viral videos’ Global YouTube hit count vsFTSM predictor’s spread count
hit count vs the FTSM score for 10 of the most popular videos
within our 2344 nodes Facebook data-set is plotted in Fig.
11. The Spearman’s rank correlation coefficient for the two
variables is ρ = 0.83030. Therefore, there exists a strong
positive correlation between the two variables.
IV. FUTURE WORK
As discussed in section II-D, computing FTSM for a large
network of a few million nodes results in very long execution
time. There exists a lot of room for further research in
this direction. Since this paper is able to show that a small
network’s viral spread simulation can be used to predict the
viral spread and ranking upto a great accuracy, this idea can
be extended to perform prediction in large networks more effi-
ciently. Specifically, a given large network can be partitioned
into multiple small networks (subgraphs) that already have
weak inter-cluster connections.
For example, for a 7 million node network representing the
Hong Kong population, the first step is to extract all the users
that reside in Hong Kong from the international Facebook
social graph. Once the geography is fixed, what remains is the
efficient allocation of local cache servers to different districts
of Hong Kong. The last step is to cluster once again completely
based on social geography, based on who is socially connected
to whom, and run the viral prediction algorithm on much
smaller size networks in parallel. The time domain predictions
for each YouTube video can be compared against the results
in each district (small-network).
V. CONCLUSION
This paper provided statistical analysis to verify FTSM
against real world data for the first time. The Fast Threshold
Spread Model (FTSM) is a fast approximation of the classical
Independent Cascade Model (ICM) and was used to perform
fast prediction of multi-media content propagation based on
the social information of its past viewers. This can be a
solution to the cache management challenges when prioritizing
large volumes of user generated multi-media content through
limited network resources. The predicted spread patterns for
the simulated 2344 nodes were compared against the real
YouTube ground truth statistics for the respective videos in
Section III-D and 7 out of 9 predictions were found to be
accurate. The global view count ranking is found to match the
predicted spread ranking with strong correlation (ρ = 0.83).
This prediction information can be used to optimally allocate
server resource, so that the average delivery latency can be
minimized.
ACKNOWLEDGEMENT
The authors would like to thank the Hong Kong Research
Grant Council for their support on this work under grant
reference 610509. The authors would like to thank Mr. Vinit
Jakhetiya, Mr. Pengfei Wan, Mr. Pradeep Rajendran, Mr.
Sheshan R. Aaron, Mr. Ramitha Soysa and Ms. Liu Xiangjun
for their valuable technical assistance and advice.
REFERENCES
[1] T. Broxton, Y. Interian, J. Vaver, and M. Wattenhofer, “Catching a viralvideo,” in Data Mining Workshops (ICDMW), 2010 IEEE InternationalConference on, dec. 2010, pp. 296 –304.
[2] S. S. Krishnan and R. K. Sitaraman, “Video stream quality impactsviewer behavior: inferring causality using quasi-experimental designs,”in Proceedings of the 2012 ACM conference on Internet measurementconference, ser. IMC ’12. New York, NY, USA: ACM, 2012, pp. 211–224. [Online]. Available: http://doi.acm.org/10.1145/2398776.2398799
[3] Y. Chen, S. Jain, V. Adhikari, and Z.-L. Zhang, “Reverse engineeringthe youtube delivery cloud,” in INFOCOM, 2011 Proceedings IEEE,2011.
[4] V. Adhikari, S. Jain, and Z.-L. Zhang, “Where do you ”tube”? uncover-ing youtube server selection strategy,” in Computer Communications andNetworks (ICCCN), 2011 Proceedings of 20th International Conferenceon, 31 2011-aug. 4 2011, pp. 1 –6.
[5] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread ofinfluence through a social network,” in Proceedings of the ninth ACMSIGKDD international conference on Knowledge discovery and datamining, ser. KDD ’03. New York, NY, USA: ACM, 2003, pp. 137–146.[Online]. Available: http://doi.acm.org/10.1145/956750.956769
[6] D. Trpevski, W. Tang, and L. Kocarev, “An opinion disseminating modelfor market penetration in social networks,” in Circuits and Systems(ISCAS), Proceedings of 2010 IEEE International Symposium on, 302010-june 2 2010, pp. 413 –416.
[7] U. Media. (2012, Nov) Viral video charts. [Online]. Available:http://viralvideochart.unrulymedia.com
[8] F. Developers. (2012, April) Facebook develop-ers graph api documentation. [Online]. Available:https://developers.facebook.com/docs/reference/api/
[9] R. Lerdorf. (1995) Php: Hypertext preprocessor, documentation.[Online]. Available: http://www.php.net/
[10] D. Hardt. (2012, October) Rfc6749 - the oauth 2.0 authorization frame-work - revision. [Online]. Available: http://tools.ietf.org/html/rfc6749
[11] E. Hammer-Lahav, D. Recordon, and D. Hardt, “The oauth 2.0 autho-rization protocol,” draft-ietf-oauth-v2-18, vol. 8, 2011.
[12] “Developer’s Guide - Google Chart API - Google Code.” [Online].Available: http://code.google.com/apis/chart/
[13] “Chrome developer tools: Overview - inspector.” [Online]. Available:https://developers.google.com/chrome-developer-tools/docs/overview
[14] “Yahoo finance: Number of active users at facebook over the years,”Oct. 2012. [Online]. Available: http://finance.yahoo.com/news/number-active-users-facebook-over-years-214600186–finance.html
[15] M. Newman, “Power laws, Pareto distributions and Zipf’s law,” Con-temporary Physics, vol. 46, pp. 323–351, Sep. 2005.
2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM) 221