CS910: Foundations of Data Analytics

CS910: Foundations of Data AnalyticsGraham Cormode

[email protected]

Case Studies

Case Studies

¨ 4 papers on data analytics published in the scientific literature:– I Tube, You Tube, Everybody Tubes: Analyzing the World’s

Largest User Generated Content Video SystemInternet Measurement Conference 2007

– What is Twitter, a Social Network or a News Media?19th international conference on World wide web 2010

– Meme-tracking and the Dynamics of the News CycleKnowledge Discovery and Data Mining (KDD), 2009

– Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge2011 International Joint Conference on Neural Networks (IJCNN)

¨ Read them!

CS910 Foundations of Data Analytics2

http://conferences.sigcomm.org/imc/2007/papers/imc131.pdf




http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf

http://snap.stanford.edu/class/cs224w-readings/leskovec09meme.pdf

http://arxiv.org/pdf/1102.4374v1.pdf




Details on Case Studies

¨ Full details and links to the papers on course web sitewww2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs910/

¨ Please read the papers in detail to get the full story¨ Bias: papers are from Computer Science research community

– Mostly address data analysis applied to large websites– The most well-studied example of “Big Data”– Examples should be familiar to you (YouTube, Facebook, Twitter)

¨ Objectives for the case studies:– To see examples of data analytics in practice– To introduce and motivate topics we will study in more detail later– To see examples of going from data to insight to understanding


Case Study 1: Online video

¨ “I Tube, You Tube, Everybody Tubes: Analyzing the world’s largest user generated content video system”

¨ By Meeyoung Cha, Haewoon Kwak, Pablo Rodriguez, Yong-Yeol Ahn, Sue Moon (Telefonica Research and KAIST)

¨ Published in Internet Measurement Conference 2007¨ http://conferences.sigcomm.org/imc/2007/papers/imc131.pdf



Objectives

¨ To understand the impact of video sharing systems¨ To study the popularity life-cycle of videos¨ Study Statistical properties of requests, relation to video age¨ Study prevalence of copying activities¨ Understand potential for caching to save bandwidth


Data Collection

¨ Crawled YouTube and Daum sites in 2007– Wrote programs to automatically collect data about all videos

¨ YouTube was already very large in 2007– Restricted crawl to ‘Entertainment’ and ‘Science/Tech’ categories

¨ Collected data on each video:– Fixed: Uploader id, date of upload, duration of video– Variable: #views, #total ratings, #positive ratings, links to

¨ Daily crawl for 6 days to see changes


Video popularity distribution

¨ Plot what fraction of views are outside the top videos¨ Normalize ranks from 0 to 100, to allow comparison¨ Top 10% of videos account for 80% of views

– Very skewed distribution– Wide variation in popularity. Why?


Understanding video popularity distribution¨ “Skewness” (“the long tail”) is a common phenomena in data¨ Observed by plotting data on a log-log scale: straight lines¨ Plot views on x-axis, #videos with more than x views on y-axis


Modeling SkewnessSeveral distributions generate skew¨ “Power law” (Pareto, zipf):

y proportional to x-a for some a– Gives straight line on log-log plot

¨ “Power law with exponential cut-off”: y proportional to x-ae-bx

– x<1/b: behaves like power law– x>1/b: behaves as exponential decay

¨ Log-normal: taking log of distributionproduces a Normal curve

How to tell which we are seeing?


10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

10-6

10-5

10-4

10-3

10-2

10-1

100

100 101 102 103 104 105 106

Fit a curve

¨ Find the best fitting curve for each model– A regression problem (covered later)– Log-normal captures behaviour for popular videos (head)– Power law with exponential cutoff seems best for tail

¨ Why?CS910 Foundations of Data Analytics10

Possible explanations

Several mechanisms are known to generate long tail distributions:¨ Preferential attachment: popular items are most likely

– Describes the main behaviour, but not the truncated tail¨ Aging effect: “old” items eventually die, receive no new activity

– Does not fit videos: no ‘death’ or ‘removal’ of an old video¨ Information filtering: a user can only view a fixed number

– Does not fit: people can keep watching new videos¨ Fetch-at-most-once: a user can view each video at most once

– Better: people prefer to watch new videos, don’t watch top-10 over and over Some exceptions: music fans watch favourites many times


Validating ‘fetch-at-most-once’

¨ Simulate preferential attachment + fetch-at-most-once– R requests per user– U different users– V different videos

¨ Observations from simulation– Increasing R sharpens tail– Increasing U shifts graph

Shape doesn’t change much

¨ Do you agree?


Effect of time

¨ Views increase over time¨ Truncation gets sharper over time

– Many possible reasons: e.g. more push to most popular content


Impact of age

¨ Popularity does not vary strongly by age¨ Most recent videos are slightly more popular¨ Data is from early days in YouTube: have things changed?


Can we predict future popularity?

¨ Consider current popularity (views) and age (time since upload)– Does this correlate with future popularity?

– Table shows correlation coefficient (number of videos sampled)

¨ Strong correlation of instant popularity with future popularity– From day 2. Day 3 does not change much.


Use these observations to cache

¨ Streaming video uses a lot of Internet bandwidth (up to 66%?)¨ Could we cut bandwidth usage by running a cache?

– E.g. put a video cache for all of Warwick University¨ How to fill the cache?

– Static: Pick the most popular items once and for all– Dynamic: Initialize with most popular, then cache all new videos

Unrealistic, but a point of comparison– Hybrid: Static + daily most popular


Reflections on the paper

¨ A widely referenced paper from early in YouTube’s history– 1100+ citations in the literature

¨ Characterized many aspects of video viewing behaviour– And attempted to explain many of these– Many other plots in the paper

¨ Video on the Internet has changed a lot since 2007– Changes to YouTube website structure– Huge growth in mobile devices – Videos with billions of views

¨ Do conclusions still hold? What other phenomena emerge?


Case Study 2: Microblogging

¨ “What is Twitter, a social network or a news media?”– Kwak, Lee, Park, Moon (KAIST), in WWW conference 2010– http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf

¨ Objectives:– “To study the topological characteristics of Twitter and its power

as a new medium of information sharing”– “The first quantitative study on the entire Twittersphere and

information diffusion on it”


http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf

One slide on Twitter

¨ Messaging service where messages are up to 140 characters¨ Broadcast (by default) to all other users

– @userid addresses a particular user– #hashtag to tag a message– RT: re-tweet someone else’s message (sometimes with comment)

¨ Users “follow” another, and receive that user’s messages


Data Collection

¨ User Profiles– Crawl name, location, timezone, number of tweets– Begin at a popular user to crawl the “giant connected component”– Twitter limits to 20,000 requests an hour

Using 20 machines with different IP addresses, took 24 days¨ Tweets

– Collected all tweets mentioning current “trending topics” Probed every 5 minutes

– Used Twitter search API to get up to 1500 tweets per query– Collect text, author, timestamp– Removed “spam tweets” from very new users


Follower/following analysis

¨ Asymmetry between followers and following¨ Some bumps can be explained with domain knowledge

– Following 20: Twitter suggests an initial set of 20 to follow– Following 2000: used to be a limit of 2000, since removed

¨ Fits power law, with exponent 2.28 (quite skewed)


CCDF: Fraction with more than this number

Degree of separation

¨ 2 users are “friends” if there is a mutual following relationship¨ Study the friendship distance from random starting points

– 80% are within 4 steps (compare to “6 degrees of separation”)


Proximity of users

¨ Are friends geographically close? How would you test this?¨ Look as average time difference between friends

– As a function of number of friends


Importance of Twitter users

¨ Who are most important/influential Twitter users?– F: Count followers? Too crude?– R: Most retweeted?– PR: Most critical in the follower-following graph?

Use PageRank: defined measure importance of web pages Recursive definition: PageRank of a node is sum of

PageRanks of its followers Can be computed efficiently (see later)

¨ Compare top-20 in each: they appear similar¨ Compute rank-correlation between top-k


Comparison of importance measures

¨ Followers (F) and PageRank (PR) give most similar ranking¨ ReTweets (RT) are also correlated, but more weakly so


Spread of influence

¨ ReTweets can spread a message far and wide

¨ Figure shows retweets of messages about a plane crash– Most trees are shallow (<3 hops)

CS910 Foundations of Data Analytics26CCDF: Fraction with more than this number

Retweet time

¨ What do these plots tell you?CS910 Foundations of Data Analytics27

Distribution of delay from initial tweet to retweet

Inter-hop delay through retweet trees


¨ See also “A few chirps about twitter”, Krisnamurthy, Gill, Arlitt– From 2008, early in Twitter’s history– Instructive to compare the approach and the findings

¨ Shows that there are many ways to slice and dice data from a relatively “simple” data source– Did not even look much at content of tweets

¨ Widely cited (3400+ citations) as early work on Twitter


Case Study 3: Meme-tracking

¨ “Meme-tracking and the Dynamics of the News Cycle”¨ Jure Leskovec, Lars Backstrom, Jon Kleinberg (Stanford, Cornell)¨ International Conference on Knowledge Discovery and Data

Mining (KDD), 2009snap.stanford.edu/class/cs224w-readings/leskovec09meme.pdf

¨ Objectives:– Track short distinctive phrases that travel through the web– Use this to study the “news cycle” in the news media


http://snap.stanford.edu/class/cs224w-readings/leskovec09meme.pdf

Data Collection and Preparation

¨ News and blog activity from August 1 to October 31 2008– 90 million documents from 1.65 million sites– Used Spinn3r API to collect: see spinn3r.com

¨ Extracted 112 million phrases in “quotes”– Discard those with < 4 words or seen < 10 times (uninteresting)– Discard those where > 25% occurrences from same source (spam)– Leaves 47 million occurrences of phrases

¨ Collect phrases into clusters, based on overlap– Consider two phrases linked if they differ by at most 1 word– Partition the induced “phrase graph” to isolate key phrases

A long speech may have several key phrases in it Quite detailed process; see paper for details


Phrase distribution

¨ For each volume, plot # of phrases with at least that volume– For all phrases, clusters of phrases, and phrases about “lipstick

on a pig” (largest phrase cluster)


Most important threads

¨ Thread is all articles containing a phrase from a cluster– Plot is automatically generated and labeled to show volume


Formulating a model

¨ Advanced data analytics: propose a new model to explain data¨ Try to capture major effects, and neglect minor points

– Imitation: sources imitate/copy each other – Recency: news cycle dominated by recent events

¨ Model: simulate discrete time steps: news sources report threads– New thread produced at each step– At time t, each source picks thread j with probability f(nj)d(t-tj)

nj: number of sources reporting on thread j [“Imitation”] d(t-tj) : decay factor based on age of thread j [“Recency”]


Validating the Model

¨ Simulation: pick f() as power-law and d() as exponential decay– Generates synthetic data that looks similar to real– Can we more rigourously validate the model?


Recency-only Imitation-only


¨ A very innovative approach to the question– Created from scratch a way to think about “memes”– Proposed new models, and gave some evaluation of them– Tackled a timely question and used compelling examples– Plots and figures illustrate the key points– Widely cited (800+ citations)– But - models not robustly evaluated


Case Study 4: Link prediction

¨ “Link Prediction by De-anonymization: How we won the Kaggle Social Network Challenge”

¨ Narayanan (Texas), Shi (Berkeley), Rubinstein (Microsoft)¨ International Joint Conference on Neural Networks, 2011

http://arxiv.org/abs/1102.4374¨ Objectives:

– Correctly predict whether two users in a network would form a link– Use additional background information to improve the results


http://arxiv.org/abs/1102.4374

http://arxiv.org/abs/1102.4374

The Competition

Kaggle hosts competitions for data analytics¨ Hosted the 2011 IJCNN Social Network Challenge in late 2010¨ Provided a graph drawn from a social network

– Nodes correspond to users in the network– (Directed) edges indicate a following relationship– Evaluation: determine whether a set of test edges truly occur

¨ It was later disclosed that the graph came from Flickr– Edges are (directed) “friendship” relations between users


Link Prediction

¨ Goal of “link prediction” is to determine which new links will form, given current state of the graph

¨ Many factors can be taken into account:– Properties of the nodes– Existing number of links– Common neighbours between a pair– Graph distance between the pair

¨ This work used an additional factor:– Try to match nodes in data set to their own data collection


Data Collection

¨ Competition Data: Kaggle– 1.1M nodes, 7.2M edges provided as main data– 8960 “test edges”: 50% true edges (removed from main data)– 20% of test set held back by Kaggle to evaluate the results

¨ Competitors Data: Flickr– Crawled Flickr social graph (used Python + Curl library)– 2M nodes and all outgoing edges crawled– Total of 9.1M nodes, with 163M edges (much bigger data set)

¨ Evaluation: Area Under the Curve (AUC) [defined later]– Values are True/False, predicted as Positive/Negative– Ranges from 0.5 poor (random chance) to 1 (perfect)


Degree distribution

¨ Each node has an in-degree and an out-degree¨ Skewed distribution, few nodes have high in/out-degree

– Can try to use these as “landmarks”


Seed identification

¨ Try to find match a few nodes between the two graphs– Look at nodes with high in-degree (pointed to by many)– These are likely to be present in both graphs

Because of the crawling process¨ Pick highest n (20) degree nodes from Kaggle (K) and Flickr (F)

– Try to match them up to get a “seed” matching– For a pair of nodes v, w in K (F), compute their “cosine similarity”:

#common neighbours(v,w)/√(#neighbours(v)*#neighbours(w))– Find best matching of nodes in K and F based on cosine similarities

Initially: manually Later: optimization problem (see OR and optimization)


Propagation

¨ Now have matched n nodes between Kaggle and Flickr graphs¨ Maintain a matching, and try to extend based on neighbors

– Find pairs of nodes in Kaggle and Flickr whose similarity is high– Extend the matching. Iterate.– Some heuristics to accept a new pair into matching:

Must be at least 4 mapped common neighbors Cosine-similarity score must be at least 0.5 Difference in similarity scores between best, and second best

must be at least 0.2


Results

¨ Using “ground truth” information:– After 120,000 mappings in first stage, 99.3% correct– After second stage, had mappings for 14K out of 17.6K in test set

Overall accuracy for matched nodes 97.8% A coverage of 57% of edges: still need to give answer for rest For test edges, accuracy was 95%

¨ Use inferred information to predict links for more of test set– Look at all possible candidates for node pair in Flickr– Use these to vote on whether the edge is present or not

Accept if unanimous vote– Covers a further 19% of test edges


Machine Learning

¨ Leaves 24% of test edges without a mapping to Flickr¨ Apply Machine Learning (the original goal of the challenge)

– Create a number of “features” for each edge In-degree and out-degree of node Whether reverse edge exists Measures of local graph (number of common neighbors etc.)

– Train a “classifier” : see later lectures on classification– AUC for the classifier approach is ~0.9

¨ Total AUC for the whole approach on test data is 0.981– Excellent accuracy for deanonymized nodes, less for rest



¨ The score of 0.981 AUC was enough to win the competition– Second best was 0.969 www.kaggle.com/c/socialNetwork/leaderboard

– The researchers contacted the organizers to reveal their method– Were told that this was within the rules– Read the messageboard for the competition to see other opinions

www.kaggle.com/c/socialNetwork/forums¨ Lesson: when understanding data, think beyond what you have

– Are there other data sets that can help understand it better?– Can you learn properties of one data set and transfer to another?– Can you link two data sets to learn more about the first?

¨ Lesson: removing information does not “anonymize” data


http://www.kaggle.com/c/socialNetwork/leaderboard

http://www.kaggle.com/c/socialNetwork/forums

Lessons from Case Studies

¨ Start by describing the data– How collected, with what attributes, what is dropped

¨ Look at the data– Plot combinations of attributes to see correlations/distributions

¨ Find models that agree with the data– Lines of best fit, long tail distributions

¨ Find explanations that are consistent with data and knowledge¨ Extract observations of interest/value¨ Make predictions and evaluate the quality


Date post:	01-Jan-2016
Category:	Documents
Upload:	yardley-pittman
View:	18 times
Download:	0 times

CS910: Foundations of Data Analytics

Documents