Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | aapo-kyroelae |
View: | 1,610 times |
Download: | 2 times |
DrunkardMob - RecSys '13
DrunkardMob: Billions of Random Walks on
Just a PCAapo Kyrola
Carnegie Mellon UniversityTwitter: @kyrpov
Big Data – small machine
Thanks: Major part of this work done during visit at Twitter’s Personalization and Recommendations team (Fall-2012).
DrunkardMob - RecSys '13
This work in a Nutshell
1. Background: Random walk –based methods are popular in Recommender Systems.
2. Research problem: How to simulate random walks if your graph does not fit in memory?
3. Solution: Instead of doing one walk a time, do billions of them a time. Stream graph from disk and maintain walk states in RAM.
DrunkardMob - RecSys '13
Contents
• Introduction to random walks• Disk-based graph systems: GraphChi• DrunkardMob algorithm• Experiments
All code available in GitHub: http://github.com/graphchi/graphchi-java
DrunkardMob - RecSys '13
Introduction: Random Walks
• Graph: G(V, E)– V = vertices / nodes, E = edges / links.
• Walk is a sequence of random t visits to vertices:
w := source(0) v(1) v(2) v(3) …. v(t)
• Walks follow edges by default, but can also reset or teleport with certain probability.– Transition probability: P(v(k+1) | v(k))
DrunkardMob - RecSys '13
Introduction (cont.)
• Usually we are interested about the distribution of the visits.– Either global distribution or for each
source separately.– Many applications (PageRank, FolkRank,
SALSA,..)
• Can be used to generate candidates:– Choose top K visited vertices as
candidates to recommend.
Example: Global PageRank
• Model: random surfer who starts from random webpage and clicks each link on the page with uniform probability:– With probability d,
teleports to a random vertex infinite walk.
• Pagerank(web page) ~ authority of web page.Can be computed using “power iteration” very
efficiently (in secs / minutes even for graphs with billions of vertices) Not interesting.
P = d
“any vertex”
P=(1-d) / 3
P=(1-d) / 3
P=(1-d) / 3
?
DrunkardMob - RecSys '13
DrunkardMob - RecSys '13
Personalized Pagerank
• Pagerank | home (source) nodes:– Compute pagerank vector
for each node separately resets only to the home node(s).
– Restrict home nodes to some category / topic / pages visited by a user.
• Used e.g. for social network recommendations.
P = d
home vertex
P=(1-d) / 3
P=(1-d) / 3
P=(1-d) / 3
?
DrunkardMob - RecSys '13
Personalized Pagerank (cont.)
• Naïve computation of Personalized Pagerank (PPR):– Compute pagerank vector for each
source separately using power iteration: O(n^2)
• Approximate by sampling:– Simulate actual walks on the graph.
DrunkardMob - RecSys '13
Random walk in an in-memory graph
• Compute one walk a time (multiple in parallel, of course):parfor walk in walks:
for i=1 to numsteps: vertex = walk.atVertex() walk.takeStep(vertex.randomNeighbor())
DrunkardMob - RecSys '13
Problem: What if Graph does not fit in memory?
Twitter network visualization, by Akshay Java, 2009
Distributed graph systems:- Each hop across partition boundary is costly.
Disk-based “single-machine” graph systems:- “Paging” from disk
is costly.
(This talk)
DrunkardMob - RecSys '13
DISK-BASED GRAPH SYSTEMS
DrunkardMob - RecSys '13
Disk-based Graph Systems
• Recently frameworks that can handle graphs with billions of edges on a single machine, using disk, have been proposed:– GraphChi (Kyrola, Blelloch, Guestrin:
OSDI’12)– TurboGraph (KDD’13)– [X-Stream (SOSP’13) – model not suitable]
• We assume vertex-centric model:– Computation done one vertex a time.
DrunkardMob - RecSys '13
GraphChi execution model
For T iterations:For p=1 to P
For vertex in interval(p)updateFunction(vertex)
shard(1)
interval(1) interval(2) interval(P)
shard(2) shard(P)
1 nv1 v2
DrunkardMob - RecSys '13
DRUNKARDMOB ALGORITHM
Random walk is often called “Drunkard’s Walk”
DrunkardMob - RecSys '13
DrunkardMob: Basic Idea• By example:
– Task: Compute personalized pagerank (PPR) for 1 million users in a social network -- in parallel
• I.e 1MM different home/source -nodes
– For each user, launch 1000 random walks (with resets) – in parallel
• Each walk takes 10 hops~ Equivalent to one 10,000 hop walk (with resets) / user
– For each user, keep track of the visits done by its 1000 short walks PPR for each user.
– Store state of each walk in RAM, process graph from disk.
= 1B random walks in parallel ~5 GB of RAM.
DrunkardMob - RecSys '13
Random walks in GraphChi
• DrunkardMob –algorithm– Reverse thinking
ForEach interval p: walkSnapshot = getWalksForInterval(p) ForEach vertex in interval(p): mywalks = walkSnapshot.getWalksAtVertex(vertex.id) ForEach walk in mywalks:
walkManager.addHop(walk, vertex.randomNeighbor())
Note: Need to store only current position of each walk!
DrunkardMob - RecSys '13
WalkManager
• Store walks in buckets– Array for each vertex would cost too
much.
DrunkardMob - RecSys '13
Encoding walks Only 4 bytes / walk.
Keeps track of each path knowledge base applications.
DrunkardMob - RecSys '13
Keeping track of walks
Vertex walks table (WalkManager)
Source A top-N visits
Source B top-N visits
Walk Distribution Tracker (DrunkardCompanion)
Execution interval
GraphChi
DrunkardMob - RecSys '13
Keeping track of walks
Vertex walks table (WalkManager)
Source A top-N visits
Source B top-N visits
Walk Distribution Tracker (DrunkardCompanion)
Execution interval
GraphChi
Source A top-N visits
Source B top-N visits
DrunkardMob - RecSys '13
Keeping track of Walks
• If we don’t have enough RAM to store the distributions:– Cut long tails: Similar problem to
estimating top-K frequent items in data streams with limited memory.
• Can also write hops to disk (bucket-by-bucket) and analyze later.
DrunkardMob - RecSys '13
Validity
• We assume that simulating 2000 x 5-hop walks with resets ~ 10000-hop walk with resets.– Not exactly same distribution – some
longer streaks not covered.• But those would be not relevant anyway for
recommendations!
– See Fogaras (2005) for analysis.
DrunkardMob - RecSys '13
Related Work
• Fogaras, Racz, Csalogany, Sarlos: “Towards scaling fully personalized pagerank: Algorithms, lower bounds, experiments” (2005)– Similar idea with full external memory
implementation.• We keep walks in memory.
• Plenty of research in approximating PPR.
DrunkardMob - RecSys '13
EXPERIMENTS
See paper for more experiments!
DrunkardMob - RecSys '13
Case Study: Twitter WTF
• Implemented Twitter’s Who-to-Follow algorithm on GraphChi (see paper)– Based on WWW’13 paper by Gupta et al.– Use DrunkardMob to generate set of
candidates to recommend for each user.– See paper.
DrunkardMob - RecSys '13
PPR: Full Twitter Graph
On Mac laptop, could estimate 500K-1M PPRs )= 0.5-1B walks ) in roughly the same time.
With a large server with SSD and 144 GB of memory:
DrunkardMob - RecSys '13
Runtime / Graph size
Running time ~ linear with graph size
DrunkardMob - RecSys '13
Comparison to in-memory walks
Competitive with in-memory walks. However, if you can fit your graph in memory – no need for DrunkardMob.
DrunkardMob - RecSys '13
Summary
• DrunkardMob allows simulating random walks efficiently on extremely large graphs– Uses bulk of RAM for keeping track of walks,
graph streamed from disk.– Graph size not limited by RAM.– Implement Twitter Who-To-Follow on your
Laptop!
• Future work: Adapt to distributed graph systems.– Even Hadoop if you really really want.
DrunkardMob - RecSys '13
Thank You!
• Code: http://github.com/graphchi/graphchi-java
Aapo KyröläPh.D. candidate @ CMU
http://www.cs.cmu.edu/~akyrolaTwitter: @kyrpov
Special thanks to Pankaj Gupta, Dong Wang, Aneesh Sharma and Jayarama Shenoy @ Twitter.