Partitioning Social Networks for Fast Retrieval of Time-dependent Queries Mindi Yuan, David Stein,...

1

Partitioning Social Networks forFast Retrieval of Time-dependent Queries

Mindi Yuan, David Stein, Berenice Carrasco, Joana Trindade, Yi Lu

University of Illinois at Urbana-Champaign

Presenter:Sameh ElniketyMicrosoft Research

Email questions, praise and complains to Prof. Yi Lu <[email protected]>

2

Online Social Networking (OSN)

• OSNs differ from traditional web applications significantly– Handle highly personalized content

• Retrieved pages differ from user to user; also vary over short time intervals when new activities occur

– Highly interconnected data• Immediate neighborhood and extended neighborhood (friend’s friend)

– Community• Strong community structure among end users, albeit varying with time

• Scaling OSNs is a problem– Interconnected nature– Astounding growth rate

• Twitter grew by 10 times in one month, forced to redesign and re-implement its architecture several times

3

Existing Approaches

• Conventional vertical scaling – Upgrade existing hardware– Replicate fully on each machine

• Problems– Expensive because of the cost of high performance servers– Even infeasible

• Facebook requires multiple hundreds of Terabytes of memory

• Horizontal scaling– Use a higher number of cheap commodity servers– Partition the load among them– Good for stateless traditional web applications

• Problematic with data back-end layer that maintains state– OSN data cannot be partitioned into disjoint components– Leads to costly inter-server communication

4

Inter-server communication

• Distributed queries are shown to reduce performance compared to local queries [Pujol et. al. 2010][Curino et. al. 2010]

• Curino et. al. reported that local queries double throughput

• Question to consider: How much can we minimize inter-server communication and at what cost?

5

Overview

1. Problem

2. Previous work

3. Our approach

4. Evaluation

6

Two extremes of the spectrum

• Cassandra– Random partitioning with distributed hashing– Consistency is easy to maintain since only one replica is available– Inter-server traffic increases at retrieval– Slow down responses– Lead to “multi-get” hole problem with e.g. Facebook, where increasing the

number of servers actually reduces throughput

• Replicating all users’ data on multiple servers– Eliminates inter-server traffic but increases replication overhead– Impossible to hold all data in memory on each server– Delay in writing due to updates of all replicas– Network traffic for maintaining consistency across replicas

7

Little Engines

The little engine(s) that could: scaling online social networks. Pujol et. al. Sigcomm, 2010.

• Assumption– Most of operations are based on data of a user and her neighbors– The assumption is debatable: True for Twitter data, not true for Facebook data

where a two-hop network is accessed for most operations (expanded on next slide)

• Guarantees that for all users in an OSN, their direct neighbors’ data are co-located on the same server– Application developers can assume local semantics– Scalability achieved with servers with low memory and network I/O

• Drawback– Popular data replicated on all servers– Extension to two-hop network will require a much larger number of replicas

8

Two-hop network

• SPAR proposed in Little Engines considers one-hop network– Mainly focus on twitter-type models: a user follows all contents of a group

of senders– Facebook messages were tested upon by SPAR, but they are still modeled

as a user following all his friends, instead of the real-world scenario where a user can access related activities of friends’ friends

• Two-hop networks– For instance, Joana can accesses all activities Between Jona and Nandana

Adarsh

Jona

Nandana

Joana

Naseer

10

Overview

1. Problem

2. Previous work

3. Our approach

4. Evaluation

11

Our Approach

• Goal: have most requests handled locally while having much fewer replicas than the approach in “little engines”?– To have an approach scalable with two-hop networks– To reduce the cost of maintaining consistency among replicas

• Main idea: – Instead of partitioning the social network, which contains all users that will

potentially send a message– We focus on the activity network, which contains users actually sending

messages– Partitioning is based on actual activities in the network so that most requests

can be served locally, which can be achieved with much fewer replicas than a strong locality guarantee

12

Activity Network 1/2

• Contains much fewer edges than the social network– More amenable to partitioning and replication

• In a two-hop neighborhood, even if a user does not participate in much activity with his direct neighbors, there can still be abundant activity in the two-hop neighborhood

• Facebook data for Dec 2006 New Orleans network– While most user pairs only have 1 post, the top 8% of users receiver > 100

posts in their two-hop neighborhood

13

Activity Network 2/2

• There is strong time correlation among wall posts between the same pairs of users– This motivates a local, dynamic partitioning algorithm that adapts to the

temporal correlation• Computation of auto-correlation

– Gather all messages in the period m(1), … m(k)– Compute the auto-correlation function

14

Activity Prediction Graph

• The example has 6 users in the network• There is a message node between each pair of user nodes• Each edge is assigned a weight that reflects

– Read frequency of a user (assume data collected)– The likelihood of retrieving a message from this particular edge when the

most recent N messages are retrieved from the two-hop neighborhood

15

Partitioning Algorithm

• Periodic partitioning– Use K-Metis to repartition the graph at fixed time epochs– In between epochs, new users and new interactions are added to the graph

to minimize partitioning cost

• Local adaptive partitioning– Takes advantage of strong time correlation between messages– Triggered when a retrieval results in remote accesses– Keep movements of nodes across partitions small

16

Overview

1. Problem

2. Previous work

3. Our approach

4. Evaluation

17

Evaluation Scenario

• Each retrieval involves the most recent 6 messages in a two-hop neighborhood

• We chose 6 as the dataset available is relatively small– A total of 13948 messages with 8640 active users

• We consider algorithms– Random hashing based on sender (hash_p1)– Random hashing based on sender-receiver pair (hash_p2)– Periodic partitioning– Local adaptive partitioning– Retrospective algorithm: instead use the APG, which uses past data to predict

the activity for the current month, use the actual activity in the month. It is the best one can do given the limitations of the partitioning algorithm

18

Results

• Proportion of retrievals that accesses only one partition for all 6 messages• The periodic algorithm performs much better than random hashing• Close to retro, so APG does provide relatively accurate prediction

19

Results

• Proportion of retrievals that accesses at most 3 partitions for all 6 messages

• Most retrievals touch at most 3 partitions

20

Results

• The local adaptive algorithm further improves over periodic partitioning• At least 6.4 times more local queries than random hashing• Without causing large movements of nodes as in the periodic case

21

Results

• We also compare the periodic partitioning on the APG vs a graph using weights in one-hop neighborhood only

• The retrievals are performed in the two-hop neighborhood• There is clearly an advantage in considering activities in the two-hop

neighborhood when constructing the graph• The advantage is more significant as the number of partitions increases

22

Conclusion

• We proposed a dynamic, adaptive algorithm to predict activities on an OSN and partition the network for fast retrieval of data

• The algorithm scales when information needs to be retrieved from the two-hop neighborhood

• Further work is done to improve data locality for most retrievals with a small number of replicas

Date post:	31-Mar-2015
Category:	Documents
Upload:	ramon-holby
View:	214 times
Download:	2 times

Partitioning Social Networks for Fast Retrieval of Time-dependent Queries Mindi Yuan, David Stein,...

Documents