+ All Categories
Home > Data & Analytics > Distributed K-Betweenness (Spark)

Distributed K-Betweenness (Spark)

Date post: 24-Jan-2017
Category:
Upload: daniel-marcous
View: 97 times
Download: 1 times
Share this document with a friend
24
Distributed K- Betweenness Complex Network Analysis Daniel Marcous and Yotam Sandbank [email protected] [email protected]
Transcript
Page 1: Distributed K-Betweenness (Spark)

Distributed K-Betweenness

Complex Network AnalysisDaniel Marcous and Yotam Sandbank

.dmarcous@gmail com.yotamsandbank@gmail com

Page 2: Distributed K-Betweenness (Spark)

Centrality

❖ Core concept in complex network analysis

❖ Different measures:❖ Closeness❖ Degree❖ Betweenness

Page 3: Distributed K-Betweenness (Spark)

Betweenness

● 

Page 4: Distributed K-Betweenness (Spark)

Betweenness computation

● 

Page 5: Distributed K-Betweenness (Spark)

Betweenness computation

Page 6: Distributed K-Betweenness (Spark)

Betweenness computation

Expensive Computation!

Page 7: Distributed K-Betweenness (Spark)

Betweenness computation

Page 8: Distributed K-Betweenness (Spark)

Distributed Betweenness

❖ Independent computation for each node

❖ Why not run on different machines?

❖ Betweenness computation not implemented in GraphX

Page 9: Distributed K-Betweenness (Spark)

Distributed Betweenness

❖ Algorithm:❖ Divide nodes between machines❖ For each machine, compute the Betweenness contribution of each node to

every other node in the graph❖ Aggregate results from all machines

❖ Problems:❖ Can’t get information about a specific node in GraphX❖ Need to copy graph to every machine (goes bad with big graphs)

Page 10: Distributed K-Betweenness (Spark)

Distributed Betweenness

❖ Solutions:❖ Can’t get information about a specific node in GraphX

❖ GraphX Pregel API❖ Run 1 iteration, with every node passing its identity to all its neighbors

❖ Need to copy graph to every machine (goes wrong with big graphs)❖ We didn’t find a good solution for this problem❖ How can we avoid copying the whole graph to every machine?

Page 11: Distributed K-Betweenness (Spark)

Distributed K-Betweenness

● 

Quotes by Adriana Iamnitchi , University of South Florida

Page 12: Distributed K-Betweenness (Spark)

Distributed K-Betweenness

● 

Page 13: Distributed K-Betweenness (Spark)

Distributed K-Betweenness

● 

Page 14: Distributed K-Betweenness (Spark)

Implementation

❖ Technology :❖ Spark 1.5.2❖ Scala 2.10❖ GraphX 1.5.2 (+ Pregel API)

❖ Steps :❖ Create K-graphlets

❖ Pregel❖ Parallel BC calculation - contribution of vertex X to other vertices BC

❖ Local for each vertex’ graphlet❖ Brandes

❖ Also parallelized for each vertex in k-graphlet❖ BC Aggregation - final kBC score for each vertex

❖ Reduce

Page 15: Distributed K-Betweenness (Spark)

Code

Page 16: Distributed K-Betweenness (Spark)

Code

Page 17: Distributed K-Betweenness (Spark)

Usage

Page 18: Distributed K-Betweenness (Spark)

Tuning

Page 19: Distributed K-Betweenness (Spark)

Do it yourself

❖ The project can be found in github:❖ https://github.com/dmarcous/spark-betweenness

❖ Accessible as a Spark Package ! ❖ http://spark-packages.org/package/dmarcous/spark-betweenness❖ spark code (scala / java), spark-shell, spark-submit, pySpark APIs

Page 20: Distributed K-Betweenness (Spark)

Experiment design

❖ Amazon EMR cluster❖ 1 master❖ 4 worker nodes❖ r3.2xlarge

❖ 8 vcpu❖ 61 GB RAM❖ 160 GB SSD

❖ 6 Datasets❖ Different sizes (|E| / |V|)❖ Different diameters

❖ Implementations❖ spark-betweenness❖ networkX

Page 21: Distributed K-Betweenness (Spark)

ResultsSpark Singl

eDescription Type Name

240 31 3 9 5156 3015 Small random generated

Random HW2

601 210 3 8 88234 4039 Social circles Social Facebook

-1 349 4

2160 -1 3 16 428156

58228 Friendship network Social Birghtkite

489 -1 3 44 925872

334863

Customer co-purchases

Social Amazon

5707 -1 4

-1 -1 5

139 -1 3 849 2766607

1965206

Road net of California Infrastructure roadNet-Ca

356 -1 4

638 -1 5

85 -1 3 1054 3843324

1379917

Road net of Texas Infrastructure roadNet-TX

305 -1 4

600 -1 5-1 means it either crashed or didn’t finish in a long time (over

an hour)

Page 22: Distributed K-Betweenness (Spark)

Results

Page 23: Distributed K-Betweenness (Spark)

Results

❖ Performs great on graphs with large diameter❖ Large K-graphlets are “impossible” to store in memory and send between

machines

❖ Not good for graphs with small diameter (very slow, sometimes crashes)

❖ Very hard to tune (how many cores, memory for each process, and so on..)

Page 24: Distributed K-Betweenness (Spark)

Conclusions

❖ Distributed Betweenness – good idea in theory, hard to implement

❖ Multi-threaded on a single strong machine might do the job

❖ Our implementation – great for large diameter graphs (road networks, power grids, and more)


Recommended