GraphFrames: Graph Queries in Spark SQL by Ankur Dave

Post on 21-Apr-2017

4,619 views 5 download

transcript

GraphFrames: Graph Queries in Spark SQLAnkur DaveUC Berkeley AMPLab

Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber),Joseph Gonzalez (UC Berkeley), Matei Zaharia (MIT and Databricks)

+ Graph Queries

2016Spark + GraphFrames

Support for Graph Analysis in Spark

+ Graph Algorithms

2013Spark + GraphX

Relational Queries

2009Spark

Graph Updates + Anchored Traversals

Neo4j, etc.

Graph Algorithms vs. Graph Queries

≈x

PageRank

Alternating Least Squares

Graph Algorithms Graph Queries

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators

Editor 1 Editor 2 Article 1 Article 2

Article 1

Article 2

Editor 1

Editor 2

same day} same day}

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank

// Iterate until convergence wikipedia.pregel(sendMsg = { e =>

e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>

0.15 + 0.85 * msgSum})

Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")

.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))

.sort("d1 + d2".desc).take(10)

Separate SystemsGraph Algorithms Graph Queries

Raw Wikipedia

< / >< / >< / >XML

Text Table

Edit GraphEdit Table

Frequent Collaborators

Problem: Mixed Graph AnalysisHyperlinks PageRank

User Product

User Article

Vandalism Suspects

User User

User Article

Solution: GraphFrames

Graph Algorithms Graph Queries

Spark SQL

GraphFrames API

Pattern Query Optimizer

GraphFrames API• Unifies graph algorithms, graph queries, and DataFrames• Available in Scala and Python

class GraphFrame {def vertices: DataFramedef edges: DataFrame

def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit

def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...

}

Implementation

Parsed Pattern

Logical Plan

Materialized Views

Optimized Logical Plan

DataFrameResult

Query String

Graph–RelationalTranslation Join Elimination

and Reordering

Spark SQL

View SelectionGraph

AlgorithmsGraphX

Graph–Relational Translation

B

D

A

C

Existing Logical PlanOutput: A,B,C

Src Dst

⋈C=Src

Edge Table

ID Attr

Vertex Table

⋈D=ID

Materialized View Selection

GraphX: Triplet view enables efficient message-passing algorithms

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A C

B C

C D

Graph

+

UpdatedPageRanks

B

A

C

D

A

Materialized View Selection

GraphFrames: User-defined views enable efficient graph queries

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A CB CC D

Graph

User-Defined Views

PageRank

CommunityDetection

Graph Queries

Join Elimination

Src Dst1 21 32 32 5

Edges

ID Attr1 A2 B3 C4 D

VerticesSELECT src, dst, attr AS src_attrFROM edges INNER JOIN vertices ON src = id;

Standard vertex-edge join:

SELECT src, dstFROM edges INNER JOIN vertices ON src = id;

Unnecessary join

can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation

Join Reordering

A → B B → A

⋈A, BB → D

C→ B⋈BB → E⋈B

C→ D⋈BC→ E⋈C, D

⋈C, EExample Query

Left-Deep Plan Bushy Plan

A → B B → A

⋈A, B

B → D C→ B

⋈B

B → E⋈B⋈B⋈B, C

User-Defined View

EvaluationFaster than Neo4j for unanchored pattern queries

0

0.5

1

1.5

2

2.5

GraphFrames Neo4j

Que

ry la

tenc

y, s

Anchored Pattern Query

01020304050607080

GraphFrames Neo4j

Que

ry la

tenc

y, s

Unanchored Pattern Query

Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

0

1

2

3

4

5

6

7

GraphFrames GraphX Naïve Spark

Per-i

tera

tion

runt

ime,

s

PageRank Performance

Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

EvaluationRegistering the right views can greatly improve performance

Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node

Try It Out!Preview available for Spark 1.4+ at:

https://github.com/graphframes/graphframesThanks to Databricks contributors Joseph Bradley, Xiangrui Meng, and Timothy Hunter.

Watch for the release on Spark Packages in the coming weeks.

ankurd@eecs.berkeley.edu