Download - Graph computation

Graph Computation

Naveen Molleti,Sigmoid

Graph of the Internet

Source: INRIA (http://raweb.inria.fr/rapportsactivite/RA2009/gravite/uid59.html)

Red Hat family tree rendered along with an axis

Source: Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Redhat_family_tree_11-06.png)

Tabular structure Graph structure

Rows, fields, values

Vertices, edges, labels, properties

?

Graph computation

Customer ID Customer Name Bill ID Item Name

391 Naveen 137 Pizza

391 Naveen 137 Coke

391 Naveen 139 Garlic Bread

393 Rahul 154 Garlic Bread

393 Rahul 154 Coke

391 Naveen 193 Coke

Table data

Compute configuration

Specify type of edges to be created:

(Customer ID: CustomerName) => Bill ID

Bill ID => Item Name

Raw data

Ingest data Compute Insert graph

Configuration

Persistence

Raw dataIngest data Compute Insert graph

Configuration

Persistence

HDFSSPARK

HDFS

TitanTinkerpop

Cassandra

Graph data structures

trait Edge

{ def out: Vertex

def in: Vertex

def props: Map[String, AnyRef]

def label: String}

trait Vertex

{ def name: String

def id: String

def props: Map[String, AnyRef]}

trait Graph

{ def adjList: immutable.Map[Vertex, Seq[Edge]]}

Compute

data

tokens + relations

vertices + edges

Compute - simple map reduce approach

0) Split data into partitions

1) For each partition, compute tokens and relations

2) Create vertices and edges, and adjacency lists (local

subgraphs)

3) Merge adjacency lists using groupBy vertices

4) Merge duplicate edges within adjacency list

5) Result is final graph

DATA

Chunk... ...

tokens relations

vertices edges

subgraph subgraph subgraphsubgraph

GRAPH

map step

reduce step

transformation step

Tweaking for memory

- Maintaining vertex and edge objects is memory consuming both on application server and Spark master/workers- Moving around objects on network is costly too

Solution: Compute on ‘aliases’. Create objects corresponding to alias only before returning.

- After effects of merging duplicate objects - GC! (which opens another box of problems)Solution: Avoid all duplicate objects as far as possible.

DATA

GRAPH

Chunk... ...

tokens relations

subcompute subcomputesubcompute ... ...

compute result

map step

reduce step

transformation step

http://aa.bb.cc.dd:8000/graph/zzgraph/search?name=mr%20vijay&depth=2&limit=10

- Xmx values on a forked JVM launched via SBT. (fork := true)

- Set javaOptions key (e.g. javaOptions := -Xmx16G)

- Underestimated size of Spark compute result

- Set spark.driver.maxResultSize

- Get the most out of your machine. Don’t let OS kill the process under memory

pressure.

- Set vm.panic_on_oom (echo 1 | sudo tee /proc/sys/vm/panic_on_oom)

Not enough memory?

?

Graph

Database

References

Titan: http://thinkaurelius.github.io/titan/Tinkerpop: http://tinkerpop.apache.org/Cassndra: http://cassandra.apache.org/

http://thinkaurelius.github.io/titan/

http://tinkerpop.apache.org/