Graph computation

transcript

Graph Computation

Naveen Molleti,Sigmoid

Graph of the Internet

Source: INRIA (http://raweb.inria.fr/rapportsactivite/RA2009/gravite/uid59.html)

Red Hat family tree rendered along with an axis

Source: Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Redhat_family_tree_11-06.png)

Tabular structure Graph structure

Rows, fields, values

Vertices, edges, labels, properties

Graph computation

Customer ID Customer Name Bill ID Item Name

391 Naveen 137 Pizza

391 Naveen 137 Coke

391 Naveen 139 Garlic Bread

393 Rahul 154 Garlic Bread

393 Rahul 154 Coke

391 Naveen 193 Coke

Table data

Compute configuration

Specify type of edges to be created:

(Customer ID: CustomerName) => Bill ID

Bill ID => Item Name

Raw data

Ingest data Compute Insert graph

Configuration

Persistence

Raw dataIngest data Compute Insert graph

Configuration

Persistence

HDFSSPARK

TitanTinkerpop

Cassandra

Graph data structures

trait Edge

{ def out: Vertex

def in: Vertex

def props: Map[String, AnyRef]

def label: String}

trait Vertex

{ def name: String

def id: String

def props: Map[String, AnyRef]}

trait Graph

{ def adjList: immutable.Map[Vertex, Seq[Edge]]}

Compute

tokens + relations

vertices + edges

Compute - simple map reduce approach

0) Split data into partitions

1) For each partition, compute tokens and relations

2) Create vertices and edges, and adjacency lists (local

subgraphs)

3) Merge adjacency lists using groupBy vertices

4) Merge duplicate edges within adjacency list

5) Result is final graph

Chunk... ...

tokens relations

vertices edges

subgraph subgraph subgraphsubgraph

map step

reduce step

transformation step

Tweaking for memory

- Maintaining vertex and edge objects is memory consuming both on application server and Spark master/workers- Moving around objects on network is costly too

Solution: Compute on ‘aliases’. Create objects corresponding to alias only before returning.

- After effects of merging duplicate objects - GC! (which opens another box of problems)Solution: Avoid all duplicate objects as far as possible.

Chunk... ...

tokens relations

subcompute subcomputesubcompute ... ...

compute result

map step

reduce step

transformation step

http://aa.bb.cc.dd:8000/graph/zzgraph/search?name=mr%20vijay&depth=2&limit=10

- Xmx values on a forked JVM launched via SBT. (fork := true)

- Set javaOptions key (e.g. javaOptions := -Xmx16G)

- Underestimated size of Spark compute result

- Set spark.driver.maxResultSize

- Get the most out of your machine. Don’t let OS kill the process under memory

pressure.

- Set vm.panic_on_oom (echo 1 | sudo tee /proc/sys/vm/panic_on_oom)

Not enough memory?

Database

References

Titan: http://thinkaurelius.github.io/titan/Tinkerpop: http://tinkerpop.apache.org/Cassndra: http://cassandra.apache.org/

Graph computation

Software