Graph Computation
Naveen Molleti,Sigmoid
Graph of the Internet
Source: INRIA (http://raweb.inria.fr/rapportsactivite/RA2009/gravite/uid59.html)
Red Hat family tree rendered along with an axis
Source: Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Redhat_family_tree_11-06.png)
Tabular structure Graph structure
Rows, fields, values
Vertices, edges, labels, properties
?
Graph computation
Customer ID Customer Name Bill ID Item Name
391 Naveen 137 Pizza
391 Naveen 137 Coke
391 Naveen 139 Garlic Bread
393 Rahul 154 Garlic Bread
393 Rahul 154 Coke
391 Naveen 193 Coke
Table data
Compute configuration
Specify type of edges to be created:
(Customer ID: CustomerName) => Bill ID
Bill ID => Item Name
Raw data
Ingest data Compute Insert graph
Configuration
Persistence
Raw dataIngest data Compute Insert graph
Configuration
Persistence
HDFSSPARK
HDFS
TitanTinkerpop
Cassandra
Graph data structures
trait Edge
{ def out: Vertex
def in: Vertex
def props: Map[String, AnyRef]
def label: String}
trait Vertex
{ def name: String
def id: String
def props: Map[String, AnyRef]}
trait Graph
{ def adjList: immutable.Map[Vertex, Seq[Edge]]}
Compute
data
tokens + relations
vertices + edges
Compute - simple map reduce approach
0) Split data into partitions
1) For each partition, compute tokens and relations
2) Create vertices and edges, and adjacency lists (local
subgraphs)
3) Merge adjacency lists using groupBy vertices
4) Merge duplicate edges within adjacency list
5) Result is final graph
DATA
Chunk... ...
tokens relations
vertices edges
subgraph subgraph subgraphsubgraph
GRAPH
map step
reduce step
transformation step
Tweaking for memory
- Maintaining vertex and edge objects is memory consuming both on application server and Spark master/workers- Moving around objects on network is costly too
Solution: Compute on ‘aliases’. Create objects corresponding to alias only before returning.
- After effects of merging duplicate objects - GC! (which opens another box of problems)Solution: Avoid all duplicate objects as far as possible.
DATA
GRAPH
Chunk... ...
tokens relations
subcompute subcomputesubcompute ... ...
compute result
map step
reduce step
transformation step
http://aa.bb.cc.dd:8000/graph/zzgraph/search?name=mr%20vijay&depth=2&limit=10
- Xmx values on a forked JVM launched via SBT. (fork := true)
- Set javaOptions key (e.g. javaOptions := -Xmx16G)
- Underestimated size of Spark compute result
- Set spark.driver.maxResultSize
- Get the most out of your machine. Don’t let OS kill the process under memory
pressure.
- Set vm.panic_on_oom (echo 1 | sudo tee /proc/sys/vm/panic_on_oom)
Not enough memory?
?
Graph
Database
References
Titan: http://thinkaurelius.github.io/titan/Tinkerpop: http://tinkerpop.apache.org/Cassndra: http://cassandra.apache.org/