Apache Giraph Large-scale Graph Processing on Hadoop
Claudio Martella
<[email protected]> @claudiomartella
2
Graphs are simple
3
A computer network
4
A social network
5
A semantic network
6
A map
7
Predicting break ups
8
Aggregation approach Graph approach
Graphs are nasty.
9
Each vertex depends
on its neighbours,
recursively.
10
Recursive problems
are nicely solved
iteratively.
11
12
PageRank in
MapReduce
• Record: < v_i, pr, [ v_j, ..., v_k ] >
• Mapper: emits < v_j, pr / #neighbours >
• Reducer: sums the partial values
13
MapReduce dataflow
14
Drawbacks
• Each job is executed N times
• Job bootstrap
• Mappers send PR values and structure
• Extensive IO at input, shuffle & sort,
output
15
16
Timeline
• Inspired by Google Pregel (2010)
• Donated to ASF by Yahoo! in 2011
• Top-level project in 2012
• 1.0 release in January 2013
• 1.1 release in November 2014
17
Plays well with
Hadoop
18
Vertex-centric API
19
Shortest Paths
20
Shortest Paths
21
Shortest Paths
22
Shortest Paths
23
Shortest Paths
24
Code def compute(vertex, messages):
minValue = Inf # float(‘Inf’)
for m in messages:
minValue = min(minValue, m)
if minValue < vertex.getValue():
vertex.setValue(minValue)
for edge in vertex.getEdges():
message = minValue + edge.getValue()
sendMessage(edge.getTargetId(), message)
vertex.voteToHalt()
25
26
27
28
29
BSP & Giraph
30
Advantages
• No locks: message-based
communication
• No semaphores: global synchronization
• Iteration isolation: massively
parallelizable
31
Designed for
iterations
• Stateful (in-memory)
• Only intermediate values (messages)
sent
• Hits the disk at input, output, checkpoint
• Can go out-of-core
32
Giraph job lifetime
33
Architecture
34
Composable API
35
Checkpointing
36
No SPoFs
37
Giraph scales
38
ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-
edges/10151617006153920
Giraph is
fast
• 100x over MR (Pr)
• jobs run within minutes
• given you have resources
;-)
39
Serialised objects
40
Primitive types
• Autoboxing is expensive
• Objects overhead (JVM)
• Use primitive types on your own
• Use primitive types-based libs (e.g.
fastutils)
41
Sharded aggregators
42
Okapi
• Apache Mahout for graphs
• Graph-based
recommenders: ALS, SGD,
SVD++, etc.
• Graph analytics: Graph
partitioning, Community
Detection, K-Core, etc.
43