+ All Categories
Home > Documents > Apache Giraph on yarn

Apache Giraph on yarn

Date post: 24-Feb-2016
Category:
Upload: etenia
View: 60 times
Download: 0 times
Share this document with a friend
Description:
Apache Giraph on yarn. Chuan Lei and Mohammad Islam. Fast Scalable Graph Processing . What is Apache Giraph Why do I need it Giraph + MapReduce Giraph + Yarn. What is Apache Giraph. - PowerPoint PPT Presentation
Popular Tags:
42
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam
Transcript
Page 1: Apache  Giraph on yarn

APACHE GIRAPH ON YARNChuan Lei and Mohammad Islam

Page 2: Apache  Giraph on yarn

Fast Scalable Graph Processing

2

What is Apache Giraph Why do I need it Giraph + MapReduce Giraph + Yarn

Page 3: Apache  Giraph on yarn

What is Apache Giraph3

Giraph is a framework for performing offline batch processing of semi-structured graph data on massive scale

Giraph is loosely based upon Google’s Pregel graph processing framework

Page 4: Apache  Giraph on yarn

What is Apache Giraph4

Giraph performs iterative calculation on top of an existing Hadoop cluster

Page 5: Apache  Giraph on yarn

What is Apache Giraph5

Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election

Done! Done!Still

Working…!

Page 6: Apache  Giraph on yarn

What is Apache Giraph6

Page 7: Apache  Giraph on yarn

Why do I need it?7

Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model

In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation

Page 8: Apache  Giraph on yarn

Why do I need it?8

Giraph makes iterative data processing more practical for Hadoop users

Giraph can avoid costly disk and network operations that are mandatory in MR

No concept of message passing in MR

Page 9: Apache  Giraph on yarn

Why do I need it?9

Each cycle of an iterative calculation on Hadoop means running a full MapReduce job

Page 10: Apache  Giraph on yarn

PageRank example10

PageRank – measuring the relative importance of document within a set of documents

1. All vertices start with same PageRank

1.0

1.0

1.0

Page 11: Apache  Giraph on yarn

PageRank example11

2. Each vertex distributes an equal portion of its pagerank to all neighbors

1.0

1.00.5

0.5

Page 12: Apache  Giraph on yarn

PageRank example12

3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph)1.5*1

+1/3

1*1+1/3

0.5*1+1/3

Page 13: Apache  Giraph on yarn

PageRank example13

4. This value becomes the vertices’ PageRank for the next iteration

1.33

0.83

1.83

Page 14: Apache  Giraph on yarn

PageRank example14

5. Repeat until convergence: change in PR per iteration < epsilon)

Page 15: Apache  Giraph on yarn

PageRank on MapReduce15

1. Load complete input graph from disk as [K = vertex ID, V = out-edges and PR]

Map Sort/Shuffle Reduce

Page 16: Apache  Giraph on yarn

PageRank on MapReduce16

2. Emit all input records (full graph state), emit [K = edgeTarget, V = share of PR]

Map Sort/Shuffle Reduce

Page 17: Apache  Giraph on yarn

PageRank on MapReduce17

3. Sort and Shuffle this entire mess.

Map Sort/Shuffle Reduce

Page 18: Apache  Giraph on yarn

PageRank on MapReduce18

4. Sum incoming PR shares for each vertex, update PR values in graph state records

Map Sort/Shuffle Reduce

Page 19: Apache  Giraph on yarn

PageRank on MapReduce19

5. Emit full graph state to disk…

Map Sort/Shuffle Reduce

Page 20: Apache  Giraph on yarn

PageRank on MapReduce20

6. … and START OVER!

Map Sort/Shuffle Reduce

Page 21: Apache  Giraph on yarn

PageRank on MapReduce21

Awkward to reason about I/O bound despite simple core business

logic

Page 22: Apache  Giraph on yarn

PageRank on Giraph22

1. Hadoop Mappers are “hijacked” to host Giraph master and worker tasks

Map Sort/Shuffle Reduce

Page 23: Apache  Giraph on yarn

PageRank on Giraph23

2. Input graph is loaded once, maintaining code-data locality when possible

Map Sort/Shuffle Reduce

Page 24: Apache  Giraph on yarn

PageRank on Giraph24

3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/scan-based

Map Sort/Shuffle Reduce

Page 25: Apache  Giraph on yarn

PageRank on Giraph25

4. Output is written from the Mappers hosting the calculation, and the job run ends

Map Sort/Shuffle Reduce

Page 26: Apache  Giraph on yarn

PageRank on Giraph26

This is all well and good, but must we manipulate Hadoop this way?

Heap and other resources are set once, globally for all Mappers in the computation

No control of which cluster nodes host which tasks

No control over how Mappers are scheduled

Mapper and Reducer slots abstraction is meaningless for Giraph

Page 27: Apache  Giraph on yarn

Overview of Yarn YARN (Yet Another Resource Negotiator)

is Hadoop’s next-gen management platform

A general purpose framework that is not fixed to MapReduce paradigm

Offers fine-grained control over each task’s resource allocation

27

Page 28: Apache  Giraph on yarn

Giraph on Yarn28

It’s a natural fit!

Page 29: Apache  Giraph on yarn

Giraph on Yarn29

Client Resource Manager Application Master

Node Manag

erWork

erWork

er

Node Manag

erApp Mstr

Worker

Node Manag

erMast

erWork

er

Resource

ManagerClien

tZooKeep

er

Page 30: Apache  Giraph on yarn

Giraph Architecture31

Master / Workers Zookeeper

Master Worker Worker Worker

Worker Worker Worker

Worker Worker Worker

Page 31: Apache  Giraph on yarn

Metrics35

Performance Processing time

Scalability Graph size (number of vertices and number

of edges)

Page 32: Apache  Giraph on yarn

Optimization Factors36

JVM Giraph App

GC control• Parallel GC• Concurrent GC• Young

Generation

• Memory Size• Number of

Workers• Combiner• Out-of-Core

• Object Reuse

Page 33: Apache  Giraph on yarn

Experimental Settings37

Cluster - 43 nodes ~ 800 GB memory Hadoop-2.0.3-alpha (non-secure) Giraph-1.0.0-release Data - LinkedIn social network graph

Approx. 205 million vertices Approx. 11 billion edges

Application - PageRank algorithm

Page 34: Apache  Giraph on yarn

14 26 38 50 60 70 80 900

100200300400500600700800900

10 GB per worker 20 GB per worker

Number of Vertices (mil)

Proc

essi

ng T

ime

(sec

)

Baseline Result38

10 v.s 20 GB per worker Max memory 800 GB

Processing time 10 GB per worker –

better performance

Scalability 20 GB per worker –

higher scalability 10 15 20 25 30 35 400

20406080

100120

10 GB per worker 20 GB per worker

Number of Workers

Num

ber

of V

ertic

es (

mil)

40 worke

rs40

workers

400G

800G

30 worke

rs

10 worke

rs

25 worke

rs

15 worke

rs

5 worke

rs

Page 35: Apache  Giraph on yarn

Heap Dump w/o Concurrent GC

39

Iteration 3 Iteration 27

Reach-able 1.5

Un-reachable 3

Reach-able 1.5

Un-reach-able 6

Big portion of unreachable objects are messages created at each superstep

GB

GB

GB

GB

Page 36: Apache  Giraph on yarn

Concurrent GC40

Significantly improves the scalability by 3 folds

Suffered from performance degradation by 16%

20 40 60 80 100 120 140 160 180 2000

100200300400500600700800900

ConGC ON ConGC OFFLinear (ConGC OFF)

Number of Vertices (mil)

Mem

ory

Nee

ded

(GB)

20 40 60 80 100 120 140 160 180 2000

200400600800

100012001400160018002000

ConGC ON ConGC OFFLinear (ConGC OFF)

Number of Vertices (mil)

Proc

essi

ng T

ime

(sec

)

20 GB per worker

Page 37: Apache  Giraph on yarn

Using Combiner41

Scale up 2 times w/o any other optimizations

Speed up the performance by 50%

10 15 20 25 30 35 400

20

40

60

80

100

120Combiner ON Combiner OFF

Number of Workers

Num

ber

of V

ertic

es (

mil)

10 15 20 25 30 35 400

200400600800

100012001400

Combiner ON Combiner OFF

Number of Workers

Proc

essi

ng T

ime

(sec

)

20 GB per worker

Page 38: Apache  Giraph on yarn

Memory Distribution42

More workers achieve better performance Larger memory size per worker provides

higher scalability

20 (20GB) 40 (10GB) 80 (5GB) 100 (4GB)0

200400600800

100012001400160018002000

020406080100120140160

Total Memory 400 GBPerformance Scalability

Number of Workers

Proc

essi

ng T

ime

(sec

)

Num

ber

of V

ertic

es (

mil)

Page 39: Apache  Giraph on yarn

Application – Object Reuse43

Improves 5x scalability Improves 4x performance Require skills from application developers

20 40 60 80 100 120 140 160 180 2000

100200300400500600700800900

No Mem Reuse Linear (No Mem Reuse)Mem Reuse

Number of Vertices (mil)

Min

imum

Mem

ory

Nee

ded

(GB)

20 40 60 80 100 120 140 160 180 2000

500

1000

1500

2000

2500

3000

No Mem Reuse Linear (No Mem Reuse)Mem Reuse

Number of Vertices (mil)Pr

oces

sing

Tim

e (s

ec)

20 GB per worker

650G

29 mins

Page 40: Apache  Giraph on yarn

Problems of Giraph on Yarn44

Various knobs to tune to make Giraph applications work efficiently

Highly depend on skillful application developers

Performance penalties suffered from scaling up

Page 41: Apache  Giraph on yarn

Future Direction45

C++ provides direct control over memory management

No need to rewrite the whole Giraph Only master and worker in C++

Page 42: Apache  Giraph on yarn

Conclusion46

Linkedin is the 1st player of Giraph on Yarn

Improvements and bug fixes Provide patches in Apache Giraph

Make full LI graph run on 40-node cluster with 650GB memory

Evaluate various performance and scalability options


Recommended