+ All Categories
Home > Engineering > Mutil-Tenant Data Cloud with YARN & Helix

Mutil-Tenant Data Cloud with YARN & Helix

Date post: 19-Aug-2014
Category:
Upload: kishore-gopalakrishna
View: 887 times
Download: 5 times
Share this document with a friend
Description:
Building applications on YARN with Helix
Popular Tags:
53
Multi-Tenant Data Cloud with YARN & Helix LinkedIn - Data infra : Helix, Espresso @kishore_b_g Yahoo - Ads infra : S4 Kishore Gopalakrishna 1 Thursday, June 5, 14
Transcript
Page 1: Mutil-Tenant Data Cloud with YARN & Helix

Multi-Tenant Data Cloud with YARN & Helix

LinkedIn - Data infra : Helix, Espresso

@kishore_b_gYahoo - Ads infra : S4

Kishore Gopalakrishna

1Thursday, June 5, 14

Page 2: Mutil-Tenant Data Cloud with YARN & Helix

What is YARNNext Generation Compute Platform

MapReduce

HDFS

Hadoop 1.0

MapReduce

HDFS

Hadoop 2.0

Others(Batch, Interactive, Online,

Streaming)

YARN(cluster resource management)

2Thursday, June 5, 14

Page 3: Mutil-Tenant Data Cloud with YARN & Helix

What is YARNNext Generation Compute Platform

MapReduce

HDFS

Hadoop 1.0

MapReduce

HDFS

Hadoop 2.0

Others(Batch, Interactive, Online,

Streaming)

YARN(cluster resource management)

A1

A1

A2

A3

B1 C1

C5

B2

B3 C2

B4

B5

C3

C4

Enables

2Thursday, June 5, 14

Page 4: Mutil-Tenant Data Cloud with YARN & Helix

HDFS/Common Area

YARNYARN Architecture

ClientResource Manager

Node Manager Node Manager

submit job

node statusnode statuscontainer request

App Package

Application Master Container

3Thursday, June 5, 14

Page 5: Mutil-Tenant Data Cloud with YARN & Helix

So, let’s build something

4Thursday, June 5, 14

Page 6: Mutil-Tenant Data Cloud with YARN & Helix

Example System

Generate Data

Serve

M/R

Redis Server 3

HDFS 3

- Generate data in Hadoop - Use it for serving

5Thursday, June 5, 14

Page 7: Mutil-Tenant Data Cloud with YARN & Helix

Example System

Generate Data

Serve

M/R

Server 3

HDFS 3

6Thursday, June 5, 14

Page 8: Mutil-Tenant Data Cloud with YARN & Helix

Example SystemRequirements

Big Data :-)

Partitioned, replicated

Fault tolerant, Scalable

Efficient resource utilization

Generate Data

Serve

M/R

Server 3

HDFS 3

6Thursday, June 5, 14

Page 9: Mutil-Tenant Data Cloud with YARN & Helix

ApplicationMaster

Example System

Request Containers Assign work

Handle FailureHandle

workload Changes

RequirementsBig Data :-)

Partitioned, replicated

Fault tolerant, Scalable

Efficient resource utilization

Generate Data

Serve

M/R

Server 3

HDFS 3

6Thursday, June 5, 14

Page 10: Mutil-Tenant Data Cloud with YARN & Helix

Allocation + Assignment

HDFS

Server 1 Server 2Server 3

Partition Assignment - affinity, even distribution

Replica Placement - on different physical machines

Container Allocation - data affinity, rack aware placement

M/Rp1 p2 p3 p4 p5 p6

p1 p2

p5 p4

Server 3p3 p4

p1 p6

Server 3p5 p6

p3 p2

Multiple servers to serve the partitioned data

M/R job generates partitioned data

7Thursday, June 5, 14

Page 11: Mutil-Tenant Data Cloud with YARN & Helix

Failure HandlingServer 1 Server 2Server 1

Acquire new container close to data if possible

Assign failed partitions to new container

On Failure - Even load distribution, while waiting for new container

Server 23 Server 3

p5 p4 p1 p6 p3 p2

p1 p2 p3 p4 p5 p6

8Thursday, June 5, 14

Page 12: Mutil-Tenant Data Cloud with YARN & Helix

Failure HandlingServer 1 Server 2Server 1

Acquire new container close to data if possible

Assign failed partitions to new container

On Failure - Even load distribution, while waiting for new container

Server 23 Server 3

p5 p4 p1 p6 p3 p2

p1 p2 p3 p4 p5 p6

8Thursday, June 5, 14

Page 13: Mutil-Tenant Data Cloud with YARN & Helix

Failure HandlingServer 1 Server 2Server 1

Acquire new container close to data if possible

Assign failed partitions to new container

On Failure - Even load distribution, while waiting for new container

Server 23 Server 3 Server 4

p5 p4 p1 p6 p3 p2

p1 p2 p3 p4 p5 p6

p3 p2

p5 p6

8Thursday, June 5, 14

Page 14: Mutil-Tenant Data Cloud with YARN & Helix

Workload ChangesServer 1 Server 2Server 3

Workload change - Acquire/Release containers

Container change - Re-distribute work

Monitor - CPU, Memory, Latency, Tps

p1 p2

p5 p4

Server 3p3 p4

p1 p6

Server 3p5 p6

p3 p2

9Thursday, June 5, 14

Page 15: Mutil-Tenant Data Cloud with YARN & Helix

Workload ChangesServer 1 Server 2Server 3

Workload change - Acquire/Release containers

Container change - Re-distribute work

Monitor - CPU, Memory, Latency, Tps

p1 p2

p5 p4

Server 3p3 p4

p1 p6

Server 3p5 p6

p3 p2

Server 3p4 p6

p2

9Thursday, June 5, 14

Page 16: Mutil-Tenant Data Cloud with YARN & Helix

Workload ChangesServer 1 Server 2Server 3

Workload change - Acquire/Release containers

Container change - Re-distribute work

Monitor - CPU, Memory, Latency, Tps

p1 p2

p5

Server 3p3 p4

p1

Server 3p5 p6

p3

Server 3p4 p6

p2

9Thursday, June 5, 14

Page 17: Mutil-Tenant Data Cloud with YARN & Helix

Service Discovery

Server 1 Server 2Server 3

Dynamically updated on changes Discover everything, what is running where

p1 p2

p1 p1

Server 3p3 p4

p1 p1

Server 3p5 p6

p1 p1

10Thursday, June 5, 14

Page 18: Mutil-Tenant Data Cloud with YARN & Helix

Service Discovery

Server 1 Server 2Server 3

Dynamically updated on changes Discover everything, what is running where

p1 p2

p1 p1

Server 3p3 p4

p1 p1

Server 3p5 p6

p1 p1

Client Client

Service Discovery

10Thursday, June 5, 14

Page 19: Mutil-Tenant Data Cloud with YARN & Helix

Building YARN Application

Writing AM is Hard and Error ProneHandling Faults, Workload Changes is non-trivial and often overlooked

Request container

How many containers

Where

Assign work

Place partitions &

replicas

Affinity

Workload changes

acquire/release

containers

Minimize movement

Faults Handling

Detect non trivial failures

new v/s reuse

containers

Other

Service Discovery

Monitoring

11Thursday, June 5, 14

Page 20: Mutil-Tenant Data Cloud with YARN & Helix

Building YARN Application

Writing AM is Hard and Error ProneHandling Faults, Workload Changes is non-trivial and often overlooked

Request container

How many containers

Where

Assign work

Place partitions &

replicas

Affinity

Workload changes

acquire/release

containers

Minimize movement

Faults Handling

Detect non trivial failures

new v/s reuse

containers

Other

Service Discovery

Monitoring

Is there something that can make this easy?

11Thursday, June 5, 14

Page 21: Mutil-Tenant Data Cloud with YARN & Helix

Apache Helix

12Thursday, June 5, 14

Page 22: Mutil-Tenant Data Cloud with YARN & Helix

What is Helix?

Built at LinkedIn, 2+ years in production

Generic cluster management framework

Contributed to Apache, now a TLP: helix.apache.org

Decoupling cluster management from core functionality

13Thursday, June 5, 14

Page 23: Mutil-Tenant Data Cloud with YARN & Helix

Helix at LinkedIn

OracleOracleOracleDB

Change Capture

ChangeConsumers

Index Search Index

User Writes

Data Replicator

In Production

ETL

HDFS

Analytics

14Thursday, June 5, 14

Page 24: Mutil-Tenant Data Cloud with YARN & Helix

Helix at LinkedInIn Production

Over 1000 instances covering over 30000 partitions

Over 1000 instances for change capture consumers

As many as 500 instances in a single Helix cluster

(all numbers are per-datacenter)

15Thursday, June 5, 14

Page 25: Mutil-Tenant Data Cloud with YARN & Helix

Others Using Helix

16Thursday, June 5, 14

Page 26: Mutil-Tenant Data Cloud with YARN & Helix

Helix concepts

Resource (Database, Index, Topic, Task)

17Thursday, June 5, 14

Page 27: Mutil-Tenant Data Cloud with YARN & Helix

Helix concepts

Resource (Database, Index, Topic, Task)

Partitionsp1 p2 p3 p4 p5 p6

17Thursday, June 5, 14

Page 28: Mutil-Tenant Data Cloud with YARN & Helix

Helix concepts

Resource (Database, Index, Topic, Task)

PartitionsReplicas

p1 p2 p3 p4 p5 p6

r1

r2

r3

17Thursday, June 5, 14

Page 29: Mutil-Tenant Data Cloud with YARN & Helix

Helix concepts

Resource (Database, Index, Topic, Task)

PartitionsReplicas

p1 p2 p3 p4 p5 p6

r1

r2

r3

Container Process

Container Process

Container Process

17Thursday, June 5, 14

Page 30: Mutil-Tenant Data Cloud with YARN & Helix

Helix concepts

Resource (Database, Index, Topic, Task)

PartitionsReplicas

p1 p2 p3 p4 p5 p6

r1

r2

r3

Container Process

Container Process

Container Process

Assignment ?

17Thursday, June 5, 14

Page 31: Mutil-Tenant Data Cloud with YARN & Helix

State Model and ConstraintsHelix Concepts

18Thursday, June 5, 14

Page 32: Mutil-Tenant Data Cloud with YARN & Helix

Serve

bootstrap

State Model and ConstraintsHelix Concepts

Stop

18Thursday, June 5, 14

Page 33: Mutil-Tenant Data Cloud with YARN & Helix

Serve

bootstrap

State Model and ConstraintsHelix Concepts

State Constraints

Transition Constraints

Partition

Resource

Node

Cluster

Serve: 3bootstrap: 0 Max T1 transitions in

parallel

- Max T2 transitions in parallel

No more than 10 replicas

Max T3 transitions in parallel

- Max T4 transitions in parallel

Stop

18Thursday, June 5, 14

Page 34: Mutil-Tenant Data Cloud with YARN & Helix

Serve

bootstrap

State Model and ConstraintsHelix Concepts

State Constraints

Transition Constraints

Partition

Resource

Node

Cluster

Serve: 3bootstrap: 0 Max T1 transitions in

parallel

- Max T2 transitions in parallel

No more than 10 replicas

Max T3 transitions in parallel

- Max T4 transitions in parallel

StateCount= Replication factor:3

Stop

18Thursday, June 5, 14

Page 35: Mutil-Tenant Data Cloud with YARN & Helix

ParticipantParticipantParticipant

Helix Architecture

P1stop

bootstrapserver

P2 P5

P3

P4

P8

P6

P7

Controller

Client Client Target Provider

Provisioner

Rebalancer

assign work via callback

spectator spectator

Service Discovery

metrics

metrics

19Thursday, June 5, 14

Page 36: Mutil-Tenant Data Cloud with YARN & Helix

Helix ControllerHigh-Level Overview

Resource Config

Constraints

Objectives

Controller

TargetProvider

Provisioner

Rebalancer

Number of Containers

Task-> Container Mapping

YARN RM

20Thursday, June 5, 14

Page 37: Mutil-Tenant Data Cloud with YARN & Helix

Helix ControllerTarget Provider

Determine how many containers are required along with the spec

Fixed CPU Memory Bin Packing

monitoring system provides usage informationDefault implementations, Bin Packing can be used to customize further

TargetProvider

Resources p1,p2 .. pn

Existing containers c1,c2 .. cn

Health of tasks, containers cpu, memory, health

Allocation constraints

Affinity,rack locality

SLA

Fixed: 10 containersCPU headroom:30%Memory Usage: 70%

time: 5h

Number of container

release listacquire list

Container speccpu: x

memory: ylocation: L

21Thursday, June 5, 14

Page 38: Mutil-Tenant Data Cloud with YARN & Helix

Helix ControllerProvisioner

Given the container spec, interact with YARN RM to acquire/release, NM to start/stop containers

YARN

Interacts with YARN RM and subscribes to notifications

22Thursday, June 5, 14

Page 39: Mutil-Tenant Data Cloud with YARN & Helix

Helix ControllerRebalancer

Based on the current nodes in the cluster and constraints, find an assignment of task to node

Auto Semi-Auto Static

Rebalancer

Tasks t1,t2 .. tn

Existing containers c1,c2 .. cn

Allocation constraints &

objectives

Affinity,rack locality,

Even distribution of tasks,

Minimize movement while expanding

Assignment C1: t1,t2C2: t3,t4

User defined

Based on the FSM, compute & fire the transitions to Participants

23Thursday, June 5, 14

Page 40: Mutil-Tenant Data Cloud with YARN & Helix

Example System: Helix-Based Solution

Solution

Configure App

Configure Target Provider

Configure Provisioner

Configure RebalancerGenerate Data

Serve

M/R

Server 3

HDFS 3

24Thursday, June 5, 14

Page 41: Mutil-Tenant Data Cloud with YARN & Helix

Configure AppConfigure App

App Name Partitioned Data Server

App Master Package

/path/to/GenericHelixAppMaster.tar

App package /path/to/RedisServerLauncher.tar

App Config DataDirectory: hdfs:/path/to/data

Configure target providerConfigure target provider

TargetProvider RedisTargetProvider

Goal Target TPS: 1 million

Min container 1

Max containers 25

Configure ProvisionerConfigure Provisioner

YARN RM host:port

Configure RebalancerConfigure RebalancerPartitions 6Replica 2

Max partitions per container 4

Rebalancer.Mode AUTO

Placement Data Affinity

FailureHandling Even distribution

Scaling Minimize Movement

app_config_spec.yaml

Example System: Helix-Based Solution

25Thursday, June 5, 14

Page 42: Mutil-Tenant Data Cloud with YARN & Helix

yarn_app_launcher.sh  app_config_spec.yaml

Launch Application

26Thursday, June 5, 14

Page 43: Mutil-Tenant Data Cloud with YARN & Helix

Helix + YARN

Server 1 Server 2

27Thursday, June 5, 14

Page 44: Mutil-Tenant Data Cloud with YARN & Helix

Helix + YARN

YARN Resource Manager

Client

submit job

Server 1 Server 2

27Thursday, June 5, 14

Page 45: Mutil-Tenant Data Cloud with YARN & Helix

Application Master

Helix + YARN

YARN Resource Manager

Client

submit job

Launch AM

Server 1 Server 2

27Thursday, June 5, 14

Page 46: Mutil-Tenant Data Cloud with YARN & Helix

Application Master

Helix + YARN

Helix Controller

YARN Resource Manager

Target Provider

Provisioner

RebalancerClient

submit job

Launch AM

Server 1 Server 2

27Thursday, June 5, 14

Page 47: Mutil-Tenant Data Cloud with YARN & Helix

Application Master

Helix + YARN

Helix Controller

YARN Resource Manager

Target Provider

Provisioner

RebalancerClient

submit job

Launch AM

request cntrs

Server 1 Server 2

27Thursday, June 5, 14

Page 48: Mutil-Tenant Data Cloud with YARN & Helix

Node ManagerNode Manager

Application Master

Helix + YARN

Helix Controller

Node Manager

YARN Resource Manager

Target Provider

Provisioner

RebalancerClient

submit job

Launch AM

request cntrs

launch containers

Server 1 Server 2participant 3 participant 3 participant 3

27Thursday, June 5, 14

Page 49: Mutil-Tenant Data Cloud with YARN & Helix

Node ManagerNode Manager

Application Master

Helix + YARN

Helix Controller

Node Manager

YARN Resource Manager

Target Provider

Provisioner

Rebalancer

assign work

Client

submit job

Launch AM

request cntrs

launch containers

Server 1 Server 2participant 3p1 p2

p5 p4

participant 3p3 p4

p1 p6

participant 3p5 p6

p3 p2

27Thursday, June 5, 14

Page 50: Mutil-Tenant Data Cloud with YARN & Helix

Auto Scaling

Non linear scaling from 0 to 1M TPS and back

28Thursday, June 5, 14

Page 51: Mutil-Tenant Data Cloud with YARN & Helix

Failure Handling: Random Faults

Recovering from faults at 1M Tps (5%, 10%, 20% failures/min)

29Thursday, June 5, 14

Page 52: Mutil-Tenant Data Cloud with YARN & Helix

Summary

HDFS

YARN(cluster resource management)

HELIX(container + task management)

Others(Batch, Interactive, Online, Streaming)

Fault tolerance, Expansion handled transparently

Generic Application Master

Efficient resource utilization by task model

30Thursday, June 5, 14

Page 53: Mutil-Tenant Data Cloud with YARN & Helix

Questions?

Website

Twitter

Mail

Team

helix.apache.org, #apachehelix

@apachehelix, @kishore_b_g

[email protected]

Kanak Biscuitwala, Zhen Zhang?We love helping & being helped

31Thursday, June 5, 14


Recommended