+ All Categories
Transcript

Bay Area Spark Meetup05/19/2015

@

Spark Streaming Resiliency

Prasanna Padmanabhan & Bharat Venkat

Personalization Infrastructure

● Deployment Setup

● Background

Agenda

● Use cases for Real Time Stream Processing

● Creating Chaos

● Motivations for Spark

● Spark Streaming Primer

● Injecting Chaos in Spark

● Future

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Netflix is a logging company

that also happens to stream videos

Scale at Netflix

● 400 Billion events per day

● 8 Million events/sec during peak

● Numerous types of events (UI

Events, Play Events, Impression

events etc)

What do we do with it?

● Event logs are captured into Hadoop (EMR)

● Run ETL jobs using Hive/Presto to

○ Provide input to pre-compute recommendations

○ User behavior analysis

○ Data analysis and Reporting

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Use Cases for Stream Processing

Recommendations based on collective real time signals

Use Cases for Stream Processing

Faster identification of Data Anomalies and Regressions

Bad iPhone push

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Motivations for Spark

● Popular compute engine for

batch processing

● Widely used for Offline

Experimentations at Netflix

● Improves agility with

Interactive queries Interactive Experimenter’s Notebook

Motivations for Spark

Single platform to build batch and real-time applications

S3

Micro Services

Spark

Spark Streaming

Recommender Systems

Batch Data

Streaming Data

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Challenges in Cloud

● Ephemeral Resources

● Cannot rely on local state

● No fixed IP

Chaos Monkey Approach

● Simulate failures by randomly

killing components

● Failures inevitably happen when

least desired

● Lather, Rinse, Repeat!

Can Spark Streaming survive Chaos Monkey?

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Spark Components

Spark Driver

Cluster Manager (Mesos, YARN,

Standalone)

Task Task

Worker Node

Executor

Task Task

Worker Node

Executor

.

.

.

Spark Driver

Spark Driver

Cluster Manager (Mesos, YARN,

Standalone)

Task Task

Worker Node

Executor

Task Task

Worker Node

Executor

.

.

.

Main Program, DAG Scheduler

Cluster Manager

Spark Driver

Cluster Manager (Mesos, YARN,

Standalone)

Task Task

Worker Node

Executor

Task Task

Worker Node

Executor

.

.

.

Resource Allocation

Spark Worker

Spark Driver

Cluster Manager (Mesos, YARN,

Standalone)

Task Task

Worker Node

Executor

Task Task

Worker Node

Executor

.

.

.

Runs Worker Process &

Monitors Executors

How does streaming work?

● Data Streams are processed in batches

● Each batch processed in Spark

● Results are pushed out in batch

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Application Details

● Process subset of UI Events from Kafka

● Compute aggregate metrics

● Publish metrics to Atlas

● Spark 1.2.0

Standalone Cluster Manager

● Provide resource management and resiliency

● All in one package

○ Built-in, easy to deploy

○ Troubleshoot issues with single team

(Databricks)

Deployment

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Stream Resiliency

● Streaming application

continues to run

● Partial data loss during

failure is acceptable

Driver Resiliency (Client Mode)

WorkerMaster

Worker

Worker

Client

Driver

./spark-submit --deploy-mode “client”

Driver Resiliency (Client Mode)

WorkerMaster

Worker

Worker

Client

Driver

Entire Application is killed

Driver Resiliency (Client Mode)

WorkerMaster

Worker

Worker

Client

Driver

Driver Resiliency (Cluster Mode)(with supervise)

WorkerMaster

Worker

Worker

Client

./spark-submit --deploy-mode “cluster” --supervise

Driver Resiliency (Cluster Mode)(with supervise)

WorkerMaster

Worker

Worker

Client

Driver

Driver runs in the worker

Driver Resiliency (Cluster Mode)(with supervise)

WorkerMaster

Worker

Worker

Client

Driver

Driver Resiliency (Cluster Mode)(with supervise)

WorkerMaster

Worker

Worker

Client

Driver

Driver is started in a new worker

Driver Resiliency (Cluster Mode)(with supervise)

WorkerMaster

Worker

Worker

Client

Driver

Driver is started in a new worker

Master Resiliency (Single Master)

WorkerMaster

Worker

Worker

Client

Entire Application is killed

Master Resiliency (Single Master)

WorkerMaster

Worker

Worker

Client

Master Resiliency (Multi Master)

Worker

Worker

Worker

Client

Standby MasterActive Master

No impact

Master Resiliency (Multi Master)

Worker

Worker

Worker

Client

Standby MasterActive Master

Master Resiliency (Multi Master)

Worker

Worker

Worker

Client

Standby MasterActive Master

Master Resiliency (Multi Master)

Worker

Worker

Worker

Client

Standby MasterActive Master Active Master

Standby becomes Active

Master Resiliency (Multi Master)

Worker

Worker

Worker

Client

Standby MasterActive Master Active Master

Standby becomes Active

Executor runs as child process of Worker

Worker Resiliency

WorkerMaster

Worker

Worker

Client

ExecutorDriver

Worker

Worker Resiliency

WorkerMaster

Worker

Worker

Client

ExecutorDriver

Worker

Worker Resiliency

WorkerMaster

Worker

Worker

Client

ExecutorDriver

Driver and Executor are also killed

Worker

Worker Resiliency

WorkerMaster

Worker

Worker

Client

ExecutorDriver

Worker is relaunched

Worker

Worker Resiliency

WorkerMaster

Worker

Worker

Client

ExecutorDriver

Driver and Executor are also killed

Worker is relaunched

Driver and executor are also relaunched

Worker

Worker Resiliency

WorkerMaster

Worker

Worker

Client

ExecutorDriver

Driver and Executor are also killed

Worker is relaunched

Driver and executor are also relaunched

Worker

Executor Resiliency

WorkerMaster

Worker

Worker

Client

Driver ExecutorExecutor

Executor Resiliency

WorkerMaster

Worker

Worker

Client

Driver Executor

Executor Resiliency

WorkerMaster

Worker

Worker

Client

Driver Executor

Executor is relaunched

Executor

Executor Resiliency

WorkerMaster

Worker

Worker

Client

Driver Executor

Executor is relaunched

Executor

Tasks in flight are rescheduled

Executor Resiliency

WorkerMaster

Worker

Worker

Client

Driver Executor

Executor is relaunched

Executor

Tasks in flight are rescheduled

Resiliency Results

Summary

Agenda

● Background

● Use cases for Real Time Stream Processing

● Motivations for Spark

● Creating Chaos

● Spark Streaming Primer

● Deployment Setup

● Injecting Chaos in Spark

● Future

Future

● Lambda Architecture

● Operational Enhancements

○ Dynamic scaling

○ Additional spark instrumentation

● http://bit.ly/persinfra

(Senior Software Engineer - Personalization Infra)

We are hiring!


Top Related