+ All Categories
Home > Technology > Productionizing your Streaming Jobs

Productionizing your Streaming Jobs

Date post: 13-Apr-2017
Category:
Upload: databricks
View: 931 times
Download: 2 times
Share this document with a friend
46
Productionizing your Streaming Jobs Prakash Chockalingam @prakash573
Transcript
Page 1: Productionizing your Streaming Jobs

Productionizing your Streaming Jobs

Prakash Chockalingam @prakash573

Page 2: Productionizing your Streaming Jobs

About the speaker: Prakash Chockalingam

Prakash is currently a Solutions Architect at Databricks and focuses on helping customers building their big data infrastructure based on his decade-long experience on building large scale distributed systems and machine learning infrastructure at companies including Netflix and Yahoo. Prior to joining Databricks, he was with Netflix designing and building their recommendation infrastructure that serves out millions of recommendations to Netflix users every day.

2

Page 3: Productionizing your Streaming Jobs

About the moderator: Denny Lee

Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud.

3

Page 4: Productionizing your Streaming Jobs

About Databricks

Founded by creators of Spark in 2013

Cloud enterprise data platform - Managed Spark clusters - Interactive data science - Production pipelines - Data governance, security, …

Page 5: Productionizing your Streaming Jobs

Agenda

• Introduction to Spark Streaming

• Lifecycle of a Spark streaming app

• Aggregations and best practices

• Operationalization tips

• Key benefits of Spark streaming

Page 6: Productionizing your Streaming Jobs

What is Spark Streaming?

Page 7: Productionizing your Streaming Jobs

Spark Streaming

Page 8: Productionizing your Streaming Jobs
Page 9: Productionizing your Streaming Jobs

How does it work?

● Receivers receive data streams and chops them in to batches.

● Spark processes the batches and pushes out the results

Page 10: Productionizing your Streaming Jobs

Word Count

val context = new StreamingContext(conf, Seconds(1))

val lines = context.socketTextStream(...)

Entry point Batch Interval

DStream: represents a data stream

Page 11: Productionizing your Streaming Jobs

Word Countval context = new StreamingContext(conf, Seconds(1))

val lines = context.socketTextStream(...)

val words = lines.flatMap(_.split(“ “))

Transformations: transform data to create new DStreams

Page 12: Productionizing your Streaming Jobs

Word Countval context = new StreamingContext(conf, Seconds(1))

val lines = context.socketTextStream(...)

val words = lines.flatMap(_.split(“ “))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_+_)

wordCounts.print()

context.start() Print the DStream contents on screen

Start the streaming job

Page 13: Productionizing your Streaming Jobs

Lifecycle of a streaming app

Page 14: Productionizing your Streaming Jobs

Execution in any Spark Application

Spark Driver

User code runs in the driver process

YARN / Mesos / Spark Standalone cluster

Tasks sent to executors for processing data

Spark Executor

Spark Executor

Spark Executor

Driver launches executors in cluster

Page 15: Productionizing your Streaming Jobs

Execution in Spark Streaming: Receiving dataExecutor

Executor

Driver runs receivers as long running tasks

Receiver Data stream

Driver object WordCount { def main(args: Array[String]) { val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)) .reduceByKey(_ + _) wordCounts.print() context.start() context.awaitTermination() }}

Receiver divides stream into blocks and keeps in memory

Data Blocks

Blocks also replicated to another executor

Data Blocks

Page 16: Productionizing your Streaming Jobs

Execution in Spark Streaming: Processing dataExecutor

Executor

Receiver

Data Blocks

Data Blocks

results

results

Data store

Every batch interval, driver launches tasks to

process the blocksDriver

object WordCount { def main(args: Array[String]) { val context = new StreamingContext(...) val lines = KafkaUtils.createStream(...) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x,1)) .reduceByKey(_ + _) wordCounts.print() context.start() context.awaitTermination() }}

Page 17: Productionizing your Streaming Jobs

End-to-end view

17

t1 = ssc.socketStream(“…”)t2 = ssc.socketStream(“…”)

t = t1.union(t2).map(…)

t.saveAsHadoopFiles(…)t.map(…).foreach(…)t.filter(…).foreach(…)

T

U

M

T

M FFE

FE FE

B

U

M

B

M F

Input DStreams

Output operations

RDD Actions / Spark Jobs

BlockRDDs

DStreamGraphDAG of RDDs

every intervalDAG of stages every interval

Stage 1

Stage 2

Stage 3

Streaming appTasks

every interval

B

U

M

B

M F

B

U

M

B

M F

Stage 1

Stage 2

Stage 3

Stage 1

Stage 2

Stage 3

Spark Streaming JobScheduler + JobGenerator

Spark DAGScheduler

Spark TaskScheduler

Exec

utor

s

YOU write this

Page 18: Productionizing your Streaming Jobs

Aggregations

Page 19: Productionizing your Streaming Jobs

Word count over a time window

val wordCounts = wordStream.reduceByKeyAndWindow((x:

Int, y:Int) => x+y, windowSize, slidingInterval)

Parent DStream

window size

sliding interval

Reduces over a time window

Page 20: Productionizing your Streaming Jobs

Word count over a time window

Scenario: Word count for the last 30 minutes How to optimize for good performance? ● Increase batch interval, if possible ● Incremental aggregations with inverse reduce function

val wordCounts = wordStream.reduceByKeyAndWindow( (x: Int, y:Int) => x+y, (x: Int, y: Int) => x-y, windowSize, slidingInterval)

● Checkpointing wordStream.checkpoint(checkpointInterval)

Page 21: Productionizing your Streaming Jobs

Stateful: Global Aggregations

Scenario: Maintain a global state based on the input events coming in. Ex: Word count from beginning of time.

updateStateByKey (Spark 1.5 and before) ● Performance is proportional to the size of the state.

mapWithState (Spark 1.6+) ● Performance is proportional to the size of the batch.

Page 22: Productionizing your Streaming Jobs

Stateful: Global Aggregations

Page 23: Productionizing your Streaming Jobs

Stateful: Global Aggregations

Key features of mapWithState: ● An initial state - Read from somewhere as a RDD ● # of partitions for the state - If you have a good estimate of the size of the state,

you can specify the # of partitions. ● Partitioner - Default: Hash partitioner. If you have a good understanding of the

key space, then you can provide a custom partitioner ● Timeout - Keys whose values are not updated within the specified timeout

period will be removed from the state.

Page 24: Productionizing your Streaming Jobs

Stateful: Global Aggregations (Word count)

val stateSpec = StateSpec.function(updateState _)

.initialState(initialRDD)

.numPartitions(100)

.partitioner(MyPartitioner())

.timeout(Minutes(120))

val wordCountState = wordStream.mapWithState(stateSpec)

Page 25: Productionizing your Streaming Jobs

Stateful: Global Aggregations (Word count)

def updateState(batchTime: Time,

key: String,

value: Option[Int],

state: State[Long])

: Option[(String, Long)]

Current batch time

A Word in the input stream

Current value (= 1)

Counts so far for the word

The word and its new count

Page 26: Productionizing your Streaming Jobs

Operationalization

Page 27: Productionizing your Streaming Jobs

Checkpoint

Two types of checkpointing:

● Checkpointing Data

● Checkpointing Metadata

Page 28: Productionizing your Streaming Jobs

Checkpoint Data● Checkpointing DStreams

• Primarily needed to cut long lineage on past batches (updateStateByKey/reduceByKeyAndWindow).

• Example: wordStream.checkpoint(checkpointInterval)

Page 29: Productionizing your Streaming Jobs

Checkpoint Metadata

● Checkpointing Metadata • All the configuration, DStream operations and incomplete batches are

checkpointed. • Required for failure recovery if the driver process crashes. • Example: streamingContext.checkpoint(directory)

Page 30: Productionizing your Streaming Jobs

Achieving good throughput

context.socketStream(...) .map(...) .filter(...) .saveAsHadoopFile(...)

Problem: There will be 1 receiver which receives all the data and stores it in its executor and all the processing happens on that executor. Adding more nodes doesn’t help.

Page 31: Productionizing your Streaming Jobs

Achieving good throughput

Solution: Increase the # of receivers and union them. ● Each receiver is run in 1 executor. Having 5 receivers will ensure

that the data gets received in parallel in 5 executors. ● Data gets distributed in 5 executors. So all the subsequent Spark

map/filter operations will be distributed val numStreams = 5 val inputStreams = (1 to numStreams).map(i =>

context.socketStream(...)) val fullStream = context.union(inputStreams) fullStream.map(...).filter(...).saveAsHadoopFile(...)

Page 32: Productionizing your Streaming Jobs

Achieving good throughput

● In the case of direct receivers (like Kafka), set the appropriate # of

partitions in Kafka.

● Each kafka paratition gets mapped to a Spark partition.

●More partitions in Kafka = More parallelism in Spark

Page 33: Productionizing your Streaming Jobs

Achieving good throughput

● Provide the right # of partitions based on your cluster size for operations causing shuffles.

words.map(x => (x, 1)).reduceByKey(_+_, 100)

# of partitions

Page 34: Productionizing your Streaming Jobs

Debugging a Streaming applicationStreaming tab in Spark UI

Page 35: Productionizing your Streaming Jobs

Debugging a Streaming application

Processing Time ●Make sure that the processing time < batch interval

Page 36: Productionizing your Streaming Jobs

Debugging a Streaming application

Page 37: Productionizing your Streaming Jobs

Debugging a Streaming application

Batch Details Page: ● Input to the batch ● Jobs that were run as part of the processing for the batch

Page 38: Productionizing your Streaming Jobs

Debugging a Streaming applicationJob Details Page ● DAG Visualization ● Stages of a Spark job

Page 39: Productionizing your Streaming Jobs

Debugging a Streaming applicationTask Details Page Ensure that the tasks are executed on multiple executors (nodes) in your cluster to have enough parallelism while processing. If you have a single receiver, sometimes only one executor might be doing all the work though you have more than one executor in your cluster.

Page 40: Productionizing your Streaming Jobs

Key benefits of Spark streaming

Page 41: Productionizing your Streaming Jobs

Dynamic Load Balancing

Page 42: Productionizing your Streaming Jobs

Fast failure and Straggler recovery

Page 43: Productionizing your Streaming Jobs

Combine Batch and Stream ProcessingJoin data streams with static data sets

val dataset = sparkContext.hadoopFile(“file”) … kafkaStream.transform{ batchRdd =>

batchRdd.join(dataset).filter(...) }

Page 44: Productionizing your Streaming Jobs

Combine ML and Stream Processing

Learn models offline, apply them online

val model = KMeans.train(dataset, …) kakfaStream.map { event => model.predict(event.feature) }

Page 45: Productionizing your Streaming Jobs

Combine SQL and Stream Processing

inputStream.foreachRDD{ rdd => val df = SQLContext.createDataframe(rdd) df.select(...).where(...).groupBy(...) }

Page 46: Productionizing your Streaming Jobs

Thank you.


Recommended