Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

transcript

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Or“A Case Study in

Operationalizing Spark Streaming”

Context/Disclaimer Our use case: Build resilient, scalable data pipeline with

streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.

Spark Streaming 1.5-1.6, Kafka 0.9

Standalone Cluster (not YARN or Mesos)

No Hadoop

Message velocity: k/s. Batch window: 10s

Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

Demo: Spark in Action

Game & Scoreboard Architecture

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data

objects Driver: JVM that creates Spark program,

negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.

Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.

Lazy Execution: Transformations & Actions Cluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone Cluster Overview

Standalone Cluster Each node

Master Worker Executor Driver

Zookeeper cluster

Delegate all IO/CPU to the Executors Avoid unnecessary shuffles (join, groupBy,

repartition) Externalize streaming joins & reference data

lookups. Large/volatile ref data set. JVM static hashmap External cache (e.g. Redis) Static LRU cache (amortize lookups) RocksDB

Hygienic function closures

We’re done, right?

Just need to QA the data…

70% missing data

Guaranteed Message Processing & Direct Kafka Integration Guaranteed Message Processing = At-least-once

processing + idempotence Kafka Receiver

Consumes messages faster than Spark can process Checkpoints before processing finished Inefficient CPU utilization

Direct Kafka Integration Control over checkpointing & transactionality Better distribution on resource consumption 1:1 Kafka Topic-partition to Spark RDD-partition Use Kafka as WAL

Statelessness, Fail-fast

Operational Monitoring& Alerting Driver “Heartbeat”

Batch processing time Message count

Kafka lag (latest offsets vs last processed) Driver start events StatsD + Graphite + Seyren http://localhost:4040/metrics/json/

Data loss fixed

So we’re done, right?

Cluster & appcontinuously crashing

Spark Cluster & App Stability

Spark slave memory utilization

Spark Cluster & App Stability

Slave memory overhead OOM killer

Crashes + Kafka Receiver = missing data Supervised driver: “--supervise” for spark-

submit. Driver restart logging Cluster resource overprovisioning Standby Masters for failover Auto-cleanup of work directories

spark.worker.cleanup.enabled=true

We’re done, right?

Finally, yes

Party Time

TL;DR1. Use Direct Kafka Integration +

transactionality2. Cache reference data for speed3. Avoid shuffles & driver bottlenecks4. Supervised driver5. Cleanup worker temp directory6. Beware of function closures7. Cluster resource over-provisioning8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag10. Standby masters

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Thanks!

Links Operationalization Spark Streaming:

https://techblog.expedia.com/2016/12/29/operationalizing-spark-streaming-part-1/

Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

App metrics: http://localhost:4040/metrics/json/ MetricsSystem: http

://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/

sparkConf.set("spark.worker.cleanup.enabled", "true")

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Technology