Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Post on 05-Apr-2017

442 views 6 download

transcript

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Or“A Case Study in

Operationalizing Spark Streaming”

Context/Disclaimer Our use case: Build resilient, scalable data pipeline with

streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.

Spark Streaming 1.5-1.6, Kafka 0.9

Standalone Cluster (not YARN or Mesos)

No Hadoop

Message velocity: k/s. Batch window: 10s

Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

Demo: Spark in Action

Game & Scoreboard Architecture

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data

objects Driver: JVM that creates Spark program,

negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.

Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.

Lazy Execution: Transformations & Actions Cluster Types: Standalone, YARN, Mesos

Spark Streaming & Standalone Cluster Overview

Standalone Cluster Each node

Master Worker Executor Driver

Zookeeper cluster

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Design Patterns for Performance

Delegate all IO/CPU to the Executors Avoid unnecessary shuffles (join, groupBy,

repartition) Externalize streaming joins & reference data

lookups. Large/volatile ref data set. JVM static hashmap External cache (e.g. Redis) Static LRU cache (amortize lookups) RocksDB

Hygienic function closures

We’re done, right?

We’re done, right?

Just need to QA the data…

70% missing data

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Guaranteed Message Processing & Direct Kafka Integration Guaranteed Message Processing = At-least-once

processing + idempotence Kafka Receiver

Consumes messages faster than Spark can process Checkpoints before processing finished Inefficient CPU utilization

Direct Kafka Integration Control over checkpointing & transactionality Better distribution on resource consumption 1:1 Kafka Topic-partition to Spark RDD-partition Use Kafka as WAL

Statelessness, Fail-fast

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Operational Monitoring& Alerting Driver “Heartbeat”

Batch processing time Message count

Kafka lag (latest offsets vs last processed) Driver start events StatsD + Graphite + Seyren http://localhost:4040/metrics/json/

Data loss fixed

Data loss fixed

So we’re done, right?

Cluster & appcontinuously crashing

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Spark Cluster & App Stability

Spark slave memory utilization

Spark Cluster & App Stability

Slave memory overhead OOM killer

Crashes + Kafka Receiver = missing data Supervised driver: “--supervise” for spark-

submit. Driver restart logging Cluster resource overprovisioning Standby Masters for failover Auto-cleanup of work directories

spark.worker.cleanup.enabled=true

We’re done, right?

We’re done, right?

Finally, yes

Party Time

TL;DR1. Use Direct Kafka Integration +

transactionality2. Cache reference data for speed3. Avoid shuffles & driver bottlenecks4. Supervised driver5. Cleanup worker temp directory6. Beware of function closures7. Cluster resource over-provisioning8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag10. Standby masters

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Thanks!