+ All Categories
Home > Technology > Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Date post: 05-Apr-2017
Category:
Upload: brandon-obrien
View: 441 times
Download: 6 times
Share this document with a friend
30
Spark Streaming + Kafka Best Practices Brandon O’Brien @hakczar Expedia, Inc
Transcript
Page 1: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Page 2: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Or“A Case Study in

Operationalizing Spark Streaming”

Page 3: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Context/Disclaimer Our use case: Build resilient, scalable data pipeline with

streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.

Spark Streaming 1.5-1.6, Kafka 0.9

Standalone Cluster (not YARN or Mesos)

No Hadoop

Message velocity: k/s. Batch window: 10s

Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)

Page 4: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Demo: Spark in Action

Page 5: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Game & Scoreboard Architecture

Page 6: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Page 7: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Page 8: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data

objects Driver: JVM that creates Spark program,

negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.

Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.

Lazy Execution: Transformations & Actions Cluster Types: Standalone, YARN, Mesos

Page 9: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Streaming & Standalone Cluster Overview

Standalone Cluster Each node

Master Worker Executor Driver

Zookeeper cluster

Page 10: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Page 11: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Design Patterns for Performance

Delegate all IO/CPU to the Executors Avoid unnecessary shuffles (join, groupBy,

repartition) Externalize streaming joins & reference data

lookups. Large/volatile ref data set. JVM static hashmap External cache (e.g. Redis) Static LRU cache (amortize lookups) RocksDB

Hygienic function closures

Page 12: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

We’re done, right?

Page 13: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

We’re done, right?

Just need to QA the data…

Page 14: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

70% missing data

Page 15: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Page 16: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Guaranteed Message Processing & Direct Kafka Integration Guaranteed Message Processing = At-least-once

processing + idempotence Kafka Receiver

Consumes messages faster than Spark can process Checkpoints before processing finished Inefficient CPU utilization

Direct Kafka Integration Control over checkpointing & transactionality Better distribution on resource consumption 1:1 Kafka Topic-partition to Spark RDD-partition Use Kafka as WAL

Statelessness, Fail-fast

Page 17: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Page 18: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Operational Monitoring& Alerting Driver “Heartbeat”

Batch processing time Message count

Kafka lag (latest offsets vs last processed) Driver start events StatsD + Graphite + Seyren http://localhost:4040/metrics/json/

Page 19: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Data loss fixed

Page 20: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Data loss fixed

So we’re done, right?

Page 21: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Cluster & appcontinuously crashing

Page 22: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Outline Spark Streaming & Standalone Cluster Overview

Design Patterns for Performance

Guaranteed Message Processing & Direct Kafka Integration

Operational Monitoring & Alerting

Spark Cluster & App Resilience

Page 23: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Cluster & App Stability

Spark slave memory utilization

Page 24: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Cluster & App Stability

Slave memory overhead OOM killer

Crashes + Kafka Receiver = missing data Supervised driver: “--supervise” for spark-

submit. Driver restart logging Cluster resource overprovisioning Standby Masters for failover Auto-cleanup of work directories

spark.worker.cleanup.enabled=true

Page 25: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

We’re done, right?

Page 26: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

We’re done, right?

Finally, yes

Page 27: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Party Time

Page 28: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

TL;DR1. Use Direct Kafka Integration +

transactionality2. Cache reference data for speed3. Avoid shuffles & driver bottlenecks4. Supervised driver5. Cleanup worker temp directory6. Beware of function closures7. Cluster resource over-provisioning8. Spark slave memory headroom 9. Monitoring on Driver heartbeat & Kafka lag10. Standby masters

Page 29: Spark Streaming + Kafka Best Practices (w/ Brandon O'Brien)

Spark Streaming+ KafkaBest Practices

Brandon O’Brien@hakczarExpedia, Inc

Thanks!


Recommended