Date post: | 05-Apr-2017 |
Category: |
Technology |
Author: | brandon-obrien |
View: | 434 times |
Download: | 6 times |
Streaming Data Ecosystems
Spark Streaming+ KafkaBest Practices
Brandon [email protected]akczarExpedia, Inc
Tell our story, to share learnings1
OrA Case Study in Operationalizing Spark Streaming
This was our use case, yours may be different2
Context/Disclaimer
Our use case: Build resilient, scalable data pipeline with streaming ref data lookups, 24hr stream self-join and some aggregation. Values accuracy over speed.
Spark Streaming 1.5-1.6, Kafka 0.9
Standalone Cluster (not YARN or Mesos)
No Hadoop
Message velocity: k/s. Batch window: 10s
Data sourcee: Kafka (primary), Redis (joins + ref data) & S3 (ref data)
This is our use case, yours may be different3
Demo: Spark in Action
Live system to reason about4
Game & Scoreboard Architecture
5
OutlineSpark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
OutlineSpark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Spark Streaming & Standalone Cluster Overview RDD: Partitioned, replicated collection of data objectsDriver: JVM that creates Spark program, negotiates for resources. Handles scheduling of tasks but does not do heavy lifting. Bottlenecks.Executor: Slave to the driver, executes tasks on RDD partitions. Function serialization.Lazy Execution: Transformations & ActionsCluster Types: Standalone, YARN, Mesos
Spark Streaming & Standalone Cluster Overview Standalone ClusterEach nodeMasterWorkerExecutorDriverZookeeper cluster
Not necessarily the only way to set it up. Save IP space9
OutlineSpark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Design Patterns for Performance
Delegate all IO/CPU to the ExecutorsAvoid unnecessary shuffles (join, groupBy, repartition)Externalize streaming joins & reference data lookups. Large/volatile ref data set.JVM static hashmapExternal cache (e.g. Redis)Static LRU cache (amortize lookups)RocksDBHygienic function closures
Ok, we built the app in the spark framework for scalability, made it fast, 11
Were done, right?
Were done, right?Just need to QA the data
70% missing data
Pause, check on game player14
OutlineSpark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Guaranteed Message Processing & Direct Kafka IntegrationGuaranteed Message Processing = At-least-once processing + idempotenceKafka ReceiverConsumes messages faster than Spark can processCheckpoints before processing finishedInefficient CPU utilizationDirect Kafka IntegrationControl over checkpointing & transactionalityBetter distribution on resource consumption1:1 Kafka Topic-partition to Spark RDD-partitionUse Kafka as WAL Statelessness, Fail-fast
Spark is hiding the fact that it cant keep up with the stream. Crash + restart + bad checkpoint = missing messages.Config to ameliorate, artifact of absence of WAL/HDFS. Multiple data loss scenariosDirect Kafka Integration = statelessness16
OutlineSpark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
Operational Monitoring& AlertingDriver HeartbeatBatch processing timeMessage count
Kafka lag (latest offsets vs last processed)Driver start eventsStatsD + Graphite + Seyrenhttp://localhost:4040/metrics/json/
Simple, At a glance, batch process time < batch interval.Strong Checkpointing strategy(direct) + fail fast / idempotent code, then driver heart beat + kafka lag = confidence18
Data loss fixed
After a few days, we notice19
Data loss fixedSo were done, right?
After a few days, we notice20
Cluster & appcontinuously crashing
I thought resiliency was the promise of Spark. Resilient distributed datasets21
OutlineSpark Streaming & Standalone Cluster Overview
Design Patterns for Performance
Guaranteed Message Processing & Direct Kafka Integration
Operational Monitoring & Alerting
Spark Cluster & App Resilience
The app was crashing, but why22
Spark Cluster & App Stability
Spark slave memory utilization
23
Spark Cluster & App Stability
Slave memory overheadOOM killerCrashes + Kafka Receiver = missing dataSupervised driver: --supervise for spark-submit. Driver restart loggingCluster resource overprovisioningStandby Masters for failoverAuto-cleanup of work directories spark.worker.cleanup.enabled=true
Crashes while using Kafka Receiver = missing data. No WALIs Spark so flaky?Spark was being attacked by the operating systemand doing surprisingly well given the circumstance, especially with the direct kafka Integration and checkpointingGoal: have enough resiliency, redundancy, idempotence, checkpointing. Multiple failures without causing problems.24
Were done, right?
Were done, right?Finally, yes
Party Time
TL;DR
Use Direct Kafka Integration + transactionalityCache reference data for speedAvoid shuffles & driver bottlenecksSupervised driverCleanup worker temp directoryBeware of function closuresCluster resource over-provisioningSpark slave memory headroom Monitoring on Driver heartbeat & Kafka lagStandby masters
Spark Streaming+ KafkaBest Practices
Brandon [email protected], Inc
Thanks!
LinksOperationalization Spark Streaming: https://techblog.expedia.com/2016/12/29/operationalizing-spark-streaming-part-1/Direct Kafka Integration: https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.htmlApp metrics: http://localhost:4040/metrics/json/ MetricsSystem: http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/sparkConf.set("spark.worker.cleanup.enabled", "true")