© Rocana, Inc. All Rights Reserved. | 1
Joey Echeverria, Platform Technical Lead - @fwiffo
Data Day Texas 2017
Building Production Spark Streaming Applications
© Rocana, Inc. All Rights Reserved. | 2
Joey• Where I work: Rocana – Platform Technical Lead
• Where I used to work: Cloudera (’11-’15), NSA
• Distributed systems, security, data processing, big data
© Rocana, Inc. All Rights Reserved. | 3
© Rocana, Inc. All Rights Reserved. | 4
Context• We built a system for large scale realtime collection, processing, and
analysis of event-oriented machine data
• On prem or in the cloud, but not SaaS
• Supportability is a big deal for us• Predictability of performance under load and failures• Ease of configuration and operation• Behavior in wacky environments
© Rocana, Inc. All Rights Reserved. | 5
Apache Spark Streaming
© Rocana, Inc. All Rights Reserved. | 6
Spark streaming overview• Stream processing API built on top of the Spark execution engine
• Micro-batching• Every n-milliseconds fetch records from data source• Execute Spark jobs on each input batch
• DStream API• Wrapper around the RDD API• Lets the developer think in terms of transformations on a stream of events
© Rocana, Inc. All Rights Reserved. | 7
Input BatchSpark Batch
Engine Output Batch
© Rocana, Inc. All Rights Reserved. | 8
Structured streaming• New streaming API for Spark
• Re-use DataFrames API for streaming
• API was too new when we started• First release was an alpha• No Kafka support at the time
• Details won't apply, but the overall approach should be in the ballpark
© Rocana, Inc. All Rights Reserved. | 9
Other notes• Our experience is with Spark 1.6.2
• 2.0.0 was released after we started our Spark integration
• We use the Apache release of Spark• Supports both CDH and HDP without recompiling• We run Spark on YARN, so we're decoupled from other users on the cluster
© Rocana, Inc. All Rights Reserved. | 10
Use CaseReal-time alerting on IT operational data
© Rocana, Inc. All Rights Reserved. | 11
Our typical customer use cases• >100K events / sec (8.6B events / day), sub-second end to end latency,
full fidelity retention, critical use cases
• Quality of service - “are credit card transactions happening fast enough?”
• Fraud detection - “detect, investigate, prosecute, and learn from fraud.”
• Forensic diagnostics - “what really caused the outage last friday?”
• Security - “who’s doing what, where, when, why, and how, and is that ok?”
• User behavior - ”capture and correlate user behavior with system performance, then feed it to downstream systems in realtime.”
© Rocana, Inc. All Rights Reserved. | 12
Overall architectureweirdo formats
transformation 1weirdo format -> event
avro events
transformation 2event -> storage-specific
storage-specific representation of events
© Rocana, Inc. All Rights Reserved. | 13
Real-time alerting• Define aggregations, conditions, and actions
• Use cases:• Send me an e-mail when the number of failed login events from a user is > 3
within an hour• Create a ServiceNow ticket when CPU utilization spikes to > 95% for 10 minutes
© Rocana, Inc. All Rights Reserved. | 14
UI
© Rocana, Inc. All Rights Reserved. | 15
Architecture
© Rocana, Inc. All Rights Reserved. | 16
Packaging, Deployment, and Execution
© Rocana, Inc. All Rights Reserved. | 17
Packaging• Application classes and dependencies
• Two options• Shade all dependencies into an uber jar
• Make sure Hadoop and Spark dependencies are marked provided• Submit application jars and dependent jars when submitting
© Rocana, Inc. All Rights Reserved. | 18
Deployment modes• Standalone
• Manually start up head and worker services• Resource control depends on options selected when launching daemons• Difficult to mix versions
• Apache Mesos• Coarse grained run mode, launch executors as Mesos tasks• Can use dynamic allocation to launch executors on demand
• Apache Hadoop YARN• Best choice if your cluster is already running YARN
© Rocana, Inc. All Rights Reserved. | 19
Spark on YARN• Client mode versus cluster mode
• Client mode == Spark Driver on local server• Cluster mode == Spark Driver in YARN AM
• Spark executors run in YARN containers (one JVM per executor)• spark.executor.instances
• Each executor core uses one YARN vCore• spark.executor.cores
© Rocana, Inc. All Rights Reserved. | 20
Job submission• Most documentation covers spark-submit
• OK for testing, but not great for production
• We use spark submitter APIs• Built easier to use wrapper API• Hide some of the details of configuration
• Some configuration parameters aren't respected when using submitter API• spark.executor.cores, spark.executor.memory• spark.driver.cores, spark.driver.memory
© Rocana, Inc. All Rights Reserved. | 21
Job monitoring• Streaming applications are always on
• Need to monitor the job for failures• Restart the job on recoverable failures• Notify an admin on fatal failures (e.g. misconfiguration)
• Validate as much up front as possible• Our application runs rules through a type checker and query planner before
saving
© Rocana, Inc. All Rights Reserved. | 22
Instrumentation, Metrics, and Monitoring
© Rocana, Inc. All Rights Reserved. | 23
Instrumentation
You can't fix whatyou don't measure
© Rocana, Inc. All Rights Reserved. | 24
Instrumentation APIs• Spark supports Dropwizard (née CodaHale) metrics
• Collect both application and framework metrics• Supports most popular metric types
• Counters• Gauges• Histograms• Timers• etc.
• Use your own APIs• Best option if you have your existing metric collection infrastructure
© Rocana, Inc. All Rights Reserved. | 25
Custom metrics• Implement the org.apache.spark.metrics.source.Source interface
• Register your source with sparkEnv.metricsSystem().registerSource()• If you're measuring something during execution, you need to register the metric
on the executors• Register executor metrics in a static block• You can't register a metrics source until the SparkEnv has been initialized
SparkEnv sparkEnv = SparkEnv.get();if (sparkEnv != null) { // create and register source}
© Rocana, Inc. All Rights Reserved. | 26
Metrics collection• Configure $SPARK_HOME/conf/metrics.properties
• Built-in sinks• ConsoleSink• CVSSink• JmxSink• MetricsServlet• GraphiteSink• Slf4jSink• GangliaSink
• Or build your own
© Rocana, Inc. All Rights Reserved. | 27
Build your own• Implement the org.apache.spark.metrics.sink.Sink interface
• We built a KafkaEventSink that sends the metrics to a Kafka topic formatted as Osso* events
• Our system has a metrics collector• Aggregates metrics in a Parquet table• Query and visualize metrics using SQL
• *http://www.osso-project.org
© Rocana, Inc. All Rights Reserved. | 28
Report and visualize
© Rocana, Inc. All Rights Reserved. | 29
Gotcha• Due to the order of metrics subsystem initialization, your collection plugin
must be on the system classpath, not application classpath• https://issues.apache.org/jira/browse/SPARK-18115
• Options:• Deploy library on cluster nodes (e.g. add to HADOOP_CLASSPATH)• Build a custom Spark assembly jar
© Rocana, Inc. All Rights Reserved. | 30
Custom spark assembly• Maven shade plugin
• Merge upstream Spark assembly JAR with your library and dependencies• Shade/rename library packages
• Might break configuration parameters as well • *.sink.kafka.com_rocana_assembly_shaded_kafka_brokers
• Mark any dependencies already in the assembly as provided• Ask me about our akka.version fiasco
© Rocana, Inc. All Rights Reserved. | 31
Configuration and Tuning
© Rocana, Inc. All Rights Reserved. | 32
Architecture
© Rocana, Inc. All Rights Reserved. | 33
Predicting CPU/task resources• Each output operation creates a separate batch job when processing a
micro-batch• number of jobs = number of output ops
• Each data shuffle/re-partitioning creates a separate stage• number of stages per job = number of shuffles + 1
• Each partition in a stage creates a separate task• number of tasks per job = number of stages * number of partitions
© Rocana, Inc. All Rights Reserved. | 34
Resources for alerting• Each rule has a single output operation (write to Kafka)
• Each rule has 3 stages1. Read from Kafka, project, filter and group data for aggregation2. Aggregate values, filter (conditions) and group data for triggers3. Aggregate trigger results and send trigger events to Kafka
• First stage partitions = number of Kafka partitions
• Stage 2 and 3 use spark.default.parallelism partitions
© Rocana, Inc. All Rights Reserved. | 35
Example• 100 rules, Kafka partitions = 50, spark.default.parallelism = 50
• number of jobs = 100
• number of stages per job = 3
• number of tasks per job = 3 * 50 = 150
• total number of tasks = 100 * 150 = 15,000
© Rocana, Inc. All Rights Reserved. | 36
Task slots• number of task slots = spark.executor.instances * spark.executor.cores
• Example• 50 instances * 8 cores = 400 task slots
© Rocana, Inc. All Rights Reserved. | 37
Waves• The jobs processing the micro-batches will run in waves based on
available task slots
• Number of waves = total number of tasks / number of task slots
• Example• Number of waves = 15,000 / 400 = 38 waves
© Rocana, Inc. All Rights Reserved. | 38
Max time per wave• maximum time per wave = micro-batch duration / number of waves
• Example:• 15 second micro-batch duration• maximum time per wave = 15,000 ms / 38 waves = 394 ms per wave
• If the average task time > 394 ms, then Spark streaming will fall behind
© Rocana, Inc. All Rights Reserved. | 39
Monitoring batch processing time
© Rocana, Inc. All Rights Reserved. | 40
Delay scheduling• A technique of delaying scheduling of tasks to get better data locality
• Works great for long running batch tasks• Not ideal for low-latency stream processing tasks
• Tip• Set spark.locality.wait = 0ms• Results
• Running job with 800 tasks on a very small (2 task slot) cluster, 300 event micro-batch• With default setting: 402 seconds• With 0ms setting: 26 seconds (15.5x faster)
© Rocana, Inc. All Rights Reserved. | 41
Model memory requirements• Persistent memory used by stateful operators
• reduceByWindow, reduceByKeyAndWindow• countByWindow, countByValueAndWindow• mapWithState, updateStateByKey
• Model retention time• Built-in time-based retention (e.g. reduceByWindow)• Explicit state management (e.g. org.apache.spark.streaming.State#remove())
© Rocana, Inc. All Rights Reserved. | 42
Example• Use reduceByKeyAndWindow to sum integers with a 30 second window
and 10 second slide over 10,000 keys
• active windows = window length / window slide• 30s / 10s = 3
• estimated memory = active windows * num keys * (state size + key size)• 3 *10,000 * (16 bytes + 80 bytes) = 2.75 MB
© Rocana, Inc. All Rights Reserved. | 43
Monitor Memory
© Rocana, Inc. All Rights Reserved. | 44
Putting it altogether• Pick your packaging and deployment model based on operational needs,
not developer convenience
• Use Spark submitter APIs whenever possible
• Measure and report operational metrics
• Focus configuration and tuning on the expected behavior of your application• Model, configure, monitor