Date post: | 21-Jul-2015 |
Category: |
Software |
Upload: | databricks |
View: | 1,167 times |
Download: | 0 times |
Spark Streaming The State of the Union and the Road Beyond
Tathagata “TD” Das @tathadas
March 18, 2015
Who am I?
Project Management Committee (PMC) member of Spark Lead developer of Spark Streaming Formerly in AMPLab, UC Berkeley Software developer at Databricks
What is Spark Streaming?
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume Kinesis
HDFS/S3
Kafka
High-level API
joins, windows, … often 5x less code
Fault-tolerant
Exactly-once semantics, even for stateful ops
Integration
Integrate with MLlib, SQL, DataFrames, GraphX
How does it work?
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
5
data streams
rece
iver
s
batches results
Streaming Word Count with Kafka
val kafka = KafkaUtils.create(ssc, kafkaParams, …)
val words = kafka.map(_._2).flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)) .reduceByKey(_ + _)
wordCounts.print()
ssc.start()
6
print some counts on screen
count the words
split lines into words
create DStream with lines from Kafka
start processing the stream
Languages
Can natively use
Can use any other language by using RDD.pipe()
7
Integrates with Spark Ecosystem
8
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
Combine batch and streaming processing
Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset)filter(...) }
9
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
Combine machine learning with streaming
Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) }
10
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
Combine SQL with streaming
Interactively query streaming data with SQL // Register each batch in stream as table kafkaStream.map { batchRDD => batchRDD.registerTempTable("latestEvents") } // Interactively query table sqlContext.sql("select * from latestEvents")
11
Spark Core
Spark Streaming
Spark SQL MLlib GraphX
A Brief History
12
Late 2011 – research idea AMPLab, UC Berkeley
We need to make Spark
faster
Okay...umm, how??!?!
A Brief History
13
Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms
Q3 2012 Spark core improvements open sourced in Spark 0.6
Feb 2013 – Alpha release 7.7k lines, merged in 7 days
Released with Spark 0.7
Late 2011 – idea AMPLab, UC Berkeley
A Brief History
14
Late 2011 – idea AMPLab, UC Berkeley
Q2 2012 – prototype Rewrote large parts of Spark core Smallest job - 900 ms à <50 ms
Q3 2012 Spark core improvements open sourced in Spark 0.6
Feb 2013 – Alpha release 7.7k lines, merged in 7 days
Released with Spark 0.7
Jan 2014 – Stable release Graduation with Spark 0.9
Current state of Spark Streaming
Adoption
16
Roadmap
Development
17
What have we added in the last year?
Python API
Core functionality in Spark 1.2, with sockets and files as sources
Kafka support in Spark 1.3
Other sources coming in future
18
kafka = KafkaUtils.createStream(ssc, params, …) lines = kafka.map(lambda x: x[1]) counts = lines.flatMap(lambda line: line.split(" "))\ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a+b) counts.pprint()
Streaming MLlib algorithms
val model = new StreamingKMeans() .setK(10) .setDecayFactor(1.0) .setRandomCenters(4, 0.0) // Apply model to DStreams model.trainOn(trainingDStream) model.predictOnValues( testDStream.map { lp => (lp.label, lp.features) } ).print()
19
Continuous learning and prediction on streaming data StreamingLinearRegression in Spark 1.1
StreamingKMeans in Spark 1.2
StreamingLogisticRegression in Spark 1.3 https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
Kafka `Direct` Stream API
Earlier Receiver-based approach for Kafka
Requires replicated journals (write ahead logs) to ensure zero data loss under driver failures
20
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Kafka Receiver high-level consumer
Kafka `Direct` Stream API
Earlier Receiver-based approach for Kafka New direct approach for Kafka in Spark 1.3
21
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Kafka Receiver high-level consumer
simple consumer API to read Kafka topics
Kafka `Direct` Stream API
New direct approach for Kafka in 1.3 – treat Kafka like a file system
No receivers!!! Directly query Kafka for latest topic offsets, and read data like reading files Instead of Zookeeper, Spark Streaming keeps track of Kafka offsets More efficient, fault-tolerant, exactly-once receiving of Kafka data
22
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Other Library Additions
Amazon Kinesis integration [Spark 1.1] More fault-tolerant Flume integration [Spark 1.1]
23
System Infrastructure
Automated driver fault-tolerance [Spark 1.0] Graceful shutdown [Spark 1.0] Write Ahead Logs for zero data loss [Spark 1.2]
24
Contributors to Streaming
25
0
10
20
30
40
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Contributors - Full Picture
26
0
30
60
90
120
Spark 0.9 Spark 1.0 Spark 1.1 Spark 1.2
Streaming
Core + Streaming (w/o SQL, MLlib,…)
All contributions to core Spark directly improve Spark Streaming
Spark Packages
More contributions from the community in spark-packages
Alternate Kafka receiver
Apache Camel receiver
Cassandra examples
http://spark-packages.org/
27
Who is using Spark Streaming?
Spark Summit 2014 Survey
29
40% of Spark users were using Spark Streaming in production or prototyping Another 39% were evaluating it
Not using 21%
Evaluating 39%
Prototyping 31%
Production 9%
30
31
80+ known
deployments
Intel China builds big data solutions for large enterprises Multiple streaming applications for top businesses
Real-time risk analysis for a top online payment company Real-time deal and flow metric reporting for a top online shopping company
Complicated stream processing SQL queries on streams Join streams with large historical datasets
> 1TB/day passing through Spark Streaming
YARN
Spark Streaming
Kafka
RocketMQ HBase
One of the largest publishing and education company, wants to accelerate their push into digital learning Needed to combine student activities and domain events to continuously update the learning model of each student Earlier implementation in Storm, but now moved on to Spark Streaming
Spark Standalone
Spark Streaming Kafka
Cassandra
Chose Spark Streaming, because Spark together combines batch, streaming, machine learning, and graph processing
Apache Blur
More information: http://dbricks.co/1BnFZZ8
Leading advertising automation company with an exchange platform for in-feed ads Process clickstream data for optimizing real-time bidding for ads
Mesos+Marathon
Spark Streaming
Kinesis MySQL Redis
RabbitMQ SQS
Wants to learn trending movies and shows in real time Currently in the middle of replacing one of their internal stream processing architecture with Spark Streaming Tested resiliency of Spark Streaming with Chaos Monkey More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Driver failures handled with Spark Standalone cluster’s supervise mode Worker, executor and receiver failures automatically handled
Spark Streaming can handle all kinds of failures More information: http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
Neuroscience @ Freeman Lab, Janelia Farm
Spark Streaming and MLlib to analyze neural activities
Laser microscope scans Zebrafish brainà Spark Streaming à interactive visualization à laser ZAP to kill neurons!
http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Neuroscience @ Freeman Lab, Janelia Farm
Streaming machine learning algorithms on time series data of every neuron Upto 2TB/hour and increasing with brain size Upto 80 HPC nodes http://www.jeremyfreeman.net/share/talks/spark-summit-2014/
Why are they adopting Spark Streaming?
Easy, high-level API
Unified API across batch and streaming
Integration with Spark SQL and MLlib
Ease of operations
41
What’s coming next?
Libraries
Operational Ease
Performance
Roadmap
Libraries Streaming machine learning algorithms
A/B testing Online Latent Dirichlet Allocation (LDA) More streaming linear algorithms
Streaming + DataFrames, Streaming + SQL
44
Roadmap
Operational Ease Better flow control Elastic scaling Cross-version upgradability Improved support for non-Hadoop environments
45
Roadmap
Performance Higher throughput, especially of stateful operations Lower latencies
Easy deployment of streaming apps in Databricks Cloud!
46
You can help!
Roadmaps are heavily driven by community feedback We have listened to community demands over the last year
Write Ahead Logs for zero data loss New Kafka direct API
Let us know what do you want to see in Spark Streaming
Spark user mailing list, tweet it to me @tathadas
47
Industry adoption increasing rapidly
Community contributing very actively
More libraries, operational ease and
performance in the roadmap
48
@tathadas
49
Backup slides
Typesafe survey of Spark users
2136 developers, data scientists, and other tech professionals
http://java.dzone.com/articles/apache-spark-survey-typesafe-0
Typesafe survey of Spark users
65% of Spark users are interested in Spark Streaming
Typesafe survey of Spark users
2/3 of Spark users want to process event streams
53
More usecases
• Big data solution provider for enterprises • Multiple applications for different businesses
- Monitoring +optimizing online services of Tier-1 bank - Fraudulent transaction detection for Tier-2 bank
• Kafka à SS à Cassandra, MongoDB • Built their own Stratio Streaming platform on
Spark Streaming, Kafka, Cassandra, MongoDB
• Provides data analytics solutions for Communication Service Providers - 4 of 5 top mobile ops, 3 of 4 top internet backbone providers - Processes >50% of all US mobile traffic
• Multiple applications for different businesses - Real-time anomaly detection in cell tower traffic - Real-time call quality optimizations
• Kafka à SS
http://spark-summit.org/2014/talk/building-big-data-operational-intelligence-platform-with-apache-spark
• Runs claims processing applications for healthcare providers
http://searchbusinessanalytics.techtarget.com/feature/Spark-Streaming-project-looks-to-shed-new-light-on-medical-claims
• Predictive models can look for claims that are likely to be held up for approval
• Spark Streaming allows model scoring in seconds instead of hours