+ All Categories
Home > Documents > Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Date post: 29-Mar-2015
Category:
Upload: douglas-clear
View: 223 times
Download: 4 times
Share this document with a friend
Popular Tags:
31
Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY
Transcript
Page 1: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Spark StreamingReal-time big-data processing

Tathagata Das (TD)

UC BERKELEY

Page 2: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

What is Spark Streaming?

Extends Spark for doing big data stream processing

Project started in early 2012, alpha released in Spring 2013 with Spark 0.7

Moving out of alpha in Spark 0.9

Spark

Spark Streamin

gGraphX

…Shark

MLlib

BlinkDB

Page 3: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Why Spark Streaming?

Many big-data applications need to process large data streams in realtime

Website monitoringFraud detection

Ad monetization

Page 4: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Why Spark Streaming?

Need a framework for big data stream processing that

Website monitoringFraud detection

Ad monetization

Scales to hundreds of nodes

Achieves second-scale latencies

Efficiently recover from failures

Integrates with batch and interactive processing

Page 5: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Integration with Batch Processing Many environments require processing

same data in live streaming as well as batch post-processing

Existing frameworks cannot do both- Either, stream processing of 100s of MB/s with low

latency

- Or, batch processing of TBs of data with high latency

Extremely painful to maintain two different stacks - Different programming models

- Double implementation effort

Page 6: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Stateful Stream Processing Traditional model

Mutable state is lost if node fails

Making stateful stream processing fault tolerant is challenging!

– Processing pipeline of nodes– Each node maintains mutable

state– Each input record updates the

state and new records are sent out

mutable state

node 1

node 3

input records

node 2

input records

Page 7: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Existing Streaming Systems Storm

- Replays record if not processed by a node

- Processes each record at least once

- May update mutable state twice!

- Mutable state can be lost due to failure!

Trident – Use transactions to update state- Processes each record exactly once

- Per-state transaction to external database is slow

7

Page 8: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Spark Streaming

8

Page 9: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Spark StreamingRun a streaming computation as a series of very small, deterministic batch jobs

9

Spark

SparkStreaming

batches of X seconds

live data stream

processed results

Chop up the live stream into batches of X seconds

Spark treats each batch of data as RDDs and processes them using RDD operations

Finally, the processed results of the RDD operations are returned in batches

Page 10: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Spark StreamingRun a streaming computation as a series of very small, deterministic batch jobs

10

Batch sizes as low as ½ second, latency of about 1 second

Potential for combining batch processing and streaming processing in the same system Spark

SparkStreaming

batches of X seconds

live data stream

processed results

Page 11: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream of data

batch @ t+1batch @ t batch @ t+2

tweets DStream

stored in memory as an RDD (immutable, distributed)

Twitter Streaming API

Page 12: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap(status => getTags(status))

flatMap flatMap flatMap

transformation: modify data in one DStream to create another DStream new DStream

new RDDs created for every batch

batch @ t+1batch @ t batch @ t+2

tweets DStream

hashTags Dstream[#cat, #dog, … ]

Page 13: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap(status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

every batch saved to HDFS

Page 14: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap(status => getTags(status))

hashTags.foreach(hashTagRDD => { ... })

foreach: do whatever you want with the processed data

flatMap flatMap flatMap

foreach foreach foreach

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Write to a database, update analytics UI, do whatever you want

Page 15: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Demo

Page 16: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Java Example

Scala

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap(status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

Java

JavaDStream<Status> tweets = ssc.twitterStream()

JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })

hashTags.saveAsHadoopFiles("hdfs://...")

Function object

Page 17: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

DStream of data

Window-based Transformationsval tweets = ssc.twitterStream()

val hashTags = tweets.flatMap(status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window operation window length sliding interval

window length

sliding interval

Page 18: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Arbitrary Stateful Computations

Specify function to generate new state based on previous state and new data- Example: Maintain per-user mood as state, and

update it with their tweets

def updateMood(newTweets, lastMood) => newMood

moods = tweetsByUser.updateStateByKey(updateMood _)

Page 19: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Arbitrary Combinations of Batch and Streaming Computations

Inter-mix RDD and DStream operations!- Example: Join incoming tweets with a spam

HDFS file to filter out bad tweets

tweets.transform(tweetsRDD => {

tweetsRDD.join(spamHDFSFile).filter(...)

})

Page 20: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

DStreams + RDDs = Power Online machine learning

- Continuously learn and update data models (updateStateByKey and transform)

Combine live data streams with historical data- Generate historical data models with Spark, etc.

- Use data models to process live data stream (transform)

CEP-style processing- window-based operations (reduceByWindow, etc.)

Page 21: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Input Sources Out of the box, we provide

- Kafka, HDFS, Flume, Akka Actors, Raw TCP sockets, etc.

Very easy to write a receiver for your own data source

Also, generate your own RDDs from Spark, etc. and push them in as a “stream”

Page 22: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Fault-tolerance Batches of input data

are replicated in memory for fault-tolerance

Data lost due to worker failure, can be recomputed from replicated input data

input data replicatedin memory

flatMap

lost partitions recomputed on other workers

tweetsRDD

hashTagsRDD

All transformations are fault-tolerant, and exactly-once transformations

Page 23: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

PerformanceCan process 60M records/sec (6 GB/sec)

on

100 nodes at sub-second latency

0 50 1000

0.5

1

1.5

2

2.5

3

3.5WordCount

1 sec2 sec

# Nodes in Cluster

Clus

ter T

hrou

ghpu

t (GB

/s)

0 20 40 60 80100

0

1

2

3

4

5

6

7Grep

1 sec2 sec

# Nodes in Cluster

Clus

ter T

hhro

ughp

ut (G

B/s)

Page 24: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Comparison with other systems

Higher throughput than Storm- Spark Streaming: 670k records/sec/node- Storm: 115k records/sec/node- Commercial systems: 100-500k

records/sec/node

100 100005

1015202530

WordCount

Spark

Storm

Record Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

100 10000

20

40

60Grep

Spark

Storm

Record Size (bytes)

Thro

ughp

ut p

er n

ode

(MB/

s)

Page 25: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Fast Fault Recovery

Recovers from faults/stragglers within 1 sec

Page 26: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Mobile Millennium Project

Traffic transit time estimation using online machine learning on GPS observations

0 20 40 60 800

400

800

1200

1600

2000

# Nodes in ClusterG

PS

ob

se

rva

tio

ns

pe

r s

ecMarkov-chain Monte Carlo

simulations on GPS observations

Very CPU intensive, requires dozens of machines for useful computation

Scales linearly with cluster size

Page 27: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Advantage of an unified stack

Explore data interactively to identify problems

Use same code in Spark for processing large logs

Use similar code in Spark Streaming for realtime processing

$ ./spark-shellscala> val file = sc.hadoopFile(“smallLogs”)...scala> val filtered = file.filter(_.contains(“ERROR”))...scala> val mapped = filtered.map(...)... object ProcessProductionData {

def main(args: Array[String]) { val sc = new SparkContext(...) val file = sc.hadoopFile(“productionLogs”) val filtered = file.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... }}object ProcessLiveStream { def main(args: Array[String]) { val sc = new StreamingContext(...) val stream = sc.kafkaStream(...) val filtered = stream.filter(_.contains(“ERROR”)) val mapped = filtered.map(...) ... }}

Page 28: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Roadmap Spark 0.8.1

- Marked alpha, but has been quite stable

- Master fault tolerance – manual recovery- Restart computation from a checkpoint file saved to HDFS

Spark 0.9 in Jan 2014 – out of alpha!- Automated master fault recovery

- Performance optimizations

- Web UI, and better monitoring capabilities

Page 29: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Roadmap Long term goals

- Python API

- MLlib for Spark Streaming

- Shark Streaming

Community feedback is crucial!- Helps us prioritize the goals

Contributions are more than welcome!!

Page 30: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Today’s Tutorial Process Twitter data stream to find most

popular hashtags over a window

Requires a Twitter account- Need to setup Twitter OAuth keys to access

tweets

- All the instructions are in the tutorial

Your account will be safe!- No need to enter your password anywhere, only

the keys

- Destroy the keys after the tutorial is done

Page 31: Spark Streaming Real-time big-data processing Tathagata Das (TD) UC BERKELEY.

Conclusion

Streaming programming guide – spark.incubator.apache.org/docs/latest/streaming-programming-guide.html

Research Paper –

tinyurl.com/dstreams

Thank you!


Recommended