+ All Categories
Home > Documents > Spark & Spark Streaming Internals - Nov 15 (1)

Spark & Spark Streaming Internals - Nov 15 (1)

Date post: 22-Jul-2015
Category:
Upload: akhil-das
View: 281 times
Download: 2 times
Share this document with a friend
Popular Tags:
12
Spark & Spark Streaming Internals Akhil Das [email protected]
Transcript
Page 1: Spark & Spark Streaming Internals - Nov 15 (1)

Spark & Spark Streaming Internals

Akhil [email protected]

Page 2: Spark & Spark Streaming Internals - Nov 15 (1)

Apache Spark

Spark Stack

Page 3: Spark & Spark Streaming Internals - Nov 15 (1)

Spark Internals

Page 4: Spark & Spark Streaming Internals - Nov 15 (1)

Resilient Distributed Dataset (RDD)

Restricted form of distributed shared memory ➔ Immutable➔ Can only be built through deterministic transformations (textFile, map,

filter, join, …)

Efficient fault recovery using lineage ➔ Recompute lost partitions on failure ➔ No cost if nothing fails

Page 5: Spark & Spark Streaming Internals - Nov 15 (1)

RDD Operations

Transformations

➔ map/flatmap➔ filter➔ union/join/groupBy➔ cache

…….

Actions➔ collect/count➔ save➔ take

…….

Page 6: Spark & Spark Streaming Internals - Nov 15 (1)

Log Mining Example

Load error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))messages.persist()

messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count

Base RDD

Transformed RDD

Action

Page 7: Spark & Spark Streaming Internals - Nov 15 (1)

What is Spark Streaming?

Page 8: Spark & Spark Streaming Internals - Nov 15 (1)

What is Spark Streaming?

Framework for large scale stream processing

➔ Scales to 100s of nodes

➔ Can achieve second scale latencies

➔ Provides a simple batch-like API for implementing complex algorithm

➔ Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

Page 9: Spark & Spark Streaming Internals - Nov 15 (1)

Overview

Run a streaming computation as a series of very small, deterministic batch jobs

SparkStreaming

Spark

- Chop up the live stream into batches of X seconds

- Spark treats each batch of data as RDDs and processes them using RDD operations

- Finally, the processed results of the RDD operations are returned in batches

Page 10: Spark & Spark Streaming Internals - Nov 15 (1)

Key Concepts➔ DStream – sequence of RDDs representing a stream of data

- Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

➔ Transformations – modify data from one DStream to another- Standard RDD operations – map, countByValue, reduce, join, …

- Stateful operations – window, countByValueAndWindow, …

➔ Output Operations – send data to external entity- saveAsHadoopFiles – saves to HDFS- foreach – do anything with each batch of results

Page 11: Spark & Spark Streaming Internals - Nov 15 (1)

Eg: Get hashtags from Twitter

val tweets = ssc.twitterStream(<Twitter username>, <Twitter password>)val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.startsWith("#"))))hashTags.saveAsHadoopFiles("hdfs://...") Transformation

#Ebola, #India, #Mars ...

Page 12: Spark & Spark Streaming Internals - Nov 15 (1)

Thank You


Recommended