Home >Documents >Spark & Spark Streaming Internals - Nov 15 (1)

Spark & Spark Streaming Internals - Nov 15 (1)

Date post:22-Jul-2015
View:264 times
Download:0 times
Share this document with a friend
  • Spark & Spark Streaming Internals

    Akhil [email protected]

  • Apache Spark

    Spark Stack

  • Spark Internals

  • Resilient Distributed Dataset (RDD)

    Restricted form of distributed shared memory Immutable Can only be built through deterministic transformations (textFile, map,

    filter, join, )

    Efficient fault recovery using lineage Recompute lost partitions on failure No cost if nothing fails

  • RDD Operations


    map/flatmap filter union/join/groupBy cache


    Actions collect/count save take


  • Log Mining Example

    Load error messages from a log into memory, then interactively search for various patterns

    lines = spark.textFile(hdfs://...) errors = lines.filter(_.startsWith(ERROR))messages = errors.map(_.split(\t)(2))messages.persist()

    messages.filter(_.contains(foo)).count messages.filter(_.contains(bar)).count

    Base RDD

    Transformed RDD


  • What is Spark Streaming?

  • What is Spark Streaming?

    Framework for large scale stream processing

    Scales to 100s of nodes

    Can achieve second scale latencies

    Provides a simple batch-like API for implementing complex algorithm

    Can absorb live data streams from Kafka, Flume, ZeroMQ, Kinesis etc.

  • Overview

    Run a streaming computation as a series of very small, deterministic batch jobs



    - Chop up the live stream into batches of X seconds

    - Spark treats each batch of data as RDDs and processes them using RDD operations

    - Finally, the processed results of the RDD operations are returned in batches

  • Key Concepts DStream sequence of RDDs representing a stream of data

    - Twitter, HDFS, Kafka, Flume, ZeroMQ, Akka Actor, TCP sockets

    Transformations modify data from one DStream to another- Standard RDD operations map, countByValue, reduce, join,

    - Stateful operations window, countByValueAndWindow,

    Output Operations send data to external entity- saveAsHadoopFiles saves to HDFS- foreach do anything with each batch of results

  • Eg: Get hashtags from Twitter

    val tweets = ssc.twitterStream(, )val hashTags = tweets.flatMap (status => status.getText.split(" ").filter(_.startsWith("#"))))hashTags.saveAsHadoopFiles("hdfs://...") Transformation

    #Ebola, #India, #Mars ...

  • Thank You

Popular Tags: