+ All Categories
Home > Technology > Real Time Data Processing Using Spark Streaming

Real Time Data Processing Using Spark Streaming

Date post: 14-Jul-2015
Category:
Upload: harisr1234
View: 203 times
Download: 0 times
Share this document with a friend
Popular Tags:
24
1 © Cloudera, Inc. All rights reserved. Hari Shreedharan, Software Engineer @ Cloudera Committer/PMC Member, Apache Flume Committer, Apache Sqoop Contributor, Apache Spark Author, Using Flume (O’Reilly) Real Time Data Processing using Spark Streaming
Transcript
Page 1: Real Time Data Processing Using Spark Streaming

1© Cloudera, Inc. All rights reserved.

Hari Shreedharan, Software Engineer @ Cloudera

Committer/PMC Member, Apache Flume

Committer, Apache Sqoop

Contributor, Apache Spark

Author, Using Flume (O’Reilly)

Real Time Data Processing using Spark Streaming

Page 2: Real Time Data Processing Using Spark Streaming

2© Cloudera, Inc. All rights reserved.

Motivation for Real-Time Stream Processing

Data is being created at unprecedented rates

• Exponential data growth from mobile, web, social• Connected devices: 9B in 2012 to 50B by 2020• Over 1 trillion sensors by 2020• Datacenter IP traffic growing at CAGR of 25%

How can we harness it data in real-time?• Value can quickly degrade → capture value immediately• From reactive analysis to direct operational impact• Unlocks new competitive advantages• Requires a completely new approach...

Page 3: Real Time Data Processing Using Spark Streaming

3© Cloudera, Inc. All rights reserved.

Use Cases Across Industries

CreditIdentifyfraudulent transactions as soon as they occur.

TransportationDynamicRe-routingOf traffic orVehicle Fleet.

Retail• Dynamic InventoryManagement• Real-timeIn-storeOffers and recommendations

Consumer Internet &MobileOptimize userengagement basedon user’s currentbehavior.

HealthcareContinuouslymonitor patientvital stats and proactively identifyat-risk patients.

Manufacturing• Identifyequipmentfailures and react instantly• PerformProactivemaintenance.

SurveillanceIdentifythreatsand intrusionsIn real-time

Digital Advertising& MarketingOptimize and personalize content based on real-time information.

Page 4: Real Time Data Processing Using Spark Streaming

4© Cloudera, Inc. All rights reserved.

From Volume and Variety to Velocity

Present

Batch + Stream Processing

Time to Insight of Seconds

Big-Data = Volume + Variety

Big-Data = Volume + Variety + Velocity

PastPresent

Hadoop Ecosystem evolves as well…

Past

Big Data has evolved

Batch Processing

Time to insight of Hours

Page 5: Real Time Data Processing Using Spark Streaming

5© Cloudera, Inc. All rights reserved.

Key Components of Streaming Architectures

Data Ingestion & TransportationService

Real-Time Stream Processing Engine

Kafka Flume

System Management

Security

Data Management & Integration

Real-TimeData Serving

Page 6: Real Time Data Processing Using Spark Streaming

6© Cloudera, Inc. All rights reserved.

Canonical Stream Processing Architecture

Kafka

Data Ingest

App 1

App 2

.

.

.

Kafka Flume

HDFS HBase

Data Sources

Page 7: Real Time Data Processing Using Spark Streaming

7© Cloudera, Inc. All rights reserved.

Spark: Easy and Fast Big Data

•Easy to Develop•Rich APIs in Java, Scala, Python

• Interactive shell

•Fast to Run•General execution graphs

• In-memory storage

2-5× less codeUp to 10× faster on disk,

100× in memory

Page 8: Real Time Data Processing Using Spark Streaming

8© Cloudera, Inc. All rights reserved.

Spark Architecture

Driver

Worker

Worker

Worker

Data

RAM

Data

RAM

Data

RAM

Page 9: Real Time Data Processing Using Spark Streaming

9© Cloudera, Inc. All rights reserved.

RDDs

RDD = Resilient Distributed Datasets• Immutable representation of data• Operations on one RDD creates a new one• Memory caching layer that stores data in a distributed, fault-tolerant cache• Created by parallel transformations on data in stable storage• Lazy materialization

Two observations:a. Can fall back to disk when data-set does not fit in memoryb. Provides fault-tolerance through concept of lineage

Page 10: Real Time Data Processing Using Spark Streaming

10© Cloudera, Inc. All rights reserved.

Spark Streaming

Extension of Apache Spark’s Core API, for Stream Processing.

The Framework Provides

Fault Tolerance

Scalability

High-Throughput

Page 11: Real Time Data Processing Using Spark Streaming

11© Cloudera, Inc. All rights reserved.

Spark Streaming

• Incoming data represented as Discretized Streams (DStreams)

• Stream is broken down into micro-batches

• Each micro-batch is an RDD – can share code between batch and streaming

Page 12: Real Time Data Processing Using Spark Streaming

12© Cloudera, Inc. All rights reserved.

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

flatMap flatMap flatMap

save save save

batch @ t+1batch @ t batch @ t+2tweets DStream

hashTags DStream

Stream composed of small (1-10s) batch

computations

“Micro-batch” Architecture

Page 13: Real Time Data Processing Using Spark Streaming

13© Cloudera, Inc. All rights reserved.

Use DStreams for Windowing Functions

Page 14: Real Time Data Processing Using Spark Streaming

14© Cloudera, Inc. All rights reserved.

Spark Streaming

• Runs as a Spark job

• YARN or standalone for scheduling

• YARN has KDC integration

• Use the same code for real-time Spark Streaming and for batch Spark jobs.

• Integrates natively with messaging systems such as Flume, Kafka, Zero MQ….

• Easy to write “Receivers” for custom messaging systems.

Page 15: Real Time Data Processing Using Spark Streaming

15© Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

def filterErrors (rdd: RDD[String]): RDD[String] = {

rdd.filter(s => s.contains(“ERROR”))

}

Library that filters “ERRORS”

• Streaming generates RDDs periodically

• Any code that operates on RDDs can therefore be used in streaming as well

Page 16: Real Time Data Processing Using Spark Streaming

16© Cloudera, Inc. All rights reserved.

Sharing Code between Batch and Streaming

val lines = sc.textFile(…)

val filtered = filterErrors(lines)

filtered.saveAsTextFile(...)

Spark:

val dStream = FlumeUtils.createStream(ssc, "34.23.46.22", 4435)

val filtered = dStream.foreachRDD((rdd: RDD[String], time: Time) => {

filterErrors(rdd)

}))

filtered.saveAsTextFiles(…)

Spark Streaming:

Page 17: Real Time Data Processing Using Spark Streaming

17© Cloudera, Inc. All rights reserved.

Reliability

• Received data automatically persisted to HDFS Write Ahead Log to prevent data loss

• set spark.streaming.receiver.writeAheadLog.enable=true in spark conf

• When AM dies, the application is restarted by YARN

• Received, ack-ed and unprocessed data replayed from WAL (data that made it into blocks)

• Reliable Receivers can replay data from the original source, if required

• Un-acked data replayed from source.

• Kafka, Flume receivers bundled with Spark are examples

• Reliable Receivers + WAL = No data loss on driver or receiver failure!

Page 18: Real Time Data Processing Using Spark Streaming

18© Cloudera, Inc. All rights reserved.

Kafka Connectors

• Reliable Kafka DStream

• Stores received data to Write Ahead Log on HDFS for replay

• No data loss

• Stable and supported!

• Direct Kafka DStream

• Uses low level API to pull data from Kafka

• Replays from Kafka on driver failure

• No data loss

• Experimental

Page 19: Real Time Data Processing Using Spark Streaming

19© Cloudera, Inc. All rights reserved.

Flume Connector

• Flume Polling DStream

• Use Spark sink from Maven to Flume’s plugin directory

• Flume Polling Receiver polls the sink to receive data

• Replays received data from WAL on HDFS

• No data loss

• Stable and Supported!

Page 20: Real Time Data Processing Using Spark Streaming

20© Cloudera, Inc. All rights reserved.

Spark Streaming Use-Cases

• Real-time dashboards

• Show approximate results in real-time

• Reconcile periodically with source-of-truth using Spark

• Joins of multiple streams

• Time-based or count-based “windows”

• Combine multiple sources of input to produce composite data

• Re-use RDDs created by Streaming in other Spark jobs.

Page 21: Real Time Data Processing Using Spark Streaming

21© Cloudera, Inc. All rights reserved.

What is coming?

• Run on Secure YARN for more than 7 days!

• Better Monitoring and alerting

• Batch-level and task-level monitoring

• SQL on Streaming

• Run SQL-like queries on top of Streaming (medium – long term)

• Python!

• Limited support coming in Spark 1.3

Page 22: Real Time Data Processing Using Spark Streaming

22© Cloudera, Inc. All rights reserved.

Current Spark project status

• 400+ contributors and 50+ companies contributing

• Includes: Databricks, Cloudera, Intel, Yahoo! etc

• Dozens of production deployments

• Spark Streaming Survived Netflix Chaos Monkey – production ready!

• Included in CDH!

Page 23: Real Time Data Processing Using Spark Streaming

23© Cloudera, Inc. All rights reserved.

More Info..

• CDH Docs: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_spark_installation.html

• Cloudera Blog: http://blog.cloudera.com/blog/category/spark/

• Apache Spark homepage: http://spark.apache.org/

• Github: https://github.com/apache/spark

Page 24: Real Time Data Processing Using Spark Streaming

24© Cloudera, Inc. All rights reserved.

Thank you

[email protected]

15% Discount Code for Cloudera Training PNWCUG_15university.cloudera.com


Recommended