+ All Categories
Home > Data & Analytics > Spark Streaming with Kafka - Meetup Bangalore

Spark Streaming with Kafka - Meetup Bangalore

Date post: 16-Aug-2015
Category:
Upload: dibyendu-bhattacharya
View: 270 times
Download: 2 times
Share this document with a friend
Popular Tags:
21
Tale of Kafka Consumer for Spark Streaming by Dibyendu Bhattacharya http://www.meetup.com/Real-Time-Data-Processing-and-Cloud- Computing/
Transcript
Page 1: Spark Streaming with Kafka - Meetup Bangalore

Tale of Kafka Consumer for Spark Streaming

by Dibyendu Bhattacharya

http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/

Page 2: Spark Streaming with Kafka - Meetup Bangalore

What we do at Pearson

Page 3: Spark Streaming with Kafka - Meetup Bangalore

Anatomy of Kafka Cluster..

Page 4: Spark Streaming with Kafka - Meetup Bangalore

Spark in a Slide..

Page 5: Spark Streaming with Kafka - Meetup Bangalore

Spark + Kafka

Page 6: Spark Streaming with Kafka - Meetup Bangalore

Spark + Kafka 1. Streaming application uses Streaming Context which uses Spark Context to launch Jobs across the cluster.

2. Receivers running on Executors process3. Receiver divides the streams into Blocks and

writes those Blocks to Spark BlockManager.4. Spark BlockManager replicates Blocks5. Receiver reports the received blocks to

Streaming Context.6. Streaming Context periodically (every Batch

Intervals ) take all the blocks to create RDD and launch jobs using Spark context on those RDDs.

7. Spark will process the RDD by running tasks on the blocks.

8. This process repeats for every batch intervals.

Page 7: Spark Streaming with Kafka - Meetup Bangalore

Failure Scenarios..

Receiver failed

Driver failed

Data Loss in both cases ?

Page 8: Spark Streaming with Kafka - Meetup Bangalore

Failure Scenarios..Receiver

Un-Reliable Receiver

Need a Reliable Receiver

Page 9: Spark Streaming with Kafka - Meetup Bangalore

Kafka Receivers..

Reliable Receiver can use ..

Kafka High Level API ( Spark Out of the box Receiver )

Kafka Low Level API (part of Spark-Packages) http://spark-packages.org/package/dibbhatt/kafka-spark-consumer

High Level Kafka API has SERIOUS issue with Consumer Re-Balance...Can not be used in Production

https://cwiki.apache.org/confluence/display/KAFKA/Consumer+Client+Re-Design

Page 10: Spark Streaming with Kafka - Meetup Bangalore

Low Level Kafka Receiver Challenges

Consumer implemented as Custom Spark Receiver which need to handle ..

• Consumer need to know Leader of a Partition.

• Consumer should aware of leader changes.• Consumer should handle ZK timeout.• Consumer need to manage Kafka Offset.• Consumer need to handle various failovers.

Page 11: Spark Streaming with Kafka - Meetup Bangalore

Low Level Receiver..

Page 12: Spark Streaming with Kafka - Meetup Bangalore

Failure Scenarios..Driver

Need to enable WAL based recovery

Data which is buffered but not processed are lost ...why ?

Page 13: Spark Streaming with Kafka - Meetup Bangalore

Zero Loss using Receiver Based approach..

Page 14: Spark Streaming with Kafka - Meetup Bangalore

Direct Kafka Stream Approach ..Experimental

Page 15: Spark Streaming with Kafka - Meetup Bangalore

Direct Kafka Stream Approach ..need to mature

Where to Store the Offset details ?

Checkpoint is default mechanism . Has issue with checkpoint recoverability .

Offset in external store . Complex and manage own recovery

Offset saved to external store can only be possible , when you get offset details from the Stream.

Page 16: Spark Streaming with Kafka - Meetup Bangalore

Some pointers on Low Level Receiver

Rate limiting by size of messages , not by number of messages

Can save ZK offset to different Zookeeper node than the one manage the Kafka cluster.

Can handle ALL failure recovery . Kafka broker down. Zookeeper down. Underlying Spark Block Manager failure. Offset Out Of Range issues. Ability to Restart or Retry based on failure scenarios.

Page 17: Spark Streaming with Kafka - Meetup Bangalore

dibbhatt/kafka-spark-consumer

Date Rate : 1MB/250ms per Receiver

Page 18: Spark Streaming with Kafka - Meetup Bangalore

dibbhatt/kafka-spark-consumer

Date Rate : 5MB/250ms per Receiver

Page 19: Spark Streaming with Kafka - Meetup Bangalore

dibbhatt/kafka-spark-consumer

Date Rate : 10MB/250ms per Receiver

Page 20: Spark Streaming with Kafka - Meetup Bangalore

Direct Stream Approach

Page 21: Spark Streaming with Kafka - Meetup Bangalore

Thank You !


Recommended