Home >Technology >Spark streaming with apache kafka

Spark streaming with apache kafka

Date post:13-Apr-2017
View:106 times
Download:2 times
Share this document with a friend

PowerPoint Presentation

Spark streaming with Apache kafkaVikas Gite

Principal Software EngineerBig Data Analytics - PubMatic

Every ad.Every sales channel.Every screen.One platform.

Agenda 2Spark streaming 101What is RDDWhat is DstreamSpark streaming architecture Introduction to KafkaStreaming ingestion with Kafka

Spark streaming 1013RDDImmutablePartitionedFault tolerantLazily evaluatedCan be persisted

First RDDSecond RDDThird RDDFilterMapLineage Graph

Spark streaming 1014DStreamContinuous sequence of RDDsDesigned for stream processing.

Spark streaming architecture

Micro batching5

Spark streaming architectureDynamic load balancing


Ashish Tadose (AT) - stream data processing is always costlier in nature as compared to batch processing. So always make it sure that you take only required fields to stream processing platform.Ashish Tadose (AT) - Data ingestion should provide capability to fork out another data stream from existing data flow of smaller size (only required fields) and pass on that stream to a different destination (a message buffer/queue).Spark streaming architectureFailure and recovery


Introduction to KafkaKafka is a message queue (Circular buffer)Based on disk space or timeOldest messages are deleted to maintain sizeSplit into topic and partitionIndexed only by offsetDelivery semantics are your responsibility


High level consumerOffsets are stored in zookeeperOffsets are stored based on Consumer group

Low level consumerOffsets are stored in any storeMust handle broker leader changes


At most onceSave offsets !!! Possible failure !!!Save results

On failure, restart at saved offset, messages are lost.

At least onceSave results!!! Possible failure !!!Save offsets

On failure, messages are repeated


Idempotent exactly onceSave result with natural unique key!!! Possible failure !!!Save offset

Operation is safe to repeat.

Pros : SimpleWorks well with map transformations

Cons : Hard for aggregate transformations


12Transactional exactly onceBegin transactionSave results Save offsetEnsure offsets are okCommit transaction

On failure roll back results and offsets

Pros : Works for any transformation

Cons : More complexRequires transactional data store


Streaming ingestion with KafkaApproach 1: Receiver-based Approach


Streaming ingestion with KafkaApproach 1: Receiver-based Approach

Pros :WAL design could work with non-kafka data store

Cons : Duplication of write operationsDependent on HDFSMust use idempotent for exactly onceNo access to offsets, cant use transactional approach 14

Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)


Streaming ingestion with KafkaApproach 2: Direct Approach (No Receivers)

Pros : Simplified parallelismOne to one mapping between partition and RDDEfficiencyReducing WAL overheadExactly-once semanticsSpark checkpointsAtomic transaction16

17Streaming ingestion with KafkaApproach 2: Direct Approach (How to use it)

// Kafka config paramsval topicsSet = topics.split(",").toSetval kafkaParams = Map[String, String]("metadata.broker.list" -> brokers,auto.offset.reset -> largest)

// DirectStream method callval messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

18Streaming ingestion with KafkaWhere to store offsets

Easy Spark checkpoints : No need to access the offsets, automatically used on restartMust be idempotent, no transactionalCheckpoints may not be recoverable

Complex Your own data store : Must access offsets, save them, provid them on restartIdempotent or transactionalOffsets are just as recoverable as your results

ad impressionsserved dailybids processedmonthlydata processeddailydata undermanagementdata centeracross geography18B+10T22TB5PB6

Our Scale

Thank You20

Click here to load reader

Reader Image
Embed Size (px)