Home > Software > Spark stream - Kafka

Spark stream - Kafka

Date post: 13-Apr-2017
Author: dori-waldman
View: 434 times
Download: 0 times
Share this document with a friend
Embed Size (px)
of 35 /35
Spark Streaming Kafka in Action Dori Waldman Big Data Lead
  • Spark StreamingKafka in ActionDori WaldmanBig Data Lead


  • Spark Streaming with Kafka Receiver Based

    Spark Streaming with Kafka Direct (No Receiver)

    Statefull Spark Streaming (Demo)



  • What we do Ad-Exchange

    Real time trading (150ms average response time) and optimize campaigns over ad spaces.

    Tech Stack :


  • Why Spark ...


  • Use Case

    Tens of Millions of transactions per minute (and growing ) ~ 15TB daily (24/7 99.99999 resiliency) Data Aggregation: (#Video Success Rate)

    Real time Aggregation and DB update Raw data persistency as recovery backupRetrospective aggregation updates (recalculate)

    Analytic Data :

    Persist incoming events (Raw data persistency) Real time analytics and ML algorithm (inside)


  • *

  • Based on high-level Kafka consumer

    The receiver stores Kafka messages in executors/workers

    Write-Ahead Logs to recover data on failures Recommended

    ZK offsets are updated by Spark

    Data duplication (WAL/Kafka)

    Receiver Approach - KafkaUtils.createStream


  • Receiver Approach - CodeSpark Partition != Kafka Partition val kafkaStream = { BasicAdvanced


  • Receiver Approach Code (continue)


  • Architecture 1.0

    StreamEventsEventsRaw DataEventsConsumerConsumerAggregationAggregationSpark BatchSpark Stream


  • Architecture

    Pros: Worked just fine with single MySQL server Simplicity legacy code stays the same Real-time DB updates Partial Aggregation was done in Spark, DB was updated via

    Insert On Duplicate Key Update Cons: MySQL limitations (MySQL sharding is an issue, Cassandra is optimal) S3 raw data (in standard formats) is not trivial when using Spark


  • Monitoring


  • *

  • Architecture 2.0

    StreamEventsEventsRaw DataEventsStream starts from largest offset by default

    Parquet columnar format (FS not DB)

    Spark batch update C* every few minutes (overwrite)ConsumerConsumerRaw DataRaw DataAggregation


  • ArchitecturePros: Parquet is ideal for Spark analytics Backup data requires less disk space

    Cons: DB is not updated in real time (streaming), we could use combination with

    MySQL for current hour...What has been changed: C* uses counters for sum/update which is a bad practice

    (no insert on duplicate key using MySQL) Parquet conversion is a heavy job and it seems that streaming hourly

    conversions (using batch in case of failure) is a better approach


  • Direct Approach KafkaUtils.createDirectStream Based on Kafka simple consumer

    Queries Kafka for the latest offsets in each topic+partition, define offset range for batch

    No need to create multiple input Kafka streams and consolidate them

    Spark creates an RDD partition for each Kafka

    partition so data is consumed in parallel

    ZK offsets are not updated by Spark, offsets are

    tracked by Spark within its checkpoints (might notrecover)

    No data duplication (no WAL)


  • S3 / HDFS

    Save metadata needed for recovery from driver failures

    RDD for statefull transformations (RDDs of previous batches)



  • Transfer data from driver to workersBroadcast - keep a read-only variable cached on each machine rather than shipping a copy of it with tasks

    Accumulator - used to implement counters/sum, workers can only add to accumulator, driver can read its value (you can extends AccumulatorParam[Vector])

    Static (Scala Object)

    Context (rdd) get data after recovery


  • Direct Approach - Code


  • def start(sparkConfig: SparkConfiguration, decoder: String) { val ssc = StreamingContext.getOrCreate(sparkCheckpointDirectory(sparkConfig),()=>functionToCreateContext(decoder,sparkConfig))

    sys.ShutdownHookThread { ssc.stop(stopSparkContext = true, stopGracefully = true) }

    ssc.start() ssc.awaitTermination() }In house code


  • def functionToCreateContext(decoder: String,sparkConfig: SparkConfiguration ):StreamingContext = {

    val sparkConf = new SparkConf().setMaster(sparkClusterHost).setAppName(sparkConfig.jobName) sparkConf.set(S3_KEY, sparkConfig.awsKey) sparkConf.set(S3_CREDS, sparkConfig.awsSecret) sparkConf.set(PARQUET_OUTPUT_DIRECTORY, sparkConfig.parquetOutputDirectory)

    val sparkContext = SparkContext.getOrCreate(sparkConf)

    // Hadoop S3 writer optimization sparkContext.hadoopConfiguration.set("spark.sql.parquet.output.committer.class", "org.apache.spark.sql.parquet.DirectParquetOutputCommitter")

    // Same as Avro, Parquet also supports schema evolution. This work happens in driver and takes // relativly long time sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") sparkContext.hadoopConfiguration.setInt("parquet.metadata.read.parallelism", 100) val ssc = new StreamingContext(sparkContext, Seconds(sparkConfig.batchTime)) ssc.checkpoint(sparkCheckpointDirectory(sparkConfig))In house code (continue)


  • // evaluate stream value happened only if checkpoint folder is not exist val streams = sparkConfig.kafkaConfig.streams map { c => val topic = c.topic.split(",").toSet KafkaUtils.createDirectStream[String, String, StringDecoder, JsonDecoder](ssc, c.kafkaParams, topic) }streams.foreach { dsStream => {

    dsStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

    for (o

  • val data = sqlContext.read.json(rdd.map(_._2)) val carpetData = data.count() if (carpetData > 0) {

    // coalesce(1) Data transfer optimization during shuffle data.coalesce(1).write.mode(SaveMode.Append).partitionBy "day", "hour").parquet(s3a//...")

    // In case of S3Exception will not continue to update ZK.zk.updateNode(o.topic, o.partition.toString, kafkaConsumerGroup, o.untilOffset.toString.getBytes) } } } } ssc } In house code (continue)


  • SaveMode (Append/Overwrite) used to handle exist data (add new file / overwrite)

    Spark Streaming does not update ZK (http://curator.apache.org/)

    Spark Streaming saves offset in its checkpoint folder. Once it crashes it will continue from the last offset

    You can avoid using checkpoint for offsets and manage it manually



  • val sparkConf = new SparkConf().setMaster("local[4]").setAppName("demo")val sparkContext = SparkContext.getOrCreate(sparkConf)val sqlContext = SQLContext.getOrCreate(sparkContext)val data = sqlContext.read.json(path)data.coalesce(1).write.mode(SaveMode.Overwrite).partitionBy("table", "day") parquet (outputFolder)Batch Code


  • Built in support for backpressure Since Spark 1.5 (default is disabled) Reciever spark.streaming.receiver.maxRate Direct spark.streaming.kafka.maxRatePerPartition

    Back Pressure


  • https://www.youtube.com/watch?v=fXnNEq1v3VA&list=PL-x35fyliRwgfhffEpywn4q23ykotgQJ6&index=16













    https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/Links Spark & Kafka integration


  • Architecture other spark optionsWe can use hourly window , do the aggregation in spark and overwrite C* raw in real time


  • https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html

    https://docs.cloud.databricks.com/docs/spark/1.6/examples/Streaming%20mapWithState.htmlStateful Spark Streaming


  • Architecture 3.0

    StreamEventsEventsRaw DataEventsConsumerConsumerRaw DataAggregationAggregationRaw DataAnalytic data uses spark stream to transfer Kafka raw data to Parquet.Regular Kafka consumer saves raw data backup in S3 (for streaming failure, spark batch will convert them to parquet)

    Aggregation data uses statefull Spark Streaming (mapWithState) to update C*In case streaming failure spark batch will update data from Parquet to C*


  • ArchitecturePros: Real-time DB updates

    Cons: Too many components, relatively expensive (comparing to phase 1) According to documentation Spark upgrade has an issue with checkpoint


  • http://www.slideshare.net/planetcassandra/tuplejump-breakthrough-olap-performance-on-cassandra-and-spark?ref=http://www.planetcassandra.org/blog/introducing-filodb/Whats Next FiloDB ? (probably not , lots of nodes) Parquet performance based on C*


  • Questions?


  • val ssc = new StreamingContext(sparkConfig.sparkConf, Seconds(batchTime)) val kafkaStreams = (1 to sparkConfig.workers) map { i => new FixedKafkaInputDStream[String, AggregationEvent, StringDecoder, SerializedDecoder[AggregationEvent]](ssc, kafkaConfiguration.kafkaMapParams, topicMap, StorageLevel.MEMORY_ONLY_SER).map(_._2) // for write ahead log }

    val unifiedStream = ssc.union(kafkaStreams) // manage all streams as one

    val mapped = unifiedStream flatMap { event => Aggregations.getEventAggregationsKeysAndValues(Option(event)) // convert event to aggregation object which contains //key (advertiserId, countryId) and values (click, impression) }

    val reduced = mapped.reduceByKey { _ + _ // per aggregation type we created + method that //describe how to do the aggregation }K1 = advertiserId = 5countryId = 8

    V1 = clicks = 2 impression = 17k1(e), v1(e)k1(e), v2(e)

    k2(e), v3(e)k1(e), v1+v2

    k2(e), v3(e)In house Code


  • Kafka messages semantics(offset)