+ All Categories
Home > Data & Analytics > Spark Summit 2014: Spark Streaming for Realtime Auctions

Spark Summit 2014: Spark Streaming for Realtime Auctions

Date post: 11-Aug-2014
Category:
Upload: russell-cardullo
View: 801 times
Download: 31 times
Share this document with a friend
Description:
Spark Summit (2014-06-30) http://spark-summit.org/2014/talk/building-a-data-processing-system-for-real-time-auctions What do you do when you need to update your models sooner than your existing batch workflows provide? At Sharethrough we faced the same question. Although we use Hadoop extensively for batch processing we needed a system to process click stream data as soon as possible for our real time ad auction platform. We found Spark Streaming to be a perfect fit for us because of it’s easy integration into the Hadoop ecosystem, powerful functional programming API, and low friction interoperability with our existing batch workflows. In this talk, we’ll present about some of the use cases that led us to choose a stream processing system and why we use Spark Streaming in particular. We’ll also discuss how we organized our jobs to promote reusability between batch and streaming workflows as well as improve testability
Popular Tags:
34
Spark Streaming for Realtime Auctions @russellcardullo Sharethrough
Transcript
Page 1: Spark Summit 2014: Spark Streaming for Realtime Auctions

Spark Streaming for Realtime Auctions @russellcardullo

Sharethrough

Page 2: Spark Summit 2014: Spark Streaming for Realtime Auctions

Agenda

• Sharethrough? • Streaming use cases • How we use Spark • Next steps

Page 3: Spark Summit 2014: Spark Streaming for Realtime Auctions

Sharethrough

vs

Page 4: Spark Summit 2014: Spark Streaming for Realtime Auctions

The Sharethrough Native Exchange

Page 5: Spark Summit 2014: Spark Streaming for Realtime Auctions

How can we use streaming data?

Page 6: Spark Summit 2014: Spark Streaming for Realtime Auctions

Use Cases

Creative Optimization Spend Tracking Operational Monitoring

$µ* = max { µk } k

Page 7: Spark Summit 2014: Spark Streaming for Realtime Auctions

Creative Optimization

• Choose best performing variant

• Short feedback cycle required

impression: {! device: “iOS”,! geocode: “C967”,! pkey: “p193”,! …!}

Variant 1 Variant 2 Variant 3

Click!Content

Here Hello Friend

? ? ?

Page 8: Spark Summit 2014: Spark Streaming for Realtime Auctions

Spend Tracking

• Spend on visible impressions and clicks

• Actual spend happens asynchronously

• Want to correct prediction for optimal serving

Predicted Spend

Actual Spend

Page 9: Spark Summit 2014: Spark Streaming for Realtime Auctions

Operational Monitoring

• Detect issues with content served on third party sites

• Use same logs as reporting

Placement Click Rates

P1

P5

P2 P3 P4

P6 P7 P8

Page 10: Spark Summit 2014: Spark Streaming for Realtime Auctions

We can directly measure business impact of using this data sooner

Page 11: Spark Summit 2014: Spark Streaming for Realtime Auctions

Why use Spark to build these features?

Page 12: Spark Summit 2014: Spark Streaming for Realtime Auctions

Why Spark?

• Scala API • Supports batch and streaming • Active community support • Easily integrates into existing Hadoop

ecosystem • But it doesn’t require Hadoop in order to run

Page 13: Spark Summit 2014: Spark Streaming for Realtime Auctions

How we’ve integrated Spark

Page 14: Spark Summit 2014: Spark Streaming for Realtime Auctions

Existing Data Pipelineweb!

server

log file

flume

HDFS

web!server

log file

flume

web!server

log file

flume

analytics!web!

service

database

Page 15: Spark Summit 2014: Spark Streaming for Realtime Auctions

Pipeline with Streamingweb!

server

log file

flume

HDFS

web!server

log file

flume

web!server

log file

flume

analytics!web!

service

database

Page 16: Spark Summit 2014: Spark Streaming for Realtime Auctions

Batch Streaming• “Real-Time” reporting • Low latency to use data • Only reliable as source • Low latency > correctness

• Daily reporting • Billing / earnings • Anything with strict SLA • Correctness > low latency

Page 17: Spark Summit 2014: Spark Streaming for Realtime Auctions

Spark Job Abstractions

Page 18: Spark Summit 2014: Spark Streaming for Realtime Auctions

Job Organization

nterSource Transform Sink

Job

Page 19: Spark Summit 2014: Spark Streaming for Realtime Auctions

Sources!case class BeaconLogLine(! timestamp: String,! uri: String,! beaconType: String,! pkey: String,! ckey: String!)!!object BeaconLogLine {!! def newDStream(ssc: StreamingContext, inputPath: String): DStream[BeaconLogLine] = {! ssc.textFileStream(inputPath).map { parseRawBeacon(_) }! }!! def parseRawBeacon(b: String): BeaconLogLine = {! ...! }!}!

case class!for pattern!matching

generate!DStream

encapsulate !

common!operations

Page 20: Spark Summit 2014: Spark Streaming for Realtime Auctions

Transformations

!!def visibleByPlacement(source: DStream[BeaconLogLine]): DStream[(String, Long)] = {! source.! filter(data => {! data.uri == "/strbeacon" && data.beaconType == "visible"! }).! map(data => (data.pkey, 1L)).! reduceByKey(_ + _)!}!!

type safety!

from!case class

Page 21: Spark Summit 2014: Spark Streaming for Realtime Auctions

Sinks

!!class RedisSink @Inject()(store: RedisStore) {!! def sink(result: DStream[(String, Long)]) = {! result.foreachRDD { rdd =>! rdd.foreach { element =>! val (key, value) = element! store.merge(key, value)! }! }! }!}!!

custom!

sinks for!new stores

Page 22: Spark Summit 2014: Spark Streaming for Realtime Auctions

Jobs!object ImpressionsForPlacements {!! def run(config: Config, inputPath: String) {! val conf = new SparkConf().! setMaster(config.getString("master")).! setAppName("Impressions for Placement")!! val sc = new SparkContext(conf)! val ssc = new StreamingContext(sc, Seconds(5))!! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)!! ssc.start! ssc.awaitTermination! }!}!

source

transform

sink

Page 23: Spark Summit 2014: Spark Streaming for Realtime Auctions

Advantages?

Page 24: Spark Summit 2014: Spark Streaming for Realtime Auctions

Code Reuse!object PlacementVisibles {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! sink(visible)! …!}!!…!!object PlacementEngagements {! …! val source = BeaconLogLine.newDStream(ssc, inputPath)! val engagements = engagementsByPlacement(source)! sink(engagements)! …!}

composable!jobs

Page 25: Spark Summit 2014: Spark Streaming for Realtime Auctions

Readability

! ssc.textFileStream(inputPath).! map { parseRawBeacon(_) }.! filter(data => {! data._2 == "/strbeacon" && data._3 == "visible"! }).! map(data => (data._4, 1L)).! reduceByKey(_ + _).! foreachRDD { rdd =>! rdd.foreach { element =>! store.merge(element._1, element._2)! }! }!

?

Page 26: Spark Summit 2014: Spark Streaming for Realtime Auctions

Readability

!! val source = BeaconLogLine.newDStream(ssc, inputPath)! val visible = visibleByPlacement(source)! redis.sink(visible)!!

Page 27: Spark Summit 2014: Spark Streaming for Realtime Auctions

Testingdef assertTransformation[T: Manifest, U: Manifest](! transformation: T => U,! input: Seq[T],! expectedOutput: Seq[U]!): Unit = {! val ssc = new StreamingContext("local[1]", "Testing", Seconds(1))! val source = ssc.queueStream(new SynchronizedQueue[RDD[T]]())! val results = transformation(source)!! var output = Array[U]()! results.foreachRDD { rdd => output = output ++ rdd.collect() }! ssc.start! rddQueue += ssc.sparkContext.makeRDD(input, 2)! Thread.sleep(jobCompletionWaitTimeMillis)! ssc.stop(true)!! assert(output.toSet === expectedOutput.toSet)! }!

function,!

input,!expectation

test

Page 28: Spark Summit 2014: Spark Streaming for Realtime Auctions

Testing

!!test("#visibleByPlacement") {!! val input = Seq(! "pkey=abcd, …",! "pkey=abcd, …",! "pkey=wxyz, …",! )!! val expectedOutput = Seq( ("abcd",2),("wxyz", 1) )!! assertTransformation(visibleByPlacement, input, expectedOutput)!}!!

use our!test helper

Page 29: Spark Summit 2014: Spark Streaming for Realtime Auctions

Other Learnings

Page 30: Spark Summit 2014: Spark Streaming for Realtime Auctions

Other Learnings

• Keeping your driver program healthy is crucial • 24/7 operation and monitoring • Spark on Mesos? Use Marathon.

• Pay attention to settings for spark.cores.max • Monitor data rate and increase as needed

• Serialization on classes • Java • Kryo

Page 31: Spark Summit 2014: Spark Streaming for Realtime Auctions

What’s next?

Page 32: Spark Summit 2014: Spark Streaming for Realtime Auctions

Twitter Summingbird

• Write-once, run anywhere • Supports:

• Hadoop MapReduce • Storm • Spark (maybe?)

Page 33: Spark Summit 2014: Spark Streaming for Realtime Auctions

Amazon Kinesis

web!server

Kinesis

web!server

web!server

other!applications

mobile!device

app!logs

Page 34: Spark Summit 2014: Spark Streaming for Realtime Auctions

Thanks!


Recommended