+ All Categories
Home > Data & Analytics > ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Date post: 07-Apr-2017
Category:
Upload: johann-schleier-smith
View: 322 times
Download: 1 times
Share this document with a friend
66
ReStream Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing October 6, 2016 Johann Schleier-Smith, Erik T. Krogen, Joseph M. Hellerstein UC Berkeley @jssmith @joe_hellerstein
Transcript
Page 1: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStreamAccelerating Backtesting and Stream Replay

with Serial-Equivalent Parallel Processing

October 6, 2016

Johann Schleier-Smith, Erik T. Krogen, Joseph M. Hellerstein UC Berkeley

@jssmith @joe_hellerstein

Page 2: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Overview

• Motivations for backtesting and stream replay

• Alternatives for scaling throughput

• ReStream and Multi-Version Parallel Streaming (MVPS)

• Evaluation

Page 3: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Research Motivation

• Operating Tagged and hi5 social networks

• >300 million users registered • Millions of daily active users

Practical Pains Curiosity

Page 4: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

• >10 million active accounts • >1000 updates/sec

• Must respond to current activity • Require near-instant decisions

Real-Time Spam Detection

for Dating Product

Page 5: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

• Facts recorded in event log • Real-time stream-processing • Need to evaluate new ideas quickly,

e.g., simulate model using data of past 30 days in under 10 minutes

Real-Time Spam Detection

for Dating Product

Page 6: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Replay lets Agile developers ask

powerful tool for creating and enhancing streaming applications

Page 7: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

When latency matters… streaming shines

• Spam detection • Payment fraud • Money laundering • Real-time recommendations • Ad serving • Dynamic pricing and inventory

management for e-commerce, car-services, etc.

• Financial trading • Industrial monitoring • IoT applications • And more

Page 8: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Research Motivation

• Operating Tagged and hi5 social networks

• >300 million users registered • Millions of daily active users

Practical Pains Curiosity

Given a program that processes an ordered log sequentially

How can we achieve parallel speedup?

Page 9: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Serial-Equivalent Parallel Replay

12345

Ordered log

Page 10: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Serial-Equivalent Parallel Replay

12345 Program

Page 11: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

12345678910

Serial-Equivalent Parallel Replay

ABProgram

Page 12: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

1234567891011

Serial-Equivalent Parallel Replay

ABProgram

Page 13: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

234567891011121314

Serial-Equivalent Parallel Replay

ABC15 Programt=4t=5t=9

Page 14: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Program*

Program*

Serial-Equivalent Parallel Replay

Page 15: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Serial-Equivalent Parallel Replay

13579

246810 A

BCProgram*

Program*

Page 16: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Serial-Equivalent Parallel Replay

13579111213151719212325

2468101214161820222426 A

BCProgram*

Program*t=4

t=5t=9

Page 17: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Serial-Equivalent Parallel Replay

• Deterministic output

• More restrictive than transaction serializability

• Partition the input between multiple parallel programs

• Obtain same output as from one program

Page 18: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Developers’ Accelerated Replay Wish List

• Semantics of sequential operations with mutable state

• Full fine-grained temporal resolution

• Process months in minutes: 10,000x real-time rate

Want serial-equivalent parallel replay

Page 19: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Workload Assumptions

• Total order provided by log

• Abundant cloud resources available

• Per-event latency not a concern

Page 20: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Possible Solutions

Page 21: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Streaming Databases

• StreamBase / Aurora

• Truviso / TelegraphCQ

• Recent startups - PipelineDB - RethinkDB

• Query interface derived from SQL

• Set-oriented approach allows query plan optimization, parallelism and reordering

• Some programs can be difficult to express

• Most systems emphasize latency over replay throughput

Examples

+––

+/–

Page 22: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

OLTP Databases

• PostgreSQL

• IBM DB2

• MS SQL Server

• Oracle

• SQL interface

• Robust high-performance implementations

• Need to coordinate parallel replay programs

• Transactional serializability gives weaker consistency than serial-equivalence

Examples

+––

+/–

Page 23: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

• Hadoop

• Apache Spark Streaming

• Lambda architecture

• Routinely delivers desired log-processing throughput

• Easy to integrate arbitrary functions

• MapReduce foundation does not lend itself naturally to sequential processing

• Throughput and program semantics may be linked

Parallel Big Data SystemsExamples

+––

+

Page 24: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Other Systems• Other streaming: Google MillWheel, Yahoo S4, Apache Storm,

Twitter Heron, Apache Flink, Apache Samza, Walmart MUP8

• Deterministic databases: Calvin, Bohm

• Transactional: VoltDB / S-Store

• Complex Event Processing: Esper, Tibco, JBoss

• Other recent systems: Trill, Naiad, Google Cloud Dataflow

Page 25: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStream

Page 26: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

• Consequence of input data

• Suggests opportunity for parallelism

• Can we maintain order when necessary, but not necessarily otherwise?

Challenge: serial equivalence and parallelism

Observation: causal dependencies are often sparse

Page 27: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

• Consequence of input data

• Suggests opportunity for parallelism

• Can we maintain order when necessary, but not necessarily otherwise?

Challenge: serial equivalence and parallelism

Observation: causal dependencies are often sparse

Page 28: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Multi-Versioned State

SET(timestamp=10,key=x,value=3)

SET(timestamp=20,key=x,value=5)

GET(timestamp=15,key=x)

x=3@t=10

x=5@t=20

GET(timestamp=11,key=x)

GET(timestamp=25,key=x)

→3

→3

→5x=5@t=25

SET(timestamp=21,key=x,value=7)x

Page 29: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Social Network Anti-Spam Example

sender has sent 2x messages to non-friends as to friends AND

> 20% of messages sent from IP contain e-mail address

⇒ message is spam

Page 30: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Social Network Anti-Spam Example

Express program in four piecesA. Track friendships B. Track how often user sends to friends / non-friends C. Track how often ip address sends text containing e-mail D. For each message, check B and C to label spam

Page 31: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Sample Code{e:NewFriendshipEvent=>

}A

Page 32: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Sample Code{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)}

A

Page 33: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)}

Sample Code

A

Page 34: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)} WRITE

Sample Code

A

Page 35: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

friendships.merge(timestamp,key,value)

WRITE

Sample Code

Page 36: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)} WRITE

Sample Code

{e:MessageEvent=>

}

42

A

B

Page 37: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

{e:MessageEvent=>userPair=(e.senderId,e.recipientId)if(friendships.get(e.timestamp,userPair)){friendMsgs.merge(e.timestamp,e.senderId,_+1)}else{nonfriendMsgs.merge(e.timestamp,e.senderId,_+1)}}

Sample Code

43

{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)} WRITE

READ

A

B

Page 38: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

{e:MessageEvent=>userPair=(e.senderId,e.recipientId)if(friendships.get(timestamp,key)){friendMsgs.merge(e.timestamp,e.senderId,_+1)}else{nonfriendMsgs.merge(e.timestamp,e.senderId,_+1)}}

Sample Code

44

{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)} WRITE

READ

A

B

Page 39: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

{e:MessageEvent=>userPair=(e.senderId,e.recipientId)if(friendships.get(e.timestamp,userPair)){

friendMsgs.merge(e.timestamp,e.senderId,_+1)}else{nonfriendMsgs.merge(e.timestamp,e.senderId,_+1)}}

Sample Code

45

{e:NewFriendshipEvent=>userPair=(e.userIdA,e.userIdB)friendships.merge(e.timestamp,userPair,_=>true)} WRITE

READ

WRITE

A

B

Page 40: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A

B

C

D

nonfriendMsgs

R

R

R

R

R

W

W

W

W

W

friendMsgs

friendships

ipEmailMsgs

ipMsgs

Page 41: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgsR

R

R

R

R

W

W

WWW

Topological sort

Page 42: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

12

R

Reading from log

Page 43: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

1234

W

R

Reading from log, writing shared state

Page 44: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

123456

R

W

R

Loose coupling

Page 45: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

123456789

Loose coupling

Page 46: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

123456789

Must respect dependencies

NO

Page 47: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

123456789

Loose coupling

OK

Page 48: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

1234567891011

OK

Loose coupling

Page 49: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

12345678910111213

OK

Out-of-order processing

Page 50: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendshipsipEmailMsgs

ipMsgs

234567891011121314

OK

Out-of-order processing

Page 51: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Multi-version Parallel StreamingMVPS:

Page 52: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

A B C D

nonfriendMsgs

friendMsgs

friendships

ipEmailMsgs

ipMsgs

A B C D

MVPS

Page 53: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

135791112131517192123

246810121416182022

A B C D

nonfriendMsgs

friendMsgs

friendships

ipEmailMsgs

ipMsgs

A B C D

MVPS

Page 54: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Mini-batches for MVPS

110

1112131415161718

Page 55: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Mini-batches for MVPS

1120

110

Page 56: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

110

2130

4150

6170

8190101 110121 130

1120

3140

5160

7180

91 100111 120140

A B C D

nonfriendMsgs

friendMsgs

friendships

ipEmailMsgs

ipMsgs

A B C D

MVPS with mini-batches

Page 57: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

• Partitioned parallel dataflow • Input events passed to all operators • Globally shared multi-versioned state • Logical timestamps referenced throughout computation • Analyze DAG of operator potential read-write dependency • May use mini-batches to amortize coordination • Serial-equivalent semantics

Multi-Versioned Parallel Streaming (MVPS)

Page 58: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Evaluation

Page 59: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStream Evaluation Aims

• Demonstrate parallel speedup vs. single-thread (COST)

• Compare to alternative systems

• Understand limits to parallelism

Page 60: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStream Evaluation Workload• Simulated social network spam detection

• Structure of read-write dependency graph linked to structure of social network

• Can tune workload characteristics by generating different social graphs

Page 61: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStream Evaluation Workload• Simulated social network spam detection

• Structure of read-write dependency graph linked to structure of social network

• Can tune workload characteristics by generating different social graphs

uniform degree distribution

skewed degree distribution

Page 62: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Scaling Throughput

0

200,000

400,000

600,000

1 2 4 8 16 32Hosts

Thro

ughp

ut (e

vent

s/s)

Execution Engine ReStream MVPS on Spark 1−Thread

Page 63: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Modeling Performance

• Greater parallel speedup possible when there are fewer read-write dataflow dependencies

• Track reads and writes of global state, compute critical path length along chained dependencies

Page 64: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

uniform degree distributionskewed degree distribution

Parametrized by α

Page 65: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Web out-links

PhotoSharing

SocialNetworks

Web in-links

0

100,000

200,000

300,000

1.5 2.0 2.5 3.0

Hosts 2 4 8 16

Thro

ughp

ut (e

vent

/s)

α

Modeling Performance

R2=0.94 Per-host batch size

2,500-40,000 events (10,000 shown)

Fit gives

Page 66: ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

ReStream Summary• Serial-equivalent results from parallel replay

• Throughput much greater than real-time rate

• MVPS consistency: Multi-Versioned Parallel Streaming - Analyze for potential read-write dependencies - Timestamped multi-versioned state - Track logical time at runtime

• Also may apply to online stream processing and deterministic databases

@jssmith @joe_hellerstein This work was supported in part by AWS Cloud Credits for Research


Recommended