+ All Categories
Home > Documents > Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of...

Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of...

Date post: 22-Aug-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
22
Data Stream Processing Can we finally forget the batches?
Transcript
Page 1: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

Data Stream ProcessingCan we finally forget the batches?

Page 2: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

2

Who am I?

Dominik WagenknechtSenior Technology ArchitectAccenture Vienna / Austria

Dealing with data in many industries

Page 3: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

Data needs to move!A à B

Page 4: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

4

To get data where it’s needed

Monolith

Gargantuan DB

ServiceService

DB DB

Service Service

Teams collidingLow agility

JOIN everything!

Per TeamHigh agility

Data as needed

Page 5: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

5

To get smarter

Service

Data Warehouse

ReportingAnalyticsInsight

SystemSystem

System System

SystemSystem

Every systemdoes it’s job

Tells you whatto do better

Page 6: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

= to integrate services

Page 7: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

Let’s do Batch!(oldie but goldie)

Page 8: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

8

So how does batch work?

Source

Source

Extract

Extract

Processing

MergeTransform

Enrich

Load Target

every night

Page 9: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

9

Speed it up?!? Delta-batches…

every hour à ½ hour à ¼ hour

Source TargetETL

Source TargetETL

Source TargetETL

enjoy the fun whenbatches overlap

Page 10: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

10

• It’s basically always too late*• Bumpy load-patterns >> shockwaves in the system• Mostly in the night• Testing becomes painful

Batch is not enough

*exceptions: on-purpose timings like interest rate calculations, monthly billing, etc…

Page 11: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

You can not go from batch to stream!

Page 12: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

12

How does the stream look

Source

Source

Event-by-EventProcessing

(there is often some state here!)

Target

continuously

Page 13: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

13

But you can go from stream to batch

Source

Source

Event-by-EventProcessing

(there is often some state here!)

Target

continuously

Extract

Load

Page 14: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

Why now?

Page 15: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

15

• Message Queues exist since forever• It’s a cost and efficiency thing

What changed?• LinkedIn / Netflix & Friends• Transaction guarantees of classic MQ’s not needed• High performance message store: Kafka• Ubiquity of high performance distributed stream processing: Storm, Samza,

Kafka Streams, Spark Streaming, Heron, Flink,…

Why now?

Page 16: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

16

Performance & Transactional Guarantees

Classic MQSource Target

Fully transactionalMQ keeps track of all messagesand transactions from all sources

Fully transactionalMQ needs to track state of every message• re-available after timeout• back-out-queues on rollback, etc…• look-ups by correlation ID, etc...

Page 17: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

17

Enter the simple distributed log

Source

Target

Log-File(s)

oldest data

latest data

Writing a message just appends to one of the log-files. The message essentially has a file position index

Reading is essentially pulling at an index and just keeps reading forward

Challenge: Position tracking • Kafka helps with that

Page 18: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

18

We loose• Full transaction support• Lookups by correlation ID, etc…

We win• Very high throughput• No overfilled queues, so we can batch into it!• Strict ordering per log-file/partition*• Multiple target systems can read independently• Very simple testing

Consequences of the distributed log

*which is quite useful given we‘re replacing batch...

Page 19: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

Summary

Page 20: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

20

Technologies in play

Source Event processing Target

DB: Change-data-capture or batch export JSystem: data feed

DB: just insert (with decent commit-size)System: REST-calls

Page 21: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

21

• Think differently: streaming-first• Go idempotent to keep-it-simple• Partition to go fast & ordered• Establish governance & standards as you go*

What should you take away from this?

*data formats, naming conventions, operations,…

Page 22: Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of high performance distributed stream processing: Storm, Samza, Kafka Streams, Spark

Thank youQuestions?


Recommended