Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of...

transcript

Data Stream ProcessingCan we finally forget the batches?

Who am I?

Dominik WagenknechtSenior Technology ArchitectAccenture Vienna / Austria

Dealing with data in many industries

Data needs to move!A à B

To get data where it’s needed

Monolith

Gargantuan DB

ServiceService

Service Service

Teams collidingLow agility

JOIN everything!

Per TeamHigh agility

Data as needed

To get smarter

Service

Data Warehouse

ReportingAnalyticsInsight

SystemSystem

System System

SystemSystem

Every systemdoes it’s job

Tells you whatto do better

= to integrate services

Let’s do Batch!(oldie but goldie)

So how does batch work?

Source

Extract

Processing

MergeTransform

Enrich

Load Target

every night

Speed it up?!? Delta-batches…

every hour à ½ hour à ¼ hour

Source TargetETL

enjoy the fun whenbatches overlap

• It’s basically always too late*• Bumpy load-patterns >> shockwaves in the system• Mostly in the night• Testing becomes painful

Batch is not enough

*exceptions: on-purpose timings like interest rate calculations, monthly billing, etc…

You can not go from batch to stream!

How does the stream look

Source

Event-by-EventProcessing

(there is often some state here!)

Target

continuously

But you can go from stream to batch

Source

Event-by-EventProcessing

(there is often some state here!)

Target

continuously

Extract

Why now?

• Message Queues exist since forever• It’s a cost and efficiency thing

What changed?• LinkedIn / Netflix & Friends• Transaction guarantees of classic MQ’s not needed• High performance message store: Kafka• Ubiquity of high performance distributed stream processing: Storm, Samza,

Kafka Streams, Spark Streaming, Heron, Flink,…

Why now?

Performance & Transactional Guarantees

Classic MQSource Target

Fully transactionalMQ keeps track of all messagesand transactions from all sources

Fully transactionalMQ needs to track state of every message• re-available after timeout• back-out-queues on rollback, etc…• look-ups by correlation ID, etc...

Enter the simple distributed log

Source

Target

Log-File(s)

oldest data

latest data

Writing a message just appends to one of the log-files. The message essentially has a file position index

Reading is essentially pulling at an index and just keeps reading forward

Challenge: Position tracking • Kafka helps with that

We loose• Full transaction support• Lookups by correlation ID, etc…

We win• Very high throughput• No overfilled queues, so we can batch into it!• Strict ordering per log-file/partition*• Multiple target systems can read independently• Very simple testing

Consequences of the distributed log

*which is quite useful given we‘re replacing batch...

Summary

Technologies in play

Source Event processing Target

DB: Change-data-capture or batch export JSystem: data feed

DB: just insert (with decent commit-size)System: REST-calls

• Think differently: streaming-first• Go idempotent to keep-it-simple• Partition to go fast & ordered• Establish governance & standards as you go*

What should you take away from this?

*data formats, naming conventions, operations,…

Thank youQuestions?

Data Stream Processing - Amazon S3 · • High performance message store: Kafka • Ubiquity of...

Documents