+ All Categories
Home > Documents > Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging...

Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging...

Date post: 28-May-2020
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
60
Everything you wanted to know about apache kafka ® (but were afraid to ask) @ riferrei | @confluentinc | @bigdataconfeu
Transcript
Page 1: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Everything you wanted

to know about apache kafka®

(but were afraid to ask)

@riferrei | @confluentinc | @bigdataconfeu

Page 2: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Who am i?

Page 3: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Ricardo

ferreira

Page 4: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Page 5: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 6: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

@riferrei @riferrei @riferrei

https://[email protected]

Page 7: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

What is

apache kafka?

Page 8: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

apache kafka is an

distributed streaming

platform

Page 9: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Page 10: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Page 11: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Page 12: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

What are

streams?

Page 13: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 14: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Streams are

unbounded set

of events

Page 15: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Current technologies

were not designed to

handle streams!

Page 16: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Lets go back

In time...

Page 17: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Databases, 30

years ago...

Page 18: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Developer, full of

expectations...

Databases,

these days...

Page 19: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

What problems

databases have?

Page 20: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

EXTREMELY

LIMITED

MAKES LOTS

OF MESS

Page 21: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Limited?

Are you

kidding me?

Page 22: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Why do you think we

move data to the dhw

using batch jobs?

yes,

limited.

Page 23: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

mess

Business line 1 Business line 2 Business line 3

Page 24: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

workarounds

Page 25: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 26: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 27: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

nosql

Page 28: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 29: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Result…

Page 30: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

mess

Business line 1 Business line 2 Business line 3

Page 31: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Lets go back

in time...

Page 32: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 33: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Jay

kreps

Neha

narkhede

Jun

rao

Page 34: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

database

Web app Web app

First version of linkedin

Page 35: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

event state>

I changed my job

from oracle to

confluent

I work at

confluent

Page 36: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Job change recommendation engine

Search engine

Email service

Page 37: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

SQL

SQL

SQL

Recommendation engine

Search engine

Email service

Databases are very good to handle state

database

LOG

Page 38: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

But they don’t handle large volumes very well

database

1000x more volume

Non-transactional events

Transactional events

LOG

Page 39: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

User

tracking

Historical

data

Operational

metricsNosql

database

Graph

database

Sql

database

microservices

...HADOOPElastic

searchgrafana

Machine

learning

REC.

ENGINESEARCH SECURITY EMAIL

SOCIAL

GRAPH

Page 40: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

conclusion:

databases were not the best

technology to solve linkedin’s

problems

Page 41: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

LOG

User

tracking

Historical

data

Operational

metricsNosql

database

Graph

database

Sql

database

microservices

...HADOOPElastic

searchgrafana

Machine

learning

REC.

ENGINESEARCH SECURITY EMAIL

SOCIAL

GRAPH

Page 42: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Why not

messaging?

Page 43: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

LOG

User

tracking

Historical

data

Operational

metricsNosql

database

Graph

database

Sql

database

microservices

...HADOOPElastic

searchgrafana

Machine

learning

REC.

ENGINESEARCH SECURITY EMAIL

SOCIAL

GRAPH

Messaging has no persistence,

and like databases has

scalability limits

Page 44: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

44

ETL/Data Integration Messaging

Batch

Expensive

Time Consuming

Difficult to Scale

No Persistence After

Consumption

No Replay

Highly Scalable

Durable

Persistent

Ordered

Fast (Low Latency)

What is happening

in the world

What happened

in the world

Page 45: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

45

ETL/Data Integration Messaging

Batch

Expensive

Time Consuming

Difficult to Scale

No Persistence After

Consumption

No Replay

Highly Scalable

Durable

Persistent

Ordered

Fast (Low Latency)

What is happening

in the world

What happened

in the world

Highly Scalable

Durable

Persistent

Ordered

Fast (Low Latency)

distributed Streaming

platform

Page 46: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Distributed streaming

platforms should be

used to handle streams

Page 47: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Apache kafka is an

distributed streaming

platform

Page 48: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Why nobody

explained this to

me before?

Page 49: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

You will never understand kafka...

If you think of it as just messaging!

Page 50: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

50

01Well done

messaging

02Scalable

storage

03Event stream

processing

Page 51: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

“The truth is the log.

The database is a cache

of a subset of the log.”

— pat helland

Immutability changes everything

http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

Page 52: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

database

0 1 2 3 4 5 6 7 8LOG

reads

writes

Destination System a

(time = 1)

Inverted database: logs as first class citizen

Destination System b

(time = 3)

Page 53: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

new

record

previous

record

offset

Page 54: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

New

recordPrevious

record

offset

Page 55: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

Do you is a table that

you perform queries?

Page 56: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

Duality between streams and tables

{"user":"riferrei","score":"1001"}

{"user":"riferrei","score":"1002"}

{"user":"riferrei","score":"1003"}

{"user":"riferrei","score":"1004"}

{"user":"riferrei","score":"1005"}

{"user":"riferrei","score":"1005"}

str

eam

table

Page 57: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Event streaming applications

CREATE STREAM possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts

WINDOW TUMBLING (SIZE 5 MINUTE)

GROUP BY card_number

HAVING count(*) > 3;

Fraud attempt Possible fraud

topic topic

Page 58: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption
Page 59: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau

Thank you

Page 60: Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging Batch Expensive Time Consuming Difficult to Scale No Persistence After Consumption

@riferrei | @confluentinc | @itau


Recommended