Everything you Wanted to Know about Apache Kafka (But Were ... · ETL/Data Integration Messaging...

Post on 28-May-2020

11 views 0 download

transcript

Everything you wanted

to know about apache kafka®

(but were afraid to ask)

@riferrei | @confluentinc | @bigdataconfeu

Who am i?

Ricardo

ferreira

@riferrei | @confluentinc | @itau

@riferrei | @confluentinc | @itau

@riferrei @riferrei @riferrei

https://riferrei.netRicardo@confluent.io

@riferrei | @confluentinc | @itau

What is

apache kafka?

@riferrei | @confluentinc | @itau

apache kafka is an

distributed streaming

platform

@riferrei | @confluentinc | @itau

@riferrei | @confluentinc | @itau

@riferrei | @confluentinc | @itau

@riferrei | @confluentinc | @itau

What are

streams?

@riferrei | @confluentinc | @itau

Streams are

unbounded set

of events

@riferrei | @confluentinc | @itau

Current technologies

were not designed to

handle streams!

Lets go back

In time...

Databases, 30

years ago...

Developer, full of

expectations...

Databases,

these days...

@riferrei | @confluentinc | @itau

What problems

databases have?

EXTREMELY

LIMITED

MAKES LOTS

OF MESS

Limited?

Are you

kidding me?

Why do you think we

move data to the dhw

using batch jobs?

yes,

limited.

mess

Business line 1 Business line 2 Business line 3

@riferrei | @confluentinc | @itau

workarounds

nosql

@riferrei | @confluentinc | @itau

Result…

mess

Business line 1 Business line 2 Business line 3

Lets go back

in time...

Jay

kreps

Neha

narkhede

Jun

rao

database

Web app Web app

First version of linkedin

@riferrei | @confluentinc | @itau

event state>

I changed my job

from oracle to

confluent

I work at

confluent

@riferrei | @confluentinc | @itau

Job change recommendation engine

Search engine

Email service

@riferrei | @confluentinc | @itau

SQL

SQL

SQL

Recommendation engine

Search engine

Email service

Databases are very good to handle state

database

LOG

@riferrei | @confluentinc | @itau

But they don’t handle large volumes very well

database

1000x more volume

Non-transactional events

Transactional events

LOG

@riferrei | @confluentinc | @itau

User

tracking

Historical

data

Operational

metricsNosql

database

Graph

database

Sql

database

microservices

...HADOOPElastic

searchgrafana

Machine

learning

REC.

ENGINESEARCH SECURITY EMAIL

SOCIAL

GRAPH

@riferrei | @confluentinc | @itau

conclusion:

databases were not the best

technology to solve linkedin’s

problems

@riferrei | @confluentinc | @itau

LOG

User

tracking

Historical

data

Operational

metricsNosql

database

Graph

database

Sql

database

microservices

...HADOOPElastic

searchgrafana

Machine

learning

REC.

ENGINESEARCH SECURITY EMAIL

SOCIAL

GRAPH

@riferrei | @confluentinc | @itau

Why not

messaging?

@riferrei | @confluentinc | @itau

LOG

User

tracking

Historical

data

Operational

metricsNosql

database

Graph

database

Sql

database

microservices

...HADOOPElastic

searchgrafana

Machine

learning

REC.

ENGINESEARCH SECURITY EMAIL

SOCIAL

GRAPH

Messaging has no persistence,

and like databases has

scalability limits

44

ETL/Data Integration Messaging

Batch

Expensive

Time Consuming

Difficult to Scale

No Persistence After

Consumption

No Replay

Highly Scalable

Durable

Persistent

Ordered

Fast (Low Latency)

What is happening

in the world

What happened

in the world

45

ETL/Data Integration Messaging

Batch

Expensive

Time Consuming

Difficult to Scale

No Persistence After

Consumption

No Replay

Highly Scalable

Durable

Persistent

Ordered

Fast (Low Latency)

What is happening

in the world

What happened

in the world

Highly Scalable

Durable

Persistent

Ordered

Fast (Low Latency)

distributed Streaming

platform

@riferrei | @confluentinc | @itau

Distributed streaming

platforms should be

used to handle streams

@riferrei | @confluentinc | @itau

Apache kafka is an

distributed streaming

platform

@riferrei | @confluentinc | @itau

Why nobody

explained this to

me before?

@riferrei | @confluentinc | @itau

You will never understand kafka...

If you think of it as just messaging!

50

01Well done

messaging

02Scalable

storage

03Event stream

processing

“The truth is the log.

The database is a cache

of a subset of the log.”

— pat helland

Immutability changes everything

http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

@riferrei | @confluentinc | @itau

database

0 1 2 3 4 5 6 7 8LOG

reads

writes

Destination System a

(time = 1)

Inverted database: logs as first class citizen

Destination System b

(time = 3)

@riferrei | @confluentinc | @itau

new

record

previous

record

offset

@riferrei | @confluentinc | @itau

New

recordPrevious

record

offset

http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf

Do you is a table that

you perform queries?

Duality between streams and tables

{"user":"riferrei","score":"1001"}

{"user":"riferrei","score":"1002"}

{"user":"riferrei","score":"1003"}

{"user":"riferrei","score":"1004"}

{"user":"riferrei","score":"1005"}

{"user":"riferrei","score":"1005"}

str

eam

table

@riferrei | @confluentinc | @itau

Event streaming applications

CREATE STREAM possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts

WINDOW TUMBLING (SIZE 5 MINUTE)

GROUP BY card_number

HAVING count(*) > 3;

Fraud attempt Possible fraud

topic topic

@riferrei | @confluentinc | @itau

Thank you

@riferrei | @confluentinc | @itau