Everything you wanted
to know about apache kafka®
(but were afraid to ask)
@riferrei | @confluentinc | @bigdataconfeu
Who am i?
Ricardo
ferreira
@riferrei | @confluentinc | @itau
@riferrei | @confluentinc | @itau
What is
apache kafka?
@riferrei | @confluentinc | @itau
apache kafka is an
distributed streaming
platform
@riferrei | @confluentinc | @itau
@riferrei | @confluentinc | @itau
@riferrei | @confluentinc | @itau
@riferrei | @confluentinc | @itau
What are
streams?
@riferrei | @confluentinc | @itau
Streams are
unbounded set
of events
@riferrei | @confluentinc | @itau
Current technologies
were not designed to
handle streams!
Lets go back
In time...
Databases, 30
years ago...
Developer, full of
expectations...
Databases,
these days...
@riferrei | @confluentinc | @itau
What problems
databases have?
EXTREMELY
LIMITED
MAKES LOTS
OF MESS
Limited?
Are you
kidding me?
Why do you think we
move data to the dhw
using batch jobs?
yes,
limited.
mess
Business line 1 Business line 2 Business line 3
@riferrei | @confluentinc | @itau
workarounds
nosql
@riferrei | @confluentinc | @itau
Result…
mess
Business line 1 Business line 2 Business line 3
Lets go back
in time...
Jay
kreps
Neha
narkhede
Jun
rao
database
Web app Web app
First version of linkedin
@riferrei | @confluentinc | @itau
event state>
I changed my job
from oracle to
confluent
I work at
confluent
@riferrei | @confluentinc | @itau
Job change recommendation engine
Search engine
Email service
@riferrei | @confluentinc | @itau
SQL
SQL
SQL
Recommendation engine
Search engine
Email service
Databases are very good to handle state
database
LOG
@riferrei | @confluentinc | @itau
But they don’t handle large volumes very well
database
1000x more volume
Non-transactional events
Transactional events
LOG
@riferrei | @confluentinc | @itau
User
tracking
Historical
data
Operational
metricsNosql
database
Graph
database
Sql
database
microservices
...HADOOPElastic
searchgrafana
Machine
learning
REC.
ENGINESEARCH SECURITY EMAIL
SOCIAL
GRAPH
@riferrei | @confluentinc | @itau
conclusion:
databases were not the best
technology to solve linkedin’s
problems
@riferrei | @confluentinc | @itau
LOG
User
tracking
Historical
data
Operational
metricsNosql
database
Graph
database
Sql
database
microservices
...HADOOPElastic
searchgrafana
Machine
learning
REC.
ENGINESEARCH SECURITY EMAIL
SOCIAL
GRAPH
@riferrei | @confluentinc | @itau
Why not
messaging?
@riferrei | @confluentinc | @itau
LOG
User
tracking
Historical
data
Operational
metricsNosql
database
Graph
database
Sql
database
microservices
...HADOOPElastic
searchgrafana
Machine
learning
REC.
ENGINESEARCH SECURITY EMAIL
SOCIAL
GRAPH
Messaging has no persistence,
and like databases has
scalability limits
44
ETL/Data Integration Messaging
Batch
Expensive
Time Consuming
Difficult to Scale
No Persistence After
Consumption
No Replay
Highly Scalable
Durable
Persistent
Ordered
Fast (Low Latency)
What is happening
in the world
What happened
in the world
45
ETL/Data Integration Messaging
Batch
Expensive
Time Consuming
Difficult to Scale
No Persistence After
Consumption
No Replay
Highly Scalable
Durable
Persistent
Ordered
Fast (Low Latency)
What is happening
in the world
What happened
in the world
Highly Scalable
Durable
Persistent
Ordered
Fast (Low Latency)
distributed Streaming
platform
@riferrei | @confluentinc | @itau
Distributed streaming
platforms should be
used to handle streams
@riferrei | @confluentinc | @itau
Apache kafka is an
distributed streaming
platform
@riferrei | @confluentinc | @itau
Why nobody
explained this to
me before?
@riferrei | @confluentinc | @itau
You will never understand kafka...
If you think of it as just messaging!
50
01Well done
messaging
02Scalable
storage
03Event stream
processing
“The truth is the log.
The database is a cache
of a subset of the log.”
— pat helland
Immutability changes everything
http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
@riferrei | @confluentinc | @itau
database
0 1 2 3 4 5 6 7 8LOG
reads
writes
Destination System a
(time = 1)
Inverted database: logs as first class citizen
Destination System b
(time = 3)
@riferrei | @confluentinc | @itau
new
record
previous
record
offset
@riferrei | @confluentinc | @itau
New
recordPrevious
record
offset
http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf
Do you is a table that
you perform queries?
Duality between streams and tables
{"user":"riferrei","score":"1001"}
{"user":"riferrei","score":"1002"}
{"user":"riferrei","score":"1003"}
{"user":"riferrei","score":"1004"}
{"user":"riferrei","score":"1005"}
{"user":"riferrei","score":"1005"}
str
eam
table
@riferrei | @confluentinc | @itau
Event streaming applications
CREATE STREAM possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 MINUTE)
GROUP BY card_number
HAVING count(*) > 3;
Fraud attempt Possible fraud
topic topic
@riferrei | @confluentinc | @itau
Thank you
@riferrei | @confluentinc | @itau