Post on 15-Apr-2017
transcript
INTRO TO KAFKAJim Plush, Director of Cloud Engineering, CrowdStrike.comTwitter: @jimplush
ABOUT ME
Jim Plush, Director of Cloud Engineering @ CrowdStrike.com
Architect of distributed cloud services for catching bad guys
Previously Director of Engineering at gravity.com
personalization service, ingesting clickstream from Yahoo!, New York Times, WSJ, etc…
wrote most of the ETL workflow
ABOUT CROWDSTRIKE
“Big Data” Security Company
Near term focus on targeted, state sponsored attacks and attribution
Single customer can generate 2.2TB of machine data per day we process in our cloud
Horizontally scalable, distributed infrastructure
Uses goodies like Kafka, Cassandra, Elastic Search, Hadoop, Scala, Go
–Said everyone, always
“Some people, when confronted with a problem, think “I know, I'll use a message queue.” Now they have two problems.”
APACHE KAFKA
It’s not a so much a queue, but an activity stream system
Trades stability and speed for consumer complexity
It’s scalable by nature
Supports data replication
You can rewind time
It’s fast!
Persistent messaging with O(1) disk structures that provide constant time performance even with many TB of stored messages.
APACHE KAFKA - CONS
Consumer Complexity
Not “Rack Aware” replication
Lack of tooling/monitoring
Still pre 1.0 release
Operationally, it’s more manual than desired
Requires ZooKeeper
BASIC CONCEPTS
Topics - logical namespace for data (clickstream, app logs)
Partition - physical separation of data to allow for horizontal scalability
Consumer Groups/Offsets - Where your consumer group last check pointed in the stream
Replica - allows for partitions to be replicated across nodes for availability, only one is the active leader
USE CASES
First point for data ingestion, provide back pressure to downstream
Provide a data firehose for clients (with seeks)
Friendly to Blue/Green deployment architectures
Mirroring test data easily
Data Center log aggregation
Seamless Integration with Storm
Data Center Aggregation
Producer
API Server
Customer A Customer B
Data Stream
Serving a Firehose
Data Affinity w/ Key Partitioning
Producer
Consumer B
Data Stream P0
Data Stream P1
UserIds 0-100
Consumer A
UserIds 0-100 UserIds 101-200
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
ZooKeeperController
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
ZooKeeperController
Green Consumer
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
ZooKeeper
Green Consumer
ControllerUser: 555
Producer
Blue Consumer
InactiveTopic
ActiveTopic
Blue/Green Deployment
Green Consumer
ControllerUser: 555ZooKeeper
SCALING OUT
1 partition = 1 consumer
1 partition needs to fit on a single machine
Partitions = the scalability of your system from the producer and consumer side
For high scale apps you will probably start out with 100 partitions
ProducerConsumer AP1
P0
P2
Producer
Consumer A
P1
P0
P2
Consumer B
Consumer C
ZOOKEEPERhttp://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-
system.html
WE’RE HIRING!jim@crowdstrike.com@jimplushcrowdstrike.com/about-us/careers
Producer A
Producer B
ZooKeeper
Partition 1
Partition 2
ClickStream
Partition OffsetsCommit Offset
Consumer A