Apache Kafka at RocanaPersistent Machine Data Collection at Scale
2015 Rocana, Inc. All Rights Reserved.
Im Alan GardnerHere to talk about our use of Apache Kafka at Rocana
Who am I?
Platform Engineer Based in Ottawa [email protected] @alanctgardner
2015 Rocana, Inc. All Rights Reserved.
Platform engineer at RocanaWork on data ingest, storage and processingDistributed open-source systems: Hadoop, Kafka, SolrSystems programming work as wellWork remotely from Ottawa, CanadaThis is my cat.
Working at Rocana
2015 Rocana, Inc. All Rights Reserved.
Working at Rocana is great:everybody is remotevery smart, very nice peoplequarterly onsites
Rocana Ops
2015 Rocana, Inc. All Rights Reserved.
What is Rocana Ops?a platform for IT operations datadesigned for 10s of thousands of servers in multiple data centersdistill the entire organizations IT infrastructure down to a single screen: whats wrong?scalable collection framework - out of the box host data and app logsevent data warehouse built on open source technologies and open schemasvisualization, anomaly detection and machine learning to provide guided root cause analysisas opposed to a wall of graphs or pile of logsApache Kafka is the Enterprise Data BusGoing to talk about why we chose Kafka in that role
Kafka Principles
2015 Rocana, Inc. All Rights Reserved.
To explain why we chose Kafka, Im going to start with how Kafka works and why its designed the way it is.
History
Designed at LinkedIn
Documented in a 2013 blog post by Jay Kreps
LinkedIn moved from a monolith to multiple data stores and services
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
2015 Rocana, Inc. All Rights Reserved.
Designed at LinkedIn to handle the explosion of different systems being createdJay Kreps blog post describes Kafka from first principles, including motivationSome of these images are cribbed from that post, where appropriate
Complexity
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
2015 Rocana, Inc. All Rights Reserved.
LinkedIns problem:lots of front-end serviceslots of back-end serviceshooking them together produces this complex spaghetti of dependenciesFront-end has to be highly available and low-latencyif you write synchronously, you can only be as fast as your slowest backend service
Complexity
2015 Rocana, Inc. All Rights Reserved.
Centralized Data Bus
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
2015 Rocana, Inc. All Rights Reserved.
Kafka acts as a central bus for data:every front-end service writes all events into Kafkabackend services can take only the events theyre interested indata doesnt live in Kafka foreverKafka is run as a utility within LinkedInSolves one goal: centralized data bus, still need horizontal scale, durability
Centralized Data Bus
2015 Rocana, Inc. All Rights Reserved.
This is much better
Design Goals
A centralized data bus that:
Scales horizontally
Delivers (some) events in order
Decouples producers and consumers
Has low latency end-to-end
2015 Rocana, Inc. All Rights Reserved.
A Horizontally Scalable Log
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
2015 Rocana, Inc. All Rights Reserved.
Kafka is fundamentally a collection of logsevents are only appendedevents are always consumed in the same orderA single partition is a log: an ordered set of eventsEvery event has an offsetPartitions are the units of scale, like shardsLog operations are constrained, so we can make them fastExample is sharding on users
Asynchronous Consumers
2015 Rocana, Inc. All Rights Reserved.
Consuming and producing are completely decoupled:consumers maintain their own logical offset, representing the last event they consumeddifferent consumers can consume at different ratesproducers continue to append new events in orderevents are retained until an expiry time, or max log sizeKafka is not a durable long-term storeConsumers can go offline for extended time or start from scratch and consume all available eventsEvents are durably written and replicated
Low-Latency, Durable Writes
Kafka writes all events to disk
Events are stored on disk in the wire protocol
Zero-copy reads and writes avoid events ever entering user space
Kafka relies on the page cache for low-latency serving of recent events
2015 Rocana, Inc. All Rights Reserved.
Kafka writes all data to disk, lots of good tricks:low-latency for recent data from the page cachedata on the wire is the same as on diskno GC overhead for the page cachezero-copy ops
Putting it all together
2015 Rocana, Inc. All Rights Reserved.
This is an overview of a typical Kafka system:multiple producers, brokers and consumerseach broker has ownership of a set of partitions (its the primary)broker lists and partition assignment are stored in ZKconsumers are using ZK to store offsets here, but thats not the only way
Our Experience
2015 Rocana, Inc. All Rights Reserved.
2015 Rocana, Inc. All Rights Reserved.
Lets revisit the Rocana architecture:thousands of agents writing into Kafkaevents are distributed across multiple partitions, written durably to diskmultiple, separate consumers are decoupled from the producers and each other
Resource ConstraintsCustomer machines are doing real work
Agent footprint must be small
Cant depend on availability of back-end services
Batching is crucial
2015 Rocana, Inc. All Rights Reserved.
Resource limits on producer machines:these machines are doing real work thats important to the businessour agent needs to quickly encode events and produce thembatching is important to ensure efficiencylatency to write to Kafka is still very low
Independent consumersConsumers arent coupled to each other
Maintenance and upgrades are simplified
Horizontal scale per consumer
2015 Rocana, Inc. All Rights Reserved.
Consumers dont affect each other:each maintain their own offsetsone consumer can be taken offline, can be slow, etc. with little impactupgrades are very easya single consumer can even be rewound (theoretically)consumers can scale horizontally with the number of partitions
Vendor Support
2015 Rocana, Inc. All Rights Reserved.
Kafka has critical mass within the industry:Cloudera, Hortonworks, MapR all support itConfluent has all the designers of Kafka working on a commercial stream processing platform
2015 Rocana, Inc. All Rights Reserved.
Those are all good things, but there are some sharp edges to watch out for.
Shamelessly stolen from https://aphyr.com/
2015 Rocana, Inc. All Rights Reserved.
Kingsbury tire fire slideExactly once delivery is very hardNot all of our consumers are doing something idempotentYou can play back the whole partition to find the last message which was written
{ syslog_arrival_ts":"1444489076463", "syslog_conn_dns":"localhost", syslog_conn_port":"57788", body: , id:KLE5GZF7WB2WSA5, }{ tailed_file_inode:2371810", tailed_file_offset:"384930", timestamp":"", body: , id:73XXMLRJNHKA76, }Ephemeral SourceDurable Source
2015 Rocana, Inc. All Rights Reserved.
Overview of a Rocana Event which would be published into Kafka:fixed fields and key-value pairsID is a hash of an event fields, used for duplicate detectionfor durable sources we can use offset and inode, get 99% of the wayfor ephemeral sources we use arrival time + internal fieldsID used for three things:assignment to a partitiondeduplication filterID in Solr for idempotent inserts
Durability
2015 Rocana, Inc. All Rights Reserved.
Kafka writes every message to diskDefaults to fsyncing every 10k messages, or every 3 seconds (at most)ACK happens when a message is written but not fsyncedOK, so Ill replicate data across multiple machines
Unclean Elections
Kafka maintains a set of up-to-date replicas in ZKIn-sync replicas or the ISRISR can dynamically grow or shrinkby default Kafka will accept writes with a single ISRIt is possible for the set to shrink to 0 nodes, which either leads to:partition unavailability until an in-sync replica returns to lifeOR data loss when an out-of-sync node begins accepting writesThis is tunable with the unclean leader election propertyDefaults to true in 0.8.2
http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen
2015 Rocana, Inc. All Rights Reserved.
The default in Kafka is to continue making progress in the presence of node failures (AP):- unclean elections allow a replica which has not seen all writes to become the leader when the ISR shrinks to 0- minimum ISR size is only 1 to accept writes by default- when a previously in-sync replica comes back, those records are lost- it can be disabled, see Jays blog for more discussion
2015 Rocana, Inc. All Rights Reserved.
Some things arent hard, but you need to look out for:
Schema Versioning
http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one
Schemas are absolutely necessary
Have a plan for how to evolve the schema before v1
A schema registry is a good investment
2015 Rocana, Inc. All Rights Reserved.
Data you put in Kafka really needs to have a schemaSchemas really need to have an evolution strategyYou probably want some notion of a schema registryGwens post is greatWe use Avro, where the consumer has to know the writer schematried to mitigate this with nullable fields, no luck
Security
No encryption or authentication in 0.8.x
stunnel, encryption at the app layer are possible
Should be fixed in 0.9.0
2015 Rocana, Inc. All Rights Reserved.
There isnt any. No encryption on disk or in flight, no authentication:you can use stunnelyou could encrypt each byte buffer and decrypt on the client sideNo authenticationThese will probably both be fixed in 0.9.0 this month
Replication
Cross-DC clusters are not recommended
Kafka includes MirrorMaker for replication between two clusters
Replication is asynchronous
Offsets arent consistent
2015 Rocana, Inc. All Rights Reserved.
MirrorMaker is basically just a consumer/producer which pumps data between clusters:doesnt preserve offsets, so consumers cant fail overyou can send events between two different sized clustersyou can merge streams from two data centres
Operations
Everything is manual:
Rebalancing partitions
Rebalancing leaders
Decomissioning nodes
Watch for lagging consumers
2015 Rocana, Inc. All Rights Reserved.
Kafka operations are pretty basic, it comes with a giant `bin` dir full of tools:CLI for rebalancing partitions and leadersleaders and partitions rebalance on node failureadding nodes requires reasignmentDecomissioning nodes is a giant pain right nowTool for lagging consumer
Sizing
Consider both throughput and retention time
Overprovision number of partitions
Rebalancing is easy, but re-sharding breaks consistent hashing
http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
2015 Rocana, Inc. All Rights Reserved.
Factors to consider when sizing a cluster:I/Othroughputretention time frame (throughput over time)Partitionslimit concurrency of consumersfuture growth (in terms of setting # of partitions)growing a cluster online is manual but possible in 0.8.2growing number of partitions breaks consistent hashing! (http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/)
Performance
Jay Kreps ran an on-premises benchmark
18 spindles, 18 cores in 3 boxes could produce 2.5M events/sec
Aggressive batching is necessary
Synchronous ACKs halve throughputhttps://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
2015 Rocana, Inc. All Rights Reserved.
Jay Kreps has a blog post about this, using 3 commodity broker boxes with 6 cores, 6 spindles each:its a little weird, he only uses 6 partitions so he never exercises all the spindles in the clusterhe batches small messages really aggressively (8k batches of 100 byte messages)his is on-premises, he hits 2.5M records/sec producing and consumingrequiring 3 acks for every message halved throughput
Performance
Reproduced on AWS with 3 and 5 node clustersd2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM5 producers on m3.xlarge instances
3 nodes accepted 2.6M events/s24 partitions, one replica, one ACKdropped to 1.7M with 3x replication and 1 ack
5 nodes accepted 3.6M events/s48 partitions, one replica, one ACKdropped to 2.16M with 3x replication and 1 ack
2015 Rocana, Inc. All Rights Reserved.
I used a similar methodology on EC2 to get some sizing numbers:- Used 4k batch sizes, results were broadly similar (1k and 2k hurt perf) Over-provisioning partitions by 2x spindles doesnt give benefit, but doesnt slow down eitherOver-provisioning by 2x and adding 3x replication did cause slow downOne partition actually hit 700k events/s, there may be coordination issues in the producerSynchronous acks were brutal, 10x performance hit, this is almost definitely due to AWS network latencyEach node is ~$500/monthAt 250MB/sec, wed only get ~18 hours of retentionWeve seen instances of only 12 hours of retention
Thank You!
2015 Rocana, Inc. All Rights Reserved.