Download - DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data

Apache Kafka at RocanaPersistent Machine Data Collection at Scale

2015 Rocana, Inc. All Rights Reserved.

Im Alan GardnerHere to talk about our use of Apache Kafka at Rocana

Who am I?

Platform Engineer Based in Ottawa [email protected] @alanctgardner


Platform engineer at RocanaWork on data ingest, storage and processingDistributed open-source systems: Hadoop, Kafka, SolrSystems programming work as wellWork remotely from Ottawa, CanadaThis is my cat.

Working at Rocana


Working at Rocana is great:everybody is remotevery smart, very nice peoplequarterly onsites

Rocana Ops


What is Rocana Ops?a platform for IT operations datadesigned for 10s of thousands of servers in multiple data centersdistill the entire organizations IT infrastructure down to a single screen: whats wrong?scalable collection framework - out of the box host data and app logsevent data warehouse built on open source technologies and open schemasvisualization, anomaly detection and machine learning to provide guided root cause analysisas opposed to a wall of graphs or pile of logsApache Kafka is the Enterprise Data BusGoing to talk about why we chose Kafka in that role

Kafka Principles


To explain why we chose Kafka, Im going to start with how Kafka works and why its designed the way it is.

History

Designed at LinkedIn

Documented in a 2013 blog post by Jay Kreps

LinkedIn moved from a monolith to multiple data stores and services

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying


Designed at LinkedIn to handle the explosion of different systems being createdJay Kreps blog post describes Kafka from first principles, including motivationSome of these images are cribbed from that post, where appropriate

Complexity



LinkedIns problem:lots of front-end serviceslots of back-end serviceshooking them together produces this complex spaghetti of dependenciesFront-end has to be highly available and low-latencyif you write synchronously, you can only be as fast as your slowest backend service

Complexity


Centralized Data Bus



Kafka acts as a central bus for data:every front-end service writes all events into Kafkabackend services can take only the events theyre interested indata doesnt live in Kafka foreverKafka is run as a utility within LinkedInSolves one goal: centralized data bus, still need horizontal scale, durability

Centralized Data Bus


This is much better

Design Goals

A centralized data bus that:

Scales horizontally

Delivers (some) events in order

Decouples producers and consumers

Has low latency end-to-end


A Horizontally Scalable Log



Kafka is fundamentally a collection of logsevents are only appendedevents are always consumed in the same orderA single partition is a log: an ordered set of eventsEvery event has an offsetPartitions are the units of scale, like shardsLog operations are constrained, so we can make them fastExample is sharding on users

Asynchronous Consumers


Consuming and producing are completely decoupled:consumers maintain their own logical offset, representing the last event they consumeddifferent consumers can consume at different ratesproducers continue to append new events in orderevents are retained until an expiry time, or max log sizeKafka is not a durable long-term storeConsumers can go offline for extended time or start from scratch and consume all available eventsEvents are durably written and replicated

Low-Latency, Durable Writes

Kafka writes all events to disk

Events are stored on disk in the wire protocol

Zero-copy reads and writes avoid events ever entering user space

Kafka relies on the page cache for low-latency serving of recent events


Kafka writes all data to disk, lots of good tricks:low-latency for recent data from the page cachedata on the wire is the same as on diskno GC overhead for the page cachezero-copy ops

Putting it all together


This is an overview of a typical Kafka system:multiple producers, brokers and consumerseach broker has ownership of a set of partitions (its the primary)broker lists and partition assignment are stored in ZKconsumers are using ZK to store offsets here, but thats not the only way

Our Experience



Lets revisit the Rocana architecture:thousands of agents writing into Kafkaevents are distributed across multiple partitions, written durably to diskmultiple, separate consumers are decoupled from the producers and each other

Resource ConstraintsCustomer machines are doing real work

Agent footprint must be small

Cant depend on availability of back-end services

Batching is crucial


Resource limits on producer machines:these machines are doing real work thats important to the businessour agent needs to quickly encode events and produce thembatching is important to ensure efficiencylatency to write to Kafka is still very low

Independent consumersConsumers arent coupled to each other

Maintenance and upgrades are simplified

Horizontal scale per consumer


Consumers dont affect each other:each maintain their own offsetsone consumer can be taken offline, can be slow, etc. with little impactupgrades are very easya single consumer can even be rewound (theoretically)consumers can scale horizontally with the number of partitions

Vendor Support


Kafka has critical mass within the industry:Cloudera, Hortonworks, MapR all support itConfluent has all the designers of Kafka working on a commercial stream processing platform


Those are all good things, but there are some sharp edges to watch out for.

Shamelessly stolen from https://aphyr.com/


Kingsbury tire fire slideExactly once delivery is very hardNot all of our consumers are doing something idempotentYou can play back the whole partition to find the last message which was written

{ syslog_arrival_ts":"1444489076463", "syslog_conn_dns":"localhost", syslog_conn_port":"57788", body: , id:KLE5GZF7WB2WSA5, }{ tailed_file_inode:2371810", tailed_file_offset:"384930", timestamp":"", body: , id:73XXMLRJNHKA76, }Ephemeral SourceDurable Source


Overview of a Rocana Event which would be published into Kafka:fixed fields and key-value pairsID is a hash of an event fields, used for duplicate detectionfor durable sources we can use offset and inode, get 99% of the wayfor ephemeral sources we use arrival time + internal fieldsID used for three things:assignment to a partitiondeduplication filterID in Solr for idempotent inserts

Durability


Kafka writes every message to diskDefaults to fsyncing every 10k messages, or every 3 seconds (at most)ACK happens when a message is written but not fsyncedOK, so Ill replicate data across multiple machines

Unclean Elections

Kafka maintains a set of up-to-date replicas in ZKIn-sync replicas or the ISRISR can dynamically grow or shrinkby default Kafka will accept writes with a single ISRIt is possible for the set to shrink to 0 nodes, which either leads to:partition unavailability until an in-sync replica returns to lifeOR data loss when an out-of-sync node begins accepting writesThis is tunable with the unclean leader election propertyDefaults to true in 0.8.2

http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen


The default in Kafka is to continue making progress in the presence of node failures (AP):- unclean elections allow a replica which has not seen all writes to become the leader when the ISR shrinks to 0- minimum ISR size is only 1 to accept writes by default- when a previously in-sync replica comes back, those records are lost- it can be disabled, see Jays blog for more discussion


Some things arent hard, but you need to look out for:

Schema Versioning

http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one

Schemas are absolutely necessary

Have a plan for how to evolve the schema before v1

A schema registry is a good investment


Data you put in Kafka really needs to have a schemaSchemas really need to have an evolution strategyYou probably want some notion of a schema registryGwens post is greatWe use Avro, where the consumer has to know the writer schematried to mitigate this with nullable fields, no luck

Security

No encryption or authentication in 0.8.x

stunnel, encryption at the app layer are possible

Should be fixed in 0.9.0


There isnt any. No encryption on disk or in flight, no authentication:you can use stunnelyou could encrypt each byte buffer and decrypt on the client sideNo authenticationThese will probably both be fixed in 0.9.0 this month

Replication

Cross-DC clusters are not recommended

Kafka includes MirrorMaker for replication between two clusters

Replication is asynchronous

Offsets arent consistent


MirrorMaker is basically just a consumer/producer which pumps data between clusters:doesnt preserve offsets, so consumers cant fail overyou can send events between two different sized clustersyou can merge streams from two data centres

Operations

Everything is manual:

Rebalancing partitions

Rebalancing leaders

Decomissioning nodes

Watch for lagging consumers


Kafka operations are pretty basic, it comes with a giant `bin` dir full of tools:CLI for rebalancing partitions and leadersleaders and partitions rebalance on node failureadding nodes requires reasignmentDecomissioning nodes is a giant pain right nowTool for lagging consumer

Sizing

Consider both throughput and retention time

Overprovision number of partitions

Rebalancing is easy, but re-sharding breaks consistent hashing

http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/


Factors to consider when sizing a cluster:I/Othroughputretention time frame (throughput over time)Partitionslimit concurrency of consumersfuture growth (in terms of setting # of partitions)growing a cluster online is manual but possible in 0.8.2growing number of partitions breaks consistent hashing! (http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/)

Performance

Jay Kreps ran an on-premises benchmark

18 spindles, 18 cores in 3 boxes could produce 2.5M events/sec

Aggressive batching is necessary

Synchronous ACKs halve throughputhttps://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines


Jay Kreps has a blog post about this, using 3 commodity broker boxes with 6 cores, 6 spindles each:its a little weird, he only uses 6 partitions so he never exercises all the spindles in the clusterhe batches small messages really aggressively (8k batches of 100 byte messages)his is on-premises, he hits 2.5M records/sec producing and consumingrequiring 3 acks for every message halved throughput

Performance

Reproduced on AWS with 3 and 5 node clustersd2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM5 producers on m3.xlarge instances

3 nodes accepted 2.6M events/s24 partitions, one replica, one ACKdropped to 1.7M with 3x replication and 1 ack

5 nodes accepted 3.6M events/s48 partitions, one replica, one ACKdropped to 2.16M with 3x replication and 1 ack


I used a similar methodology on EC2 to get some sizing numbers:- Used 4k batch sizes, results were broadly similar (1k and 2k hurt perf) Over-provisioning partitions by 2x spindles doesnt give benefit, but doesnt slow down eitherOver-provisioning by 2x and adding 3x replication did cause slow downOne partition actually hit 700k events/s, there may be coordination issues in the producerSynchronous acks were brutal, 10x performance hit, this is almost definitely due to AWS network latencyEach node is ~$500/monthAt 250MB/sec, wed only get ~18 hours of retentionWeve seen instances of only 12 hours of retention

Thank You!