+ All Categories
Home > Technology > DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data

Date post: 19-Jan-2017
Category:
Upload: hakka-labs
View: 409 times
Download: 1 times
Share this document with a friend
34
Apache Kafka at Rocana Persistent Machine Data Collection at Scale
Transcript

Apache Kafka at RocanaPersistent Machine Data Collection at Scale

2015 Rocana, Inc. All Rights Reserved.

Im Alan GardnerHere to talk about our use of Apache Kafka at Rocana

Who am I?

Platform Engineer Based in Ottawa [email protected] @alanctgardner

2015 Rocana, Inc. All Rights Reserved.

Platform engineer at RocanaWork on data ingest, storage and processingDistributed open-source systems: Hadoop, Kafka, SolrSystems programming work as wellWork remotely from Ottawa, CanadaThis is my cat.

Working at Rocana

2015 Rocana, Inc. All Rights Reserved.

Working at Rocana is great:everybody is remotevery smart, very nice peoplequarterly onsites

Rocana Ops

2015 Rocana, Inc. All Rights Reserved.

What is Rocana Ops?a platform for IT operations datadesigned for 10s of thousands of servers in multiple data centersdistill the entire organizations IT infrastructure down to a single screen: whats wrong?scalable collection framework - out of the box host data and app logsevent data warehouse built on open source technologies and open schemasvisualization, anomaly detection and machine learning to provide guided root cause analysisas opposed to a wall of graphs or pile of logsApache Kafka is the Enterprise Data BusGoing to talk about why we chose Kafka in that role

Kafka Principles

2015 Rocana, Inc. All Rights Reserved.

To explain why we chose Kafka, Im going to start with how Kafka works and why its designed the way it is.

History

Designed at LinkedIn

Documented in a 2013 blog post by Jay Kreps

LinkedIn moved from a monolith to multiple data stores and services

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

2015 Rocana, Inc. All Rights Reserved.

Designed at LinkedIn to handle the explosion of different systems being createdJay Kreps blog post describes Kafka from first principles, including motivationSome of these images are cribbed from that post, where appropriate

Complexity

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

2015 Rocana, Inc. All Rights Reserved.

LinkedIns problem:lots of front-end serviceslots of back-end serviceshooking them together produces this complex spaghetti of dependenciesFront-end has to be highly available and low-latencyif you write synchronously, you can only be as fast as your slowest backend service

Complexity

2015 Rocana, Inc. All Rights Reserved.

Centralized Data Bus

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

2015 Rocana, Inc. All Rights Reserved.

Kafka acts as a central bus for data:every front-end service writes all events into Kafkabackend services can take only the events theyre interested indata doesnt live in Kafka foreverKafka is run as a utility within LinkedInSolves one goal: centralized data bus, still need horizontal scale, durability

Centralized Data Bus

2015 Rocana, Inc. All Rights Reserved.

This is much better

Design Goals

A centralized data bus that:

Scales horizontally

Delivers (some) events in order

Decouples producers and consumers

Has low latency end-to-end

2015 Rocana, Inc. All Rights Reserved.

A Horizontally Scalable Log

https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying

2015 Rocana, Inc. All Rights Reserved.

Kafka is fundamentally a collection of logsevents are only appendedevents are always consumed in the same orderA single partition is a log: an ordered set of eventsEvery event has an offsetPartitions are the units of scale, like shardsLog operations are constrained, so we can make them fastExample is sharding on users

Asynchronous Consumers

2015 Rocana, Inc. All Rights Reserved.

Consuming and producing are completely decoupled:consumers maintain their own logical offset, representing the last event they consumeddifferent consumers can consume at different ratesproducers continue to append new events in orderevents are retained until an expiry time, or max log sizeKafka is not a durable long-term storeConsumers can go offline for extended time or start from scratch and consume all available eventsEvents are durably written and replicated

Low-Latency, Durable Writes

Kafka writes all events to disk

Events are stored on disk in the wire protocol

Zero-copy reads and writes avoid events ever entering user space

Kafka relies on the page cache for low-latency serving of recent events

2015 Rocana, Inc. All Rights Reserved.

Kafka writes all data to disk, lots of good tricks:low-latency for recent data from the page cachedata on the wire is the same as on diskno GC overhead for the page cachezero-copy ops

Putting it all together

2015 Rocana, Inc. All Rights Reserved.

This is an overview of a typical Kafka system:multiple producers, brokers and consumerseach broker has ownership of a set of partitions (its the primary)broker lists and partition assignment are stored in ZKconsumers are using ZK to store offsets here, but thats not the only way

Our Experience

2015 Rocana, Inc. All Rights Reserved.

2015 Rocana, Inc. All Rights Reserved.

Lets revisit the Rocana architecture:thousands of agents writing into Kafkaevents are distributed across multiple partitions, written durably to diskmultiple, separate consumers are decoupled from the producers and each other

Resource ConstraintsCustomer machines are doing real work

Agent footprint must be small

Cant depend on availability of back-end services

Batching is crucial

2015 Rocana, Inc. All Rights Reserved.

Resource limits on producer machines:these machines are doing real work thats important to the businessour agent needs to quickly encode events and produce thembatching is important to ensure efficiencylatency to write to Kafka is still very low

Independent consumersConsumers arent coupled to each other

Maintenance and upgrades are simplified

Horizontal scale per consumer

2015 Rocana, Inc. All Rights Reserved.

Consumers dont affect each other:each maintain their own offsetsone consumer can be taken offline, can be slow, etc. with little impactupgrades are very easya single consumer can even be rewound (theoretically)consumers can scale horizontally with the number of partitions

Vendor Support

2015 Rocana, Inc. All Rights Reserved.

Kafka has critical mass within the industry:Cloudera, Hortonworks, MapR all support itConfluent has all the designers of Kafka working on a commercial stream processing platform

2015 Rocana, Inc. All Rights Reserved.

Those are all good things, but there are some sharp edges to watch out for.

Shamelessly stolen from https://aphyr.com/

2015 Rocana, Inc. All Rights Reserved.

Kingsbury tire fire slideExactly once delivery is very hardNot all of our consumers are doing something idempotentYou can play back the whole partition to find the last message which was written

{ syslog_arrival_ts":"1444489076463", "syslog_conn_dns":"localhost", syslog_conn_port":"57788", body: , id:KLE5GZF7WB2WSA5, }{ tailed_file_inode:2371810", tailed_file_offset:"384930", timestamp":"", body: , id:73XXMLRJNHKA76, }Ephemeral SourceDurable Source

2015 Rocana, Inc. All Rights Reserved.

Overview of a Rocana Event which would be published into Kafka:fixed fields and key-value pairsID is a hash of an event fields, used for duplicate detectionfor durable sources we can use offset and inode, get 99% of the wayfor ephemeral sources we use arrival time + internal fieldsID used for three things:assignment to a partitiondeduplication filterID in Solr for idempotent inserts

Durability

2015 Rocana, Inc. All Rights Reserved.

Kafka writes every message to diskDefaults to fsyncing every 10k messages, or every 3 seconds (at most)ACK happens when a message is written but not fsyncedOK, so Ill replicate data across multiple machines

Unclean Elections

Kafka maintains a set of up-to-date replicas in ZKIn-sync replicas or the ISRISR can dynamically grow or shrinkby default Kafka will accept writes with a single ISRIt is possible for the set to shrink to 0 nodes, which either leads to:partition unavailability until an in-sync replica returns to lifeOR data loss when an out-of-sync node begins accepting writesThis is tunable with the unclean leader election propertyDefaults to true in 0.8.2

http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen

2015 Rocana, Inc. All Rights Reserved.

The default in Kafka is to continue making progress in the presence of node failures (AP):- unclean elections allow a replica which has not seen all writes to become the leader when the ISR shrinks to 0- minimum ISR size is only 1 to accept writes by default- when a previously in-sync replica comes back, those records are lost- it can be disabled, see Jays blog for more discussion

2015 Rocana, Inc. All Rights Reserved.

Some things arent hard, but you need to look out for:

Schema Versioning

http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one

Schemas are absolutely necessary

Have a plan for how to evolve the schema before v1

A schema registry is a good investment

2015 Rocana, Inc. All Rights Reserved.

Data you put in Kafka really needs to have a schemaSchemas really need to have an evolution strategyYou probably want some notion of a schema registryGwens post is greatWe use Avro, where the consumer has to know the writer schematried to mitigate this with nullable fields, no luck

Security

No encryption or authentication in 0.8.x

stunnel, encryption at the app layer are possible

Should be fixed in 0.9.0

2015 Rocana, Inc. All Rights Reserved.

There isnt any. No encryption on disk or in flight, no authentication:you can use stunnelyou could encrypt each byte buffer and decrypt on the client sideNo authenticationThese will probably both be fixed in 0.9.0 this month

Replication

Cross-DC clusters are not recommended

Kafka includes MirrorMaker for replication between two clusters

Replication is asynchronous

Offsets arent consistent

2015 Rocana, Inc. All Rights Reserved.

MirrorMaker is basically just a consumer/producer which pumps data between clusters:doesnt preserve offsets, so consumers cant fail overyou can send events between two different sized clustersyou can merge streams from two data centres

Operations

Everything is manual:

Rebalancing partitions

Rebalancing leaders

Decomissioning nodes

Watch for lagging consumers

2015 Rocana, Inc. All Rights Reserved.

Kafka operations are pretty basic, it comes with a giant `bin` dir full of tools:CLI for rebalancing partitions and leadersleaders and partitions rebalance on node failureadding nodes requires reasignmentDecomissioning nodes is a giant pain right nowTool for lagging consumer

Sizing

Consider both throughput and retention time

Overprovision number of partitions

Rebalancing is easy, but re-sharding breaks consistent hashing

http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/

2015 Rocana, Inc. All Rights Reserved.

Factors to consider when sizing a cluster:I/Othroughputretention time frame (throughput over time)Partitionslimit concurrency of consumersfuture growth (in terms of setting # of partitions)growing a cluster online is manual but possible in 0.8.2growing number of partitions breaks consistent hashing! (http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/)

Performance

Jay Kreps ran an on-premises benchmark

18 spindles, 18 cores in 3 boxes could produce 2.5M events/sec

Aggressive batching is necessary

Synchronous ACKs halve throughputhttps://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

2015 Rocana, Inc. All Rights Reserved.

Jay Kreps has a blog post about this, using 3 commodity broker boxes with 6 cores, 6 spindles each:its a little weird, he only uses 6 partitions so he never exercises all the spindles in the clusterhe batches small messages really aggressively (8k batches of 100 byte messages)his is on-premises, he hits 2.5M records/sec producing and consumingrequiring 3 acks for every message halved throughput

Performance

Reproduced on AWS with 3 and 5 node clustersd2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM5 producers on m3.xlarge instances

3 nodes accepted 2.6M events/s24 partitions, one replica, one ACKdropped to 1.7M with 3x replication and 1 ack

5 nodes accepted 3.6M events/s48 partitions, one replica, one ACKdropped to 2.16M with 3x replication and 1 ack

2015 Rocana, Inc. All Rights Reserved.

I used a similar methodology on EC2 to get some sizing numbers:- Used 4k batch sizes, results were broadly similar (1k and 2k hurt perf) Over-provisioning partitions by 2x spindles doesnt give benefit, but doesnt slow down eitherOver-provisioning by 2x and adding 3x replication did cause slow downOne partition actually hit 700k events/s, there may be coordination issues in the producerSynchronous acks were brutal, 10x performance hit, this is almost definitely due to AWS network latencyEach node is ~$500/monthAt 250MB/sec, wed only get ~18 hours of retentionWeve seen instances of only 12 hours of retention

Thank You!

2015 Rocana, Inc. All Rights Reserved.


Recommended