Date post: | 26-Jan-2017 |
Category: |
Technology |
Upload: | signalfx |
View: | 1,249 times |
Download: | 0 times |
SignalFx
Agenda
1. Why we wrote a Kafka consumer
2. Properties and limitations of modern hardware
3. Optimizations
4. Results
SignalFx
Why we wrote a Kafka consumer
• High resolution: • Any mix of resolutions up to 1 sec
• Streaming analytics: • Custom analytics pipelines at any scale that output in seconds• Streaming dashboards update in seconds
• Multidimensional metrics: • Dimensions allow arbitrary modeling, pivoting, filtering, and
grouping of both raw and derived (from analytics) metrics interactively on streaming data
• E.g. 99th-percentile-of-latency-by-service-by-customer
SignalFx is built for monitoring modern infrastructure
• Designed to replace SimpleConsumer not the 0.9 consumer
• Needed a non-blocking single threaded consumer
• Wanted it to be low over head• 100s of thousands of messages/second• Sensitive to GC
• The Kafka 0.9 consumer wasn’t ready yet
Why write a new Kafka consumer
SignalFx
Kafka consumer - a brief introduction
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Brokers, topics and partitions
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Metadata request and response
Client
Metadata Request
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Metadata request and response
Client
Metadata Request
Metadata Response
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Metadata request and response
Client
Metadata Request
Metadata Response
Partition Broker ID
0 1
1 2
…. ….
n 3
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Offset request and response
Client Partition offset
0 9024
1 1245
…. ….
n 11645
Partition Broker ID
0 1
1 2
…. ….
n 3
Offsets(Consumer group/ external source)
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Fetch request and response
Client
Fetch request
Partition offset
0 9024
1 1245
…. ….
n 11645
Partition Broker ID
0 1
1 2
…. ….
n 3
SignalFx
Topic : 0
Topic : 3
Topic : 6
Topic : 9
Topic : 2
Topic : 5
Topic : 8
Topic : 11
Topic : 1
Topic : 4
Topic : 7
Topic : 10
BROKER 1 BROKER 2 BROKER 3
Fetch request and response
Client
Fetch response
Partition offset
0 9026
1 1247
…. ….
n 11649
Partition Broker ID
0 1
1 2
…. ….
n 3
SignalFx
Properties and limitationsof modern hardware
SignalFx Main memory
L1 D L1 I
L3
L1 D L1 I
L2L2
Core 1 Core 2
1
Cache Lines
• Data is transferred between memory and cache in blocks of fixed size, called cache lines (typically 64 bytes)
• The memory subsystem makes a few bets to help us:• Temporal locality• Spatial locality• Prefetching
SignalFx Main memory
L1 D L1 I
L3
L1 D L1 I
L2L2
Core 1 Core 2
1
1
1
2
1
2
2
2
1 2
SignalFx
L1 D
Main memory
L1 D L1 I
L3
L1 I
L2L2
Core 1 Core 2
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
2
1 2 3 4 5 6 7 8
1 4 3
6 8 7 5
Reference latency numbers for comparison
By Jeff Dean: http://research.google.com/people/jeff/
L1 Cache 0.5ns
Branch mispredict 5 ns
L2 Cache 7 ns 14x L1 Cache
Mutex lock/unlock 25 ns
Main memory 100 ns 20x L2 Cache, 200x L1 Cache
Compress 1K bytes (Zippy) 3,000 ns
Send 1K bytes over 1Gbps 10,000 ns 0.01 ms
Read 4K randomly from SSD 150,000 ns 0.15 ms
Read 1MB sequentially from memory 250,000 ns 0.25 ms
Round trip within same DC 500,000 ns 0.5 ms
Read 1MB sequentially from SSD 1,000,000 ns 1 ms 4x memory
Disk seek 10,000,000 ns 10 ms 20x DC roundtrip
Read 1MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20x SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
SignalFx
L1 CORE
SignalFx
L2 CORE
SignalFx
MainMemory CORE
SignalFx
Optimizations
Optimization aims
• We are NOT aiming for more data/second• Even a very inefficient implementation will
be bottlenecked by the network
• We are aiming to make the client get out of the way• The client is not the only thing running on
the system• Leave all resources for the actual
application
Efficiency VS raw speed
• We value efficiency more than raw speed for the client• Fewer cycles• Less cache usage and fewer cache
misses• Less memory?
• Efficiency for the client == raw speed for the application
Efficiency from constraints
• No consumer group functionality needed
• A single topic
• Finite number of integer partitions
• Partition reassignment is rare and happens during startup and shutdown
• We are in control of the code that consumes the messages
SignalFx
Use cache conscious data structures
Use arrays and open addressing hash maps
• Single topic. Less than 1024 partitions
• Instead of maps we can use arrays
• Or use primitive specialized open addressing hash maps
Topic:Partition -> Offset
Topic:Partition offset
Foo:0 9026
Foo:1 1247
…. ….
Foo:n 11649
Partition offset
0 9026
1 1247
…. ….
n 11649
Foo Foo
Offsets
9026
1247
11649
offsetpartition
partition* offset*partition* offset*
Entry*
Entry*
Entry*
Entry*
1
2
3 4
Hash map implemented as an array of lists of key* | value*
offsetpartition
partition* offset*partition* offset*
Dependable cache miss generator
List
List
List
List
Sparse arrayoffset 0
offset 1
offset 2
offset 3
offset 4
offset 5
offset 6
offset 7
offset 8
offset 9
offset 10
offset 11
offset 12
offset 13
1
SignalFx
In memory
1160
partition* offset*Entry*
Entry*
Offsets
116
partition* offset*
0
116
116
Entry* Entry*
In cache
(4 * 2 + 4 + 8) + (4 + 4 + 8) + (4 + 8) + (8 + 8) = 64 bytes
1024 * 8 + 4 + 8 = 8204 bytes
4 * 64 = 256 bytes
1 * 64 = 64 bytes
1
2
3
4
1
Low memory and cache friendly data structures
• Queues built from integer arrays. Negative -> partition lost
• Zero allocation hashed-wheel timer to close stuck connections
• Open addressing hash maps
• BitSets coded on top of long arrays whenever a set of partitions is required• Can be traversed in O(num set bits)
Applicability and benefit to Kafka consumer 0.9
• Benefits - medium• Lots of hash map look ups
• Applicability - low• Multiple topics - sparse arrays not a great
match• Open addressing hash maps - preserve
most of the benefits
SignalFx
Create buffers once, reuse
Eliminate redundant work
• A single topic. Finite number of partitions:
• Topic and client string immutable
• The metadata request buffer can be created just once and kept around forever
• Other requests can have their fixed part written out and only write the variable part on each request• Offset request
= fixed_part + per_partition_part• Fetch request create
= fixed_part + per_partition_part
SignalFx
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING REPLICA_ID
MAX_WAIT_TIME MIN_BYTES
NUM_TOPICS TOPIC_STRING
NUM_PARTITIONS
0 1266 1024
1 1164 1024
2 1900 1024
Fixed
Variable
FETCH REQUEST BUFFER
SignalFx
FETCH REQUEST BUFFER
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING REPLICA_ID
MAX_WAIT_TIME MIN_BYTES
NUM_TOPICS TOPIC_STRING
NUM_PARTITIONS
0 1266 1024
1 1164 1024
2 1900 1024
Index
1200
1216
1232
SignalFx
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING REPLICA_ID
MAX_WAIT_TIME MIN_BYTES
NUM_TOPICS TOPIC_STRING
NUM_PARTITIONS
Offsets
1289
1172
1990
0 1266 1024
1 1164 1024
2 1900 1024
Index
1200
1216
1232
FETCH REQUEST BUFFER
SignalFx
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING REPLICA_ID
MAX_WAIT_TIME MIN_BYTES
NUM_TOPICS TOPIC_STRING
NUM_PARTITIONS
Offsets
1289
1172
1990
0 1289 1024
1 1172 1024
2 1990 1024
Index
1200
1216
1232
FETCH REQUEST BUFFER
Code
private void setNewOffsetsForFetchRequest() { final ByteBuffer buffer = this.fetchRequestBuffer; // Iterate through the partitions assigned to this broker // and write the offset directly on the buffer. for (int i = 0; i < partitionAssignment.length; i++) { // This loop runs in O(partitions assigned). long bitSet = partitionAssignment[i]; while (bitSet != 0) { final long t = bitSet & -bitSet; final int partitionId = i * 64 + Long.bitCount(t - 1);
// The position in the buffer that points to the // beginning of the offset for this partition. final int bufferPositionForOffset = fetchRequestIndex[partitionId]; final long offset = partitionToOffset[partitionId]; // Write the offset directly. buffer.putLong(bufferPositionForOffset, offset);
bitSet ^= t; } } }
SignalFx
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING NUM_TOPICS
TOPIC_STRING
METADATA REQUEST BUFFER
Fixed
SignalFx
SIZE API_KEY
API_VERSION CORRELATION_ID
CLIEND_ID_STRING REPLICA_ID
NUM_TOPICS TOPIC_STRING
NUM_PARTITIONS
0 1 2 3 4 5
OFFSET REQUEST BUFFER
NUM_PARTITIONS_POSITION
Fixed
Applicability and benefit to Kafka consumer 0.9
• Benefits - high• Reuse instead of allocating - temporal locality• Steaming through 3 arrays - prefetching• One fetch request per fetch response - common• Metadata or offset requests - rare
• Applicability - high• Internal detail so API doesn’t change• Even for consumer groups, partition reassignment
and partition migration events are rare
SignalFx
Zero allocation response processing
Stream responses to application
• Pass each message to the application when it is ready
• Consume messages synchronously without a copy or allocation• No deserialization required
• Benefits add up when processing 100s of thousands of messages per second
Low level interface
public interface KafkaMessageHandler { void handleMessage(ByteBuffer buffer, int position, int length);}
public interface KafkaConsumer { void poll(KafkaMessageHandler handler, long timeoutMs); . . . . . .}
SignalFx
Partition Message 1 Message 2 Message .. Message n
1 … … … …
2 … … … …
3 … … … …
4 … … … …
Topic string, client string etc
FETCH RESPONSE PARSING
SignalFx
Partition Message 1 Message 2 Message .. Message n
1 … … … …
2 … … … …
3 … … … …
4 … … … …
Topic string, client string etc
FETCH RESPONSE PARSING
public interface KafkaMessageHandler { void handleMessage(ByteBuffer buffer, int position, int length);}
SignalFx
Partition Message 1 Message 2 Message .. Message n
1 … … … …
2 … … … …
3 … … … …
4 … … … …
Topic string, client string etc
FETCH RESPONSE PARSING
public interface KafkaMessageHandler { void handleMessage(ByteBuffer buffer, int position, int length);}
SignalFx
Partition Message 1 Message 2 Message .. Message n
1 … … … …
2 … … … …
3 … … … …
4 … … … …
Topic string, client string etc
FETCH RESPONSE PARSING
public interface KafkaMessageHandler { void handleMessage(ByteBuffer buffer, int position, int length);}
Applicability and benefit to Kafka consumer 0.9
• Benefits - very high• Reuse response buffer, no allocations - temporal locality• Data is processed right after being read from the socket -
temporal locality• Streaming through a buffer - spatial locality + prefetching• Combine with DirectByteBuffers for zero copy
• Applicability - low• API too low level• Integrity of internal buffers compromised by bugs in
application• Maybe a low level “with great power comes great
responsibility” API
SignalFx
Some numbers
Caveats
• These are from running a very specific workload similar to our application
• There are many Pareto-optimal choices for a client. Our’s is not better in any way - it’s just tuned for our workload
• It can and will prove bad for other workloads
Benchmark
• Single topic-partition
• Settings of fetch_max_wait, fetch_min_bytes, max_bytes_per_partition were identical
• Only 5000 messages per second produced by a single producer• Each message is 23 bytes• Warm up -> profile for 5 mins
• 5000/sec * 5 mins = 1.5 million• Profiler = Java Mission Control
SignalFx
0.9 Consumer allocation profile : TLAB
SignalFx
SignalFx Consumer allocation profile : TLAB
SignalFx
0.9 Consumer code profile
SignalFx
SignalFx Consumer code profile
SignalFx
With 5,000 messages/second
Implementation CPU Allocation TLAB
0.9 consumer 6% 422.8 MB
SignalFx consumer 1.3% 217 KB
4.6x 1944 x
SignalFx
With 10,000 messages/second
Implementation CPU Allocation TLAB
0.9 consumer 6.122% 858 MB
SignalFx consumer 1.456% 400 KB
4.2x 2145 x
SignalFx
Thank You!
Rajiv [email protected]
@rzidane360
WE’RE [email protected]
@SignalFx - signalfx.com/careers
SignalFx
Q&A