Date post: | 16-Mar-2018 |
Category: |
Internet |
Upload: | nitin-kumar |
View: | 115 times |
Download: | 0 times |
1
Exactly Once Semantics in Apache Kafka
2
Apache Kafka: A Distributed Streaming Platform
Consumers
Producers
Connectors
Processing
3
A Distributed What???
A Streaming Platform is a little like…
• A messaging system
• Except it scales horizontally, stores the streams persistently,
and allows continuous stream processing
• Hadoop
• But not batch oriented
4
Logs: A data structure for continuous streams
5
APIs
1. Producer and Consumer: Read and write streams
2. Connect: Managed Connectors that connect existing
systems
3. Streams: Transformations of streams
6
Producer & Consumer API
7
Consumers Scale With Groups
8
Connect API
9
Kafka Connect Does The Hard Parts
1. Scale out
2. Fault Tolerance
3. Central Management
4. Schemas
10
11
12
13
14
Streams API
• Full power of a modern stream processing framework
• Distributed and fault-tolerant
• Natively uses event-time
• Stateful processing: joins, aggregations, etc
• Integrates tables and streams
• Easy re-processing
• Just a library
15
Wordcount Example
16
Wordcount Example
17
Deploy as you wish
24
Not Limited To Java
CREATE TABLE possible_fraud AS
SELECT card_number, count(*)
FROM authorization_attempts
WINDOW TUMBLING (SIZE 5 SECONDS)
GROUP BY card_number
HAVING count(*) > 3;
25
The Semantics of Working With Streams
26
Two Problems
1. Duplicate writes
2. Exactly-once processing
27
Problem #1:
Duplicate Writes
28
Duplicate Writes
29
Duplicate Writes
30
Duplicate Writes
31
Duplicate Writes
32
Duplicate Writes
33
Duplicate Writes
34
Duplicate Writes
35
Duplicate Writes
36
Duplicate Writes
37
Problem #2:
Duplicate Processing
38
Read From Offset=0
39
Process and Update State
40
Commit offset 0 as processed
41
Read from offset=1
42
Process and Update State
43
App crashes!
44
Restore from offset=0, resume processing
45
Chose your undesirable semantics
1. Update state, then save offset => At-Least-Once Delivery
2. Save offset, then update state => At-Most-Once Delivery
46
Two workarounds
1. Make processing idempotent
• Much harder than it sounds in practice
2. Store offset in the application DB and update transitionally
• Not all stores support transactions
• Must handle zombies
47
Solving These Problems
48
Solving Problem #1:
Avoiding Duplicate Writes with an Idempotent
Producer
49
Basic idea
1. Unique ID for each message
2. Server deduplicates
50
Basic idea has problems
1. Random access database of all message ids?
2. Message IDs would be bulky
3. Must handle server fail-over
51
Better idea: Do it like TCP
1. Unique producer id for each producer (PID)
2. Each producer assigns a sequential number to each
message it sends
3. The unique identifier is the PID + sequence number
4. Sequence number and PID both stored in the log
52
The idempotent producer
53
The idempotent producer
54
The idempotent producer
55
The idempotent producer
56
The idempotent producer
57
The idempotent producer
58
The idempotent producer
59
The idempotent producer
60
Idempotent Producer
• Works transparently – no API changes.
• Fast enough you don’t need to worry about it
• Will be on by default in the future
61
Solving Problem #2:
Avoiding Duplicate Processing with Transactions
62
63
It’s More Complex Than I’ve Let On
• Multiple partitions
• Multiple input streams
• Non-determinism
• Diverse data stores
• Zombies
64
Transactions in Kafka
65
Introducing transactions
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(record0);
producer.send(record1);
producer.commitTransaction();
} catch (KafkaException e) {
producer.abortTransaction();
}
66
Introducing ‘transactions’
67
Initializing ‘transactions’
68
Transactional sends – part 1
69
Transactional sends – part 2
70
Commit – phase 1
71
Commit – phase 2
72
Commit – phase 2
73
Success!
74
Consumer returns only committed messages
75
Transactions => Stream Processing
76
Factor problem into two parts
1. Transforming input streams to output streams (Streams)
2. Connecting output streams to data systems (Connect)
77
Stream processing with Kafka
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save offsets
78
Stream processing as a sequence of transactions
BEGIN
1. Read from input streams
2. Process and update state
3. Produce to output streams
4. Save Offsets
COMMIT
79
The Theory
• Two Generals
• Atomic Broadcast
• Consensus
80
In Practice
81
Performance
• Up to +20% producer throughput
• Up to +50% consumer throughput
• Up to -20% disk utilization
• Savings start when you batch
• Details: https://bit.ly/kafka-eos-perf
82
Cool!
But how do I use this?
83
Producer Configs
• enable.idempotence = true
• acks = “all”
• retries > 1 (preferably MAX_INT)
• transactional.id = ‘some unique id’
84
Consumer configs
• isolation.level:
• “read_committed”, or
• “read_uncommitted”
85
Streams config
• processing.mode = “exactly_once”
86
Confluent
• Founded by the original creators of Apache Kafka
• Headquarters based in Palo Alto, CA
KSQL: Streaming SQL for Apache Kafka
Developer Preview
(https://github.com/confluentinc/ksql)
87
Thank You!