Kafka eos

1

Exactly Once Semantics in Apache Kafka

2

Apache Kafka: A Distributed Streaming Platform

Consumers

Producers

Connectors

Processing

3

A Distributed What???

A Streaming Platform is a little like…

• A messaging system

• Except it scales horizontally, stores the streams persistently,

and allows continuous stream processing

• Hadoop

• But not batch oriented

4

Logs: A data structure for continuous streams

5

APIs

1. Producer and Consumer: Read and write streams

2. Connect: Managed Connectors that connect existing

systems

3. Streams: Transformations of streams

6

Producer & Consumer API

7

Consumers Scale With Groups

8

Connect API

9

Kafka Connect Does The Hard Parts

1. Scale out

2. Fault Tolerance

3. Central Management

4. Schemas

10

11

12

13

14

Streams API

• Full power of a modern stream processing framework

• Distributed and fault-tolerant

• Natively uses event-time

• Stateful processing: joins, aggregations, etc

• Integrates tables and streams

• Easy re-processing

• Just a library

15

Wordcount Example

16

Wordcount Example

17

Deploy as you wish

24

Not Limited To Java

CREATE TABLE possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts

WINDOW TUMBLING (SIZE 5 SECONDS)

GROUP BY card_number

HAVING count(*) > 3;

25

The Semantics of Working With Streams

26

Two Problems

1. Duplicate writes

2. Exactly-once processing

27

Problem #1:

Duplicate Writes

28

Duplicate Writes

29

Duplicate Writes

30

Duplicate Writes

31

Duplicate Writes

32

Duplicate Writes

33

Duplicate Writes

34

Duplicate Writes

35

Duplicate Writes

36

Duplicate Writes

37

Problem #2:

Duplicate Processing

38

Read From Offset=0

39

Process and Update State

40

Commit offset 0 as processed

41

Read from offset=1

42

Process and Update State

43

App crashes!

44

Restore from offset=0, resume processing

45

Chose your undesirable semantics

1. Update state, then save offset => At-Least-Once Delivery

2. Save offset, then update state => At-Most-Once Delivery

46

Two workarounds

1. Make processing idempotent

• Much harder than it sounds in practice

2. Store offset in the application DB and update transitionally

• Not all stores support transactions

• Must handle zombies

47

Solving These Problems

48

Solving Problem #1:

Avoiding Duplicate Writes with an Idempotent

Producer

49

Basic idea

1. Unique ID for each message

2. Server deduplicates

50

Basic idea has problems

1. Random access database of all message ids?

2. Message IDs would be bulky

3. Must handle server fail-over

51

Better idea: Do it like TCP

1. Unique producer id for each producer (PID)

2. Each producer assigns a sequential number to each

message it sends

3. The unique identifier is the PID + sequence number

4. Sequence number and PID both stored in the log

52

The idempotent producer

53


54


55


56


57


58


59


60

Idempotent Producer

• Works transparently – no API changes.

• Fast enough you don’t need to worry about it

• Will be on by default in the future

61

Solving Problem #2:

Avoiding Duplicate Processing with Transactions

62

63

It’s More Complex Than I’ve Let On

• Multiple partitions

• Multiple input streams

• Non-determinism

• Diverse data stores

• Zombies

64

Transactions in Kafka

65

Introducing transactions

producer.initTransactions();

try {

producer.beginTransaction();

producer.send(record0);

producer.send(record1);

producer.commitTransaction();

} catch (KafkaException e) {

producer.abortTransaction();

}

66

Introducing ‘transactions’

67

Initializing ‘transactions’

68

Transactional sends – part 1

69

Transactional sends – part 2

70

Commit – phase 1

71

Commit – phase 2

72

Commit – phase 2

73

Success!

74

Consumer returns only committed messages

75

Transactions => Stream Processing

76

Factor problem into two parts

1. Transforming input streams to output streams (Streams)

2. Connecting output streams to data systems (Connect)

77

Stream processing with Kafka

1. Read from input streams

2. Process and update state

3. Produce to output streams

4. Save offsets

78

Stream processing as a sequence of transactions

BEGIN

1. Read from input streams

2. Process and update state

3. Produce to output streams

4. Save Offsets

COMMIT

79

The Theory

• Two Generals

• Atomic Broadcast

• Consensus

80

In Practice

81

Performance

• Up to +20% producer throughput

• Up to +50% consumer throughput

• Up to -20% disk utilization

• Savings start when you batch

• Details: https://bit.ly/kafka-eos-perf

https://bit.ly/kafka-eos-perf

82

Cool!

But how do I use this?

83

Producer Configs

• enable.idempotence = true

• acks = “all”

• retries > 1 (preferably MAX_INT)

• transactional.id = ‘some unique id’

84

Consumer configs

• isolation.level:

• “read_committed”, or

• “read_uncommitted”

85

Streams config

• processing.mode = “exactly_once”

86

Confluent

• Founded by the original creators of Apache Kafka

• Headquarters based in Palo Alto, CA

KSQL: Streaming SQL for Apache Kafka

Developer Preview

(https://github.com/confluentinc/ksql)

https://github.com/confluentinc/ksql

87

Thank You!

Date post:	16-Mar-2018
Category:	Internet
Upload:	nitin-kumar
View:	115 times
Download:	0 times

Kafka eos

Internet