+ All Categories
Home > Internet > Kafka eos

Kafka eos

Date post: 16-Mar-2018
Category:
Upload: nitin-kumar
View: 115 times
Download: 0 times
Share this document with a friend
81
1 Exactly Once Semantics in Apache Kafka
Transcript
Page 1: Kafka eos

1

Exactly Once Semantics in Apache Kafka

Page 2: Kafka eos

2

Apache Kafka: A Distributed Streaming Platform

Consumers

Producers

Connectors

Processing

Page 3: Kafka eos

3

A Distributed What???

A Streaming Platform is a little like…

• A messaging system

• Except it scales horizontally, stores the streams persistently,

and allows continuous stream processing

• Hadoop

• But not batch oriented

Page 4: Kafka eos

4

Logs: A data structure for continuous streams

Page 5: Kafka eos

5

APIs

1. Producer and Consumer: Read and write streams

2. Connect: Managed Connectors that connect existing

systems

3. Streams: Transformations of streams

Page 6: Kafka eos

6

Producer & Consumer API

Page 7: Kafka eos

7

Consumers Scale With Groups

Page 8: Kafka eos

8

Connect API

Page 9: Kafka eos

9

Kafka Connect Does The Hard Parts

1. Scale out

2. Fault Tolerance

3. Central Management

4. Schemas

Page 10: Kafka eos

10

Page 11: Kafka eos

11

Page 12: Kafka eos

12

Page 13: Kafka eos

13

Page 14: Kafka eos

14

Streams API

• Full power of a modern stream processing framework

• Distributed and fault-tolerant

• Natively uses event-time

• Stateful processing: joins, aggregations, etc

• Integrates tables and streams

• Easy re-processing

• Just a library

Page 15: Kafka eos

15

Wordcount Example

Page 16: Kafka eos

16

Wordcount Example

Page 17: Kafka eos

17

Deploy as you wish

Page 18: Kafka eos

24

Not Limited To Java

CREATE TABLE possible_fraud AS

SELECT card_number, count(*)

FROM authorization_attempts

WINDOW TUMBLING (SIZE 5 SECONDS)

GROUP BY card_number

HAVING count(*) > 3;

Page 19: Kafka eos

25

The Semantics of Working With Streams

Page 20: Kafka eos

26

Two Problems

1. Duplicate writes

2. Exactly-once processing

Page 21: Kafka eos

27

Problem #1:

Duplicate Writes

Page 22: Kafka eos

28

Duplicate Writes

Page 23: Kafka eos

29

Duplicate Writes

Page 24: Kafka eos

30

Duplicate Writes

Page 25: Kafka eos

31

Duplicate Writes

Page 26: Kafka eos

32

Duplicate Writes

Page 27: Kafka eos

33

Duplicate Writes

Page 28: Kafka eos

34

Duplicate Writes

Page 29: Kafka eos

35

Duplicate Writes

Page 30: Kafka eos

36

Duplicate Writes

Page 31: Kafka eos

37

Problem #2:

Duplicate Processing

Page 32: Kafka eos

38

Read From Offset=0

Page 33: Kafka eos

39

Process and Update State

Page 34: Kafka eos

40

Commit offset 0 as processed

Page 35: Kafka eos

41

Read from offset=1

Page 36: Kafka eos

42

Process and Update State

Page 37: Kafka eos

43

App crashes!

Page 38: Kafka eos

44

Restore from offset=0, resume processing

Page 39: Kafka eos

45

Chose your undesirable semantics

1. Update state, then save offset => At-Least-Once Delivery

2. Save offset, then update state => At-Most-Once Delivery

Page 40: Kafka eos

46

Two workarounds

1. Make processing idempotent

• Much harder than it sounds in practice

2. Store offset in the application DB and update transitionally

• Not all stores support transactions

• Must handle zombies

Page 41: Kafka eos

47

Solving These Problems

Page 42: Kafka eos

48

Solving Problem #1:

Avoiding Duplicate Writes with an Idempotent

Producer

Page 43: Kafka eos

49

Basic idea

1. Unique ID for each message

2. Server deduplicates

Page 44: Kafka eos

50

Basic idea has problems

1. Random access database of all message ids?

2. Message IDs would be bulky

3. Must handle server fail-over

Page 45: Kafka eos

51

Better idea: Do it like TCP

1. Unique producer id for each producer (PID)

2. Each producer assigns a sequential number to each

message it sends

3. The unique identifier is the PID + sequence number

4. Sequence number and PID both stored in the log

Page 46: Kafka eos

52

The idempotent producer

Page 47: Kafka eos

53

The idempotent producer

Page 48: Kafka eos

54

The idempotent producer

Page 49: Kafka eos

55

The idempotent producer

Page 50: Kafka eos

56

The idempotent producer

Page 51: Kafka eos

57

The idempotent producer

Page 52: Kafka eos

58

The idempotent producer

Page 53: Kafka eos

59

The idempotent producer

Page 54: Kafka eos

60

Idempotent Producer

• Works transparently – no API changes.

• Fast enough you don’t need to worry about it

• Will be on by default in the future

Page 55: Kafka eos

61

Solving Problem #2:

Avoiding Duplicate Processing with Transactions

Page 56: Kafka eos

62

Page 57: Kafka eos

63

It’s More Complex Than I’ve Let On

• Multiple partitions

• Multiple input streams

• Non-determinism

• Diverse data stores

• Zombies

Page 58: Kafka eos

64

Transactions in Kafka

Page 59: Kafka eos

65

Introducing transactions

producer.initTransactions();

try {

producer.beginTransaction();

producer.send(record0);

producer.send(record1);

producer.commitTransaction();

} catch (KafkaException e) {

producer.abortTransaction();

}

Page 60: Kafka eos

66

Introducing ‘transactions’

Page 61: Kafka eos

67

Initializing ‘transactions’

Page 62: Kafka eos

68

Transactional sends – part 1

Page 63: Kafka eos

69

Transactional sends – part 2

Page 64: Kafka eos

70

Commit – phase 1

Page 65: Kafka eos

71

Commit – phase 2

Page 66: Kafka eos

72

Commit – phase 2

Page 67: Kafka eos

73

Success!

Page 68: Kafka eos

74

Consumer returns only committed messages

Page 69: Kafka eos

75

Transactions => Stream Processing

Page 70: Kafka eos

76

Factor problem into two parts

1. Transforming input streams to output streams (Streams)

2. Connecting output streams to data systems (Connect)

Page 71: Kafka eos

77

Stream processing with Kafka

1. Read from input streams

2. Process and update state

3. Produce to output streams

4. Save offsets

Page 72: Kafka eos

78

Stream processing as a sequence of transactions

BEGIN

1. Read from input streams

2. Process and update state

3. Produce to output streams

4. Save Offsets

COMMIT

Page 73: Kafka eos

79

The Theory

• Two Generals

• Atomic Broadcast

• Consensus

Page 74: Kafka eos

80

In Practice

Page 75: Kafka eos

81

Performance

• Up to +20% producer throughput

• Up to +50% consumer throughput

• Up to -20% disk utilization

• Savings start when you batch

• Details: https://bit.ly/kafka-eos-perf

Page 76: Kafka eos

82

Cool!

But how do I use this?

Page 77: Kafka eos

83

Producer Configs

• enable.idempotence = true

• acks = “all”

• retries > 1 (preferably MAX_INT)

• transactional.id = ‘some unique id’

Page 78: Kafka eos

84

Consumer configs

• isolation.level:

• “read_committed”, or

• “read_uncommitted”

Page 79: Kafka eos

85

Streams config

• processing.mode = “exactly_once”

Page 80: Kafka eos

86

Confluent

• Founded by the original creators of Apache Kafka

• Headquarters based in Palo Alto, CA

KSQL: Streaming SQL for Apache Kafka

Developer Preview

(https://github.com/confluentinc/ksql)

Page 81: Kafka eos

87

Thank You!


Recommended