+ All Categories
Home > Documents > Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To...

Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To...

Date post: 24-May-2020
Category:
Upload: others
View: 13 times
Download: 0 times
Share this document with a friend
37
Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1
Transcript
Page 1: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

Best Practices For Loading Data To Distributed Systems With Change Data Capture

Alexey Goncharuk

1

Page 2: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Change Data Capture In The Wild

Agenda● What is CDC?● What can I do with CDC?● What is available in Ignite / GridGain?

2

Page 3: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

What is Change Data Capture?

3

Page 4: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

4

What is CDC?

● Have a data set or arbitrary size● Determine what records changed since a given moment● Many ways to achieve this...

What is Change Data Capture

Page 5: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

What Is CDC?

5

● Timestamps ● Versions● Statuses● Attached to application data model

Record Change Markers

Page 6: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Record Change Markers

6

ID … UPDATE_TS

1 2019-10-10 00:01:02.000

2 2019-10-09 11:01:02.000

3 2018-10-09 18:36:13.000

4 2019-09-01 01:02:03.000

10 2019-06-13 11:12:04.000

Page 7: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Record Change Markers

7

ID … UPDATE_TS

1 2019-10-10 00:01:02.000

2 2019-10-09 11:01:02.000

3 2018-10-09 18:36:13.000

4 2019-11-01 23:59:59.000

10 2019-11-15 14:00:00.000

Page 8: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Record Change Markers

8

SELECT * FROM Table WHERE UPDATE_TS > ’ 2019-11-01 00:00:00.000’

ID … UPDATE_TS

1 2019-10-10 00:01:02.000

2 2019-10-09 11:01:02.000

3 2018-10-09 18:36:13.000

4 2019-11-01 23:59:59.000

10 2019-11-15 14:00:00.000

Page 9: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Record Change Markers

9

● Detecting changes is tricky○ Full scan○ Additional index for change markers

● No previous value (change coalescing)

Cons

Page 10: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Record Change Markers

10

● May be implemented in application layer● Delayed change consumption● Negligible storage overhead

Pros

Page 11: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

What Is CDC?

11

● Triggers / interceptors / etc...● User code is supplied to the storage system

Callbacks

Page 12: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Callbacks

12

Update

Callback User-defined Action

Page 13: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Callbacks

13

● Invoked synchronously● Tricky failover in distributed systems

Cons

Page 14: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Callbacks

14

● No system storage/insert overhead● Previous value is usually available● May have an ability to modify updated value

Pros

Page 15: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

15

What Is CDC?

Change Feed● Changes are stored as events (Event Sourcing) ● Or changes produce events● Consumers subscribe to a change feed● Database WAL is an events source!

Page 16: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Change Feed

16

UpdateSubscriber / Consumer

Page 17: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

17

Change Feed

● Need additional storage to keep changes

Cons

Page 18: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

18

Change Feed

● Previous values are usually available● Full change history is preserved● Possibly an ability to re-read the history

Pros

Page 19: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

19

CDC Applications

Page 20: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

20

CDC Applications

Continuous Data Integration● “Active” database produces changes● The changes are applied to a secondary system

Page 21: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Continuous Data Integration

21

Captured Changes

Page 22: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

22

CDC Applications

Continuous Data Integration● Reads offload● Audit Changelog● Cross-system Replication● High Availability

Page 23: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC Applications

23

● Computationally expensive function over a large set of items?

● Calculate once, then apply deltas

Running function calculation

Page 24: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC Applications

24

● AVG (ITEMS) = SUM (ITEMS) / COUNT (ITEMS)○ O(N) Complexity

● On insert => SUM += New Value, COUNT += 1● On delete => SUM -= Deleted Value, COUNT -= 1● On update => SUM = SUM - Old Value + New Value● Average is a O(1) operation

Running function calculation

Page 25: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC Applications

25

● Updates feed is going both ways● Need to resolve conflicts● Conflict-free Replicated Data Types (CRDTs) for help

Cross-System Active-Active Replication

Page 26: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Basic CRDTs

26

• Grow-only counter• Positive-negative counter• Grow-only set• Two-phase set• Last-write-wins• …

Page 27: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Apache Ignite

27

Page 28: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

28

CDC In Ignite

● IgniteDataStreamer to optimally deliver changes to data nodes

● A user can use custom stream receiver● Out-of-the-box integrations

● Kafka● MQTT● …

Applying Changes To Ignite

Page 29: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Ignite

29

● CacheInterceptor○ Guarantees update order○ May alter inserted value○ Synchronous, may affect performance

Callbacks

Page 30: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Ignite

30

● Cache Events○ Guarantee update order○ Asynchronous

Callbacks

Page 31: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Ignite

31

● ContinuousQuery○ Client - server subscription○ Remote filter acts as a synchronous callback○ Local listener acts as a sink

Callbacks And Change Feed Combined

Page 32: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Ignite

32

Consumer

Remote Filter

Remote Filter

Update (K1)

Update (K2)

Page 33: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Ignite

33

● Automatic failover in case of primary node crash● Single-key ordering guarantees

Callbacks And Change Feed Combined

Page 34: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

CDC In Ignite

34

● Ingestion● IgniteDataStreamer

● Capturing Changes● CacheInterceptor● Events● ContinuousQuery

Page 35: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Summary

35

● CDC is a powerful and a well-known technique● Many systems have built-in support for CDC● May improve both development time and performance

Page 36: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Apache Ignite

36

Want To Contribute?

[email protected][email protected]

Page 37: Best Practices For Loading Data To Distributed Systems ... · Best Practices For Loading Data To Distributed Systems With Change Data Capture Alexey Goncharuk 1

2019 © GridGain Systems

Q&A

37

Thank you for your attention!


Recommended