Adaptive Replication and Partitioning in Data Systemsmtabebe/resources/... · Hierarchical P2P File...

Post on 18-Aug-2020

2 views 4 download

transcript

Adaptive Replication and Partitioning

in Data Systems

Brad Glasbergen, Michael Abebe,

Khuzaima DaudjeeMiddleware 2018

Data Systems Group

BA .APAGE 2

PAGE 3

PAGE 4

PAGE 5

S1

A

Single Node Architecture

BOverloaded

PAGE 6

Single Node Architecture

How to scale beyond a single node?

Replicate and partition

PAGE 7

Single Node Architecture

How to scale beyond a single node?

Replicate and partition

PAGE 8

S1

A

Replicated Architecture

B

S2 S3

A B A B

Handle more requests PAGE 9

S1

A

Replicated Architecture

B

S2 S3

A B A B

Cost of coordinationPAGE 10

S1

A

Replicated Architecture

S2 S3

A B B

How many replicas?PAGE 11

S1

A

Replicated Architecture

S2 S3

A B B

W[ A ] W[ A ]

PAGE 12

S1

A

Replicated Architecture

S2 S3

AB B

W[ A ] W[ A ]

Where to place replicas?PAGE 13

S1

A

Replicated Architecture

S2 S3

AB B

How to propagate updates?

W[ A ]

PAGE 14

● (A)synchronous● Consistency

● How many replicas?

● Where to place replicas?

● How to propagate updates?

Replication Decisions

PAGE 15

Single Node Architecture

How to scale beyond a single node?

Replicate and partition

PAGE 16

Partitioned Architecture

Distributes requests S1

A

S2

B

PAGE 17

Partitioned Architecture

S1 S2

BA

PAGE 18

Partitioned Architecture

How to form partitions? S1

A1

S2

BA2

PAGE 19

Partitioned Architecture

Where to place partitions? S1

A1

S2

BA2

PAGE 20

Partitioned Architecture

How to execute multi-partition operations?

S1

A

S2

B

PAGE 21

Partitioning Decisions

PAGE 22

● How to form partitions? ● Where to place partitions?

● How to execute multi-partition operations?

Where to place partitions?

S1

A1

S3

BA2

PAGE 23

How to make a partitioning or replication decision when access patterns change?

Static Decisions

Why do access patterns change?

PAGE 24

Why do accesses change?

Humans have follow-the-sun cycles

PAGE 25

Why do accesses change?

Load bursts

PAGE 26

Why do accesses change?

Shifting hot-spotsPAGE 27

How to make a partitioning or replication decision when access patterns change?

Static Decisions

Adaptively replicate and partition

PAGE 28

Where to place partitions?

S1

A1

S2

BA2

PAGE 29

Where to place partitions?

S1

A1

S2

A2B

PAGE 30

How many replicas?

S1

A1

S2

A2B

R[ B ]

PAGE 31

How many replicas?

S1

A1

S2

A2B

R[ B ]

B

PAGE 32

● Adaptive Replication

● Adaptive Partitioning

● Outlook

Road Map

PAGE 33

Adaptive Replication

PAGE 34

● How many replicas?

● Where to place replicas?

● How to propagate updates?

Replication Decisions

PAGE 35

Adaptive Replication ● Decentralized

● Geo-Distributed

● Caching

● Availability

PAGE 36

S2

Adaptive Replication (ADR)

(Wolfson et al., TODS 1997)

S1 S3

S4 S5 S6

PAGE 37

Adaptive Replication (ADR)

(Wolfson et al., TODS 1997)

S2 S1 S3

PAGE 38

Adaptive Replication (ADR)

(Wolfson et al., TODS 1997)

A

S2 S1 S3

PAGE 39

S2

Local Read

(Wolfson et al., TODS 1997)

A

R[A] S1 S3

PAGE 40

Remote Read

(Wolfson et al., TODS 1997)

A

S2 S1 S3

PAGE 41

R[A] S2

Remote Read

(Wolfson et al., TODS 1997)

A

S1 S3

PAGE 42

R[A] S2

Remote Read

(Wolfson et al., TODS 1997)

A

S1 S3

More Messages!

PAGE 43

R[A] S2

Replication for Reads

(Wolfson et al., TODS 1997)

A

S1 S3

A’

PAGE 44

W[A] S2

No Free Lunch

(Wolfson et al., TODS 1997)

A

S1 S3

A’

PAGE 45

W[A] S2

No Free Lunch

(Wolfson et al., TODS 1997)

A

S1 S3

A’

Reduces read cost,Increases write cost

PAGE 46

When to Replicate?

(Wolfson et al., TODS 1997)

A

S2 S1 S3

PAGE 47

S2

When to Replicate?

(Wolfson et al., TODS 1997)

A

Reads: 15

Writes: 5

Writes: 5

S1 S3

PAGE 48

When to Replicate?

(Wolfson et al., TODS 1997)

A

Reads: 0

Writes: 5

Writes: 5

A’

Writes: 10

S2 S1 S3

PAGE 49

When to Replicate?

(Wolfson et al., TODS 1997)

A

Reads: 0

Writes: 5

Writes: 5

A’

Writes: 10

5 Fewer Messages!

S2 S1 S3

PAGE 50

When to Stop Replicating?

(Wolfson et al., TODS 1997)

AA’

S2 S1 S3

PAGE 51

S2

When to Stop Replicating?

(Wolfson et al., TODS 1997)

AA’

Writes: 20Reads: 5

Reads: 5

S1 S3

PAGE 52

S2

When to Stop Replicating?

(Wolfson et al., TODS 1997)

A

Writes: 20Reads: 5

Reads: 5

10 Fewer Messages!

S1 S3

PAGE 53

Decentralized Decisions

(Wolfson et al., TODS 1997)

A

S2 S1 S3

Replicate?

PAGE 54

Adapts to Changing Workloads

(Wolfson et al., TODS 1997)

A

Replicate?

S2 S1 S3

More Reads

A’

PAGE 55

S2

Extensions (Network Topology)

(Wolfson et al., TODS 1997)

S1 S3

S4 S5 S6

PAGE 56

S2

Extensions (Network Topology)

(Wolfson et al., TODS 1997)

S1 S3

S4 S5 S6

PAGE 57

S2

Extensions (Network Topology)

(Wolfson et al., TODS 1997)

S1 S3

S4 S5 S6

Form Tree

PAGE 58

Extensions (Blocks)

(Wolfson et al., TODS 1997)

S2 S1 S3

A

B

C

⅓ R[A] + ⅓ R[B] + ⅓ R[C] ⅓ W[A] + ⅓ W[B] + ⅓ W[C]

A’

B’

C’

PAGE 59

Peer-to-Peer File Systems

PAGE 60

Hash-Based P2P File System

h(s) = 275

PAGE 61

Hash-Based P2P File System

PAGE 62

Hash-Based P2P File System

PAGE 63

Hash-Based P2P File System

PAGE 64

Hash-Based P2P FSBalanced load, but poor access locality!

PAGE 65

Hierarchical P2P File Systems

PAGE 66

Hierarchical P2P File Systems

PAGE 67

Hierarchical P2P File Systems

PAGE 68

Hierarchical P2P File Systems

PAGE 69

Hierarchical P2P File SystemGood locality,but poor balance

PAGE 70

(Gopalakrishnan et al., ICDCS’04)

Overloaded

Moderate Load

Light Load

Locality and Load Balance

PAGE 71

(Gopalakrishnan et al., ICDCS’04)

Overloaded

Moderate Load

Light Load

Locality and Load Balance

PAGE 72

(Gopalakrishnan et al., ICDCS’04)

Locality and Load Balance

PAGE 73

No replication

(Gopalakrishnan et al., ICDCS’04)

Replicate to balance load!

Locality and Load Balance

PAGE 74

(Gopalakrishnan et al., ICDCS’04)

Locality and Load Balance

PAGE 75

Locality and Load Balance

(Gopalakrishnan et al., ICDCS’04)

Replicate to balance load!

PAGE 76

(Gopalakrishnan et al., ICDCS’04)

Locality and Load Balance

PAGE 77

Decentralized Replicas

(Gopalakrishnan et al., ICDCS’04)

Writes?

PAGE 78

● How many replicas?Decentralized decision

● Where to place replicas?At the requester

● How to propagate updates?ADR: Synchronous, P2P: read-only

Replication Decisions

(Gopalakrishnan et al., ICDCS’04)(Wolfson et al., TODS 1997)PAGE 79

PAGE 80

Global Scale Replication

022 73

Average Latency31 ms

PAGE 81

PAGE 82

Global Scale Replication

185 159

142189

136022 73

Average Latency113 ms

PAGE 83

PAGE 84

Global Scale Replication

136

185 159

142180

0022 73

Average Latency 95 ms

PAGE 85

Global Scale Replication

301136

159

185 0

169

0022 73

Average Latency64 ms

133

PAGE 86

Minimize cost of access

Take me to your leader!(Sharov et al., VLDB 2015)

Global Scale Replication

GPlacer (Zakhary et al., EDBT 2018)

Place data around the world

PAGE 87

Data Distribution Model

(Sharov et al., VLDB 2015)

S1

A B

S4

A B

S2

A B

S3

A B

Replicas

PAGE 88

Data Distribution Model

(Sharov et al., VLDB 2015)

S1

A B

S4

a b

S2

A B

S3

A B

Read-Write Replicas

Read ReplicasState

changesPAGE 89

Data Distribution Model

(Sharov et al., VLDB 2015)

S1

A B

S4

a b

S2

A B

S3

A B

Read-Write Replicas

Read ReplicasLeader State

changesCoordinates

PAGE 90

RWWL

Global Replication Problem

(Sharov et al., VLDB 2015)

● Select replicas

● Assign replica roles (read or read-write)

● Assign leader

PAGE 91

Global Replication Problem

(Sharov et al., VLDB 2015)

● Select replicas

● Assign replica roles (read or read-write)

● Assign leader

PAGE 92

Assign Leader

(Sharov et al., VLDB 2015)

Leader: site that minimizes access costs

PAGE 93

+ median RTT( S1, {S1, S2, S3})

Write transaction cost

(Sharov et al., VLDB 2015)

S1

A B

S4

a b

S2

A B

S3

A B

Send write to leader

Quorum writes cost = RTT(S3, S1)

PAGE 94

RWWL

Assign Leader

(Sharov et al., VLDB 2015)

Leader: site that minimizes access costs

Client cost: RTT(client, replica) + cost( transaction )

PAGE 95

Weighting Client Cost

(Sharov et al., VLDB 2015)PAGE 96

Weighting Client Cost

(Sharov et al., VLDB 2015)PAGE 97

2 writes2 reads

10 writes20 reads

5 writes5 reads

Assign Leader

(Sharov et al., VLDB 2015)

Leader: site that minimizes access costs

Client cost: RTT(client, replica) + cost( transaction )

Cost: Weighted average of client costs

PAGE 98

Global Replication Problem

(Sharov et al., VLDB 2015)

● Select replicas

● Assign replica roles (read or read-write)

● Assign leader

PAGE 99

Assign Replica RolesLeader: minimizes median RTT to read-write replicas

(Sharov et al., VLDB 2015)

Read-write replicas:

PAGE 100

+ median RTT( S1, {S1, S2, S3})

Write transaction cost

(Sharov et al., VLDB 2015)

S1

A B

S4

a b

S2

A B

S3

A B

Send write to leader

Quorum writes cost = RTT(S3, S1)

PAGE 101

RWWL

Assign Replica RolesLeader: minimizes median RTT to read-write replicas

(Sharov et al., VLDB 2015)

Read-write replicas: Lowest RTT to leader

PAGE 102

● Select replicas

● Assign replica roles (read or read-write)

● Assign leader

Global Replication Problem

(Sharov et al., VLDB 2015)PAGE 103

Replica selection

(Sharov et al., VLDB 2015)

Read-write replicas: Lowest RTT to leader

Read replicas:

Leader: minimizes median RTT to read-write replicas

PAGE 104

Replica selection

(Sharov et al., VLDB 2015)PAGE 105

Client cost: RTT(client, replica) + cost( transaction )

Replica selection

(Sharov et al., VLDB 2015)PAGE 106

Client cost: RTT(client, replica) + cost( transaction )

Read replicas: Lowest RTT to clients

Replica selection

(Sharov et al., VLDB 2015)

Read-write replicas: Lowest RTT to leader

Read replicas: Lowest RTT to clients

Leader: minimizes median RTT to read-write replicas

PAGE 107

K-Means Replica selection

(Sharov et al., VLDB 2015)PAGE 108

K-Means Replica selection

(Sharov et al., VLDB 2015)

97

121

7365

88 0

0

0

220

Average Latency55 ms

PAGE 109

219

169

K-Means Replica selectionSelect replicas

(Sharov et al., VLDB 2015)

Assign leader and read-write replicas

PAGE 110

Leaderless Protocols

(Zakhary et al., EDBT 2018)

S1

A B

S4

a b

S2

A B

S3

A B

Any quorum member can coordinate

PAGE 111

Hinted Hand off

301136

159(Zakhary et al., EDBT 2018)

PAGE 112

Hinted Hand off

301136

159

295 < 301

(Zakhary et al., EDBT 2018)PAGE 113

Hinted Hand off

(Zakhary et al., EDBT 2018)

Hand off request from S1 to S2 if:

cost( S1 ) > RTT( S1, S2) + cost( S2 )

PAGE 114

cost( S ) = cost of executing request at S

● How many replicas?Centralized, given client workload

● Where to place replicas?Heuristic (clustering)

● How to propagate updates?Quorums / Leader-based (Sharov)

Replication Decisions

(Zakhary et al., EDBT 2018)(Sharov et al., VLDB 2015)PAGE 115

Intra-Region Latency

(Sivasubramanian et al., WWW 2005)

100+ ms

PAGE 116

Edge Nodes

(Sivasubramanian et al., WWW 2005)PAGE 117

Supports Static Data

Dynamic Data?

GlobeDB

(Sivasubramanian et al., WWW 2005)PAGE 118

Replication Granularity

(Sivasubramanian et al., WWW 2005)

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

PAGE 119

Replication Granularity

(Sivasubramanian et al., WWW 2005)

Per-Record?High Overhead

PAGE 120

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Replication Granularity

(Sivasubramanian et al., WWW 2005)

Per-Table?Inflexible

PAGE 121

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)

When would these be replicated together?

PAGE 122

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)

Abryan = <r1,...rn,w1,...wn>

PAGE 123

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)

Sim(Abryan,Ajustin) ≥τ?

PAGE 124

Ajustin = <r1,...rn,w1,...wn>

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)

Shared Replication Scheme

PAGE 125

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)

Ap1 = <r1,...rn,w1,...wn>

PAGE 126

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)

Aavril = <r1,...rn,w1,...wn>

PAGE 127

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Sim(Ap1,Aavril) ≥τ?

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)PAGE 128

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

Access-Driven Replicas

(Sivasubramanian et al., WWW 2005)PAGE 129

ID ARTIST

1 Bryan Adams

2 Justin Bieber

3 Avril Lavigne

4 Kanye West

5 Drake

6 David Guetta

7 Ed Sheeran

Transaction Processing

(Sivasubramanian et al., WWW 2005)

Origin Server:Decide Partitions, Place Replicas, Place Master

PAGE 130

Transaction Processing

(Sivasubramanian et al., WWW 2005)

W[B]

PAGE 131

M

Transaction Processing

(Sivasubramanian et al., WWW 2005)

Push updates

PAGE 132

M

Transaction Processing

(Sivasubramanian et al., WWW 2005)

R[B]

PAGE 133

M

Transaction Processing

(Sivasubramanian et al., WWW 2005)

Push updates R[B]

PAGE 134

M

Transaction Processing

(Sivasubramanian et al., WWW 2005)PAGE 135

M

M

Replica and Master Placement

(Sivasubramanian et al., WWW 2005)PAGE 136

Read Latency vs. Bandwidth

(Sivasubramanian et al., WWW 2005)PAGE 137

M

Read Latency vs. Bandwidth

(Sivasubramanian et al., WWW 2005)PAGE 138

M

Read Latency

Bandwidth

𝜶

𝜷

Master?

Master Partition Placement

(Sivasubramanian et al., WWW 2005)PAGE 139

Master Partition Placement

(Sivasubramanian et al., WWW 2005)PAGE 140

Placement Heuristic

(Sivasubramanian et al., WWW 2005)PAGE 141

M

100% Threshold

Placement Heuristic

(Sivasubramanian et al., WWW 2005)PAGE 142

M

95% Threshold

Placement Heuristic

(Sivasubramanian et al., WWW 2005)PAGE 143

M

30% Threshold

Placement Heuristic

(Sivasubramanian et al., WWW 2005)PAGE 144

M

5% Threshold

Placement Heuristic

(Sivasubramanian et al., WWW 2005)

Minimize:𝛼 r + 𝛽 b

PAGE 145

M

0% Threshold

● How many replicas?Cost-based given requests

● Where to place replicas?Cost-based given requests

● How to propagate updates?Single-master, eventual consistency

Replication Decisions

(Sivasubramanian et al., WWW 2005) PAGE 146

Web Workload Characteristics

(Glasbergen et al., EDBT 2018)

Read-heavy

Cache misses are very painful

PAGE 147

Web Workload Characteristics

(Glasbergen et al., EDBT 2018)PAGE 148

Web Workload Characteristics

(Glasbergen et al., EDBT 2018)PAGE 149

Web Workload Characteristics

(Glasbergen et al., EDBT 2018)PAGE 150

Predict(Q2)

Predictive Caching

(Glasbergen et al., EDBT 2018)

Q1 Forward(Q1)Result(Q1)

Cache(Q2)Q2

PAGE 151

Query Patterns (TPC-W)

(Glasbergen et al., EDBT 2018)PAGE 152

PAGE 153

Building a Predictive Model

(Glasbergen et al., EDBT 2018)PAGE 154

Time

Q1 Q2 Q3 Q1 Q2 Q3

Q1

Q2 Q3

1 1

Building a Predictive Model

(Glasbergen et al., EDBT 2018)PAGE 155

Time

Q1 Q2 Q3 Q1 Q2 Q3

Q1

Q2 Q3

1 1

1

Building a Predictive Model

(Glasbergen et al., EDBT 2018)PAGE 156

Time

Q1 Q2 Q3 Q1 Q2 Q3

Q1

Q2 Q3

1 1

1

1

Building a Predictive Model

(Glasbergen et al., EDBT 2018)PAGE 157

Q1

Q2 Q3

5 3

5

1All executed 5 times

100% probability Q2 follows Q1

20% probability Q1 follows Q3

Finding Parameter Mappings

(Glasbergen et al., EDBT 2018)PAGE 158

Finding Parameter Mappings

(Glasbergen et al., EDBT 2018)PAGE 159

Finding Parameter Mappings

(Glasbergen et al., EDBT 2018)PAGE 160

Finding Parameter Mappings

(Glasbergen et al., EDBT 2018)PAGE 161

Predictive Caching

(Glasbergen et al., EDBT 2018)PAGE 162

Q1

Q2 Q3

Predictive Caching

(Glasbergen et al., EDBT 2018)PAGE 163

Q1

Q2 Q3

Predictively Cache Q2

Predictive Caching

(Glasbergen et al., EDBT 2018)PAGE 164

Q1

Q2 Q3

Predictively Cache Q3

Apollo Deployment

(Glasbergen et al., EDBT 2018)

A

PAGE 165

A

AA

A

A A

AR[B]

Apollo Deployment

(Glasbergen et al., EDBT 2018)PAGE 166

AR[B]

R[C]

Invalidations

(Glasbergen et al., EDBT 2018)

W[B]

PAGE 167

Invalidations

(Glasbergen et al., EDBT 2018)

Invalidations Limit Cache Effectiveness

PAGE 168

Session Semantics

(Glasbergen et al., EDBT 2018)

R[B]

W[B]

PAGE 169

Session Semantics

(Glasbergen et al., EDBT 2018)

Good fit for web data!

R[B]

PAGE 170

● How many replicas?Predictively based on requests

● Where to place replicas?Client edge cache, predictively

● How to propagate updates?Cache updates with sessions

Replication Decisions

(Glasbergen et al., EDBT 2018) PAGE 171

PAGE 172

PAGE 173

PAGE 174

Replication for Availability

Failures are common

Data systems must remain available

PAGE 175

Replication for Availability

S1

C

S3 S2 S4 S5

DA

CA B

CDB

DB

CA

Tolerating r faults requires r + 1 replicas

B B B

PAGE 176

Lower Overhead

PAGE 177

B1

ReplicasData

XOR

B2 xor (B1 xor B2) = B1

B2

B1

B2

B1

B2

B1 xor B2

B1

B2

Erasure Coding

12...k

k + 1

...

k + r

12...k

Tolerating r faults requires (k+r)/k space

k partitionsr parity

partitions

Data

PAGE 178

Erasure Coding

S1 S3 S2 S4 S5

A1A2 A3 A4

W [ A ] Encode and storek = 2, r = 2

PAGE 179

Erasure Coding

S1 S3 S2 S4 S5

A1A2 A3 A4

R [ A ] Read and decodek = 2, r = 2

PAGE 180

Erasure CodingReduces storage overhead

Requires parallel retrieval

PAGE 181

Erasure Coded StorageWhere to place data

How to access dataEC-Store (Abebe, ICDCS 2018)

PAGE 182

EC-Store Data Access

(Abebe et al., ICDCS 2018)

S1 S3 S2 S4

A1 A2 A3B1 B2 B3 C1

R[ A, B ]

R[ A, B ]Load aware

S5

R[ C ]k = 2, r = 1

PAGE 183

EC-Store Data Access

(Abebe et al., ICDCS 2018)

Access Strategy: Minimize cost of access

Cost of site access: load at site + I/O at site

PAGE 184

EC-Store Data Movement

(Abebe et al., ICDCS 2018)

S1 S3 S2 S4

A1 A2 A3B1 B2 B3 C1

R[ A, B ] R[ C ]

R[ A, B ]Load aware

S5

A3

k = 2, r = 1

PAGE 185

EC-Store Data Movement

(Abebe et al., ICDCS 2018)

Move data to minimize cost of future accesses and balance system load

Model access patterns to predict future accesses

PAGE 186

● How many replicas?Fault tolerance requirements

● Where to place replicas?Dynamic movement, using access costs

● How to propagate updates?Synchronous updates

Replication Decisions

(Abebe et al., ICDCS 2018) PAGE 187

● Adaptive Replication

● Adaptive Partitioning

● Outlook

Road Map

PAGE 188

Adaptive Partitioning

PAGE 189

● How to form partitions? ● Where to place partitions?

● How to execute multi-partition operations?

Partitioning Decisions

PAGE 190

Adaptive Partitioning ● Iterative improvements

● Partitioning per request

● Considering the overall workload ○ Heuristics

PAGE 191

PAGE 192

PAGE 193

Physical Database Design

A B C D

1 2 4 2

2 4 6 8

3 6 7 5

PAGE 194

A B C D

1 2 4 8

2 4 6 3

3 6 7 10

1248

Physical Database Design

PAGE 195

A B C D

1 2 4 8

2 4 6 3

3 6 7 10

12482463

Physical Database Design

PAGE 196

A B C D

1 2 4 8

2 4 6 3

3 6 7 10

1248246336710

Physical Database Design

PAGE 197

A B C D

1 2 4 82 4 6 3

3 6 7 10

SELECT AVERAGE(C) FROM R WHERE R.D > 5;

Physical Database Design

PAGE 198

1248246336710

A B C D

1 2 4 82 4 6 3

3 6 7 10

SELECT AVERAGE(C) FROM R WHERE R.D > 5;

AAA

...

...

CCC

DDD

Scan

Analytic Database Design

PAGE 199

SELECT AVERAGE(C) FROM R WHERE R.D > 5;

AAA...

...

CCC

DDD

Need to know what to index upfront

Index on D

Analytic Database Design

PAGE 200

Adaptive Range IndexingSELECT ... FROM R WHERER.D > 5;SELECT … FROM R WHERE R.D > 5 AND R.D < 10;SELECT … FROM R WHERE R.D > 10 AND R.D < 20;

PAGE 201

Database Cracking83

1000425122215

174221

Column D83

1000425122215

174221

Cracked Column D

Copy

(Idreos et al., CIDR 2007)PAGE 202

Indexes via Partitioning83

1000425122215

174221

Cracked Column D

SELECT … WHERE R.D > 5

(Idreos et al., CIDR 2007)PAGE 203

Indexes via Partitioning

(Idreos et al., CIDR 2007)

83

1000425122215

174221

SELECT … WHERE R.D > 5

Cracked Column D

PAGE 204

Indexes via Partitioning

(Idreos et al., CIDR 2007)

83

1000425122215

174221

SELECT … WHERE R.D > 5

Cracked Column D

PAGE 205

Indexes via Partitioning

(Idreos et al., CIDR 2007)

83

1000425122215

174221

SwapSELECT … WHERE R.D > 5

Cracked Column D

PAGE 206

Indexes via Partitioning

(Idreos et al., CIDR 2007)

53

1000425122218

174221

SwapSELECT … WHERE R.D > 5

Cracked Column D

PAGE 207

Indexes via Partitioning

(Idreos et al., CIDR 2007)

53

1000425122218

174221

SwapSELECT … WHERE R.D > 5

Cracked Column D

PAGE 208

Indexes via Partitioning

(Idreos et al., CIDR 2007)

5314

251222108

174221

SwapSELECT … WHERE R.D > 5

Cracked Column D

PAGE 209

Indexes via Partitioning

(Idreos et al., CIDR 2007)

5314

251222108

174221

SELECT … WHERE R.D > 5

Cracked Column D

PAGE 210

Indexes via Partitioning

(Idreos et al., CIDR 2007)

5314

251222108

174221

SELECT … WHERE R.D > 5 AND R.D < 10

Only need to consider these

Cracked Column D

PAGE 211

Indexes via Partitioning

(Idreos et al., CIDR 2007)

53148

10171225224221

SELECT … WHERE R.D > 5 AND R.D < 10

Cracked Column D

PAGE 212

Indexes via Partitioning

(Idreos et al., CIDR 2007)

53148

10171225224221

Only need to consider these

SELECT … WHERE R.D > 10 AND R.D < 20

Cracked Column D

PAGE 213

Indexes via Partitioning

(Idreos et al., CIDR 2007)

53148

10171225224221

SELECT … WHERE R.D > 10 AND R.D < 20

Cracked Column D

Iterative Partitioning forIndexing

PAGE 214

Cracked Column D53148

10171225224221

Advanced Cracking Methods

Distribution

(Idreos et al., CIDR 2007)

Cracking: Extensions

PAGE 215

● How to form partitions?Iteratively, based on queries

● Where to place partitions?Sorted in memory

● How to execute multi-partition operations?N/A

Partitioning Decisions

(Idreos et al., CIDR 2007) PAGE 216

Exploratory Workloads?

App Usage Time

PAGE 217

Exploratory Workloads?

Usage By Device

PAGE 218

Exploratory Workloads?

Revenue By Device

PAGE 219

Exploratory Workloads?

Devices By Country

No upfront information, need generic partitioning!

PAGE 220

Initial Partitioning (KD-Tree)Depth Limits Division

64 MB

128 MB

256 MB

512 MB

(Shanbhag et al., SoCC 2017)PAGE 221

Heterogeneous Tree

(Shanbhag et al., SoCC 2017)

Contains more attributes!

PAGE 222

Building the Partitioning

(Shanbhag et al., SoCC 2017)

A: 0.0B: 0.0C: 0.0D: 0.0

PAGE 223

Building the Partitioning

(Shanbhag et al., SoCC 2017)

A: 1.0B: 0.0C: 0.0D: 0.0

PAGE 224

Building the Partitioning

(Shanbhag et al., SoCC 2017)

A: 1.0B: 0.5C: 0.0D: 0.0

PAGE 225

Building the Partitioning

(Shanbhag et al., SoCC 2017)

A: 1.0B: 0.5C: 0.5D: 0.0

PAGE 226

Building the Partitioning

(Shanbhag et al., SoCC 2017)

A: 1.0B: 0.5C: 0.5D: 0.5

PAGE 227

Building the Partitioning

(Shanbhag et al., SoCC 2017)

A: 1.0B: 0.75C: 0.5D: 0.5

PAGE 228

Q1 =σD≤45

Adaptive Partitioning

(Shanbhag et al., SoCC 2017)PAGE 229

Adaptive Partitioning

(Shanbhag et al., SoCC 2017)

Q2 =σA≧125Refine partitioning per the workload!

PAGE 230

Adaptive Partitioning: When?

(Shanbhag et al., SoCC 2017)

Q1 =σD≤45Q1, Q2, Q3, Q1, ...

PAGE 231

Swap Operation

(Shanbhag et al., SoCC 2017)

Q1 =σD≤45Q1, Q2, Q3, Q1, ...

PAGE 232

Rewrite Tree

Push Up Operation

(Shanbhag et al., SoCC 2017)PAGE 233

Q1 =σD≤45Q1, Q2, Q3, Q1, ...

Push up

Push Up Operation

(Shanbhag et al., SoCC 2017)PAGE 234

Q1 =σD≤45Q1, Q2, Q3, Q1, ...

Logical Movement

Divide and ConquerQ1 =σD≤45

Get Best Subtree

Get Best Subtree

PAGE 235(Shanbhag et al., SoCC 2017)

● How to form partitions?Upfront then iteratively, based on queries

● Where to place partitions?Rely on HDFS

● How to execute multi-partition operations?Rely on HDFS

Partitioning Decisions

(Shanbhag et al., SoCC 2017) PAGE 236

PAGE 237

Exploiting Workloads● Known ahead of time

● Parameterized

● Repetitive

PAGE 238

Exploiting Workloads - OLTP

PAGE 239

Warehouse

District

Customer

Partitioning OLTPWrite [ W1, D1, C1 ]

S1

W1C1

D1

S2

W2C2

D2

PAGE 240

Partitioning OLTP

S1

W1C1

D1

S2

W2C2

D2

Write [ W1, D1, C2 ]

prepare to commit

commit

PAGE 241

Two phase commit

Per transaction partitioning

Workload based repartitioning

Partitioning OLTP

G-Store(Das et al., SoCC 2010)

L-Store (Lin et al., SIGMOD 2016)

Later in the tutorial

PAGE 242

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

PAGE 243

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

Join request

PAGE 244

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

Join request

JoinedC2 Propagate

Txn ops

PAGE 245

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

Join request

JoinedC2 Propagate

Txn ops

Delete group

PAGE 246

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

Join request

JoinedC2 Propagate

Txn ops

Delete group

FreePAGE 247

Write[ W1, D1, C2 ]

Create group

Key Grouping

S1

C1

D1

S2

W2C2

D2

(Das et al., SoCC 2010)

W1

Join request

Joined

Propagate

Txn ops

Delete group

FreePAGE 248

Write[ W1, D1, C2 ]

On demand transactional partitioning

Key Grouping

Works best when groups are small and transactions contain multiple operations

(Das et al., SoCC 2010)

But groups are transient

PAGE 249

Localizing Execution

(Lin et al., SIGMOD 2016)

Repartition data via localization for single site execution

Dynamic partitioning based on transaction patterns

PAGE 250

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2C2

D2

Ownership information

D2 S2

C1 S1

C2 S2

W1 S1

W2 S2

D1 S1

PAGE 251

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2C2

D2

Ownership information

D2 S2

C1 S1

C2 S2

W1 S1

W2 S2

D1 S1

PAGE 252

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2C2

D2

Write [ W1, D1, C1 ]

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S2

PAGE 253

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2C2

D2

Owner request

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S2

PAGE 254

Write [ W1, D1, C2 ]

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2C2

D2TransferW1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S2

PAGE 255

Owner request

Write [ W1, D1, C2 ]

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2 D2Transfer

ResponseC2

Txn ops

C2

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S1

PAGE 256

Owner request

Write [ W1, D1, C2 ]

Localizing Execution

(Lin et al., SIGMOD 2016)

S1

W1C1

D1

S2

W2 D2C2

Txn ops

W1 S1

W2 S2

D1 S1

D2 S2

C1 S1

C2 S1

PAGE 257

Write [ W1, D1, C2 ]

Localizing Execution

(Lin et al., SIGMOD 2016)

Dynamic partitioning based on per transaction patterns

Does not consider workload overall

PAGE 258

● How to form partitions?Transaction localization

● Where to place partitions?At requester

● How to execute multi-partition operations?L-Store protocol

Partitioning Decisions

(Lin et al., SIGMOD 2016) PAGE 259

● How to form partitions?Key groups, temporarily

● Where to place partitions?Key group leader

● How to execute multi-partition operations?Key group protocol

Partitioning Decisions

(Das et al., SoCC 2010) PAGE 260

Localizing Transactions

A, B

W[A,B]

Commit

Commit Locally Without Synchronization!

PAGE 261

C, D

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

From a workload trace

(Curino et al., VLDB 2010)PAGE 262

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

Add traced transactions: R[A,B], 3x W[A,C]

(Curino et al., VLDB 2010)PAGE 263

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

Add node weights (size, load)

(Curino et al., VLDB 2010)PAGE 264

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

Min-cut edges subject to weight imbalance

k=2(Curino et al., VLDB 2010)

PAGE 265

High Edge Cuts!

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

Min-cut edges subject to weight imbalance

k=2(Curino et al., VLDB 2010)

PAGE 266

Imbalanced!

Constructing the Graph

ID Name

A Alice

B Bob

C Carol

Min-cut edges subject to weight imbalance

k=2(Curino et al., VLDB 2010)

PAGE 267

Adding Replica SupportR[A,B] 3x W[A,C]

(Curino et al., VLDB 2010)PAGE 268

Adding Replica SupportR[A,B] 3x W[A,C]

(Curino et al., VLDB 2010)

Holistic Partitioning/Replication

Offline and Periodic

PAGE 269

Access Patterns Change!

(Nicoara et al., EDBT 2015)

W[A,B], 3x W[A,C], W[A,D], W[C,D] 3x R[B,C]

PAGE 270

W(P1)=4, W(P2)=10, EC=8

Two Phases

(Nicoara et al., EDBT 2015)

Phase 1

PAGE 271

Two Phases

(Nicoara et al., EDBT 2015)

Phase 2

Rule:- Movement doesn’t overload- Move best-gain candidates- If overloaded, must move!

PAGE 272

Logical Movement, then Migrate

Two Phases

(Nicoara et al., EDBT 2015)

Already Overloaded

Phase 1

W(P1)=4, W(P2)=10, EC=8, Bounds: (6,8)

PAGE 273

Two Phases

(Nicoara et al., EDBT 2015)

Phase 2

W(P1)=4, W(P2)=10, EC=8, Bounds: (6,8)

Gain=0

PAGE 274

Two Phases

(Nicoara et al., EDBT 2015)

Already Overloaded

Phase 1

W(P1)=5, W(P2)=9, EC=8, Bounds: (6,8)

PAGE 275

Two Phases

(Nicoara et al., EDBT 2015)

Phase 2

W(P1)=5, W(P2)=9, EC=8, Bounds: (6,8)

Gain=-1

PAGE 276

Convergence

(Nicoara et al., EDBT 2015)

W(P1)=6, W(P2)=8, EC=9, Bounds: (6,8)

PAGE 277

Stable

● How to form partitions?Graph partitioning

● Where to place partitions?Based on partitioning

● How to execute multi-partition operations?2PC

Partitioning Decisions

PAGE 278(Curino et al., VLDB 2010) (Nicoara et al., EDBT 2015)

Graph Partitioning

X

A

B C

DMinimizes total number of distributed transactions

Ignores per node involvement

PAGE 279

Balance load and minimize distributed transactions

Adaptive Database Partitioning

PAGE 280

Database elasticity

Clay (Serafini et al., VLDB 2015)

P-Store (Taft et al., SIGMOD 2018)

E-Store (Taft et al., VLDB 2014)

Considering Distributed Cost

(Serafini et al., VLDB 2016)

General Graph Partitioning:

Clay:

minimize # edge cuts

load(Si) < (1 + ε) avg load( S )

load(Si) = ∑ w(v)

load(Si) = ∑ w(v) + k ∑ w( uv ) (v at Si)(u not at Si)

(v at Si)General:

such that:

Distributed cost

load balanced

PAGE 281

Repartitioning Cost

(Serafini et al., VLDB 2016)

General Graph Partitioning: minimize # edge cuts

Clay: minimize # edge cuts

cost of repartitioning

and# of vertices mapped to new partitions

PAGE 282

(Serafini et al., VLDB 2016)

F

E

D

C

B

A

S1 S2

S3

MedLow HighClumping

PAGE 283

(Serafini et al., VLDB 2016)

F

E

D

C

B

A

S1 S2

S3

G

MedLow HighClumping

PAGE 284

(Serafini et al., VLDB 2016)

F

E

D

C

B

A

S1 S2

S3

G Form clump

MedLow HighClumping

PAGE 285

(Serafini et al., VLDB 2016)

F

E

D

C

B

A

S1 S2

S3

G Migrate clump

Increases cost

MedLow HighClumping

PAGE 286

(Serafini et al., VLDB 2016)

F

E

D

C

B

A

S1 S2

S3

G

MedLow HighClumping

PAGE 287

(Serafini et al., VLDB 2016)

F

E

D

C

B

A

S1 S2

S3

G Expand clump

MedLow HighClumping

PAGE 288

(Serafini et al., VLDB 2016)

EF

D

C

B

A

S1

S2

S3

GMigrate clump

MedLow HighClumping

PAGE 289

(Serafini et al., VLDB 2016)

FE

D

C

B

A

S1

S2

S3

GExpand clump

MedLow HighClumping

PAGE 290

(Serafini et al., VLDB 2016)

FE

D C

B

A

S1

S2

S3

GMigrate clump

MedLow HighClumping

PAGE 291

(Serafini et al., VLDB 2016)

FE

D

C

B

A

S1

S2

S3

G

MedLow HighClumping

PAGE 292

Termination

Clumping

(Serafini et al., VLDB 2016)

Expands clumps to frequently accessed neighbours

Consider moving clump to lightly loaded sites

Considers both re-partitioning and load costs

PAGE 293

Elasticity

(Taft et al., VLDB 2015)

F

E

D

C

B

A

S1 S2

S3

MedLow High

PAGE 294

Elasticity

(Taft et al., VLDB 2015)

F

E

D

C

B

A

S1 S2

S3

MedLow High

PAGE 295

Elasticity

(Taft et al., VLDB 2015)

F

E

D

C

B

A

S1 S2

S3

MedLow High

S4

PAGE 296

Elasticity

(Taft et al., VLDB 2015)

Repartition to elastically add or remove nodes

PAGE 297

Elasticity

(Taft et al., VLDB 2015)

F

E

D

C

B

A

S1 S2

S3

MedLow High

S4

PAGE 298

Elasticity

(Taft et al., VLDB 2015)

F

E

D

C

B

A

S1 S2

MedLow High

S4

PAGE 299

Elasticity Decisions

(Taft et al., VLDB 2015)

When the average load:

increases: add nodes

decreases: remove nodes

PAGE 300

Two Tier Data Placement

(Taft et al., VLDB 2015)

Identify hot data

Evenly distribute hot data

Distribute cold data over remaining capacity

PAGE 301

Identifying Hot Data

(Taft et al., VLDB 2015)

Monitor partition level access frequency

If hot partition enable tuple level monitoring

Reacts to changes in load

PAGE 302

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 303

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 304

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 305

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 306

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 307

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 308

Reactive Elasticity

(Taft et al., VLDB 2015)

Load

Capacity

Time

PAGE 309

Ideal Elasticity

(Taft et al., SIGMOD 2018)

Load

Capacity

Time

predict the function

PAGE 310

Periodic Workloads

(Taft et al., SIGMOD 2018)

Daily load variations

Seasonal load spikesPAGE 311

How to Predict Load

(Taft et al., SIGMOD 2018)

load(t) = avg_load( t - pi) + change_in_load( t - ji)

= +Load Periodicity Trend

SPAR: Sparse Periodic Auto-Regression

PAGE 312

decide the # of nodes

Ideal Elasticity

(Taft et al., SIGMOD 2018)

Load

Capacity

Time

PAGE 313

Number of Nodes

(Taft et al., SIGMOD 2018)

# of nodes Predicted Load

Load per Server=

Assuming partitionable

PAGE 314

(Serafini et al., VLDB 2016)

● How to form partitions?Heuristically (Clumping versus 2 Tier)

● Where to place partitions?React or predict based on load

● How to execute multi-partition operations?2PC

Partitioning Decisions

(Taft et al., VLDB 2015)(Taft et al., SIGMOD 2018) PAGE 315

● Adaptive Replication

● Adaptive Partitioning

● Outlook

Road Map

PAGE 316

Outlook

PAGE 317

How to make a partitioning or replication decision when access patterns change?

Adaptive Systems

Adaptively replicate and partition

PAGE 318

● How to form partitions?

● Where to place partitions?

● How to execute multi-partition operations?

Partitioning Decisions

Iterative, Temporarily, Graph partitioning, Heuristic

Sorted, Leader, At requester, Graph partitioning, Reactively, Predictively

Novel protocols, 2PCPAGE 319

● How many replicas?

● Where to place replicas?

● How to propagate updates?

Replication Decisions Decentralized, Client workload, Cost-based, Predictive, Fault tolerance

At requester, Heuristic, Cost-based, Predictive, Dynamic

Synchronous, Quorums, Single-master, Cache PAGE 320

● How many replicas?

● Where to place replicas?

● Where to place partitions?

Decisions Predictively

Predictively

Predictively

PAGE 321

How to make a partitioning or replication decision when access patterns change?

Adaptive & Predictive Systems

Adaptively and predictively replicate and partition

PAGE 322

Predicting the FutureHow can your system predict its future workload?

Apollo: Predict future queries (Markov Model)

P-Store: Predict future load (SPAR)

PAGE 323

Predicting the Future QB5000When, how many, and what queries will arrive?

(Ma et al., SIGMOD 2018)

Pre-process: remove parameters, creating templates

SELECT * FROM C WHERE id = “C1”

SELECT * FROM C WHERE id = $

PAGE 324

Predicting the Future QB5000When, how many, and what queries will arrive?

(Ma et al., SIGMOD 2018)

Cluster: group templates by arrival rate

T1 T2 T3

PAGE 325

Predicting the Future QB5000When, how many, and what queries will arrive?

(Ma et al., SIGMOD 2018)

Forecast: Predict clusters arrival rate (Ensemble of RNN, LR, KR)

PAGE 326

Predicting the Future QB5000When, how many, and what queries will arrive?

(Ma et al., SIGMOD 2018)

Pre-process: remove parameters, creating templatesCluster: group templates by arrival rate

Forecast: Predict clusters arrival rate (Ensemble of RNN, LR, KR)

PAGE 327

Predicting the FutureHow can your system predict its future workload?

Apollo: Predict future queries (Markov Model)

P-Store: Predict future load (SPAR)

QB5000: Predict query workloads (Ensemble of RNN, LR, KR)

PAGE 328

If your system knew the future workload, how could it partition and replicate data?

Predicting the FutureHow can your system predict its future workload?

PAGE 329