9
Important take-away points
Language
SQL
Declarative languages
Functional languages
Optimizations
Query plans
Indices
11
Important take-away points
Consistency constraints
Tabular integrity
Domain integrity
Atomic integrity (1st normal form)
Boyce-Codd normal form
SQL
12
Important take-away points
Consistency constraints
Tabular integrity
Domain integrity
Atomic integrity (1st normal form)
2nd, 3rd, Boyce-Codd normal form
NEW
NEW
SQL
NoSQL
Heterogeneous dataNested data
Denormalized dataNEW
14
Important take-away points
Transactions
Atomicity
Consistency
Isolation
Durability
Atomic Consistency
Availability
Partition tolerance
Eventual Consistency
NEW
NEW
NEW
NEW
CAP
ACID
16
The stack
Storage
Encoding
Syntax
Data models
Validation
Processing
Indexing
Data stores
User interfaces
Querying
20
The stack:
Data models
Data models
Tables: Relational model
Trees: XML Infoset, XDM
Graphs: RDF
Cubes: OLAP
22
The stack:
Processing
Processing
Two-phase processing:
MapReduce
DAG-driven processing:
Tez, Spark, Flink, Ray
Elastic computing:
EC2
23
The stack:
Indexing
IndexingKey-value stores
Hash indices
B-Trees
Geographical indices
Spatial indices
24
The stack:
Data stores
Data stores
RDBMS
(Oracle/IBM/Microsoft)
MongoDB
CouchBase
ElasticSearch
Hive
HBase
MarkLogic
Cassandra
...
31
File storage
Lorem Ipsum
Dolor sit amet
Consectetur
Adipiscing
Elit. In
Imperdiet
Ipsum ante
Files
organized
in a
hierarchy
33
File Metadata
$ ls -l
total 48drwxr-xr-x 5 gfourny staff 170 Jul 29 08:11 2009drwxr-xr-x 16 gfourny staff 544 Aug 19 14:02 Exercisesdrwxr-xr-x 11 gfourny staff 374 Aug 19 14:02 Learning Objectives
drwxr-xr-x 18 gfourny staff 612 Aug 19 14:52 Lectures-rw-r--r-- 1 gfourny staff 1788 Aug 19 14:04 README.md
Fixed "schema"
37
Local storage
Local Machine
LAN (NAS)
WAN
LAN = local-area network
NAS = network-attached storage
WAN = wide-area network
40
Scaling Issues
Aleksandr Elesin / 123RF Stock Photo
1,000 files
1,000,000 files
1,000,000,000 files
43
Better performance: Explicit Block Storage
1 2 3
4 5 6
7 8
Application
(Control over locality of blocks)
44
So how do we make this scale?
Lorem IpsumDolor sit ametConsecteturAdipiscingElit. InImperdietIpsum ante
1. We throw away
the hierarchy!
51
"Black-box" objects
Flat and global key-value model
Flexible metadata
... and we get Object Storage
52
"Black-box" objects
Flat and global key-value model
Flexible metadata
Commodity hardware
... and we get Object Storage
63
Approach 3: be smart
Viktorija Reuta / 123RF Stock Photo
“You can have a second
computer once you’ve shown you know how to use the first one.”
Paul Barham
84
More about SLA
SLA Outage
99% 4 days/year
99.9% 9 hours/year
99.99% 53 minutes/year
99.999% 6 minutes/year
99.9999% 32 seconds/year
99.99999% 4 seconds/year
85
More about SLA
Amazon's approach:
Response time < 10 ms
in 99.9% of the cases
(rather than average or median)
98
Resources
Resource (URI)
http://www.ethz.ch/
http://www.mywebsite.ch/api/collection/foo/object/bar
urn:isbn:0123456789
mailto:[email protected]
99
Resources
Resource (URL)
http://www.ethz.ch/
http://www.mywebsite.ch/api/collection/foo/object/bar
102
Resources
Resource (URI)
http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head
scheme
103
Resources
Resource (URI)
http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head
authority
104
Resources
Resource (URI)
http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head
path
105
Resources
Resource (URI)
http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head
query
106
Resources
Resource (URI)
http://www.mywebsite.ch/api/collection/foo/object/bar?id=foobar#head
fragment
109
Example
GET /index.html HTTP/1.1
Host: www.example.com
HTTP/1.1 200 OK
Date: Tue, 25 Sep 2018 09:48:34 GMT
Content-Type: text/html; charset=UTF-8 Content-Length: 138
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes Connection: close
<html> <head> <title>An Example Page</title>
</head> <body> Hello World, this is a very simple
HTML document. </body> </html> Source: Wikipedia
117
Example
GET /my-image.jpg HTTP/1.1
Host: bucket.s3.amazonaws.com
Date: Tue, 26 Sep 2017 10:55:00 GMT
Authorization: authorization string
118
Folders: is S3 a file system?
/food/fruits/orange
/food/fruits/strawberry
/food/vegetables/tomato/food/vegetables/turnip
/food/vegetables/lettuce
Physical
(Object keys)
Logical
(Browsing)
foodfruit
orangestrawberry
vegetables
tomato
turnip
lettuce
126
Storage Class
Standard High availability
Standard –
Infrequent Access
Less availability
Cheaper storage
Cost for retrieving
Amazon Glacier Low-costHours to GET
128
Overall comparison Azure vs. S3
S3 Azure
Object
ID
Bucket +
Object
Account + Container +
Blob
Object
API
Blackbox Block/Append/Page
Limit 5 TB 4.78 TB (block)195 GB (append)
8 TB (page)
129
Azure Architecture: Storage Stamp
Front-Ends
Partition Layer
Stream Layer
Virtual IP address
Account name
Partition name
Object name
132
Storage Replication
Front-Ends
Partition Layer
Stream Layer
Intra-stamp replication (synchronous)
133
Storage Replication
Front-Ends
Partition Layer
Stream Layer
Front-Ends
Partition Layer
Stream Layer
Inter-stamp replication (asynchronous)
134
Location Services
Front-Ends
Partition Layer
Stream Layer
Location ServicesDNS
Virtual IP (primary) Virtual IP
Account name
mapped to one
Virtual IP
(primary stamp)
Front-Ends
Partition Layer
Stream Layer
136
Location Services
Front-Ends
Partition Layer
Stream Layer
DNS
Front-Ends
Partition Layer
Stream Layer
Account name
Primary stamp's VIP
Partition + Object
152
Key-value stores: why do we simplify?
Simplicity
ConsistencyEventual consistency
More features
Performance Overhead
153
Key-value stores: why do we simplify?
Simplicity
ConsistencyEventual consistency
More features
Performance
Scalability Monolithic
Overhead
165
IDs are organized in a logical ring00000000001111111111
mod 2n
1000000000
01000000001100000000 2n nodes
174
Adding and removing nodes000...111...
mod 2n
Needs to
be transferred
These nodes are
not affected
179
Adding and removing nodes000...111...
mod 2n
Needs to
be transferred
But what if the node failed?
186
Dynamo: Preference lists
Key Nodes
a n1, n2
b n1, n2
c n2, n3
d n2, n3
e n2, n3
f n2, n3
g n2, n3
h n2, n3
i n3, n4
j n3, n4
k n3, n4
l n3, n4
Every node
knows about the ranges of
all other nodes
190
Dynamo: Preference lists
Key Node
1, 2, 3, ... Writes confirmed (synchronously)
from at least W nodes
196
Distributed Hash Tables: Pros
Highly scalable
Robust against failure
Self organizing
Credits: Thomas Hofmann
197
Distributed Hash Tables: Cons
Lookup, no search
Data integrity
Security issues
Credits: Thomas Hofmann
201
So...
How can we
• artificially increase the number of node?
and
• bring some elasticity to account for
performance differences?
213
Vector clocks
put by Node A
put by Node A
put by Node Bput by Node C
([A, 1])
([A, 2])
([A, 2], [B, 1]) ([A, 2], [C, 1])
214
Vector clocks
put by Node A
put by Node A
put by Node Bput by Node C
reconcile and put by Node A
([A, 1])
([A, 2])
([A, 2], [B, 1]) ([A, 2], [C, 1])
215
Vector clocks
put by Node A
put by Node A
put by Node Bput by Node C
reconcile and put by Node A
([A, 1])
([A, 2])
([A, 2], [B, 1]) ([A, 2], [C, 1])
([A, 3], [B, 1], [C, 1])
216
Context
put by Node A
put by Node A
put by Node Bput by Node C
reconcile and put by Node A
([A, 1])
([A, 2])
([A, 2], [B, 1]) ([A, 2], [C, 1])
([A, 3], [B, 1], [C, 1])
conte
xt
223
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
replic
ation
replic
ation
225
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
gathering all versions
gathering all versions
get(key1)
226
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
get(key1)
227
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
get(key1)
228
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
ReturningA, [ (n1, 1) ]
context
get(key1)
229
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
incoming request
put(key1, [ (n1, 1) ], B)
B, [ (n1, 2) ]new context (+1)
231
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
replic
ation
replic
ation
232
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
get(key1)
233
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
gathering all versions
gathering all versions
A, [ (n1, 1) ]
B, [ (n1, 2) ]
get(key1)
234
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
gathering all versions
gathering all versions
A, [ (n1, 1) ]
B, [ (n1, 2) ]
Returning
B, [ (n1, 2) ]
maximum element
get(key1)
235
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
incoming request
put(key1, [ (n1, 2) ], C)
C, [ (n1, 3) ]new context (+1)
236
Vector clocks
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
network partition
237
Vector clocks
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
network partition
gathering all versions
interim
coordinator
get(key1)
238
Vector clocks
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
network partition
Returning
B, [ (n1, 2) ]
gathering all versions
interim
coordinator
get(key1)
239
Vector clocks
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
network partition
incoming request
put(key1, [ (n1, 2) ], D)
D, [ (n1, 2), (n2, 1) ]new context
interim
coordinator
240
Vector clocks
interim
coordinatorn2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
network partition
D, [ (n1, 2), (n2, 1) ]
241
Vector clocks
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
242
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
243
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
get(key1)
244
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
gathering all versions
gathering all versions
A, [ (n1, 1) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
get(key1)
247
Directed Acyclic Graph
A, [ (n1, 1) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]not comparable
maximal elements
248
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
get(A)
gathering all versions
gathering all versions
A, [ (n1, 1) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
249
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
get(A)
gathering all versions
gathering all versions
A, [ (n1, 1) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
return
C, D, [ (n1, 3), (n2, 1) ]
250
Vector clocks
coordinator
n2
n3
A, [ (n1, 1) ]
A, [ (n1, 1) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ]
D, [ (n1, 2), (n2, 1) ]
251
Vector clocks
A, [ (n1, 1) ]
B, [ (n1, 2) ]
D, [ (n1, 2), (n2, 1) ]
C, [ (n1, 3) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
D, [ (n1, 2), (n2, 1) ]
C, [ (n1, 3) ]
A, [ (n1, 1) ]
B, [ (n1, 2) ]
D, [ (n1, 2), (n2, 1) ]
C, [ (n1, 3) ]
synchronization
252
Vector clocks
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
Cleanup
A, [ (n1, 1) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]
253
Vector clocks
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
incoming request
put(key1,
[ (n1, 3), (n2, 1) ],
E)
(Client semantically solved
the conflict between C and D)
254
Vector clocks
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
incoming request
put(key1,
[ (n1, 3), (n2, 1) ],
E)
E, [ (n1, 4), (n2, 1) ]
255
Vector clocks
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
E, [ (n1, 4), (n2, 1) ]
E, [ (n1, 4), (n2, 1) ]
E, [ (n1, 4), (n2, 1) ]
256
Vector clocks
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
D, [ (n1, 3), (n2, 1) ]
C, [ (n1, 3) ]
E, [ (n1, 4), (n2, 1) ]
E, [ (n1, 4), (n2, 1) ]
E, [ (n1, 4), (n2, 1) ]
C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]
E, [ (n1, 4), (n2, 1) ]
257
Version history (all times)
A, [ (n1, 1) ]
B, [ (n1, 2) ]
C, [ (n1, 3) ] D, [ (n1, 2), (n2, 1) ]
E, [ (n1, 4), (n2, 1) ] absolute maximum
259
Merkle Trees
What if hinted replicas get lost?
What if the complexity in replica deltas increases?
260
Merkle Trees
What if hinted replicas get lost?
What if the complexity in replica deltas increases?
Anti-entropy protocol