Lecture 6B
CS4411: Databases II
• CAP Theorem • NoSQL Databases: Overview
Agenda
CAP Theorem
• Atomic – Transaction cannot be subdivided – All or nothing
• Consistent – Constraints don’t change from before transaction to after
transaction – A transaction transforms a database from one consistent
state to another consistent state. • Isolated
– Transactions execute independently of one another. – Database changes not revealed to users until after
transaction has completed • Durable
– Database changes are permanent and must not be lost.
Transaction ACID Properties
The limitations of distributed databases can be described in the so called the CAP theorem
§ Consistency: every node always sees the same data at any given instance (i.e., strict consistency)
§ Availability: the system continues to operate, even if nodes in a cluster crash, or some hardware or software parts are down due to upgrades
§ Partition Tolerance: the system continues to operate in the presence of network partitions
CAP Properties of distributed databases
• Consistency in Databases (ACID): – Database has a set of integrity constraints – A consistent database state is one where all integrity
constraints are satisfied – Each transaction run individually on a consistent
database state must leave the database in a consistent state
• Consistency in distributed systems with replication – Strong consistency: a schedule with read and write
operations on an object should give results and final state equivalent to some schedule on a single copy of the object, with order of operations from a single site preserved
– Weak consistency (several forms)
What is Consistency?
n When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent
n For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service
n Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID § Soft state: copies of a data item may be inconsistent § Eventually Consistent: copies becomes consistent at
some later time if there are no more updates to that data item
Eventual Consistency
n Traditionally, availability of centralized server n For distributed systems - availability of system to
process requests n For large system, at almost any point in time there’s a good
chance that n a node is down or even n Network partitioning
n Distributed consensus algorithms will block during partitions to ensure consistency n Many applications require continued operation even during
a network partition, even at cost of consistency
Availability
Also known as Brewer’s Theorem by Prof. Eric Brewer, published in 2000 at University of Berkeley.
CAP Theorem
“Of three properties of a shared data system: data consistency, system availability and tolerance to network partitions, only two can be achieved at any given moment.”
There are many levels of consistency. Strict Consistency – RDBMS. Tunable Consistency – Cassandra. Eventual Consistency – Amazon Dynamo.
q Traditional database choose consistency q Most Web applications choose availability n Except for specific parts such as order
processing
A B
Data Data
Consistent and Available No Partition.
App
CAP Theorem
A B
Data Old Data
Available and Partitioned Not Consistent, we get back old data.
App
CAP Theorem
A B
New Data Wait for new data
Consistent and Partitioned Not available, waiting…
App
CAP Theorem
Almost the opposite of ACID. • Basically available: Nodes in the a distributed
environment can go down, but the whole system shouldn’t be affected.
• Soft State (scalable): The state of the system and data changes over time.
• Eventual Consistency: Given enough time, data will be consistent across the distributed system.
BASE, an ACID Alternative
does not make safety guarantees, i.e., an eventually consistent system can return any value before it converges
BASE differs from ACID – trades consistency for availability
ACID: • Strong Consistency. • Less availability. • Pessimistic concurrency. • Complex.
BASE: • Availability is the most important thing. • Willing to sacrifice for this (CAP). • Weaker consistency (Eventual). • Best effort. • Simple and fast. • Optimistic.
BASE vs ACID
§ When companies such as Google and Amazon were designing large-scale databases, 24/7 Availability was a key § A few minutes of downtime means lost revenue
§ When horizontally scaling databases to 1000s of machines,
the likelihood of a node or a network failure increases tremendously
§ Therefore, in order to have strong guarantees on Availability and Partition Tolerance, they had to sacrifice “strict” Consistency (implied by the CAP theorem)
Large-Scale Databases
Maintaining consistency should balance between the strictness of consistency versus availability/scalability § Good-enough consistency depends on your application
Strict Consistency
Generally hard to implement, and is inefficient
Loose Consistency
Easier to implement, and is efficient
Trading-Off Consistency
? Examples:
Acceptable in ATM withdrawals and cellphone calls Decouple updates to seller and buyer in transaction
Abadi’s classification system: PACELC • CAP theorem only matters when there is a partition • Even if partitions are rare, applications may trade
off Consistency for Latency – E.g. PNUTS allows inconsistent reads to reduce latency
• Critical for many applications – But update protocol (via master) ensures consistency over
availability • Thus Abadi asks two questions:
– If there is Partitioning, how does system trade off Availability for Consistency ?
– Else (no partitioning), how does system trade off Latency for Consistency ?
PACELC: Availability vs Latency
PACELC: Availability vs Latency
• If there is Partitioning, how does system tradeoff Availability for Consistency ? • Else how does system trade off Latency for Consistency ?
• Google Megastore: PC/EC • Yahoo PNUTS: PC/EL
Amazon Dynamo (by default): PA/EL
NoSQL Databases: Overview
• From CAP Theorem: – CA, CP, PA databases
• Data model – What data is being stored?
• CRUD interface – API for Create, Read, Update, Delete – Sometimes preceding S for Search
• Transaction consistency guarantees • Replication and sharding model
– What’s automated and what’s manual?
NoSQL Database Features More than 150 different NoSQL databases!!!
NoSQL Databases
Column-Family Store Key/Value Store
Document Store Graph Databases
NoSQL: we focus on 4 Data Models
NoSQL Data Models
Key-Value store
• Eventually-consistent Key-Value store • Hierarchical Key-Value Stores • Key-Value Stores In RAM • Key Value Stores on Disk • Ordered Key-Value Stores
• Essentially, big distributed hash maps • Origin attributed to Dynamo – Amazon’s DB for
world-scale catalog/cart collections – But Berkeley DB has been here for >20 years
• Data Model: store pairs ⟨key,opaque-value⟩ – Opaque means that DB does not associate any
structure/semantics with the value; oblivious to values – This may mean more work for the user: retrieving a large
value and parsing to extract an item of interest – Keys are unique.
• Sharding via partitioning of the key space – Hashing, gossip and remapping protocols for load
balancing and fault tolerance
• Redis • Amazon’s DynamoDB
– Originally designed for Amazon’s workload at peaks – Offered as part of Amazon’s Web services
• Riak – Focuses on high availability, BASE – “As long as your Riak client can reach one Riak server, it should be
able to write data.”
• FoundationDB – Focus on transactions, ACID
• Berkeley DB – First release 1994, by Berkeley, acquired by Oracle – ACID, replication
25
Example: Key-Value databases
• Redis is most popular key-value database
Redis
• Basically a data structure for strings, numbers, hashes, lists, sets
• Simplistic "transaction" management – Queuing of commands as blocks, really – Among ACID, only Isolation guaranteed
• A block of commands that is executed sequentially; no transaction interleaving; no roll back on errors
• In-memory store – Persistence by periodical saves to disk
• Comes with – A command-line API – Clients for different programming languages
• Perl, PHP, Rubi, Tcl, C, C++, C#, Java, R, …
key value set x 10 x 10
hset h y 5 h yà5 hset h1 name two
hset h1 value 2 h1 nameàtwo valueà2
hmset p:22 name Alma age 25 p:22 nameàAlma ageà25 sadd s 20
sadd s Alma s {20,Alma}
rpush l a rpush l b lpush l c
l (c,a,b)
get x >> 10
hget h y >> 5
hkeys p:22 >> name , age
(simple value) (hash table)
smembers s >> 20 , Alma
scard s >> 2
(set)
(list)
llen l >> 3
lrange l 1 2 >> a , b
lindex l 2 >> b
lpop l >> c
rpop l >> b
Example of Redis Commands
• A value: – Any <512MB binary string (e.g., JPEG image) – List with < 232 - 1 elements (more than 4 billion of elements).
• Some key operations: – Select database: select index (default index is 0) – List all keys: keys * – Remove all keys: flushall – Check if a key exists: exists k
• You can configure the persistency model – save m k means save every m seconds if at least k
keys have changed
Redis: extra notes
• Add-on module for managing multi-node applications over Redis
• Master-slave architecture for sharding + replication – Multiple masters holding pairwise disjoint sets of keys, every
master has a set of slaves for replication and sharding
http://redis.io/presentation/Redis_Cluster.pdf
Redis Cluster
Document store
• Similar in nature to key-value store, but value is tree structured as a Document
• Data model: store pairs ⟨key,Document⟩ • Motivation: avoid joins; ideally, all relevant joins
already encapsulated in the document structure • A document is an atomic object that cannot be split
across servers – But a document collection will be split
• Moreover, transaction atomicity is typically guaranteed within a single document
"Documents" are encoded in a standard data exchange format such as XML, JSON (JavaScript Object Notation) or BSON (Binary JSON). Unlike the simple key-value stores, the value column in document databases contains semi-structured data A single column can house hundreds of such attributes, and the number and type of attributes recorded can vary from row to row. Also, unlike simple key-value stores, both keys and values are fully searchable in document databases.
Document store
Model generalizes column-family and key-value stores
• MongoDB • Apache CouchDB
– Emphasizes Web access
• RethinkDB – Optimized for highly dynamic application data
• RavenDB – Deigned for .NET, ACID
• Clusterpoint Server – XML and JSON, a combined SQL/JavaScript QL
Example: Document store databases
• Open source, 1st release 2009, document store – Actually, an extended format called BSON (binary JSON)
for typing and better compression
• Supports replication (master/slave), sharding – Developer provides the “shard key” – collection is
partitioned by ranges of values of this key
• Consistency guarantees, CP of CAP • Used by Adobe (experience tracking), Craigslist, eBay,
FIFA (video game), LinkedIn, McAfee • Provides connector to Hadoop
– Cloudera provides the MongoDB connector in distributions
MongoDB
• JavaScript Object Notation (JSON) model • Database = set of named collections • Collection = sequence of documents • Document = BJSON: {attribute1:value1,...,attributek:valuek} • Attribute = string (attributei≠attributej) • Value = primitive value (string, number, date, ...), or a
document, or an array • Array = [value1,...,valuen]
• Key properties: hierarchical (like XML), no schema
– Collection docs may have different attributes
MongoDB Data Model
An example record from MongoDB, using JSON format, might look like { "_id" : ObjectId("4fccbf281168a6aa3c215443"), "first_name" : "Thomas", "last_name" : "Jefferson", "address" : { "street" : "1600 Pennsylvania Ave NW", "city" : "Washington", "state" : "DC" } }
Embedded object
Though records are called documents, they are not documents in the sense of a word processing document, although you can store binary data (using BSON format) in any of the fields in the document. You can also modify the structure of any document on the fly by adding and removing members from the document, either by reading the document into your program, modifying it and re-saving it, or by using various update commands.
MongoDB: Collection example
36
{ item: "ABC2", details: { model: "14Q3", manufacturer: "M1 Corporation" }, stock: [ { size: "M", qty: 50 } ], category: "clothing” } { item: "MNO2", details: { model: "14Q3", manufacturer: "ABC Company" }, stock: [ { size: "S", qty: 5 }, { size: "M", qty: 5 }, { size: "L", qty: 1 } ], category: "clothing” }
(docs.mongodb.org)
Collection inventory
db.inventory.insert( { item: "ABC1", details: {model: "14Q3",manufacturer: "XYZ Company"}, stock: [ { size: "S", qty: 25 }, { size: "M", qty: 50 } ], category: "clothing" } ) Document insertion
MongoDB: Collection example
{ _id: "a", cust_id: "abc123", status: "A", price: 25, items: [ { sku: "mmm", qty: 5, price: 3 }, { sku: "nnn", qty: 5, price: 2 } ] } { _id: "b", cust_id: "abc124", status: "B", price: 12, items: [ { sku: "nnn", qty: 2, price: 2 }, { sku: "ppp", qty: 2, price: 4 } ] }
Collection orders db.orders.find( { status: "A" }, { cust_id: 1, price: 1, _id: 0 } )
In SQL it would look like this: SELECT cust_id, price FROM orders WHERE status="A"
{ cust_id: "abc123", price: 25 }
selection
projection
MongoDB: Simple Query
{ _id: "a", cust_id: "abc123", status: "A", price: 25 } { _id: "b", cust_id: "abc124", status: "B", price: 12 } { _id: "c", cust_id: "abc123", status: "A", price: 20 }
Collection orders
{ _id: "abc123", price: 45 } { _id: "abc124", price: 12 }
Collection PurchasesPerCustomer
Sum up the purchases per
customer:
In SQL it would look like this: SELECT cust_id, sum(price) FROM orders GROUP BY cust_id; But orders are distributed all
over...
We'll do it later
2 options now: (1) Built-in MongoDB aggregates (2) MapReduce + custom JS code (more
flexible, less smart)
MongoDB: Map-reduce
• Similar to relational database model
• Structure:
– Column
– Super-column
– Column family
• Structure of database is defined by super-columns and column families.
• Data access is accomplished by specifying column family, key and column in order to get value, using following structure:
• <columnFamily>.<key>.<column> = <value>
Column family Model
keyspace
sid name address year faculty
861 Alma Haifa 2 NULL
753 Amir Jaffa NULL CS
955 Ahuva NULL 2 IE Standard RDB
id sid
1 861
2 753
3 955
id name
1 Alma
2 Amir
3 Ahuva
id address
1 Haifa
2 Jaffa
id year
1 2
3 2
id faculty
2 CS
3 IE
Column Store: each column stored separately (still SQL)
Why? Efficiency (fetch only required columns), compression, sparse data for free
1 sid:861 name:Alma address:Haifa ts:20
2 sid:753 name:Amir address:Jaffa ts:22
3 sid:955 name:Ahuva ts:32
1 year:2 ts:26
2 faculty:CS ts:25 email:{prime:c@d ext:c@e}
3 year:2 faculty:IE ts:32 email:{prime:a@b ext:a@c}
column family
column family
“column”
“supercolumn”
Column-Family Store: NoSQL
(Cassandra model) timestamp for conflicts
Two Types of Column Store
• The two often mixed as “column store” à confusion – See Daniel Abadi’s blog
• Common idea: don’t keep a row in a consecutive block, split via projection – Column store: each column is independent; – Column family store: each column family is independent
• Both provide some major efficiency benefits in common read-mainly workloads – Given a query, load to memory only the relevant columns – Columns can often be highly compressed due to value
similarity – Effective form for sparse information (no NULLs, no space)
Column store vs Column family store
• Column store (SQL): – MonetDB (started 2002, Univ. Amsterdam) – VectorWise (spawned from MonetDB) – Vertica (M. Stonebraker) – SAP Sybase IQ – Infobright
• Column family store (NoSQL): – Apache Cassandra – Google’s BigTable (main inspiration to column families) – Apache HBase (used by Facebook, LinkedIn, Netflix...) – Hypertable
Example: Column store and Column-family store
• Initially developed by Facebook – Open-sourced in 2008
• Used by 1500+ businesses, e.g., Comcast, eBay, GitHub, Hulu, Instagram, Netflix, Best Buy, ...
• Column-family store – Supports key-value interface – Provides a SQL-like CRUD interface: CQL
• Uses Bloom filters – An interesting membership test that can have false positives but never
false negatives, well behaves statistically
• BASE consistency model (AP) – Gossip protocol (constant communication) to establish consistency – Ring-based replication model
Appache Cassandra
Cassandra Data Model
Columns are added and modified dynamically
Super-columns group columns under a common name
Cassandra Data Model
• Graph databases employ nodes, edges and properties • Based on graph theory
• Nodes represent entities • Edges are the lines that connect nodes to nodes • Properties are pertinent information that relate to nodes
Graph Model Those databases are used when data can be represented as graphs/ For example, social networks, criminal rings, gated communities, etc.
• Graph with nodes/edges marked with labels and properties (labeled property graph) – neo4j (Java, 1st release 2010) – Sparksee (DEX) (Java, 1st release 2008) – InfiniteGraph (Java/C++, 1st release 2010) – OrientDB (Java, 1st release 2010)
• Triple stores: Support W3C RDF and SPARQL, also viewed as graph databases – MarkLogic, AllegroGraph, Blazegraph, IBM SystemG,
Oracle Spatial & Graph, OpenLink Virtuoso, ontotext
Example: Graph databases
• Open source, written in Java – First version released 2010
• Supports the Cypher query language • Clustering support
– Replication and sharding through master-slave architectures
• Used by ebay, Walmart, Cisco, National Geographic, TomTom, Lufthansa, ...
neo4j
49
label property
direction name
Cypher Graph for Social Networks
Cypher Graph: E-mail Exchange
51
CREATE (alice:User {username:'Alice'}), (bob:User {username:'Bob'}), (charlie:User {username:'Charlie'}), (davina:User {username:'Davina'}), (edward:User {username:'Edward'}), (alice)-[:ALIAS_OF]->(bob)
Creating Graph Data
MATCH p = (email:Email {id:'6'}) <-[:REPLY_TO*1..4]-(:Reply)<-[:SENT]-(replier) RETURN replier.username AS replier, length(p) - 1 AS depth ORDER BY depth
replier depth Davina 1
Bob 1 Charlie 2
Bob 3
Path Assignment
MATCH (bob:User{username:'Bob'})-[:SENT]->(email)-[:CC]->(alias), (alias)-[:ALIAS_OF]->(bob) RETURN email
email Node{id:"1",content:"..."}
Query Example
Graph database
Discovering insurance fraud
http://info.neo4j.com/rs/neotechnology/images/Fraud%20Detection%20Using%20GraphDB%20-%202014.pdf
Graph database
Discovering insurance fraud
Graph representation of Insurance Fraud
Other Popular NoSQL Databases
• Replaced by Redis nowadays • Designed to speeding up dynamic web
applications by alleviating database load • RAM resident key-value store for small chunks
of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering
• Simple interface • Designed for quick deployment, ease of
development • APIs in many languages
Memcached
Why Redis beats Memcached
• A distributed key-value system
• Used at LinkedIn
• 10K-20K node operations/CPU
• Auto-sharding
• Graceful server failure handling
Voldemort
• Open source project by Apache Foundation
• Consists of two core components
– Hadoop Distributed File System (Storage)
– MapReduce (Compute)
• Column-oriented data store
• Java interface
• Hbase designed specifically to work with Hadoop
Hadoop / Hbase
• Apache document-oriented store
• Written in ERLANG
• RESTful JSON API
• Distributed, featuring robust, incremental replication with bi-directional conflict detection and management
CouchDB
• Native XML database designed to used by Petabyte data stores
• ACID compliant
• Heavy use by federal agencies, document publishers and "high-variability" data
• Arguably the most successful NoSQL company
MarkLogic
• OpenSource native XML database
• Strong support for XQuery and XQuery extensions
• Heavily used by the Text Encoding Initiative (TEI) community and XRX/XForms communities
• Ideal for metadata management
• Integrated Lucene search and structured search
eXist
• Open Source • Closely modeled after Google's Bigtable
project • High performance distributed data storage
system • Designed to support applications requiring
maximum performance, scalability, and reliability
• Hypertable Query Language (HQL) that is syntactically similar to SQL
Hypertable
• The data is not structured or structure is changing
• You need to have a denormalized representation of your data
• You need massive write performance
• You need fast key-value access
• You need flexible schema/data types
• You need schema migration
• You need easier maintainability
When to use NoSQL ?