Seminar.2010.NoSql

July 11th, 2010

Apples, Oranges and NOSQL

Roi Aldaag Architect & ConsultantNadav Wiener Architect & Consultant

3

» What is NoSQL?

» What’s “wrong” with RDBMS?

» Why now?

Introduction

Agenda

4

» Scaling

» CAP Theorem

» ACID vs. BASE

RDBMS vs. NoSQL

Agenda

5

» Key / Value

» Column

» Document

» Graph

NoSQL Taxonomy

Agenda

6

» Comparing Apples to Oranges

» Polyglot Persistence

How to choose?

Agenda

Introduction

8

Introduction

Question: What do they all have in common?

9

Before we answer – some facts:

Introduction

10

Before we answer – some facts:

Introduction

Daily Page Views

Daily Visitors

Data size

7.8x109

620x106

Petabytes

7.1x109

500x106

Petabytes

550x106

56x106

Petabytes

350x106

37x106

Terabytes

82x106

12x106

Terabytes

July, 2010: http://www.alexa.com

11

Introduction

Answer: They use NoSQL data stores

12

Why!?

Introduction

13

» ACID doesn’t scale well horizontally

Sharding breaks relations

Joins are inefficient

» Transactions overhead

» Schema is not flexible

Predfined

Hard to evolve

Relational DBs Have Scaling Limitations

Introduction

14

» NO SQL / Not Only SQL

» A collective description of Open Source, Non-relational, data stores

Highly distributed

Highly scalable

Not ACID and... doesn’t use SQL

» Term coined in a convention in 2009 called “NoSQL” (Eric Evans)

» Started a movement that is gaining momentum

What is NoSQL?

Introduction

15

Introduction

16

» NoSQL data stores predate RDBMS (1970)

But remained a niche

» RDBMS – most popular and generic option

» Web 2.0 introduced new requirements:

Exponential increase in data

Information connectivity

Semi-structured data

» NoSQL data stores had answers

When time was right

When RDBMSs didn’t

Why now?

Introduction

17

It’s theory time:

Introduction

18

Scaling

19

» Adding resources to a single node in a system

» Add more CPUs or memory

» Move system to a larger machine

» Pros:

Quick and Simple

» Cons:

Outgrowing the capacity of largestsystem available (More’s law)

Expensive

Creates vendor lock-in

Scaling Up

Scaling

20

» Add more nodes to a system

» Functional Scaling (vertical)

Grouping data by function and spreading functional groups across databases

» Sharding (horizontal)

Splitting same functional data across multiple databases

» Pros: More flexible

» Cons: More complex

Scaling Out

Scaling

Distributed

Databases

22

Distributed Databases

» Many nodes

» Same databaseNode 1 Node 2

Node 3

23

» Consistency

All clients can see the same data

» Availability

All clients can always access data

» Partition tolerance

The ability to continue working when the network topology is broken

The ability to recover once the network is healed

What are the requirements from distributed databases?


24

» You can fully satisfy at most 2 out of 3

Compromise on 3rd

» Not “all or nothing”

Choose various levels of consistency, availability or partition tolerance

» Recognize which of the CAP rules your business needs for the task

CAP Theorem (E. Brewer, N. Lynch)


25

» Partition Tolerance is compromised

» Single site clusters (easier to ensure all nodes are always in contact)

» When a network partition occurs, the system blocks

» e.g. Two Phase Commit (2PC)

CA: Consistency & Availability


Partition

Tolerance

26

» Availability is compromised

» Access to some data may be temporarily limited

» The rest is still consistent/accurate

» e.g. Sharded database

» TBD sample

CP: Consistency & Partitioning


Partition

Tolerance

27

» Consistency is compromised

» System is still available under partitioning

» Some data returned may be temporarily not up-to-date

» Requires conflict resolution strategy

» e.g. DNS, caches, Master/Slave replication

» TBD sample

AP: Availability & Partitioning


Partition

Tolerance

ACID vs. BASE

29

» Atomicity

When a part of the transaction fails -> the entire transaction fails; Database state is left unchanged

» Consistency

A transaction takes database from one consistent state to another

» Isolation

A transaction can't see dirty state from other transactions

» Durability

Commit means commit.

ACID – a quick recap

ACID vs. BASE

30

» The CAP compliment of ACID

Just had to be called BASE

Backronym:

» Basically Available

» Soft State

» Eventually Consistent

BASE

ACID vs. BASE

31

» RDBMSs strive to provide ACID guarantees

ACID forces consistency

» NoSQL solutions often scale through BASE

BASE accepts that conflicts will happen

RDBMS & ACID / NoSQL & BASE

ACID vs. BASE

Taxonomy

33

Taxonomy

Key / Value Column

Graph

TXT

XML

BIN

Document

34

Taxonomy

Key / Value Databases

35

» Simple Key / Value lookups (DHT)

» Value is opaque

» Focus on scaling to huge amounts of data

» Designed to handle massive load

» E.g.

Riak

Project Voldemort

Redis

Key/Value Stores

Taxonomy

Based on Amazon’s

Dynamo paper

36

Key/Value e.g.: Riak

Taxonomy

» No single point of failure

» No machines are special or central

» MapReduce queries (Erlang / Javascript)

» HTTP/JSON API

» Ring cluster with automatic replication

» Elastic / partition rebalancing» Written in: Erlang, C, Javascript

» Developed by: Basho Technologies

» Java client: (jonjlee / riak-java-client)

37

Data Model


» Key / Value pairs are stored in a Bucket

» A Bucket ~ a namespace

» Each update is tracked by a Vector Clock

An algorithm for determining ordering and detecting conflicts

» When in conflict

Last wins / manual resolution

Versioning

38

» Read an object

» Store a new object

» Store an object with existing key (update)


GET /riak/bucket/key

POST /riak/bucket

PUT /riak/bucket/key

Example: REST API

39

» A framework supporting distributed computing on large data sets on clusters of machines

» Leverage parallel processing power

» Introduced by Google

» Inspired by map / reduce functions in functional programming

» Map step

» Reduce step


MapReduce

40

» Map

» Parse each document

» Emit a sequence of <word, doc_id> pairs


MapReduce example: Inverted Index

<doc_id, doc_text>

<100, >,

<200, >,

<300, >

Node

1

Node

2

Node

3

TXT1

TXT2

TXT3

<word ,doc_id>

< word1 ,100>,< word2 ,100>,

< word2 ,200>,

< word2 ,300>

41

» Reduce

» Accept all pairs for a given word

» Sort the corresponding document IDs

» Emit a <word, list(document ID)> pair


MapReduce example: Inverted Index

<word, list(document_id)>

< word1 ,(100) >,

< word2 ,(100,200)>, < word3 ,(300) >

42

Taxonomy

BigTable and

Column Oriented Databases

43

» Conceptually a single, infinitely large table

» Each rows can have different number of columns

» Table is sparse: |rows|*|columns| > |values |

» Based on Google’s BigTable paper

» E.g.

Cassandra

Hbase

Hypertable

Column Stores – BigTable derivatives

Taxonomy

44

» RDBMS:

Create a central table with common attributes

Create a table per product with unique attributes

Use a join query

Alternatively create a table that holds meta data on products

» NoSQL:

Column oriented database

Use arbitrarily columns

Use Case: Manage products with diverse attributes

Taxonomy

45

» Data model: Google’s BigTable

» Infrastructure: Amazon Dynamo

» Incremental scalability

» Flexible schema

» No single point of failure (Distributed P2P)

» Optimistic replication (Gossip protocol)

» Written in: Java

» Developed by: Facebook

» Java client: e.g. Hector / Thrift

Column Store e.g.: Cassandra

Taxonomy

46

» Column

Smallest increment of data: tuple of name, value, timestamp

Data Model

Column e.g.: Cassandra

{

name: "emailAddress",

value: “[email protected]",

timestamp: 123456789

}

47

» SuperColumn

A sorted, associative, unbounded array of columns


{ // this is a SuperColumn

name: "homeAddress",

// with an unbounded array of Columns

value: {

// the keys is the name of the Column

street: {name: "street", value: "s", timestamp:...},

city: {name: "city", value: "c", timestamp:...},

zip: {name: "zip", value: "z", timestamp:...}

}

}

48

» ColumnFamily

A container (~Table) for columns sorted by their names

Column Families are referenced and sorted by row keys


Users = { // ColumnFamily

john: { // key to row in CF

"role" : "admin",

"status" : "offline",

"nick" : "dude1934"

}, // end row

fred: { // another row

"nick" : “freddy",

"email" :"[email protected]",

"age" : "25",

"gender" : "male",…

},… // more rows

} Column Family

49

» Keyspace

The outer most grouping of data (~DB Schema)

Contains ColumnFamily’s

There is no imposed relationship between ColumsFamily’s


50

» Example


Tweets CF

Timeline CFKeyspace

51

Taxonomy

Document Oriented DatabasesTXT

XML

BIN

52

» Store semi-structured documents (think JSON)

» Document versioning

» Map/Reduce based queries, sorting, aggregation, etc.

» DB is aware of internal structure

» E.g. MongoDB

CouchDB

JackRabbit (JCR JSR 170)

Document Store

Taxonomy

53

» RDBMS:

Table for each: posts, comments, tags

Foreign relations

» NoSQL:

Document storage

Store post + tags + comments as a document

Use Case: Blog with tagged posts and comments

Taxonomy

54

» MongoDB (from "humongous")

» Manages collections of JSON-like documents (BSON)

» Queries can return specific fields of documents

» Supports secondary indexes

» Atomic operations on single documents

» Developed by: 10gen

» Written in: C++

» Clients: Java, Scala and more

Document Store e.g: MongoDB

Taxonomy

55

» Suppose you host a blog, where each post is tagged:

» Notice how posts have an array of tags

Example: Blog posts

Docment e.g.: MongoDB

db.posts.save({

_id : 3,

author:"john",

title : “Apples, Oranges and NOSQL",

text : “This article will…",

tags : [ “database", “nosql" ]

});

56

» MongoDB supports secondary indexes and a query optimizer

Compound indexes are also supported


db.posts.ensureIndex({ tags: 1 });

db.posts.ensureIndex({ author: 1});

db.posts.find({ author: "john", tags: "nosql" });

// Result:

{

"_id" : 3,

"author" : "john",

"title" : "Apples, Oranges and NOSQL",

"text" : "This article will…",

"tags" : ["database", "nosql", "mongodb" ]

}

57

» Let's update our posts to include some comments:


db.posts.update({ _id: 3 }, {

$inc: { comments_count: 4},

$pushAll : {

comments: [

{ text: “Comment 1" },

{ text: “Comment 2", author: "Mr. T" },

{ text: “Comment 3" },

{ text: “Comment 4" }

]

}

});

58

Taxonomy

Graph Databases

59

» Inspired by mathematical graph theory G=(E,V)

» Models the structure of data

» Navigational data model

» Scalability / data complexity

» Data model: Key-Value pairs on Edges / Nodes

» Relationships: Edges between Nodes

» E.g. Neo4j

Pregel (Google’s PageRank)

AllegroGraph

Graph databases

Taxonomy

60

» RDBMS

Complex recursive algorithm

Multiple Self joins

Round trips to DB / bulk read and resolve in RAM

» NoSQL:

Graph Storage

Network traversal

Use Case: Connected data - deep relationship links between users in a social network

Taxonomy

61

» High-performance graph engine

» Embedded / disk based

» Work with OO model: nodes, relationships, properties

» ACID Transactions

JTA support – participate in 2PC with your RDBMS

» Developed by: Neo Technologies

» Written in: Java

» Clients: Java, client libraries in other platforms

Graph e.g.: Neo4J

Taxonomy

62

Graph e.g.: Neo4j

http://neo4j.org/

Comparing Apples to Oranges

64

» RDBMS

Databases contains tables, columns and rows

All rows the same structure

Inherent ORM mismatch

» NoSQL

Choose your data structure

Data is stored in natural structure (e.g. Documents, Graphs, Objects)

Comparing Data Structures


65

» RDBMS

Strict schema, difficult to evolve

Maintains relations and forces data integrity

» NoSQL

Structure of data can be changed dynamically

• e.g. Column stores – Cassandra

Data can sometimes be completely opaque

• e.g Key/Value – Project Voldemort

Comparing Schema Flexibility


66

» RDBMS

The data model is normalized to remove data duplication

Normalization establishes table relations

» NoSQL

Denormalization is not a dirty word

Relations are not explicitly defined

Related data is usually grouped and stored as one unit

• E.g. document, column

Comparing Normalization & Relations


67

» RDBMS

CRUD operations using SQL

Access data from multiple tables using SQL joins

Generic API such as JDBC

» NoSQL

Proprietary API and DSLs (e.g. Pig / Hive / Gremlin)

MapReduce, graph traversals

REST APIs, portable serialization formats• BSON, JSON, Apache Thrift, Memcached

Comparing Data Acces


68

» RDBMS

Slice and Dice data, then reassemble any way you like

» NoSQL

Hard to repurpose data for ad-hoc usage

• Plan ahead

Think in advance

• How and what you store

• Data access patterns

Comparing Reporting Capabilities


Summary

70

» ACID ruled exclusively in the last 40 years

doesn’t compromise on consistency

» Database industry neglected distributed DBs w/ availability

» Vacuum was filled with “NoSQL” BASE architectures

Strict A and P, minimize C compromise

» Relational databases are now trying to catch up

Why NOSQL / BASE

Summary

71

» Missing some query capabilities

joins / composite transaction

» Eventual consistency -- not for every problem

» Not a drop in replacement for RDBMS “on ACID”

» No standardization -> product lock-in

» Relatively immature (support, bugs, community)

NoSQL Limitations

Summary

72

» Relational databases and NoSQL databases are designed to meet different needs

» RDBMS-only should not be a default

» NOSQL databases outperform RDBMS’s in their particular niche

» No one size fits all / Silver bullet

...but you don’t have to choose one

Choose the right tool for the job

Summary

73

» Poly: many Glot: language

» Meshing up persistence mechanisms to best meet requirements

» Good integration stories:

E.g. Neo4j + JDBC using JTA

Polyglot Persistence

Summary

Date post:	20-Jan-2015
Category:	Documents
Upload:	roialdaag
View:	3,827 times
Download:	1 times