+ All Categories
Home > Documents > Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties...

Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties...

Date post: 24-Sep-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
50
Making Sense of NoSQL Dan McCreary Wednesday, Nov. 13 th 2014
Transcript
Page 1: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Making Sense of NoSQL Dan McCreary

Wednesday, Nov. 13th 2014

Page 2: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Agenda

• Why NoSQL?

• What are the key NoSQL architectures?

• How are they different from traditional

RDBMS Systems?

• What types of problems do they solve?

• How to learn more

Copyright Kelly-McCreary & Associates, LLC 2

Page 3: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Background for Dan McCreary

• Co-founder of the NoSQL Now! conference

• Background

– Bell Labs

– NeXT Computer (Steve Jobs)

– Owner of 75-person software consulting firm

– US Federal data integration (National Information Exchange Model NIEM.gov)

– Native XML/XQuery for metadata management since 2006

– Advocate of web standards, NoSQL and XRX systems

– As of Monday – Principal Engineer with MarkLogic

3Copyright Kelly-McCreary & Associates, LLC

Page 4: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Making Sense of NoSQL

• Coauthor (with Ann Kelly) of "Making Sense of NoSQL"

• Guide for managers and architects

• Focus on NoSQL architectural tradeoff analysis

• Basis for 40 hour course on database architectures

• http://manning.com/mccreary

4Copyright Kelly-McCreary & Associates, LLC

Page 5: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

The Story of Property Tax Forms

• How did I get into NoSQL?

• In 2006 a state agency in Minnesota wanted to standardize property tax records across 87 counties

• A story about standards

– XML, XML Schema, XForms, XQuery, NEIM

• A story about agility

• A story about new technology adoption

Kelly-McCreary & Associates 5

Alex Bleasdale

Arun Batchu

Page 6: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

6Kelly-McCreary & Associates

2006

eCRV

Case

Study

Page 7: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

7 Kelly-McCreary & Associates

Page 8: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

8 Kelly-McCreary & Associates

Page 9: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

9 Kelly-McCreary & Associates

Page 10: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

10 Kelly-McCreary & Associates

Page 11: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Four Translations

11

• T1 – HTML into Java Objects

• T2 – Java Objects into SQL Tables

• T3 – Tables into Objects

• T4 – Objects into HTML

T1

T4

T2

T3

Object Middle

Tier

Relational

DatabaseWeb Browser

Kelly-McCreary & Associates

Page 12: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Kurt's Suggestion

store($collection, $file-name, $data)

12

Web Browser

Save

Web Form

Use a Native XML

Database!

Kurt Cagle

eXist-db

Kelly-McCreary & Associates

Page 13: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Zero Translation

• XML lives in the web browser (XForms)

• REST interfaces

• XML in the database (Native XML, XQuery)

• XRX Web Application Architecture

• No "impedance mismatch", No translation!

• Department tried it and then went back to HTML, Java and SQL

• …but I was forever changed…I began to question everything I had been taught about databases

13

Web Browser XML database

XForms

Kelly-McCreary & Associates

Page 14: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Anger, Wiki, Conference, Book

Kelly-McCreary & Associates

2011, 2012, 2013, 2014

Page 15: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

NoSQL on Google Trends

15

RDBMS NoSQL

http://www.google.com/trends/explore?q=NoSQL%2C+RDBMS#q=NoSQL%2C%20RDBMS&cmpt=q

Kelly-McCreary & Associates, LLC

Page 16: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

The NO-SQL Universe

Document StoresKey-Value Stores

Graph/Triple Stores

Column-Family Stores

XML

16Copyright Kelly-McCreary & Associates, LLC

Page 17: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Sample of NoSQL Jargon

Document orientation

Schema free

MapReduce

Horizontal scaling

Sharding and auto-sharding

Brewer's CAP Theorem

Consistency

Reliability

Partition tolerance

Single-point-of-failure

Object-Relational mapping

Key-value stores

Column stores

Document-stores

Memcached

17

Indexing

B-Tree

Configurable durability

Documents for archives

Functional programming

Document Transformation

Document Indexing and Search

Alternate Query Languages

Aggregates

OLAP

XQuery

MDX

RDF

SPARQL

Architecture Tradeoff Modeling

ATAM

Erlang

Note that within the context of NoSQL many

of these terms have different meanings!

Copyright Kelly-McCreary & Associates, LLC

Page 18: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Before NoSQL

18

Relational Analytical (OLAP)

Copyright Kelly-McCreary & Associates, LLC

Page 19: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

After NoSQL

19

Relational Analytical (OLAP) Key-Value

Column-Family DocumentGraph

key value

key value

key value

key value

Copyright Kelly-McCreary & Associates, LLC

Page 20: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Food for thought…

• What percentage of database transactions run on RDBMSs in the

following organizations?

20

• What percentage of all transactions in Minnesota run on

RDBMSs?

• Why is this number different?

• Is our data fundamentally different?

Copyright Kelly-McCreary & Associates, LLC

Page 21: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

NoSQL – The Big Tent

• "NoSQL" – a label for a "meme" that now encompasses a large body of innovative ideas on data management

• "Not Only SQL"

• Focus on non-relational databases and hybrids

• A community where new ideas are quickly recombined to create innovative new business solutions

21

http://www.flickr.com/photos/morgennebel/2933723145/

Copyright Kelly-McCreary & Associates, LLC

Page 22: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Historical Context

Copyright Kelly-McCreary & Associates, LLC 22

Mainframe Era Commodity Processors

• 10,000 CPUs

• Functional programming

• MapReduce "farms"

• Pennies per CPU hour

• 1 CPU

• COBOL and FORTRAN

• Punchcards and flat files

• $10,000 per CPU hour

Page 23: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Two Approaches to Computation

23Copyright 2010 Dan McCreary & Associates

Alonzo ChurchJohn von Neumann

Manage state with a program counter. Make computations act like math functions.

Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?

1930s and 40s

Page 24: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Standard vs. MapReduce Prices

24

http://aws.amazon.com/elasticmapreduce/#pricing

John's Way Alonzo's Way

Copyright Kelly-McCreary & Associates, LLC

Page 25: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

MapReduce CPUs Cost Less!

0

2

4

6

8

10

12

14Cost Per CPU Hour (Cents)

25

http://aws.amazon.com/elasticmapreduce/#pricing

Cut cost from 12 to 3 cents per CPU hour!

Perhaps Alonzo was right!

Why? (hint: how "shareable" is this process)

EC2 MapReduce

Copyright Kelly-McCreary & Associates, LLC

Page 26: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Pressures on Single Node RDBMS Architectures

Copyright Kelly-McCreary & Associates, LLC26

OLAP/BI/DataWarehouse

Social Networks

Scalability

AgileSchema

Free

Single Node

RDBMS

Page 27: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

An evolving tree of data types

Copyright Kelly-McCreary & Associates, LLC27

Read Mostly

Read/Write

StructuredUnstructured

Transactional

RDBMS BI/DWWeb Crawlers

Documents

Log Files

XML

JSON

Binary

Open Linked Data

Graph

Page 28: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Many Uses of Data

Copyright Kelly-McCreary & Associates, LLC28

• Transactions (OLTP)

• Analysis (OLAP)

• Search and Findability

• Enterprise Agility

• Discovery and Insight

• Speed and Reliability

• Consistency and Availability

Page 29: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Three Eras of Enterprise Data

• NoSQL will not replace ERP or BI/DW systems – but they will complement them and also facilitate the integration of unstructured document data

29

1970s, 80s, 90s 1990s, 2000s

HR

Inventory Sales

FinanceERP

BI/Data

Warehouse

ERP

BI/Data

Warehouse

OLTP

OLAP NoSQL

Today

Siloed Systems

DocumentsDocuments

ERP Drives BI/DW NoSQL and Documents

Copyright Kelly-McCreary & Associates, LLC

Page 30: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Simplicity is a Virtue

• Many modern systems derive

their strength by dramatically

limiting the features in their

system and focus on a

specific task

• Simplicity allows database

designer to focus on the

primary business drivers

• Simplicity promotes

"separation of concerns"

30

Photo from flickr by PSNZ Images

Copyright Kelly-McCreary & Associates, LLC

Page 31: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Google MapReduce

• 2004 paper that had huge impact of functional

programming on the entire community

• Copied by many organizations, including Yahoo

31

Page 32: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Google Bigtable Paper

• 2006 paper that gave focus to scaleable databases

• designed to reliably scale to petabytes of

data and thousands of machines

32

Page 33: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Amazon's Dynamo Paper

• Werner Vogels

• CTO - Amazon.com

• October 2, 2007

• Used to power Amazon's Web Sites

• One of the most influential papers in the NoSQL movement

33

Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,

Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”,

in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.

Page 34: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Scale Up vs. Scale Out

Copyright Kelly-McCreary & Associates, LLC 34

Scale Up• Make a single CPU as

fast as possible

• Increase clock speed

• Add RAM

• Make disk I/O go faster

Scale Out• Make Many CPUs work

together

• Learn how to divide your problems into independent threads

Page 35: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Automatic Sharding

• When one node in a cluster has too much of a load the system should be able to

automatically rebalance the data distribution

• Note: Auto-sharding is not the same as replication!

35Copyright Kelly-McCreary & Associates, LLC

Warning processor at 90% capacity!

Time to "Shard" – copy ½ data to a new processor

Before

Shard

After

Shard

Each processor gets ½ the load

Original processor New processor

Page 36: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Schema-Free Integration

"We can easily store the data that we actually get, not

the data we thought we would get."

36

XML

v1

XML

v2

XML

v3

Enterprise Messaging System NoSQLDatabase

Page 37: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Horizontal Scalability

Number of processors

Performance

Non-scalable systems

reach a plateau of

performance where

adding new processors

does not add

incremental

performance

Linear scalable

architectures provide a

constant rate of additional

performance as the number

of processors increases

Figure 6.2

Page 38: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Shared Nothing Architecture

• Every node in the cluster has its own CPU, RAM and disk

Page 39: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Master-Slave vs. Peer to Peer

• The Master node may become a bottleneck in large clusters

• Many newer NoSQL architectures are moving toward a true peer-to-peer system

Copyright Kelly-McCreary & Associates, LLC 39

Master-Slave Peer-to-Peer

MasterStandbyMaster

Node Node

Node

Node

Node

Node

Node

requestsrequests

Used only if primary

master fails

Page 40: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

The Tunable SLA

• NoSQL systems that use many commodity

processors can be precisely tuned to meet

an organizations service level agreements

40

Max Read

Time

Max Write

Time

Reads Per

Second

Duplicate

Copies

Multiple

Datacenters

Transaction

Guarantees

Writes Per

Second

$435.50

Estimated

Price/Month

inputs output

Page 41: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Key-Value Stores

• Keys used to access opaque blobs

of data

• Values can contain any type of data

(images, video)

Pros: scalable, simple API (put, get,

delete)

Cons: no way to query based on the

content of the value

Copyright Kelly-McCreary & Associates, LLC41

key value

key value

key value

key value

Examples:

Berkley DB,

Memcache,

DynamoDB, S3,

Redis, Riak

Page 42: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Column-Family

• Key includes a row, column family and column name

• Store versioned blobs in one large table

• Queries can be done on rows, column families and column names

• Pros: Good scale out, versioning

• Cons: Cannot query blob content, row and column designs are critical

Copyright Kelly-McCreary & Associates, LLC42

Examples:

Cassandra, HBase,

Hypertable, Apache

Accumulo, Bigtable

Page 43: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Column store key-value

Row-IDColumn

Family

Column

NameTimestamp Value

Key

• The key is composed of:

– row id (string)

– Column family (grouping of columns)

– Column name (string)

– Timestamp (64-bit value)

• Value

– any blob (byte stream)

Page 44: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Graph Store

• Data is stored in a series of nodes, relationships and properties

• Queries are really graph traversals

• Ideal when relationships between data is key: – e.g. social networks

• Pros: fast network search, works with public linked data sets

• Cons: Poor scalability when graphs don't fit into RAM, specialized query languages (RDF uses SPARQL)

Copyright Kelly-McCreary & Associates, LLC44

Examples:

Neo4j, AllegroGraph,

Bigdata triple store,

InfiniteGraph,

StarDog

Page 45: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Document Store

• Data stored in nested hierarchies

• Logical data remains stored together as a unit

• Any item in the document can be queried

• Pros: No object-relational mapping layer, ideal for search

• Cons: Complex to implement, incompatible with SQL

Copyright Kelly-McCreary & Associates, LLC45

Examples:

MarkLogic, MongoDB,

Couchbase, CouchDB,

RavenDB, eXist-db

Page 46: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Two Models

"Bag of Words"

• All keywords in a single container

• Only count frequencies are stored

with each word

"Retained Structure"

• Keywords associated with each

sub-document component

46

'love'

'hate''new'

'fear'

keywords

keywords

keywords

keywords

keywords

keywords

doc-id

Page 47: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Keywords and Node IDs

• Keywords in the reverse index are now

associated with the node-id in every

document

47

Node-id

Node-id

Node-id

Node-id

Node-id

Node-id

keywords

keywords

keywords

keywords

keywords

keywords

document-id

Page 48: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Start Finish

Using the Wrong Architecture

Credit: Isaac Homelund – MN Office of the Revisor

Page 49: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Using the Right Architecture

Start Finish

Find ways to remove barriers and empower

the non programmers on your team.

Page 50: Making Sense of NoSQL · •Data is stored in a series of nodes, relationships and properties •Queries are really graph traversals •Ideal when relationships between data is key:

Further Reading and Questions

Dan McCreary

[email protected]

Copyright Kelly-McCreary & Associates, LLC50

http://manning.com/mccreary

@dmccreary


Recommended