Dr. ChuckCartledgeDr. ChuckCartledgeDr. …ccartled/Teaching/2015-Spring/Lectures/polyglot...“Make...

1/37

A little history A change in the air Database layouts CRUDy stuff Databases that I/we use Conclusion References

CS-695NoSQL Database

Polyglot Persistence; Or, The Many Ways WeStore Data

Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge

27 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 201527 Aug. 2015

2/37


Table of contents I

1 A little history

2 A change in the air

3 Database layouts

4 CRUDy stuff

5 Databases that I/we use

6 Conclusion

7 References

3/37


Hammer and nails . . .

“. . . it is tempting,if the only tool youhave is a hammer, totreat everything as if itwere a nail.”

Abraham H. Maslow [8]

4/37


Miscellania

Origin of “polyglot . . . ”

Popularized by Neal Ford [4]:

Talked about software development

How things are evolving (SQL,XML, .NET, etc.)

How multi-threading is hard(concurrency, coordination, etc.)

Promoted the idea of enterprisedevelopment via Java and .NET

Take away: choose the right tool for thejob.

Different languages will continue to exist because each is good atsomething and all are necessary.

5/37


Miscellania

The world BC (Before Codd).

Databases existed before EdgarCodd.

Hierarchical approach – aliveand well in our file system

Network approach –currently underpinning ideasfor graph databases

These suffered because peoplehad to know lots of details abouthow the database wasimplemented.

6/37


Miscellania

The world after Codd.

Separate representation fromimplementation

Changes in database foroptimization needn’t affect dataqueries

User interactions aren’t clutteredby “construction noise” (includingindexing and sorting)

Codd’s relational data bank hides allimplementation information.

Relational database management systems (RDBMS) hidinginformation about how data is stored. Data language isindependent of how data is stored [3].

7/37


Miscellania

The world according to RDBMS.

Everything is neat and tidy

Everything can be defined ina set of tables that haverelationships between them

If you make the databaselarge enough, you can storeanything and ask anyquestion

Image from [10].

RDBMS reigned supreme for 30 - 40 years (starting in 1970). Andthen reality and Big Data started to hit.

8/37


How we turned and started to get to now.

And then things started changing.

Can’t point a finger at a specificincident, might be a critical mass.

The Internet made it easier tocollect data.

A new generation of peoplethought about things in a differentway.

The new data had three attributes:velocity, volume, variety [7].

New ways of looking at dataencouraged new questions.

People wanted answers faster.

Many of these items couldn’t be supported by a RDBMS.

9/37


Make things faster.

Simple and complex ways

How to get more processingpower to answer databasequestions?? Basically:

Scale up – buy faster CPUand more RAM

Scale out – buy more CPUsand get them to work inparallel

Scaling up with custom CPUsgets expensive very, very quickly.

Image from [9].

Commodity CPUs are almost a dime a dozen. Leading to clusters,network services, distributed applications, etc.

10/37


Make things faster.

Amdahl’s Law [1]

Division and measurement of serial and parallel operations appearstime and again. (Shades of Mandelbrot.)

“Make the common fast.”

“Make the fast common.”

Understand what parts haveto be done serially.

Understand what parts canbe done in parallel.

Need to factor in “overhead” costs when computing speed up.

11/37


Make things faster.

Amdahl’s Law (A summary)

Time for serial executiondef.== T (1)

Portion that is NOT beparalyzable

def.== B ∈ (0, 1]

Number of parallel resourcesdef.== n

T (n) = T (1) ∗ (B + 1n(1− B))

Speed updef.== S(n)

S(n) = T (1)T (n)

= 1

B+ 1n(1−B)

Dr. Gene Amdahl (circa 1960)

12/37


The questions changed.

We knew that we didn’t know.

Our questions and our data changed.RDBMS had limitations:

Supported ad hoc questions onpredefined data

Didn’t support undefined orunstructured data

Could scale up not out, sodatabase size was practicallylimited

SQL predicate calculus madelogic awkward

RDBMS are very, very good at somethings, but user needs were changing.

13/37


The questions changed.

What happens when we ask a different question??

When the RDBMS database was designed, wethought we knew what we wanted to know.That was then.

Now if we want to look at familyrelationships (parent, child, sibling,extended family, etc.)

We can add a column to the table forup/down relationships

We can add a column for side to siderelationships

We can add a column for extended familyrelationships

The database doesn’t look like how we thinkabout the problem.

When the data representation doesn’t match how we think, then something has

to change.

14/37


A collection of different database layouts.

A RDBMS

Can add well formed data easily

Difficult to add new data fields ortypes

Each row is expected to have thesame data

Supports unknown (ad hoc) querieswell

Scales up not out

Popular RDBMS: Oracle, MySQL, MSSQL Server, PostgreSQL

The “King of the World” for a very long time. (A version lives inyour phone.)

15/37



A columnar database

Takes the idea of a roworientated database and turns iton its side.

Can add new columns easily

Each row can have differentuse different columns

Scales up and out

Popular column orienteddatabases: IBM DB2, Sybase IQ,Teradata Image from [2].

16/37



A Key-Value design

A number (called the key) locates all otherdata (the value[s]).

Use math on some data (may be morethan one piece)

The math (hash function) returns onevalue (the key)

Use the key to find the rest of the data

Locating data can be fast

Hash function should return unique values

Popular Key-Value DBMS: Redis, Memcached,Amazon DynamoDB, Riak

Key-value databases are fast when using the hash function. Not so fast if you

aren’t.

17/37



An Online Analytical Processing (OLAP) design

A way to visualize and analyze data using a“data cube” and basic functions:

Basic functions:

1 Consolidation (roll-up) of themulti-dimensional data

2 Drill-down into the data3 Slicing and dicing

Fast execution time

Incorporates aspects of navigational,hierarchical, and relational databases

Popular OLAP databases: Hyperion Solutions,Cognos, MicroStrategy, Applix

Image from [15].

Target users are business analysts and business process management.

18/37



A Graph design

A very different way to think about data.

Consists of two parts:

1 Node (something that exists asan entity in the database)

2 Arcs (something that describes arelationship between nodes)

You can have nodes without arcs. Youcan not have arcs without nodes. Arcscan be unidirectional.

Popular graph databases: Neo4j, OrientDB,Titan, Giraph

Image from [6].

Questions are driven by the relationships between nodes vice the nodes

themselves.

19/37



A document design

Document oriented databases can be “viewed,”and can have internal document databases(recursively).

Database is organized based on “tags”

Tag’s meaning is instance dependent

Tags can be nested (recursively)

Database structure maybe XML basedand represented in different ways

Popular document databases: MongoDB,CouchDB, Couchbase, MarkLogic

Sometimes document databases show up in unexpected places.

20/37


Which design to use?

If I had a hammer, . . .

Questions to ask:

1 How much data will be in thedatabase??

2 Will I be reading mostly??

3 Will I be writing mostly??

4 How accurate must the data be??

5 How many simultaneous readersand writers??

6 How robust/resilient must thedatabase be??

7 How will the database beaccessed??

8 What about ACID vs. BASE??

So many choices.

21/37



ACID vs. BASE

One is a design principle, the other is counter marketing.

ACID [5]1 A – Atomicity - all or nothing2 C – Consistency - database is always valid3 I – Isolation - concurrent equal serial ops.4 D – Durable - the database is written to disk

A database action will completecompletely.

BASE [12]1 BA – Basically Available2 S – Soft state - user guarantees consistency3 E – Eventually consistent

A database action will probably completeeventually.

ACID comes with SQL. BASE comes with NoSQL.

22/37



Consistency, Availability, Partition tolerance (CAP)Theorem

Sharing data in distributed systems ishard.

Data can be consistent across thesystem

Data can be available across thesystem

The system can continue tofunction if partitioned/split

You only get to choose two.

Image from [17].

RDBMS on a single machine means partition is undefined. Distributed systems

only get two.

23/37


Create — darkness was on the face of the deep.

Ex nihilo nihil fit (out of nothing, nothing comes).

The CRUD approach doesn’t say what happened before the C.

RDBMS CREATE DATABASE db name;

CREATE TABLE table name (column name1 data type(size),column name2 data type(size), . . . );

Columnar

CREATE DATABASE

CREATE table name, column name1,column name2, ...;

Key-Value, Graph, Document

CREATE DATABASE

CREATE table name

Graph, Document

CREATE DATABASE

Image from [11].

Implementation agnostic.

24/37


Create — darkness was on the face of the deep.

Create an entry

RDBMSINSERT INTO table name VALUES (value1,value2,value3,...);

ColumnarPUT table name, row name, column name1:, “value”;

Key-ValueADD table name, key value, value;

GraphCREATE relationship name, vertex name1, vertex name2

DocumentINSERT table name (GML/XML/JSON “marked up” data)

25/37


Report — databases aren’t much good if you can’t get stuff out.

Report/Retrieve data an entry from the database

RDBMSSELECT column name,column name FROM table name;

ColumnarGET table name, row name1:, column name:;

Key-ValueGET table name, key value;

Graph (pipe operations)GET VERTEX|EDGE FILTER(expression) (. . . )

DocumentFIND document id

26/37


Update — things change.

Update an entry

RDBMS

UPDATE table name SET column1=value1,column2=value2,... WHEREsome column=some value;

Columnar

DELETE FROM table name WHERE [expression];

PUT table name, row name, column name1:, “value”;

Key-Value

SET table name, key value, value;

Graph

GET VERTEX | EDGE FILTER(expression) (. . . ) REMOVE propertyADD property

Document

UPDATE document id value (same format as CREATE)

27/37


Delete — to remove that which once was.

Delete an entry

RDBMSDELETE FROM table name WHEREsome column=some value;

ColumnarDELETE FROM table name WHERE [expression];

Key-ValueDROP table name, key value;

GraphGET VERTEX|EDGE FILTER(expression) (. . . ) REMOVE

DocumentREMOVE document id value

28/37


Lots and they are hidden.

Shopping as an example

Firefox – SQLite for browserhistory

Shopping cart – Key-Valuebased on session ID

Recommended purchases –graph database

Credit card payment – SQLdatabase

Excel record purchase –document

Save Excel file – hierarchicaldatabase

29/37


A continuum.

Things from a 50,000 foot perspective

Messy Neat andtidy

Rigid

Ad-hoc

Data

Queries

Free textK-V

Doc.

OLAP

Col.

RDBMS

30/37


A continuum.

Notional strengths and weaknesses

Database type

RDBMS K-V Col. Doc. Graph

ACIDBASE

Ad-hoc queries∆ Hardware

Hardware failure

SupportedNot supported by data model

No statement

31/37


Where can I get these things??

Popular open source databases

RDBMS – MySQL,PostrgreSQL, SQLite

Key-Value – Redis,Memcached, Riak

Columnar – HBase,Accumulo, Hypertable

Document – MongoDB,CouchDB, Couchbase

Graph – Neo4j, OrientDB,Titan Image from [16].

Open source does not mean free; your time costs money.

32/37


In summary . . .

What can we say??

1 Each type of databasedesign fills a specificneed/niche.

2 Each type could do the workof the others

1 Each type has a datamodel tailored to itsproblem domain

2 Performance is tied to thehardware (CPU and I/O)

RDBMS has been the King for a long time. Expect it to remain sodue to inertia.

33/37


NoSQL Distilled: A Brief Guide to the Emerging Worldof Polyglot Persistence

by Sadalage and Fowler [14].

Book to be used and refered toduring the course, ISBN9780321826626.

34/37


Seven Databases in Seven Weeks: A Guide to ModernDatabases and the NoSQL Movement

by Redmon and Wilson [13].

A very nice and graspable tour ofvarious NoSQL database types.Examples of each type ispresented with exercises that canbe completed in a weekend.Book to be used and refered toduring the course, ISBn9781934356920.

35/37


References I

[1] Gene M Amdahl, Validity of the single processor approach to achievinglarge scale computing capabilities, Proceedings of the Spring JointComputer Conference, ACM, 1967, pp. 483–485.

[2] Dale Anderson, Column oriented database technologies,http://www.dbbest.com/blog/column-oriented-database-technologies/,2012.

[3] Edgar F. Codd, A relational model of data for large shared data banks,Communications of the ACM 13 (1970), no. 6, 377–387.

[4] Neal Ford, Polyglot programming,http://memeagora.blogspot.com/2006/12/polyglot-programming.html,2006.

[5] Jim Gray, The transaction concept: Virtues and limitations, Very LargeDatabases, vol. 81, 1981, pp. 144–154.

http://www.dbbest.com/blog/column-oriented-database-technologies/

http://memeagora.blogspot.com/2006/12/polyglot-programming.html

36/37


References II

[6] Andy Hogg, Whiteboard it the power of graph databases,http://www.computerweekly.com/feature/Whiteboard-it-the-power-of-graph-

2013.

[7] Doug Laney, 3d data management: Controlling data volume, velocity andvariety, META Group Research Note 6 (2001).

[8] Abraham H. Maslow, The psychology of science, Henry Regency, 1966.

[9] Andrea Mauro, Storage scale-up vs. scale-out,http://vinfrastructure.it/2014/06/scale-out-vs-scale-in/,2014.

[10] David Mertz, Xml matters: Putting xml in context with hierarchical,relational, and object-oriented models,http://www.ibm.com/developerworks/library/x-matters8/, 2001.

http://www.computerweekly.com/feature/Whiteboard-it-the-power-of-graph-databases

http://vinfrastructure.it/2014/06/scale-out-vs-scale-in/

http://www.ibm.com/developerworks/library/x-matters8/

37/37


References III

[11] Brian Panulla, If libraries were like relational databases,http://ghostednotes.com/2010/12/31/if-libraries-were-like-relational-

2010.

[12] Dan Pritchett, Base: An acid alternative, Queue 6 (2008), no. 3, 48–55.

[13] Eric Redmond and Jim R Wilson, Seven databases in seven weeks,Pragmatic Bookshelf, 2012.

[14] Pramod J Sadalage and Martin Fowler, Nosql distilled, PearsonEducation, 2012.

[15] DatabaseJournal Staff, Examples of sql server implementations, DatabaseJournal (2010).

[16] Wikipedia Staff, Database,https://en.wikipedia.org/wiki/Database, 2015.

[17] Saeid Zebardast, Said experts, http://blog.zebardast.ir/, 2015.

http://ghostednotes.com/2010/12/31/if-libraries-were-like-relational-databases

https://en.wikipedia.org/wiki/Database

http://blog.zebardast.ir/

Date post:	19-Mar-2018
Category:	Documents
Upload:	doanthuan
View:	217 times
Download:	3 times

Dr. ChuckCartledgeDr. ChuckCartledgeDr. …ccartled/Teaching/2015-Spring/Lectures/polyglot...“Make...

Documents