NewSQL - Deliverance from BASE and back to SQL and ACID

NewSQL - Deliverance from BASE and back to SQL and ACID

There are a number of NewSQL products now on market such as VoltDB and Progres-XL. These promise NoSQL performance and scalability but with ACID and relational concepts implemented with ANSI SQL.

This session will cover off why NoSQL came about, why it's had it's day and why NewSQL will become the backbone of the Enterprise for OLTP and Analytics.

Tony Rogerson, SQL Server MVP

[email protected]@tonyrogersonhttp://dataidol.com/tonyrogerson

mailto:[email protected]

http://dataidol.com/tonyrogerson

Who am I?Freelance SQL Server professional and Data Specialist

Fellow BCS, MSc in BI, PGCert in Data Science

28 years of development and database experience, 22 of which SQL Server – starting out in 1986 with VSAM, System W, Application System, DB2 and Oracle crossing over to Client/Server and SQL Server since 4.21a in 1993

Awarded SQL Server MVP yearly since 97

Founded UK SQL Server User Group back in ’99, founder member of DDD, SQL Bits, SQL Relay, SQL Santa

Interested in commodity based distributed processing of Data (naturally!)

AgendaNoSQL

◦ Why the need?◦ What products are available?

Transactions◦ BASE◦ ACID

SQL◦ What is today’s SQL capable of?◦ SQL Server performance – NoSQL required?

NewSQL◦ SQL -> NoSQL -> NewSQL (distributed form of where we started)◦ Distributed Data and ACID

Discussion

Not Only SQL (NoSQL)WHY THE NEED?

Why the Need?The year is 2001 and

◦ It’s that Big Data thing….

◦ Mainstream Relational Databases (that use SQL) are scale up

◦ More grunt required – buy a bigger box

◦ SAN based storage is ridiculously expensive and complicated, heavy TCO

Y2K + 1◦ Developers twiddling their thumbs ;)

Web adoption accelerates◦ Google, Yahoo, Amazon and the like are born

◦ MySQL does not scale – too inflexible

◦ Up front costs of kit for projects/business that may fail – need elasticity

http://www.tomshardware.co.uk/15-years-of-hard-drive-history-uk,review-1908-7.html

Products AvailableVaried – type of NoSQL database

◦ Graph

◦ Key-Value

◦ Column store/Column Family

◦ Document Store

◦ Object

◦ Relational but without SQL

You name it and there is a product to do it

Performance Today [commodity]64KiB 100% Read

100% sequential 100% random

ACIDAtomicity

◦ The bounds of the transaction – everything within those bounds is a single unit of work◦ All or nothing

Consistency◦ Data must reside in the correct Domain of values◦ Deferrable to the end of the unit of work

Isolation◦ Changes are Isolated from other users◦ Other connections cannot update what you have updated/updating◦ Multi-Value Concurrency Control (MVCC) – snapshots◦ Locking

Durability◦ In system failure your changes are still maintained – nothing is lost

BASE (Basically Available, Soft-state, Eventually Consistent)BASE is a Transactional modelish (at the global level, rather than individual transactions)

Specific to Distributed database model

Basically Available – all or some of the system is available

Node 1 Node 2 Node 3

BASE (Basically Available, Soft-state, Eventually Consistent)

Soft-stateEventually Consistent

System may change over time [as replica’s become up-to-date (consistent)]

Node 1 Node 2 Node 3

Insert value ‘A’

Eventual Consistency in SQL ServerAsynchronous Availability Groups/Database Mirroring

Replication

Eventual / Causal Consistency◦ Eventual no good for order specific [and important] transactions

◦ Like Merge replication

◦ Causal: deliver messages in correct order [e.g. service broker]◦ Like Transactional Replication

ACID - Distributed2PC is clunky and doesn’t scale across many nodes

PAXOS – Consensus theory – scales better

Remove the need for distributed ACID altogether

Coordinator

Subordinate

SubordinateINSERT

2PC Transaction

All or nothing

Subordinate

Mixing BASE and ACID ACID applied local data node

BASE remote

RelationalSets

Tables with Rows x Columns

Relational Theory dictates the row/column intersection is an Atomic value i.e. contains only a single value from the domain modelled for that column

Chris Date:◦ Atomicity cannot really be defined as absolute in Normal Form

◦ a column can contain “relational values” i.e. another table

Normal Form – the process used to define the schema around the data being modelled

OldSQL rootsBuilt for disk storage

Built for single machine, scale-up

Mature SQL language (decades of research) over the Relational Model

SQL extensions to deal with unstructured data (freetext)

OldSQL todayACI [no Durability]

In-Memory

Modified design to work with Flash

Still scale-up

SQL ServerDelayed / No-Durability in SQL Server 2014

In-Memory extensions

Entity Attribute Value design combined with ColumnStore

Sparse Columns / Column sets

DEMOS

NewSQLOLDSQL -> SQL -> NEWSQL

Describe NewSQLNewSQL = OldSQL + Transparent_Data_Distribution + ACID

Also – add in the knobs and whistles for new tech◦ Flash

◦ RAM

◦ Processor cache improvements

◦ Better parallelisation across local processor cores

Basically -> Scale out with ACID

Latency in a Distributed environment

Server

1Gbit ethernet

Server

Switch

Server

Server

Server

Server

SQL ServerFirstName Surname DOB

Query returns20,000 rows558KiBytes of data

FastestSlowerSlowest(Data Travel)

Reduce Latency – Data Locality

SQL ServerServer1Gbit ethernetServer

Switch

Server

Server

Server

Server

SQL ServerServer

SQL ServerServer

Distributed SQL with ACID

SQL ServerServer11Gbit ethernet

Switch

SQL ServerServer2BEGIN DISTRIBUTED TRAN

INSERT Server3.pres_NEWSQL.dbo.people( ….. )INSERT Server2.pres_NEWSQL.dbo.people( ….. )INSERT Server1.pres_NEWSQL.dbo.people( ….. )

COMMIT TRAN

• 2 Phase Commit using DTC• High Latency• All or nothing

SQL ServerServer2

Querying a Distributed EnvironmentFinancial Trading – Global position of the book

TOP 10 customers

Not easy (at speed) in an OLTP setting

N1 N2 N3 N4

Network Switch

Couple {Data, Processing} with {Machine-n}

PartitioningChop big table up into “horizontal partitions”

Partition key required (Mash, Modulo, Key range)

Each partition is self-contained binding rows by the partitioning key

Access all data through logical view over all partitions (local database)

Table by table basis

Shared NothingPartitioning+

Each Shard is self-contained and has all the procs, meta-data and of course your partition of data

Shard Key common to multiple tables, for example CustomerID, Email Address.

Greater autonomy across the distributed database

Seeing the entire database as a logical unit is more difficult – joining is a nightmare

Node 1

Node 2

Node 3

Data Distribution using HashingDistributed Database Cluster has fixed number of data nodes

Your data is spread across the database cluster◦ 10 node cluster; each data item may reside on 3 nodes

◦ Which 3 nodes?

Data key is Hashed to a number – hashing algorithm is deterministic

data-node = f( data-key )◦ print ( checksum( 'All hale to the ale' ) * 1.) % 10

◦ print ( checksum( 'And a glass of wine for the ladies' ) * 1.) % 10

Sharding Sync

LOGICAL DATABASE

Pick a node

Node 1

Node 2

Node 3

Full copy of data

Subset of data

Replication

Apps

Postgres-XC

Coordinators(plans, 2pc trans, knows about data distribution)

Applications(issue SQL to coordinators)

Data Nodes

GTMGlobalTransactionManager

http://de.slideshare.net/PavanDeolasee/postgresxc-28475161

Combine Sharding + ReplicationShard your big tables based on a hash (or something) around your business key e.g. Customer, EmailAddress etc.

Replicate static tables.

Discussion

[email protected]

@tonyrogerson


mailto:[email protected]


Date post:	08-Jul-2015
Category:	Data & Analytics
Upload:	tony-rogerson
View:	754 times
Download:	1 times

NewSQL - Deliverance from BASE and back to SQL and ACID

Data & Analytics