Date post: | 08-Jul-2015 |
Category: |
Data & Analytics |
Upload: | tony-rogerson |
View: | 754 times |
Download: | 1 times |
NewSQL - Deliverance from BASE and back to SQL and ACID
There are a number of NewSQL products now on market such as VoltDB and Progres-XL. These promise NoSQL performance and scalability but with ACID and relational concepts implemented with ANSI SQL.
This session will cover off why NoSQL came about, why it's had it's day and why NewSQL will become the backbone of the Enterprise for OLTP and Analytics.
Tony Rogerson, SQL Server MVP
[email protected]@tonyrogersonhttp://dataidol.com/tonyrogerson
Who am I?Freelance SQL Server professional and Data Specialist
Fellow BCS, MSc in BI, PGCert in Data Science
28 years of development and database experience, 22 of which SQL Server – starting out in 1986 with VSAM, System W, Application System, DB2 and Oracle crossing over to Client/Server and SQL Server since 4.21a in 1993
Awarded SQL Server MVP yearly since 97
Founded UK SQL Server User Group back in ’99, founder member of DDD, SQL Bits, SQL Relay, SQL Santa
Interested in commodity based distributed processing of Data (naturally!)
AgendaNoSQL
◦ Why the need?◦ What products are available?
Transactions◦ BASE◦ ACID
SQL◦ What is today’s SQL capable of?◦ SQL Server performance – NoSQL required?
NewSQL◦ SQL -> NoSQL -> NewSQL (distributed form of where we started)◦ Distributed Data and ACID
Discussion
Not Only SQL (NoSQL)WHY THE NEED?
Why the Need?The year is 2001 and
◦ It’s that Big Data thing….
◦ Mainstream Relational Databases (that use SQL) are scale up
◦ More grunt required – buy a bigger box
◦ SAN based storage is ridiculously expensive and complicated, heavy TCO
Y2K + 1◦ Developers twiddling their thumbs ;)
Web adoption accelerates◦ Google, Yahoo, Amazon and the like are born
◦ MySQL does not scale – too inflexible
◦ Up front costs of kit for projects/business that may fail – need elasticity
http://www.tomshardware.co.uk/15-years-of-hard-drive-history-uk,review-1908-7.html
Products AvailableVaried – type of NoSQL database
◦ Graph
◦ Key-Value
◦ Column store/Column Family
◦ Document Store
◦ Object
◦ Relational but without SQL
You name it and there is a product to do it
Performance Today [commodity]64KiB 100% Read
100% sequential 100% random
ACIDAtomicity
◦ The bounds of the transaction – everything within those bounds is a single unit of work◦ All or nothing
Consistency◦ Data must reside in the correct Domain of values◦ Deferrable to the end of the unit of work
Isolation◦ Changes are Isolated from other users◦ Other connections cannot update what you have updated/updating◦ Multi-Value Concurrency Control (MVCC) – snapshots◦ Locking
Durability◦ In system failure your changes are still maintained – nothing is lost
BASE (Basically Available, Soft-state, Eventually Consistent)BASE is a Transactional modelish (at the global level, rather than individual transactions)
Specific to Distributed database model
Basically Available – all or some of the system is available
Node 1 Node 2 Node 3
BASE (Basically Available, Soft-state, Eventually Consistent)
Soft-stateEventually Consistent
System may change over time [as replica’s become up-to-date (consistent)]
Node 1 Node 2 Node 3
Insert value ‘A’
Eventual Consistency in SQL ServerAsynchronous Availability Groups/Database Mirroring
Replication
Eventual / Causal Consistency◦ Eventual no good for order specific [and important] transactions
◦ Like Merge replication
◦ Causal: deliver messages in correct order [e.g. service broker]◦ Like Transactional Replication
ACID - Distributed2PC is clunky and doesn’t scale across many nodes
PAXOS – Consensus theory – scales better
Remove the need for distributed ACID altogether
Coordinator
Subordinate
SubordinateINSERT
2PC Transaction
All or nothing
Subordinate
Mixing BASE and ACID ACID applied local data node
BASE remote
RelationalSets
Tables with Rows x Columns
Relational Theory dictates the row/column intersection is an Atomic value i.e. contains only a single value from the domain modelled for that column
Chris Date:◦ Atomicity cannot really be defined as absolute in Normal Form
◦ a column can contain “relational values” i.e. another table
Normal Form – the process used to define the schema around the data being modelled
OldSQL rootsBuilt for disk storage
Built for single machine, scale-up
Mature SQL language (decades of research) over the Relational Model
SQL extensions to deal with unstructured data (freetext)
OldSQL todayACI [no Durability]
In-Memory
Modified design to work with Flash
Still scale-up
SQL ServerDelayed / No-Durability in SQL Server 2014
In-Memory extensions
Entity Attribute Value design combined with ColumnStore
Sparse Columns / Column sets
DEMOS
NewSQLOLDSQL -> SQL -> NEWSQL
Describe NewSQLNewSQL = OldSQL + Transparent_Data_Distribution + ACID
Also – add in the knobs and whistles for new tech◦ Flash
◦ RAM
◦ Processor cache improvements
◦ Better parallelisation across local processor cores
Basically -> Scale out with ACID
Latency in a Distributed environment
Server
1Gbit ethernet
Server
Switch
Server
Server
Server
Server
SQL ServerFirstName Surname DOB
Query returns20,000 rows558KiBytes of data
FastestSlowerSlowest(Data Travel)
Reduce Latency – Data Locality
SQL ServerServer1Gbit ethernetServer
Switch
Server
Server
Server
Server
SQL ServerServer
SQL ServerServer
Distributed SQL with ACID
SQL ServerServer11Gbit ethernet
Switch
SQL ServerServer2BEGIN DISTRIBUTED TRAN
INSERT Server3.pres_NEWSQL.dbo.people( ….. )INSERT Server2.pres_NEWSQL.dbo.people( ….. )INSERT Server1.pres_NEWSQL.dbo.people( ….. )
COMMIT TRAN
• 2 Phase Commit using DTC• High Latency• All or nothing
SQL ServerServer2
Querying a Distributed EnvironmentFinancial Trading – Global position of the book
TOP 10 customers
Not easy (at speed) in an OLTP setting
N1 N2 N3 N4
Network Switch
Couple {Data, Processing} with {Machine-n}
PartitioningChop big table up into “horizontal partitions”
Partition key required (Mash, Modulo, Key range)
Each partition is self-contained binding rows by the partitioning key
Access all data through logical view over all partitions (local database)
Table by table basis
Shared NothingPartitioning+
Each Shard is self-contained and has all the procs, meta-data and of course your partition of data
Shard Key common to multiple tables, for example CustomerID, Email Address.
Greater autonomy across the distributed database
Seeing the entire database as a logical unit is more difficult – joining is a nightmare
Node 1
Node 2
Node 3
Data Distribution using HashingDistributed Database Cluster has fixed number of data nodes
Your data is spread across the database cluster◦ 10 node cluster; each data item may reside on 3 nodes
◦ Which 3 nodes?
Data key is Hashed to a number – hashing algorithm is deterministic
data-node = f( data-key )◦ print ( checksum( 'All hale to the ale' ) * 1.) % 10
◦ print ( checksum( 'And a glass of wine for the ladies' ) * 1.) % 10
Sharding Sync
LOGICAL DATABASE
Pick a node
Node 1
Node 2
Node 3
Full copy of data
Subset of data
Replication
Apps
Postgres-XC
Coordinators(plans, 2pc trans, knows about data distribution)
Applications(issue SQL to coordinators)
Data Nodes
GTMGlobalTransactionManager
http://de.slideshare.net/PavanDeolasee/postgresxc-28475161
Combine Sharding + ReplicationShard your big tables based on a hash (or something) around your business key e.g. Customer, EmailAddress etc.
Replicate static tables.
Discussion
@tonyrogerson
http://dataidol.com/tonyrogerson