Date post: | 13-Apr-2017 |
Category: |
Technology |
Upload: | david-simons |
View: | 225 times |
Download: | 2 times |
S Q L &N O S Q L
D a v i d S i m o n s @ S w a m W i t h Tu r t l e s
S Q L &N O S Q L
D a v i d S i m o n s @ S w a m W i t h Tu r t l e s
W H O A M I ?
• Tech Lead/Consultant at Softwire
• Background in Statistics & Computer Simulation
W H AT D O W E D O ?
• Business Analysis/Mapping
• Architecture
• Project Management
• Design (UI and User Workflows)
• Development
• QA
• Warranty
W H AT D O W E D O ?
• Business Analysis/Mapping
• Architecture
• Project Management
• Design (UI and User Workflows)
• Development
• QA
• Warranty
What problems are we solving?
How do we solve them?
Solving them now!
Are they still solving the problem?
T O D AY W E ’ R E G O I N G T O TA L K A B O U T
• Business Analysis/Mapping
• Architecture
• Project Management
• Design (UI and User Workflows)
• Development
• QA
• Warranty
H O W T O D O A R C H I T E C T U R E
E V O LV I N G D E S I G N
U P - F R O N T D E C I S I O N M A K I N G
T O D AY…
• Part 1: Looking at some SQL & Database Theory
• Part 2: Looking at a lot of NoSQL databases
W H AT I S A D ATA B A S E ?PA R T 1 : T H E O R Y
- U N I V E R S I T Y O F G E O R G I A
“A database is a collection of information organized to provide efficient retrieval.”
T H E M Y T H I C A L D ATA B A S E D I V I D E
S Q LN O S Q L
T H E M Y T H I C A L D ATA B A S E D I V I D E
• NoSQL (apparently) has always meant Not Only SQL
• Considering Databases that don’t meet the SQL Standard which covers a wide range of databases
T H E S Q L S TA N D A R DPA R T 1 : T H E O R Y
H I S T O R Y
• First defined by ANSI in 1986 (though around before then)
• Structured Query Language
• Different databases have implemented this standard way of storing, inserting and retrieving data
E X A M P L E S O F S Q L D ATA B A S E S
• MySQL
• Microsoft SQL Server
• Oracle
• PostgreSQL (mostly)
• IBM DB2 and more…
W H AT ’ S I N T H E S TA N D A R D ?
• Rules for how the language works
• No opinion as to what the database looks like
B U T…
• ‘SQL’ has come to mean a lot more than the language (especially in the context of NoSQL)
• Family of RDBMS databases that follow a set of rules
W H AT ’ S I N A N R D B M S ?
• Prescriptive Schema
• Set-based Operations
• Table-driven & Denormalised
• ACID Transactions
S C H E M A D R I V E N
Name Species
S E T- B A S E D O P E R AT I O N
R E A D D A TA O U T W I T H
E V E R Y R O W I S A “ T H I N G ”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
“ W H E R E ” ( I N T E R S E C T I O N )
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
U N I O N S
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
5 Nemo
6 Moby Dick
7 Wanda
– R O N E R N E S T ( & T H E S Q L C O M M U N I T Y AT L A R G E )
“Cursors are evil.”
N O R M A L F O R M S
Body Level One
J O I N S
Name SpeciesSpecies Coolness
Rating
1 Puss 0
2 Dinah 0
3 Einstein 10
4 Jess 0
R E L AT I O N S B E T W E E N D ATA
• We don’t like duplicating data
• Goes out of sync
• May not be the same everywhere
R E L AT I O N S B E T W E E N D ATA
• Objects have properties that come in groups
• For example: Landmarks have cities and countries.
• The same city will always have the same country
W E S O LV E T H AT W I T H …
• Denormalisation
• Store linked groups as its own row in a separate table
• And store pointers to that table
• These are combined by query-time joins
J O I N S
Name SpeciesSpecies Coolness
1 Puss
2 Dinah
3 Einstein
4 JessSpecies
Coolness Rating
1 0
2 10
J O I N S
T R A N S A C T I O N S
W R I T E D A TA I N W I T H
– J O H N N Y A P P L E S E E D
“A unit of work you want to treat as a whole”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
DonaldPlutoMickey
{ }
Ducks aren’t mammals
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
The database is always in a valid state, as defined by a whole number of queries
regardless of: (1) invalid data;
(2) concurrent requests; (3) system failures
The database is always in a valid state, as defined by a whole number of queries
regardless of: (1) invalid data;
(2) concurrent requests; (3) system failures
The database is always in a valid state, as defined by a whole number of queries
regardless of: (1) invalid data;
(2) concurrent requests; (3) system failures
The database is always in a valid state, as defined by a whole number of queries
regardless of: (1) invalid data;
(2) concurrent requests; (3) system failures
A C I D
• Atomicity
• Consistency
• Isolation
• Durability
W H AT ’ S I N A N R D B M S ?
• Prescriptive Schema
• Set-based Operations
• Table-driven & Denormalised
• ACID Transactions
C A PA C I T Y & S C A L A B I L I T Y
PA R T 1 : T H E O R Y
A S K I N G A S Y S T E M T O D O S O M E T H I N G U S E S R E S O U R C E S
W H AT H A P P E N S A S M O R E R E Q U E S T S C O M E I N ?
S Q L I S P R E T T Y G O O D F O R L A R G E A M O U N T S O F D ATA
T R U T H F U L LY
W I T H E N O U G H D ATA , Y O U H AV E T O S C A L E
T H E H A R D T R U T H
Y O U R C U R R E N T S Y S T E M
D ATA B A S E A P P L I C AT I O N
U S E R S
A S I T G R O W S
D ATA B A S E A P P L I C AT I O N
U S E R S
H O R I Z O N TA L S C A L A B I L I T Y
D ATA B A S E
A P P L I C AT I O N
U S E R S
D ATA B A S E
D ATA B A S E
V E R T I C A L S C A L A B I L I T Y
M O R E P O W E R F U L D ATA B A S E A P P L I C AT I O N
U S E R S
S Q L C A N S C A L E …
T H E H A R D T R U T H
S Q L C A N S C A L E V E R T I C A L LY
A N D …
• Scaling to meet the needs of read operations is very doable
• Master-Slave replication
B U T…
• Scaling writes is problematic
• How do atomic transactions work on a scaled database?
• How can SQL enforce constraints across multiple databases?
- J O E R I S E B R A C H T S
“To scale up write operations or the number of nodes in a cluster beyond a certain point you have
to be able to relax some of the ACID requirements”
T H E C A P T H E O R E MPA R T 1 : T H E O R Y
T H E C O S T O F S C A L I N G
• You become vulnerable to network failures
C A P T H E O R E M
• Choose Two:
• Consistency
• Availability
• Partition Tolerance
• WARNING: These have specific definitions
P R O V I S O
There is a lot of thought in this area, I am giving a simplified description
that would make many database people pull their hair out.
https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html
C A P T H E O R E M
CP APConsistent
& Partition TolerantAvailable
& Partition Tolerant
C A P T H E O R E M
A
BC
Data = “Cat”
Data = “Cat”
Data = “Cat”
C A P T H E O R E M
A
BC
Data = “Cat”
Data = “Dog”
Data = “Cat”
C A P T H E O R E M
A
BC
Data = “Dog”
Data = “Dog”
Data = “Dog”
A P S Y S T E M S
C A P T H E O R E M
A
BC
Data = “Dog”
Data = “Dog” Data = “Dog”
AVA I L A B L E ( “ A P ” ) S Y S T E M S
A
BC
Data = “Wolf”
Data = “Dog” Data = “Dog”
AVA I L A B L E ( “ A P ” ) S Y S T E M S
A
BC
Data = “Wolf”
Data = “Dog” Data = “Wolf”
C P S Y S T E M S
C O N S I S T E N T ( “ C P ” ) S Y S T E M
A
BC
Data = “Dog”
Data = “Dog” Data = “Dog”
C O N S I S T E N T ( “ C P ” ) S Y S T E M
A
BC
Data = “Dog”
Data = “Dog” Data = “Dog”
C O N S I S T E N T ( “ C P ” ) S Y S T E M
A
BC
Data = “Wolf”
Data = “Dog” Data = “Wolf”
part 1 done
What shape is your data?
Are you happy to pay?
What uses your data?
• Databases store data in an accessible way
• SQL database meet a defined standard; NoSQL is a movement towards considering databases that don’t
• SQL uses tables and schemas to store data, and acts on it like sets in a transactional way.
I N C O N S I S T E N T D ATA B A S E S
PA R T 2 : E X A M P L E S
T H E R E ’ S A L O T O F VA L U E I N C O N S I S T E N C Y…
– D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E S T O R E
“Reliability at massive scale is one of the biggest challenges we face at Amazon.com. Even the
slightest outage has significant financial consequences and impacts customer trust.”
– D Y N A M O : A M A Z O N ’ S H I G H LY AVA I L A B L E K E Y- VA L U E S T O R E
“Dynamo targets applications that operate with weaker consistency if this results in high
availability.”
D Y N A M O I M P L E M E N TAT I O N S
N O T G U A R A N T E E D C O N S I S T E N C Y
T H E C O S T ?
A M A Z O N S H O P P I N G
I S T H A T H O N E S T LY O K A Y ?
S M S H I S T O R I C L O G
I S T H A T H O N E S T LY O K A Y ?
W E U S E D …D Y N A M O I M P L E M E N TAT I O N S
C A S S A N D R A
• All nodes communicate with each other through a Gossip protocol similar to Dynamo and Riak, exchanging information about themselves and other nodes they have gossiped with.
D Y N A M O I M P L E M E N TAT I O N S
C A S S A N D R A
No single point of failure
W H Y C A S S A N D R A
• We needed fast and high availability writes
• Data didn’t need to be real time - it was aggregate analytics so eventually consistent was enough.
C A S S A N D R A : T H E C O N ’ S
• Data is only eventually consistent - so if you need 100% accuracy it’s not great
• Not as wide range of support as SQL (but nothing does)
• Flexible schema makes it harder to integrate with OO languages
C A S S A N D R A : T H E P R O ’ S
• Very fast write throughput
• SQL-like query language so you don’t need to relearn things
• Wide range of language drivers
• Highly available
H I G H LY R E L AT I O N A L D ATA
PA R T 2 : E X A M P L E S
E V E R Y R O W I S A “ T H I N G ”
Name Species
1 Puss
2 Dinah
3 Einstein
4 Jess
W H AT S Q L D O E S W E L L
• Modelling objects:
• With a fixed structure and shape
• With a limited number of relations
• With no opinion or opinion of any deeper underlying domain
R D B M S ( R E L AT I O N A L D ATA B A S E M A N A G E M E N T S Y S T E M )
T H E R E A R E P R O B L E M S T H I S I S B A D F O R
B U T …
K E V I N B A C O NS I X D E G R E E S O F …
E L E C T I O N D ATA
E L E C T I O N D ATA
W O R L D ’ S L E A D I N G G R A P H D B :
"embedded, disk-based, fully transactional Java persistence engine that stores data structured in
graphs rather than in tables"
D ATA S T O R A G E
D ATA S T O R A G E
D ATA S T O R A G E
• Nodes and edges are all:
• Stored as first-class objects on the file system
• “typed”
• Key-value stores
D ATA I N T H E R E L AT I O N S
• “Joins” are first class objects in the database that can be queried at no additional cost
• Certain queries become trivial (e.g. Joins)
• At a cost: high write-time cost
P R O T O T Y P I N G
• Easy to see and work with data
• Schemaless
• Active community with a lot of libraries
N E O 4 J U S E R S
N E O 4 J : T H E C O N ’ S
• More expensive writes to the database
• Not scalable
• Less mature tooling (especially in non-Java ecosystems)
N E O 4 J : T H E P R O ’ S
• Models certain data models very well
• Prevents costly queries when running lots of data
• Schemalessness allows for fast prototyping and flexible data models
• Commercial buy-in means language support is not far behind
S C H E M A L E S S N E S SPA R T 2 : E X A M P L E S
NB: MongoDB claims there’s a lot of usecases, we’re only covering this one
M O N G O D B : T H E C O N ’ S
• Mongo was the first famous NoSQL database and got used before it was tested and mature. There’s lots of articles about featurelessness and bugs
• Schemalessness makes data integrity checks and OO language integration tricky
M O N G O D B : T H E P R O ’ S
• Schemalessness - if you want flexible data models
• People have used it for a while, and so library support is not bad
H O W D O Y O U R E T R I E V E Y O U R D ATA
PA R T 2 : E X A M P L E S
F R E E - T E X T S E A R C H
D O C U M E N T S T O R EElasticSearch
D O C U M E N T S T O R E
E V E R Y R O W I S A “ T H I N G ”
N A M E = P U S S C O O L N E S S = 0
!
N A M E = J E S S C O O L N E S S = 0
!
N A M E = D I N A H C O O L N E S S = 0
!
N A M E = E I N S T E I N C O O L N E S S = 1 0
!
DOCUMENT
A PA C H E L U C E N E
“Apache Lucene is a high-performance, full-featured text search engine library … It is a
technology suitable for nearly any application that requires full-text search”
F O C U S E D A R O U N D T E X T S E A R C H I N G Q U E R I E S
Q U E R I E S A R E TA I L O R E D T O T H E Q U E S T I O N S Y O U ’ L L B E A S K I N G
{ "query": { "match": {"hobbies": "skateboard"} } }
{ "query": { {"fuzzy": {"hobbies": “skateboarig"}} } }
{ "query": { {"match": {"hobbies": {"query": "writing reddit comments", "type": "phrase"}}} } }
W H AT C O N S U M E S Y O U R D ATA ?
E N D U S E R What is the average age of …?
W H AT C O N S U M E S Y O U R D ATA ?
E N D U S E REr….
I think it was something like “Campbell”?
O U R C H O I C E I S I N F O R M E D B Y O U R P L A N S F O R T H E A P P L I C AT I O N
R E M E M B E R T H A T
E L A S T I C S E A R C H : T H E C O N ’ S
• It only does one thing (even if it does it well)
E L A S T I C S E A R C H : T H E P R O ’ S
• It has a lot of search related queries built into it - fuzzy/phonetic/sentence matching
• A lot of people use this, support is mature
• Integration with a large number of other languages and frameworks - this is the industry standard
W H E N I T G O E S W R O N GPA R T 2 : E X A M P L E S
S Q L : T H E C O N ’ S
• It’s very hard to scale writes
• It has a specific data model - not every data domain fits into it
• e.g. highly relational models, schemalessness
• Domain non-specific query languages
S Q L : T H E P R O ’ S
• If a library exists for anything, it exists for SQL
• ACID transactions make everything easy
• Constraints and Schemas allow for automated data integrity checking
• Easy denormalisation of data
part 2 done
What shape is your data?
Are you happy to pay?
What uses your data?
• Some sites are happy to sacrifice consistency for availability - Dynamo is a standard that databases can meet to fulfil that
• If you’ll be doing lots of joins, Graph Databases such as Neo4j improve performance
• Sometimes you want the flexibility to store any objects - there are a range of schemaless databases available
• Consider what will retrieve your data, and ensure you have a database efficient for your use case.
ANY QUESTIONS?
D a v i d S i m o n s @ S w a m W i t h Tu r t l e s