NoSQL for great good [hanoi.rb talk]

transcript

NoSQL database for great good

@huydx hanoi.rb

$> whoami

huy Software developer Tokyo base ruby/scala user

nickname: @huydx

Disclaimer

This talk is not going to go detail about any NoSQL

I'm going to talk about: when we need to choose a nosql db, how should we think?

What people often think about NoSQL?

• As cache

• As a magic which can make "any" web system faster

Your system is slow

Just use NoSQLRDBMS is shit

RDBMS is not slow NoSQL is not the cure for everything

RDBMS is awesome

• Can scan 7m rows / sec with index! • Can handle very big data (facebook) • Has very flexible query language (SQL) • Has some awesome analytics feature

(window function-postgresql)

• Has ACID properties

https://www.percona.com/blog/2008/04/09/how-fast-can-mysql-process-data/

Why ACID is important• Atomicity : protect transaction (all or nothing) • Consistency: protect data correctness • Isolation : protect data from concurrency • Durability : protect data from failure

ACID makes a database something you can

So• RMDBS is way better than you thought

• You should learn to do RMDBS the right way

• How to make the best performance from RMDBS (index tuning, query optimize, data modeling, master-slave replication, monitoring, shard-ing the right way....)

https://www.percona.com/blog/2014/03/27/a-conversation-with-5-facebook-mysql-gurus/

But this talk is about NoSQL!!!!!!

Where RMDBS is not fit for?• Nature of data: When data is not row-column

style (multidimensional data) • How your data scale : Data shard-ing (You

don't want to shard-ing) • ACID is great, but it degrade performance • Single point of failure : single master • Data usage : when you need realtime, fast data

https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/

Let's talk about NoSQL

We have plenty of playersBut when, and how to use them?

We have a decent answer: It depends!!

What do you want to store?

• Geo-partial data? • Users important data? (password, paying

information..) • Cache data? • Analytics realtime data (write/read intensive)

Where do you want to store?

• On memory? • On disk?

• On Slow Disk (HDD) • On Fast Disk (SSD)

How big is your data

• Able to fit into memory?

• Able to fit into single machine?

• Not able to fit into hundreds of machine?

It's there any factor to category NoSQL database

Data Model

Query Model

NoSQL categorized by how data model

Documentpair each key with a complex data structure known as a document (JSON, BSON).

MongoDB, CouchDB, RethinkDB

Column Family

One row key pair with many column (rows in RDBMS) (easy for block partition)

Cassandra, HBase, Hypertable

Graph Store data as nodes + link between nodes

Neo4j, FlockDB

KVSJust a key + a value (a value can be complex, but will not be able to as wide as column family)

Riak, Memcached, Redis, CouchBase

What about merit / demerit

of each data model?

Data model affect how we query data

User always want query method to be as flexible as possible

But sometimes, we have to face the trade-off between

flexibility and scalability

• Document : query can be very flexible because document is examinable (mongodb has very rich query language). Data model can change very flexible

• Column Family : just a key value with a very wide fields, which make it very fast to look up a bunch of values

• Graph : for very special cases when you need to store and query relationship (followers in twitters)

• KVS : when you really need high performance, and just need to look up for simple value

So it really depends, right?

Data model for NoSQL is hard!

So be careful with your selection

Sometimes the borderline of data modeling is blurred

We need other factor to consider

Scalability

First we need to know about CAP theorem

http://webpages.cs.luc.edu/~pld/353/gilbert_lynch_brewer_proof.pdf

We can only have two of them!!!!!!!!

NO MORE!!!!

http://blog.flux7.com/blogs/nosql/cap-theorem-why-does-it-matter

Just ask your self: what do you care about

• You need very fast write and read, data can be a little bit stale -> A + P

• You need transaction, and every one must see the same view, but sometimes something must be lock -> C + P

• You don't need a distributed system which is false-tolerance with network problem -> C + A

So we have two options to think about, what's more?

Operation

Programmer may not care but

Infrastructure engineer care

What factors affect operation?

• What is your database distributed model, how they shard, and replicate (master-slave or p2p)

• Do your database run on JVM? (operating a JVM system is waaaayyy bothersome than a system written in C or C++)

• Do your database has single point of failure? • Do your database optimized for SSD only?

Operation is hard

When you fail at operation, you lost your data

So choose what you know very well about

Conclusion

• It's really depends!!!!

• Ask your self: Is it really needed to use nosql?

• First know your requirement, know your data

• Investigate carefully before choosing any solution (when you fail to choose, you lost your data)

NoSQL for great good [hanoi.rb talk]

Engineering