+ All Categories
Home > Data & Analytics > Oracle vs NoSQL – The good, the bad and the ugly

Oracle vs NoSQL – The good, the bad and the ugly

Date post: 11-Aug-2014
Category:
Upload: john-kanagaraj
View: 910 times
Download: 19 times
Share this document with a friend
Description:
A good understanding of NoSQL database technologies that can be used to support a Big Data implementation is essential for today’s Oracle professional. This was discussed in detail in a 2 hour deep-dive technical session at COLLABORATE 2014 - The Oracle User Group Conference. In this slide deck, you will learn what Big Data brings to the table as well as the concepts behind the underlying NoSQL data stores, in comparison to its ancestor you know well - the Oracle RDBMS. We will determine where and how to employ these NoSQL data stores effectively as well as point out some of the issues that you will have to think through (and prepare for) before your organization rushes headlong into a “Big Data” implementation. We will look specifically at MongoDB, CouchBase and Cassandra in this context. At the end of the session, we will provide pointers and links to help the audience take the next step in learning about these technologies for themselves
Popular Tags:
70
REMINDER Check in on the COLLABORATE mobile app Oracle vs. NoSQL The good, the bad and the ugly John Kanagaraj Member of Technical Staff, PayPal Database Engineering, An eBay Inc. company
Transcript
Page 1: Oracle vs NoSQL – The good, the bad and the ugly

REMINDER

Check in on the COLLABORATE mobile app

Oracle vs. NoSQL The good, the bad and the ugly John Kanagaraj Member of Technical Staff, PayPal Database Engineering, An eBay Inc. company

Page 2: Oracle vs NoSQL – The good, the bad and the ugly

Housekeeping

■  Check the font sizes ▪  Can you read this at the back of the room?

▪  Can you read this at the back of the room?

▪  Just kidding! ■  Silence your Phones! ■  Q & A : Ask as we go along (and I will repeat the question)

▪  Keep it relevant to the slide at hand ▪  I might defer the question to a later slide if I believe it is

addressed later ▪  If it gets too long, I humbly request we deal with it after the

break or after the session ■  It is a long day, so if you nod off it is ok (hopefully no snoring!)

Page 3: Oracle vs NoSQL – The good, the bad and the ugly

Agenda ■  Big Data – What it is, why should we care ■  NoSQL – What it is, and why do we need it ■  Concepts you need to understand

▪  CAP Theorem (and why it is important) ▪  Unstructured Data ▪  Sharding and Replication ▪  Data Modeling in the brave new world of NoSQL

■  Introduction to some popular NoSQL stores ■  A look into the (immediate) future: Moving forward

Page 4: Oracle vs NoSQL – The good, the bad and the ugly

Not on the Agenda ■  Not a Tutorial on various NoSQL datastores ■  NotAnInstallationGuide ■  NotAnAdministrationManual ■  If you already know the CAP Theorem and NoSQL:

▪  I will be covering the basics (so you know!)

▪  We are all here to share and learn: Maybe I can learn from your questions/inputs (time and context permitting)

▪  Let’s talk after the talk (or during the break)

Page 5: Oracle vs NoSQL – The good, the bad and the ugly

Speaker Qualifications

■  Currently Database Engineer @ PayPal ■  Has been working with Oracle Databases and

UNIX for too many years J ■  Author and Technical editor ■  Frequent speaker at OOW, IOUG

COLLABORATE and regional OUGs ■  Oracle ACE ■  Contributing Editor, IOUG SELECT Journal ■  Loves to mentor new speakers and authors! ■  http://www.linkedin.com/in/johnkanagaraj

Page 6: Oracle vs NoSQL – The good, the bad and the ugly

Big Data

Page 7: Oracle vs NoSQL – The good, the bad and the ugly

Big Data – The Why ■  2.5 quintillions of data is generated every day

▪  (1 quintillion = 1018 Bytes): so that is ~= 2.3 Trillion GB ▪  Humans (using devices) as well as Machines (IoT)

—  Location data emitted by your smart phone —  “Web-scale” Webserver logs and interactions —  Sensor data emitted by almost every networked device: E.g.

Cars’ fuel/pressure gauges, Personal fitness devices (wearables) —  Multi-media sources: Security cameras, Face/Plate recognition —  Data that matters to you: Medical, Scientific, Weather

▪  Lots of value in this data, but mostly untapped ▪  Most of this is never stored: Too big to store, but not too big

to understand J

Page 8: Oracle vs NoSQL – The good, the bad and the ugly

Big Data – The Why ■  Plummeting cost of technology

▪  Storage Cost/GB – 1980 : $437,500, 2013 : $0.05 ▪  Computing Cost – Moore’s law ▪  Network transportation Cost – WiFi, BLE, etc.

■  What is driving this? ▪  Cheaper to store data than to delete/ignore it ▪  Minimal cost to generate, transport and store ▪  Ubiquity of network, storage and data generation ▪  Accelerating advances in science and technology ▪  Machine learning and intelligence is growing

Source for storage cost: http://www.statisticbrain.com/average-cost-of-hard-drive-storage/

Page 9: Oracle vs NoSQL – The good, the bad and the ugly

Big Data – The Why

Infographic:  h.p://www.ibmbigdatahub.com/infographic/four-­‐vs-­‐big-­‐data  

Page 10: Oracle vs NoSQL – The good, the bad and the ugly

Big Data Characteristics: 4 V’s + 1 ■  Volume – Scale at which data is generated

▪  Cannot be stored using traditional methods ▪  Cannot be stored in a monolithic store

■  Variety – Different forms of data ▪  Big Data is usually not structured; structure not known in

advance; structure not controlled by consumer ▪  May not always be in text form (more than just binary)

■  Velocity – Data arrives in a continuous stream ▪  Multiple, varied source produce data continuously ▪  Peaks and bursts unpredictable ▪  “Always on”: No down time for maintenance or re-orgs ▪  No “Known Users” – unpredictable, unknown patterns/scale

Page 11: Oracle vs NoSQL – The good, the bad and the ugly

Big Data Characteristics: 4 V’s + 1

■  Veracity – Uncertainty: Data is not always accurate ▪  Multiplicity of sources creates convergence of truth ▪  Eventual consistency (versus immediate consistency)

■  Value – Immediacy and hidden relationships ▪  In many use cases, value of Big Data declines quickly

—  Traffic reports do not matter after 30 minutes

—  Routing resupply trucks is counterproductive after the fact

—  However, some historical value may be derived post the event

▪  Concept of “Near Line” data (neither fully online or offline) ▪  Easy to miss hidden relationships

—  Most data sets are correlated to other data sets, implicitly or explicitly

—  Not easy to detect due to volume and variety

—  Mine data using various techniques (Data Science)

Page 12: Oracle vs NoSQL – The good, the bad and the ugly

So how do we store this storm? ■  Big Data impossible to store using RDBMS

▪  Too big, too fast for RDBMS to ingest ▪  RDBMS needs “schema before write” ▪  Unknown structures = “schema during read”

■  So what is limiting RDBMS? ▪  ACID requirement drives “protection” mechanism ▪  Redo and Undo in Oracle provides ACID ▪  “Relational” imposes “schema before write” ▪  Easy to get “small bits”; hard to get “large pieces”

Page 13: Oracle vs NoSQL – The good, the bad and the ugly

So how do we store this storm? ■  RDBMS’ are essentially ACID

▪  Atomic: Transactions fully succeeds or fully fails ▪  Consistent: Transactions moves the database from one

consistent state to another ▪  Isolated: Transactions cannot interfere with each other ▪  Durable: Committed transactions persist even during failure

■  RDBMS Clusters = “Shared everything” for ACID ■  Atomicity in a distributed database: Two Phase commit

▪  Essential for splitting workload ▪  Reduction in availability though!

■  New concept! BASE (Basically Available, Soft state, Eventual Consistency)

Page 14: Oracle vs NoSQL – The good, the bad and the ugly

Confiden=al  and  Proprietary  14  

■  Heap table with one or more “right growing” indexes −  Primary Key: Unique index on a NUMBER column

−  Key value generated from an Oracle Sequence (NEXTVAL = 1)

−  I.e. “monotonically” increasing ID value

−  High rate of insert (> 5000 inserts/second) from multiple sessions

−  Multiple indexes, typically leading date/time series or mono-valued

−  E.g. Oracle E-Business Suite’s FND_CONCURRENT_REQUESTS

■  Here’s the Problem: −  All INSERTing sessions need one particular index block in CURRent

mode (as well as one particular data block in CURRent mode)

−  Question: Would you use RAC to scale out this particular workload?

A common scalability inhibitor

Page 15: Oracle vs NoSQL – The good, the bad and the ugly

Confiden=al  and  Proprietary  15  

■  Here’s what happens to accommodate the INSERT −  Assume the current value of the PK is 100, and NEXTVAL = 1

−  Assume we have ‘N’ sessions simultaneously inserting into that table

−  Session 1 needs to update the Index block (add the Index entry for 100)

−  Session 2 wants the same block in CURRent mode (add another entry for 101; needs the same block because the entry fits in the same block)

−  Session 3… N also want the same block in CURRent mode at the very same time (as all sessions will have “nearby” values for index entry)

−  Block level pins/unpins (+ lots of other work – Redo/Undo) required….

−  Same memory location (SGA buffer for Index block) accessed

−  Smaller but still impacting work for buffer for Data block

−  Rate of work constrained by CPU speed and RAM access speeds

A quick deep dive

Page 16: Oracle vs NoSQL – The good, the bad and the ugly

Confiden=al  and  Proprietary  16  

■  What if you use RAC to “scale out” this workload? −  Assume “N” sessions simultaneously inserting from 2 RAC nodes (2xN)

−  In addition to previously described work, you need to

−  Obtain the Index block from remote node in CURRent mode

−  Session 1 (Node 1) updates Index block with value 100

−  Session 2 (Node 2) requests block in CURRent mode (value 101)

−  LMS processes on both nodes churn CPU co-ordinating messages and block transfers back and forth on the interconnect

−  Flush redo changes to disk on Node 1 before shipping CURRent block to Node 2 (gated by RedoWriter response!!!)

−  Sessions block on “gc current <state>” waits during this process

−  CPU, Redo IO, Interconnect, LMS/LMD processes involved

A quick deep dive

Page 17: Oracle vs NoSQL – The good, the bad and the ugly

Confiden=al  and  Proprietary  17  

■  Some solutions −  Spread the pain for the right growing index

−  Use Reverse Indexes (cons: Range scan not possible)

−  Use Hash partitioned indexes (cons: All partitions probed for Range scan, Need Partitioning Option, Additional administration)

−  Prefix RAC node # (or some identifier per node) to key

−  Use a modified key: Use Java UUID, Other distinct prefix/suffixes

−  Use Range-Hash Partitioned tables with Time based ID as key

−  E.g. Epoch Time (# of seconds from Jan 1, 1970) + Sequence value for lower bits

−  Enables Date/Time based partitioning key

−  Unique values allow Local Index to be unique

A quick deep dive

Page 18: Oracle vs NoSQL – The good, the bad and the ugly

Relaxing ACID – Skip the Redo/Undo ☺ ■  BASE Model

▪  “In partitioned databases, trading some consistency for availability can lead to dramatic improvements in scalability”

▪  Proposed by Dan Pritchett (eBay) in 2008 ▪  ACID is pessimistic; enforces consistency at the end of a

transaction ▪  BASE is optimistic; accepts eventual consistency ▪  Supports partial failure without total failure

■  Enabled new paradigms ▪  New patterns for distributing workload emerges

—  Sharding and Replication

—  Less than perfect (but good enough) consistency

Page 19: Oracle vs NoSQL – The good, the bad and the ugly

A New Beginning - NoSQL

■  A new dawn emerges… ▪  Brewer proposes CAP theorem (2000) ▪  Google creates BigTable (~ 2006) ▪  Amazon creates Dynamo (~ 2007) ▪  eBay shards over Oracle Databases (2008) ▪  Inspires a new set of alternate data storage

projects ▪  NoSQL databases start appearing…

(~2008 – 2010) ▪  Becomes a buzz word (~ 2011 – 2013)

■  Now we all want “in”…

■  Picture courtesy Kamran Agayev via Twitter

Page 20: Oracle vs NoSQL – The good, the bad and the ugly

So What is NoSQL? ■  NoSQL – supposed to be “No SQL”, but it is NOT ■  NoSQL – Loosely it is “Not Only SQL” (i.e. NOSQL)

▪  Term coined by Eric Evans (developer at Rackspace) ▪  Adopted by Johan Oskarrson (another developer) ▪  For a meetup of like minds at SF, 2009 ▪  Meetup for “open-source, distributed, nonrelational

databases” [Voldemort, Cassandra, CouchDB, MongoDB, etc.] ■  NoSQL does not mean there is no “SQL-Like” interface

▪  Cassandra supports CQL (Cassandra Query Language) ■  NoSQL does NOT always mean Big Data

▪  But Big Data stores are almost always NoSQL based ▪  That is, if you count Hadoop as a NoSQL datastore *

* See: http://wiki.apache.org/hadoop/HadoopIsNot

Page 21: Oracle vs NoSQL – The good, the bad and the ugly

A small diversion: The Hadoop ecosystem ■  Let’s understand Hadoop vs. the Rest ■  Hadoop – The real Big Data Store

▪  Real Big platform to store data ▪  Store almost anything and everything ▪  Key components of Hadoop:

—  HDFS: A unified file system that combines all storage in the cluster

—  MapReduce: A programming model to handle large data sets

—  An extensile ecosystem: Other components to control, schedule and manage processing and the cluster

▪  Is NOT a database (although there is HBase)…. ▪  But supports SQL-like interface using Hive ▪  Not really meant for Online, Web-site facing implementation

Page 22: Oracle vs NoSQL – The good, the bad and the ugly

A small diversion: The Hadoop ecosystem

Page 23: Oracle vs NoSQL – The good, the bad and the ugly

Big Data / NoSQL Landscape

From http://www.bigdata-startups.com/open-source-tools/

Page 24: Oracle vs NoSQL – The good, the bad and the ugly

Why NoSQL?

■  Impedance Mismatch ▪  Real world data does not naturally posses structure ▪  A “Person” has many variable characteristics ▪  Applications deal with a “person” object ▪  This is then a set of In-memory structures ▪  Relational Databases require structured table/columns

though…. ▪  Thus, an “impedance mismatch” between Dev and DBA ▪  Which ORM’s try to bridge (the gap between Dev and DBA)

—  Cultural mismatch: “Agile” (Dev) seems to be “Fragile” (for a DBA)

—  Technical mismatch: “Objects” to “Relational Tables”

—  Storage structure mismatch: “Un-/Semi-structured” to “Structured”

Page 25: Oracle vs NoSQL – The good, the bad and the ugly

Why NoSQL? ■  Rapid “web-scale” growth for external entities/users

▪  Ability to support viral/burst traffic patterns ■  Most data does not (usually) need immediate consistency

▪  It is ok to lose some data; It is Ok not to have ACID ■  Commodity hardware and the Cloud

▪  RDBMS’ don’t run well on clusters (apologies: RAC world) — Shared Disk clusters are both a SPOF and expensive! — License costs for RDBMS on clusters — Failure of one component brings everything down

▪  Clustering cheaper commodity hardware is economical — Single or even a small number of failures affect a portion of

workload, not the whole application (due to sharding) ▪  Easier to create a “cloud” with commodity hardware

Page 26: Oracle vs NoSQL – The good, the bad and the ugly

Why NoSQL? ■  Open patterns

▪  Almost all NoSQL products is open-source ▪  Relatively open learning

—  Meetups; Open seminars run by vendors

—  Lively blogs and passionate contributors

▪  Quick-and-easy installs ▪  Community versions from vendors ▪  Easy to install on for-rent cloud environments ▪  Monitoring/Alerting through open frameworks (Nagios, Ganglia)

■  Enterprise support through vendor ▪  10gen for MongoDB; DataStax for Cassandra; CouchBase ▪  Cloudera, Hortonworks, MapR for Hadoop

■  Large Webscale companies building own NoSQL databases

Page 27: Oracle vs NoSQL – The good, the bad and the ugly

NoSQL Characteristics

■  “Schema before write” vs. “Schema before read” ▪  Caters to “unstructured” need ▪  Primarily solves Impedance mismatch ▪  Creates its own challenges

■  Modeled by read and write patterns ▪  “customer and orders” together for a customer centric view ▪  “product and orders” for a production/supply-chain centric view ▪  Alternative: Store twice

■  Data modeling driven by physical storage model ■  Read patterns

▪  Secondary indexing (overheads) ▪  Brute-force access via MapReduce jobs ▪  Store multiple, denormalized copies (“disk is cheap”)

Page 28: Oracle vs NoSQL – The good, the bad and the ugly

NoSQL Characteristics ■  ACID is “relaxed”

▪  A transaction is limited to an aggregate (k-v pair) ▪  Enables distributed, shared-nothing architectures ▪  Ideal for clustered deployments ▪  Optimistic locking ▪  Some loss of data and consistency is expected (and catered to)

■  Write patterns ▪  UPDATEs converted to INSERTs (timestamped/tombstoned) ▪  Time-To-Live (TTL) based DELETE’s/Purges ▪  Compaction based garbage collection ▪  Reduced Write latency due to memory only writes ▪  Transaction logging supported in some NoSQL stores

Page 29: Oracle vs NoSQL – The good, the bad and the ugly

Why use an RDBMS then?! ■  ACID may be a hard business requirement

▪  Data loss can never be tolerated ▪  Data inconsistency can never be tolerated (e.g. Money

movement) ■  Complex data models favor RDBMS

▪  Try modeling Oracle EBS in NoSQL J ■  Standardized interface via SQL

▪  Broadly same across all RDBMS ▪  Well understood, skills availability

■  Inter-application integration ▪  Single platform for data created it’s own ecosystem

■  Cost to change is prohibitive

Page 30: Oracle vs NoSQL – The good, the bad and the ugly

Introducing the CAP Theorem

■  Eric Brewer’s conjecture at the July 2000 ACM Symposium ■  Formalized by Seth Gilbert and Nancy Lynch in 2002 ■  Any networked shared-data system can have at most two of

three desirable properties: ▪  At least one Consistent (C) up-to-date copy of the data ▪  high Availability (A) of that data (for both reads and updates) ▪  tolerance to network Partitions (P)

■  Core systemic requirements in a distributed environment ▪  Special symbiotic relationship ▪  Present during design and deployment of applications in a

distributed environment (whether acknowledged or not) ■  Applies well to the distributed NoSQL world

Page 31: Oracle vs NoSQL – The good, the bad and the ugly

Components of the CAP Theorem ■  (C)onsistency

▪  All clients see the same results from a query, even in the presence of an update at the same time as the query

■  High (A)vailability ▪  All clients can write or access data, even in the presence of

system failures. Requestors receive acknowledgment of success or failure

▪  Performance may degrade, but consuming applications are able to access data even though some parts of the system may not be operational at the time of a query

■  (P)artition Tolerance ▪  The system returns results regardless of failures in

communication between partitions in the distributed system; i.e. system property holds true even if there is a network partition

Page 32: Oracle vs NoSQL – The good, the bad and the ugly

General CAP Theorem

Page 33: Oracle vs NoSQL – The good, the bad and the ugly

Illustrating the CAP Theorem (adapted) ■  You start a small business: Provide phone reminders/information ■  Customers call with information; You call back/respond to remind ■  Start small: All information written down in your (single) notebook ■  Business grows: Wife is recruited (scale out, PBX shards calls) ■  Inconsistency: Response misses info updated in Wife’s notebook ■  Resolve inconsistency: All notebooks updated when call ends (lock) ■  Wife’s day off: You leave sticky notes (Inconsistent until next day) ■  Wife fights with you: Network Partition (sticky notes thrown away) ■  You have a choice here: CAP Theorem in play – Pick two

▪  (C) Always provide consistent information to clients

▪  (A) Business is always open if at least one of you is present

▪  (P) Business is open even during a loss of communication between 2

■  Run around clerk: Eventual consistency and Compaction

Page 34: Oracle vs NoSQL – The good, the bad and the ugly

Examples of CAP Theorem pairs ■  Consistency and Partition Tolerance (CP): Banking Transaction at an ATM

▪  Data needs to be consistent in the presence of updates

▪  If there is a network failure, dispense cash but limit the transaction amount

▪  Transaction still available, but system property changed due to network partition

■  Consistency and Availability (CA): Database System-of-Record

▪  Data Consistency is key

▪  During is a network failure, clients stop writing (no redo), no write availability

▪  Present in Oracle Data Guard’s Maximum protection mode/Single node DB

■  Availability and Partition Tolerance (AP): Shopping cart in Amazon.com

▪  Spread data across multiple partitions to be always available

▪  Reconcile cart at checkout (may result in dual purchases!)

▪  Sacrifices consistency, but works for most cases, most of the time

Page 35: Oracle vs NoSQL – The good, the bad and the ugly

CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques

▪  Partition workload by function —  Schema level split: data unrelated to each other is segregated

—  Typically provides headroom for main workload/environment

▪  Distribute transactions —  For related data that still needs to be viewed together

—  Typically using Database links

—  Typically for master lookups and remote writes

—  Introduces dependencies (more on that soon)

▪  Decouple work asynchronously —  Use AQ to write tokens or keys to process later

—  Introduces a “delay”: Data not immediately consistent

Page 36: Oracle vs NoSQL – The good, the bad and the ugly

CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques

▪  Partition workload by function —  Schema level split: data unrelated to each other is segregated

—  Typically provides headroom for main workload/environment

▪  Distribute transactions —  For related data that still needs to be viewed together

—  Typically using Database links

—  Typically for master lookups and remote writes

—  Introduces dependencies (more on that soon)

▪  Decouple work asynchronously —  Use AQ to write tokens or keys to process later

—  Introduces a “delay”: Data not immediately consistent

Page 37: Oracle vs NoSQL – The good, the bad and the ugly

CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques

▪  Offload reads using Active Data Guard (DB 11g and above) ▪  DG copy opened for reads during Real Time Apply ▪  DG allows Redo Data shipping in 3 modes

—  Maximum Protection: Zero loss but dependent on remote redo write

—  Maximum Performance: Remote redo written asynchronously

—  Maximum Availability: Switches to Max Performance mode on remote redo write failure, operates in Max protection mode otherwise

▪  Offers multiple shades of availability and protection ▪  ADG and “read your writes” pattern

—  RTA apply is not equal to “instant” apply

—  Not “immediately consistent” but “eventually consistent”

Page 38: Oracle vs NoSQL – The good, the bad and the ugly

CAP Theorem in the Oracle World ■  Application Scalability: Some well-known techniques

▪  Offload reads using Active Data Guard (DB 11g and above) ▪  DG copy opened for reads during Real Time Apply ▪  DG allows Redo Data shipping in 3 modes

—  Maximum Protection: Zero loss but dependent on remote redo write

—  Maximum Performance: Remote redo written asynchronously

—  Maximum Availability: Switches to Max Performance mode on remote redo write failure, operates in Max protection mode otherwise

▪  Offers multiple shades of availability and protection ▪  ADG and “read your writes” pattern

—  RTA apply is not equal to “instant” apply

—  Not “immediately consistent” but “eventually consistent”

Page 39: Oracle vs NoSQL – The good, the bad and the ugly

CAP Theorem in the NoSQL World ■  Realization of CAP enabled NoSQL to “break free”

▪  Opened minds of database developers ■  However, the “2 of 3” rule was somewhat misleading

▪  NoSQL datastores offer options to vary consistency/durability and availability levels

▪  MongoDB has “Write Concern” – Unacknowledged, Acknowledged, Journaled, Replica Acknowledged

▪  Cassandra has Write Consistency: From ANY to ALL ■  Reality is a spectrum between C and A in the presence of P

▪  Eventual Consistency is a given ▪  Some data loss is expected ▪  Application code/other techniques will need to cater for this

Page 40: Oracle vs NoSQL – The good, the bad and the ugly

Sharding and Replication in NoSQL ■  NoSQL datastores: essentially shared-nothing clusters ■  Relaxing ACID allows distributed processing (CAP applies!) ■  Ability to scale out reads/writes is the key ■  Achieved using two techniques: Sharding and Replication ■  Sharding: Divide and Rule

▪  Data is read/written to different servers (“shards”) ▪  Location determined applying a fixed function on a known key ▪  Different functions: Modulo, Hash, Range, Programmatic ▪  Efficacy of load balancing dependent on function and data ▪  Typically used for Write-scaling (more than Read-scaling) ▪  (Hash partitioned tables/indexes are essentially object level

sharding in Oracle databases to enable write scaling)

Page 41: Oracle vs NoSQL – The good, the bad and the ugly

Sharding and Replication in NoSQL ■  Sharding (contd.)

▪  Difficult, if not impossible to change function once implemented ▪  No consistency across shards, or across aggregates ▪  No joins allowed – no cross-shard dependencies ▪  Resilience does not improve (but enables partial availability) ▪  Not to be implemented lightly: Start single if you can ▪  Many NoSQL stores allow auto-sharding (e.g. CouchBase)

■  Replication: Allow multiple copies ▪  Master-Slave model: Simplest, Scales out reads only; Read

resilience; May need to cater for eventual consistency ▪  Peer-to-Peer or Multi-Master model: Scales out reads and

writes, but consistency/conflict resolution is a big problem ■  Can combine Sharding and Replication!

Page 42: Oracle vs NoSQL – The good, the bad and the ugly

The NoSQL Datastore Landscape ■  Generally four types:

▪  Key-Value ▪  Document ▪  Column Family ▪  Graph

■  Not using the relational model, i.e. schema-less ▪  But not without a Data Model!

■  Runs on clusters of commodity hardware ■  Generally Open Source ■  Can be considered as storing/retrieving “aggregates”

▪  a collection of related objects that can be treated as a unit ■  Usually described by “Keys” and “Values” (i.e. K-V pairs)

Page 43: Oracle vs NoSQL – The good, the bad and the ugly

Key-Value NoSQL stores ■  The most basic of NoSQL stores ■  Simple K-V structure: A “blob” of data (“Value”) indexed and

accessed via a “Key” ■  “Value” part also known as Aggregate ■  Aggregate is a collection of related objects treated as a unit ■  Written/Updated/Read/Consistent as single, smallest unit ■  Typically, aggregate is limited in size (BLOB in Oracle) ■  Typically, expressed in JSON, and sometimes in XML ■  JSON/XML aggregates are self-describing ■  Value is “opaque” in a K-V store, but is simple ■  Scale out with sharding ■  Examples of K-V store: Riak, Oracle NoSQL

Page 44: Oracle vs NoSQL – The good, the bad and the ugly

Key-Value NoSQL stores

■  Typical Use cases ▪  Shines when you need simple GET/PUT operations ▪  Session state; Tokens – Enables web-scale ▪  User profiles and preferences – Typically latent caching layer ▪  Latency bridge: Support RYOW’s in some cases

■  Anti-patterns ▪  No ad-hoc query patterns - (i.e. need key to access) ▪  Not meant for analytics type workload ▪  When multi-key/multi-operation consistency is required ▪  Set based operations (i.e. related data)

Page 45: Oracle vs NoSQL – The good, the bad and the ugly

Document NoSQL stores

■  Datastore able to understand and manipulate structures ■  Needs to follow an agreed format

▪  usually JSON, but BSON, XML and YAML ■  Support for secondary indexes

▪  Needs ability to understand/index K-V pairs in the aggregate ▪  Secondary indexes may throttle write rate

■  Aggregate size usually limited ■  Scale-out again supported via sharding

▪  Some stores support multiple sharding methods (MongoDB) ■  K-V store sometimes evolve into Document stores

▪  E.g. CouchBase evolution ■  Needs embedding/linking support (size/other limitations)

Page 46: Oracle vs NoSQL – The good, the bad and the ugly

Document NoSQL stores ■  Typical Use cases

▪  Of course, any collection of document-type models ▪  Easy-to-start NoSQL projects when moving from RDBMS ▪  Almost any NoSQL use case needing secondary index access ▪  Content and Metadata store: typically multiple keys ▪  Queries using materialized views (CouchBase) ▪  Non-trivial sharding (MongoDB) ▪  Horizontally scaled or Cached reads (MongoDB, CouchBase) ▪  Models requiring simple relationships (Blogs, User modeling)

■  Anti-patterns: ▪  Not a drop-in replacement for RDBMS ▪  Evolving relationships or query patterns ▪  Usually not good for write-heavy

Page 47: Oracle vs NoSQL – The good, the bad and the ugly

Column Family NoSQL stores

■  Characteristics of CF Stores ▪  Data is mostly organized by sets of columns ▪  Key – Value based access ▪  “Value” consists of sets or ranges of columns ▪  Still unstructured ▪  No joins (except via another keyed table, using MapReduce)

■  Cassandra, Hbase, Amazon SimpleDB are prime examples ▪  HDFS on a Hadoop cluster underlies HBase ▪  HBase evolved from Google’s BigTable ▪  Cassandra evolved from Facebook ▪  Cassandra also supports CQL (a SQL like language)

Page 48: Oracle vs NoSQL – The good, the bad and the ugly

Column Family NoSQL stores

■  Typical Use cases ▪  Data is mostly organized by sets of columns (super columns) ▪  Key – Value based access ▪  “Value” consists of sets of columns (but still unstructured) ▪  Lots of repeated sets of values (e.g. Customer transactions) ▪  No joins (except via another keyed table, using MapReduce) ▪  Write-intensive patterns (Internet-of-Things type data) ▪  Rolling expiry patterns such as Time series data

■  Anti-patterns ▪  IMHO Low-latency reads (in comparison to other NoSQL stores) ▪  Need access via secondary or other keys

Page 49: Oracle vs NoSQL – The good, the bad and the ugly

Graph NoSQL stores ■  Stores Nodes and Edges ■  Provides “Index-free Adjacency” ■  Nodes are entities: People, Accounts, Items, Locations ■  Edges connect Nodes to other Nodes ■  Edges have properties ■  Can mine patterns present in these relationships ■  Supports graph-like queries:

▪  Shortest distance between two locations ▪  Social Graphing: Connecting people ▪  Products that your friends liked

■  Neo4j is a well-known graph database ■  Giraph: An open source graph processing systems (FB!)

Page 50: Oracle vs NoSQL – The good, the bad and the ugly

Graph NoSQL stores ■  Typical Use Cases

▪  Social Graphs ▪  Recommendation Engines ▪  Graph transversal uses cases ▪  Relationships with defined end-points ▪  Routing and Location based solutions ▪  Account Linking (e.g. for fraud detection; peer risk checking)

■  Anti-patterns ▪  Scale out via sharding typically not supported in some products ▪  Update all/Update most patterns ▪  Dangling end-points

Page 51: Oracle vs NoSQL – The good, the bad and the ugly

Some more concepts: JSON ■  You need to understand JSON

▪  Java Script Object Notation ▪  Self describing, English text key-value pairs ▪  In other words, a simpler version of XML ▪  No externally imposed structure (hint: No tab/column mapping!) {

"id":101,

”first_name":”John",

“second_name”:”Kanagaraj”,

”residential_address":[{“add1”:”20 First St”, "city":”San Jose”, “state”:”CA”}],

“phone”:”408-555-9999”

}

▪  Can you spot some optimization here?

Page 52: Oracle vs NoSQL – The good, the bad and the ugly

Some more concepts: Languages ■  You need to understand JVMs and some Java

▪  Many NoSQL stores use JVM based programs ▪  E.g. Hadoop, Cassandra ▪  Ability to understand JVM’s and their internals is key ▪  JVM’s Garbage Collection needs to be managed ▪  Need to understand/configure JMX (Java Management Xtensions) ▪  Most NoSQL stores support Java API’s out of the box

■  Most NoSQL stores support more than just Java ▪  E.g. Python, Ruby, Perl, C/C++, Node.js, Go ▪  Less-well known ones such as Erlang, Haskell, Scala ▪  Need to able to install and troubleshoot app issues

■  Deploy/Management: Puppet, Nagios, Ganglia, Fab ▪  Frameworks can do more than just NoSQL!

Page 53: Oracle vs NoSQL – The good, the bad and the ugly

MongoDB: Document datastore Client  

MongoS   MongoS  

MongoD  (Master)  

MongoD  (Slave)  

MongoD  (Slave)  

MongoD  (Master)  

MongoD  (Slave)  

MongoD  (Slave)  

MongoD  

MongoD  

Replica  Set  1   Replica  Set  2  

1

3

2

•  Write  scaling  Sharding  through  MongoS  

•  Read  scaling  via  Replica  sets  

•  Writes  to  Master  Node,  reads  from  Master  and  Slave  nodes  (op=onal)  

MongoD  Routers  

Config  Servers  

4

Page 54: Oracle vs NoSQL – The good, the bad and the ugly

MongoDB: Data Modeling RDBMS   MongoDB  

Database   Database  

Table   Collec=on  

Row   Document  

RowID   _id    

Index   Index  

Join   Embedded  Document  (DBRef)  

Foreign  Key   Reference  

Order  ID:  1001  Customer:  John    

Order  Line  Items:  20001  –  Tires  –  2  x  $84  -­‐  $168  45320  –  Pump  –  1  x  $54  -­‐  $54    

Payment  Details:  Card:  Amex  CC:  3425268768  Exp:  03/17  Total:  $222  

Order    

Customer  

Line  Items  

Financial  Instrument  

FinTrans  Journal  

{  “order_id”:  “1001”,  “customer”:”John”,            “orderitems”:  [  {“prodid”:”20001”,  “prodname”:”Tires”,  “Qty”:2,  “price”:168},                                                                  {“prodid”:”45320”,  “prodname”:”Pump”,  “Qty”:1,  “price”:54}  ],          “pcard”:”Amex”,”pcc”:”3425268768”,”pexp”:”03/17”,”ord_tot”:222  }  

Page 55: Oracle vs NoSQL – The good, the bad and the ugly

MongoDB: Essentials ■  Stands for “huMONGOus DataBase” ■  Reads and Writes using memory-mapped files

▪  Try and fit working set in memory ▪  Use SSDs for faster I/O

■  Very good index support on identified JSON fields ▪  Allows Key-Value, Range and text search queries ▪  Unique as well as Compound Indexes ▪  Special TTL (Time-to-Live) index to retire data

■  Stores documents in BSON format (Binary JSON) ■  Interact, manage, program through Mongo Shell ■  Many other drivers and interfaces ■  Support for Geospatial data and queries ■  Aggregation Framework and MapReduce support

Page 56: Oracle vs NoSQL – The good, the bad and the ugly

MongoDB Physical/Memory Mapping

Page 57: Oracle vs NoSQL – The good, the bad and the ugly

MongoDB: Essentials ■  Query optimizer exposes execution plan ■  Multiple sharding methods:

▪  Range-based sharding: Optimized for range queries ▪  Hash-based sharding: Ensure uniform distribution ▪  Tag-aware sharding: Partitioned by user-specified configuration

■  Write-ahead journaling ▪  Journal commits every 100ms (oplog is capped collection)

■  Configurable Write-availability via Write Concern ▪  Unacknowledged (memory only) ▪  Acknowledgement for specific levels:

—  Write to at least 2 replicas in the same datacenter

—  Write to at least 1 replica in remote datacenter

■  Commercially supported by 10gen (now called MongoDB)

Page 58: Oracle vs NoSQL – The good, the bad and the ugly

MongoDB: The Not-so-good… ■  Reads block Writes (albeit for very short periods ~ microsecs)

▪  Be careful about aggregation/MapReduce: Intense reads ▪  Read lock yields when read has to go to disk ▪  Read locks can be shared by multiple readers

■  Writes block Reads (Writer-greedy, for very short periods) ■  Locks are at a “database” level

▪  Careful with your data model! ▪  Typically restrict one collection per database if possible ▪  Write to multiple documents will yield periodically

■  Index creation (writes) locks your entire database ■  Replicates to Slaves and locks all slaves in Replicaset ■  Compaction also locks the database ■  Secondaries block on replication writes

Page 59: Oracle vs NoSQL – The good, the bad and the ugly

CouchBase – Another Document Store

Couchbase Cluster"

Multitenant Architecture"

Server Nodes"

User/applica=on  data  

based  on  bucket  par==oning  

Which  live  on  

Data Buckets"

Documents"Read/write  from/to  

That  form  a  

Clients  

Servers  

dynamically  scalable  

Page 60: Oracle vs NoSQL – The good, the bad and the ugly

CouchBase Single-Node Architecture

Replica=on,  Rebalance,    Shard  State  Manager  

REST  management    API/Web  UI  

8091  Admin  Console  

Erlang  /O

TP  

11210  /  11211  Data  access  ports  

Object-­‐managed  Cache  

Storage  Engine  

8092  Query  API  

Que

ry  Engine  

hDp  

Data  Manager   Cluster  Manager  

Page 61: Oracle vs NoSQL – The good, the bad and the ugly

CouchBase: Background and Use cases

■  Created as a Merge of code and ideas: ▪  MemCache – An excellent memory only cache ▪  CouchDB – A Key-Value store ▪  Now a Persistent Cache ▪  Code in Erlang and C++ (??) ▪  Different ports for both products – now merging ▪  Lots of MemCache implementations ▪  Now can upgrade into CouchBase quickly – Moxi client

■  Primarily as a Caching solution ▪  Very fast for reads and writes ▪  Some concerns with cross data center replication ▪  IMHO - Not yet suited for RYOWs via secondary key

Page 62: Oracle vs NoSQL – The good, the bad and the ugly

Cassandra: Column-Family datastore

Node  1  

Node  2  

Node  3  

Node  4  

Node  5  

Node  6  

Client  •  Hash  func=on(Key)  =>  Token  •  Client  writes  to  selected  Node  as  per  

Token  •  Coordinator  Node  replicates  to  other  

nodes  (Timed  per  Quorum  selng)  •  Node  acknowledges  to  coordinator  •  Acknowledgement  to  client  •  Data  wri.en  to  internal  commit  log  •  If  node  goes  offline,  writes  stop  •  When  node  rejoins,  a  “hinted  handoff”  

process  completes  the  pending  writes  +  “read  repair”  

•  Requests  can  range  from  ANY  to  ALL  •  ANY:  Write  to  commit  log  on  at  

least  1  node  •  ALL:  Writes  complete  to  memory  

and  commit  log  on  ALL  replicas  •  Availability  precedes  Consistency  (AP)  •  Read  and  Write  Paths  are  separate  

Page 63: Oracle vs NoSQL – The good, the bad and the ugly

Cassandra: Column-Family datastore

(1) Write:(K1,{C1:V1})  (2) Write:(K1,{C2:V2})  (3) Write:(K2,{C1:V3,C2:V4})  (4) Write:(K1,{C1:V5,C3:V6})  

K1   C1:V1   Memory  

Disk  K1   C1:V1  

C2:V2  

K1   C2:V2  

K2   C1:V3   C2:V4  

K2   C1:V3   C2:V4  

C1:V5   C3:V6  

K1   C1:V5   C3:V6  

Memtable  

Commit  log  

Index  

K1   C2:V2  C1:V5   C3:V6  

K2   C1:V3   C2:V4  

SSTable  

Page 64: Oracle vs NoSQL – The good, the bad and the ugly

Cassandra: Essentials ■  Write Path is simpler; Reads are a little more complex

▪  Merge Memtable (Row/Key cache) and Row Reads from Disk ▪  Uses Bloom Filter to decide which SSTables to skip (false +ive) ▪  In-memory caches are stored in Java heap (GC!!!!) ▪  Can return inconsistent data for RYOW (depending on Quorum) ▪  Consistent: (nodes_written + nodes_read) > replication_factor

■  Compaction: Merge SSTables; Expire Tombstoned data (TTL) ■  Data Modeling:

▪  Model your queries – Optimize for reads ▪  Denormalize – Reads: Slow; Writes: Fast; Disk: Cheap ▪  Column families are stored sorted by timestamp

■  CQL: Cassandra Query Language – A familiar interface ■  Maintaining the Cluster: Gossip and Snitch J

Page 65: Oracle vs NoSQL – The good, the bad and the ugly

Choosing the right NoSQL database: ASCII the right question!

■  Is this a site-facing, P1 Application? ■  Is this a BI/Analytics type problem waiting to be solved? ■  Is this Write Intensive or Read Intensive? ■  Is this a Caching problem? ■  Can the application afford some data loss? ■  What about data consistency? ■  What is more important – consistency or availability? ■  How many data centers need to be supported? ■  What are the query patterns? Are they widely varying? ■  How many distinct clusters of data are present, and how are

they related? ■  Is my organization ready to support this product?

Page 66: Oracle vs NoSQL – The good, the bad and the ugly

Generic problems ■  Consistency is and will be a problem in the NoSQL world ■  Data loss will be present - application should cater to this

▪  Consider the cost of workarounds/cost of data loss ■  The world of NoSQL is evolving:

▪  Maturing slowly: Peak -> Sliding into the trough ▪  Too many choices: 150 choices: http://nosql-database.org/ ▪  Many picking the wrong product…

—  (and had to change it later: Check my Delicious stream #nosql)

▪  Most NoSQL vendors still VC funded ▪  New Versions/Features every 6 months! ▪  We will learn lessons the hard way…..

Page 67: Oracle vs NoSQL – The good, the bad and the ugly

Real World problems ■  Need to break out of the RDBMS/ACID world

▪  Imagine a world with no COMMITs, no “Transactions” ▪  Data loss and Data inconsistency is inevitable ▪  Data Owners/Architects shy away: FUDs, Real dangers

■  Everyone wants to become (or is!) a NoSQL expert ▪  Spell NoSQL and earn $$$ J ▪  Best way to learn: Create a “Big Data” need and fulfill it ▪  Who makes the decisions?

■  Lack of skills and maturity ▪  Product choice: Knowledge/Experience/Forethought required ▪  Many NoSQL products still basic in functionality ▪  Be prepared to back out of your initial choice

Page 68: Oracle vs NoSQL – The good, the bad and the ugly

How to get there (from here)?

■  This presentation is just the beginning ■  Lots and lots of reading and experimenting required ■  Recommended Reading:

▪  NoSQL Distilled by Fowler and Sadalage ▪  Seven Databases in Seven Weeks: Redmond and Wilson ▪  Many NoSQL books – browse at Safari Online

■  Lots of links to read – Live links: ▪  Follow me on http://delicious.com/jkanagaraj - Tag #nosql

■  Play with the community versions: ▪  Available from the vendors: No support though ▪  Spin up/use Cloud based VMs – Rackspace or AWS

Page 69: Oracle vs NoSQL – The good, the bad and the ugly

A warning – And some advice

“Some people, when confronted with a big data problem, think, I’ll use Hadoop. Now, they have a

big data problem and a big Hadoop cluster” Dmitry Ryaboy, Engineering Manager, Twitter

▪  Start small ▪  Grow with success ▪  Create your own expertise ▪  It is about the untapped potential in your data

Page 70: Oracle vs NoSQL – The good, the bad and the ugly

Please  fill  in  the  feedback  form!  Link  up  with  me  on  LinkedIn  

John  Kanagaraj,  PayPal,  an  eBay  Inc.  Company  


Recommended