Succeeding with Polyglot Persistence: Surviving and Thriving with NoSQL in the Enterprise
Dr. Vladimir Bacvanski [email protected]
@OnSo5ware
www.scispike.com Copyright © SciSpike 2014
Who am I?
§ Dr. Vladimir Bacvanski – Founder of SciSpike – a development, consulIng and training firm specializing in advanced so5ware and data technologies hLp://www.scispike.com/
– Consultant, trainer, and mentor focusing on making clients successful in adopIng new data and so5ware approaches
– Awarded the Itle of IBM Champion 2009-‐2014 @OnSo5ware hLp://www.linkedin.com/in/VladimirBacvanski
www.scispike.com Copyright © SciSpike 2014
SciSpike: What We Do
§ Custom Development – Rapid development, innovaIve Web ApplicaIons with Node.js & NoSQL
§ Consul.ng – So5ware and Enterprise Architecture – NoSQL – Agile Development
§ Training – So5ware Development • Node.js, Java, JEE, Spring
3
www.scispike.com Copyright © SciSpike 2014
NoSQL and Polyglot Persistence
§ NoSQL: "Not Only SQL" – An unfortunate name! • "Non-‐rela*onal" would be the right name
– Non-‐relaIonal, typically distributed data stores – Address deficiencies of relaIonal databases – Various technologies and approaches
§ Polyglot persistence: – Use of several different database systems – Each being the best choice in its applicaIon area
www.scispike.com Copyright © SciSpike 2014
Problems with RelaIonal Stores
§ Data that does not naturally fit into tables à Impedance mismatch
§ Development Ime o5en to long
§ Dealing with unstructured data § Performance problems
§ Difficult to run on clusters
§ Cost
5
www.scispike.com Copyright © SciSpike 2014
Structured and Unstructured Data Sources
Structured Data Sources
• ExisIng databases • ERP/CRM/BI systems • Inventory • Supply chain
Unstructured Data Sources
• Server logs • Search engine logs • Browsing logs • E-‐Commerce records • Social media • Voice • Video • Sensor data
6
www.scispike.com Copyright © SciSpike 2014
NoSQL Impact
7
Disks Processors
x1000 x1000 x1000
Cost / Perform
ance
1M 1B 1T 1Q …HUGE!!! x1000
RelaIonal Database
NoSQL
Tomorrow -‐ Volume is out of reach
Today -‐ Doable, but expensive and slow
Stabilize Cost & Increase Performance
Enable Unlimited Volume Growth
www.scispike.com Copyright © SciSpike 2014
Scale Up vs. Scale Out
8
Capability
Cost Scale Up
Capability
Cost Scale Out
www.scispike.com Copyright © SciSpike 2014
Typical NoSQL Systems
§ Non-‐rela.onal § Distributed § Horizontally scalable § No need for a fixed schema
§ Several established players
§ Systems are specialized
9
www.scispike.com Copyright © SciSpike 2014
NoSQL Stores and Their Categories
§ Choose a store that is a best match for your applicaIon
§ It is fine to have several different stores used – "Polyglot persistence"
10
k v
Key-‐Value Column-‐Family
Document-‐Oriented
Graph DB
www.scispike.com Copyright © SciSpike 2014
NoSQL Stores: Scale vs. Complexity of Data
11
k v
Key-‐Value
Column-‐Family
Document-‐Oriented
complexity
scalability
Graph DB
needs of most applica*ons
www.scispike.com Copyright © SciSpike 2014
Key-‐Value Stores
§ Key à Value mapping
§ Large, persistent Map ("hashtable") – Values could be lists and hashes
§ Easy to use § Scale very well § Data model may be too simple for most applicaIons
§ Systems: – Redis, Riak, Memcached, Amazon DynamoDB, Aerospike, FoundaIonDB
§ Use when data model is very simple and scalability essenIal
12
www.scispike.com Copyright © SciSpike 2014
Typical Use Cases
§ The data model is very simple! – Actual data can be JSON
§ Session data § User preferences and profiles § Shopping cart
§ If other NoSQL store is good enough, you may want to skip this and let Column or Document store handle it
13
www.scispike.com Copyright © SciSpike 2014
Column-‐Family
§ "Column-‐family": similar to a table – Table is sparse
§ Key à (Column:Value)*
§ Columns have names
§ Can be indexed § Can store complex data
– Denormalize! § Systems:
– Google BigTable, HBase, Cassandra, Amazon SimpleDB, Hypertable
§ Use when scalability is essenIal 14
www.scispike.com Copyright © SciSpike 2014
Typical Use Cases
§ High insert volume: logging
§ Real-‐Ime updates
§ Content management
§ Expiring content § Cross-‐datacenter replicaIon § MapReduce analyIcs over stored data
§ You don’t need convenIonal (ACID) transacIons
15
www.scispike.com Copyright © SciSpike 2014
Document Stores
§ JSON, BSON, XML
§ No schema
§ Indexes improve performance
§ Easy transiIon from RDBMS
§ Systems – MongoDB, CouchDB, CouchBase
§ Use when data is in semi-‐structured form
§ O5en seen in new Web applicaIons
16
www.scispike.com Copyright © SciSpike 2014
Typical Use Cases
§ Logging – Especially with variable content
§ Product informaIon
§ Customer informaIon
§ Content management
§ Data to be stored has format that varies over Ime – Flexible schema
§ Web analyIcs
17
www.scispike.com Copyright © SciSpike 2014
Graph Databases
§ Nodes with properIes § Nodes connected through relaIonships § Can model very complex graph data
– Social networks § Systems:
– Neo4J, Infinite Graph, TitanDB, OrientDB § Use when data is a (complex) graph
18
www.scispike.com Copyright © SciSpike 2014
Typical Use Cases
§ Highly interconnected data § Social graphs § Party relaIonships in an enterprise § LocaIon based services § Purchasing analyIcs and recommendaIons
§ O5en combined with other systems to store the bulk of data – Graph database can focus on relaIonships
19
www.scispike.com Copyright © SciSpike 2014
Hadoop and NoSQL Trends
20
www.scispike.com Copyright © SciSpike 2014
Challenges of NoSQL
§ No standards § NoSQL categories are very different from each other
§ Typically schema is not enforced, or there is none
§ Systems are targeIng specific areas only – You may need mulIple systems for different purposes
21
www.scispike.com Copyright © SciSpike 2014
A Common Pa[ern for Processing Large Data
Load a large set of records onto a set of machines
Extract something interesIng from each record
Shuffle and sort intermediate results
Aggregate intermediate results
Store end result
22
"Map"
"Reduce"
Key/Value pairs
www.scispike.com Copyright © SciSpike 2014 23
The MapReduce Programming Model
§ "Map" step: – Input split into pieces – Worker nodes process individual pieces in parallel (under global control of the Job Tracker node)
– Each worker node stores its result in its local file system where a reducer is able to access it
§ "Reduce" step: – Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of the Job Tracker)
– MulIple reduce tasks can parallelize the aggregaIon
www.scispike.com Copyright © SciSpike 2014
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
Map Map Map Map
Shuffle + Sort (Aggregate Values by Keys)
Reduce Reduce Reduce
b 2 a 1 c 6 c 3 c 2 a 5 c 8 b 7
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
MapReduce Step-‐by-‐Step
Credit: Jimmy Lin, University of Maryland
www.scispike.com Copyright © SciSpike 2014
SeparaIon of Work
Programmers
• Map • Reduce
Framework
• Deals with fault tolerance
• Assign workers to map and reduce tasks
• Moves processes to data
• Shuffles and sorts intermediate data
• Deals with errors
25
www.scispike.com Copyright © SciSpike 2014
Hadoop Distributed File System
26
1 2 3 4 5 6 7 8 9
Data broken into blocks
Node 1 Node 2 Node n Master Node
Name Node
1 3
5 7
9
2 4
6 8
…
Data is replicated across the nodes
www.scispike.com Copyright © SciSpike 2014
Map and Reduce Tasks Running on Nodes
27
Node 1 Node 2 Node n Master Node
Name Node
1 3
5 7
9
2 4
6 8
…
Job Tracker
Map
Reduce
Map
Reduce
Map
Reduce
User programs are copied to all nodes
www.scispike.com Copyright © SciSpike 2014
Big Data with Apache Hadoop
§ Open source framework for processing large amount of data on cheap, unreliable clusters
§ Inspired by Google technologies, implemented in Java
§ From the onset, designed to scale
§ Popular in companies that deal with large amounts of data – O5en in PB size
§ Useful as a cost-‐effecIve soluIon for less than PB problems as well!
28
www.scispike.com Copyright © SciSpike 2014
Two Key Aspects of Hadoop
§ MapReduce framework – How Hadoop understands and assigns work to the nodes (machines)
§ Hadoop Distributed File System = HDFS – Where Hadoop stores data – A file system that spans all the nodes in a Hadoop cluster – It links together the file systems on many local nodes to make them into one big file system
www.scispike.com Copyright © SciSpike 2014
Logical MapReduce Example: Word Count
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
Hello World Bye World
Hello Hadoop
Content of Input Documents
Reduce (final output):
< Bye, 1> < Hadoop, 1> < Hello, 2> < World, 2>
Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < Hadoop, 1> aIer
shuffle
www.scispike.com Copyright © SciSpike 2014
How To Create MapReduce Jobs
§ Java API – Low level, very flexible – Time consuming development
§ Streaming API – A simple, producIve model for Python and Ruby
§ Hive – Open source language / Apache sub-‐project – Provides a SQL-‐like interface to Hadoop
§ Pig – Data flow language / Apache sub-‐project
www.scispike.com Copyright © SciSpike 2014
Flume
§ Distributed streaming tool for collecIng, aggregaIng and moving large amounts of log data
§ Horizontally scalable, centrally managed
§ Tunable data reliability § Push and pull sources § Many data sources
– Tail of a file, program output, logs, TwiLer, AMQP, IRC, … § Many data outputs
– HDFS, HBase § Decorators: process data in flight
32
www.scispike.com Copyright © SciSpike 2014
The Big Picture
33
Columnar
Price updates
Logs
Document
Product info
Graph
Customer Agent
rela*on-‐ships
RDB
XA data
Hadoop
Oper. analy*cs
Price analy*cs
Key/Value
Session data
Applica*ons
www.scispike.com Copyright © SciSpike 2014
Data Access Components
§ Do not allow that a specific technology dependencies propagate through the whole applicaIon
§ Design for change – But do not over-‐complicate
34
www.scispike.com Copyright © SciSpike 2014
NoSQL Data Services
§ A flexible way of separaIng data from parIcular technologies
§ Typically REST § Some database already come with
service APIs
§ Many APIs developed by community
35
Data Store
REST API
www.scispike.com Copyright © SciSpike 2014
IntegraIng RelaIonal, Streams, and Hadoop
Streams
Data + Big Data
TradiIonal Warehouse
In-‐MoIon AnalyIcs
Data analyIcs Results
Database & Warehouse
At-‐rest data analyIcs
Results
Ultra Low Latency Results
TradiIonal / RelaIonal
Data Sources
Non-‐TradiIonal / Non-‐RelaIonal Data Sources
Varied data formats
Semi-‐structured, unstructured...
Event System
NoSQL
www.scispike.com Copyright © SciSpike 2014
Master Data Management and Governance
§ NoSQL stores can easily become a bigger mess than relaIonal stores
§ Introduce a pracIcal plan – Avoid lengthy and cumbersome governance – Actual use should be the driving force
§ Be ready for change – The technologies change rapidly
§ Focus on business outcomes
§ MDM and NoSQL – The good: performance – The bad: few out of the box soluIons
37
www.scispike.com Copyright © SciSpike 2014
Succeeding with Polyglot Persistence
1. AcIvely look for soluIons where the right store can ease the pain
2. Make sure you deliver tangible value to clients
3. A5er you get your first apps to work: create a NoSQL/Big Data introducIon and governance plan
4. PrioriIze: do the most useful thing for the business first
5. Integrate with exisIng IT 6. Make sure you hire or grow your Big Data/NoSQL champions
7. Field is immature: look out for new tools and techniques
38
www.scispike.com Copyright © SciSpike 2014
Conclusions
– NoSQL address the weak points of relaIonal systems • Hadoop and NoSQL
– Polyglot persistence: use the most suitable database for your task • Including relaIonal databases!
– Scale out to crunch Big Data – Integrate with convenIonal technologies
www.scispike.com Copyright © SciSpike 2014
Connect!
§ Dr. Vladimir Bacvanski
Email: [email protected]
Blog: hLp://www.OnBuildingSo5ware.com/
TwiLer: hLp://twiLer.com/OnSo5ware
LinkedIn: hLp://www.linkedin.com/in/VladimirBacvanski