SucceedingwithPolyglotPersistencefiles.meetup.com/12718532/Succeeding with Polyglot... ·...

Succeeding with Polyglot Persistence: Surviving and Thriving with NoSQL in the Enterprise

Dr. Vladimir Bacvanski [email protected]

@OnSo5ware

www.scispike.com Copyright © SciSpike 2014

Who am I?

§  Dr. Vladimir Bacvanski – Founder of SciSpike – a development, consulIng and training firm specializing in advanced so5ware and data technologies hLp://www.scispike.com/

– Consultant, trainer, and mentor focusing on making clients successful in adopIng new data and so5ware approaches

– Awarded the Itle of IBM Champion 2009-‐2014 @OnSo5ware hLp://www.linkedin.com/in/VladimirBacvanski


SciSpike: What We Do

§  Custom Development – Rapid development, innovaIve Web ApplicaIons with Node.js & NoSQL

§  Consul.ng – So5ware and Enterprise Architecture – NoSQL – Agile Development

§  Training – So5ware Development • Node.js, Java, JEE, Spring

3


NoSQL and Polyglot Persistence

§  NoSQL: "Not Only SQL" – An unfortunate name! •  "Non-‐rela*onal" would be the right name

– Non-‐relaIonal, typically distributed data stores – Address deficiencies of relaIonal databases – Various technologies and approaches

§  Polyglot persistence: – Use of several different database systems – Each being the best choice in its applicaIon area


Problems with RelaIonal Stores

§  Data that does not naturally fit into tables à Impedance mismatch

§  Development Ime o5en to long

§  Dealing with unstructured data §  Performance problems

§  Difficult to run on clusters

§  Cost

5


Structured and Unstructured Data Sources

Structured Data Sources

• ExisIng databases • ERP/CRM/BI systems • Inventory • Supply chain

Unstructured Data Sources

• Server logs • Search engine logs • Browsing logs • E-‐Commerce records • Social media • Voice • Video • Sensor data

6


NoSQL Impact

7

Disks Processors

x1000 x1000 x1000

Cost / Perform

ance

1M 1B 1T 1Q …HUGE!!! x1000

RelaIonal Database

NoSQL

Tomorrow -‐ Volume is out of reach

Today -‐ Doable, but expensive and slow

Stabilize Cost & Increase Performance

Enable Unlimited Volume Growth


Scale Up vs. Scale Out

8

Capability

Cost Scale Up

Capability

Cost Scale Out


Typical NoSQL Systems

§  Non-‐rela.onal §  Distributed §  Horizontally scalable §  No need for a fixed schema

§  Several established players

§  Systems are specialized

9


NoSQL Stores and Their Categories

§  Choose a store that is a best match for your applicaIon

§  It is fine to have several different stores used – "Polyglot persistence"

10

k v

Key-‐Value Column-‐Family

Document-‐Oriented

Graph DB


NoSQL Stores: Scale vs. Complexity of Data

11

k v

Key-‐Value

Column-‐Family

Document-‐Oriented

complexity

scalability

Graph DB

needs of most applica*ons


Key-‐Value Stores

§  Key à Value mapping

§  Large, persistent Map ("hashtable") – Values could be lists and hashes

§  Easy to use §  Scale very well §  Data model may be too simple for most applicaIons

§  Systems: – Redis, Riak, Memcached, Amazon DynamoDB, Aerospike, FoundaIonDB

§  Use when data model is very simple and scalability essenIal

12


Typical Use Cases

§  The data model is very simple! – Actual data can be JSON

§  Session data §  User preferences and profiles §  Shopping cart

§  If other NoSQL store is good enough, you may want to skip this and let Column or Document store handle it

13


Column-‐Family

§  "Column-‐family": similar to a table – Table is sparse

§  Key à (Column:Value)*

§  Columns have names

§  Can be indexed §  Can store complex data

– Denormalize! §  Systems:

– Google BigTable, HBase, Cassandra, Amazon SimpleDB, Hypertable

§  Use when scalability is essenIal 14


Typical Use Cases

§  High insert volume: logging

§  Real-‐Ime updates

§  Content management

§  Expiring content §  Cross-‐datacenter replicaIon §  MapReduce analyIcs over stored data

§  You don’t need convenIonal (ACID) transacIons

15


Document Stores

§  JSON, BSON, XML

§  No schema

§  Indexes improve performance

§  Easy transiIon from RDBMS

§  Systems – MongoDB, CouchDB, CouchBase

§  Use when data is in semi-‐structured form

§  O5en seen in new Web applicaIons

16


Typical Use Cases

§  Logging – Especially with variable content

§  Product informaIon

§  Customer informaIon

§  Content management

§  Data to be stored has format that varies over Ime – Flexible schema

§  Web analyIcs

17


Graph Databases

§  Nodes with properIes §  Nodes connected through relaIonships §  Can model very complex graph data

– Social networks §  Systems:

– Neo4J, Infinite Graph, TitanDB, OrientDB §  Use when data is a (complex) graph

18


Typical Use Cases

§  Highly interconnected data §  Social graphs §  Party relaIonships in an enterprise §  LocaIon based services §  Purchasing analyIcs and recommendaIons

§  O5en combined with other systems to store the bulk of data – Graph database can focus on relaIonships

19


Hadoop and NoSQL Trends

20


Challenges of NoSQL

§  No standards §  NoSQL categories are very different from each other

§  Typically schema is not enforced, or there is none

§  Systems are targeIng specific areas only – You may need mulIple systems for different purposes

21


A Common Pa[ern for Processing Large Data

Load a large set of records onto a set of machines

Extract something interesIng from each record

Shuffle and sort intermediate results

Aggregate intermediate results

Store end result

22

"Map"

"Reduce"

Key/Value pairs

www.scispike.com Copyright © SciSpike 2014 23

The MapReduce Programming Model

§  "Map" step: –  Input split into pieces –  Worker nodes process individual pieces in parallel (under global control of the Job Tracker node)

–  Each worker node stores its result in its local file system where a reducer is able to access it

§  "Reduce" step: –  Data is aggregated (‘reduced” from the map steps) by worker nodes (under control of the Job Tracker)

–  MulIple reduce tasks can parallelize the aggregaIon


k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6

Map Map Map Map

Shuffle + Sort (Aggregate Values by Keys)

Reduce Reduce Reduce

b 2 a 1 c 6 c 3 c 2 a 5 c 8 b 7

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

MapReduce Step-‐by-‐Step

Credit: Jimmy Lin, University of Maryland


SeparaIon of Work

Programmers

• Map • Reduce

Framework

• Deals with fault tolerance

• Assign workers to map and reduce tasks

• Moves processes to data

• Shuffles and sorts intermediate data

• Deals with errors

25


Hadoop Distributed File System

26

1 2 3 4 5 6 7 8 9

Data broken into blocks

Node 1 Node 2 Node n Master Node

Name Node

1 3

5 7

9

2 4

6 8

…

Data is replicated across the nodes


Map and Reduce Tasks Running on Nodes

27

Node 1 Node 2 Node n Master Node

Name Node

1 3

5 7

9

2 4

6 8

…

Job Tracker

Map

Reduce

Map

Reduce

Map

Reduce

User programs are copied to all nodes


Big Data with Apache Hadoop

§  Open source framework for processing large amount of data on cheap, unreliable clusters

§  Inspired by Google technologies, implemented in Java

§  From the onset, designed to scale

§  Popular in companies that deal with large amounts of data – O5en in PB size

§  Useful as a cost-‐effecIve soluIon for less than PB problems as well!

28


Two Key Aspects of Hadoop

§  MapReduce framework – How Hadoop understands and assigns work to the nodes (machines)

§  Hadoop Distributed File System = HDFS – Where Hadoop stores data – A file system that spans all the nodes in a Hadoop cluster –  It links together the file systems on many local nodes to make them into one big file system


Logical MapReduce Example: Word Count

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Hello World Bye World

Hello Hadoop

Content of Input Documents

Reduce (final output):

< Bye, 1> < Hadoop, 1> < Hello, 2> < World, 2>

Map 1 emits: < Hello, 1> < World, 1> < Bye, 1> < World, 1> Map 2 emits: < Hello, 1> < Hadoop, 1> aIer

shuffle


How To Create MapReduce Jobs

§  Java API – Low level, very flexible – Time consuming development

§  Streaming API – A simple, producIve model for Python and Ruby

§  Hive – Open source language / Apache sub-‐project – Provides a SQL-‐like interface to Hadoop

§  Pig – Data flow language / Apache sub-‐project


Flume

§  Distributed streaming tool for collecIng, aggregaIng and moving large amounts of log data

§  Horizontally scalable, centrally managed

§  Tunable data reliability §  Push and pull sources §  Many data sources

– Tail of a file, program output, logs, TwiLer, AMQP, IRC, … §  Many data outputs

– HDFS, HBase §  Decorators: process data in flight

32


The Big Picture

33

Columnar

Price updates

Logs

Document

Product info

Graph

Customer Agent

rela*on-‐ships

RDB

XA data

Hadoop

Oper. analy*cs

Price analy*cs

Key/Value

Session data

Applica*ons


Data Access Components

§  Do not allow that a specific technology dependencies propagate through the whole applicaIon

§  Design for change – But do not over-‐complicate

34


NoSQL Data Services

§  A flexible way of separaIng data from parIcular technologies

§  Typically REST §  Some database already come with

service APIs

§  Many APIs developed by community

35

Data Store

REST API


IntegraIng RelaIonal, Streams, and Hadoop

Streams

Data + Big Data

TradiIonal Warehouse

In-‐MoIon AnalyIcs

Data analyIcs Results

Database & Warehouse

At-‐rest data analyIcs

Results

Ultra Low Latency Results

TradiIonal / RelaIonal

Data Sources

Non-‐TradiIonal / Non-‐RelaIonal Data Sources

Varied data formats

Semi-‐structured, unstructured...

Event System

NoSQL


Master Data Management and Governance

§  NoSQL stores can easily become a bigger mess than relaIonal stores

§  Introduce a pracIcal plan – Avoid lengthy and cumbersome governance – Actual use should be the driving force

§  Be ready for change – The technologies change rapidly

§  Focus on business outcomes

§  MDM and NoSQL – The good: performance – The bad: few out of the box soluIons

37


Succeeding with Polyglot Persistence

1.  AcIvely look for soluIons where the right store can ease the pain

2.  Make sure you deliver tangible value to clients

3.  A5er you get your first apps to work: create a NoSQL/Big Data introducIon and governance plan

4.  PrioriIze: do the most useful thing for the business first

5.  Integrate with exisIng IT 6.  Make sure you hire or grow your Big Data/NoSQL champions

7.  Field is immature: look out for new tools and techniques

38


Conclusions

– NoSQL address the weak points of relaIonal systems • Hadoop and NoSQL

– Polyglot persistence: use the most suitable database for your task •  Including relaIonal databases!

– Scale out to crunch Big Data –  Integrate with convenIonal technologies


Connect!

§  Dr. Vladimir Bacvanski

Email: [email protected]

Blog: hLp://www.OnBuildingSo5ware.com/

TwiLer: hLp://twiLer.com/OnSo5ware

LinkedIn: hLp://www.linkedin.com/in/VladimirBacvanski

Date post:	07-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times