Why Every NoSQL Deployment Should be Paired With Hadoop

Post on 20-Jun-2015

2,577 views 0 download

Tags:

description

Frequently the terms NoSQL and Big Data are conflated – many view them as synonyms. It’s understandable – both technologies eschew the relational data model and spread data across clusters of servers, versus relational database technology which favors centralized computing. But the “problems” these technologies address are quite different. Hadoop, the Big Data poster child, is focused on data analysis – gleaning insights from large volumes of data. NoSQL databases are transactional systems – delivering high-performance, cost-effective data management for modern real-time web and mobile applications; this is the Big User problem. Of course, if you have a lot of users, you are probably going to generate a lot of data. IDC estimates that more than 1.8 trillion gigabytes of information was created in 2011 and that this number will double every two years. The proliferation of user-generated data from interactive web and mobile applications are key contributors to this growth. These slides will address: - Why NoSQL and Big Data are similar, but different - The categories of NoSQL systems, and the types of applications for which they are best suited - How Cloudera’s Distribution Including Apache Hadoop and Couchbase can be used together to build better applications - Explore real-world use cases where NoSQL and Hadoop technologies work in concert To view Couchbase webinars on-demand visit http://www.couchbase.com/webinars

transcript

1  

Why  every  NoSQL  deployment  should  be  paired  with  Hadoop  

James  Phillips  Co-­‐founder  and  SVP  Products  

Amr  Awadallah  Co-­‐founder  and  CTO  

Couchbase   Cloudera  

2  

Agenda    

•  Big  Audience  vs.  Big  Data  •  NoSQL  for  Big  Audience  •  Hadoop  for  Big  Data  •  Big  Audiences  create  and  consume  Big  Data  

–  NoSQL  and  Hadoop  are  highly  synergisJc  •  Couchbase  +  Cloudera  

3  

Aren’t  NoSQL,  Hadoop,  “Big  Data”  all  the  same?  

No.  

4  

Two  challenges  at  the  data  layer    

IDC  esJmates  that  more  than  1.8  trillion  gigabytes  of  informaJon  was  

created  in  2011  and  that  it  will  double  every  two  years.  

Most  new  interacJve  soWware  systems  are  accessed  via  browser  with  2  billion  potenJal  users  and  a  

24x7  upJme  requirement.  

“Big  Audience.”   “Big  Data.”  

5  

NoSQL for

“Big Audience”

6  

Changes  in  interacJve  soWware  –  NoSQL  driver  

7  

Modern interactive software architecture

Application Scales Out Just add more commodity web servers

Database Scales Up Get a bigger, more complex server

Note  –  RelaJonal  database  technology  is  great  for  what  it  is  great  for,  but  it  is  not  great  for  this.  

8  

Extending  the  scope  of  RDBMS  technology  

•  Data  parJJoning  (“sharding”)  –  DisrupJve  to  reshard  –  impacts  applicaJon  –  No  cross-­‐shard  joins  –  Schema  management  at  every  shard  

•  Denormalizng  –  Increases  speed  –  At  the  limit,  provides  complete  flexibility  –  Eliminates  relaJonal  query  benefits  

•  Distributed  caching  –  Accelerate  reads  –  Scale  out  –  Another  Jer,  no  write  acceleraJon,  coherency  management  

9  

Lacking  market  soluJons,  users  forced  to  invent  

Dynamo  October  2007  

Cassandra  August  2008  

Voldemort  February  2009  

Bigtable  November  2006  

•  No  schema  required  before  inserJng  data  •  No  schema  change  required  to  change  data  format  •  Auto-­‐sharding  without  applicaJon  parJcipaJon  •  Distributed  queries  •  Integrated  main  memory  caching  •  Data  synchronizaJon  (mobile,  mulJ-­‐datacenter)  

10  

NoSQL database matches application logic tier architecture Data layer now scales with linear cost and constant performance.

Application Scales Out Just add more commodity web servers

Database Scales Out Just add more commodity data servers

Scaling out flattens the cost and performance curves.

NoSQL  Database  Servers  

11  

11%  

12%  

16%  

29%  

35%  

49%  

Other  

All  of  these  

Costs  

High  latency/low  performance  

Inability  to  scale  out  data  

Lack  of  flexibility/rigid  schemas  

Source: Couchbase NoSQL Survey, December 2011, n=1351

What  is  the  biggest  data  management  problem    driving  your  use  of  NoSQL  in  the  coming  year?  

Survey:  Schema  inflexibility  #1  adopJon  driver  

12  

Hadoop for

“Big Data”

©2012 Cloudera, Inc. All Rights Reserved. 13

Storage Only Grid (original raw data)

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps

Mostly Append

ETL Compute Grid

2. Moving Data To Compute Doesn’t Scale

1. Can’t Explore Original High Fidelity Raw Data

3. Archiving = Premature Data Death

The Problems with Current Data Systems

©2012 Cloudera, Inc. All Rights Reserved. 14

The Solution: A Combined Storage/Compute Layer

Hadoop: Storage + Compute Grid

Instrumentation

Collection

RDBMS (aggregated data)

BI Reports + Interactive Apps 1. Data Exploration & Advanced Analytics

3. Keep Data Alive For Ever

2. Scalable Throughput For ETL & Aggregation

Mostly Append

The Key Benefit: Agility/Flexibility

©2012 Cloudera, Inc. All Rights Reserved. 15

Schema-on-Read (Hadoop):

Schema-on-Write (RDBMS): •  Schema must be created before

any data can be loaded.

•  An explicit load operation has to take place which transforms data to DB internal structure.

•  New columns must be added explicitly before new data for such columns can be loaded into the database.

•  Read is Fast

•  Standards/Governance

•  Data is simply copied to the file store, no transformation is needed.

•  A SerDe (Serializer/Deserlizer) is applied during read time to extract the required columns (late binding)

•  New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it.

•  Load is Fast

•  Flexibility/Agility

Pros  

Scalability: Scalable Software Development

©2012 Cloudera, Inc. All Rights Reserved. 16

Grows without requiring developers to re-architect their algorithms/application.

AUTO  SCALE  

Economics: Return on Byte •  Return on Byte (ROB) = value to be extracted from

that byte divided by the cost of storing that byte

•  If ROB is < 1 then it will be buried into tape wasteland, thus we need more economical active storage.

©2012 Cloudera, Inc. All Rights Reserved. 17

Low ROB

High ROB

Hadoop in the Enterprise Data Stack

Logs Files Web Data Relational Databases

IDEs BI / Analytics

Enterprise Reporting

Enterprise Data Warehouse

Cloudera Manager

SYSTEM OPERATORS

ENGINEERS ANALYSTS BUSINESS USERS

Web/Mobile Applications

CUSTOMERS

Sqoop

Sqoop

Sqoop

Flume Flume Flume

Modeling Tools

DATA SCIENTISTS

DATA ARCHITECTS

Meta Data/ ETL Tools

ODBC, JDBC, NFS

©2012 Cloudera, Inc. All Rights Reserved. 18

19  

Big Audiences create and consume

Big Data.

20  

Two  peas.  One  pod.  

hnp://Jnyurl.com/6tx42tw  

21  

Hadoop  as  a  Web  applicaJon  feeder  or  consumer  

“big  data”  

insights  

applicaJon  Web  

big  audience  

applicaJon  Web  

“big  audience”  

big  data  

insights  

Panern  1  Hadoop  feeding  a  web  applicaJon  

Panern  2  Hadoop  consuming  web  applicaJon  data  

22  

Panern  1  Case  Study:  AOL  Ad  TargeJng  

•  One  of  the  largest  online  ad  targeJng  operaJons  •  Ad  slot  filling  opJmizaJon  

–  Serve  the  most  relevant  ad  to  a  given  user  – Meet  contracted  impression  counts  

•  Relevancy  criteria  –  Demographic  –  Psychographic  –  Current  behavioral  

•  40  milliseconds  to  fill  all  slots  

23  

AOL  AdverJsing:  Hadoop  as  an  ad  targeJng  feeder  

events  profiles,  campaigns  

profiles,  real  Jme  campaign    staJsJcs  

40  milliseconds  to  respond  with  the  decision.  

2  

3  

1  

affiliates  

24  

Panern  2  Case  Study:  Social  gaming  user  analysis  

•  Tens  to  hundreds  of  millions  of  users  •  Game  opJmizaJon  requirements  

–  Keep  game  fresh  and  retain  audience  – Maximize  revenue  through  offer  and  experience  tuning  

•  Very  different  data  management  tasks  –  Serving  game  data  

•  System  of  record  game  data  •  Very  low  latency  data  access  •  Non-­‐disrupJve  elasJcity  •  Complex  queries  

–  Analyzing  user  behavior  •  Not  game  data,  rather  user  behavior  data  •  High-­‐throughput  data  analysis  

25  

Social  Game:  Game  opJmizaJon  via  Hadoop  

1  

2  

3  

User  interacJng  with  game  

ValidaJon  and  response  

Game  and  user  data  system  of  record  

Insights  

4  

5  

User  behavioral  data  

26  

Couchbase and Cloudera

27  

Couchcbase  Sqoop  connector  for  Cloudera  

hnp://www.couchbase.com/develop/connectors/hadoop    

Cloudera-­‐cerJfied  connector  Bi-­‐direcJonal  data  movement          -­‐  Hadoop  -­‐>  Couchbase          -­‐  Couchbase  -­‐>  Hadoop  

28  

Questions?