+ All Categories
Home > Technology > (Tugdual grall) no sql-hadoop

(Tugdual grall) no sql-hadoop

Date post: 27-Jan-2015
Category:
Upload: naver-d2
View: 114 times
Download: 4 times
Share this document with a friend
Description:
 
Popular Tags:
46
NoSQL & BigData Why Every NoSQL Deployment Should be Paired with Hadoop Tugdual Grall Couchbase @tgrall
Transcript
Page 1: (Tugdual grall)   no sql-hadoop

NoSQL & BigDataWhy Every NoSQL Deployment Should be Paired with Hadoop

Tugdual GrallCouchbase@tgrall

Page 2: (Tugdual grall)   no sql-hadoop

About  Me  

• Tugdual  “Tug”  Grall­ Couchbase

• Technical  Evangelist

­ eXo

• CTO

­ Oracle

• Developer/Product  Manager

• Mainly  Java/SOA

­ Developer  in  consul@ng  firms

• Web

• @tgrall

• hEp://blog.grallandco.com

• tgrall

• NantesJUG  co-­‐founder

• Pet  Project  :

• hEp://www.resultri.com

Page 3: (Tugdual grall)   no sql-hadoop

3

0

0.50

1.00

1.50

2.00

2000 2006 2011Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm)

Trillions  of  G

igabytes  (ZeE

abytes)

Big  DataHigh  Data  Variety  and  Velocity

More  Flexible  Data  Model  Required

• Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is  certainly  a  big  trend.

• But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons.• There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies.• The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐

generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases  aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this.

• The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they  want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very  rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model  that  they  can  evolve  very  quickly.  

• Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.    As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.

Page 4: (Tugdual grall)   no sql-hadoop

3

0

0.50

1.00

1.50

2.00

2000 2006 2011Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm)

Trillions  of  G

igabytes  (ZeE

abytes)

Big  DataHigh  Data  Variety  and  Velocity

More  Flexible  Data  Model  Required

• Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is  certainly  a  big  trend.

• But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons.• There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies.• The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐

generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases  aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this.

• The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they  want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very  rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model  that  they  can  evolve  very  quickly.  

• Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.    As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.

Page 5: (Tugdual grall)   no sql-hadoop

3

0

0.50

1.00

1.50

2.00

2000 2006 2011Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm)

Trillions  of  G

igabytes  (ZeE

abytes)

Big  DataHigh  Data  Variety  and  Velocity

Structured  Data

More  Flexible  Data  Model  Required

• Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is  certainly  a  big  trend.

• But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons.• There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies.• The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐

generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases  aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this.

• The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they  want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very  rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model  that  they  can  evolve  very  quickly.  

• Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.    As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.

Page 6: (Tugdual grall)   no sql-hadoop

3

0

0.50

1.00

1.50

2.00

2000 2006 2011Source:  IDC  2011  Digital  Universe  Study  (hEp://www.emc.com/collateral/demos/microsites/emc-­‐digital-­‐universe-­‐2011/index.htm)

Trillions  of  G

igabytes  (ZeE

abytes)

Big  DataHigh  Data  Variety  and  Velocity

Unstructured  and  Semi-­‐Structured  Data

Structured  Data

Text,  Log  Files,  Click  Streams,  Blogs,  Tweets,  Audio,  Video,  etc.

More  Flexible  Data  Model  Required

• Usually  when  people  talk  about  Big  Data  they  talk  about  capturing  huge  amounts  of  data  and  analyzing  it.  This  reference  to  Big  Data  is  certainly  a  big  trend.

• But  Big  Data  affects  opera@onal  databases  in  a  big  way  as  well  but  for  a  different  set  of  reasons.• There  are  2  aspects  of  Big  Data  that  are  pushing  people  toward  NoSQL  technologies.• The  first  is  that  the  vast  majority  of  the  increase  in  data  is  in  the  form  of  un-­‐structured  or  semi-­‐structured  data.    This  is  data  like  user-­‐

generated  content  like  consumer  recommenda@ons  and  machine  generated  data  like  log  files  and  website  click  data.    Rela@onal  databases  aren’t  well  suited  for  storing  this  type  of  data  while  NoSQL  technologies  like  document-­‐oriented  database  are  ideally  suited  for  this.

• The  second  is  that  applica@on  developers  are  finding  new  types  of  data  they  want  to  store  all  the  @me.    It  might  be  new  informa@on  they  want  to  store  in  a  user’s  account  profile,  new  logging  informa@on,  etc.    The  point  is  that  what  developers  want  to  store  is  changing  very  rapidly  and  the  amount  of  data  they  want  to  store  is  increasing  very  rapidly.    The  result  is  that  developers  want  a  very  flexible  data  model  that  they  can  evolve  very  quickly.  

• Rela@onal  databases  have  fixed  schemas  that  ofen  take  weeks  or  months  to  change.    On  the  other  hand,  NoSQL  databases  are  schema-­‐less.    As  a  result,  you  can  far  more  easily  add  new  types  of  data  and  iterate  quickly  on  your  applica@on.

Page 7: (Tugdual grall)   no sql-hadoop

ClouderaHortonworks

Opera@onal  vs.  Analy@c  Databases

CouchbaseMongo

AnalyOcDatabases

Get  insights  from  data

Real-­‐Ome,  InteracOve  Databases

Fast  access  to  data

NoSQL

4

• There  are  two  types  of  databases.  Each  is  focused  on  a  very  different  problem.• AnalyOc  databases  were  referred  to  in  the  past  as  OLAP  databases.    They  are  focused  on  looking  through  every  record  in  a  huge  database  to  

answer  a  ques@on  or  gain  an  insight  about  the  data  contained  in  it.    These  analyses  are  batch  processes  that  access  every  piece  of  data  in  the  database,  are  very  “read”  heavy,  and  produce  results  in  seconds,  minutes,  or  someOmes  days.  For  analy@c  databases,  “real  @me”  means  an  analysis  takes  a  few  seconds  to  run.

• Real-­‐Ome  interac@ve  databases  are  ofen  referred  to  as  operaOonal  databases.    They  store  a  lot  of  data  but  usually  much  less  than  an  analy@c  database.

• They  must  provide  access  to  individual  records  in  a  database  in  milliseconds  so  that  users  of  an  applica@on  get  good  response  @me.• Since  the  requirements  of  each  database  is  very  different,  the  architectures  and  capabili@es  of  each  are  very  different  as  well.• When  I  refer  to  NoSQL  in  my  presenta@on,  I  am  referring  to  real-­‐Ome,  interacOve  databases.    This  is  the  type  of  NoSQL  database  Couchbase  

provides.

Page 8: (Tugdual grall)   no sql-hadoop

Lack  of  flexibility/rigid  schemas

Inability  to  scale  out  data

Performance  challenges Cost All  of  these Other

49%

35%

29%

16% 12% 11%

Source:  Couchbase  Survey,  December  2011,  n  =  1351.

Page 9: (Tugdual grall)   no sql-hadoop

NoSQL  catalogKey-­‐Value

Memcached

Cache

(mem

ory  on

ly)

Database

(mem

ory/disk)

Redis

Data  Structure

Membase Couchbase

MongoDB

Document Column

Cassandra

Graph

Neo4j

HBase InfiniteGraph

Coherence

Page 10: (Tugdual grall)   no sql-hadoop

Use  Cases

Key  Value•  Session  Management•  User  Profile/Preferences•  Shopping  Cart

Document•  Event  Logging•  Content  Management  •  Web  AnalyOcs•  E-­‐Commerce  ApplicaOon

Columns•  Event  Logging•  Content  Management•  Counters

Graph•  Connected  Data  /    Social  Networks•  RouOng,  Dispatch•  RecommendaOons  based  on  Social  Graph

Page 11: (Tugdual grall)   no sql-hadoop

Hadoop

Page 12: (Tugdual grall)   no sql-hadoop

What  is  Hadoop?

• Highly  scalable

• Unstructured  data

• Open  source

• Big  Data  OperaOng  System

• Changing  the  World  One  Petabyte  at  a  Time

Page 13: (Tugdual grall)   no sql-hadoop

What  is  Hadoop?

• Simplest  unit  of  compute  and  storage

CPU

Disks Application

Data

Page 14: (Tugdual grall)   no sql-hadoop

What  is  Hadoop?

• And  when  it  grows?

Application

Data

Page 15: (Tugdual grall)   no sql-hadoop

What  is  Hadoop?

• And  when  it  grows  more?

Page 16: (Tugdual grall)   no sql-hadoop

What  is  Hadoop?

• NoSQL  to  the  rescue

Application

Data

Page 17: (Tugdual grall)   no sql-hadoop

What  is  Hadoop?

• Hadoop  is  a  different  paradigm

Application

Data

Page 18: (Tugdual grall)   no sql-hadoop

Hadoop is not a “NoSQL Database” but more a set of tools to work with BigData: the ultimate Swiss Army Knife to deal with VERY VERY large volume of data

Oozie: Workflow, coordinationSqoop : Data connector to import/export dataHive : SQL-Like interfacePig : High level programming languageMahout : Machine learning libraryWhirr : Hadoop management tools for cloud servicesFlume : AggregatorMap Reduce : Framework to process large volume of dataHBase : Key Value data storeZookeeper : Centralized configuration managementHDFS : Distributed file system

Page 19: (Tugdual grall)   no sql-hadoop

Hadoop  and  NoSQL

Page 20: (Tugdual grall)   no sql-hadoop

events

profiles,  campaigns

profiles,  real  @me  campaign  sta@s@cs

40  milliseconds  to  respond  with  the  decision.

2

3

1

Ad  and  offer  targeOng

17

Page 21: (Tugdual grall)   no sql-hadoop

events

profiles,  campaigns

profiles,  real  @me  campaign  sta@s@cs

40  milliseconds  to  respond  with  the  decision.

2

3

1

Ad  and  offer  targeOng

17

Page 22: (Tugdual grall)   no sql-hadoop

Moving  Parts

18

Logs

Couchbase Server Cluster

Hadoop Cluster

sqoop import

LogsLogs

LogsLogs

Ad Targeting Platform

sqoop export

flumeflow

Page 23: (Tugdual grall)   no sql-hadoop

events&

user&profiles&

make&&recommenda2ons&

2&

3&

1&

Content Oriented Site

Legacy Relational Database

Content  &  RecommendaOon  TargeOng

19

Page 24: (Tugdual grall)   no sql-hadoop

events&

user&profiles&

make&&recommenda2ons&

2&

3&

1&

Content Oriented Site

Legacy Relational Database

Content  &  RecommendaOon  TargeOng

19

Page 25: (Tugdual grall)   no sql-hadoop

Logs

Couchbase Server Cluster

Hadoop Cluster

sqoop import

LogsLogs

LogsLogs

Content Driven Web Site

sqoop export

Original RDBMS

In order to keep up with changing needs on richer, more targeted content that is delivered to larger and larger audiences very quickly, data behind content driven sites is shifting to Couchbase.

Hadoop excels at complex analytics which may involve multiple steps of processing which incorporate a number of different data sources.

sqoop importflumeflow

Moving  Parts

20

Page 26: (Tugdual grall)   no sql-hadoop

Sqoop  :  What  is  this?

Page 27: (Tugdual grall)   no sql-hadoop

Sqoop is a tool designed to transfer data between Hadoop and relational databases.

You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.

sqoop.apache.org

What  is  Sqoop?

22

Page 28: (Tugdual grall)   no sql-hadoop

• Traditional ETL

Application DataData

T

What  is  Sqoop?

23

Page 29: (Tugdual grall)   no sql-hadoop

• A different paradigm

Data

ApplicationData

What  is  Sqoop?

24

Page 30: (Tugdual grall)   no sql-hadoop

• A very scalable different paradigm

Data

Application

Data

Application

Data

Application

Data

What  is  Sqoop?

25

Page 31: (Tugdual grall)   no sql-hadoop

• Where did the Transform go?

Application

Data

TTT TTT TTT TTT

What  is  Sqoop?

26

Page 32: (Tugdual grall)   no sql-hadoop

What  is  Sqoop?

• Sqoop  “SQL-­‐Hadoop”­ Default  connec@on  is  via  JDBC

• Lots  of  custom  connectors­ Couchbase,  VoltDB,  Ver@ca­ Teradata,  Netezza­ Oracle,  MySQL,  Postgres

Page 33: (Tugdual grall)   no sql-hadoop

Sqoop  :  Import

Page 34: (Tugdual grall)   no sql-hadoop

Sqoop  :  Import

sqoop import --connect jdbc:mysql://rdbms1.demo.com/CRM --table customers

Page 35: (Tugdual grall)   no sql-hadoop

Sqoop  :  Export

Page 36: (Tugdual grall)   no sql-hadoop

Sqoop  :  Export

sqoop export --connect jdbc:mysql://rdbms1.demo.com/ANALYTICS --table sales --export-dir /user/hive/warehouse/zip_profits --input-fields-terminated-by '\0001'

Page 37: (Tugdual grall)   no sql-hadoop

Sqoop  :  Import

Page 38: (Tugdual grall)   no sql-hadoop

Sqoop  :  Import

sqoop import –-connect http://localhost:8091/pools --table DUMP

Page 39: (Tugdual grall)   no sql-hadoop

MapReduceJob

Sqoop  :  Import

HDFS

Map

HDFS

Map

HDFS

Map

Sqoop  Client

Metadata

Launches

Page 40: (Tugdual grall)   no sql-hadoop

MapReduceJob

Sqoop  :  Import

HDFS

Map

HDFS

Map

HDFS

Map

Sqoop  Client

Metadata

Launches

Page 41: (Tugdual grall)   no sql-hadoop

Sqoop  :  Export

Page 42: (Tugdual grall)   no sql-hadoop

Sqoop  :  Export

sqoop export --connect http://localhost:8091/pools --table DUMP --export-dir /user/hive/profiles/recommendation --username social

Page 43: (Tugdual grall)   no sql-hadoop

Sqoop  :  ExportMapReduceJob

HDFS

Map

HDFS

Map

HDFS

Map

Sqoop  Client

Metadata

Launches

Page 44: (Tugdual grall)   no sql-hadoop

Sqoop  :  ExportMapReduceJob

HDFS

Map

HDFS

Map

HDFS

Map

Sqoop  Client

Metadata

Launches

Page 45: (Tugdual grall)   no sql-hadoop

DemonstraOon

Page 46: (Tugdual grall)   no sql-hadoop

NoSQL & BigDataWhy Every NoSQL Deployment Should be Paired with Hadoop

Tugdual GrallCouchbase@tgrall

Q&A


Recommended