Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | naver-d2 |
View: | 114 times |
Download: | 4 times |
NoSQL & BigDataWhy Every NoSQL Deployment Should be Paired with Hadoop
Tugdual GrallCouchbase@tgrall
About Me
• Tugdual “Tug” Grall Couchbase
• Technical Evangelist
eXo
• CTO
Oracle
• Developer/Product Manager
• Mainly Java/SOA
Developer in consul@ng firms
• Web
• @tgrall
• hEp://blog.grallandco.com
• tgrall
• NantesJUG co-‐founder
• Pet Project :
• hEp://www.resultri.com
3
0
0.50
1.00
1.50
2.00
2000 2006 2011Source: IDC 2011 Digital Universe Study (hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
Trillions of G
igabytes (ZeE
abytes)
Big DataHigh Data Variety and Velocity
More Flexible Data Model Required
• Usually when people talk about Big Data they talk about capturing huge amounts of data and analyzing it. This reference to Big Data is certainly a big trend.
• But Big Data affects opera@onal databases in a big way as well but for a different set of reasons.• There are 2 aspects of Big Data that are pushing people toward NoSQL technologies.• The first is that the vast majority of the increase in data is in the form of un-‐structured or semi-‐structured data. This is data like user-‐
generated content like consumer recommenda@ons and machine generated data like log files and website click data. Rela@onal databases aren’t well suited for storing this type of data while NoSQL technologies like document-‐oriented database are ideally suited for this.
• The second is that applica@on developers are finding new types of data they want to store all the @me. It might be new informa@on they want to store in a user’s account profile, new logging informa@on, etc. The point is that what developers want to store is changing very rapidly and the amount of data they want to store is increasing very rapidly. The result is that developers want a very flexible data model that they can evolve very quickly.
• Rela@onal databases have fixed schemas that ofen take weeks or months to change. On the other hand, NoSQL databases are schema-‐less. As a result, you can far more easily add new types of data and iterate quickly on your applica@on.
3
0
0.50
1.00
1.50
2.00
2000 2006 2011Source: IDC 2011 Digital Universe Study (hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
Trillions of G
igabytes (ZeE
abytes)
Big DataHigh Data Variety and Velocity
More Flexible Data Model Required
• Usually when people talk about Big Data they talk about capturing huge amounts of data and analyzing it. This reference to Big Data is certainly a big trend.
• But Big Data affects opera@onal databases in a big way as well but for a different set of reasons.• There are 2 aspects of Big Data that are pushing people toward NoSQL technologies.• The first is that the vast majority of the increase in data is in the form of un-‐structured or semi-‐structured data. This is data like user-‐
generated content like consumer recommenda@ons and machine generated data like log files and website click data. Rela@onal databases aren’t well suited for storing this type of data while NoSQL technologies like document-‐oriented database are ideally suited for this.
• The second is that applica@on developers are finding new types of data they want to store all the @me. It might be new informa@on they want to store in a user’s account profile, new logging informa@on, etc. The point is that what developers want to store is changing very rapidly and the amount of data they want to store is increasing very rapidly. The result is that developers want a very flexible data model that they can evolve very quickly.
• Rela@onal databases have fixed schemas that ofen take weeks or months to change. On the other hand, NoSQL databases are schema-‐less. As a result, you can far more easily add new types of data and iterate quickly on your applica@on.
3
0
0.50
1.00
1.50
2.00
2000 2006 2011Source: IDC 2011 Digital Universe Study (hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
Trillions of G
igabytes (ZeE
abytes)
Big DataHigh Data Variety and Velocity
Structured Data
More Flexible Data Model Required
• Usually when people talk about Big Data they talk about capturing huge amounts of data and analyzing it. This reference to Big Data is certainly a big trend.
• But Big Data affects opera@onal databases in a big way as well but for a different set of reasons.• There are 2 aspects of Big Data that are pushing people toward NoSQL technologies.• The first is that the vast majority of the increase in data is in the form of un-‐structured or semi-‐structured data. This is data like user-‐
generated content like consumer recommenda@ons and machine generated data like log files and website click data. Rela@onal databases aren’t well suited for storing this type of data while NoSQL technologies like document-‐oriented database are ideally suited for this.
• The second is that applica@on developers are finding new types of data they want to store all the @me. It might be new informa@on they want to store in a user’s account profile, new logging informa@on, etc. The point is that what developers want to store is changing very rapidly and the amount of data they want to store is increasing very rapidly. The result is that developers want a very flexible data model that they can evolve very quickly.
• Rela@onal databases have fixed schemas that ofen take weeks or months to change. On the other hand, NoSQL databases are schema-‐less. As a result, you can far more easily add new types of data and iterate quickly on your applica@on.
3
0
0.50
1.00
1.50
2.00
2000 2006 2011Source: IDC 2011 Digital Universe Study (hEp://www.emc.com/collateral/demos/microsites/emc-‐digital-‐universe-‐2011/index.htm)
Trillions of G
igabytes (ZeE
abytes)
Big DataHigh Data Variety and Velocity
Unstructured and Semi-‐Structured Data
Structured Data
Text, Log Files, Click Streams, Blogs, Tweets, Audio, Video, etc.
More Flexible Data Model Required
• Usually when people talk about Big Data they talk about capturing huge amounts of data and analyzing it. This reference to Big Data is certainly a big trend.
• But Big Data affects opera@onal databases in a big way as well but for a different set of reasons.• There are 2 aspects of Big Data that are pushing people toward NoSQL technologies.• The first is that the vast majority of the increase in data is in the form of un-‐structured or semi-‐structured data. This is data like user-‐
generated content like consumer recommenda@ons and machine generated data like log files and website click data. Rela@onal databases aren’t well suited for storing this type of data while NoSQL technologies like document-‐oriented database are ideally suited for this.
• The second is that applica@on developers are finding new types of data they want to store all the @me. It might be new informa@on they want to store in a user’s account profile, new logging informa@on, etc. The point is that what developers want to store is changing very rapidly and the amount of data they want to store is increasing very rapidly. The result is that developers want a very flexible data model that they can evolve very quickly.
• Rela@onal databases have fixed schemas that ofen take weeks or months to change. On the other hand, NoSQL databases are schema-‐less. As a result, you can far more easily add new types of data and iterate quickly on your applica@on.
ClouderaHortonworks
Opera@onal vs. Analy@c Databases
CouchbaseMongo
AnalyOcDatabases
Get insights from data
Real-‐Ome, InteracOve Databases
Fast access to data
NoSQL
4
• There are two types of databases. Each is focused on a very different problem.• AnalyOc databases were referred to in the past as OLAP databases. They are focused on looking through every record in a huge database to
answer a ques@on or gain an insight about the data contained in it. These analyses are batch processes that access every piece of data in the database, are very “read” heavy, and produce results in seconds, minutes, or someOmes days. For analy@c databases, “real @me” means an analysis takes a few seconds to run.
• Real-‐Ome interac@ve databases are ofen referred to as operaOonal databases. They store a lot of data but usually much less than an analy@c database.
• They must provide access to individual records in a database in milliseconds so that users of an applica@on get good response @me.• Since the requirements of each database is very different, the architectures and capabili@es of each are very different as well.• When I refer to NoSQL in my presenta@on, I am referring to real-‐Ome, interacOve databases. This is the type of NoSQL database Couchbase
provides.
Lack of flexibility/rigid schemas
Inability to scale out data
Performance challenges Cost All of these Other
49%
35%
29%
16% 12% 11%
Source: Couchbase Survey, December 2011, n = 1351.
NoSQL catalogKey-‐Value
Memcached
Cache
(mem
ory on
ly)
Database
(mem
ory/disk)
Redis
Data Structure
Membase Couchbase
MongoDB
Document Column
Cassandra
Graph
Neo4j
HBase InfiniteGraph
Coherence
Use Cases
Key Value• Session Management• User Profile/Preferences• Shopping Cart
Document• Event Logging• Content Management • Web AnalyOcs• E-‐Commerce ApplicaOon
Columns• Event Logging• Content Management• Counters
Graph• Connected Data / Social Networks• RouOng, Dispatch• RecommendaOons based on Social Graph
Hadoop
What is Hadoop?
• Highly scalable
• Unstructured data
• Open source
• Big Data OperaOng System
• Changing the World One Petabyte at a Time
What is Hadoop?
• Simplest unit of compute and storage
CPU
Disks Application
Data
What is Hadoop?
• And when it grows?
Application
Data
What is Hadoop?
• And when it grows more?
What is Hadoop?
• NoSQL to the rescue
Application
Data
What is Hadoop?
• Hadoop is a different paradigm
Application
Data
Hadoop is not a “NoSQL Database” but more a set of tools to work with BigData: the ultimate Swiss Army Knife to deal with VERY VERY large volume of data
Oozie: Workflow, coordinationSqoop : Data connector to import/export dataHive : SQL-Like interfacePig : High level programming languageMahout : Machine learning libraryWhirr : Hadoop management tools for cloud servicesFlume : AggregatorMap Reduce : Framework to process large volume of dataHBase : Key Value data storeZookeeper : Centralized configuration managementHDFS : Distributed file system
Hadoop and NoSQL
events
profiles, campaigns
profiles, real @me campaign sta@s@cs
40 milliseconds to respond with the decision.
2
3
1
Ad and offer targeOng
17
events
profiles, campaigns
profiles, real @me campaign sta@s@cs
40 milliseconds to respond with the decision.
2
3
1
Ad and offer targeOng
17
Moving Parts
18
Logs
Couchbase Server Cluster
Hadoop Cluster
sqoop import
LogsLogs
LogsLogs
Ad Targeting Platform
sqoop export
flumeflow
events&
user&profiles&
make&&recommenda2ons&
2&
3&
1&
Content Oriented Site
Legacy Relational Database
Content & RecommendaOon TargeOng
19
events&
user&profiles&
make&&recommenda2ons&
2&
3&
1&
Content Oriented Site
Legacy Relational Database
Content & RecommendaOon TargeOng
19
Logs
Couchbase Server Cluster
Hadoop Cluster
sqoop import
LogsLogs
LogsLogs
Content Driven Web Site
sqoop export
Original RDBMS
In order to keep up with changing needs on richer, more targeted content that is delivered to larger and larger audiences very quickly, data behind content driven sites is shifting to Couchbase.
Hadoop excels at complex analytics which may involve multiple steps of processing which incorporate a number of different data sources.
sqoop importflumeflow
Moving Parts
20
Sqoop : What is this?
Sqoop is a tool designed to transfer data between Hadoop and relational databases.
You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS.
sqoop.apache.org
What is Sqoop?
22
• Traditional ETL
Application DataData
T
What is Sqoop?
23
• A different paradigm
Data
ApplicationData
What is Sqoop?
24
• A very scalable different paradigm
Data
Application
Data
Application
Data
Application
Data
What is Sqoop?
25
• Where did the Transform go?
Application
Data
TTT TTT TTT TTT
What is Sqoop?
26
What is Sqoop?
• Sqoop “SQL-‐Hadoop” Default connec@on is via JDBC
• Lots of custom connectors Couchbase, VoltDB, Ver@ca Teradata, Netezza Oracle, MySQL, Postgres
Sqoop : Import
Sqoop : Import
sqoop import --connect jdbc:mysql://rdbms1.demo.com/CRM --table customers
Sqoop : Export
Sqoop : Export
sqoop export --connect jdbc:mysql://rdbms1.demo.com/ANALYTICS --table sales --export-dir /user/hive/warehouse/zip_profits --input-fields-terminated-by '\0001'
Sqoop : Import
Sqoop : Import
sqoop import –-connect http://localhost:8091/pools --table DUMP
MapReduceJob
Sqoop : Import
HDFS
Map
HDFS
Map
HDFS
Map
Sqoop Client
Metadata
Launches
MapReduceJob
Sqoop : Import
HDFS
Map
HDFS
Map
HDFS
Map
Sqoop Client
Metadata
Launches
Sqoop : Export
Sqoop : Export
sqoop export --connect http://localhost:8091/pools --table DUMP --export-dir /user/hive/profiles/recommendation --username social
Sqoop : ExportMapReduceJob
HDFS
Map
HDFS
Map
HDFS
Map
Sqoop Client
Metadata
Launches
Sqoop : ExportMapReduceJob
HDFS
Map
HDFS
Map
HDFS
Map
Sqoop Client
Metadata
Launches
DemonstraOon
NoSQL & BigDataWhy Every NoSQL Deployment Should be Paired with Hadoop
Tugdual GrallCouchbase@tgrall
Q&A