Date post: | 22-Apr-2015 |
Category: |
Technology |
Upload: | krishnan-parasuraman |
View: | 15,111 times |
Download: | 2 times |
Tweet about Enzee Universe using #enzee11
Hadoop and NetezzaCo-existence or competition?Krishnan Parasuraman, CTO - Digital Media, Netezza
Tweet about Enzee Universe using #enzee11
@kparasuraman
Tweet about Enzee Universe using #enzee11
2
The Buzz
Tweet about Enzee Universe using #enzee11
3
Tweet about Enzee Universe using #enzee11
4
Fuelling the debate
Tweet about Enzee Universe using #enzee11
5
A brief history of wannabe RDBMS killers
Tweet about Enzee Universe using #enzee11
6
Open Source Distributed Storage and Processing Engine
Self healing, distributed storage
Commodity hardware – inexpensive storage
+
Fault tolerant distributed processing
Abstraction for parallel computing
Manage complex data – relational and non relational – in a single repository
Store source data forever and analyze as and when needed
Process at source – eliminate data movementOozie
Workflow
SqoopIntegration
ZookeeperService coordination
Flume, Chukwa, ScribeData collection
Tweet about Enzee Universe using #enzee11
7
Hadoop: Origin and evolution
2003 2004 2005 2006 2007 2008 2009 2010 2011
Google: GFS paper
Google: MapReduce paper
Apache: Lucene subproject
Google: Bigtable paper
Apache: Hadoop project
Yahoo: 10K core cluster
Apache: HBase project
Netezza : Hadoop Connector, MapReduce support
Early Research Open source dev momentum
Initial success stories
Commercialization
Tweet about Enzee Universe using #enzee11
8
Common Perceptions
Low cost
Cloud
Complex Analytics
Ad-hoc queries
Unstructured
Large Volumes
Tweet about Enzee Universe using #enzee11
9
Parallel data warehouse systems
FPGA
Memory
CPU FPGA
Memory
CPU FPGA
Memory
CPU
Hosts
Storage Units
Massively parallel compute nodes
Network fabric
Host controllers
SQL
Tweet about Enzee Universe using #enzee11
10
Hadoop
Storage Units
Parallel compute nodes
Network fabric
Master Node
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Job Tracker
Map Reduce
Tweet about Enzee Universe using #enzee11
11
The similarities
Highly Available
Scalable
Execute code & algorithms next to data
Massive parallelism
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Job Tracker
Map Reduce
Tweet about Enzee Universe using #enzee11
12
The differences
Data Node
Task Tracker
Data Node
Task Tracker
Data Node
Task Tracker
Name Node
Job Tracker
Map Reduce
Data Loading = File copy Look Ma, No ETL
Schema on Read – Data loading is fast
Batch mode data access
Not intended for real time access
Doesn’t support Random Access
No joins, no query engine, no types, no SQL
Tweet about Enzee Universe using #enzee11
Where does it work well?
1. Queryable Archive: Moving computation is cheaper than moving data
13
2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema
3. Complex data: Parallel ETL in Java
Tweet about Enzee Universe using #enzee11
14
Imperatives for co-existence
• Fast data loading - flexible schema till we figure out what we want to do
• Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce
• Low cost of storing and analyzing not-so-hot data
• Parse and analyze complex data such as video and images
Tweet about Enzee Universe using #enzee11
Netezza-Hadoop: Co-existence use cases
unstructured data
semi-structured data
structured data
Create context (classification, text mining)
Analyze
Parse, aggregate Analyze, report
Analyze, reportActive archival
Long running queries
Tweet about Enzee Universe using #enzee11
Pattern 1: Data ingestion
NameNodeJobTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Hadoop Cluster Netezza Environment
Raw Weblogs
1
2
3
4
Tweet about Enzee Universe using #enzee11
Pattern 2: Low cost storage and dynamic provisioning
Elastic MapReduce
2
3
Amazon S3
Amazon Cloud
1
Tweet about Enzee Universe using #enzee11
Pattern 3: Queryable archive
Data Sources
1 2
Tweet about Enzee Universe using #enzee11
Pattern 4: Support low interaction partners
Data Sources
1
23
Tweet about Enzee Universe using #enzee11
Netezza and Hadoop integration
Hadoop/HDFS integration
High speed data loader(bidirectional)
weblogs
• Move data back and forth between Netezza and Hadoop cluster
• Use Hadoop for ingesting/parsing web logs, offline analytics
Tweet about Enzee Universe using #enzee11
21
Summary: Leveraging best of both worlds
2. Hadoop and Netezza are complementary technologies
1. Hadoop is not a replacement to a parallel datawarehouse
4. We have only solved the integration problem
3. Don’t let the hype drive the need