Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Hadoop and NetezzaCo-existence or competition?Krishnan Parasuraman, CTO - Digital Media, Netezza


@kparasuraman


2

The Buzz


3


4

Fuelling the debate


5

A brief history of wannabe RDBMS killers


6

Open Source Distributed Storage and Processing Engine

Self healing, distributed storage

Commodity hardware – inexpensive storage

+

Fault tolerant distributed processing

Abstraction for parallel computing

Manage complex data – relational and non relational – in a single repository

Store source data forever and analyze as and when needed

Process at source – eliminate data movementOozie

Workflow

SqoopIntegration

ZookeeperService coordination

Flume, Chukwa, ScribeData collection


7

Hadoop: Origin and evolution

2003 2004 2005 2006 2007 2008 2009 2010 2011

Google: GFS paper

Google: MapReduce paper

Apache: Lucene subproject

Google: Bigtable paper

Apache: Hadoop project

Yahoo: 10K core cluster

Apache: HBase project

Netezza : Hadoop Connector, MapReduce support

Early Research Open source dev momentum

Initial success stories

Commercialization


8

Common Perceptions

Low cost

Cloud

Complex Analytics

Ad-hoc queries

Unstructured

Large Volumes


9

Parallel data warehouse systems

FPGA

Memory

CPU FPGA

Memory

CPU FPGA

Memory

CPU

Hosts

Storage Units

Massively parallel compute nodes

Network fabric

Host controllers

SQL


10

Hadoop

Storage Units

Parallel compute nodes

Network fabric

Master Node

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce


11

The similarities

Highly Available

Scalable

Execute code & algorithms next to data

Massive parallelism

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce


12

The differences

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Data Loading = File copy Look Ma, No ETL

Schema on Read – Data loading is fast

Batch mode data access

Not intended for real time access

Doesn’t support Random Access

No joins, no query engine, no types, no SQL


Where does it work well?

1. Queryable Archive: Moving computation is cheaper than moving data

13

2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema

3. Complex data: Parallel ETL in Java


14

Imperatives for co-existence

• Fast data loading - flexible schema till we figure out what we want to do

• Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce

• Low cost of storing and analyzing not-so-hot data

• Parse and analyze complex data such as video and images


Netezza-Hadoop: Co-existence use cases

unstructured data

semi-structured data

structured data

Create context (classification, text mining)

Analyze

Parse, aggregate Analyze, report

Analyze, reportActive archival

Long running queries


Pattern 1: Data ingestion

NameNodeJobTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Hadoop Cluster Netezza Environment

Raw Weblogs

1

2

3

4


Pattern 2: Low cost storage and dynamic provisioning

Elastic MapReduce

2

3

Amazon S3

Amazon Cloud

1


Pattern 3: Queryable archive

Data Sources

1 2


Pattern 4: Support low interaction partners

Data Sources

1

23


Netezza and Hadoop integration

Hadoop/HDFS integration

High speed data loader(bidirectional)

weblogs

• Move data back and forth between Netezza and Hadoop cluster

• Use Hadoop for ingesting/parsing web logs, offline analytics


21

Summary: Leveraging best of both worlds

2. Hadoop and Netezza are complementary technologies

1. Hadoop is not a replacement to a parallel datawarehouse

4. We have only solved the integration problem

3. Don’t let the hype drive the need

Date post:	22-Apr-2015
Category:	Technology
Upload:	krishnan-parasuraman
View:	15,111 times
Download:	2 times

Hadoop and Netezza - Co-existence or Competition?

Technology