+ All Categories
Home > Technology > Hadoop and Netezza - Co-existence or Competition?

Hadoop and Netezza - Co-existence or Competition?

Date post: 22-Apr-2015
Category:
Upload: krishnan-parasuraman
View: 15,111 times
Download: 2 times
Share this document with a friend
Description:
Hadoop is rapidly emerging as a viable platform for big data analytics. Thanks to early adoption by organizations like Yahoo and Facebook, and an active open source community, we have seen significant innovation around this platform. With support of relational constructs and a SQL-like query interface, many experts believe that Hadoop will subsume some of the data warehousing tasks at some point in the future. Even though Hadoop and parallel databases have some architectural similarities, they are designed to solve different problems. In this presentation, you will get introduced to Hadoop architecture, its salient differences from Netezza and typical use cases. You will learn about common co-existence deployment models that have been put into practice by Netezza's customers who have leveraged benefits from both these technologies. You will also understand Netezza's current support for Hadoop and future strategy.
21
Tweet about Enzee Universe using #enzee11 Hadoop and Netezza Co-existence or competition? Krishnan Parasuraman, CTO - Digital Media, Netezza Tweet about Enzee Universe using #enzee11 @kparasuraman
Transcript
Page 1: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Hadoop and NetezzaCo-existence or competition?Krishnan Parasuraman, CTO - Digital Media, Netezza

Tweet about Enzee Universe using #enzee11

@kparasuraman

Page 2: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

2

The Buzz

Page 3: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

3

Page 4: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

4

Fuelling the debate

Page 5: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

5

A brief history of wannabe RDBMS killers

Page 6: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

6

Open Source Distributed Storage and Processing Engine

Self healing, distributed storage

Commodity hardware – inexpensive storage

+

Fault tolerant distributed processing

Abstraction for parallel computing

Manage complex data – relational and non relational – in a single repository

Store source data forever and analyze as and when needed

Process at source – eliminate data movementOozie

Workflow

SqoopIntegration

ZookeeperService coordination

Flume, Chukwa, ScribeData collection

Page 7: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

7

Hadoop: Origin and evolution

2003 2004 2005 2006 2007 2008 2009 2010 2011

Google: GFS paper

Google: MapReduce paper

Apache: Lucene subproject

Google: Bigtable paper

Apache: Hadoop project

Yahoo: 10K core cluster

Apache: HBase project

Netezza : Hadoop Connector, MapReduce support

Early Research Open source dev momentum

Initial success stories

Commercialization

Page 8: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

8

Common Perceptions

Low cost

Cloud

Complex Analytics

Ad-hoc queries

Unstructured

Large Volumes

Page 9: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

9

Parallel data warehouse systems

FPGA

Memory

CPU FPGA

Memory

CPU FPGA

Memory

CPU

Hosts

Storage Units

Massively parallel compute nodes

Network fabric

Host controllers

SQL

Page 10: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

10

Hadoop

Storage Units

Parallel compute nodes

Network fabric

Master Node

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Page 11: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

11

The similarities

Highly Available

Scalable

Execute code & algorithms next to data

Massive parallelism

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Page 12: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

12

The differences

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Name Node

Job Tracker

Map Reduce

Data Loading = File copy Look Ma, No ETL

Schema on Read – Data loading is fast

Batch mode data access

Not intended for real time access

Doesn’t support Random Access

No joins, no query engine, no types, no SQL

Page 13: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Where does it work well?

1. Queryable Archive: Moving computation is cheaper than moving data

13

2. Exploratory analysis: Relationships not defined yet; Can’t put in a process for ETL; Evolving schema

3. Complex data: Parallel ETL in Java

Page 14: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

14

Imperatives for co-existence

• Fast data loading - flexible schema till we figure out what we want to do

• Expressability of SQL coupled with flexibility of procedural code i.e. MapReduce

• Low cost of storing and analyzing not-so-hot data

• Parse and analyze complex data such as video and images

Page 15: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Netezza-Hadoop: Co-existence use cases

unstructured data

semi-structured data

structured data

Create context (classification, text mining)

Analyze

Parse, aggregate Analyze, report

Analyze, reportActive archival

Long running queries

Page 16: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Pattern 1: Data ingestion

NameNodeJobTracker

DataNodeTaskTracker

DataNodeTaskTracker

DataNodeTaskTracker

Hadoop Cluster Netezza Environment

Raw Weblogs

1

2

3

4

Page 17: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Pattern 2: Low cost storage and dynamic provisioning

Elastic MapReduce

2

3

Amazon S3

Amazon Cloud

1

Page 18: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Pattern 3: Queryable archive

Data Sources

1 2

Page 19: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Pattern 4: Support low interaction partners

Data Sources

1

23

Page 20: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

Netezza and Hadoop integration

Hadoop/HDFS integration

High speed data loader(bidirectional)

weblogs

• Move data back and forth between Netezza and Hadoop cluster

• Use Hadoop for ingesting/parsing web logs, offline analytics

Page 21: Hadoop and Netezza - Co-existence or Competition?

Tweet about Enzee Universe using #enzee11

21

Summary: Leveraging best of both worlds

2. Hadoop and Netezza are complementary technologies

1. Hadoop is not a replacement to a parallel datawarehouse

4. We have only solved the integration problem

3. Don’t let the hype drive the need


Recommended