+ All Categories
Home > Data & Analytics > MPP vs Hadoop

MPP vs Hadoop

Date post: 21-Apr-2017
Category:
Upload: alexey-grishchenko
View: 21,814 times
Download: 0 times
Share this document with a friend
53
1 Pivotal Confidential–Internal Use Only MPP vs Hadoop Alexey Grishchenko HUG Meetup 28.11.2015
Transcript
Page 1: MPP vs Hadoop

1 Pivotal Confidential–Internal Use Only 1 Pivotal Confidential–Internal Use Only

MPP vs Hadoop Alexey Grishchenko

HUG Meetup 28.11.2015

Page 2: MPP vs Hadoop

2 Pivotal Confidential–Internal Use Only

Agenda

� Distributed Systems

� MPP

� Hadoop

� MPP vs Hadoop

� Summary

Page 3: MPP vs Hadoop

3 Pivotal Confidential–Internal Use Only

Agenda

� Distributed Systems � MPP

� Hadoop

� MPP vs Hadoop

� Summary

Page 4: MPP vs Hadoop

4 Pivotal Confidential–Internal Use Only

Distributed Systems

Avoid distributed systems in all the problems that potentially could be solved using non-distributed systems

Page 5: MPP vs Hadoop

5 Pivotal Confidential–Internal Use Only

Distributed Systems

� Consensus problem –  Paxos –  RAFT –  ZAB –  etc.

� Transaction consistency –  2PC –  3PC

Page 6: MPP vs Hadoop

6 Pivotal Confidential–Internal Use Only

Distributed Systems

� CAP Theorem

Page 7: MPP vs Hadoop

7 Pivotal Confidential–Internal Use Only

Distributed Systems

http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf

Page 8: MPP vs Hadoop

8 Pivotal Confidential–Internal Use Only

Distributed Systems

Reasons to use �  Performance issues –  More than 100’000 TPS –  More than 4 GB/sec scan rate –  More than 100’000 IOPS

�  Capacity issues –  More than 50TB of data

�  DR and Geo-Distribution

Page 9: MPP vs Hadoop

9 Pivotal Confidential–Internal Use Only

Agenda

� Distributed Systems

� MPP � Hadoop

� MPP vs Hadoop

� Summary

Page 10: MPP vs Hadoop

10 Pivotal Confidential–Internal Use Only

MPP

Main principles �  Shared Nothing

�  Data Sharding

�  Data Replication

�  Distributed Transactions

�  Parallel Processing

Page 11: MPP vs Hadoop

11 Pivotal Confidential–Internal Use Only

MPP

Page 12: MPP vs Hadoop

12 Pivotal Confidential–Internal Use Only

MPP

Works well for �  Relational data

�  Batch processing

�  Ad hoc analytical SQL

�  Low concurrency

�  Applications requiring ANSI SQL

Page 13: MPP vs Hadoop

13 Pivotal Confidential–Internal Use Only

MPP

Not the best choice for �  Non-relational data

�  OLTP and event stream processing

�  High concurrency

�  100+ server clusters

�  Non-analytical use cases

�  Geo-Distributed use cases

Page 14: MPP vs Hadoop

14 Pivotal Confidential–Internal Use Only

Agenda

� Distributed Systems

� MPP

� Hadoop � MPP vs Hadoop

� Summary

Page 15: MPP vs Hadoop

15 Pivotal Confidential–Internal Use Only

Hadoop

Main Components �  HDFS

�  YARN

�  MapReduce

�  HBase

�  Hive / Hive+Tez

Page 16: MPP vs Hadoop

16 Pivotal Confidential–Internal Use Only

Hadoop

HDFS �  Distributed filesystem

�  Block-level storage with big blocks

�  Non-updatable

�  Synchronous block replication

�  No built-in Geo-Distribution support

�  No built-in DR solution

Page 17: MPP vs Hadoop

17 Pivotal Confidential–Internal Use Only

Hadoop

HDFS

Page 18: MPP vs Hadoop

18 Pivotal Confidential–Internal Use Only

Hadoop

YARN �  Cluster resource manager

�  Manages CPU and RAM allocation

�  Schedulers are pluggable

�  Can handle different resource pools

�  Supports both MR and non-MR workload

Page 19: MPP vs Hadoop

19 Pivotal Confidential–Internal Use Only

Hadoop

YARN

Page 20: MPP vs Hadoop

20 Pivotal Confidential–Internal Use Only

Hadoop

MapReduce �  Framework for distributed data processing

�  Two main operations: map and reduce

�  Data hits disk after “map” and before “reduce”

�  Scales to thousands of servers

�  Can process petabytes of data

�  Extremely reliable

Page 21: MPP vs Hadoop

21 Pivotal Confidential–Internal Use Only

Hadoop

MapReduce

Page 22: MPP vs Hadoop

22 Pivotal Confidential–Internal Use Only

Hadoop

HBase �  Distributed key-value store

�  Data is sharded by key

�  Data is stored in sorted order

�  Stores multiple versions of the row

�  Easily scales

Page 23: MPP vs Hadoop

23 Pivotal Confidential–Internal Use Only

Hadoop

HBase

Page 24: MPP vs Hadoop

24 Pivotal Confidential–Internal Use Only

Hadoop

Hive �  Query engine with SQL-like syntax

�  Translates HiveQL query to MR / Tez / Spark job

�  Processes HDFS data

�  Supports UDFs and UDAFs

Page 25: MPP vs Hadoop

25 Pivotal Confidential–Internal Use Only

Hadoop

Hive

Page 26: MPP vs Hadoop

26 Pivotal Confidential–Internal Use Only

Hadoop

Works well for �  Write Once Read Many

�  100+ server clusters

�  Both relational and non-relational data

�  High concurrency

�  Batch processing and analytical workload

�  Elastic scalability

Page 27: MPP vs Hadoop

27 Pivotal Confidential–Internal Use Only

Hadoop

Not the best choice for �  Write-heavy workloads

�  Small clusters

�  Analytical DWH cases

�  OLTP and event stream processing

�  Cost savings

Page 28: MPP vs Hadoop

28 Pivotal Confidential–Internal Use Only

Agenda

� Distributed Systems

� MPP

� Hadoop

� MPP vs Hadoop � Summary

Page 29: MPP vs Hadoop

29 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open

Page 30: MPP vs Hadoop

30 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity

Page 31: MPP vs Hadoop

31 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common

Page 32: MPP vs Hadoop

32 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K

Page 33: MPP vs Hadoop

33 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High

Page 34: MPP vs Hadoop

34 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source

Page 35: MPP vs Hadoop

35 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex

Page 36: MPP vs Hadoop

36 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers

Page 37: MPP vs Hadoop

37 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB

Page 38: MPP vs Hadoop

38 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB Target Systems DWH Purpose-Built Batch

Page 39: MPP vs Hadoop

39 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Business MPP Hadoop

Platform Openness Mostly Closed Open Hardware Options Mostly Appliances Commodity Vendor Lock-in Typical Not Common Technology Price $200K – $10M $50K – $500K Implementation Cost Moderate High Extensibility Vendor-provided APIs Open Source Supportability Easy Complex Scalability Up to 100 servers Up to 5000 servers Scalability Up to 100-300 TB Up to 100 PB Target Systems DWH Purpose-Built Batch Target End Users Business Analysts Developers

Page 40: MPP vs Hadoop

40 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None

Page 41: MPP vs Hadoop

41 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard

Page 42: MPP vs Hadoop

42 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java

Page 43: MPP vs Hadoop

43 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High

Page 44: MPP vs Hadoop

44 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High

Page 45: MPP vs Hadoop

45 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec

Page 46: MPP vs Hadoop

46 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins

Page 47: MPP vs Hadoop

47 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks

Page 48: MPP vs Hadoop

48 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks Min Collection Size Megabytes Gigabytes

Page 49: MPP vs Hadoop

49 Pivotal Confidential–Internal Use Only

MPP vs Hadoop for Architect MPP Hadoop

Query Optimization Good Poor to None Debugging Easy Very Hard Accessibility SQL Mainly Java DBA Skill Level Low High Single Job Redundancy Low High Query Latency 10-20 ms 10-20 sec Query Runtime 5-7 sec 10-15 mins Query Max Runtime 1-2 hours 1-2 weeks Min Collection Size Megabytes Gigabytes Max Concurrency 10-15 queries 70-100 jobs

Page 50: MPP vs Hadoop

50 Pivotal Confidential–Internal Use Only

Agenda

� Distributed Systems

� MPP

� Hadoop

� MPP vs Hadoop

� Examples

� Summary

Page 51: MPP vs Hadoop

51 Pivotal Confidential–Internal Use Only

Summary

Use MPP for �  Analytical DWH

�  Ad hoc analyst SQL queries and BI

�  Keep under 100TB of data

Use Hadoop for

�  Specialized data processing systems

�  Over 100TB of data

Page 52: MPP vs Hadoop

52 Pivotal Confidential–Internal Use Only 52 Pivotal Confidential–Internal Use Only

Questions?

Page 53: MPP vs Hadoop

BUILT FOR THE SPEED OF BUSINESS


Recommended