+ All Categories
Home > Software > Hadoop 3 (2017 hadoop taiwan workshop)

Hadoop 3 (2017 hadoop taiwan workshop)

Date post: 21-Jan-2018
Category:
Upload: wei-chiu-chuang
View: 130 times
Download: 1 times
Share this document with a friend
44
© Cloudera, Inc. All rights reserved. 1 Hadoop 3 is coming — what’s new and what’s next?
Transcript
Page 1: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 1

Hadoop 3 is coming — what’s new and what’s next?

Page 2: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 2

About Wei-Chiu

Apache Hadoop committerSoftware Engineer, Cloudera

Page 3: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 3

Agenda

The ProblemWhat is HadoopMajor Hadoop 3 FeaturesWhat’s Next?

Page 4: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 4

- Anne Wojcicki

“Data helps solve problems”

Page 5: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 5

Big Data - 3Vs

Volume Velocity Variety

Page 6: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 6

Apache Hadoop

The de facto Big Data Analytics platformA distributed framework to support large scale computation on commodity

hardware• Petabyte+ storage, 1,000+ compute nodes• Inspired by Google• Originally developed by Yahoo!, donated to Apache Software Foundation.• Open source :)• 183 committers, thousands contributors

Page 7: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 7

Apache Hadoop

Page 8: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 8

Cloudera

Commercializes Hadoop* technologyOpen source, open cultureCDH - Cloudera’s Distribution for Hadoop• Platform. Open source

Cloudera Manager (CM), Cloudera Navigator, Key Trustee• Cluster management, monitoring. Proprietary

*Hadoop and its associated projects

Page 9: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 9

Hadoop Ecosystem

Page 10: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 10

Storage: reduce storage costCompute: much larger cluster

Well, data itself is a problem ...

High availabilityHigh performanceSeamless experience

More cloud usageEase of development

Clusters becoming larger More enterprise adoption New applications

Page 11: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 11

HDFS Erasure Coding

Page 12: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 12

Hadoop Distributed File System

3x replication

write

read

A fault tolerant, highly scalable storage systemPOSIX semanticsSecurity — user authentication, authorization, at-rest encryption, transport

encryption NameNode DataNode

Page 13: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 13

Advantage• Failure tolerant

But• 3x storage cost• 3x datacenter space• 3x power consumption

Hadoop Distributed File System

How to reduce storage overhead?

Page 14: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 14

•Parity bit

•XOR

•If X is lost, X can be reconstructed using Y and X ^ Y

•50% overhead ((3-2)/2)

•Can tolerate one failure

•Reed–Solomon

•RS(k,m) tolerates m failures in k data cells.

•XOR = RS(2,1)

Erasure Coding 101

Page 15: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 15

Reed-Solomon• Compute parity bits for

redundancy• Blocks can be reconstructed after

failures• Configurable durability v.s.

storage overheadRS(6,3)

• = 50% storage overhead• (9-6)/6

Erasure Coding 101

K blocks

encode

M parity blocks

Page 16: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 16

HDFS-EC: RS(6,3)

write

read

block

strip 1strip 2strip 3strip 4strip 5strip 6

parity 1parity 2parity 3

Page 17: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 17

HDFS-EC: Failure Handling

Page 18: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 18

Page 19: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 19

YARN Federation

Page 20: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 20

YARN

A resource management framework for Hadoop clusters● Highly scalable, 4000 - 8000 nodes in production● Hive, Oozie, Spark, …● HBase

Page 21: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 21

YARN

Resource Manager

Application Master

Client Node Manager

Page 22: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 22

YARN Federation

Developed by MicrosoftExtreme scale● 100,000 compute nodes● Resource Manager becomes the bottleneck

Page 23: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 23

YARN Federation

Application Master

Application Master

RM ProxyRM 1

RM 2

Page 24: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 24

YARN Timeline Service v2

Page 25: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 25

Job History Server

●Keeps track of job progress● Collect or retrieve information of MapReduce jobs

●Extensibility●MR only

●Usability●No YARN level events●Metrics can only be retrieved after job terminates

Page 26: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 26

Application Timeline server v2● Development led by Twitter● Usability

○ Flow: logical group of applications● Scalability

● HBase● Use cases

● Analyze application performance.● Cluster capacity planning.

Page 27: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 27

Use HBase for storageUse cases:● Analyze application

performance.● Cluster capacity planning.

ASTv2 Architecture

Page 28: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 28

HDFS Multi Standby NameNodes

Page 29: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 29

NameNode High Availability

Active NameNode

Standby NameNode

JournalNode

JournalNode

JournalNode

Quorum

Client

Upload fsimage

Page 30: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 30

Contributed by Salesforce.

Multiple Standby NameNode

Active NameNode

Standby NameNode

JournalNode

JournalNode

JournalNode

Quorum

Client

Standby NameNode

Upload fsimage

Upload fsimage

Page 31: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 34

Classpath Isolation

Page 32: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 35

Dependency Hell

Page 33: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 36

Dependency Hell

Hadoop was not initially designed as foundation of many applications.● More applications depending on Hadoop● harder for Hadoop to upgrade dependency libraries.● Potential risk to break existing applications● Increase exposure to security vulnerabilities

Classpath Isolation● Separate client-side classpath from server-side

Page 34: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 37

Cloud

Page 35: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 38

● Cloud connectors○ Microsoft Azure Data Lake filesystem○ Aliyun Object Storage Service

Other features

Page 36: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 39

Misc.

Page 37: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 40

● Shell script rewrite● Requires Java 8● Server ports● Remove legacy features

○ S3 file system → S3A (recommended) or S3N○ Hftp → webhdfs/httpfs○ Bookkeeper Journal Manager → Quorum Journal Manager

Other features and incompatibility

Page 38: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 41

What’s next?

Page 39: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 42

Developers• Use it early, test it early and file bug reports.

Administrators• Test upgradability

Users• Expect better user experience.

Now what?

Alpha 1

2016/09 2016/12

Alpha 2 Beta 1

2017/0?

GA

?CDH6

Hadoop 3

Timeline

2017/01

Alpha 3

2017/0?

Page 40: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 43

● We don’t know yet.● Ozone (HDFS-7240)

○ Object store for HDFS● HDFS over cloud (HDFS-9806)● Emerging applications and use cases

● Docker● Deep learning

● Hardware Trend○ Cloud storage○ Faster ethernet (40GBps), high density (> 100TB) storage node○ Memory technology○ Locality will not be a deciding factor.

Future? Hadoop 4?

Page 41: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 44

Ozone (HDFS-7240)

Status quo● NameNode is becoming a bottleneck● A general file system may not suit the

specific need of an applicationSolution

● Split HDFS namespace into blob stores

Page 42: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 45

HDFS over Cloud (HDFS-9806)

Use case● Use HDFS for temporary data● Use cloud for permanent storage

The problem● Data management● Consistency

Solution● HDFS as metastore and cache● Cloud as backend data store

Page 43: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 46

Ask Bigger Questions

Page 44: Hadoop 3 (2017 hadoop taiwan workshop)

© Cloudera, Inc. All rights reserved. 47

• Introduction to HDFS Erasure Coding in Apache Hadoop• Enable YARN RM scale out via federation using multiple RM's• Application Timeline Server - Past, Present and Future • HDFS-6440 Support more than 2 NameNodes• How-to: Use the New HDFS Intra-DataNode Disk Balancer in Apache Hadoop

References


Recommended