Debunking the Myths of HDFS Erasure Coding Performance

Debunking the Myths ofHDFS Erasure Coding Performance

HDFS inherits 3-way replication from Google File System- Simple, scalable and robust

200% storage overhead Secondary replicas rarely accessed

Replication is Expensive

Erasure Coding Saves Storage Simplified Example: storing 2 bits

Same data durability- can lose any 1 bit

Half the storage overhead Slower recovery

1 01 0Replication:XOR Coding: 1 0⊕ 1=

2 extra bits1 extra bit

Erasure Coding Saves Storage Facebook

- f4 stores 65PB of BLOBs in EC Windows Azure Storage (WAS)

- A PB of new data every 1~2 days- All “sealed” data stored in EC

Google File System- Large portion of data stored in EC

Roadmap Background of EC

- Redundancy Theory- EC in Distributed Storage Systems

HDFS-EC architecture Hardware-accelerated Codec Framework Performance Evaluation

Durability and EfficiencyData Durability = How many simultaneous failures can be tolerated?

Storage Efficiency = How much portion of storage is for useful data?

useful data

3-way Replication: Data Durability = 2

Storage Efficiency = 1/3 (33%)

redundant data



XOR:Data Durability = 1

Storage Efficiency = 2/3 (67%)

useful data redundant data

X Y X Y⊕0 0 00 1 11 0 11 1 0

Y = 0 1 = 1⊕



Reed-Solomon (RS):Data Durability = 2

Storage Efficiency = 4/6 (67%)Very flexible!



Data Durability Storage Efficiency Single Replica 0 100%3-way Replication 2 33%XOR with 6 data cells 1 86%RS (6,3) 3 67%RS (10,4) 4 71%

EC in Distributed StorageBlock Layout:

Data Locality 👍🏻Small Files 👎🏻

128~256MFile 0~128M … 640~768M

0~128M

bloc

k 0

DataNode 0

128~256M

bloc

k 1

DataNode 1

0~128M 128~256M

… 640~768M

bloc

k 5

DataNode 5 DataNode 6

…

parity

Contiguous Layout:

EC in Distributed StorageBlock Layout:

File

bloc

k 0

DataNode 0

bloc

k 1

DataNode 1

…bl

ock

5

DataNode 5 DataNode 6

…

parity

Striped Layout:0~1M 1~2M 5~6M6~7M

Data Locality 👎🏻

Small Files 👍🏻Parallel I/O 👍🏻

0~128M 128~256M

EC in Distributed Storage

Spectrum:

Replication ErasureCoding

Striping

Contiguous

Ceph

Ceph

Quancast File System

Quancast File System

HDFS Facebook f4Windows Azure




Choosing Block Layout• Medium: 1~6 blocks• Small files: < 1 block• Assuming (6,3) coding • Large: > 6 blocks (1 group)

96.29%

1.86% 1.85%

26.06%

9.33%

64.61%

small medium large

file count

space usage

Top 2% files occupy ~65% space

Cluster A Profile

86.59%

11.38%2.03%

23.89%36.03% 40.08%

file count

space usage

Top 2% files occupy ~40% space

small medium large

Cluster B Profile

99.64%

0.36% 0.00%

76.05%

20.75%

3.20%

file count

space usage

Dominated by small files

small medium large

Cluster C Profile

Choosing Block Layout

CurrentHDFS

Generalizing Block NameNodeMapping Logical and Storage Blocks Too Many Storage Blocks?

Hierarchical Naming Protocol:

Client Parallel Writing

streamer

queue

streamer … streamer

Coordinator

Client Parallel Reading

…

parity

Reconstruction on DataNode Important to avoid delay on the critical path

- Especially if original data is lost Integrated with Replication Monitor

- Under-protected EC blocks scheduled together with under-replicated blocks- New priority algorithms

New ErasureCodingWorker component on DataNode

Data Checksum Support Supports getFileChecksum for EC striped mode files

- Comparable checksums for same content striped files- Can’t compare the checksums for contiguous file and striped file- Can reconstruct on the fly if found block misses while computing

Planning to introduce new version of getFileChecksum- To achieve comparable checksums between contiguous and striped file




Acceleration with Intel ISA-L 1 legacy coder

- From Facebook’s HDFS-RAID project 2 new coders

- Pure Java — code improvement over HDFS-RAID- Native coder with Intel’s Intelligent Storage Acceleration Library (ISA-L)

Why is ISA-L Fast?

pre-computed and reused

parallel operation

Direct ByteBuffer

Microbenchmark: Codec Calculation

Microbenchmark: Codec Calculation

Microbenchmark: HDFS I/O



DFSIO / MapReduce

Hive-on-MR — locality sensitive

Hive-on-Spark — locality sensitive

Conclusion Erasure coding expands effective storage space by ~50%! HDFS-EC phase I implements erasure coding in striped block layout Upstream effort (HDFS-7285):

- Design finalized Nov. 2014- Development started Jan. 2015- 218 commits, ~25k LoC change- Broad collaboration: Cloudera, Intel, Hortonworks, Huawei, Yahoo, LinkedIn

Phase II will support contiguous block layout for better locality

Acknowledgements Cloudera

- Andrew Wang, Aaron T. Myers, Colin McCabe, Todd Lipcon, Silvius Rus Intel

- Kai Zheng, Rakesh R, Yi Liu, Weihua Jiang, Rui Li Hortonworks

- Jing Zhao, Tsz Wo Nicholas Sze Huawei

- Vinayakumar B, Walter Su, Xinwei Qin Yahoo (Japan)

- Gao Rui, Kai Sasaki, Takuya Fukudome, Hui Zheng

Questions?

Zhe Zhang, [email protected] | @oldcaphttp://zhe-thoughts.github.io/

Uma Gangumalla, [email protected]

@UmaMaheswaraG

http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/

mailto:[email protected]

https://twitter.com/oldcap

http://zhe-thoughts.github.io/

mailto:[email protected]

https://twitter.com/UmaMaheswaraG

http://blog.cloudera.com/blog/2016/02/progress-report-bringing-erasure-coding-to-apache-hadoop/

Come See us at Intel - Booth 305 “Amazing Analytics from Silicon to Software”• Intel powers analytics solutions that are optimized

for performance and security from silicon to software

• Intel unleashes the potential of Big Data to enable advancement in healthcare/ life sciences, retail, manufacturing, telecom and financial services

• Intel accelerates advanced analytics and machine learning solutions Twitter #HS16SJ

LinkedIn Hadoop

Dali: LinkedIn’s Logical Data Access Layer for

Hadoop

Meetup Thu 6/306~9PM @LinkedIn

2nd floor, Unite room2025 Stierlin CtMountain View

Dr. Elephant: performance

monitoring and tuning.SFHUG in Aug

Backup

Date post:	16-Apr-2017
Category:	Technology
Upload:	dataworks-summithadoop-summit
View:	1,264 times
Download:	3 times

Debunking the Myths of HDFS Erasure Coding Performance

Technology