+ All Categories
Home > Documents > Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced...

Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced...

Date post: 18-Dec-2015
Category:
Upload: claude-george
View: 221 times
Download: 3 times
Share this document with a friend
Popular Tags:
38
Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems
Transcript
Page 1: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Anti-Skew: Single-Key Data Skew Mitigation for MapReduce

Yue ChenFlorida State University

Advanced Database Systems

Page 2: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Outline

• Background• Data Skew• Anti-skew Design• Conclusion• Related Work

Page 3: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Background

Skip

Page 4: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Big Data Trend

Page 5: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Big Data Trend

Mike Olson is a co-founder and former CEO of Cloudera.

Page 6: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Big Data Trend

Page 7: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

History

Page 8: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Review of MapReduce Word Count

Page 9: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Data Skew

Page 10: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

What is data skew?

Attribute1

0 1 2 3 4 5 6 7

Key8Key7Key6Key5Key4Key3Key2Key1

Page 11: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

What is data skew?

Attribute1

0 5 10 15 20 25

Key8Key7Key6Key5Key4Key3Key2Key1

Page 12: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

What is data skew?

Attribute1

0 50 100 150 200 250 300 350

Key8Key7Key6Key5Key4Key3Key2Key1

Page 13: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.
Page 14: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

The Most Skewed Key?

NULL

Reported by the data team of

Page 15: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Anti-skew Design

Page 16: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Problem

Reducer1

Key1

Key3

Key2

Reducer2

Key4

Key6

Key5

Page 17: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Assumption1

• The task can be divided into sub-tasks, and can be reassembled back to get the result in an easy way.

Page 18: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Assumption2

• The key-value pairs in input data are near-equally distributed, which means sampling would be effective; although pre-execution sampling is not required.

Page 19: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Basic Idea

Key1

Key1.1

Key1.2

Key1.3

Key1.4

Key1.5

Key1.6

Page 20: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Skew Perception

Needs visualization!

Page 21: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Skew Detection

Page 22: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Straggler Identification (tentative)

• A certain key’s count is more than 50% (100%? 200%?) of the median one.

Straggler

Page 23: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Straggler Identification (tentative)

Attribute10

5

10

15

20

25

Key1Key2Key3Key4Key5Key6Key7Key8

Page 24: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Key Splitting

the 8fc42c6ddf9966db3b09e84365034357

6e9b31333e61aad015fa16a3a5fe8e0d

2e20bfee9e4486f0ab651fc0bb988ffb

Hash

Rehash

Rehash

Page 25: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Key Splitting

the

8fc42c6ddf9966db3b09e84365034357

6e9b31333e61aad015fa16a3a5fe8e0d

2e20bfee9e4486f0ab651fc0bb988ffb

Special Key

Page 26: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Load Balancing (tentative)

• Can the hashing algorithm combined with the platform’s partition algorithm evenly distribute the keys to reducers?

Partitioner Function

Page 27: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Privacy (if pre-processed)

Name Net worth (USD)

Bill Gates $79.2 billionCarlos Slim $77.1 billion

Warren Buffett $72.7 billionAmancio Ortega $64.5 billionLarry Ellison $54.3 billion

Charles Koch $42.9 billion

David Koch $42.9 billionChristy Walton $41.7 billion

Page 28: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Privacy (if pre-processed)

Name Net worth (USD)

8adc1a86f7 $79.2 billion8ea0bb9a8f $77.1 billion

9e640e0fe9 $72.7 billionabf803fe43 $64.5 billionbce5c74f58 $54.3 billion

4f589f4867 $42.9 billion

4867dbd572 $42.9 billione9ca9f808c $41.7 billion

Page 29: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Conclusion

• A simple way to handle single-key skew in the MapReduce programming model

• No extra OS-level resources needed• Implement it as a wrapper, no need to modify

platforms’ source code, can be used for online platforms (there are so many Hadoop distributions, versions and patches!)

Page 30: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Hadoop Distributions

Page 31: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.
Page 32: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.
Page 33: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Related Work

Skip

Page 34: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SkewReduce

1

2

13

14

15

5

69

3

412

7

810

11

• Varying granularities of partitions• Can we automatically find a good partition plan and

schedule?

Page 35: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SkewReduce

• Goal: minimize expected total runtime

Sample

SkewReduceOptimizer

1

2

13

14

15

5

69

3

412

7

810

11

Clusterconfiguration

Cost functions

Page 36: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SkewTune

• Does what SkewReduce does when the program is running.

• Skew detected -> Stop -> Repartition -> Continue

Page 37: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

SpongeFiles

Page 38: Anti-Skew: Single-Key Data Skew Mitigation for MapReduce Yue Chen Florida State University Advanced Database Systems.

Q&ASuggestions?


Recommended