+ All Categories
Home > Documents > MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

Date post: 15-Sep-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
38
MapReduces PATTERNS FOR PROCESS
Transcript
Page 1: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

MapReducesPATTERNS FOR PROCESS

Page 2: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Agenda Overview of all the Map Reduce Design Patterns

MapReduce Design Patterns Overview

Deep Dive into following Patterns

Filtering Patterns

Join Patterns

Input and Output Patterns

Other Patterns Overview

Summarization Patterns

Data Organization Patterns

MetaPatterns

Comparison chart of when to use which design patterns

Best Practices

2

Page 3: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

MapReduce Patterns

Summarization Patterns

Filtering Patterns

Data Organization Patterns

Join Patterns

Meta Patterns

Input & Output Patterns

BIG DATA SERIES 3Powered by Prognosive © 2015

Page 4: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Summarization Patterns

Numerical Summarizations

Inverted Indexes

Counters

Page 5: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Numerical Summarizations

Word Count

Record Counts

Min / Max

Average/Median/Std Deviation

BIG DATA SERIES 5Powered by Prognosive © 2015

Page 6: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Inverted Indexes

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Page 7: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Counters

Record Count

Unique Instances

Summations

if( StringUtils.startsWithLetter(token) ){

context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);

}

Page 8: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Filter Patterns

Filters

Bloom Filters

Top Ten

Distinct

BIG DATA SERIES 8Powered by Prognosive © 2015

Page 9: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Filters

Narrowing Views

Tracking Event Threads

Distributed Grep

Data Cleansing

Simple Random Sampling

Low Scoring Data

Page 10: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Bloom Filters Similar to other filters Check each Record – decide to keep or remove

Different: Filter based on set membership

Set membership is evaluated as well

Compares one list to another

Sometimes emits a false positive Often this is OK

Steps: Train the filter and list of values – store in HDFS

Do the filtering

Page 11: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Bloom Filters

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Page 12: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Top Ten

Page 13: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Distinct (De-dupe)

Several Methods:

HDFS & MapReduce Alone

HBase & HDFS

HDFS, MapReduce & Storage Controller

Streaming, HDFS & MapReduce

MapReduce with Blocking

Page 14: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Data Organization Patterns

Structured to Hierarchical

Partitioning

Binning

Total Ordering

Shuffling

BIG DATA SERIES 14Powered by Prognosive © 2015

Page 15: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Structural to Hierarchical

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Page 16: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Partitioning

Page 17: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Binning

Uses MultipleOutputs class Emits multiple distinct files The mapper:

looks at each line iterates through a list of criteria

for each bin If the record meets the criteria, it

is sent to that bin No combiner, partitioner, or reducer

used

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Page 18: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Total Order Sort

Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from file decides which reducer to target

Reducers = Identity reducermust = # of partitions

$ hadoop fs -cat output/part-r-*

Page 19: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Shuffling

Mapper just outputs random K for K,V’s Reducer sorts these further randomization results

Use case: random sampling Load-balances well

BIG DATA SERIES 19Powered by Prognosive © 2015

Page 20: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Join Patterns

Reduce-Side Joins

Replicated Joins

Composite Joins

Cartesian Product

Page 21: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Review: Inner Join

Page 22: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Review: Outer Join

Page 23: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Review: Cartesian Product

Page 24: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Reduce-side Join

BIG DATA SERIES 24Powered by Prognosive © 2015

Page 25: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Replicated Join

Map-onlyMapper reads join file

at startup from cache store in-memory

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Page 26: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Composite Join

Map-only Driver code handles most of the

work Hadoop does the rest

BIG DATA SERIES 26Powered by Prognosive © 2015

Page 27: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Cartesian Product

Map-only Driver code handles

most of the work Simple mapper

Page 28: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Input/Output Patterns

Custom Input & Output

Generating Data

External Sources

Partition Pruning

Page 29: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

MapReduce Input and Output

BIG DATA SERIES 29Powered by Prognosive © 2015

Page 30: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Custom Inputs

Page 31: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

OutputFormats FileOutputFormat<K,V> superclass

TextOutputFormat<K,V> default output format

SequenceFileOutputFormat<K,V>

MultipleOutputs<K,V> sends to various destinations

NullOutputFormat<K,V> null output

LazyOutputFormat<K,V>

BIG DATA SERIES 31Powered by Prognosive © 2015

Page 32: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Custom Output Extend OutputFormat usually FileOutputFormat

implement getRecordReader() returning a RecordWriter instance

Define write() in the class invoke for each K-V

write(AccountKey key, Account value) {

out.println(key.getAccountKeyId() + ‘\t

+ value.getAccountNbr());

Class: BankRecordWriter

OutputFormat

RecordWriter

Page 33: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Generating Data

Map-Only

Good for generating sample data

MapReduce is a good tool to use

Seldom done

33

Page 34: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

External Outputs

Page 35: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Partition Pruning

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Page 36: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

MetaPatterns

Job Chaining

Chain Folding

Job Merging

BIG DATA SERIES 36Powered by Prognosive © 2015

Page 37: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

End of Chapter

Page 38: MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from

Lab

38


Recommended