MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

Post on 15-Sep-2020

4 views 0 download

transcript

MapReducesPATTERNS FOR PROCESS

Agenda Overview of all the Map Reduce Design Patterns

MapReduce Design Patterns Overview

Deep Dive into following Patterns

Filtering Patterns

Join Patterns

Input and Output Patterns

Other Patterns Overview

Summarization Patterns

Data Organization Patterns

MetaPatterns

Comparison chart of when to use which design patterns

Best Practices

2

MapReduce Patterns

Summarization Patterns

Filtering Patterns

Data Organization Patterns

Join Patterns

Meta Patterns

Input & Output Patterns

BIG DATA SERIES 3Powered by Prognosive © 2015

Summarization Patterns

Numerical Summarizations

Inverted Indexes

Counters

Numerical Summarizations

Word Count

Record Counts

Min / Max

Average/Median/Std Deviation

BIG DATA SERIES 5Powered by Prognosive © 2015

Inverted Indexes

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Counters

Record Count

Unique Instances

Summations

if( StringUtils.startsWithLetter(token) ){

context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);

}

Filter Patterns

Filters

Bloom Filters

Top Ten

Distinct

BIG DATA SERIES 8Powered by Prognosive © 2015

Filters

Narrowing Views

Tracking Event Threads

Distributed Grep

Data Cleansing

Simple Random Sampling

Low Scoring Data

Bloom Filters Similar to other filters Check each Record – decide to keep or remove

Different: Filter based on set membership

Set membership is evaluated as well

Compares one list to another

Sometimes emits a false positive Often this is OK

Steps: Train the filter and list of values – store in HDFS

Do the filtering

Bloom Filters

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Top Ten

Distinct (De-dupe)

Several Methods:

HDFS & MapReduce Alone

HBase & HDFS

HDFS, MapReduce & Storage Controller

Streaming, HDFS & MapReduce

MapReduce with Blocking

Data Organization Patterns

Structured to Hierarchical

Partitioning

Binning

Total Ordering

Shuffling

BIG DATA SERIES 14Powered by Prognosive © 2015

Structural to Hierarchical

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Partitioning

Binning

Uses MultipleOutputs class Emits multiple distinct files The mapper:

looks at each line iterates through a list of criteria

for each bin If the record meets the criteria, it

is sent to that bin No combiner, partitioner, or reducer

used

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Total Order Sort

Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from file decides which reducer to target

Reducers = Identity reducermust = # of partitions

$ hadoop fs -cat output/part-r-*

Shuffling

Mapper just outputs random K for K,V’s Reducer sorts these further randomization results

Use case: random sampling Load-balances well

BIG DATA SERIES 19Powered by Prognosive © 2015

Join Patterns

Reduce-Side Joins

Replicated Joins

Composite Joins

Cartesian Product

Review: Inner Join

Review: Outer Join

Review: Cartesian Product

Reduce-side Join

BIG DATA SERIES 24Powered by Prognosive © 2015

Replicated Join

Map-onlyMapper reads join file

at startup from cache store in-memory

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Composite Join

Map-only Driver code handles most of the

work Hadoop does the rest

BIG DATA SERIES 26Powered by Prognosive © 2015

Cartesian Product

Map-only Driver code handles

most of the work Simple mapper

Input/Output Patterns

Custom Input & Output

Generating Data

External Sources

Partition Pruning

MapReduce Input and Output

BIG DATA SERIES 29Powered by Prognosive © 2015

Custom Inputs

OutputFormats FileOutputFormat<K,V> superclass

TextOutputFormat<K,V> default output format

SequenceFileOutputFormat<K,V>

MultipleOutputs<K,V> sends to various destinations

NullOutputFormat<K,V> null output

LazyOutputFormat<K,V>

BIG DATA SERIES 31Powered by Prognosive © 2015

Custom Output Extend OutputFormat usually FileOutputFormat

implement getRecordReader() returning a RecordWriter instance

Define write() in the class invoke for each K-V

write(AccountKey key, Account value) {

out.println(key.getAccountKeyId() + ‘\t

+ value.getAccountNbr());

Class: BankRecordWriter

OutputFormat

RecordWriter

Generating Data

Map-Only

Good for generating sample data

MapReduce is a good tool to use

Seldom done

33

External Outputs

Partition Pruning

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

MetaPatterns

Job Chaining

Chain Folding

Job Merging

BIG DATA SERIES 36Powered by Prognosive © 2015

End of Chapter

Lab

38