MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

transcript

MapReducesPATTERNS FOR PROCESS

Agenda Overview of all the Map Reduce Design Patterns

MapReduce Design Patterns Overview

Deep Dive into following Patterns

Filtering Patterns

Join Patterns

Input and Output Patterns

Other Patterns Overview

Summarization Patterns

Data Organization Patterns

MetaPatterns

Comparison chart of when to use which design patterns

Best Practices

MapReduce Patterns

Filtering Patterns

Join Patterns

Meta Patterns

Input & Output Patterns

Numerical Summarizations

Inverted Indexes

Counters

Numerical Summarizations

Word Count

Record Counts

Min / Max

Average/Median/Std Deviation

Inverted Indexes

source: MapReduce Design Patterns, Miner & Shook, O’Reilly

Counters

Record Count

Unique Instances

Summations

if( StringUtils.startsWithLetter(token) ){

context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);

Filter Patterns

Filters

Bloom Filters

Top Ten

Distinct

Filters

Narrowing Views

Tracking Event Threads

Distributed Grep

Data Cleansing

Simple Random Sampling

Low Scoring Data

Bloom Filters Similar to other filters Check each Record – decide to keep or remove

Different: Filter based on set membership

Set membership is evaluated as well

Compares one list to another

Sometimes emits a false positive Often this is OK

Steps: Train the filter and list of values – store in HDFS

Do the filtering

Bloom Filters

Top Ten

Distinct (De-dupe)

Several Methods:

HDFS & MapReduce Alone

HBase & HDFS

HDFS, MapReduce & Storage Controller

Streaming, HDFS & MapReduce

MapReduce with Blocking

Structured to Hierarchical

Partitioning

Binning

Total Ordering

Shuffling

Structural to Hierarchical

Partitioning

Binning

Uses MultipleOutputs class Emits multiple distinct files The mapper:

looks at each line iterates through a list of criteria

for each bin If the record meets the criteria, it

is sent to that bin No combiner, partitioner, or reducer

Total Order Sort

Mapper extracts the sort key Custom partitioner loads partition file takes data ranges from file decides which reducer to target

Reducers = Identity reducermust = # of partitions

$ hadoop fs -cat output/part-r-*

Shuffling

Mapper just outputs random K for K,V’s Reducer sorts these further randomization results

Use case: random sampling Load-balances well

Join Patterns

Reduce-Side Joins

Replicated Joins

Composite Joins

Cartesian Product

Review: Inner Join

Review: Outer Join

Review: Cartesian Product

Reduce-side Join

Replicated Join

Map-onlyMapper reads join file

at startup from cache store in-memory

Composite Join

Map-only Driver code handles most of the

work Hadoop does the rest

Cartesian Product

Map-only Driver code handles

most of the work Simple mapper

Input/Output Patterns

Custom Input & Output

Generating Data

External Sources

Partition Pruning

MapReduce Input and Output

Custom Inputs

OutputFormats FileOutputFormat<K,V> superclass

TextOutputFormat<K,V> default output format

SequenceFileOutputFormat<K,V>

MultipleOutputs<K,V> sends to various destinations

NullOutputFormat<K,V> null output

LazyOutputFormat<K,V>

Custom Output Extend OutputFormat usually FileOutputFormat

implement getRecordReader() returning a RecordWriter instance

Define write() in the class invoke for each K-V

write(AccountKey key, Account value) {

out.println(key.getAccountKeyId() + ‘\t

+ value.getAccountNbr());

Class: BankRecordWriter

OutputFormat

RecordWriter

Generating Data

Map-Only

Good for generating sample data

MapReduce is a good tool to use

Seldom done

External Outputs

Partition Pruning

MetaPatterns

Job Chaining

Chain Folding

Job Merging

End of Chapter

MapReducessource: MapReduce Design Patterns, Miner & Shook, O’Reilly Total Order Sort Mapper...

Documents