20130201 MapReduce Design Patterns

transcript

MapReduce Design

Patterns

Will Shen 2013/02/01

Outline Part I: MapReduce Basics • Map and Reduce • A WordCount example • Open-source framework: Hadoop

Part II: MapReduce Design Patterns • Summarization Patterns • Filtering Patterns • Data Organization Patterns • Join Patterns • Meta Patterns • Input and Output Patterns

Reference: Donald Miner and Adam Shook, “MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems”, 230 pages, O'Reilly Media; 1 edition (December 22, 2012)

Part I: MapReduce Basics

Motivation: Large Scale Data Processing • Process lots of data (>1TB) • Want to use hundreds of CPUs MapReduce - Google (2005), US patent (2010)

• Automatic parallelization and distribution • Fault-tolerance • I/O scheduling • Status and monitoring

Google, “MapReduce: Simplified Data Processing on Large Clusters”, 2005/04/06

What is Map and Reduce Borrows from Functional Programming • Functional operations do not modify data structures

create new ones • Stateless functional operations no side-effect order of

operations does not matter

map: (k1, v1) → [(k2, v2)] reduce: (k2, [v2]) → [(k3, v3)]

fun foo(li: int list) = sum(li) + mul(li) + length(li)

What is MapReduce

map: (k1, v1) → [(k2, v2)]

reduce: (k2, [v2]) → [(k3, v3)]

Parallel Execution Bottleneck: Reduce phase cannot start until map phase completes

Big Picture of MapReduce Input Reader - Divides input into appropriate size 'splits' (16 to 128 MB) Map - partitioning of the data (compute part of a problem across several servers)

Shuffle - together the values returned by the map function Reduce - processing of the partitions (aggregate the partial results from all servers into a single result-set)

Output Writer - Writes the output of the Reducer

Example – counting words in documents

map (String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1");

reduce (String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(output_key, AsString(result));

Open-source framework Apache Hadoop

Hadoop - http://hadoop.apache.org/ Hadoop: not only a Map/Reduce implementation! • HDFS – distributed file system • Pig – high level query language (SQL like) • HBase – distributed column store • Hive – Hadoop based data warehouse • ZooKeeper, Chukwa, Pipes/Streaming, …

How Hadoop runs a MapReduce Job

• Client submits the MapReduce job. • JobTracker coordinates the job run. • TaskTrackers run the tasks that the job has

been split into. • HDSF is used for sharing job files between the

other entities.

WordCount Java Code in Hadoop

General Considerations Map execution order is not deterministic Map processing time cannot be predicted Reduce tasks cannot start before all Maps have finished (dataset needs to be fully partitioned) Not suitable for continuous input streams There will be a spike in network utilization after the Map / before the Reduce phase Number & size of key/value pairs • Object creation & serialisation overhead (Amdahl’s law!)

Aggregate partial results when possible! • Use Combiners

Using MaReduce to Solve Problems

Map • Word Count: texts (word, 1) • Inverted Index: documents (word, doc_id) • Max Temperature: formatted data (year,

temperature) • Mean Rain Precipitation: daily data (<year-

month, lat, long>, temperature)

Reduce applies a count, list, max, and average to a set of values for each key, respectively. Reusable Solutions? 13

What is a “Design Pattern” Design Pattern a general reusable solution to a commonly occurring problem within a given context in software design.

14 GoF

Part II: MapReduce Design Patterns 1. Summarization: get a top-level view by

summarizing and grouping data 2. Filtering: view data subsets such as records

generated from one user 3. Data Organization: reorganize data to work

with other systems, or to make MapReduce analysis easier

4. Join : analyze different datasets together to discover interesting relationships

5. Metapattern : piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job

6. Input and Output: customize the way you use Hadoop to load or store data

Total 23 patterns

A template for solving a common and general data manipulation problem with MapReduce.

The 23 Patterns of MapReduce Summarization • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters

Filtering • Filtering • Bloom Filtering • Top Ten • Distinct

Data Organization • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling

Join • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product

Metapatterns • Job Chaining • Chain Folding • Job Merging

Input and Output • Generating Data • External Source Output • External Source Input • Partition Pruning

The End

Thanks for your attentions. Any Questions?

Pattern Template in this Book Name: a well-selecting name of the pattern Intent: A quick problem description Motivation: Why you would want to solve this problem or where it would appear. Applicability: A set of criteria that must be true to be able to apply this pattern to a problem. Structure: The layout of the MapReduce job itself. Consequences: The end goal of the output this pattern produces. Resemblances: Show analogies of how this problem would be solved with other languages, like SQL and PIG. Known Uses: some common use cases Performance Analysis: Explains the performance profile of the analytic produced by the pattern.

2.1 Summarization Patterns

Your data is large and vast, with more data coming into the system every day • ex. web user-logs • You want to produce a top-level, summarized

view of the data • You can glean insights not available from looking

at a localized set of records alone. Patterns • Numerical Summarizations • Inverted Index Summarizations • Counting with Counters

Numerical Summarizations 1/4 Intent - Group records together by a key field and calculate a numerical aggregate per group to get a top-level view of the larger data set. Motivation • Many data sets these days are too large for a human to get

any real meaning out it by reading through it manually, e.g., terabytes of website log files.

• minimum, maximum, average, median, and standard deviation

Applicability • You are dealing with numerical data or counting. • The data can be grouped by specific fields

Numerical Summarizations 2/4 Structure • Mapper: outputs keys that consist of each field to group

by, and values consisting of any pertinent numerical items. • Reducer: receives a set of numerical values (v1, v2, v3, …,

vn) associated with a group-by key records to perform the aggregation function λ. The value of λ is output with the given input key.

Numerical Summarizations 3/4 Consequences • A set of part files containing a single record per reducer

input group. Each record will consist of the key and all aggregate values.

Known uses • Word count, Record count • Min, Max, Count of a particular event • Average, Median, Standard deviation

Resemblances • SQL • Pig

SELECT MIN(numericalcol1), MAX(numericalcol1), COUNT(*) FROM table GROUP BY groupcol2;

b = GROUP a BY groupcol2; c = FOREACH b GENERATE group, MIN(a.numericalcol1), MAX(a.numericalcol1), COUNT_STAR(a);

Numerical Summarizations 4/4 Performance analysis • Aggregations perform well when the combiner is properly used. • Data skew of reduce groups: many more intermediate key/value

pairs with a specific key than other keys, one reducer is going to have a lot more work to do than others.

Inverted Index Summarizations 1/4 Intent - Generate an index from a data set to allow for faster searches.

storing a mapping from content to its locations

Inverted Index Summarizations 2/4

Motivation • To index large data sets on keywords, so that

searches can trace terms back to records that contain specific values.

• Search performance of search engine Applicability • You are requiring quick query responses. • The results of such a query can be preprocessed

and ingested into a database.

Structure

Consequences • “filed value” -> [unique IDs of records] Performance analysis • Parsing content in Mapper most computationally • The cardinality of the index keys increase the

number of reducers increase parallelism • The number of content identifiers per key, “the”

• a few reducers will take much longer than the others. • Require a custom partitioner

Counting with Counters 1/3 Intent • An efficient means to retrieve count summarizations of large

data sets. Motivation • A count or summation can tell you a lot about your data as

a whole. • Simply use the framework’s counters no reduce phase

and no summation Applicability • You have a desire to gather counts or summations over

large data sets. • The number of counters you are going to create is small

Counting with Counters 2/3 Structure • Mapper: processes each input record at a time to increment

counters based on certain criteria. • Counter: (a) incremented by one if counting a single instance

(b)incremented by some number if executing a summation.

Counting with Counters 3/3 Consequences • the final output is a set of counters grabbed from the job

framework (no actual output) Known uses • Count number of records (over a given time period) • Count a small number of unique instances • Counters can be used to sum fields of data together.

Performance analysis • Using counters is very fast, as data is simply read in

through the mapper and no output is written. • Performance depends largely on the number of map tasks

being executed and how much time it takes to process each record.

2.2 Filtering Patterns

To understand a smaller piece of data Find a subset of data - a top-ten listing, the results of a de-duplication. Sampling Filtering Patterns: • Filtering • Bloom Filtering • Top Ten • Distinct

Filtering 1/4

Intent • Filter out records that are not of interest Motivation • Your data set is large and you want to take a

subset of this data to focus in on it and perhaps do follow-on analysis.

Applicability • The data can be parsed into “records” that can be

categorized through some well-specified criterion determining whether they are to be kept.

Filtering 2/4

Structure • No “Reducer”

map(key, record): if we want to keep record then emit key,value

Filtering 3/4 Consequences • A subset of the records that pass the selection criteria. • If the format was kept the same, any job that ran over the

larger data set should be able to run over this filtered data set, as well.

Known uses • Closer view of data • Tracking a thread of events • Distributed grep • Data cleansing • Simple random sampling • Removing low scoring data (if you can score your data)

Filtering 4/4 Resemblances • SQL: SELECT * FROM table WHERE VALUE < 3 • Pig: b = FILTER a BY value < 3;

Performance analysis • NO reducers • Data never has to be transmitted between the map and

reduce phase. • Most of the map tasks pull data off of their locally attached

disks and then write back out to that node. • Both the sort phase and the reduce phase are cut out.

Bloom Filtering 1/4 Intent • Filter such that we keep records that are member of some

predefined set of values (hot values). Motivation • To filter the record based on some sort of set membership

operation against the hot values. • The set membership is going to be evaluated with a Bloom

filter.

• M = 18, k = 3 • w is not in the set x, y, z

Bloom Filtering 2/4

Applicability • Data can be separated into records, as in filtering. • A feature can be extracted from each record that

could be in a set of hot values. • There is a predetermined set of items for the hot

values. • Some false positives are acceptable (i.e., some

records will get through when they should not have).

Bloom Filtering 3/4 Structure – training + actual filtering

Bloom Filtering 4/5

Consequences • a subset of the records in that passed the Bloom

filter membership test. • Exists false positives records Known uses • Removing most of the non-watched values • Prefiltering a data set for an expensive set

membership check

Bloom Filtering 5/5

Performance analysis • Loading up the Bloom filter is not that expensive

since the file is relatively small. • Checking a value against the Bloom filter is also a

relatively cheap operation – by O(1) hashing

Top Ten 1/4 Intent • Retrieve a relatively small number of top K records,

according to a ranking scheme in your data set, no matter how large the data.

Motivation • Finding records that are typically the most interesting • To find the best records for a specific criterion

Applicability • It is able to compare one record to another to determine which is “larger”

• The number of output records should be significantly fewer than the number of input records a total ordering of the data set.

Top Ten 2/4 Structure • Mapper: find local top K • (only one) Reducer: K*M records the final top k

Top Ten 3/4

Consequences • The top K records are returned. Known uses • Outlier analysis • Select interesting data (most valuable data) • Catchy dashboards Resemblances • SQL: SELECT * FROM table WHERE col4 DESC LIMIT 10

• Pig: B = ORDER A BY col4 DESC; C = LIMIT B 10;

Top Ten 4/4 Performance analysis – one single Reducer • How many records (K*M) the reducer is getting? • The sort can become an expensive operation when it has

too many records and has to do most of the sorting on local disk, instead of in memory.

• The reducer host will receive a lot of data over the network a network resource hot spot

• Naturally, scanning through all the data in the reduce will take a long time if there are many records to look through.

• Any sort of memory growth in the reducer has the possibility of blowing through the Java virtual machine’s memory

• Writes to the output file are not parallelized

Distinct 1/4

Intent • To find a unique set of values from similar records Motivation • Reducing a data set to a unique set of values has

several uses Applicability • You have duplicates values in data set; it is silly

to use this pattern otherwise.

Distinct 2/4 Structure • It exploits MapReduce’s ability to group keys together to

remove duplicates. • Mapper transforms the data and doesn’t do much in the

reducer. • Duplicate records are often located close to another in a

data set, so a combiner will deduplicate them in the map phase.

• Reducer groups the nulls together by key, so we’ll have one null per key simply output the key

map(key, record): emit(record, null) reduce(key, records): emit(key);

Distinct 3/4 Consequences • The output records are guaranteed to be unique, but any

order has not been preserved due to the random partitioning of the records.

Known uses • Deduplicate data • Getting distinct values • Protecting from an inner join explosion

Resemblances • SQL: SELECT DISTINCT * FROM table; • Pig: b = DISTINCT a;

Distinct 4/4

Performance analysis • The number of reducers you think you will need. • Basically, if duplicates are very rare within an

input split, pretty much all of the data is going to be sent to the reduce phase.

2.3 Data Organization patterns

The value of individual records is often multiplied by the way they are partitioned, sharded, or sorted, especially true in distributed systems. Patterns: • Structured to Hierarchical • Partitioning • Binning • Total Order Sorting • Shuffling

Structured to Hierarchical 1/3 Intent • Transform your row-based data to a hierarchical format

(JSON or XML) Motivation • Migrating data from an RDBMS to Hadoop table join • Reformatting your data into a more conducive structure

Applicability • You have data sources that are linked by some set of

foreign keys. • Your data is structured and row-based.

Posts Post Comment Comment Post Comment Comment Comment

Structured to Hierarchical 2/3 Structure • Mapper load the data and parse the records into one

cohesive format. • Combiner isn’t going to help • Reducer build the hierarchical data structure from the list of

data items.

Structured to Hierarchical 3/3

Consequences • The output will be in a hierarchical form, grouped

by the key that you specified Known uses • Pre-joining data • Preparing data for HBase or MongoDB Performance analysis • How much data is being sent to the reducers from

the mappers • The memory footprint of the object that the

reducer builds. • For a post that has a million comments?

Partitioning 1/3 Intent • Move the records into categories;; doesn’t care the order of

records. • Take similar records in a data set and partition them into

distinct, smaller data sets. Motivation • If you want to look at a particular set of data, the data

items are normally spread out across the entire data set requires an entire scan of all of the data

Applicability • Knowing how many partitions you are going to have ahead

of time - by day of the week 7 partitions.

Partitioning 2/3

Structure - to determine what partition a record is going to go

Partitioning 3/3 Known uses • Partition pruning by continuous value (e.g., date) • Partition pruning by category

• Country, phone area code, language

• Sharding (to different disks)

Performance analysis • The resulting partitions will likely not have similar number

of records. Perhaps one partition hold 50%. • If implemented naively, all of this data will get sent to one

reducer and will slow down processing significantly.

Binning 1/3 Intent • For each record in the data set, file each one into one or

more categories. Motivation • Binning is very similar to partitioning and often can be used

to solve the same problem. • Binning splits data up in the map phase instead of in the

partitioner. • Each mapper will now have one file per possible output bin

• 1000 Bins x 1000 Mappers = 1000,000 files

Binning 2/3 Structure • Mapper: if the record

meets the criteria, it is sent to that bin.

• No combiner, partitioner, or reducer is used in this pattern.

Binning 3/3 Consequences • Each mapper outputs one small file per bin.

Resemblances • PIG

Performance analysis • map-only jobs how efficient of processing records • No sort, shuffle, or reduce to be performed • Most of the processing is going to be done on data that is

local.

SPLIT data INTO eights IF col1 == 8, bigs IF col1 > 8, smalls IF (col1 < 8 AND col1 > 0);

Total Order Sorting 1/3

Intent • Sort your data in parallel on a sort key. Motivation • Reducer will sort its data by key - but not global

across all data. • Sorting in parallel is not easy Applicability • Your sort key has to be comparable so the data

can be ordered.

Structure • Analyze phase - determines the ranges

• idea: partitions that evenly split the random sample should evenly split the larger data set well.

• Mapper does a random sampling. • the number of records in the total data set • percentage of records you’ll need to analyze

• Only one reducer - collect the sort keys together into a sorted list the list of keys will be sliced into the data range boundaries.

• Order phase - actually sorts the data. • # of Reducers === # of Partitions • A custom partitioner loads up the partition file data

ranges 60

Consequences • The output files will contain sorted data Resemblances • SQL: SELCT * FROM data ORDER BY col1; • Pig: c = ORDER b BY col1; Performance analysis • Expensive!!! • load and parse the data twice:

• Step 1. Build the partition ranges • Step 2. Actually sort the data.

Shuffling 1/3

Intent • To completely randomize a set of records that Motivation • Shuffling for 綺夢 • Shuffling for anonymizing the data. • Shuffling for repeatable random sampling.

Shuffling 2/3

Structure • Mappers [random key, record] • Reducer sorts the random keys randomizing

the data. Consequences • Each reducer outputs a file containing random

records. Resemblances • SQL: SELECT * FROM data ORDER BY RAND() • Pig: c = GROUP b BY RANDOM(); d = FOREACH c GENERATE FLATTEN(b);

Shuffling 3/3

Performance analysis • Nice performance properties. • Data distribution across reducers is completely

balanced. • With more reducers, the data will be more spread

out. • The size of the files will also be very predictable:

each is the size of the data set divided by the number of reducers. This makes it easy to get a specific desired file size as output

2.4 Join patterns

Refresh of RDMS join • Inner Join • Outer Join • Cartesian Product • Anti Join = full outer join - inner join. Patterns • Reduce Side Join • Replicated Join • Composite Join • Cartesian Product

An SQL query walks into a bar, sees two tables and asks them “May I join you?”

Reduce Side Join 1/3 Intent • Join large multiple data sets together by some foreign key.

Motivation • Simple to implement in Reducers • Supports all the different join operations • No limitation on the size of your data sets.

Applicability • Multiple large data sets are being joined by a foreign key. • You want the flexibility of being able to execute any join

operation. • A large amount of network bandwidth

Reduce Side Join 2/3 Structure • Mapper prepares [(foreign key, record)] • Reducer performs join operation

Reduce Side Join 3/3 Consequences • # of part files == # of reduce tasks. • A part contains the portion of the joined records.

Resemblances • SQL

Performance analysis • Custer’s network bandwidth !!! • Utilize relatively more reducers than your analytic.

SELECT users.ID, users.Location, comments.upVotes FROM users [INNER|LEFT|RIGHT] JOIN comments ON users.ID=comments.UserID

Replicated Join 1/3 Intent • Eliminates the need to shuffle any data to the reduce phase.

Motivation • All the data sets except the very large one are essentially

read into memory during the setup phase of each map task, which is limited by the JVM heap.

Applicability • All of the data sets, except for the large one, can be fit into

main memory of each map task.

Replicated Join 2/3 Structure • Map-only pattern • Read all files from the

distributed cache and store them into in-memory lookup tables.

Replicated Join 3/3 Consequences • # of part files == # of map tasks. • The part files contain the full set of joined records.

Performance analysis • A replicated join can be the fastest type of join executed

because there is no reducer required. • The amount of data that can be stored safely inside JVM.

Composite Join 1/4 Intent • Performed on the map-side with many very large formatted

inputs. • Completely eliminates the need to shuffle and sort all the

data to the reduce phase. • Data to be already organized or prepared in a very specific

way. Motivation • Particularly useful if you want to join very large data sets

together. • The data sets must first be sorted by foreign key,

partitioned by foreign key, and read in a very particular manner.

Composite Join 2/4 Applicability • An inner or full outer join is

desired. • All the data sets are

sufficiently large. • All data sets can be read with

the foreign key as the input key to the mapper.

• All data sets have the same number of partitions.

• Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set.

• The data sets do not change often (if they have to be prepared).

Composite Join 3/4

Structure • Map-only • Mapper is very trivial. • Two values are retrieved

from the input tuple and output to file system

Composite Join 4/4

Consequences • Output # of part files == # of map tasks. Performance analysis • Can be executed relatively quickly over large data

sets. • Data Preparation = sorting cost • The cost of producing these prepared data sets is

averaged out over all of the runs.

Cartesian Product 1/3 Intent • Pair up and compare every single record with every other

record in a data set. Motivation • Simply pairs every record of a data set with every record of

all the other data sets. • To analyze relationships between one or more data sets

Applicability • You want to analyze relationships between all pairs of

individual records. • You’ve exhausted all other means to solve this problem. • You have no time constraints on execution time.

Cartesian Product 2/3 Structure • Map-only • RecordReader job

Cartesian Product 3/3 Consequences • The final data set is made up of tuples equivalent to the

number of input data sets. • Every possible tuple combination from the input records is

represented in the final output Resemblances • SQL: SELECT * FROM tableA, tableB;

Performance Analysis • A massive explosion in data size O(n^2) • If a single input split contains a thousand records the

right input split needs to be read a thousand times before the task can finish.

• If a single task fails for an odd reason, the whole thing needs to be restarted.

2.5 Metapatterns (skipped)

Patterns about using patterns • Job Chaining - piecg together several patterns to

solve complex, multistage problems • Chain Folding • Job Merging - an optimization for performing

several analytics in the same MapReduce job

2.6 Input and Output patterns Customizing Input and Output in Hadoop Loaded data on disk • Configuring how contiguous chunks of input are generated

from blocks in HDFS • Configuring how records appear in the map phase • RecordReader and InputFormat classes • RecordWriter and OuputFormat classes

Patterns • Generating Data • External Source Output • External Source Input • Partition Pruning

Generating Data 1/3

Intent • You want to generate a lot of data from scratch. Motivation • it doesn’t load data generate the data and

store it back in the distributed file system.

Generating Data 2/3

Structure • map-only

Generating Data 3/3

Consequences • Each mapper outputs a file containing random

data. Performance analysis • How many worker map tasks are needed to

generate the data. • In general, the more map tasks you have, the

faster you can generate data.

External Source Output 1/3 Intent • To write MapReduce output to a nonnative location (outside

of Hadoop and HDFS). Motivation • To output data from the MapReduce framework directly to

an external source. • This is extremely useful for direct loading into a system

instead of staging the data to be delivered to the external source.

External Source Output 2/3

Structure

External Source Output 3/3 Consequences • The output data has been sent to the external source and

that external source has loaded it successfully. Performance analysis • The receiver of the data can handle the parallel connections. • Having a thousand tasks writing to a single SQL database is

not going to work well.

External Source Input 1/3 Intent • You want to load data in parallel from a source that is not

part of your MapReduce framework. Motivation • Typical model for using MapReduce to analyze the data is to

store it into HDFS. • With this pattern, you can hook up the MapReduce

framework into an external source, such as a database or a web service, and pull the data directly into the mappers.

External Source Input 2/3

Structure

External Source Input 3/3

Consequences • Data is loaded from the external source into the

MapReduce job • Map phase doesn’t care where that data came

from. Performance analysis • Bottleneck - the source or the network. • The source may not scale well with multiple

connections (e.g., a single threaded SQL db). • If the source is not in the cluster’s network, the

connections may be reaching out on a single connection on a slower public network.

Partition Pruning 1/3

Intent • You have a set of data that is partitioned by a

predetermined value, which you can use to dynamically load the data based on what is requested by the application.

Motivation • Loading all of the files is a large waste of

processing time. • By partitioning the data by a common value, you

can avoid significant amounts of processing time by looking only where the data would exist

Partition Pruning 2/3

Structure

Partition Pruning 3/3 Consequences • Partition pruning changes only the amount of data that is

read by the MapReduce job, not the eventual outcome of the analytic.

Performance analysis • Utilizing this pattern can provide massive gains by reducing

the number of tasks that need to be created that would not have generated output anyways.

• Outside of the I/O, the performance depends on the other pattern being applied in the map and reduce phases of the job.

The End (Finally…)

Thanks for your attentions. • MapReduce has proven to be a useful abstraction

• Greatly simplifies large-scale computations • Hadoop is widely used • Focus on problems, let MapReduce deal with

messy details. Any Questions?

20130201 MapReduce Design Patterns

Technology