Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to...

Homework 2Homework 2

In the docs folder of your Berkeley DB, have a In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure careful look at documentation on how to configure BDB in main memory.BDB in main memory.

A new description of Homework 2 is posted.A new description of Homework 2 is posted. If you do not see the memory limitation pointed out If you do not see the memory limitation pointed out

in Part 1, change your record size to be variable in Part 1, change your record size to be variable length: [1+(id % 10)]*1024.length: [1+(id % 10)]*1024.

When you have debugged your program using a When you have debugged your program using a small memory size, scale up to large memorie sizes, small memory size, scale up to large memorie sizes, e.g., 512 MB. Play around with the system.e.g., 512 MB. Play around with the system.

It is OK if your observations do not correspond to It is OK if your observations do not correspond to those stipulated by Homework 2. Simply state your those stipulated by Homework 2. Simply state your observation and provide a zipped version of your observation and provide a zipped version of your software.software.

MapReduce ExecutionMapReduce Execution Map invocations are distributed across multiple machines by Map invocations are distributed across multiple machines by

automatically partitioning the input data into a set of M splits.automatically partitioning the input data into a set of M splits. Reduce invocations are distributed by paritioning the intermediate Reduce invocations are distributed by paritioning the intermediate

key space into R pieces using a hash function: hash(key) mod key space into R pieces using a hash function: hash(key) mod R.R. R and the partitioning function are specified by the programmer.R and the partitioning function are specified by the programmer.

MapReduceMapReduce

Any questions?Any questions?

Question 1:Question 1:

Master takes the location of input files (GFS) Master takes the location of input files (GFS) and their replicas into account. It strives to and their replicas into account. It strives to schedule a map task on a machine that schedule a map task on a machine that contains a replica of the corresponding input contains a replica of the corresponding input file (or near it).file (or near it). Minimize contention for the network bandwidth.Minimize contention for the network bandwidth.

Can the reduce tasks be scheduled on the Can the reduce tasks be scheduled on the same nodes that are holding the same nodes that are holding the intermediate data on their local disks to intermediate data on their local disks to further reduce network traffic?further reduce network traffic?

Answer to Question 1Answer to Question 1

Probably not because every Map task will Probably not because every Map task will have some data for each Reduce task.have some data for each Reduce task.

A Map task produces R output files, each to A Map task produces R output files, each to be consumed by one reduce tasks.be consumed by one reduce tasks. If there is 1 Map task and 10 Reduce tasks then If there is 1 Map task and 10 Reduce tasks then

the 1 Map task produces 10 output files.the 1 Map task produces 10 output files. Each file resembles partitioning of an intermediate Each file resembles partitioning of an intermediate

key/value pair, e.g., intermediate key % R.key/value pair, e.g., intermediate key % R.

If there is 200 Map tasks and 10 Reduce tasks, If there is 200 Map tasks and 10 Reduce tasks, the Map phase produces 2000 files (10 files the Map phase produces 2000 files (10 files produced by each Map task).produced by each Map task). Each reduce task processes the 200 files (produced by Each reduce task processes the 200 files (produced by

200 different Map tasks) that map to the same 200 different Map tasks) that map to the same partitioning, e.g., intermediate key % R.partitioning, e.g., intermediate key % R.

The master may assign a reduce task to one node; it The master may assign a reduce task to one node; it must pick one of the 200 Map tasks.must pick one of the 200 Map tasks.

Question 1.aQuestion 1.a

Probably not because every Map task will Probably not because every Map task will have some data for each Reduce task.have some data for each Reduce task.

A Map task produces R output files, each to A Map task produces R output files, each to be consumed by one reduce tasks.be consumed by one reduce tasks. If there is 1 Map task and 10 Reduce tasks then If there is 1 Map task and 10 Reduce tasks then

the 1 Map task produces 10 output files.the 1 Map task produces 10 output files. Each file resembles partitioning of an intermediate Each file resembles partitioning of an intermediate

key/value pair, e.g., intermediate key % R.key/value pair, e.g., intermediate key % R. If there is 200 Map tasks and 10 Reduce tasks, If there is 200 Map tasks and 10 Reduce tasks,

the Map phase produces 2000 files (10 files the Map phase produces 2000 files (10 files produced by each Map task).produced by each Map task). Each reduce task processes the 200 files (produced by Each reduce task processes the 200 files (produced by

200 different Map tasks) that map to the same 200 different Map tasks) that map to the same partitioning, e.g., intermediate key % R.partitioning, e.g., intermediate key % R.

The master may assign a reduce task to one node; it The master may assign a reduce task to one node; it must pick one of the 200 Map tasks.must pick one of the 200 Map tasks.

What if there are 200 Map tasks and 200 Reduce tasks?What if there are 200 Map tasks and 200 Reduce tasks?

Answer to Question 1.aAnswer to Question 1.a

Probably not because every Map task will have Probably not because every Map task will have some data for each Reduce task.some data for each Reduce task.

A Map task produces R output files, each to be A Map task produces R output files, each to be consumed by one reduce tasks.consumed by one reduce tasks. If there is 1 Map task and 10 Reduce tasks then the 1 Map If there is 1 Map task and 10 Reduce tasks then the 1 Map

task produces 10 output files.task produces 10 output files. Each file resembles partitioning of an intermediate key/value Each file resembles partitioning of an intermediate key/value

pair, e.g., intermediate key % R.pair, e.g., intermediate key % R. If there is 200 Map tasks and 10 Reduce tasks, the Map If there is 200 Map tasks and 10 Reduce tasks, the Map

phase produces 2000 files (10 files produced by each Map phase produces 2000 files (10 files produced by each Map task).task). Each reduce task processes the 200 files (produced by 200 Each reduce task processes the 200 files (produced by 200

different Map tasks) that map to the same partitioning, e.g., different Map tasks) that map to the same partitioning, e.g., intermediate key % R.intermediate key % R.

The master may assign a reduce task to one node; it must The master may assign a reduce task to one node; it must pick one of the 200 Map tasks.pick one of the 200 Map tasks.

What if there are 200 Map tasks and 200 Reduce tasks?What if there are 200 Map tasks and 200 Reduce tasks? There will be a total of 40,000 files to process.There will be a total of 40,000 files to process. Each reduce task must retrieve 200 different files from 200 Each reduce task must retrieve 200 different files from 200

different Map tasks.different Map tasks. Scheduling a reduce task on one node requires transmission of Scheduling a reduce task on one node requires transmission of

199 other files across the network.199 other files across the network.

Question 2Question 2

Given R reduce tasks, once reduce task ri is Given R reduce tasks, once reduce task ri is assigned to a worker, all partitioned assigned to a worker, all partitioned intermediate key values that map to ri MUST intermediate key values that map to ri MUST be sent to this worker. Why?be sent to this worker. Why?


Given R Reduce tasks, once Reduce task ri Given R Reduce tasks, once Reduce task ri is assigned to a worker, all partitioned is assigned to a worker, all partitioned intermediate key values logically assigned to intermediate key values logically assigned to ri MUST be sent to this worker. Why?ri MUST be sent to this worker. Why?

Reduce task ri does aggregation and must Reduce task ri does aggregation and must have all instances of the intermediate keys have all instances of the intermediate keys produced by different Map tasks.produced by different Map tasks.

In our example, [“Jim”, “1 1 1”] produced by In our example, [“Jim”, “1 1 1”] produced by five different map tasks must be directed to five different map tasks must be directed to the same reduce task so that it computes the same reduce task so that it computes [“Jim”, “15”] as its output.[“Jim”, “15”] as its output.

If directed to five different reduce tasks, If directed to five different reduce tasks, each reduce task will produce [“Jim”, “3”] each reduce task will produce [“Jim”, “3”] and there is no mechanism to merge them and there is no mechanism to merge them together!together!


Are the renaming operations at the end of a Are the renaming operations at the end of a Reduce task protected by locks? Is it Reduce task protected by locks? Is it possible for a file to become corrupted if two possible for a file to become corrupted if two threads attempt to rename it to the same threads attempt to rename it to the same name at essentially the same time? Or does name at essentially the same time? Or does the rename operation happen so fast that the the rename operation happen so fast that the chances of this happening are very remote? chances of this happening are very remote?


Are the renaming operations at the end of a Are the renaming operations at the end of a Reduce task protected by locks? Is it Reduce task protected by locks? Is it possible for a file to become corrupted if two possible for a file to become corrupted if two threads attempt to rename it to the same threads attempt to rename it to the same name at essentially the same time? Or does name at essentially the same time? Or does the rename operation happen so fast that the the rename operation happen so fast that the chances of this happening are very remote?chances of this happening are very remote? The rename operations are performed on two The rename operations are performed on two

different files, produced by different Reduce different files, produced by different Reduce tasks that performed the same computation.tasks that performed the same computation. A file produced by the Reduce task corresponds to a A file produced by the Reduce task corresponds to a

range, i.e., a tablet of Bigtable.range, i.e., a tablet of Bigtable. To update the meta-data, the tablet server must To update the meta-data, the tablet server must

update the meta-data on Chubby.update the meta-data on Chubby. There is one instance of this meta-data.There is one instance of this meta-data. Chubby serializes the rename operations. Chubby serializes the rename operations.


I have a hard time picturing a useful non-I have a hard time picturing a useful non-deterministic function. Can you give an deterministic function. Can you give an example of a non-deterministic function that example of a non-deterministic function that could be implemented by Map/Reduce. could be implemented by Map/Reduce. How to construct a non-deterministic function?How to construct a non-deterministic function? What are some of the examples that may use What are some of the examples that may use

such a non-deterministic function?such a non-deterministic function?


I have a hard time picturing a useful non-I have a hard time picturing a useful non-deterministic function. Can you give an deterministic function. Can you give an example of a non-deterministic function that example of a non-deterministic function that could be implemented by Map/Reduce. could be implemented by Map/Reduce. How to construct a non-deterministic function?How to construct a non-deterministic function?

A computation that uses a random number generator.A computation that uses a random number generator. An optimization with a large search space such that it An optimization with a large search space such that it

requires heuristic search starting with a randomly chosen requires heuristic search starting with a randomly chosen node in the space.node in the space.

What are some of the examples that may use What are some of the examples that may use such a non-deterministic function?such a non-deterministic function? Given a term not-encountered before, what are the best Given a term not-encountered before, what are the best

advertisements to offer the user to maximize profits?advertisements to offer the user to maximize profits?

Performance NumbersPerformance Numbers

A cluster consisting of 1800 PCs:A cluster consisting of 1800 PCs: 2 GHz Intel Xeon processors2 GHz Intel Xeon processors 4 GB of memory4 GB of memory

1-1.5 GB reserved for other tasks sharing the nodes.1-1.5 GB reserved for other tasks sharing the nodes.

320 GB storage: two 160 GB IDE disks320 GB storage: two 160 GB IDE disks

Grep through 1 TB of data looking for a pre-Grep through 1 TB of data looking for a pre-specified pattern (M=15000 64 MB, R=1):specified pattern (M=15000 64 MB, R=1): Execution time is 150 Seconds.Execution time is 150 Seconds.

Performance NumbersPerformance Numbers

A cluster consisting of 1800 PCs:A cluster consisting of 1800 PCs: 2 GHz Intel Xeon processors2 GHz Intel Xeon processors 4 GB of memory4 GB of memory

1-1.5 GB reserved for other tasks sharing the nodes.1-1.5 GB reserved for other tasks sharing the nodes.

320 GB storage: two 160 GB IDE disks320 GB storage: two 160 GB IDE disks

Grep through 1 TB of data looking for a pre-Grep through 1 TB of data looking for a pre-specified pattern (M=15000 64 MB, R=1):specified pattern (M=15000 64 MB, R=1): Execution time is 150 Seconds.Execution time is 150 Seconds.

1764 workers are assigned!1764 workers are assigned!

Time to schedule tasks; startup.Time to schedule tasks; startup.

Startup with GrepStartup with Grep

Startup includes:Startup includes: Propagation of the program to all worker Propagation of the program to all worker

machines,machines, Delays interacting with GFS to open the set of Delays interacting with GFS to open the set of

1000 input files,1000 input files, Information needed for the locality optimization.Information needed for the locality optimization.

SortSort

Map function extracts a 10-byte sorting key Map function extracts a 10-byte sorting key from a text line, emitting the key and the from a text line, emitting the key and the original text line as the intermediate original text line as the intermediate key/value pair.key/value pair. Each intermediate key/value pair will be sorted.Each intermediate key/value pair will be sorted.

Identity function as the reduce operator.Identity function as the reduce operator. R = 4000.R = 4000. Partitioning information has built-in knowledge Partitioning information has built-in knowledge

of the distribution of keys.of the distribution of keys. If this information is missing, add a pre-pass If this information is missing, add a pre-pass

MapReduce to collect a sample of the keys and MapReduce to collect a sample of the keys and compute the partitioning information.compute the partitioning information.

Final sorted output is written to a set of 2-Final sorted output is written to a set of 2-way replicated GFS files.way replicated GFS files.

Sort ResultsSort Results

Date post:	21-Dec-2015
Category:	Documents
View:	214 times
Download:	0 times

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to...

Documents