+ All Categories
Transcript
Page 1: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Large-scale file systems and Map-Reduce

Single-node architecture

Memory

Disk

CPU

Google example:

• 20+ billion web pages x 20KB = 400+ Terabyte• 1 computer reads 30-35 MB/sec from disk• ~4 months to read the web

• ~1,000 hard drives to store the web• Takes even more to do something useful with the data• New standard architecture is emerging:• Cluster of commodity Linux nodes• Gigabit ethernet interconnect

Slide based on www.mmds.com

Page 2: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Distributed File Systems

• Files are very large, read/append.• They are divided into chunks.– Typically 64MB to a chunk.

• Chunks are replicated at several compute-nodes.• A master (possibly replicated) keeps track of all

locations of all chunks.

Slide based on www.mmds.com

Page 3: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Commodity clusters: compute nodes• Organized into racks.• Intra-rack connection typically gigabit speed.• Inter-rack connection faster by a small factor.• Recall that chunks are replicated

Some implementations:• GFS (Google File System –

proprietary). In Aug 2006 Google had ~450,000 machines

• HDFS (Hadoop Distributed File System – open source).

• CloudStore (Kosmix File System, open source).

Slide based on www.mmds.com

Page 4: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

• Problems with Large-scale computing on commodity hardware

• Challenges:– How do you distribute computation?– How can we make it easy to write distributed

programs?– Machines fail:• One server may stay up 3 years (1,000 days)• If you have 1,000 servers, expect to loose 1/day• People estimated Google had ~1M machines in 2011

– 1,000 machines fail every day!

Slide based on www.mmds.com

Page 5: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Slide based on www.mmds.com

• Issue: Copying data over a network takes time• Idea:– Bring computation close to the data– Store files multiple times for reliability

• Map-reduce addresses these problems– Google’s computational/data manipulation model– Elegant way to work with big data– Storage Infrastructure – File system• Google: GFS. Hadoop: HDFS

– Programming model• Map-Reduce

Page 6: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Slide based on www.mmds.com

• Problem:– If nodes fail, how to store data persistently?

• Answer:– Distributed File System:• Provides global file namespace• Google GFS; Hadoop HDFS;

• Typical usage pattern– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common

Page 7: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Racks of Compute Nodes

File

Chunks

Replication

Slide based on www.mmds.com

Page 8: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

3-way replication offiles, with copies ondifferent racks.

Replication

Slide based on www.mmds.com

Page 9: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Map-Reduce• You write two functions, Map and Reduce.– They each have a special form to be explained.

• System (e.g., Hadoop) creates a large number of tasks for each function.– Work is divided among tasks in a precise way.

Slide based on www.mmds.com

Page 10: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Map-Reduce Algorithms• Map tasks convert inputs to key-value pairs.– “keys” are not necessarily unique.

• Outputs of Map tasks are sorted by key, and each key is assigned to one Reduce task.

• Reduce tasks combine values associated with a key.

Slide based on www.mmds.com

Page 11: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Simple map-reduce example: Word Count• We have a large file of words, one word to a line• Count the number of times each distinct word

appears in the file• Sample application: analyze web server logs to find

popular URLs• Different scenarios:– Case 1: Entire file fits in main memory– Case 2: File too large for main mem, but all

<word, count> pairs fit in main mem– Case 3: File on disk, too many distinct words to

fit in memorySlide based on www.mmds.com

Page 12: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Word Count• Map task: For each word, e.g. CAT output (CAT,1)

• Total output: (w1,1), (w1,1), …., (w1,1) (w2,1), (w2,1), …., (w2,1) …… Hash each (w,1) to bucket h(w) in [0,r-1] in local intermediate file. r is the number of reducers

• Master: Group by key: (w1,[1,1,…,1]), (w2,[1,1,…,1]), Push group (w,[1,1,..,1]) to reducer h(w)

• Reduce task: Reducer h(w)

Read : (w,[1,1,…,1]) Aggregate: each (w,[1,1,…,1]) into (w,sum) Output: (w,sum) into common output file• Since addition is commutative and associative the map task could have sent : (w1,sum1), (w2,sum2), …

• Reduce task would receive: (wi,sumi,1), (wi,sumi,2), … (wj,sumj,1), (wj,sumj,2), … and output (wi,sumi), (wj,sumj), ….

Slide based on www.mmds.com

Page 13: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Partition Function

• Inputs to map tasks are created by contiguous splits of input file

• For reduce, we need to ensure that records with the same intermediate key end up at the same worker

• System uses a default partition function e.g., hash(key) mod R

• Sometimes useful to override – E.g., hash(hostname(URL)) mod R ensures URLs

from a host end up in the same output file

Slide based on www.mmds.com

Page 14: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Coordination

• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become

available– When a map task completes, it sends the master

the location and sizes of its R intermediate files, one for each reducer

– Master pushes this info to reducers• Master pings workers periodically to detect

failuresSlide based on www.mmds.com

Page 15: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Data flow• Input, final output are stored on a distributed file system

– Scheduler tries to schedule map tasks “close” to physical storage location of input data

• Intermediate results are stored on local FS of map and reduce workers

• Output is often input to another map-reduce task

• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become available– When a map task completes, it sends the master the location and sizes

of its R intermediate files, one for each reducer– Master pushes this info to reducers

• Master pings workers periodically to detect failures

Slide based on www.mmds.com

Page 16: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Failures

• Map worker failure– Map tasks completed or in-progress at worker

are reset to idle (result sits locally at worker)– Reduce workers are notified when task is

rescheduled on another worker• Reduce worker failure– Only in-progress tasks are reset to idle

• Master failure– Map-reduce task is aborted and client is notified

Slide based on www.mmds.com

Page 17: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

How many Map and Reduce jobs?

• M map tasks, R reduce tasks• Rule of thumb:– Make M and R much larger than the number of

nodes in cluster– One DFS chunk per map is common– Improves dynamic load balancing and speeds

recovery from worker failure• Usually R is smaller than M, because output

is spread across R files

Slide based on www.mmds.com

Page 18: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Relational operators with map-reduce

Selection

Map task: If C(t) is true output pair (t,t)

Reduce task: With input (t,t) output t

Selection is not really suitable for map-reduce,everything could have been done in the map task

)(RC

Page 19: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Relational operators with map-reduce

Projection

Map task: Let t’ be the projection of t. Output pair (t’,t’)

Reduce task: With input (t’,[t’,t’,…,t’] ) output t’

Here the duplicate elimination is done by the reduce task

)(RL

Page 20: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Relational operators with map-reduce

• Union RSMap task: for each tuple t of the chunk of R or S output (t, t)Reduce task: input is (t,[t]) or (t,[t, t]). Output t

• Intersection R SMap task: for each tuple t of the chunk output (t,t)Reduce task: if input is (t,[t,t]), output t if input is (t,[t]) , output nothing

• Difference R – SMap task: for each tuple t of R output (t,R) for each tuple t of S output (t,S)Reduce task: if input is (t,[R]), output t if input is (t,[R,S]) , output nothing

Page 21: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Joining by Map-Reduce• Suppose we want to compute • R(A,B) JOIN S(B,C), using k Reduce tasks.– I.e., find tuples with matching B-values.

• R and S are each stored in a chunked file.

• Use a hash function h from B-values to k buckets.– Bucket = Reduce task.

• The Map tasks take chunks from R and S, and send:– Tuple R(a,b) to Reduce task h(b).

• Key = b value = R(a,b).– Tuple S(b,c) to Reduce task h(b).

• Key = b; value = S(b,c).

Slide based on www.mmds.com

Page 22: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Reducetask i

Map tasks sendR(a,b) if h(b) = i

Map tasks sendS(b,c) if h(b) = i

All (a,b,c) such thath(b) = i, and (a,b)is in R, and (b,c) isin S.

• Key point: If R(a,b) joins with S(b,c), then both tuples are sent to Reduce task h(b).

• Thus, their join (a,b,c) will be produced there and shipped to the output file.

Slide based on www.mmds.com

Page 23: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Mapping tuples in joins

Mapper for R(1,2)

R(1,2) (2, (R,1))

Mapper for R(4,2)R(4,2)

Mapper for S(2,3)

S(2,3)

Mapper for S(5,6)

S(5,6)

(2, (R,4))

(2, (S,3))

(5, (S,6))

Reducerfor B = 2

Reducerfor B = 5

(2, [(R,1), (R,4), (S,3)])

(5, [(S,6)])

Slide based on www.mmds.com

Page 24: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Output of the Reducers

Reducerfor B = 2

Reducerfor B = 5

(2, [(R,1), (R,4), (S,3)])

(5, [(S,6)])

(1,2,3), (4,2,3)

Slide based on www.mmds.com

Page 25: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Relational operators with map-reduce

Grouping and aggregation: A,agg(B)(R(A,B,C))

Map task: for each tuple (a,b,c) output (a,[b]) Reduce task: if input is (a,[b1, b2, …, bn]), output (a,agg(b1, b2, …, bn))

for example (a, b1+b2+ …+bn)

Page 26: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Matrix-vector multiplication using map-reduce

j=1

Page 27: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

If vector doesn’t fit in main memory

Divide matrix and vector into stripes:

Each map task gets a chunk of stripe i of the matrixand the entire stripe i of the vector and producespairs

Reduce task i gets all pairs and producespairs

Slide based on www.mmds.com

Page 28: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Example:

MAPPERS:

REDUCERS:

Page 29: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Examples:

• Hamming distance 1 between bit-strings

• Matrix multiplication in one MR-round

• Matrix multiplication in two MR-rounds

• Three-way joins in two rounds and in one round

Page 30: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.
Page 31: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Relational operators with map-reduceThree-Way Join

• We shall consider a simple join of three relations, the natural join

R(A,B) ⋈ S(B,C) ⋈ T(C,D).

• One way: cascade of two 2-way joins, each implemented by map-reduce.

• Fine, unless the 2-way joins produce large intermediate relations.

Slide based on www.mmds.com

Page 32: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Another 3-Way Join

• Reduce processes use hash values of entire S(B,C) tuples as key.

• Choose a hash function h that maps B- and C-values to k buckets.

• There are k2 Reduce processes, one for each (B-bucket, C-bucket) pair.

Slide based on www.mmds.com

Page 33: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Job of the Reducers

• Each reducer gets, for certain B-values b and C-values c :

1. All tuples from R with B = b,2. All tuples from T with C = c, and3. The tuple S(b,c) if it exists.

• Thus it can create every tuple of the form (a, b, c, d) in the join.

Slide based on www.mmds.com

Page 34: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Mapping for 3-Way Join

We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)).

We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,…,k.

We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,…,k.

Keys Values

Aside: even normalmap-reduce allowsinputs to map toseveral key-valuepairs.

Slide based on www.mmds.com

Page 35: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Assigning Tuples to Reducers

h(b) = 0

1

2

3

h(c) = 0 1 2 3

S(b,c) whereh(b)=1; h(c)=2

R(a,b), whereh(b)=2

T(c,d), whereh(c)=3

Slide based on www.mmds.com

Page 36: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

DB = R(1,1)R(1,2)R(2,1)R(2,2)

S(1,1)S(1,2)S(2,1)S(2,2)

T(1,1)T(1,2)T(2,1)T(2,2)

..etc

R(1,1)R(2,1)

S(1,1) T(1,1)T(1,2)

R(1,1)R(2,1)

S(1,2) T(2,1)T(2,2)

R(1,2)R(2,2)

S(2,1) T(1,1)T(1,2)

R(1,2)R(2,2)

S(2,2) T(2,1)T(2,2)

MapperR(1,1) (1,1,(R,1))(1,2,(R,1))

R(1,1) S(1,1) T(1,1) R(2,1) S(1,1) T(1,1) R(2,1) S(1,1) T(1,2) R(1,1) S(1,1) T(1,2)

MapperS(1,2) (1,2,(S,1,2))


Top Related