Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce:...

Post on 03-Aug-2020

5 views 0 download

transcript

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

--

--

We’ll put lists on our doors (after class) and meet with you one by one to discuss grades, goals, … .

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Architecture: Client/Master/Chunkservers

4

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Consistency Model (Metadata)

• Changes to namespace (i.e., metadata) are atomic• Done by single master server!• Master uses WAL to define global total order of

namespace-changing operations

5

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Consistency Model (Data)

• Changes to data are ordered as chosen by a primary• But multiple writes from the same client may be

interleaved or overwritten by concurrent operations from other clients

• Record append completes at least once, at offset of GFS’s choosing• Applications must cope with possible duplicates

• Failures can cause inconsistency• E.g., different data across chunk servers (failed append)• Behavior is worse for writes than appends

6

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Summary• Success: used actively by Google

• Availability and recoverability on cheap hardware• High throughput by decoupling control and data• Supports massive data sets and concurrent appends

• Semantics not transparent to apps• Must verify file contents to avoid inconsistent regions, repeated

appends (at-least-once semantics)• Performance not good for all apps

• Assumes read-once, write-once workload (no client caching!)

• Successor: Colossus • Eliminates master node as single point of failure• Storage efficiency: Reed-Solomon (1.5x) instead of Replicas (3x)• Reduces block size to be between 1~8 MB• Few details public ☹

7

GFS

BigTable Spanner

Maps GMail

Books

YT

Ads

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Apache Hadoop DFS

8

Hmm… looks familiar

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS vs. HDFS

9

GFS HDFS Master NameNode chunkserver DataNode operation log journal, edit log chunk block random file writes possible only append is possible multiple writer, multiple reader model

single writer, multiple reader model

chunk: 32bit checksum over 64KB data pieces (1024 per chunk)

per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp)

default block size: 64MB default block size: 128MB

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•–––

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••••

•••

••

P1 P2 P3 P4 P5

Message Passing

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

••

P1 P2 P3 P4 P5

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

••••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

How do you find the frequency of words, such as , “440”, “error”, “rmi”, “p4” ?

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

How do you count the number of mutual friendships for all pairs of people, e.g., "you

and Joe have 147 friends in common"

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•• …••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

• ⟨ ⟩•

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

∑ ∑ ∑ ∑ ∑

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

∑ ∑ ∑ ∑ ∑

• ⟨ ⟩•

1) Mapping Phase

2) Shuffling / Sorting Phase

3) Reduce Phase

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

1)2)3)

••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

••••

••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Count the number of mutual friendships, e.g., "you and Joe have 147

friends in common"

How to do this in the MapReduce framework?

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

HDFSHDFS

#tasks >> #processors

dynamic task assignment

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

Task Manager

Mapper Mapper Mapper• • •

••• ••• •••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

••• ⟨ ⟩•

hK

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••••

••• ••• •••

• • •

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

• • •

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

••

••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

••

••••

••

Map

Reduce

Map

Reduce

Map

Reduce

Map

Reduce

Map/Reduce ••

••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

•••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

‒‒

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

•••

•••

••

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

popular?

depends on popularity of her followers

What’s the algorithm?

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University