+ All Categories
Home > Documents > Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see...

Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see...

Date post: 27-Feb-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
31
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Fall 2016 1
Transcript
Page 1: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Introduction to Data ManagementCSE 344

Lecture 24: MapReduce

CSE 344 - Fall 2016 1

Page 2: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

HW8 is out

• Last assignment!– Get Amazon credits now (see instructions)

• Spark with Hadoop

• Due next wed

CSE 344 - Fall 2016 2

Page 3: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Parallel Data Processing @ 1990

CSE 344 - Fall 2016 3

Page 4: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Review: Shared Nothing• Cluster of machines on high-speed network• Called "clusters" or "blade servers”• Each machine has its own memory and disk: lowest

contention.

NOTE: Because all machines today have many cores and many disks, then shared-nothing systems typically run many "nodes” on a single physical machine.

Characteristics:• Today, this is the most scalable architecture.• Most difficult to administer and tune.

4CSE 344 - Fall 2016We discuss only Shared Nothing in class

Page 5: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Purchase

pid=pid

cid=cid

Customer

ProductPurchase

pid=pid

cid=cid

Customer

Product

Review: Approaches toParallel Query Evaluation

• Inter-query parallelism– Transaction per node– OLTP

• Inter-operator parallelism– Operator per node– Both OLTP and Decision Support

• Intra-operator parallelism– Operator on multiple nodes– Decision Support

CSE 344 - Fall 2016We study only intra-operator parallelism: most scalable

Purchase

pid=pid

cid=cid

Customer

Product

Purchase

pid=pid

cid=cid

Customer

Product

Purchase

pid=pid

cid=cid

Customer

Product

5

Page 6: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Distributed Query Processing

• Data is horizontally partitioned on many servers

• Operators may require data reshuffling

CSE 344 - Fall 2016 6

Page 7: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Horizontal Data Partitioning

CSE 344 - Fall 2016 7

1 2 P . . .

Data: Servers:

K A B… …

Page 8: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Horizontal Data Partitioning

CSE 344 - Fall 2016 8

K A B… …

1 2 P . . .

Data: Servers:

K A B

… …

K A B

… …

K A B

… …

Which tuplesgo to what server?

Page 9: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Horizontal Data Partitioning• Block Partition:

– Partition tuples arbitrarily s.t. size(R1)≈ … ≈ size(RP)

• Hash partitioned on attribute A:– Tuple t goes to chunk i, where i = h(t.A) mod P + 1

• Range partitioned on attribute A:– Partition the range of A into -∞ = v0 < v1 < … < vP = ∞– Tuple t goes to chunk i, if vi-1 < t.A < vi

9CSE 344 - Fall 2016

Page 10: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Parallel Group ByData: R(K,A,B,C)Query: γA,sum(C)(R)

How to compute if:

• R is hash-partitioned on A

• R is block-partitioned

• R is hash-partitioned on K

10CSE 344 - Fall 2016

Page 11: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Parallel Group By

Data: R(K,A,B,C)Query: γA,sum(C)(R)• R is block-partitioned or hash-partitioned on K

11

R1 R2 RP . . .

R1’ R2’ RP’. . .

Reshuffle Ron attribute A

CSE 344 - Fall 2016

Page 12: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Parallel Join

• Data: R(K1,A, B), S(K2, B, C)• Query: R(K1,A,B) ⋈ S(K2,B,C)

12

R1, S1 R2, S2 RP, SP . . .

R’1, S’1 R’2, S’2 R’P, S’P . . .

Reshuffle R on R.Band S on S.B

Each server computesthe join locally

CSE 344 - Fall 2016

Initially, both R and S are horizontally partitioned on K1 and K2

Page 13: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Data: R(K1,A, B), S(K2, B, C)Query: R(K1,A,B) ⋈ S(K2,B,C)

CSE 344 - Fall 2016 13

K1 B1 202 50

K2 B101 50102 50

K1 B3 204 20

K2 B201 20202 50

R1 S1 R2 S2

K1 B1 203 204 20

K2 B201 20

K1 B2 50

K2 B101 50102 50202 50

R1’ S1’ R2’ S2’

M1 M2

M1 M2

Shuffle

⋈ ⋈

Partition

Local Join

Page 14: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Speedup and Scaleup

• Consider:– Query: γA,sum(C)(R)– Runtime: dominated by reading chunks from disk

• If we double the number of nodes P, what is the new running time?– Half (each server holds ½ as many chunks)

• If we double both P and the size of R, what is the new running time?– Same (each server holds the same # of chunks)

CSE 344 - Fall 2016 14

Page 15: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Uniform Data v.s. Skewed Data• Let R(K,A,B,C); which of the following

partition methods may result in skewedpartitions?

• Block partition

• Hash-partition– On the key K– On the attribute A

Uniform

Uniform

May be skewed

Assuming goodhash function

E.g. when all recordshave the same valueof the attribute A, thenall records end up in thesame partition

CSE 344 - Fall 2016 15

Page 16: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

16

Loading Data into a Parallel DBMS

AMP = “Access Module Processor” = unit of parallelism

CSE 344 - Fall 2016

Example using Teradata

Page 17: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

17

Example Parallel Query Execution

SELECT * FROM Order o, Line i

WHERE o.item = i.itemAND o.date = today()

join

select

scan scan

date = today()

o.item = i.item

Order oItem i

Find all orders from today, along with the items ordered

CSE 344 - Fall 2016

Order(oid, item, date), Line(item, …)

Page 18: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

18

Example Parallel Query Execution

AMP 1 AMP 2 AMP 3

selectdate=today()

selectdate=today()

selectdate=today()

scanOrder o

scanOrder o

scanOrder o

hashh(o.item)

hashh(o.item)

hashh(o.item)

AMP 1 AMP 2 AMP 3

join

select

scan

date = today()

o.item = i.item

Order o

CSE 344 - Fall 2016

Order(oid, item, date), Line(item, …)

Page 19: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

19

Example Parallel Query Execution

AMP 1 AMP 2 AMP 3

scanItem i

AMP 1 AMP 2 AMP 3

hashh(i.item)

scanItem i

hashh(i.item)

scanItem i

hashh(i.item)

join

scandate = today()

o.item = i.item

Order oItem i

CSE 344 - Fall 2016

Order(oid, item, date), Line(item, …)

Page 20: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

20

Example Parallel Query Execution

AMP 1 AMP 2 AMP 3

join join joino.item = i.item o.item = i.item o.item = i.item

contains all orders and all lines where hash(item) = 1

contains all orders and all lines where hash(item) = 2

contains all orders and all lines where hash(item) = 3

CSE 344 - Fall 2016

Order(oid, item, date), Line(item, …)

Page 21: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Parallel Data Processing @ 2000

CSE 344 - Fall 2016 21

Page 22: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Optional Reading

• Original paper:https://www.usenix.org/legacy/events/osdi04/tech/dean.html

• Rebuttal to a comparison with parallel DBs:http://dl.acm.org/citation.cfm?doid=1629175.1629198

• Chapter 2 (Sections 1,2,3 only) of Mining of Massive Datasets, by Rajaraman and Ullmanhttp://i.stanford.edu/~ullman/mmds.html

CSE 344 - Fall 2016 22

Page 23: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Distributed File System (DFS)

• For very large files: TBs, PBs• Each file is partitioned into chunks, typically

64MB• Each chunk is replicated several times (≥3),

on different racks, for fault tolerance• Implementations:

– Google’s DFS: GFS, proprietary– Hadoop’s DFS: HDFS, open source

CSE 344 - Fall 2016 23

Page 24: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

MapReduce

• Google: paper published 2004• Free variant: Hadoop

• MapReduce = high-level programming model and implementation for large-scale parallel data processing

24CSE 344 - Fall 2016

Page 25: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Typical Problems Solved by MR

• Read a lot of data• Map: extract something you care about from each

record• Shuffle and Sort• Reduce: aggregate, summarize, filter, transform• Write the results

CSE 344 - Fall 2016 25

Paradigm stays the same,change map and reduce functions for different problems

slide source: Jeff Dean

Page 26: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Map Reduce Data ModelInstance: Files!• where a file = a bag of (key, value) pairs

Schema: None!• just like other key-value data models

Query language: a MapReduce program:• Input: a bag of (inputkey, value) pairs• Output: a bag of (outputkey, value) pairs

26CSE 344 - Fall 2016

Page 27: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Step 1: the MAP Phase

User provides the MAP-function:• Input: (input key, value)• Output: bag of (intermediate key, value)

System applies the map function in parallel to all (input key, value) pairs in the input file

27CSE 344 - Fall 2016

Page 28: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Step 2: the REDUCE Phase

User provides the REDUCE function:• Input: (intermediate key, bag of values)• Output: bag of output (values)

System groups all pairs with the same intermediate key, and passes the bag of values to the REDUCE function

28CSE 344 - Fall 2016

Page 29: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Example

• Counting the number of occurrences of each word in a large collection of documents

• Each Document– The key = document id (did)– The value = set of words (word)

map(String key, String value):// key: document name// value: document contentsfor each word w in value:

EmitIntermediate(w, “1”);29

reduce(String key, Iterator values):// key: a word// values: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Page 30: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

MAP REDUCE

(w1,1)

(w2,1)

(w3,1)

(w1,1)

(w2,1)

(did1,v1)

(did2,v2)

(did3,v3)

. . . .

(w1, (1,1,1,…,1))

(w2, (1,1,…))

(w3,(1…))

(w1, 25)

(w2, 77)

(w3, 12)

Shuffle

30CSE 344 - Fall 2016

Page 31: Introduction to Data Management CSE 344...• Last assignment! – Get Amazon credits now (see instructions) • Spark with Hadoop • Due next wed CSE 344 - Fall 2016 2. Parallel

Jobs v.s. Tasks

• A MapReduce Job– One single “query,” e.g., count the words in all docs– More complex queries may consists of multiple jobs

• A Map Task, or a Reduce Task– A group of instantiations of the map-, or reduce-

function, which are scheduled on a single worker

CSE 344 - Fall 2016 31


Recommended