+ All Categories
Home > Education > [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Date post: 27-Jan-2015
Category:
Upload: npinto
View: 120 times
Download: 1 times
Share this document with a friend
Description:
http://cs264.org
Popular Tags:
42
Introduction to Zak Stone <[email protected]> PhD candidate, Harvard School of Engineering and Applied Sciences Advisor: Todd Zickler (Computer Vision)
Transcript
Page 1: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Introduction to

Zak Stone <[email protected]>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: Todd Zickler (Computer Vision)

Page 2: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Hadoop distributes data and computation across a large number of computers.

Page 3: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 4: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 5: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Why should you care? - Lots of Data

LOTS OF DATAEVERYWHERE

Page 6: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Why should you care? - Lots of Data

LOTS!

Page 7: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Why should you care? - Lots of Data

Page 8: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Why should you care? - Even Grocery Stores Care

...

Page 9: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Why!! ! ! ! ! ! for big data?

• Most credible open-source toolset for large-scale, general-purpose computing

• Backed by ,

• Used by , , many others

• Increasing support from web services

• Hadoop closely imitates infrastructure developed by

• Hadoop processes petabytes daily, right now

Page 10: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Why!! ! ! ! ! ! for big data?

Page 11: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

• Don’t use Hadoop if your data and computation fit on one machine

• Getting easier to use, but still complicated

DISCLAIMER

http://www.wired.com/gadgetlab/2008/07/patent-crazines/

Page 12: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 13: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects

Page 14: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects; focus on two right now

Page 15: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 16: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

An overview of Hadoop Map-Reduce

TraditionalComputing

Hadoop

(one computer)

(many computers)

Page 17: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

An overview of Hadoop Map-Reduce

(Actually more like this)

(many computers, little communication, stragglers and failures)

Page 18: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Three phases

1. Map

2. Sort

3. Reduce

Page 19: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Map phase

(key, value) (key, value)(key, value)(key, value)

Only specify operations on key-value pairs!

(zero or more output pairs)

(each “elephant” works on an input pair; doesn’t know other elephants exist )

INPUT PAIR OUTPUT PAIRS

Page 20: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Map phase, word-count example

(line1, “Hello there.”) (“hello”, 1)

(“there”, 1)

(line2, “Why, hello.”) (“why”, 1)

(“hello”, 1)

Page 21: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Sort phase

(key1, value289)(key1, value43)(key1, value3)

(key2, value512)(key2, value11)(key2, value67)

...

...

Page 22: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Sort phase, word-count example

(“hello”, 1)

(“there”, 1)

(“why”, 1)

(“hello”, 1)

Page 23: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Reduce phase

(key1, value289)(key1, value43)(key1, value3)

(key1, output1)

...

Page 24: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Reduce phase, word-count example

(“hello”, 1)

(“there”, 1)

(“why”, 1)

(“hello”, 1)(“hello”, 2)

(“there”, 1)

(“why”, 1)

Page 25: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Code for word-count

def mapper(key,value): for word in value.split(): yield word,1

def reducer(key,values): yield key,sum(values)

Page 26: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Seems like too much work for a word-count!

Page 27: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: Imagine word-count on the Web

Page 28: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Map-Reduce: The main advantage

def mapper(key,value): for word in value.split(): yield word,1

def reducer(key,values): yield key,sum(values)

With Hadoop, this very same code could run on the entire Web! (In theory, at least)

Page 29: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 30: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

HDFS: Hadoop Distributed File System

Data

. . .

. . .

. . .. .

.

(chunks of data on computers)

(each chunkreplicated morethan once for

reliability)

Page 31: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

HDFS: Hadoop Distributed File System

. . .

(key1, value1)(key2, value2)

...

(key1, value1)(key2, value2)

......

Computation is local to the dataKey-value pairs processed independently in parallel

Page 32: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

HDFS: Inspired by the Google File System

Page 33: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 34: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Hadoop Map-Reduce and HDFS: Advantages

• Distribute data and computation

• Computation local to data avoids network overload

• Tasks are independent

• Easy to handle partial failures - entire nodes can fail and restart

• Avoid crawling horrors of failure-tolerant synchronous distributed systems

• Speculative execution to work around stragglers

• Linear scaling in the ideal case

• Designed for cheap, commodity hardware

• Simple programming model

• The “end-user” programmer only writes map-reduce tasks

Page 35: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Hadoop Map-Reduce and HDFS: Disadvantages

• Still rough - software under active development

• e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

• Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

• No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job flow isn’t trivial when intermediate data should be kept

• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)

Page 36: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 37: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

• Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/

Page 38: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

• Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

• The Python word-count example and others come with Dumbo

• Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy

Page 39: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Page 40: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:

• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/

• Always test locally on a tiny dataset before running on a cluster!

Page 41: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

...

Page 42: [Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Thanks for your attention!


Recommended