[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Post on 27-Jan-2015

120 views 1 download

Tags:

description

http://cs264.org

transcript

Introduction to

Zak Stone <zak@eecs.harvard.edu>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: Todd Zickler (Computer Vision)

Hadoop distributes data and computation across a large number of computers.

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Why should you care? - Lots of Data

LOTS OF DATAEVERYWHERE

Why should you care? - Lots of Data

LOTS!

Why should you care? - Lots of Data

Why should you care? - Even Grocery Stores Care

...

Why!! ! ! ! ! ! for big data?

• Most credible open-source toolset for large-scale, general-purpose computing

• Backed by ,

• Used by , , many others

• Increasing support from web services

• Hadoop closely imitates infrastructure developed by

• Hadoop processes petabytes daily, right now

Why!! ! ! ! ! ! for big data?

• Don’t use Hadoop if your data and computation fit on one machine

• Getting easier to use, but still complicated

DISCLAIMER

http://www.wired.com/gadgetlab/2008/07/patent-crazines/

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects; focus on two right now

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

An overview of Hadoop Map-Reduce

TraditionalComputing

Hadoop

(one computer)

(many computers)

An overview of Hadoop Map-Reduce

(Actually more like this)

(many computers, little communication, stragglers and failures)

Map-Reduce: Three phases

1. Map

2. Sort

3. Reduce

Map-Reduce: Map phase

(key, value) (key, value)(key, value)(key, value)

Only specify operations on key-value pairs!

(zero or more output pairs)

(each “elephant” works on an input pair; doesn’t know other elephants exist )

INPUT PAIR OUTPUT PAIRS

Map-Reduce: Map phase, word-count example

(line1, “Hello there.”) (“hello”, 1)

(“there”, 1)

(line2, “Why, hello.”) (“why”, 1)

(“hello”, 1)

Map-Reduce: Sort phase

(key1, value289)(key1, value43)(key1, value3)

(key2, value512)(key2, value11)(key2, value67)

...

...

Map-Reduce: Sort phase, word-count example

(“hello”, 1)

(“there”, 1)

(“why”, 1)

(“hello”, 1)

Map-Reduce: Reduce phase

(key1, value289)(key1, value43)(key1, value3)

(key1, output1)

...

Map-Reduce: Reduce phase, word-count example

(“hello”, 1)

(“there”, 1)

(“why”, 1)

(“hello”, 1)(“hello”, 2)

(“there”, 1)

(“why”, 1)

Map-Reduce: Code for word-count

def mapper(key,value): for word in value.split(): yield word,1

def reducer(key,values): yield key,sum(values)

Seems like too much work for a word-count!

Map-Reduce: Imagine word-count on the Web

Map-Reduce: The main advantage

def mapper(key,value): for word in value.split(): yield word,1

def reducer(key,values): yield key,sum(values)

With Hadoop, this very same code could run on the entire Web! (In theory, at least)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

HDFS: Hadoop Distributed File System

Data

. . .

. . .

. . .. .

.

(chunks of data on computers)

(each chunkreplicated morethan once for

reliability)

HDFS: Hadoop Distributed File System

. . .

(key1, value1)(key2, value2)

...

(key1, value1)(key2, value2)

......

Computation is local to the dataKey-value pairs processed independently in parallel

HDFS: Inspired by the Google File System

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Hadoop Map-Reduce and HDFS: Advantages

• Distribute data and computation

• Computation local to data avoids network overload

• Tasks are independent

• Easy to handle partial failures - entire nodes can fail and restart

• Avoid crawling horrors of failure-tolerant synchronous distributed systems

• Speculative execution to work around stragglers

• Linear scaling in the ideal case

• Designed for cheap, commodity hardware

• Simple programming model

• The “end-user” programmer only writes map-reduce tasks

Hadoop Map-Reduce and HDFS: Disadvantages

• Still rough - software under active development

• e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

• Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

• No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job flow isn’t trivial when intermediate data should be kept

• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

• Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/

Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

• Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

• The Python word-count example and others come with Dumbo

• Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:

• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/

• Always test locally on a tiny dataset before running on a cluster!

...

Thanks for your attention!