[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

transcript

Introduction to

Zak Stone <zak@eecs.harvard.edu>PhD candidate, Harvard School of Engineering and Applied SciencesAdvisor: Todd Zickler (Computer Vision)

Hadoop distributes data and computation across a large number of computers.

Outline

1. Why should you care about Hadoop?

2. What exactly is Hadoop?

3. An overview of Hadoop Map-Reduce

4. The Hadoop Distributed File System (HDFS)

5. Hadoop advantages and disadvantages

6. Getting started with Hadoop

7. Useful resources

Outline

7. Useful resources

Why should you care? - Lots of Data

LOTS OF DATAEVERYWHERE

Why should you care? - Lots of Data

Why should you care? - Even Grocery Stores Care

Why!! ! ! ! ! ! for big data?

• Most credible open-source toolset for large-scale, general-purpose computing

• Backed by ,

• Used by , , many others

• Increasing support from web services

• Hadoop closely imitates infrastructure developed by

• Hadoop processes petabytes daily, right now

Why!! ! ! ! ! ! for big data?

• Don’t use Hadoop if your data and computation fit on one machine

• Getting easier to use, but still complicated

DISCLAIMER

http://www.wired.com/gadgetlab/2008/07/patent-crazines/

Outline

7. Useful resources

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects

What exactly is ! ! ! ! ! ! ! ?

• Actually a growing collection of subprojects; focus on two right now

Outline

7. Useful resources

An overview of Hadoop Map-Reduce

TraditionalComputing

Hadoop

(one computer)

(many computers)

An overview of Hadoop Map-Reduce

(Actually more like this)

(many computers, little communication, stragglers and failures)

Map-Reduce: Three phases

1. Map

2. Sort

3. Reduce

Map-Reduce: Map phase

(key, value) (key, value)(key, value)(key, value)

Only specify operations on key-value pairs!

(zero or more output pairs)

(each “elephant” works on an input pair; doesn’t know other elephants exist )

INPUT PAIR OUTPUT PAIRS

Map-Reduce: Map phase, word-count example

(line1, “Hello there.”) (“hello”, 1)

(“there”, 1)

(line2, “Why, hello.”) (“why”, 1)

(“hello”, 1)

Map-Reduce: Sort phase

(key1, value289)(key1, value43)(key1, value3)

Map-Reduce: Sort phase, word-count example

(“hello”, 1)

(“there”, 1)

(“why”, 1)

(“hello”, 1)

Map-Reduce: Reduce phase

(key1, output1)

Map-Reduce: Reduce phase, word-count example

(“hello”, 1)

(“there”, 1)

(“why”, 1)

(“hello”, 1)(“hello”, 2)

(“there”, 1)

(“why”, 1)

Map-Reduce: Code for word-count

def mapper(key,value): for word in value.split(): yield word,1

def reducer(key,values): yield key,sum(values)

Seems like too much work for a word-count!

Map-Reduce: Imagine word-count on the Web

Map-Reduce: The main advantage

def mapper(key,value): for word in value.split(): yield word,1

def reducer(key,values): yield key,sum(values)

With Hadoop, this very same code could run on the entire Web! (In theory, at least)

Outline

7. Useful resources

HDFS: Hadoop Distributed File System

. . .. .

(chunks of data on computers)

(each chunkreplicated morethan once for

reliability)

HDFS: Hadoop Distributed File System

(key1, value1)(key2, value2)

......

Computation is local to the dataKey-value pairs processed independently in parallel

HDFS: Inspired by the Google File System

Outline

7. Useful resources

Hadoop Map-Reduce and HDFS: Advantages

• Distribute data and computation

• Computation local to data avoids network overload

• Tasks are independent

• Easy to handle partial failures - entire nodes can fail and restart

• Avoid crawling horrors of failure-tolerant synchronous distributed systems

• Speculative execution to work around stragglers

• Linear scaling in the ideal case

• Designed for cheap, commodity hardware

• Simple programming model

• The “end-user” programmer only writes map-reduce tasks

Hadoop Map-Reduce and HDFS: Disadvantages

• Still rough - software under active development

• e.g. HDFS only recently added support for append operations

• Programming model is very restrictive

• Lack of central data can be frustrating

• “Joins” of multiple datasets are tricky and slow

• No indices! Often, entire dataset gets copied in the process

• Cluster management is hard (debugging, distributing software, collecting logs...)

• Still single master, which requires care and may limit scaling

• Managing job flow isn’t trivial when intermediate data should be kept

• Optimal configuration of nodes not obvious (# mappers, # reducers, mem. limits)

Outline

7. Useful resources

Getting started: Installation options

• Cloudera virtual machine

• Your own virtual machine (install Ubuntu in VirtualBox, which is free)

• Elastic MapReduce on EC2

• StarCluster with Hadoop on EC2

• Cloudera’s distribution of Hadoop on EC2

• Install Cloudera’s distribution of Hadoop on your own machine

• Available for RPM and Debian deployments

• Or download Hadoop directly from http://hadoop.apache.org/

Getting started: Language choices

• Hadoop is written in Java

• However, Hadoop Streaming allows mappers and reducers in any language!

• Binary data is a little tricky with Hadoop Streaming

• Could use base64 encoding, but TypedBytes are much better

• For Python, try Dumbo: http://wiki.github.com/klbostee/dumbo

• The Python word-count example and others come with Dumbo

• Dumbo makes binary data with TypedBytes easy

• Also consider Hadoopy: https://github.com/bwhite/hadoopy

Outline

7. Useful resources

Useful resources and tips

• The Hadoop homepage: http://hadoop.apache.org/

• Cloudera: http://cloudera.com/

• Dumbo: http://wiki.github.com/klbostee/dumbo

• Hadoopy: https://github.com/bwhite/hadoopy

• Amazon Elastic Compute Cloud Getting Started Guide:

• http://docs.amazonwebservices.com/AWSEC2/latest/GettingStartedGuide/

• Always test locally on a tiny dataset before running on a cluster!

Thanks for your attention!

[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)

Education