Download - CS 626 Large Scale Data Sciencejzhang/CS626/Lecture4.pdf · Abstract and facilitate the storage and processing of large and/or rapidly growing data sets ... but the MapReduce execution

CS 626 Large Scale Data Science

Jun Zhang

January 30, 2020

Originally created by Dr. Licong Cui

Lecture 4 – Hadoop System

Outline

Hadoop Distributed File System (HDFS)

MapReduce

Hands On

Review: Basic Scalable Computing Concepts

Distributed File Systems

Scalable Computing over the Internet

Programming Models for Big Data

Hadoop Ecosystem

Hadoop Ecosystem – Layer Diagram

What is Hadoop?

Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.

Goals/Requirements Abstract and facilitate the storage and processing of

large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)

Fault-tolerance Move computation to data

Hadoop Architecture

Hadoop Architecture (cont.)

HDFS Name Node

Data Node

Job Tracker

Task Tracker

MapReduce


Scalability: Split files into blocks across nodes for parallel access

A B C D

A B C DFile

Nodes

Default block size: 128MB

Node 1 Node 2 Node 3 Node 4


Reliability: Replication for fault tolerance

A B C DFile

Nodes

Default: replicates 3 times

A

C

B

D

B

C

C

A

D

B

D

A

Node 1 Node 2 Node 3 Node 4

Hadoop Architecture

HDFS Components

Name Node (admin/master) Metadata Manage blocks

Data Node (slave) Actual data Block storage

Backup Node (name node) Checkpoints

HDFS Components (cont.)

Hadoop Rack Aware Replication

HDFS Name Node

Stores metadata for the files, like the directory structure of a typical FS.

Transaction log for file deletes/adds, etc.

Handles creation of more replica blocks when necessary after a Data Node failure

HDFS Data Node

Stores the actual data in HDFS

Notifies Name Node of what blocks it has

Replicates blocks 2x in local rack, 1x elsewhere

Write Files to HDFS

Hadoop Rack Aware Replication

Read Files from HDFS

MapReduce

Programming model for Hadoop ecosystem Based on functional programming

Map = apply operation to all elements

Reduce = summarize operation on elements

f(x) = y

MapReduce Engine

Job Tracker & Task Tracker

Job Tracker splits up data into smaller tasks (“Map”) and sends it to the Task Tracker process in each node

Task Tracker reports back to the Job Tracker node and reports on job progress, sends data (“Reduce”) or requests new jobs

MapReduce Job Tracker

Runs on NameNode

Receives MapReduce execution requests from the client

Talks to NameNode to determine the location of the data

Finds the best TaskTracker nodes to execute tasks

Monitors individual TaskTrackers and then submits back the overall status of the job back to the client

When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.

MapReduce Task Tracker

Runs on DataNode

Execute Mapper and Reducer tasks assigned by JobTracker

Constantly communicates with the JobTracker signaling the progress of the task in execution

TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.

Heartbeats from Datanode

A dataNode sends heartbeat to NameNode to report its status

The default interval is 3 seconds

If the DataNode in HDFS does not send heartbeat to NameNode in ten minutes, then NameNode considers the DataNode to be out of service and the Blocks replicas hosted by that DataNode to be unavailable.

MapReduce Model

Map Sort & Shuffle Reduce

Word Count Example

Given a large file of words

Count the number of times each distinct word appears in the file

Sample application

Analyze web server logs to find popular URLs

Map + Reduce

Map

Accepts input <key,value> pair

Emits intermediate <key,value> pair

Reduce Accepts intermediate <key,value*> pair

Emits output <key,value> pair

<key, value>

Map

(k1, v1) -> list(k2, v2)

Reduce

(k2, list(v2)) -> list(k3, v3)

Word Count using MapReduce (Pseudocode)

map(key, value):// key: hidden line number; value: line text

for each word w in value:emit(w, 1)

reduce(key, values):// key: a word; values: an iterator over counts

result = 0for each count v in values:

result += vemit(key, result)

Word Count using MapReduce in Java

Hands On Materials

Hadoop MapReduce Example

http://docs.cloudera.com/documentation/other/tutorial/CDHS/topics/Hadoop=Tutorial.html

Get this example work on your machine

MapReduce Pros and Cons

MapReduce architecture provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates MapReduce is not suitable for Frequently changing data Dependent Tasks Interactive analysis

MapReduce

Simplified parallel programming

Applications with independent data parallel tasks

Hands On: Basic File Manipulation in HDFS

Create directory in HDFShadoop fs –mkdir cs626

Copy file to HDFShadoop fs –copyFromLocal words.txt cs626

List files in HDFS directoryhadoop fs –ls cs626

Copy a file within HDFShadoop fs -cp cs626/words.txt cs626/words2.txt

Copy a file from HDFShadoop fs -copyToLocal cs626/words2.txt

Delete a file in HDFS

hadoop fs -rm cs626/words2.txt

Hands On: Basic File Manipulation in HDFS

Hands On: Run the Word Count program

Execute the Word Count applicationhadoop jar wordcount.jar cs626/words.txtcs626/output/

Copy the results from Word Count out of HDFS

hadoop fs –copyToLocal cs626/output/part-r-00000 local.txt

Hands On Materials

Create and Execute MapReduce in Eclipsehttps://www.youtube.com/watch?v=VzKGdM4hc74

Build a MapReduce Code Using Maven in Eclipse

https://www.youtube.com/watch?v=JwnUl42-JSE

Apache Hadoop Main 2.9.1 APIhttps://hadoop.apache.org/docs/current/api/

Reading Materials

The Hadoop Distributed File Systemby Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler

MapReduce: Simplified Data Processing on Large Clusters

by Jeffrey Dean and Sanjay Ghemawat

Hands On Materials

Hadoop MapReduce Example

http://docs.cloudera.com/documentation/other/tutorial/CDHS/topics/Hadoop=Tutorial.html

Get this example work on your machine