CS 626 Large Scale Data Science
Jun Zhang
January 30, 2020
Originally created by Dr. Licong Cui
Lecture 4 – Hadoop System
Outline
Hadoop Distributed File System (HDFS)
MapReduce
Hands On
Review: Basic Scalable Computing Concepts
Distributed File Systems
Scalable Computing over the Internet
Programming Models for Big Data
Hadoop Ecosystem
Hadoop Ecosystem – Layer Diagram
What is Hadoop?
Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.
Goals/Requirements Abstract and facilitate the storage and processing of
large and/or rapidly growing data sets High scalability and availability Use commodity hardware (cheap!)
Fault-tolerance Move computation to data
Hadoop Architecture
Hadoop Architecture (cont.)
HDFS Name Node
Data Node
Job Tracker
Task Tracker
MapReduce
Hadoop Distributed File System (HDFS)
Scalability: Split files into blocks across nodes for parallel access
A B C D
A B C DFile
Nodes
Default block size: 128MB
Node 1 Node 2 Node 3 Node 4
Hadoop Distributed File System (HDFS)
Reliability: Replication for fault tolerance
A B C DFile
Nodes
Default: replicates 3 times
A
C
B
D
B
C
C
A
D
B
D
A
Node 1 Node 2 Node 3 Node 4
Hadoop Architecture
HDFS Components
Name Node (admin/master) Metadata Manage blocks
Data Node (slave) Actual data Block storage
Backup Node (name node) Checkpoints
HDFS Components (cont.)
Hadoop Rack Aware Replication
HDFS Name Node
Stores metadata for the files, like the directory structure of a typical FS.
Transaction log for file deletes/adds, etc.
Handles creation of more replica blocks when necessary after a Data Node failure
HDFS Data Node
Stores the actual data in HDFS
Notifies Name Node of what blocks it has
Replicates blocks 2x in local rack, 1x elsewhere
Write Files to HDFS
Hadoop Rack Aware Replication
Read Files from HDFS
MapReduce
Programming model for Hadoop ecosystem Based on functional programming
Map = apply operation to all elements
Reduce = summarize operation on elements
f(x) = y
MapReduce Engine
Job Tracker & Task Tracker
Job Tracker splits up data into smaller tasks (“Map”) and sends it to the Task Tracker process in each node
Task Tracker reports back to the Job Tracker node and reports on job progress, sends data (“Reduce”) or requests new jobs
MapReduce Job Tracker
Runs on NameNode
Receives MapReduce execution requests from the client
Talks to NameNode to determine the location of the data
Finds the best TaskTracker nodes to execute tasks
Monitors individual TaskTrackers and then submits back the overall status of the job back to the client
When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.
MapReduce Task Tracker
Runs on DataNode
Execute Mapper and Reducer tasks assigned by JobTracker
Constantly communicates with the JobTracker signaling the progress of the task in execution
TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.
Heartbeats from Datanode
A dataNode sends heartbeat to NameNode to report its status
The default interval is 3 seconds
If the DataNode in HDFS does not send heartbeat to NameNode in ten minutes, then NameNode considers the DataNode to be out of service and the Blocks replicas hosted by that DataNode to be unavailable.
MapReduce Model
Map Sort & Shuffle Reduce
Word Count Example
Given a large file of words
Count the number of times each distinct word appears in the file
Sample application
Analyze web server logs to find popular URLs
Map + Reduce
Map
Accepts input <key,value> pair
Emits intermediate <key,value> pair
Reduce Accepts intermediate <key,value*> pair
Emits output <key,value> pair
<key, value>
Map
(k1, v1) -> list(k2, v2)
Reduce
(k2, list(v2)) -> list(k3, v3)
Word Count using MapReduce (Pseudocode)
map(key, value):// key: hidden line number; value: line text
for each word w in value:emit(w, 1)
reduce(key, values):// key: a word; values: an iterator over counts
result = 0for each count v in values:
result += vemit(key, result)
Word Count using MapReduce in Java
Hands On Materials
Hadoop MapReduce Example
http://docs.cloudera.com/documentation/other/tutorial/CDHS/topics/Hadoop=Tutorial.html
Get this example work on your machine
MapReduce Pros and Cons
MapReduce architecture provides Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates MapReduce is not suitable for Frequently changing data Dependent Tasks Interactive analysis
MapReduce
Simplified parallel programming
Applications with independent data parallel tasks
Hands On: Basic File Manipulation in HDFS
Create directory in HDFShadoop fs –mkdir cs626
Copy file to HDFShadoop fs –copyFromLocal words.txt cs626
List files in HDFS directoryhadoop fs –ls cs626
Copy a file within HDFShadoop fs -cp cs626/words.txt cs626/words2.txt
Copy a file from HDFShadoop fs -copyToLocal cs626/words2.txt
Delete a file in HDFS
hadoop fs -rm cs626/words2.txt
Hands On: Basic File Manipulation in HDFS
Hands On: Run the Word Count program
Execute the Word Count applicationhadoop jar wordcount.jar cs626/words.txtcs626/output/
Copy the results from Word Count out of HDFS
hadoop fs –copyToLocal cs626/output/part-r-00000 local.txt
Hands On Materials
Create and Execute MapReduce in Eclipsehttps://www.youtube.com/watch?v=VzKGdM4hc74
Build a MapReduce Code Using Maven in Eclipse
https://www.youtube.com/watch?v=JwnUl42-JSE
Apache Hadoop Main 2.9.1 APIhttps://hadoop.apache.org/docs/current/api/
Reading Materials
The Hadoop Distributed File Systemby Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler
MapReduce: Simplified Data Processing on Large Clusters
by Jeffrey Dean and Sanjay Ghemawat
Hands On Materials
Hadoop MapReduce Example
http://docs.cloudera.com/documentation/other/tutorial/CDHS/topics/Hadoop=Tutorial.html
Get this example work on your machine