Date post: | 18-Jul-2015 |
Category: |
Technology |
Upload: | akram-al-kouz |
View: | 192 times |
Download: | 2 times |
The Next Frontier for Innovation, Competition and Productivity
BIGDaTa
Hadoop
Dr. Akram Alkouz
Princess Sumaya University for Technology
• What is Hadoop
• Terminology Review
• Hadoop Architecture
• HDFS
• MapReduce
• Type of Nodes
• Topology awareness
• Writing a file to HDFS
• Open source project
• Written in Java
• Optimized to handle:
• Massive amount of structured,
unstructured and semi-structured data
through parallelism using inexpensive
commodity hardware
• Great performance
• Not for Online Transaction Processing (OLTP),
Online Analytical Processing (OLAP). It is good
for Big Data
• Current version: 2.6
• Hadoop is a framework that clusters many
machines (Nodes) to analyze a large data set.
• Hadoop consists of 2 components:
• Hadoop Distributed File System (HDFS)
• MapReduce
• Hadoop cluster is a set of machines (Nodes)
running HDFS and MapReduce
• Reliability provided through replication (HDFS)
• Hadoop cluster can have from 1 to 40000 Nodes
Node 1
Node 2
Node n
Rack 1
Node 1
Node 2
Node n
Rack 2
Node 1
Node 2
Node n
Rack n
---
Hadoop Cluster
• Hadoop file system runs on top of existing file
system
• Handle very large file with streaming data
access patterns
• Uses Block to store a file or parts of a file
• Advantages of Blocks
1. Fixed size easy to calculate how many fits on disk
2. A file can be larger than any single disk in the
cluster
3. If file size or any chunk of it is smaller than the
block size only needed space is used
• 420 MB file is split as:
4. Fits well with replication to provide fault tolerance
128 MB 128 MB 128 MB 36 MB
• Is the data processing component in Hadoop
• Google’s technology
• It distributes tasks across the nodes
• Tasks in each node process data locally
• MapReduce Program:
• Map function
• Reduce function
• MapReduce job is divided into phases that run in
parallel:
• Map phase
• Reduce phase
• In between Map and Reduce there is a shuffle and
sort phase
• We have the following text file:
Hello
The car is the target
How are you
Are you ok
• We need to find the frequency of each word
The output should be like this:
(hello, 1), (the,2), (car,1),(is ,1), (target ,1),
(how,1), (are,2), (you,2), (ok,1)
• In MapReduce every input is viewed as Key-Value pair
• In this example:
Key Line number Value Line content
1 Hello
2 The car is the target
3 How are you
4 Are you ok
• Map program (Mapper):
For each word in Value
Emit(word,1)
Key Line number Value Line content
1 Hello
2 The car is the target
3 How are you
4 Are you ok
• Map program (Mapper):
For each word in Value
Emit(word,1)
Hello The car is the target How are you Are you ok
Map Map Map Map
(hello,1) (the,1)
(car,1)
(is,1)
(the,1)
(target,1)
(How,1)
(are,1)
(your,1)
(are,1)
(you,1)
(ok,1)
Hello The car is the target How are you Are you ok
Map Map Map Map
(hello,1) (the,1)
(car,1)
(is,1)
(the,1)
(target,1)
(How,1)
(are,1)
(you,1)
(are,1)
(you,1)
(ok,1)
Shuffle and Sort
(hello,1), (the,1,1), (car,1)(is,1), (target,1), (how,1),
(are,1,1), (you,1,1), (ok,1)
Shuffle and Sort
(hello,1), (the,1,1), (car,1)(is,1), (target,1), (how,1),
(are,1,1), (you,1,1), (ok,1)
ReduceReduce program (Reducer):
For each v in Value
Sum = sum +v
Emit(k, sum)
(hello, 1), (the, 2), (car, 1)(is, 1), (target, 1), (how, 1),
(are, 2), (you, 2), (ok, 1)
• Hadoop 1
• HDFS Nodes:
• NameNode
• DataNode
• MapReduce Nodes:
• JobTracker
• TaskTracker
• Secondary NameNode
• Backup Node
• Check Point Node
DataNode
DataNode
DataNode
DataNode
Name Node
Job Tracker
TaskTracker
TaskTracker
Hadoop Cluster
Client
Master Slave
DataNode
DataNode
DataNode
DataNode
Name Node
Job Tracker
TaskTracker
TaskTracker
Hadoop Cluster
Client
Master Slave
• NameNode:
• Only one per Hadoop cluster
• Manages the filesystem namespace and
metadata
• Single point of failure use good hardware
specifications (Large memory)
DataNode
DataNode
DataNode
DataNode
Name Node
Job Tracker
TaskTracker
TaskTracker
Hadoop Cluster
Client
Master Slave
Client
• DataNode:
• Many per Hadoop cluster
• Manages Blocks of data and serves them to
client
• Periodically reports to NameNode the list of
Blocks it stores
• Fault tolerance use inexpensive commodity
hardware
DataNode
DataNode
DataNode
DataNode
Name Node
Job Tracker
TaskTracker
TaskTracker
Hadoop Cluster
Client
Master Slave
Client
• JobTracker Node:
• One per Hadoop cluster
• Receives job requests submitted by client
• Schedules and monitors MapReduce jobs (Map
Tasks and Reduce Tasks) on TaskTrackers
DataNode
DataNode
DataNode
DataNode
Name Node
Job Tracker
TaskTracker
TaskTracker
Hadoop Cluster
Client
Master Slave
Client
• TaskTracker Node:
• Many per Hadoop cluster
• Uses JVM to executes MapReduce operations in
parallel
• Process on the same Node
• TaskTracker &
DataNode on the
same Node
Rack1 Rack2
Node1B1Map
Node2
Node3
Node1
Node2
Node3
• Process on the same Rack
• TaskTracker &
DataNode on the
same Rack
Rack1 Rack2
Node1B1
Node2
Node3
Node1
Node2
Node3
Map
• Process on the same data
center
• TaskTracker &
DataNode on the
same physical
data center
Rack1 Rack2
Node1B1
Node2
Node3
Node1
Node2
Node3
Map
Rack1 Rack2
Node1
Node2
Node3
Node1
Node2
Node3
HDFS Client
File1B1
B2
B3
Rack3
Node1
Node2
Node3
Create request
NameNode
1. File does not exist2. Client has permission to
create the file
Rack1 Rack2
Node1
Node2
Node3
Node1
Node2
Node3
HDFS Client
File1B1
B2
B3
Rack3
Node1
Node2
Node3
NameNode
B1
Default 3 replicas
B1
B1
Pipeline
Rack1 Rack2
Node1
Node2
Node3
Node1
Node2
Node3
HDFS Client
File1B1
B2
B3
Rack3
Node1
Node2
Node3
NameNode
B1
Default 3 replicas
B1
B1Acknowledgment
Rack1 Rack2
Node1
Node2
Node3
Node1
Node2
Node3
HDFS Client
File1B1
B2
B3
Rack3
Node1
Node2
Node3
NameNode
B1
Default 3 replicas
B1
B1B2
B2
B2
B3
B3
B3
Rack1 Rack2
Node1
Node2
Node3
Node1
Node2
Node3
HDFS Client
File1B1
B2
B3
Rack3
Node1
Node2
Node3
NameNode
B1
B1
B1B2
B2
B2
B3
B3
B3
Complete
Scalability
Maximum cluster size 4,000 DataNodes
Maximum Concurrent tasks 4,000
Single JopTracker to synchronize 4,000 TaskTrackers
NameNode is Single point of Failure
Kills all running jobs
Needs to re-submit all the jobs
No other paradigms
Only MapReduce
Iterative is 10 times slower, and bad in graph processing
• Key design concept is to :
• Split up the two main functions of the JobTracker:
• Cluster resource management
• Application Life-cycle management
• Make MapReduce as one of the applications inside
Hadoop
Key Features of YARN:1. Resource manager
2. Application master
3. Scale up to 10,000 nodes
4. Support multiple framework such as :
MapReduce, Graph proceesing, ..
Terminologies:
1. Application:
• A job submitted to the framework eg. MapReduce job
2. Container: is the basic unit of allocation eg. Container A =
2GB, 1CPU
3. Resource manager
• Global resource scheduler
• Hierarchal queues
4. Node Manager:
• Per-machine
• Manages the life-cycle of the container
• Monitor the container resources
5. Application Master
• Per-application
• Manages application scheduling and task execution