2 hadoop

The Next Frontier for Innovation, Competition and Productivity

BIGDaTa

Hadoop

Dr. Akram Alkouz

Princess Sumaya University for Technology

• What is Hadoop

• Terminology Review

• Hadoop Architecture

• HDFS

• MapReduce

• Type of Nodes

• Topology awareness

• Writing a file to HDFS

• Open source project

• Written in Java

• Optimized to handle:

• Massive amount of structured,

unstructured and semi-structured data

through parallelism using inexpensive

commodity hardware

• Great performance

• Not for Online Transaction Processing (OLTP),

Online Analytical Processing (OLAP). It is good

for Big Data

• Current version: 2.6

• Hadoop is a framework that clusters many

machines (Nodes) to analyze a large data set.

• Hadoop consists of 2 components:

• Hadoop Distributed File System (HDFS)

• MapReduce

• Hadoop cluster is a set of machines (Nodes)

running HDFS and MapReduce

• Reliability provided through replication (HDFS)

• Hadoop cluster can have from 1 to 40000 Nodes

Node 1

Node 2

Node n

Rack 1

Node 1

Node 2

Node n

Rack 2

Node 1

Node 2

Node n

Rack n

---

Hadoop Cluster

Hadoop Cluster

Hadoop Cluster

Data

Hadoop Cluster

Hadoop Cluster

Program

Hadoop Cluster

• Two main components:

• Distributed File System

• MapReduce Engine

• Hadoop file system runs on top of existing file

system

• Handle very large file with streaming data

access patterns

• Uses Block to store a file or parts of a file

• 64 MB (default), 128 MB (recommended)

compare to 4 KB in Unix

128 MB HDFS Block

OS Block

• Advantages of Blocks

1. Fixed size easy to calculate how many fits on disk

2. A file can be larger than any single disk in the

cluster

3. If file size or any chunk of it is smaller than the

block size only needed space is used

• 420 MB file is split as:

4. Fits well with replication to provide fault tolerance

128 MB 128 MB 128 MB 36 MB

• Is the data processing component in Hadoop

• Google’s technology

• It distributes tasks across the nodes

• Tasks in each node process data locally

• MapReduce Program:

• Map function

• Reduce function

• MapReduce job is divided into phases that run in

parallel:

• Map phase

• Reduce phase

• In between Map and Reduce there is a shuffle and

sort phase

• We have the following text file:

Hello

The car is the target

How are you

Are you ok

• We need to find the frequency of each word

The output should be like this:

(hello, 1), (the,2), (car,1),(is ,1), (target ,1),

(how,1), (are,2), (you,2), (ok,1)

• In MapReduce every input is viewed as Key-Value pair

• In this example:

Key Line number Value Line content

1 Hello

2 The car is the target

3 How are you

4 Are you ok

• Map program (Mapper):

For each word in Value

Emit(word,1)

Key Line number Value Line content

1 Hello

2 The car is the target

3 How are you

4 Are you ok

• Map program (Mapper):

For each word in Value

Emit(word,1)

Hello The car is the target How are you Are you ok

Map Map Map Map

(hello,1) (the,1)

(car,1)

(is,1)

(the,1)

(target,1)

(How,1)

(are,1)

(your,1)

(are,1)

(you,1)

(ok,1)

Hello The car is the target How are you Are you ok

Map Map Map Map

(hello,1) (the,1)

(car,1)

(is,1)

(the,1)

(target,1)

(How,1)

(are,1)

(you,1)

(are,1)

(you,1)

(ok,1)

Shuffle and Sort

(hello,1), (the,1,1), (car,1)(is,1), (target,1), (how,1),

(are,1,1), (you,1,1), (ok,1)

Shuffle and Sort

(hello,1), (the,1,1), (car,1)(is,1), (target,1), (how,1),

(are,1,1), (you,1,1), (ok,1)

ReduceReduce program (Reducer):

For each v in Value

Sum = sum +v

Emit(k, sum)

(hello, 1), (the, 2), (car, 1)(is, 1), (target, 1), (how, 1),

(are, 2), (you, 2), (ok, 1)

• Hadoop 1

• HDFS Nodes:

• NameNode

• DataNode

• MapReduce Nodes:

• JobTracker

• TaskTracker

• Secondary NameNode

• Backup Node

• Check Point Node

DataNode

DataNode

DataNode

DataNode

Name Node

Job Tracker

TaskTracker

TaskTracker

Hadoop Cluster

Client

Master Slave

DataNode

DataNode

DataNode

DataNode

Name Node

Job Tracker

TaskTracker

TaskTracker

Hadoop Cluster

Client

Master Slave

• NameNode:

• Only one per Hadoop cluster

• Manages the filesystem namespace and

metadata

• Single point of failure use good hardware

specifications (Large memory)

DataNode

DataNode

DataNode

DataNode

Name Node

Job Tracker

TaskTracker

TaskTracker

Hadoop Cluster

Client

Master Slave

Client

• DataNode:

• Many per Hadoop cluster

• Manages Blocks of data and serves them to

client

• Periodically reports to NameNode the list of

Blocks it stores

• Fault tolerance use inexpensive commodity

hardware

DataNode

DataNode

DataNode

DataNode

Name Node

Job Tracker

TaskTracker

TaskTracker

Hadoop Cluster

Client

Master Slave

Client

• JobTracker Node:

• One per Hadoop cluster

• Receives job requests submitted by client

• Schedules and monitors MapReduce jobs (Map

Tasks and Reduce Tasks) on TaskTrackers

DataNode

DataNode

DataNode

DataNode

Name Node

Job Tracker

TaskTracker

TaskTracker

Hadoop Cluster

Client

Master Slave

Client

• TaskTracker Node:

• Many per Hadoop cluster

• Uses JVM to executes MapReduce operations in

parallel

• Process on the same Node

• TaskTracker &

DataNode on the

same Node

Rack1 Rack2

Node1B1Map

Node2

Node3

Node1

Node2

Node3

• Process on the same Rack

• TaskTracker &

DataNode on the

same Rack

Rack1 Rack2

Node1B1

Node2

Node3

Node1

Node2

Node3

Map

• Process on the same data

center

• TaskTracker &

DataNode on the

same physical

data center

Rack1 Rack2

Node1B1

Node2

Node3

Node1

Node2

Node3

Map

Rack1 Rack2

Node1

Node2

Node3

Node1

Node2

Node3

HDFS Client

File1B1

B2

B3

Rack3

Node1

Node2

Node3

Create request

NameNode

1. File does not exist2. Client has permission to

create the file

Rack1 Rack2

Node1

Node2

Node3

Node1

Node2

Node3

HDFS Client

File1B1

B2

B3

Rack3

Node1

Node2

Node3

NameNode

B1

Default 3 replicas

B1

B1

Pipeline

Rack1 Rack2

Node1

Node2

Node3

Node1

Node2

Node3

HDFS Client

File1B1

B2

B3

Rack3

Node1

Node2

Node3

NameNode

B1

Default 3 replicas

B1

B1Acknowledgment

Rack1 Rack2

Node1

Node2

Node3

Node1

Node2

Node3

HDFS Client

File1B1

B2

B3

Rack3

Node1

Node2

Node3

NameNode

B1

Default 3 replicas

B1

B1B2

B2

B2

B3

B3

B3

Rack1 Rack2

Node1

Node2

Node3

Node1

Node2

Node3

HDFS Client

File1B1

B2

B3

Rack3

Node1

Node2

Node3

NameNode

B1

B1

B1B2

B2

B2

B3

B3

B3

Complete

Scalability

Maximum cluster size 4,000 DataNodes

Maximum Concurrent tasks 4,000

Single JopTracker to synchronize 4,000 TaskTrackers

NameNode is Single point of Failure

Kills all running jobs

Needs to re-submit all the jobs

No other paradigms

Only MapReduce

Iterative is 10 times slower, and bad in graph processing

• Key design concept is to :

• Split up the two main functions of the JobTracker:

• Cluster resource management

• Application Life-cycle management

• Make MapReduce as one of the applications inside

Hadoop

YARN: Yet Another Resource Negotiator

Key Features of YARN:1. Resource manager

2. Application master

3. Scale up to 10,000 nodes

4. Support multiple framework such as :

MapReduce, Graph proceesing, ..

Terminologies:

1. Application:

• A job submitted to the framework eg. MapReduce job

2. Container: is the basic unit of allocation eg. Container A =

2GB, 1CPU

3. Resource manager

• Global resource scheduler

• Hierarchal queues

4. Node Manager:

• Per-machine

• Manages the life-cycle of the container

• Monitor the container resources

5. Application Master

• Per-application

• Manages application scheduling and task execution

1. Client-RM: Client submit app. to Resource Manager

2. RM-NM: Resource Manager asks NM to start App.Master

3. App.M-RM: App.Master asks RM to allocate Resources

4. RM-NM: Resource Manager asks NM for a container

1

234

Date post:	18-Jul-2015
Category:	Technology
Upload:	akram-al-kouz
View:	192 times
Download:	2 times