Single Node cluster Using Hadoop

7/30/2019 Single Node cluster Using Hadoop

1/30

Cloud computing using

HadoopRahul Poddar 11500110119

Santosh Kumar 11500110006

Shubham Raj 11500110054

Vinayak Raj 11500110019

6th semester CSE-B BPPIMT


2/30

Outline

Briefintroduction

of CloudComputing

Requirementsfor thisproject

What isHadoop and

itsproperties

What led todevelopmentof Hadoop?

MapReduce

HDFS

An exampleapplication on

Hadoop


3/30

What is cloud computing

Cloud computing is the use ofcomputing resources (hardware and

software) that are delivered as a serviceover a network (typically the Internet).

The Cloud aims to cut costs, and helpthe users focus on their core business

instead of being impeded by IT obstacles

The main enabling technologies for CloudComputing are virtualization and

autonomic computing.


4/30

With cloud computing othercompanies host your computers


5/30

Cloud Computing Architecture

Software as a service(SaaS)

Platform as a service(PaaS)

Infrastructure as a service(IaaS)

These three services encapsulate the basic

component of cloud computing.


6/30

Software requirements for Hadoopproject

Java Requirements:Hadoop is a Java-based

system. Recentversions of Hadoop

require Sun Java 1.6.

Operating System:Linux, Ubuntu 12.04

LTS version, Mac OS X.Can also be run in

Windows, but Windowsrequires Cygwin to be

installed.

Installing Hadoop:

Hadoop 1.0.3 or aboveinstalled(either singlenode or multi node).


7/30

Hardware requirements forHadoop(Small cluster 5-50 nodes)

Hadoop and Hbase requires two types of machines:

1)Master(the HDFS NameNode, the MapReduce JobTracker, andthe HBase Master))

2)Slaves(the HDFS DataNodes, the MapReduce TaskTrackers, ,and the HBase RegionServers)

Two quad core CPUs

12 GB to 24 GB memory and 1 GB RAM.


8/30

Here comes Hadoop

Hadoop is a scalable fault tolerantgrid operating system for data

storage and processing.

Its scalability comes from the combo of:

HDFS: Self healing, high bandwidth Clusteredstorage

MapReduce: Fault tolerant Distributed

processing

Operates on structured andunstructured data


9/30

Here comes Hadoop

A large and active ecosystem(manydevelopers and additions like

Hbase,Pig,Hive)

Open source under the Apache License

http://wiki.apache.org/hadoop/
http://wiki.apache.org/hadoop/http://wiki.apache.org/hadoop/


10/30

Characteristics of Hadoop

Commodity HW

Addinexpensive

servers

Use replicationacross servers to

deal with unreliablestorage/servers

Support for movingcomputation close

to data

Servers have 2purposes: data

storage andcomputation


11/30

Need for Hadoop:Big data

We live in the age of very large andcomplex data called the BIG DATA.

IDC estimates that the total size ofdigital universe is 1.8 zettabytes

which is equal to 1021 bytes.

That equals to each person of thisworld having one hard disk drive.


12/30

Need for Hadoop:Big data

Every day 2.5 quintillions(2.5 x 1018)bytes ofdata is being generated .

90% of the total world data has beengenerated in just 2 years alone.

Such a large amount of ever increasing datais getting difficult for traditional RDBMS and

grid computing systems to manage.


13/30

Sources of Big data

The New York Stock Exchange generates about oneterabyte of new trade data per day.

Facebook hosts approximately 10 billion photostaking 1 petabyte of storage.

The Large Hadron Collider at CERN, Genevaproduces about 15 million petabytes of data per

year.

The Internet Archive stores around 2 petabytes ofdata, and is growing at a rate of 20 terabytes per

month.


14/30

Inefficiency and high expenses

High expenses of high end serverscomputers and other proprietary

hardware and softwares for processingand storage of large amount of data aswell as their maintenance cost isunbearable for many industrialorganisations. Also upgradation andmaintenance to scale up the capacity ofthese servers require huge cost .


15/30

Not Robust

The traditional single server architectureis not a robust architecture because a

large single computer is taking care of allthe computing.If it fails or shutdownsthen whole system breaks down and hugelosses are incurred by the enterprises

.Also during repairing or upgradationcomputer has to switch off and inmeantime no useful tasks are executedresulting in lagging of computations.


16/30

MapReduce algorithm

MapReduce is a programming model for processing large data sets and

typically used to do distributed computing on clusters of computers.

MapReduce provides regular programmers the ability to produce paralleldistributed programs much more easily.

MapReduce consists of two simple functions:

map()

reduce()


17/30

MapReduce algorithm

"Map" step: The master nodetakes the input, divides it into

smaller sub-problems, and

distributes them to worker nodes.

A worker node may do this againin turn, leading to a multi-level

tree structure.

The worker node processes the

smaller problem, and passes theanswer back to its master node.


18/30

MapReduce algorithm

"Reduce" step: The masternode collects the answers toall the sub-problems from

slaves

Then the master combinesthe answers in some way to

form the output the answerto the problem it was

originally trying to solve.


19/30

MapReduce: High Level

JobTrackerMapReduce job

submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance


20/30

Some MapReduce Terminology

JobA full program- an execution of a

Mapper and Reduceracross a data set

TaskAn executionof a Mapper or aReducer on a slice of

dataa.k.a. Task-In-Progress(TIP)

Task Attempt Aparticular instance of

an attempt to executea task on a machine


21/30

Terminology Example

Running WordCount across 20

files is onejob

20 files to bemapped imply 20map tasks + some

number ofreducetasks

At least 20 maptask attempts willbe performed

more if a machinecrashes, etc.


22/30

HDFS(Hadoop Distributed FileSystem)

The Hadoop Distributed File System (HDFS) is adistributed file system designed to run on

commodity hardware.

HDFS is highly fault-tolerant and is designed to bedeployed on low-cost hardware.

HDFS provides high throughput access toapplication data and is suitable for applications that

have large data sets.

HDFS is part of the Apache Hadoop project, whichis part of the Apache Lucene project.


23/30


24/30

HDFS Architecture

Master-Slavearchitecture

Manages the filesystem namespace

Maintain file name to list blocks + location mapping

Manages block allocation/replication

Checkpoints namespace and journals namespace changes for reliability

Control access to namespace

DFS MasterNamenode

Stores blocks using the underlying OSs files

Clients access the blocks directly from datanodes

Periodically sends block reports to Namenode

Periodically check block integrity

DFS SlavesDatanodes handle

block storage


25/30


26/30

An Example:Weather Data Mining

Weather sensors all across the globe are collectingclimatic data.

The data can be used from National Climatic DataCentre(http://www.ncdc.noaa.gov/)

We will focus only on temperature for simplicity

The input will be data from NCDC which will given askey-value pair to map()

The output given by reduce() will be the maximumtemperature of each year.
http://www.ncdc.noaa.gov/http://www.ncdc.noaa.gov/


27/30

Weather Data Mining

Mapper.py:#!/usr/bin/env python

import reimport sys

for line in sys.stdin:

val = line.strip()

(year, temp, q) = (val[15:19], val[87:92],val[92:93])If (temp != "+9999" and re.match("[01459]", q)):

print "%s\t%s" % (year, temp)


28/30

Weather Data Mining

Reduce.py:#!/usr/bin/env pythonimport sys

(last_key, max_val) = (None, 0)for line in sys.stdin:(key, val) = line.strip().split("\t")if last_key and last_key != key:print "%s\t%s" % (last_key, max_val)

(last_key, max_val) = (key, int(val))else:(last_key, max_val) = (key, max(max_val, int(val)))if last_key:print "%s\t%s" % (last_key, max_val)


29/30

Running the program

To run a test:

% cat input/ncdc/sample.txt |src/main/ch02/python/max_temperature_map.py | \

sort |src/main/ch02/python/max_temperature_reduce.py

Output:

1949 111

1950 22


30/30

References

Hadoop Wikihttp://hadoop.apache.org/core/

http://wiki.apache.org/hadoop/GettingStartedWithHadoop

http://wiki.apache.org/hadoop/HadoopMapReduce

http://hadoop.apache.org/core/docs/current/hdfs_design.html
http://hadoop.apache.org/core/http://hadoop.apache.org/core/http://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReducehttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://hadoop.apache.org/core/http://hadoop.apache.org/core/http://hadoop.apache.org/core/http://hadoop.apache.org/core/

Date post:	14-Apr-2018
Category:	Documents
Upload:	shubham-shrivastava
View:	230 times
Download:	0 times

Single Node cluster Using Hadoop

Documents