+ All Categories
Home > Documents > Single Node cluster Using Hadoop

Single Node cluster Using Hadoop

Date post: 14-Apr-2018
Category:
Upload: shubham-shrivastava
View: 230 times
Download: 0 times
Share this document with a friend

of 30

Transcript
  • 7/30/2019 Single Node cluster Using Hadoop

    1/30

    Cloud computing using

    HadoopRahul Poddar 11500110119

    Santosh Kumar 11500110006

    Shubham Raj 11500110054

    Vinayak Raj 11500110019

    6th semester CSE-B BPPIMT

  • 7/30/2019 Single Node cluster Using Hadoop

    2/30

    Outline

    Briefintroduction

    of CloudComputing

    Requirementsfor thisproject

    What isHadoop and

    itsproperties

    What led todevelopmentof Hadoop?

    MapReduce

    HDFS

    An exampleapplication on

    Hadoop

  • 7/30/2019 Single Node cluster Using Hadoop

    3/30

    What is cloud computing

    Cloud computing is the use ofcomputing resources (hardware and

    software) that are delivered as a serviceover a network (typically the Internet).

    The Cloud aims to cut costs, and helpthe users focus on their core business

    instead of being impeded by IT obstacles

    The main enabling technologies for CloudComputing are virtualization and

    autonomic computing.

  • 7/30/2019 Single Node cluster Using Hadoop

    4/30

    With cloud computing othercompanies host your computers

  • 7/30/2019 Single Node cluster Using Hadoop

    5/30

    Cloud Computing Architecture

    Software as a service(SaaS)

    Platform as a service(PaaS)

    Infrastructure as a service(IaaS)

    These three services encapsulate the basic

    component of cloud computing.

  • 7/30/2019 Single Node cluster Using Hadoop

    6/30

    Software requirements for Hadoopproject

    Java Requirements:Hadoop is a Java-based

    system. Recentversions of Hadoop

    require Sun Java 1.6.

    Operating System:Linux, Ubuntu 12.04

    LTS version, Mac OS X.Can also be run in

    Windows, but Windowsrequires Cygwin to be

    installed.

    Installing Hadoop:

    Hadoop 1.0.3 or aboveinstalled(either singlenode or multi node).

  • 7/30/2019 Single Node cluster Using Hadoop

    7/30

    Hardware requirements forHadoop(Small cluster 5-50 nodes)

    Hadoop and Hbase requires two types of machines:

    1)Master(the HDFS NameNode, the MapReduce JobTracker, andthe HBase Master))

    2)Slaves(the HDFS DataNodes, the MapReduce TaskTrackers, ,and the HBase RegionServers)

    Two quad core CPUs

    12 GB to 24 GB memory and 1 GB RAM.

  • 7/30/2019 Single Node cluster Using Hadoop

    8/30

    Here comes Hadoop

    Hadoop is a scalable fault tolerantgrid operating system for data

    storage and processing.

    Its scalability comes from the combo of:

    HDFS: Self healing, high bandwidth Clusteredstorage

    MapReduce: Fault tolerant Distributed

    processing

    Operates on structured andunstructured data

  • 7/30/2019 Single Node cluster Using Hadoop

    9/30

    Here comes Hadoop

    A large and active ecosystem(manydevelopers and additions like

    Hbase,Pig,Hive)

    Open source under the Apache License

    http://wiki.apache.org/hadoop/

    http://wiki.apache.org/hadoop/http://wiki.apache.org/hadoop/
  • 7/30/2019 Single Node cluster Using Hadoop

    10/30

    Characteristics of Hadoop

    Commodity HW

    Addinexpensive

    servers

    Use replicationacross servers to

    deal with unreliablestorage/servers

    Support for movingcomputation close

    to data

    Servers have 2purposes: data

    storage andcomputation

  • 7/30/2019 Single Node cluster Using Hadoop

    11/30

    Need for Hadoop:Big data

    We live in the age of very large andcomplex data called the BIG DATA.

    IDC estimates that the total size ofdigital universe is 1.8 zettabytes

    which is equal to 1021 bytes.

    That equals to each person of thisworld having one hard disk drive.

  • 7/30/2019 Single Node cluster Using Hadoop

    12/30

    Need for Hadoop:Big data

    Every day 2.5 quintillions(2.5 x 1018)bytes ofdata is being generated .

    90% of the total world data has beengenerated in just 2 years alone.

    Such a large amount of ever increasing datais getting difficult for traditional RDBMS and

    grid computing systems to manage.

  • 7/30/2019 Single Node cluster Using Hadoop

    13/30

    Sources of Big data

    The New York Stock Exchange generates about oneterabyte of new trade data per day.

    Facebook hosts approximately 10 billion photostaking 1 petabyte of storage.

    The Large Hadron Collider at CERN, Genevaproduces about 15 million petabytes of data per

    year.

    The Internet Archive stores around 2 petabytes ofdata, and is growing at a rate of 20 terabytes per

    month.

  • 7/30/2019 Single Node cluster Using Hadoop

    14/30

    Inefficiency and high expenses

    High expenses of high end serverscomputers and other proprietary

    hardware and softwares for processingand storage of large amount of data aswell as their maintenance cost isunbearable for many industrialorganisations. Also upgradation andmaintenance to scale up the capacity ofthese servers require huge cost .

  • 7/30/2019 Single Node cluster Using Hadoop

    15/30

    Not Robust

    The traditional single server architectureis not a robust architecture because a

    large single computer is taking care of allthe computing.If it fails or shutdownsthen whole system breaks down and hugelosses are incurred by the enterprises

    .Also during repairing or upgradationcomputer has to switch off and inmeantime no useful tasks are executedresulting in lagging of computations.

  • 7/30/2019 Single Node cluster Using Hadoop

    16/30

    MapReduce algorithm

    MapReduce is a programming model for processing large data sets and

    typically used to do distributed computing on clusters of computers.

    MapReduce provides regular programmers the ability to produce paralleldistributed programs much more easily.

    MapReduce consists of two simple functions:

    map()

    reduce()

  • 7/30/2019 Single Node cluster Using Hadoop

    17/30

    MapReduce algorithm

    "Map" step: The master nodetakes the input, divides it into

    smaller sub-problems, and

    distributes them to worker nodes.

    A worker node may do this againin turn, leading to a multi-level

    tree structure.

    The worker node processes the

    smaller problem, and passes theanswer back to its master node.

  • 7/30/2019 Single Node cluster Using Hadoop

    18/30

    MapReduce algorithm

    "Reduce" step: The masternode collects the answers toall the sub-problems from

    slaves

    Then the master combinesthe answers in some way to

    form the output the answerto the problem it was

    originally trying to solve.

  • 7/30/2019 Single Node cluster Using Hadoop

    19/30

    MapReduce: High Level

    JobTrackerMapReduce job

    submitted by

    client computer

    Master node

    TaskTracker

    Slave node

    Task instance

    TaskTracker

    Slave node

    Task instance

    TaskTracker

    Slave node

    Task instance

  • 7/30/2019 Single Node cluster Using Hadoop

    20/30

    Some MapReduce Terminology

    JobA full program- an execution of a

    Mapper and Reduceracross a data set

    TaskAn executionof a Mapper or aReducer on a slice of

    dataa.k.a. Task-In-Progress(TIP)

    Task Attempt Aparticular instance of

    an attempt to executea task on a machine

  • 7/30/2019 Single Node cluster Using Hadoop

    21/30

    Terminology Example

    Running WordCount across 20

    files is onejob

    20 files to bemapped imply 20map tasks + some

    number ofreducetasks

    At least 20 maptask attempts willbe performed

    more if a machinecrashes, etc.

  • 7/30/2019 Single Node cluster Using Hadoop

    22/30

    HDFS(Hadoop Distributed FileSystem)

    The Hadoop Distributed File System (HDFS) is adistributed file system designed to run on

    commodity hardware.

    HDFS is highly fault-tolerant and is designed to bedeployed on low-cost hardware.

    HDFS provides high throughput access toapplication data and is suitable for applications that

    have large data sets.

    HDFS is part of the Apache Hadoop project, whichis part of the Apache Lucene project.

  • 7/30/2019 Single Node cluster Using Hadoop

    23/30

  • 7/30/2019 Single Node cluster Using Hadoop

    24/30

    HDFS Architecture

    Master-Slavearchitecture

    Manages the filesystem namespace

    Maintain file name to list blocks + location mapping

    Manages block allocation/replication

    Checkpoints namespace and journals namespace changes for reliability

    Control access to namespace

    DFS MasterNamenode

    Stores blocks using the underlying OSs files

    Clients access the blocks directly from datanodes

    Periodically sends block reports to Namenode

    Periodically check block integrity

    DFS SlavesDatanodes handle

    block storage

  • 7/30/2019 Single Node cluster Using Hadoop

    25/30

  • 7/30/2019 Single Node cluster Using Hadoop

    26/30

    An Example:Weather Data Mining

    Weather sensors all across the globe are collectingclimatic data.

    The data can be used from National Climatic DataCentre(http://www.ncdc.noaa.gov/)

    We will focus only on temperature for simplicity

    The input will be data from NCDC which will given askey-value pair to map()

    The output given by reduce() will be the maximumtemperature of each year.

    http://www.ncdc.noaa.gov/http://www.ncdc.noaa.gov/
  • 7/30/2019 Single Node cluster Using Hadoop

    27/30

    Weather Data Mining

    Mapper.py:#!/usr/bin/env python

    import reimport sys

    for line in sys.stdin:

    val = line.strip()

    (year, temp, q) = (val[15:19], val[87:92],val[92:93])If (temp != "+9999" and re.match("[01459]", q)):

    print "%s\t%s" % (year, temp)

  • 7/30/2019 Single Node cluster Using Hadoop

    28/30

    Weather Data Mining

    Reduce.py:#!/usr/bin/env pythonimport sys

    (last_key, max_val) = (None, 0)for line in sys.stdin:(key, val) = line.strip().split("\t")if last_key and last_key != key:print "%s\t%s" % (last_key, max_val)

    (last_key, max_val) = (key, int(val))else:(last_key, max_val) = (key, max(max_val, int(val)))if last_key:print "%s\t%s" % (last_key, max_val)

  • 7/30/2019 Single Node cluster Using Hadoop

    29/30

    Running the program

    To run a test:

    % cat input/ncdc/sample.txt |src/main/ch02/python/max_temperature_map.py | \

    sort |src/main/ch02/python/max_temperature_reduce.py

    Output:

    1949 111

    1950 22

  • 7/30/2019 Single Node cluster Using Hadoop

    30/30

    References

    Hadoop Wikihttp://hadoop.apache.org/core/

    http://wiki.apache.org/hadoop/GettingStartedWithHadoop

    http://wiki.apache.org/hadoop/HadoopMapReduce

    http://hadoop.apache.org/core/docs/current/hdfs_design.html

    http://hadoop.apache.org/core/http://hadoop.apache.org/core/http://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReducehttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://hadoop.apache.org/core/docs/current/hdfs_design.htmlhttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/HadoopMapReducehttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://wiki.apache.org/hadoop/GettingStartedWithHadoophttp://hadoop.apache.org/core/http://hadoop.apache.org/core/http://hadoop.apache.org/core/http://hadoop.apache.org/core/

Recommended