Introduction to hadoop

What is BigData?

The term “BigData” is used to describe the collection of Complex and Large Data such that it’s difficult to capture, search, store, process and analyze this kind of data using Database Management System.

Basically the data coming from everyware like,Social media sitesTraffic, SatelliteDigital worldSoftware logsBusiness dataAnd many more…..

• BigData Includes both Structured and Unstructured data.

• BigData is difficult to work with using most Relational database management systems.• BigData is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile.• why it so important ,

1.More data leads to more accurate analyses.2.More accurate analyses leads to better decision

making.3.Better decisions means greater operational

efficiencies, cost reductions and reduced Risk.

What is Hadoop…?“Apache hadoop is open source software

library framework use to process large data sets across the distributed cluster using simple programming on commodity(highly available) hardware.” Hadoop process the data parallel on large cluster.Google created its own distributed computing framework and published papers about the same. Hadoop was developed on the basis of papers released by Google.Core hadoop consists of two core components,-The Hadoop Distributed File System (HDFS) -MapReduce

Why Hadoop ?

Why Hadoop

Economical (cost

effective)

Flexible

Scalable

Solves Bigdata

problems

Reliable

Smart

How Hadoop works

ClientProgram

Data

Master Node

Slave Node

Slave Node

Slave Node

HDFS Name Node

Map Reduce Job Tracker

Map Reduce Task Tracker

HDFS Name Node

HDFS Name Node


HDFS Name Node


STEPS:Step 1 : Data is Broken Into file splits of 64 mb OR

128 mb and the blocks are moved to different Nodes.

Step 2 : Once all the blocks are moved, The hadoop framework passes on the program to each node.

Step 3 : Job Tracker Then Starts the scheduling the programs on individual nodes.

Step 4 : Once all the node are done, the output id return back.

History……Hadoop was inspired by Google’s MapReduce, a software framework in which an application is broken down into numerous small parts. Any of those parts (also called fragments or blocks) can be run on any node in the cluster.Doug Cutting, hadoop’s creator , named the framework after his child’s stuffed toy elephant.In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, MapReduce and HDFS project. In 2008, Yahoo run 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop.

Hadoop Eco-System

PIG

Apache PIG is a platform for analyzing large data set, that consist of high level language, for expressing data analysis programs. Introduced by Yahoo.

HIVE

Apache HIVE is data warehouse software used to querying and managing large data set on distributed cluster. Introduced by Facebook.

HBase

Apache HBase is a Distributed column-oriented database on top of HDFS and Hadoop.

SQOOP

SQOOP is a combination of SQL-Hadoop.SQOOP is import and export utility, it is a data transfer tool, to get data into hadoop from relational system and put data into RDBMS for analysis with BI tools.

Zookeeper

Apache zookeeper coordination service for distributed system, it is fast and scalable.

OOZiEOOZiE is a workflow engine that runs on server, it is job scheduling service within a hadoop cluster.

FLUME

FLUME is a service that basically lets you ingest data (typically file data) into HDFS. Defined as, distributed reliable, available service for moving large amount of data as it is produced.

Ganesh L. Sanap

[email protected]

Date post:	21-Jul-2015
Category:	Software
Upload:	ganesh-sanap
View:	109 times
Download:	1 times