+ All Categories
Home > Software > Introduction to hadoop

Introduction to hadoop

Date post: 21-Jul-2015
Category:
Upload: ganesh-sanap
View: 109 times
Download: 1 times
Share this document with a friend
12
Transcript
Page 1: Introduction to hadoop
Page 2: Introduction to hadoop

What is BigData?

The term “BigData” is used to describe the collection of Complex and Large Data such that it’s difficult to capture, search, store, process and analyze this kind of data using Database Management System.

Basically the data coming from everyware like,Social media sitesTraffic, SatelliteDigital worldSoftware logsBusiness dataAnd many more…..

Page 3: Introduction to hadoop

• BigData Includes both Structured and Unstructured data.

• BigData is difficult to work with using most Relational database management systems.• BigData is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile.• why it so important ,

1.More data leads to more accurate analyses.2.More accurate analyses leads to better decision

making.3.Better decisions means greater operational

efficiencies, cost reductions and reduced Risk.

Page 4: Introduction to hadoop

What is Hadoop…?“Apache hadoop is open source software

library framework use to process large data sets across the distributed cluster using simple programming on commodity(highly available) hardware.” Hadoop process the data parallel on large cluster.Google created its own distributed computing framework and published papers about the same. Hadoop was developed on the basis of papers released by Google.Core hadoop consists of two core components,-The Hadoop Distributed File System (HDFS) -MapReduce

Page 5: Introduction to hadoop

Why Hadoop ?

Why Hadoop

Economical (cost

effective)

Flexible

Scalable

Solves Bigdata

problems

Reliable

Smart

Page 6: Introduction to hadoop

How Hadoop works

ClientProgram

Data

Master Node

Slave Node

Slave Node

Slave Node

HDFS Name Node

Map Reduce Job Tracker

Map Reduce Task Tracker

HDFS Name Node

HDFS Name Node

Map Reduce Task Tracker

HDFS Name Node

Map Reduce Task Tracker

Page 7: Introduction to hadoop

STEPS:Step 1 : Data is Broken Into file splits of 64 mb OR

128 mb and the blocks are moved to different Nodes.

Step 2 : Once all the blocks are moved, The hadoop framework passes on the program to each node.

Step 3 : Job Tracker Then Starts the scheduling the programs on individual nodes.

Step 4 : Once all the node are done, the output id return back.

Page 8: Introduction to hadoop

History……Hadoop was inspired by Google’s MapReduce, a software framework in which an application is broken down into numerous small parts. Any of those parts (also called fragments or blocks) can be run on any node in the cluster.Doug Cutting, hadoop’s creator , named the framework after his child’s stuffed toy elephant.In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, MapReduce and HDFS project. In 2008, Yahoo run 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop.

Page 9: Introduction to hadoop

Hadoop Eco-System

Page 10: Introduction to hadoop

PIG

Apache PIG is a platform for analyzing large data set, that consist of high level language, for expressing data analysis programs. Introduced by Yahoo.

HIVE

Apache HIVE is data warehouse software used to querying and managing large data set on distributed cluster. Introduced by Facebook.

HBase

Apache HBase is a Distributed column-oriented database on top of HDFS and Hadoop.

Page 11: Introduction to hadoop

SQOOP

SQOOP is a combination of SQL-Hadoop.SQOOP is import and export utility, it is a data transfer tool, to get data into hadoop from relational system and put data into RDBMS for analysis with BI tools.

Zookeeper

Apache zookeeper coordination service for distributed system, it is fast and scalable.

OOZiEOOZiE is a workflow engine that runs on server, it is job scheduling service within a hadoop cluster.

Page 12: Introduction to hadoop

FLUME

FLUME is a service that basically lets you ingest data (typically file data) into HDFS. Defined as, distributed reliable, available service for moving large amount of data as it is produced.

Ganesh L. Sanap

[email protected]


Recommended