Date post: | 21-Jul-2015 |
Category: |
Software |
Upload: | ganesh-sanap |
View: | 109 times |
Download: | 1 times |
What is BigData?
The term “BigData” is used to describe the collection of Complex and Large Data such that it’s difficult to capture, search, store, process and analyze this kind of data using Database Management System.
Basically the data coming from everyware like,Social media sitesTraffic, SatelliteDigital worldSoftware logsBusiness dataAnd many more…..
• BigData Includes both Structured and Unstructured data.
• BigData is difficult to work with using most Relational database management systems.• BigData is more than simply a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile.• why it so important ,
1.More data leads to more accurate analyses.2.More accurate analyses leads to better decision
making.3.Better decisions means greater operational
efficiencies, cost reductions and reduced Risk.
What is Hadoop…?“Apache hadoop is open source software
library framework use to process large data sets across the distributed cluster using simple programming on commodity(highly available) hardware.” Hadoop process the data parallel on large cluster.Google created its own distributed computing framework and published papers about the same. Hadoop was developed on the basis of papers released by Google.Core hadoop consists of two core components,-The Hadoop Distributed File System (HDFS) -MapReduce
Why Hadoop ?
Why Hadoop
Economical (cost
effective)
Flexible
Scalable
Solves Bigdata
problems
Reliable
Smart
How Hadoop works
ClientProgram
Data
Master Node
Slave Node
Slave Node
Slave Node
HDFS Name Node
Map Reduce Job Tracker
Map Reduce Task Tracker
HDFS Name Node
HDFS Name Node
Map Reduce Task Tracker
HDFS Name Node
Map Reduce Task Tracker
STEPS:Step 1 : Data is Broken Into file splits of 64 mb OR
128 mb and the blocks are moved to different Nodes.
Step 2 : Once all the blocks are moved, The hadoop framework passes on the program to each node.
Step 3 : Job Tracker Then Starts the scheduling the programs on individual nodes.
Step 4 : Once all the node are done, the output id return back.
History……Hadoop was inspired by Google’s MapReduce, a software framework in which an application is broken down into numerous small parts. Any of those parts (also called fragments or blocks) can be run on any node in the cluster.Doug Cutting, hadoop’s creator , named the framework after his child’s stuffed toy elephant.In 2002, Doug Cutting created an open source, web crawler project. In 2004, Google published MapReduce, GFS papers. In 2006, Doug Cutting developed the open source, MapReduce and HDFS project. In 2008, Yahoo run 4,000 node Hadoop cluster and Hadoop won terabyte sort benchmark. In 2009, Facebook launched SQL support for Hadoop.
Hadoop Eco-System
PIG
Apache PIG is a platform for analyzing large data set, that consist of high level language, for expressing data analysis programs. Introduced by Yahoo.
HIVE
Apache HIVE is data warehouse software used to querying and managing large data set on distributed cluster. Introduced by Facebook.
HBase
Apache HBase is a Distributed column-oriented database on top of HDFS and Hadoop.
SQOOP
SQOOP is a combination of SQL-Hadoop.SQOOP is import and export utility, it is a data transfer tool, to get data into hadoop from relational system and put data into RDBMS for analysis with BI tools.
Zookeeper
Apache zookeeper coordination service for distributed system, it is fast and scalable.
OOZiEOOZiE is a workflow engine that runs on server, it is job scheduling service within a hadoop cluster.
FLUME
FLUME is a service that basically lets you ingest data (typically file data) into HDFS. Defined as, distributed reliable, available service for moving large amount of data as it is produced.
Ganesh L. Sanap