Date post: | 11-Jan-2015 |
Category: |
Engineering |
Upload: | nalini-mehta |
View: | 96 times |
Download: | 1 times |
MANAGING BIG DATA WITH HADOOP
Presented by:
Nalini MehtaStudent(MLVTEC Bhilwara) Email: [email protected]
Introduction
Big Data:•Big data is a term used to describe the voluminous amount of unstructured and semi-structured data .
•Data that would take too much time and cost too much money to load into a relational database for analysis.
• Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.
General framework of Big Data Networking
The driving force behind the implementation of Big data is both infrastructure and analytics which together constitutes the software.
Hadoop is the Big Data management software which is used to distribute, catalogue manage and query data across multiple, horizontally scaled server nodes.
Managing Big Data
Overview of Hadoop• Hadoop is a platform for
processing large amount of data in distributed fashion.
• It provides scheduling and resource management framework to execute the map and to reduce phases in the cluster environment.
• Hadoop Distributed File is Hadoop’s data storage layer which is designed to handle the petabytes and exabytes of data distributed over multiple nodes in parallel.
Hadoop Cluster
• DataNode- The DataNodes are the repositories for the data, and it consist of multiple smaller database infrastructures.
• Client- The client represents the user interface to the big data implementation and query engine. The client could be a server or PC with a traditional user interface.
• NameNode- the NameNode is equivalent to the address router and location of every data node.
• Job Tracker- The job tracker represents the software tracking mechanism to distribute and aggregate search queries across multiple nodes for ultimate client analysis.
Apache Hadoop
• Apache Hadoop is an open source distributed software platform for storing and processing data.
• It is a framework for running applications on large cluster built of commodity hardware.
• A common way of avoiding data loss is through replication: redundant copies of the data are kept by the system so that in the event of failure, there is another copy available. The Hadoop Distributed File system (HDFS), takes care of this problem.
• MapReduce is a simple programming model for processing and generating large data sets.
What is MapReduce? MapReduce is a programming model . Programs written automatically parallelized and executed on a
large cluster of commodity machines. Users specify a map function that processes a key/value pair to
generate a set of intermediate key/value pair, and a reduce function that merges all intermediate values associated with the same intermediate key.
MapReduce
MAP
map function that processes a key/value pair to generate a set
of intermediate key/value pairs
REDUCE
and a reduce function that merges
all intermediate values associated
with the same intermediate key.
The Programming Model Of MapReduce
Map, written by the user, takes an input pair and produces a set of intermediate key/value pairs. The MapReduce library groups together all intermediate values associated with the same intermediate key and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key and a set of values for that key. It merges together these values to form a possibly smaller set of values.
HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Apache Hadoop comes with a distributed file system called HDFS, which stands for Hadoop Distributed File System.
HDFS is designed to hold very large amounts of data
(terabytes or even petabytes), and provide high-throughput access to this information.
HDFS is designed for scalability and fault tolerance and provides APIs MapReduce applications to read and write data in parallel.
The capacity and performance of HDFS can be scaled by adding Data Nodes, and a single Name Node mechanisms that manages data placement and monitor server availability.
Assumptions and Goals 1. Hardware Failure • An HDFS instance may consist of hundreds or thousands of server
machines, each storing part of the file system’s data. • There are a huge number of components and that each component
has a non-trivial probability of failure.• Detection of faults and quick, automatic recovery from them is a core
architectural goal of HDFS.
2. Streaming Data Access • Applications that run on HDFS need streaming access to their data
sets. • HDFS is designed more for batch processing rather than interactive
use by users. • The emphasis is on high throughput of data access rather than low
latency of data access.
3. Large Data Sets • A typical file in HDFS is gigabytes to terabytes in size. • Thus, HDFS is tuned to support large files.• It should provide high aggregate data bandwidth and scale to
hundreds of nodes in a single cluster.
• HDFS applications need a write-once-read-many access model for files.
• A file once created, written, and closed need not be changed. • This assumption simplifies data coherency issues and enables
high throughput data access.
5. “Moving Computation is Cheaper than Moving Data”• A computation requested by an application is much more efficient
if it is executed near the data it operates on when the size of the data set is huge.
• This minimizes network congestion and increases the overall throughput of the system.
6. Portability across Heterogeneous Hardware and Software Platforms• HDFS has been designed to be easily portable from one platform
to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications.
4. Simple coherency model
Concepts of HDFS:
NameNode and DataNodes A HDFS cluster has two
types of node operating in a master-slave pattern: a NameNode (the master) and a number of DataNodes (slaves).
The NameNode manages the file system namespace. It maintains the file system tree and the metadata for all the files and directories in the tree.
Internally a file is split into one or more blocks and these blocks are stored in a set of DataNodes.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories.
DataNodes store and retrieve blocks when they are told to (by clients or the NameNode), and they report back to the NameNode periodically with lists of blocks that they are storing.
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Without the NameNode, the file system cannot be used. In fact, if the machine running the NameNode were destroyed, all the files on the file system would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the DataNodes.
File System Namespace HDFS supports a traditional hierarchical file
organization. A user or an application can create and remove files, move a file from one directory to another, rename a file, create directories and store files inside these directories.
The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode.
An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Data Replication The blocks of a file are replicated for fault
tolerance.
The block and replication factor are configurable as per file.
The NameNode makes all decisions regarding replication of blocks.
A Block report contains a list of all blocks on a DataNode.
Hadoop as a Service in the Cloud (Haas):
Hadoop is economical for large scale data driven companies like Yahoo or Facebook.
The ecosystem around Hadoop nowadays offers various tools like Hive and Pig to make Big Data processing accessible focusing on what to do with the data and to avoid the complexity of programming.
Consequently, a minimal Hadoop as a Service provide a managed Hadoop cluster ready to use without the need to configure or install any Hadoop relevant services on any cluster nodes like Job tracker, Task tracker, NameNode or DataNode.
Depending on the level of service, abstraction and tools provided, Hadoop as a Service (HaaS) can be placed in the cloud stack as a Platform or Software as a Service solutions, between infrastructure services and cloud clients.
Limitations:
It places several requirements on the network:
Data locality The distributed Hadoop nodes running jobs parallel
causes east-west network traffic that can be adversely affected by the suboptimal network connectivity.
The network should provide high bandwidth, low latency and any to any connectivity between the nodes for optimal Hadoop performance.
Scale out Deployments might start with a small cluster and then
scale out over time as the customer may realize the initial success and then needs.
The underlying network architecture should also scale seamlessly with Hadoop clusters and should provide predictable performance.
ConclusionThe growth of communication and
connectivity has led to the emergence of Big Data. Apache Hadoop is an open source framework that has become a de-facto standard for big data platforms deployed today.
To sum up, we conclude that promising progress has been made in the area of Big Data but much remains to be done. Almost all proposed approaches are evaluated to a limited scale, and further research is required for large scale evaluations.
References:
White paper –Introduction to Big Data: Infrastructure and Network consideration
MapReduce: Simplified Data processing on Large Clusters, http://research .google.com/archive /mapreduce.html
White paper Big Data Analytics[http:/Hadoop.intel.com] The Hadoop Distributed File System Architecture and
Design:by Dhruba Borthakur Big Data in the enterprise, Cisco White Paper. Cloudera capacity planning recommendations:
http://www.cloudera.com/blog/ 2010/08/Hadoop HBase-capacity-planning/
Apache Hadoop Wiki Website: http://en.wikipedia.org/wiki/Apache-Hadoop.
Towards a Big Data Reference Architecture [www.win.tue.nl/~gfletche/Maier_MSc_thesis.pdf]