Big data & Hadoop

BIG DATA & HADOOP

By: Ahmed Gamil

Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture , manage, and process data within a tolerable elapsed time.

Definition

Archives: Scanned documents, statements, medical records, e-mails etc..

Docs: XLS, PDF, CSV, HTML, JSON etc

Business Apps: CRM, ERP systems, HR, project management etc.

Big Data Sources

Media: Images, video, audio etc. Social Networks: Twitter, Facebook,

Google+, LinkedIn etc Public Web: Wikipedia, news, weather,

public finance etc Data Storages: RDBMS, NoSQL, Hadoop,

file systems etc.

Big Data Sources (Cont’d)

Machine Log Data: Application logs, event logs, server data, CDRs, clickstream data etc.

Sensor Data: Smart electric meters, medical devices, car sensors, road cameras etc.

Big Data Sources ( Cont’d)

Characteristics of Big Data

Volume• Data

quantity

Velocity• Data

Speed

Variety• Data

Types

• Facebook ingests 500 terabytes of new data every day.

• Boeing 737 will generate 240 terabytes in one trip.

• The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.

Volume

Clickstreams and ad impressions capture user behavior at millions of events per second.

High-frequency stock trading algorithms reflect market changes within microseconds.

Machine to machine processes exchange data between billions of devices.

Infrastructure and sensors generate massive log data in real-time.

On-line gaming systems support millions of concurrent users, each producing multiple inputs per second.

Velocity

Big Data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media.

Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure.

Big Data analysis includes different types of data

Variety

Every day we create 2.5 quintillion (1018 ) bytes of data .

90% of the data in the world today has been created in the last two years alone

Facts

k kilo 103 = 10001 210 = 10241

M mega 106 = 10002 220 = 10242

G giga 109 = 10003 230 = 10243

T tera 1012 = 10004 240 = 10244

P peta 1015 = 10005 250 = 10245

E exa 1018 = 10006 260 = 10246

Z zetta 1021 = 10007 270 = 10247

Y yotta 1024 = 10008 280 = 10248

Information Units

Examining large amount of data Appropriate information Identification of hidden patterns, unknown

correlations Competitive advantage Better business decisions: strategic and

operational Effective marketing, customer satisfaction,

increased revenue

Big Data Advantages

Data Storage (Standard disk is 1 TB)

Data Processing

Data Transfer (100 MB/s)

Big Data Issues

Resolution

• Fragment data into small Pieces

• Process Data in Parallel

• Collect Results

15

16

Big Data Technology

Open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters.

Apache HADOOP

Google File System – 2003

MapReduce – 2004

Hadoop 0.1.0 released – 2006

Hadoop Release 2.6.4 – 2016

HADOOP History

Storage part: ◦Hadoop Distributed File System (HDFS)

Processing part: ◦Map Reduce

Apache Hadoop components

Distributed

Scalable

Portable file-system

Written in Java

Hadoop distributed file system (HDFS)

HDFS Architecture

An HDFS cluster consists of:◦ Single NameNode—a master server that

manages the file system namespace and regulates access to files by clients.

◦ There are a number of DataNodes, usually one per computer node in the cluster, which manage storage attached to the nodes that they run on.

HDFS Cluster

An HDFS file consists of a number of blocks.

Each block is typically 64 MBytes. Each block is replicated some specified

number of times. The replicas of the blocks are stored on

different DataNodes chosen to reflect loading on a DataNode as well as to provide both speed in transfer and resiliency in case of failure of a rack.

HDFS Files

A standard directory structure is used in HDFS. HDFS files exist in directories that may in turn

be sub-directories of other directories, and so on.

There is no concept of a current directory within HDFS.

The NameNode executes HDFS file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

HDFS File System

The list of HDFS files belonging to each block, the current location of the block replicas on the DataNodes, the state of the file, and the access control information is the metadata for the cluster and is managed by the NameNode.

DataNodes are responsible for serving read and write requests from the HDFS file system’s clients. The DataNodes also perform block replica creation, deletion, and replication upon instruction from the NameNode..

HDFS File System (Cont’d)

Is the heart of Hadoop®.

It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster.

Map Reduce

Provides a parallel programming model.

Moves computation to where the data is.

Handles scheduling, fault tolerance.

Status reporting and monitoring.

Introduced by Google.

Map Reduce

The data set should be big enough to ensure that splitting up the data will increase overall performance and will not be detrimental to it

The computations are generally not dependent on external input.

The calculations/ processing that runs on one subset of the data needs to be merged with another subset.

The resultant data set should be smaller than the initial data set.

Map Reduce Requirements

Map◦ Takes the input pair and produces an intermediate

key/value pair. ◦ All intermediate pairs are then grouped according to

a common intermediate key . Reduce

◦ Function accepts an intermediate key , and a set of values for that key.

◦ It merges these values together to form possibly smaller values.

◦ The Reduce function typically produces only a zero or an single output value per function invocation/call.

Programming Model

User Program

Map Workers

Reduce Workers

Return to the User Program

Execution Overview

Execution typically begins with the user program. MapReduce libraries that are imported into the program

are used in splitting operations that are performed on the input data set.

Every machine in the cluster has a separate instance of the Mapper programming running on it.

One of the copies of the program is special. It is called the Master.

The rest of the programs are assigned to work under the master and are referred to as Workers.

There are X number of tasks and Y reduce operations to perform. The Master picks idle workers and assigns each of them a map task or a reduce task.

User Program

The worker that is assigned the Map task takes the split input data and generates the key/value pair for each segment of input data.

The worker then invokes the user-defined Map function. The resultant values of the Map function are buffered in

the memory. The data in these temporary buffers is later written to a disk.

The physical address of these contents is passed to the Master.

The Master then finds idle workers are passes these physical memory addresses to them to perform the Reduce task.

Map Workers

A reduce worker, when notified by the uses remote procedure calls to access the buffered data from the Map workers.

When a reduce worker has read all the intermediate data, it groups together all the data of the same intermediate key.

Many different keys map to the same task because of the parallel processing nature of the tasks. Hence the above sorting step is required.

Each unique key and its data are passed by the reduce worker to the Reduce function for each user.

The Output of the reduce function is written to an output usually to a distributed file system.

Reduce Workers

After all Map and Reduce functions have been run. The Master sends control back to the user side.

Return to the User Program

Input Output

MapReduce Word Count Example

Input: In this step, the sample file is input to MapReduce. Split: In this step, Hadoop splits / divides our sample input file into four

parts, each part made up of one line from the input file. Map: In this step, each split is fed to a mapper which is the map()

function containing the logic on how to process the input data, which in our case is the line of text present in the split.

Combine: This is an optional step and is often used to improve the performance by reducing the amount of data transferred across the network. This is essentially the same as the reducer (reduce() function) and acts on output from each mapper. In our example, the key value pairs from first mapper "(SQL, 1), (DW, 1), (SQL, 1)" are combined and the output of the corresponding combiner becomes "(SQL, 2), (DW, 1)".

Shuffle and Sort: In this step, output of all the mappers is collected, shuffled, and sorted and arranged to be sent to reducer.

Reduce: In this step, the collective data from various mappers, after being shuffled and sorted, is combined / aggregated and the word counts are produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.

Output: In this step, the output of the reducer is written to a file on HDFS. The following image is the output of our word count example

Thank You

Date post:	14-Apr-2017
Category:	Technology
Upload:	ahmed-gamil
View:	66 times
Download:	0 times