Big Data

By : Priyanka Tuteja(2k14-mtech(cse)-mrce-012)

Introduction

Outlines

1. What is Big Data2. Big Data generators3. Why Big Data4. Characteristic of Big Data5. Big Data – A world wide problem6. Solution for Big Data7. Hadoop

HDFS Map Reduce

8. How Big Data Impact on IT9. Future of Big Data

What is big data?

Big data is a collection of large and complex data sets which becomes difficult to process using on-hand database management tools or traditional data processing applications.

In simpler terms,Big Data is a term given to large volumes of data that organizations store and process.

Huge amount of data

+ From the beginning of recorded time until 2003,we created 5 billion gigabytes (exabytes) of data.

+ In 2011, the same amount was created every two days

+ In 2013, the same amount of data is created every 10 minutes.

Types of Data Generators

This data comes from everywhere: <> sensors used to gather climate information, <> posts to social media sites,

<> digital pictures <> online Shopping

<> Hospitality data <> Airlines

<> purchase transaction records, and many more…

This data is “ big data.”

Comparison

1990’s 2014H/D: 1GB-20 GB I TB

RAM : 64-128 MB 4-16 GB

Reading : 10 KBPS 100 MBPS

Big Data Requires ?

• Growth of Big Data is needed – Increase of storage capacities

– Increase of processing power

– Availability of data(different data types)

– Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been created in the last two years alone

Big Data stores

• Choosing the correct data stores based on your data characteristics

• Data center people maintain these servers and these servers can be IBM, EMC server etc.

• Whenever you want to process data– Fetch data.– Give it to your local machine.– Then process.

Three Characteristics of Big Data V3s

Volume•Data quantity

Velocity•Data Speed

Variety•Data Types

1st Character of Big DataVolume

• It refers to vast amount of data generated every second.

•The size of available data has been growing at an increasing rate.

•Today, Facebook ingests 500 terabytes of new data every day.

• The smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.

2nd Character of Big DataVelocity

• It refers to speed at which new data is being generated.

• Speed at which data moves around.

• Clickstreams and ad impressions capture user behavior at millions of events per second

• machine to machine processes exchange data between billions of devices

• on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.

3rd Character of Big DataVariety

• It refers to different types of data we are now using.

• In past we only focused on structured data that nearly fitted into tables and relational databases.

• Nowa days 80% data is unstructured (text, images , video,voice) or semi structured (log files)

• Big Data analysis includes different types of data

Big Data! A Worldwide Problem ?

It is becoming very difficult for companies to store, retrieve and process the ever-increasing data.

The problem lies in the use of traditional systems to store enormous data.

These systems were a success a few years ago, with increasing amount and complexity of data, these are soon becoming obsolete.

Contd..

• When data is less , processing speed is feasible• As soon as data increases, processing is not

that much good.• Thus for more data, processing should be

equalise.• Thus, HADOOP is introduced as a best

solution.

Solution for Big Data !

The good news is - Hadoop, Panacea for all those companies working with

BIG DATA in a variety of applications It has become an integral part for storing,

handling, evaluating and retrieving hundreds or even petabytes of data.

Apache Hadoop!

Hadoop was developed by Doug Cutting and Michael J. Cafarella.

Hadoop is an open source software framework.It supports data-intensive distributed

applications. Hadoop is licensed under the Apache v2

license. Therefore known as Apache Hadoop.

Core concepts of Hadooop

• HDFS (Hadoop Distributed File System)Technique for storing huge amount of data.

• Map ReduceTechnique for processing the data which we

are storing in HDFS.

HDFS

• It is a specially designed file system for storing huge data sets with cluster of commodity h/w and with streaming access pattern. cluster - Set of machines working togather

commodity h/w -Cheap hardware

streaming access pattern - Write ones, read any no of times but dont try to change the content of file ones you are keeping data in HDFS

CONTD.

• Its HDFS (Hadoop Distributed File System) splits files into large blocks (default 64MB or

128MB) and distributes the blocks • Amongst the nodes in the cluster.• For processing the data, the

Hadoop Map/Reduce ships code to the nodes that have the required data, and the nodes then process the data in parallel.

Map Reduce

• It is a technique for processing a data which we are storing in HDFS.

• Hadoop runs map reduce in form of key , value pairs.• Mapper and Reducer also works with key, value pairs.

Contd..• Record reader is a interface between input split and mapper• For every input split and mapper there is one record reader.• Record reader has been taken care by hadoop framework

itself by default• In Mapper code we are writting logic on basis of that logic it

will give key,value pairs• Record reader on basis of 3 file formats converts records into

key,value pairs– Text Input Format (by default)– KeyValueText Input Format– SequenceFile Input Format

• Shuffling:– It is a phase on intermediate data to combine all key

values pairs into a collection associated to same key.• (how[11111])• (is[11111])

• Sorting :– It is also an another phase on intermediate data to

sort all key values pairs.

How Big data impacts on IT

• Big data is a troublesome force presenting opportunities with challenges to IT organizations.

• By 2015 4.4 million IT jobs in Big Data ; 1.9 million is in US itself

• India will require a minimum of 1 lakh data scientists in the next couple of years in addition to data analysts and data managers to support the Big Data space.

Future of Big Data• $15 billion on software firms only specializing in

data management and analytics. • This industry on its own is worth more than $100

billion and growing at almost 10% a year which is roughly twice as fast as the software business as a whole.

• In February 2012, the open source analyst firm Wikibon released the first market forecast for Big Data , listing $5.1B revenue in 2012 with growth to $53.4B in 2017

• The McKinsey Global Institute estimates that data volume is growing 40% per year.

Thank you

Date post:	14-Aug-2015
Category:	Technology
Upload:	priyanka-tuteja
View:	484 times
Download:	0 times

Big Data

Technology