Date post: | 28-Nov-2014 |
Category: |
Data & Analytics |
Upload: | mahantesh-angadi |
View: | 290 times |
Download: | 4 times |
FEARLESS engineering
Acharya institute of technology, Bengaluru.
Grand Welcome to the world of, “bigdata_community”
FEARLESS engineering
Acharya institute of technology, Bengaluru.
Introduction to bigdata and installation of single-node
apache hadoop cluster
Presented By, Mahantesh C. Angadi Nagarjuna DN Manoj PT
Under the Guidance of, Prof. Manjunath tN Prof. Amogh pk Dept. of ISE AIT, Bengaluru
Session-1: 24 March 2014
FEARLESS engineering
Contents
Purpose of This Talk
Introduction
Terminologies
Cloud Computing
BigData
Traditional Approaches to Solve BigData Problems
Hadoop and its Characteristics
Architecture of Hadoop
FEARLESS engineering
Contents
Why Only Hadoop?
Advantages of Hadoop
Limitations of Hadoop
Job Opportunities in BigData & Hadoop
Conclusion
References
FEARLESS engineering
Purpose of This talk
To understand the terminologies
To introduce you to BigData and Hadoop
To be able to clearly differentiate between Cloud, BigData,
Hadoop
Traditional approaches to handle BigData
Characteristics of Hadoop
Explain how Hadoop works?
Get friendly with MapReduce and HDFS
FEARLESS engineering
continued…
Get aware about Hadoop Ecosystem
Advantages and Limitations of Hadoop
Job opportunities in Hadoop
Conclusion
FEARLESS engineering
introduction
Today We Live in the Data Age.
Due to Internet of Things (IoT), the speed of
ingestion of data is keeps on increasing and
increasing.
So, the World is getting more “Hungrier and Hungrier
for Data”
FEARLESS engineering
terminologies
Cloud Computing
BigData
Hadoop
Distributed Computing
Parallel Computing
Utility Computing
Data Scientist
FEARLESS engineering
Letz start the journey…!
FEARLESS engineering
Cloud computing
“Computing may someday be organized as a public
utility, just as the telephone system is organized as a
public utility”
- John McCarthy, 1961
The word “Cloud” is first time used in a technical
perspective by HP and Compaq people.
Cloud Computing is a Utility Computing that involves a
large number of computers connected through a
communication network such as the Internet, provides
services based-on-demand.
FEARLESS engineering
utility computing
Utility computing is the packaging of computing
resources, such as computation, storage and services, as
a metered service.
This model has the advantage of a low or no initial cost
to acquire computer resources;
FEARLESS engineering
Distributed computing
Distributed computing refers to the use of distributed
systems to solve computational problems.
Here a problem will be divided into many no. of small
tasks, each of which is solved by one or more
computers, which communicate with each other by
passing messages.
Parallel Computing is a form of computation in which
many calculations are carried out simultaneously,
operating on the principle that large problems can often
be divided into smaller ones, which are then
solved concurrently.
FEARLESS engineering
Why bigdata deserves our attention?
Everyday we create 2.5 quintillions bytes of data, 90%
this data is unstructured.
90% of the data in the World today has been created
in the last two years alone.
By the end of 2015, CISCO estimates that global
Internet traffic will reach 4.8 Zettabytes a year.
BigData would create 4.4. million jobs by 2015.
There is a shortage of 140,000-190,000 BigData
professionals in the United States alone…!
FEARLESS engineering
what happens in an internet minute…?
FEARLESS engineering
what is bigdata…?
BigData is the any amount of data that is structured
and/or unstructured data which is beyond the
storage and processing capabilities of a single
physical machine and traditional database techniques.
Data that has extra large Volume, comes from Variety
of sources, Variety of formats and comes at us with a
great Velocity is normally refers to as BigData.
FEARLESS engineering
3 v’s of bigdata
FEARLESS engineering
Rise of bigdata adoption
FEARLESS engineering
Rise of bigdata adoption
Data Scientist is the Hottest job of 21st Century…!
- Harvard Business Review Magazine
Positions such as Data Scientist, Data Analytics were
doesn’t exist few years ago.
Today Companies are fighting to recruit these specialists.
The market is not Growing at the rate it wants to grow:-
Because skills shortage is looming, so they increase
Salaries up…!
Data Scientists take huge amounts of data & attempt to
pull useful “Business Insights” from that raw data.
FEARLESS engineering
Traditional approaches to solve bigdata problems
FEARLESS engineering
Traditional approach: Storage area network (san)
Application
Servers
FEARLESS engineering
Characteristics of Storage area network (san)
SAN can be visualize as, one massive storage that can
give us Infinite Storage.
Moving Data to Computational Nodes.
It has multiple Application Servers.
Programs run on each Application Server.
All the data is stored in one SAN.
Before Execution, each server Gets the data from SAN.
After Execution, each server Writes the output to SAN.
FEARLESS engineering
Problems with Storage area network (san)
Huge dependency between networks
Huge bandwidth demand
Scaling up and scaling down is not a smooth process
Partial failures are also difficult to handle
A lot of processing power is spent on Transferring the Data
Data Synchronization is required during exchange
FEARLESS engineering
moore’s law
continued…
Moore’s Law:
“The number of Transistors per silicon chip, that can be
placed in a Processor will double approximately every
two years, for half the cost.”
It is named after Gordon Moore, the founder of Intel.
FEARLESS engineering
hadoop
Inspired by Google.
Google is originated in the year 1998.
They faced serious challenge in early 2000 to handle
the BigData.
In 2004 Google related two papers:
- GFS: Google File System
- MapReduce: A Programming Model
FEARLESS engineering
what is hadoop…?
Apache Hadoop is an open-source software framework,
used to manage BigData.
Its built and used by a global community of contributors and
users.
It’s not only a tool, it’s a Framework of tools.
Moving computation is cheaper than moving data.
Most important Hadoop sub-projects:
i. HDFS: Hadoop Distributed File System
ii. MapReduce: A Programming Model
FEARLESS engineering
founders of hadoop
FEARLESS engineering
why the name hadoop…?
“Hadoop“ is simply the name of a stuffed toy ELEPHANT that belonged to the son of its creator “DOUG CUTTING”.
FEARLESS engineering
Scalable– New nodes can be added without changing
data formats.
Cost-effective– It parallelly processes huge datasets on
large clusters of commodity computers.
Efficient and Flexible- It is schema-less, and can
absorb any type of data, from any number of sources.
Fault-tolerant and Reliable- It handles failures of
nodes easily because od Replication.
Easy to use- It uses simple Map and Reduce functions
to process the data.
It is developed in Java but it can support Python &
others too.
Characteristics of hadoop
FEARLESS engineering
who uses hadoop…?
FEARLESS engineering
Hadoop ecosystem
FEARLESS engineering
Hadoop core components
Hadoop core has two major components:
1. HDFS
a. Name Node
b. Secondary Name Node
c. Data Node
2. MapReduce Engine
a. Job Tracker
b. Task Tracker
FEARLESS engineering
Architecture of hadoop
FEARLESS engineering
overview of hadoop
FEARLESS engineering
Hadoop Distributed File System
Pioneered by Google File System (GFS)
It consists of three major components -
i. Name Node
• It is responsible for the distribution of the data throughout the Hadoop cluster.
ii. Secondary Name Node (Backup Node)
• It regularly contacts Name Node and maintains an up to date snapshots of Name Node's directory information.
iii. Data Node
• It responsible to store the chunk of data that is assigned to it by the Name Node.
FEARLESS engineering
Mapreduce
Pioneered by Google, Popularized by Yahoo (Apache).
It consists of two major components –
i. Job Tracker
• It is responsible for scheduling the task to slave nodes.
• So it consults the Name Node and assigns the task to the nodes which has the data on which task would be performed.
ii. Task Tracker
• It has the actual logic to perform the task, so it performs Map and Reduce functions on the data assigned to it by Master Node.
FEARLESS engineering
Distributed model
FEARLESS engineering
Task tracker and data nodes
FEARLESS engineering
Master/slave architecture
FEARLESS engineering
continued…
FEARLESS engineering
continued…
FEARLESS engineering
Mapreduce example: wordcount
FEARLESS engineering
Advantages of hadoop
Moving Computation is far better than Moving Data
Runs on commodity hardware
It’s a Master/Slave architecture
It handles all types of node failures by live Heartbeats
It handles assigning tasks to nodes
It has Rack awareness between nodes
So, Programmers only need to concentrate on getting
business values from BigData
FEARLESS engineering
limitations of hadoop
Do you think Hadoop is a “Golden_Bullet” that can solve all
kinds of problems…?
- The answer is NO…!!!
Not suitable, if data is too small.
Not suitable, if there is a dependency between the data.
Not suitable, if Job cannot be divided into small chunks.
Not suitable, to process real-time and stream-based
processing.
FEARLESS engineering
Closer home
AADHAR – Government of India’s UIDAI project is considered as one of the largest
BigData project in the World...!
FEARLESS engineering
Closer home
Feb 14th 2011 – IBM’s Super Computer “WATSON” built using BigData Technology.
Its not online & its process like a Human Brain…!
FEARLESS engineering
Job opportunities
Roles and Profiles:
Hadoop Administrator
Pre-req: Networking, Admin
Hadoop Developer
Pre-req: Programming Expertise
Preferably Java/Python
Data Scientist and Data Analytics
Pre-req: Mathematics, Statistical Background
Scripting languages like Perl etc.
FEARLESS engineering
Conception about bigdata
“BigData is like Teen_Age_Sex:
Everyone talks about it, nobody really
knows how to do it…??? Everyone thinks
that everyone else is doing it, so everyone
claims they are doing it…!!! But anyone
who actually tries this will be Terrible at
it...!!! ”
-Dan Ariely, Behavioral Economics Guru.
FEARLESS engineering
conclusion
• Big Data brings new and exciting
opportunities to companies who utilize the
platforms available.
• In this Information Era, BigData technology
has got its own importance for businesses.
• It has got lot of opportunities in the upcoming
days.
FEARLESS engineering
references
WWW.APCHE.ORG WWW.BIGDATAUNIVERSITY.COM WWW.CLOUDERA.ORG WWW.HORTONWORKS.COM WWW.WIKIPEDIA.ORG WWW.EDUREKHA.COM WWW.EXPLAININGCOMPUTERS.COM WWW.YAHOO.COM WWW.GOOGLE.COM WWW.INFORMATIONWEEK.COM WWW.CS.BERKELEY.EDU WWW.IBM.COM WWW.INTEL.COM WWW.STACKOVERFLOW.COM WWW.TECHTARGET.COM WWW.MICHAEL-NOLL.COM WWW.MEETUP.COM AND MANY MORE…!!!
FEARLESS engineering
Any queries…???
FEARLESS engineering
thank you one and all For your patience <3