www.edureka.co/big-data-and-hadoop
Introduction to big data and hadoop
Slide 2 www.edureka.co/big-data-and-hadoop
Slide 3 www.edureka.co/big-data-and-hadoop
At the end of this session , you will understand the:
→ Big Data Learning Paths
→ Big Data Introduction
→ Hadoop and Its Eco-System
→ Hadoop Architecture
→ Next Step on How to Setup Hadoop
Slide 4 www.edureka.co/big-data-and-hadoop
• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark
• Linux Administration• Cluster Management• Cluster Performance• Virtualization
• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R
Developer/Testing
Administration
Data Analyst
Big Data and Hadoop
MapReduce Design Patterns
Apache Spark & Scala
Apache Cassandra
Linux Administration Hadoop Administration
Data Science
Business Analytics Using R
Advance Predictive Modelling in R
Talend for Big Data
Data Visualization Using Tableau
Slide 5 www.edureka.co/big-data-and-hadoop
→ Lots of Data (Terabytes or Petabytes)
→ Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications
→ The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization Big Data
Slide 6 www.edureka.co/big-data-and-hadoop
→ Systems / Enterprises generate huge amount of data from Terabytes to Petabytes of information
Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades
Slide 7 www.edureka.co/big-data-and-hadoop
→ By 2020, IDC (International Data Corporation) predicts the number will have reached 40,000 EB, or 40 Zettabytes (ZB)
→ The world’s information is doubling every two years. By 2020 the world will generate 50 times the amount of information and 75 times the number of information containers
Slide 8 www.edureka.co/big-data-and-hadoop
IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/
VOLUME
Web logs
Images
Videos
Audios
Sensor Data
VARIETY
VELOCITY VERACITY
Min Max Mean SD
4.3 7.9 5.84 0.83
2.0 4.4 3.05 0.43
0.1 2.5 1.20 0.76
Slide 9 www.edureka.co/big-data-and-hadoop
Hello There!!My name is Annie. I love quizzes and
puzzles and I am here to make you guys think and
answer my questions.
Slide 10 www.edureka.co/big-data-and-hadoop
Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)
Slide 11 www.edureka.co/big-data-and-hadoop
Ans. XML files, e-mail body → Semi-structured dataAudio, Video, Image, Files, Archived documents → Unstructured data Data from Enterprise systems (ERP, CRM etc.) → Structured data
Slide 12 www.edureka.co/big-data-and-hadoop
More on Big Data
• http://www.edureka.in/blog/the-hype-behind-big-data/
Why Hadoop?
• http://www.edureka.in/blog/why-hadoop/
Opportunities in Hadoop
• http://www.edureka.in/blog/jobs-in-hadoop/
Big Data
• http://en.wikipedia.org/wiki/Big_Data
IBM’s definition – Big Data Characteristics
• http://www-01.ibm.com/software/data/bigdata/
Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoop
→ Web and e-tailing
» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection
→ Telecommunications
» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure
http://wiki.apache.org/hadoop/PoweredBy
Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop
→ Government
» Fraud Detection and Cyber Security» Welfare Schemes » Justice
→ Healthcare and Life Sciences
» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop
→ Banks and Financial services
» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis
→ Retail
» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop
→ Insight into data can provide Business Advantage.
→ Some key early indicators can mean Fortunes to Business.
→ More Precise Analysis with more data.
*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.
Case Study: Sears Holding Corporation
Slide 17Slide 17Slide 17 www.edureka.co/big-data-and-hadoop
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
Storage only Grid (Original Raw Data)
Collection
Inctrumentation
A meagre 10% of the
~2PB data is available for
BI
Storage
2. Moving data to compute doesn’t scale
90% of the ~2PB archived
Processing
3. Premature data death
1. Can’t explore original high fidelity raw data
Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop
Mostly Append
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Hadoop : Storage + Compute Grid
Collection
Instrumentation
Both Storage
And Processing
Entire ~2PB Data is
available for processing
No Data Archiving
1. Data Exploration & Advanced analytics
2. Scalable throughput for ETL & aggregation
3. Keep data alive forever
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.
Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop
Read 1 TB Data
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine
Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine
43 Minutes
Read 1 TB Data
Slide 21Slide 21Slide 21 www.edureka.co/big-data-and-hadoop
4 I/O ChannelsEach Channel – 100 MB/s
1 Machine4 I/O ChannelsEach Channel – 100 MB/s
10 Machine
4.3 Minutes43 Minutes
Read 1 TB Data
Slide 22Slide 22Slide 22 www.edureka.co/big-data-and-hadoop
→ Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
→ It is an Open-source Data Management with scale-out storage and distributed processing.
Slide 23 www.edureka.co/big-data-and-hadoop
Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets
Slide 24 www.edureka.co/big-data-and-hadoop
Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
Slide 25Slide 25Slide 25 www.edureka.co/big-data-and-hadoop
Pig LatinData Analysis
HiveDW System
Other YARNFrameworks (MPI, GRAPH)
HBaseMapReduce Framework
YARNCluster Resource Management
Apache Oozie (Workflow)
HDFS(Hadoop Distributed File System)
Hadoop 2.0
Sqoop
Unstructured or Semi-structured Data Structured Data
Flume
MahoutMachine Learning
Slide 26Slide 26Slide 26 www.edureka.co/big-data-and-hadoop
DataNode
Node Manager
DataNode DataNode DataNode
YARN
HDFSCluster
Resource Manager
NameNode
Node Manager
Node Manager
Node Manager
Slide 27 www.edureka.co/big-data-and-hadoop
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS
RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply
StandBy NameNode
Slide 28 www.edureka.co/big-data-and-hadoop
Master
NameNodehttp://master:50070/
ResourceManagerhttp://master:8088
Slave01
DataNode
NodeManager
Slave02
DataNode
NodeManager
Slave03
DataNode
NodeManager
Slave04
DataNode
NodeManager
Slave05
DataNode
NodeManager
Slide 29 www.edureka.co/big-data-and-hadoop
NodeManager
DataNode
NodeManager
HDFS YARN
NameNode
DataNode
NodeManager DataNode
ResourceManager
DataNode
NodeManager
DataNode
NodeManager
NodeManager
DataNode
NodeManager
DataNode
NodeManager
DataNode
Slide 30 www.edureka.co/big-data-and-hadoop
Namenode NS
Storage
…
Nam
espa
ceBl
ock
Stor
age
Nam
espa
ce NN-1 NN-k NN-n
Common StorageBl
ock
Stor
age
Pool 1 Pool k Pool nBlock Pools
… …
Hadoop 1.0 Hadoop 2.0
DatanodeDatanode
Datanode 1 …
Datanode m …
Datanode 2 …
Block Management
Slide 31 www.edureka.co/big-data-and-hadoop
How does HDFS Federation help HDFS Scale horizontally?a. Reduces the load on any single NameNode by using the multiple, independent NameNode to manage individual parts of the file system namespace.b. Provides cross-data centre (non-local) support for HDFS, allowing a cluster administrator to split the Block Storage outside the local cluster.
Slide 32 www.edureka.co/big-data-and-hadoop
Ans. Option (a) In order to scale the name service horizontally, HDFS federation uses multiple independent NameNode. The NameNode are federated, that is, the NameNode are independent and don’t require coordination with each other.
Slide 33 www.edureka.co/big-data-and-hadoop
You have configured two name nodes to manage /marketing and /finance respectively. What will happen if you try to put a file in /accounting directory?
Slide 34 www.edureka.co/big-data-and-hadoop
Ans. Put will fail. None of the namespace will manage the file and you will get an IOException with a No such file or directory error.
Slide 35 www.edureka.co/big-data-and-hadoop
Node Manager
ContainerApp
Master
Node Manager
ContainerApp
Master
HDFS YARN
Resource Manager
All name space edits logged to shared NFS storage; single writer
(fencing)
Read edit logs and applies to its own namespace
Secondary Name Node
DataNode
Standby NameNode
Active NameNode
DataNode Data Node
DataNode
DataNode
NameNode High Availability
Next Generation MapReduce
*Not necessary to configure Secondary NameNode
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
ContainerApp
Master
Node Manager
ContainerApp
Master
Slide 36 www.edureka.co/big-data-and-hadoop
Node Manager
ContainerApp
Master
Node Manager
ContainerApp
Master
HDFS YARN
Resource Manager
All name space edits logged to shared NFS storage; single writer
(fencing)
Read edit logs and applies to its own namespace
Secondary Name Node
DataNode
Standby NameNode
Active NameNode
DataNode Data Node
DataNode
DataNode
NameNode High Availability
Next Generation MapReduce
*Not necessary to configure Secondary NameNode
Client
Shared Edit Logs
HDFS HIGH AVAILABILITY
Node Manager
ContainerApp
Master
Node Manager
ContainerApp
Master
Slide 37 www.edureka.co/big-data-and-hadoop
HDFS HA was developed to overcome the following disadvantage in Hadoop 1.0?a. Single Point of Failure of Name-Nodeb. Only one version can be run in classic Map-Reducec. Too much burden on Job Tracker
Slide 38 www.edureka.co/big-data-and-hadoop
Ans. Single Point of Failure of NameNode
Slide 39 www.edureka.co/big-data-and-hadoop
…
Slide 40 www.edureka.co/big-data-and-hadoop
→ We use Hadoop to store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning.
→ Currently we have 2 major clusters:
» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.
» A 300-machine cluster with 2400 cores and about 3 PB raw storage.
» Each (commodity) node has 8 cores and 12 TB of storage.
» We are heavy users of both streaming as well as the Java APIs. We have built
a higher level data warehousing framework using these features called Hive
(see the http://Hadoop.apache.org/hive/). We have also developed a FUSE
implementation over HDFS.
Slide 41 www.edureka.co/big-data-and-hadoop
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
→ No daemons, everything runs in a single JVM.→ Suitable for running MapReduce programs during development.→ Has no DFS.
→ Hadoop daemons run on the local machine.
→ Hadoop daemons run on a cluster of machines.
Standalone (or Local) Mode
Slide 42 www.edureka.co/big-data-and-hadoop
→ Apache Hadoop and HDFS
→ http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
→ Apache Hadoop HDFS Architecture
→ http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Slide 43 www.edureka.co/big-data-and-hadoop
• Referring the documents present in the LMS under assignment solve the below problem.
How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?
Slide 44
Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!
Please spare few minutes to take the survey after the webinar.
www.edureka.co/big-data-and-hadoop