+ All Categories
Home > Technology > Introduction to Big data & Hadoop -I

Introduction to Big data & Hadoop -I

Date post: 15-Jul-2015
Category:
Upload: edureka
View: 557 times
Download: 2 times
Share this document with a friend
Popular Tags:
31
www.edureka.co/big-data-and-hadoop Introduction to big data and hadoop
Transcript
Page 1: Introduction to Big data & Hadoop -I

www.edureka.co/big-data-and-hadoop

Introduction to big data and hadoop

Page 2: Introduction to Big data & Hadoop -I

Slide 2 www.edureka.co/big-data-and-hadoop

Objectives

At the end of this session , you will understand the:

Big Data Learning Paths

Big Data Introduction

Hadoop and Its Eco-System

Hadoop Architecture

Next Step on How to Setup Hadoop

Page 3: Introduction to Big data & Hadoop -I

Slide 3 www.edureka.co/big-data-and-hadoop

Big Data Learning Path

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R

Developer/Testing

Administration

Data Analyst

Big Data and Hadoop

MapReduceDesign Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau

Page 4: Introduction to Big data & Hadoop -I

Slide 4 www.edureka.co/big-data-and-hadoop

Un-structured Data is Exploding

Source: Twitter

Page 5: Introduction to Big data & Hadoop -I

Slide 5 www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

Page 6: Introduction to Big data & Hadoop -I

Slide 6 www.edureka.co/big-data-and-hadoop

Annie’s Introduction

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Page 7: Introduction to Big data & Hadoop -I

Slide 7 www.edureka.co/big-data-and-hadoop

Annie’s Question

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)

Page 8: Introduction to Big data & Hadoop -I

Slide 8 www.edureka.co/big-data-and-hadoop

Annie’s Answer

Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured data Data from Enterprise systems (ERP, CRM etc.) Structured data

Page 9: Introduction to Big data & Hadoop -I

Slide 9 www.edureka.co/big-data-and-hadoop

Further Reading

More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics

http://www-01.ibm.com/software/data/bigdata/

Page 10: Introduction to Big data & Hadoop -I

Slide 10Slide 10Slide 10 www.edureka.co/big-data-and-hadoop

Common Big Data Customer Scenarios

Web and e-tailing

» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection

Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy

Page 11: Introduction to Big data & Hadoop -I

Slide 11Slide 11Slide 11 www.edureka.co/big-data-and-hadoop

Government

» Fraud Detection and Cyber Security» Welfare Schemes » Justice

Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety

http://wiki.apache.org/hadoop/PoweredBy

Common Big Data Customer Scenarios (Contd.)

Page 12: Introduction to Big data & Hadoop -I

Slide 12Slide 12Slide 12 www.edureka.co/big-data-and-hadoop

Common Big Data Customer Scenarios (Contd.)

Banks and Financial services

» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis

http://wiki.apache.org/hadoop/PoweredBy

Page 13: Introduction to Big data & Hadoop -I

Slide 13Slide 13Slide 13 www.edureka.co/big-data-and-hadoop

Why DFS?

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

Page 14: Introduction to Big data & Hadoop -I

Slide 14Slide 14Slide 14 www.edureka.co/big-data-and-hadoop

Why DFS? (Contd.)

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

43 Minutes

Read 1 TB Data

Page 15: Introduction to Big data & Hadoop -I

Slide 15Slide 15Slide 15 www.edureka.co/big-data-and-hadoop

Why DFS? (Contd.)

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

4.3 Minutes43 Minutes

Read 1 TB Data

Page 16: Introduction to Big data & Hadoop -I

Slide 16Slide 16Slide 16 www.edureka.co/big-data-and-hadoop

Page 17: Introduction to Big data & Hadoop -I

Slide 17 www.edureka.co/big-data-and-hadoop

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

RAM: 32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

StandBy NameNode

Page 18: Introduction to Big data & Hadoop -I

Slide 18Slide 18Slide 18 www.edureka.co/big-data-and-hadoop

Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.

Case Study: Sears Holding Corporation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

Page 19: Introduction to Big data & Hadoop -I

Slide 19Slide 19Slide 19 www.edureka.co/big-data-and-hadoop

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Inctrumentation

A meagre 10% of the

~2PB data is available for

BI

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Limitations of Existing Data Analytics Architecture

Page 20: Introduction to Big data & Hadoop -I

Slide 20Slide 20Slide 20 www.edureka.co/big-data-and-hadoop

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

Solution: A Combined Storage Computer Layer

Page 21: Introduction to Big data & Hadoop -I

Slide 21 www.edureka.co/big-data-and-hadoop

Annie’s Question

Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets

Page 22: Introduction to Big data & Hadoop -I

Slide 22 www.edureka.co/big-data-and-hadoop

Annie’s Answer

Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.

Page 23: Introduction to Big data & Hadoop -I

Slide 23Slide 23Slide 23 www.edureka.co/big-data-and-hadoop

Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Sqoop

Unstructured or Semi-structured Data Structured Data

Flume

MahoutMachine Learning

Page 24: Introduction to Big data & Hadoop -I

Slide 24 www.edureka.co/big-data-and-hadoop

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Page 25: Introduction to Big data & Hadoop -I

Slide 25 www.edureka.co/big-data-and-hadoop

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.

Hadoop daemons run on the local machine.

Hadoop daemons run on a cluster of machines.

Standalone (or Local) Mode

Hadoop Cluster Modes

Page 26: Introduction to Big data & Hadoop -I

Slide 26 www.edureka.co/big-data-and-hadoop

Learning Path to Certification

CourseLIVE Online Class Class Recording in LMS

24/7 Post Class Support Module Wise Quiz and Assignment

Project Work

Verifiable Certificate

1. Assistance from Peers and Support team

2. Review for Certification

Page 27: Introduction to Big data & Hadoop -I

Slide 27Slide 27Slide 27 www.edureka.co/big-data-and-hadoop

CA - Single site cluster, therefore all nodes are alwaysin contact. When a partition occurs, the system blocks.

CP - Some data may not be accessible, but the rest isstill consistent/accurate.

AP - System is still available under partitioning, butsome of the data returned may be inaccurate.

Here is the brief description of three combinations CA, CP, AP :

Cap Theorem

Page 28: Introduction to Big data & Hadoop -I

Slide 28 www.edureka.co/big-data-and-hadoop

Further Reading

Apache Hadoop and HDFS

http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/

Apache Hadoop HDFS Architecture

http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/

Page 29: Introduction to Big data & Hadoop -I

Slide 29 www.edureka.co/big-data-and-hadoop

Assignment

Referring the documents present in the LMS under assignment solve the below problem.

How many such DataNodes you would need to read 100TB data in 5 minutes in your Hadoop Cluster?

Page 30: Introduction to Big data & Hadoop -I

Slide 30

Your feedback is important to us, be it a compliment, a suggestion or a complaint. It helps us to make the course better!

Please spare few minutes to take the survey after the webinar.

www.edureka.co/big-data-and-hadoop

Survey

Page 31: Introduction to Big data & Hadoop -I

Recommended