Introduction to Big Data & Hadoop

transcript

www.edureka.co/big-data-and-hadoopCMC Contact : aparna.jaiswal@cmcltd.com Edureka Contact : corp@edureka.co

Introduction to big data and hadoop

CMC Contact : aparna.jaiswal@cmcltd.com Edureka Contact : corp@edureka.co www.edureka.co/big-data-and-hadoop

Objectives

At the end of this session , you will understand the:

Big Data Introduction

Use Cases of Big Data in Multiple Industry Verticals

Hadoop and Its Eco-System

Hadoop Architecture

Learning Path for Developers, Administrators, Testing Professionals and Aspiring Data Scientists

Un-structured Data is Exploding

Source: Twitter

www.edureka.co/big-data-and-hadoop

IBM’s Definition – Big Data Characteristicshttp://www-01.ibm.com/software/data/bigdata/

IBM’s Definition of Big Data

Annie’s Introduction

Hello There!!My name is Annie. I love quizzes and

puzzles and I am here to make you guys think and

answer my questions.

Annie’s Question

Map the following to corresponding data type:» XML files, e-mail body» Audio, Video, Images, Archived documents» Data from Enterprise systems (ERP, CRM etc.)

Annie’s Answer

Ans. XML files, e-mail body Semi-structured dataAudio, Video, Image, Files, Archived documents Unstructured data Data from Enterprise systems (ERP, CRM etc.) Structured data

More on Big Data

http://www.edureka.in/blog/the-hype-behind-big-data/

Why Hadoop?

http://www.edureka.in/blog/why-hadoop/

Opportunities in Hadoop

http://www.edureka.in/blog/jobs-in-hadoop/

Big Data

http://en.wikipedia.org/wiki/Big_Data

IBM’s definition – Big Data Characteristics

http://www-01.ibm.com/software/data/bigdata/

Slide 9Slide 9 www.edureka.co/big-data-and-hadoopCMC Contact : aparna.jaiswal@cmcltd.com Edureka Contact : corp@edureka.co

Common Big Data Customer Scenarios

Web and e-tailing

» Recommendation Engines» Ad Targeting» Search Quality» Abuse and Click Fraud Detection

Telecommunications

» Customer Churn Prevention» Network Performance Optimization» Calling Data Record (CDR) Analysis» Analysing Network to Predict Failure

http://wiki.apache.org/hadoop/PoweredBy

Government

» Fraud Detection and Cyber Security» Welfare Schemes » Justice

Healthcare and Life Sciences

» Health Information Exchange» Gene Sequencing» Serialization» Healthcare Service Quality Improvements» Drug Safety

Common Big Data Customer Scenarios (Contd.)

Banks and Financial services

» Modeling True Risk» Threat Analysis» Fraud Detection» Trade Surveillance» Credit Scoring and Analysis

Retail

» Point of Sales Transaction Analysis» Customer Churn Analysis» Sentiment Analysis

Why DFS?

Read 1 TB Data

4 I/O ChannelsEach Channel – 100 MB/s

1 Machine4 I/O ChannelsEach Channel – 100 MB/s

10 Machine

Why DFS? (Contd.)

10 Machine

43 Minutes

Read 1 TB Data

Why DFS? (Contd.)

10 Machine

4.3 Minutes43 Minutes

Read 1 TB Data

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 coresEthernet: 3 x 10 GB/sOS: 64-bit CentOS

Hadoop Cluster: A Typical Use Case

RAM: 16GBHard disk: 6 x 2TBProcessor: Xenon with 2 cores.Ethernet: 3 x 10 GB/sOS: 64-bit CentOS

RAM: 64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 x 10 GB/sOS: 64-bit CentOSPower: Redundant Power Supply

Active NameNodeSecondary NameNode

DataNode DataNode

StandBy NameNode

Hidden Treasure

Insight into data can provide Business Advantage.

Some key early indicators can mean Fortunes to Business.

More Precise Analysis with more data.

*Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc., to store and process the customer activity and sales data.

Case Study: Sears Holding Corporation

http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

ETL Compute Grid

Storage only Grid (Original Raw Data)

Collection

Instrumentation

A meagre 10% of the

~2PB data is available for

Storage

2. Moving data to compute doesn’t scale

90% of the ~2PB archived

Processing

3. Premature data death

1. Can’t explore original high fidelity raw data

Limitations of Existing Data Analytics Architecture

Mostly Append

BI Reports + Interactive Apps

RDBMS (Aggregated Data)

Hadoop : Storage + Compute Grid

Collection

Instrumentation

Both Storage

And Processing

Entire ~2PB Data is

available for processing

No Data Archiving

1. Data Exploration & Advanced analytics

2. Scalable throughput for ETL & aggregation

3. Keep data alive forever

*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions.

Solution: A Combined Storage Computer Layer

Annie’s Question

Hadoop is a framework that allows for the distributed processing of:» Small Data Sets» Large Data Sets

Annie’s Answer

Ans. Large Data Sets. It is also capable of processing small data-sets. However, to experience the true power of Hadoop, one needs to have data in TB’s. Because this is where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.

Hadoop Ecosystem

Pig LatinData Analysis

HiveDW System

OtherYARN

Frameworks(MPI, GRAPH)

HBaseMapReduce Framework

YARNCluster Resource Management

Apache Oozie(Workflow)

HDFS(Hadoop Distributed File System)

Hadoop 2.0

Unstructured or Semi-structured Data Structured Data

MahoutMachine Learning

Hadoop Cluster: Facebook

Facebook

We use Hadoop to store copies of internal log and dimension data sources and useit as a source for reporting/analytics and machine learning.

Currently we have 2 major clusters:

» A 1100-machine cluster with 8800 cores and about 12 PB raw storage.

» A 300-machine cluster with 2400 cores and about 3 PB raw storage.

» Each (commodity) node has 8 cores and 12 TB of storage.

» We are heavy users of both streaming as well as the Java APIs. We have

built a higher level data warehousing framework using these features called

Hive(see the http://Hadoop.apache.org/hive/). We have also developed a

FUSE implementation over HDFS.

BATCH(MapReduce)

INTERACTIVE(Text)

ONLINE(HBase)

STREAMING(Storm, S4, …)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPC MPI(OpenMPI)

OTHER(Search)

(Weave..)

http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html

YARN – Moving beyond MapReduce

Hadoop can run in any of the following three modes:

Fully-Distributed Mode

Pseudo-Distributed Mode

No daemons, everything runs in a single JVM. Suitable for running MapReduce programs during development. Has no DFS.

Hadoop daemons run on the local machine.

Hadoop daemons run on a cluster of machines.

Standalone (or Local) Mode

Hadoop Cluster Modes

Big Data Learning Path

• Java / Python / Ruby• Hadoop Eco-system• NoSQL DB• Spark

• Linux Administration• Cluster Management• Cluster Performance• Virtualization

• Statistics Skills• Machine Learning• Hadoop Essentials • Expertise in R

Developer/Testing

Administration

Data Analyst

Big Data and Hadoop

MapReduceDesign Patterns

Apache Spark & Scala

Apache Cassandra

Linux Administration Hadoop Administration

Data Science

Business Analytics Using R

Advance Predictive Modelling in R

Talend for Big Data

Data Visualization Using Tableau

Learning Path to Certification

CourseLIVE Online Class Class Recording in LMS

24/7 Post Class Support Module Wise Quiz and Assignment

Project Work

Verifiable Certificate

1. Assistance from Peers and Support team

2. Review for Certification

Introduction to Big Data & Hadoop

Technology