Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System...

transcript

Hadoop tutorials

Todays agenda

• Hadoop Introduction and Architecture• Hadoop Distributed File System• MapReduce• Spark

Cloudera Image for hands-on• Installation instruction• https://cern.ch/test-zbaranow/CVM.txt

Hadoop Introduction

What is Hadoop? (1)

• A framework for large scale data processing

• Volume

• Variety

• Velocity

What Hadoop is? (2)

• Solution for big data processing• Sequential data access – a brute force approach• Simplified data structures (no relational model)• Ideal for ad-hoc data analytics

• Instead of some clever data lookups with indexing etc.• Data analytic cases has to be known before hand• Complex data design

What is Hadoop? (3)

• Data locality (shared nothing) – scales out

Interconnect network

MEMORY

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

What is Hadoop? (4)

• Optimized storage access (for HDD)• Big data blocks >=128MB• Seqential IO instead of Random IO

HDD drive 7200rpm speed:- Sequential IO: ~120MB/s- Random IO: 0.5 - 50MB/s

Hadoop eco system

HDFSHadoop Distributed File System

YARNCluster resource manager

MapReduce

ctor Sq

n Impa

Hadoop cluster architecture• Master and slaves approach

Interconnect network

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

HDFS DataNode

Various component agents and

masters

YARN Node Manager

HDFS NameNode

HDFS DataNode

masters

YARN Node Manager

YARN ResourceManager

HDFS DataNode

demons

YARN Node Manager

Hive metastore

HDFS DataNode

demons

YARN Node Manager

HDFS DataNode

demons

YARN Node Manager

HDFS DataNode

demons

YARN Node Manager

What to not use the Hadoop for?• Online Transaction Processing system• No transactions• No locks• No data updates (only appends and overwrites)• Response time in seconds rather milliseconds

• Not good for systems with relational data• Interactive applications• Accounting systems• Etc.

What to use the Hadoop for?• For Big Data!

• Storing• Analysis

• Write once – read many• Scalable out system (CPU, IO, RAM)

• transparent to the users (data placement, data analysis)

• Good for data exploration:• in a batch fashion• statistics, aggregations, correlation

• Data warehouses• Logs

Hadoop @CERN

• 4 main clusters (provided by IT)• 16-20 machines each• 24GB – 256GB of RAM

• Main users• ATLAS (EventIndex, PandaMon, Rucio)• CASTOR logs• WLCG Dasboards• IT Monitoring• Computer Security• …

• Available services• HDFS, YARN (MR), Hbase, Hive, Pig, Spark, Impala (upcoming)

• Contact• SNOW: https://cern.service-now.com/service-portal/report-ticket.do?

name=request&se=Hadoop-Service

Summary

• Hadoop is a solution for massive data processing• Designed to scale out• On a commodity hardware• Optimized for sequential reads

• Hadoop architecture• HDFS is a core• Many components with multiple functionalities

distributed across cluster nodes

Questions?

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System...

Documents