+ All Categories

Hadoop

Date post: 17-May-2015
Category:
Upload: raghu-juluri
View: 1,685 times
Download: 0 times
Share this document with a friend
Description:
This slides demonstrate a basic overview on Hadoop Distributed computing.
Popular Tags:
34
Overview on HADOOP Distributed Computing RAGHU JULURI Senior Member Technical Staff Oracle India Development Center. 2/7/2011 1
Transcript
Page 1: Hadoop

Overview on HADOOP Distributed Computing

RAGHU JULURI Senior Member Technical Staff Oracle India Development Center.

2/7/2011 1

Page 2: Hadoop

Dealing with lots of Data20 billion web pages * 20 kb =400 TB1000 hard disks to store web1 computer can read ~50 MB/sec from disk => 3 months

Sol : spread the work over many machines

Hardware & SoftwareSoftware – Communication & Co-ordination , recovery from

failure ,status reporting, debugging .Every application need to implement above functionality

(Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library.

2/7/2011 2

Page 3: Hadoop

2/7/2011 3

Page 4: Hadoop

2/7/2011 4

Page 5: Hadoop

Standard Model

2/7/2011 5

Page 6: Hadoop

Hadoop EcoSystem

2/7/2011 6

Page 7: Hadoop

2/7/2011 7

Page 8: Hadoop

2/7/2011 8

Page 9: Hadoop

Hadoop, Why?Need to process Multi Petabyte DatasetsExpensive to build reliability in each

application.Nodes fail every day

– Failure is expected, rather than exceptional.– The number of nodes in a cluster is not constant.

Need common infrastructure– Efficient, reliable, Open Source Apache License

The above goals are same as Condor, butWorkloads are IO bound and not CPU bound

2/7/2011 9

Page 10: Hadoop

2/7/2011 10

Page 11: Hadoop

2/7/2011 11

Page 12: Hadoop

HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.

2/7/2011 12

Page 13: Hadoop

Goals of HDFSVery Large Distributed File System

– 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware

– Files are replicated to handle hardware failure– Detect failures and recovers from them

Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

2/7/2011 13

Page 14: Hadoop

SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. filename

2. BlckId, DataNodes

o

3.Read data

Cluster Membership

Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log

2/7/2011 14

Page 15: Hadoop

MapReduce:Programming Model

How nowBrown cow

How doesIt work now

brown 1cow 1does 1How 2

it 1now 2work 1

M

M

M

M

R

R

<How,1><now,1><brown,1><cow,1><How,1><does,1><it,1><work,1><now,1>

<How,1 1><now,1 1><brown,1><cow,1><does,1><it,1><work,1>

Input OutputMap

ReduceMapReduceFramework

2/7/2011 15

Page 16: Hadoop

MapReduce:Programming ModelProcess data using special map() and

reduce() functionsThe map() function is called on every item in

the input and emits a series of intermediate key/value pairs

All values associated with a given key are grouped together

The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output

2/7/2011 16

Page 17: Hadoop

MapReduce BenefitsGreatly reduces parallel programming

complexityReduces synchronization complexityAutomatically partitions dataProvides failure transparencyHandles load balancing

PracticalApproximately 1000 Google MapReduce jobs

run everyday.

2/7/2011 17

Page 18: Hadoop

MapReduce ExamplesWord frequency

Map

doc

Reduce

<word,3>

<word,1>

<word,1>

<word,1>

RuntimeSystem

<word,1,1,1>

2/7/2011 18

Page 19: Hadoop

A Brief HistoryFunctional programming (e.g., Lisp)

map() function Applies a function to each value of a sequence

reduce() function Combines all elements of a sequence using a binary

operator

2/7/2011 19

Page 20: Hadoop

MapReduce Execution Overview1. The user program, via the MapReduce

library, shards the input data

UserProgramInput

Data

Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6

* Shards are typically 16-64mb in size

2/7/2011 20

Page 21: Hadoop

MapReduce Execution Overview2. The user program creates process copies

distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads.

UserProgram

Master

WorkersWorkers

WorkersWorkers

Workers

2/7/2011 21

Page 22: Hadoop

MapReduce Resources3. The master distributes M map and R

reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided

into R parts

MasterIdle

Worker

Message(Do_map_task)

2/7/2011 22

Page 23: Hadoop

MapReduce Resources4. Each map-task worker reads assigned input

shard and outputs intermediate key/value pairs.

Output buffered in RAM.

MapworkerShard 0 Key/value pairs

2/7/2011 23

Page 24: Hadoop

MapReduce Execution Overview5. Each worker flushes intermediate values,

partitioned into R regions, to disk and notifies the Master process.

Master

Mapworker

Disk locations

LocalStorage

2/7/2011 24

Page 25: Hadoop

MapReduce Execution Overview6. Master process gives disk locations to an

available reduce-task worker who reads all associated intermediate data.

Master

Reduceworker

Disk locations

remoteStorage

2/7/2011 25

Page 26: Hadoop

MapReduce Execution Overview7. Each reduce-task worker sorts its

intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file.

Reduceworker

Sorts data PartitionOutput file

2/7/2011 26

Page 27: Hadoop

MapReduce Execution Overview8. Master process wakes up user process

when all tasks have completed. Output contained in R output files.

wakeup UserProgram

Master

Outputfiles

2/7/2011 27

Page 28: Hadoop

2/7/2011 28

Page 29: Hadoop

Pig

Data-flow oriented language “Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for

complex tasks

• Developed at Yahoo!

Hive

• SQL-based data warehousing app Feature set is similar to Pig – Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets – Partition columns – Sampling – Buckets Developed at Facebook

2/7/2011 29

Page 30: Hadoop

Hbase Column-store database – Based on design of Google BigTable – Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model – (key, val) lookup – Limited transactions (only one row)

2/7/2011 30

Page 31: Hadoop

ZooKeeper

Distributed consensus engine Provides well-defined concurrent accesssemantics:– Leader election– Service discovery– Distributed locking / mutual exclusion– Message board / mailboxes

2/7/2011 31

Page 32: Hadoop

Some more projects…

Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P

backend Dumbo – Python library for streamingGanglia – distributed monitoring

2/7/2011 32

Page 33: Hadoop

Conclusions Computing with big datasets is afundamentally different challenge than doing “big compute” over a small dataset

• New ways of thinking about problems needed – New tools provide means to capture this – MapReduce, HDFS, etc. can help

2/7/2011 33

Page 34: Hadoop

2/7/2011 34


Recommended