Date post: | 17-May-2015 |
Category: |
Technology |
Upload: | raghu-juluri |
View: | 1,685 times |
Download: | 0 times |
Overview on HADOOP Distributed Computing
RAGHU JULURI Senior Member Technical Staff Oracle India Development Center.
2/7/2011 1
Dealing with lots of Data20 billion web pages * 20 kb =400 TB1000 hard disks to store web1 computer can read ~50 MB/sec from disk => 3 months
Sol : spread the work over many machines
Hardware & SoftwareSoftware – Communication & Co-ordination , recovery from
failure ,status reporting, debugging .Every application need to implement above functionality
(Google search (indexing) , page ranking,trends,picasa…) In 2003 Google came up with Map Reduce run time library.
2/7/2011 2
2/7/2011 3
2/7/2011 4
Standard Model
2/7/2011 5
Hadoop EcoSystem
2/7/2011 6
2/7/2011 7
2/7/2011 8
Hadoop, Why?Need to process Multi Petabyte DatasetsExpensive to build reliability in each
application.Nodes fail every day
– Failure is expected, rather than exceptional.– The number of nodes in a cluster is not constant.
Need common infrastructure– Efficient, reliable, Open Source Apache License
The above goals are same as Condor, butWorkloads are IO bound and not CPU bound
2/7/2011 9
2/7/2011 10
2/7/2011 11
HDFS splits user data across servers in a cluster. It uses replication to ensure that even multiple node failures will not cause data loss.
2/7/2011 12
Goals of HDFSVery Large Distributed File System
– 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware
– Files are replicated to handle hardware failure– Detect failures and recovers from them
Optimized for Batch Processing– Data locations exposed so that computations can move to where data resides– Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS
2/7/2011 13
SecondaryNameNode
Client
HDFS Architecture
NameNode
DataNodes
1. filename
2. BlckId, DataNodes
o
3.Read data
Cluster Membership
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log
2/7/2011 14
MapReduce:Programming Model
How nowBrown cow
How doesIt work now
brown 1cow 1does 1How 2
it 1now 2work 1
M
M
M
M
R
R
<How,1><now,1><brown,1><cow,1><How,1><does,1><it,1><work,1><now,1>
<How,1 1><now,1 1><brown,1><cow,1><does,1><it,1><work,1>
Input OutputMap
ReduceMapReduceFramework
2/7/2011 15
MapReduce:Programming ModelProcess data using special map() and
reduce() functionsThe map() function is called on every item in
the input and emits a series of intermediate key/value pairs
All values associated with a given key are grouped together
The reduce() function is called on every unique key, and its value list, and emits a value that is added to the output
2/7/2011 16
MapReduce BenefitsGreatly reduces parallel programming
complexityReduces synchronization complexityAutomatically partitions dataProvides failure transparencyHandles load balancing
PracticalApproximately 1000 Google MapReduce jobs
run everyday.
2/7/2011 17
MapReduce ExamplesWord frequency
Map
doc
Reduce
<word,3>
<word,1>
<word,1>
<word,1>
RuntimeSystem
<word,1,1,1>
2/7/2011 18
A Brief HistoryFunctional programming (e.g., Lisp)
map() function Applies a function to each value of a sequence
reduce() function Combines all elements of a sequence using a binary
operator
2/7/2011 19
MapReduce Execution Overview1. The user program, via the MapReduce
library, shards the input data
UserProgramInput
Data
Shard 0Shard 1Shard 2Shard 3Shard 4Shard 5Shard 6
* Shards are typically 16-64mb in size
2/7/2011 20
MapReduce Execution Overview2. The user program creates process copies
distributed on a machine cluster. One copy will be the “Master” and the others will be worker threads.
UserProgram
Master
WorkersWorkers
WorkersWorkers
Workers
2/7/2011 21
MapReduce Resources3. The master distributes M map and R
reduce tasks to idle workers. M == number of shards R == the intermediate key space is divided
into R parts
MasterIdle
Worker
Message(Do_map_task)
2/7/2011 22
MapReduce Resources4. Each map-task worker reads assigned input
shard and outputs intermediate key/value pairs.
Output buffered in RAM.
MapworkerShard 0 Key/value pairs
2/7/2011 23
MapReduce Execution Overview5. Each worker flushes intermediate values,
partitioned into R regions, to disk and notifies the Master process.
Master
Mapworker
Disk locations
LocalStorage
2/7/2011 24
MapReduce Execution Overview6. Master process gives disk locations to an
available reduce-task worker who reads all associated intermediate data.
Master
Reduceworker
Disk locations
remoteStorage
2/7/2011 25
MapReduce Execution Overview7. Each reduce-task worker sorts its
intermediate data. Calls the reduce function, passing in unique keys and associated key values. Reduce function output appended to reduce-task’s partition output file.
Reduceworker
Sorts data PartitionOutput file
2/7/2011 26
MapReduce Execution Overview8. Master process wakes up user process
when all tasks have completed. Output contained in R output files.
wakeup UserProgram
Master
Outputfiles
2/7/2011 27
2/7/2011 28
Pig
Data-flow oriented language “Pig latin” Datatypes include sets, associative arrays,tuples High-level language for routing data, allows easy integration of Java for
complex tasks
• Developed at Yahoo!
Hive
• SQL-based data warehousing app Feature set is similar to Pig – Language is more strictly SQL Supports SELECT, JOIN, GROUP BY, etc. Features for analyzing very large data sets – Partition columns – Sampling – Buckets Developed at Facebook
2/7/2011 29
Hbase Column-store database – Based on design of Google BigTable – Provides interactive access to information Holds extremely large datasets (multi-TB) Constrained access model – (key, val) lookup – Limited transactions (only one row)
2/7/2011 30
ZooKeeper
Distributed consensus engine Provides well-defined concurrent accesssemantics:– Leader election– Service discovery– Distributed locking / mutual exclusion– Message board / mailboxes
2/7/2011 31
Some more projects…
Chukwa – Hadoop log aggregation Scribe – More general log aggregation Mahout – Machine learning library Cassandra – Column store database on a P2P
backend Dumbo – Python library for streamingGanglia – distributed monitoring
2/7/2011 32
Conclusions Computing with big datasets is afundamentally different challenge than doing “big compute” over a small dataset
• New ways of thinking about problems needed – New tools provide means to capture this – MapReduce, HDFS, etc. can help
2/7/2011 33
2/7/2011 34