Date post: | 27-Jan-2015 |
Category: |
Technology |
Upload: | ken-krugler |
View: | 117 times |
Download: | 3 times |
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
A very short Intro toHadoop
1
phot
o by
: i_p
inz,
flic
kr
Scale Unlimited, Inc
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Welcome to Intro to Hadoop
Usually a 4 hour online class
Focus on Why, What, How of Hadoop
2
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Meet Your Instructor
Developer of Bixo web mining toolkit
Committer on Apache Tika
Hadoop developer and trainer
Solr developer and trainer
3
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
A very short Intro toHadoop
4
Overview
phot
o by
: exf
ordy
, flic
kr
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
How to Crunch a Petabyte?
Lots of disks, spinning all the time
Redundancy, since disks die
Lots of CPU cores, working all the time
Retry, since network errors happen
5
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Hadoop to the Rescue
Scalable - many servers with lots of cores and spindles
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big computer
6
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Logical Architecture
Logically, Hadoop is simply a computing cluster that provides:
a Storage layer, and
an Execution layer
7
Cluster
Storage
Execution
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Storage Layer
Hadoop Distributed File System (aka HDFS, or Hadoop DFS)
Runs on top of regular OS file system, typically Linux ext3
Fixed-size blocks (64MB by default) that are replicated
Write once, read many; optimized for streaming in and out
8
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Execution Layer
Hadoop Map-Reduce
Responsible for running a job in parallel on many servers
Handles re-trying a task that fails, validating complete results
Jobs consist of special “map” and “reduce” operations
9
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Scalable
Virtual execution and storage layers span many nodes (servers)
Scales linearly (sort of) with cores and disks.
10
Cluster
RackRackRack
Node Node Node ...Node
Storage
Executioncpu
disk
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Reliable
Each block is replicated, typically three times.
Each task must succeed, or the job fails
11
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Fault-tolerant
Failed tasks are automatically retried.
Failed data transfers are automatically retried.
Servers can join and leave the cluster at any time.
12
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Simple
Reduces complexity
Conceptual “operating system” that spans many CPUs & disks
13
Node Node Node ...Node
Storage
Executioncpu
disk
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Typical Hadoop Cluster
Has one “master” server - high quality, beefy box.
NameNode process - manages file system
JobTracker process - manages tasks
Has multiple “slave” servers - commodity hardware.
DataNode process - manages file system blocks on local drives
TaskTracker process - runs tasks on server
Uses high speed network between all servers
14
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Architectural Components
15
Solid boxes are unique applicationsDashed boxes are child JVM instances
(on same node as parent)Dotted boxes are blocks of managed files
(on same node as parent)
NameNode DataNodeDataNode
DataNodeDataNode
JobTrackerClientjobs
Master Slaves
mapper child jvm
reducer child jvm
mapper child jvm
reducer child jvmreducer
child jvm
mapper child jvm
data block
TaskTrackertasks
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
A very short Intro toHadoop
16
Distributed File System
phot
o by
: Gra
ham
Rac
her,
flick
r
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Virtual File System
Treats many disks on many servers as one huge, logical volume
Data is stored in 1...n blocks
The “DataNode” process manages blocks of data on a slave.
The “NameNode” process keeps track of file metadata on the master.
17
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Replication
Each block is stored on several different disks (default is 3)
Hadoop tries to copy blocks to different servers and racks.
Protects data against disk, server, rack failures.
Reduces the need to move data to code.
18
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Error Recovery
Slaves constantly “check in” with the master.
Data is automatically replicated if a disk or server “goes away”.
19
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Limitations
Optimized for streaming data in/out
So no random access to data in a file
Data rates ≈ 30% - 50% of max raw disk rate
No append support currently
Write once, read many
20
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
NameNode
Runs on master node
Is a single point of failure
There are no built-in software hot failover mechanisms
Maintains filesystem metadata, the “Namespace”
files and hierarchical directories
21
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Files stored on HDFS are chunked and stored as blocks on DataNode
Manages storage attached to the nodes that they run on
Data never flows through NameNode, only DataNodes
22
DataNodes
DataNodeDataNode
DataNodeDataNode data block
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
A very short Intro toHadoop
23
Map-Reduce Paradigm
phot
o by
: Gra
ham
Rac
her,
flick
r
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Key Value Pair -> two units of data, exchanged between Map & Reduce
Map -> The ‘map’ function in the MapReduce algorithm
user defined
converts each input Key Value Pair to 0...n output Key Value Pairs
Reduce -> The ‘reduce’ function in the MapReduce algorithm
user defined
converts each input Key + all Values to 0...n output Key Value Pairs
Group -> A built-in operation that happens between Map and Reduce
ensures each Key passed to Reduce includes all Values
24
Definitions
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
All Together
25
Reducer
Reducer
Reducer
Mapper
Mapper
Mapper
Mapper
Mapper
Shuffle
Shuffle
Shuffle
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
How MapReduce Works
Map translates input to keys and values to new keys and values
System Groups each unique key with all its values
Reduce translates the values of each unique key to new keys and values
26
[K1,V1] Map [K2,V2]
[K2,{V2,V2,....}] [K3,V3]Reduce
[K2,V2] [K2,{V2,V2,....}]Group
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Canonical Example - Word Count
Word Count
Read a document, parse out the words, count the frequency of each word
Specifically, in MapReduce
With a document consisting of lines of text
Translate each line of text into key = word and value = 1
e.g. <“the”,1> <“quick”,1> <“brown”,1> <“fox”,1> <“jumped”,1>
Sum the values (ones) for each unique word
27
Monday, December 19, 2011
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
28
[1, When in the Course of human events, it becomes]
[When,1] [in,1] [the,1] [Course,1] [k,1]
[2, necessary for one people to dissolve the political bands]
[n, {k,k,k,k,...}]
[When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]
[k,{v,v,...}][connected,{1,1,1,1,1}]
[When,4] [peope,6] [dissolve,3]
[k,sum(v,v,v,..)][connected,5]
Map
Reduce
GroupUser
Defined
[K1,V1]
[K2,{V2,V2,....}]
[K2,V2]
[K3,V3]
[3, which have connected them with another, and to assume]
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Divide & Conquer (splitting data)
Because
The Map function only cares about the current key and value, and
The Reduce function only cares about the current key and its values
Then
A Mapper can invoke Map on an arbitrary number of input keys and values
or just some fraction of the input data set
A Reducer can invoke Reduce on an arbitrary number of the unique keys
but all the values for that key
29
Monday, December 19, 2011
Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
30
Mapper
Reducer
file
split1 split2 split3 split4 ...
Mapper
Mapper
Mapper
Mapper
part-00000 part-00001 part-000N
directory
Reducer
Reducer
Shuffle
Shuffle
Shuffle
[K2,V2] [K2,{V2,V2,...}][K1,V1] [K3,V3]
Mappers must complete before Reducers can
begin
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Divide & Conquer (parallelizable)
Because
Each Mapper is independent and processes part of the whole, and
Each Reducer is independent and processes part of the whole
Then
Any number of Mappers can run on each node, and
Any number of Reducers can run on each node, and
The cluster can contain any number of nodes
31
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
JobTracker
Is a single point of failureDetermines # Mapper Tasks from file splits via InputFormatUses predefined value for # Reducer TasksClient applications use JobClient to submit jobs and query statusCommand line use hadoop job <commands>
Web status console use http://jobtracker-server:50030/
32
JobTrackerClientjobs
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
TaskTracker
Spawns each Task as a new child JVM
Max # mapper and reducer tasks set independently
Can pass child JVM opts via mapred.child.java.opts
Can re-use JVM to avoid overhead of task initialization
33
TaskTracker
mapper
child jvm
reducer
child jvm
mapper
child jvm
reducer
child jvmreducer
child jvm
mapper
child jvm
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
A very short Intro toHadoop
34
Hadoop (sort of)Deep Thoughts
phot
o by
: Gra
ham
Rac
her,
flick
r
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Avoiding Hadoop
Hadoop is a big hammer - but not every problem is a nail
Small data
Real-time data
Beware the Hadoopaphile
35
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Avoiding Map-Reduce
Writing Hadoop code is painful and error prone
Hive & Pig are good solutions for People Who Like SQL
Cascading is a good solution for complex, stable workflows
36
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Leveraging the Eco-System
Many open source projects built on top of Hadoop
HBase - scalable NoSQL data store
Sqoop - getting SQL data in/out of Hadoop
Other projects work well with Hadoop
Kafka/Scribe - getting log data in/out of Hadoop
Avro - data serialization/storage format
37
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Get Involved
Join the mailing list - http://hadoop.apache.org/mailing_lists.html
Go to user group meetings - e.g. http://hadoop.meetup.com/
38
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Learn More
Buy the book - Hadoop: The Definitive Guide, 2nd edition
Try the tutorials
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
Get training (danger, personal plug)
http://www.scaleunlimited.com/training
http://www.cloudera.com/hadoop-training
39
Monday, December 19, 2011
Copyright (c) 2008-2011 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.
Resources
Scale Unlimited Alumni list - [email protected]
Hadoop mailing lists - http://hadoop.apache.org/mailing_lists.html
Users groups - e.g. http://hadoop.meetup.com/
Hadoop API - http://hadoop.apache.org/common/docs/current/api/
Hadoop: The Definitive Guide, 2nd edition by Tom White
Cascading: http://www.cascading.org
Datameer: http://www.datameer.com
40
Monday, December 19, 2011