Date post: | 20-Aug-2015 |
Category: |
Technology |
Upload: | ryan-tabora |
View: | 8,449 times |
Download: | 0 times |
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
Introduction to HDFS and MapReduce
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2
Who Am I- Ryan Tabora
- Data Developer at Think Big Analytics
- Big Data Consulting
- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved2
Who Am I- Ryan Tabora
- Data Developer at Think Big Analytics
- Big Data Consulting
- Experience working with Hadoop, HBase, Hive, Solr, Cassandra, etc.
Thursday, January 10, 13
Confidential Think Big Analytics
• One of Silicon Valley’s Fastest Growing Big Data start ups• 100% Focus on Big Data consulting & Data Science solution services• Management Background:
Cambridge Technology, C-bridge, Oracle, Sun Microsystems, Quantcast, Accenture
C-bridge Internet Solutions (CBIS) founder 1996 & executives, IPO 1999• Clients: 40+• North America Locations
• US East: Boston, New York, Washington D.C.• US Central: Chicago, Austin• US West: HQ Mountain View, San Diego, Salt Lake City
• EMEA & APAC
3
Think Big is the leading professional services firm that’s purpose built for Big Data.
Thursday, January 10, 13
Confidential Think Big Analytics 01/04/13
Think Big Recognized as a Top Pure-Play Big Data Vendor
Source: Forbes February 2012
4Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved5
Agenda- Big Data
- Hadoop Ecosystem
- HDFS
- MapReduce in Hadoop
- The Hadoop Java API
- Conclusions
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved6
Big Data
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved7
A Data Shift...
Source: EMC Digital Universe Study*
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved8
“Simple algorithms and lots of data trump complex
models. ”
Motivation
Halevy, Norvig, and Pereira (Google), IEEE Intelligent Systems
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved9
Pioneers• Google and Yahoo:
- Index 850+ million websites, over one trillion URLs.
• Facebook ad targeting:
- 840+ million users, > 50% of whom are active daily.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved10
Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved11
Common Tool?• Hadoop
- Cluster: distributed computing platform.
- Commodity*, server-class hardware.
- Extensible Platform.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved12
Hadoop Origins• MapReduce and Google File System (GFS)
pioneered at Google.
• Hadoop is the commercially-supported open-source equivalent.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved13
What Is Hadoop?• Hadoop is a platform.
• Distributes and replicates data.
• Manages parallel tasks created by users.
• Runs as several processes on a cluster.
• The term Hadoop generally refers to a toolset, not a single tool.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved14
Why Hadoop?• Handles unstructured to semi-structured to
structured data.
• Handles enormous data volumes.
• Flexible data analysis and machine learning tools.
• Cost-effective scalability.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
15
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
15
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved16
HDFS
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved17
What Is HDFS?• Hadoop Distributed File System.
• Stores files in blocks across many nodes in a cluster.
• Replicates the blocks across nodes for durability.
• Master/Slave architecture.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Traits
18
• Not fully POSIX compliant.
• No file updates.
• Write once, read many times.
• Large blocks, sequential read patterns.
• Designed for batch processing.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Master
19
• NameNode
- Runs on a single node as a master process
‣ Holds file metadata (which blocks are where)
‣ Directs client access to files in HDFS
• SecondaryNameNode
- Not a hot failover
- Maintains a copy of the NameNode metadata
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Slaves
20
• DataNode
- Generally runs on all nodes in the cluster
‣ Block creation/replication/deletion/reads
‣ Takes orders from the NameNode
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
File
Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
File
Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,4,6
Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,4,6,5,3Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,2,6
,4,6,5,3Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 1 DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
HDFS Illustrated
21
3
12,2,6
,4,6,5,3Put File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
DataNode 1
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
3
12,2,6
,4,6,5,3Read File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
DataNode 1
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
3
12,2,6
,4,6,5,3Read File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
DataNode 1
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
3
12,2,6
,4,6,5,3Read File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
5
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
5
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
Read time =
Transfer Rate x
Number of Machines*
5
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
NameNode
DataNode 2 DataNode 3
DataNode 4 DataNode 5 DataNode 6
22
Power of Hadoop
32,2,6
,4,6,5,3Read File
Read time =
Transfer Rate x
Number of Machines*
100 MB/sx3=
300MB/s
5
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
HDFS Shell
23
• Easy to use command line interface.
• Create, copy, move, and delete files.
• Administrative duties - chmod, chown, chgrp.
• Set replication factor for a file.
• Head, tail, cat to view files.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
24
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
24
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved25
MapReduce in
Hadoop
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
MapReduce Basics
26
• Logical functions: Mappers and Reducers.
• Developers write map and reduce functions, then submit a jar to the Hadoop cluster.
• Hadoop handles distributing the Map and Reduce tasks across the cluster.
• Typically batch oriented.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
MapReduce Daemons
27
•JobTracker (Master)
- Manages MapReduce jobs, giving tasks to different nodes, managing task failure
•TaskTracker (Slave)
- Creates individual map and reduce tasks
- Reports task status to JobTracker
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28
MapReduce in Hadoop
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved28
MapReduce in Hadoop
Let’s look at how MapReduce actually works in Hadoop,
using WordCount.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
29
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
29
We need to convert the Input
into the Output.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved30
There is a Map phase
Hadoop uses MapReduce
Input Mappers Sort,Shuffle
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase
reduce 1there 2uses 1
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved31
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers
There is a Reduce phase (doc4, "…")
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved32
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers
There is a Reduce phase (doc4, "…")
(hadoop, 1)(uses, 1)(mapreduce, 1)
(there, 1) (is, 1)(a, 1) (reduce, 1)(phase, 1)
(there, 1) (is, 1)(a, 1) (map, 1)(phase, 1)
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved33
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
Reducers
There is a Reduce phase (doc4, "…")
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved34
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
There is a Reduce phase (doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved35
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
There is a Reduce phase (doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
reduce 1there 2uses 1
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
(doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
(doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Map:
• Transform one input to 0-N outputs.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved36
There is a Map phase
Hadoop uses MapReduce
Input
(doc1, "…")
(doc2, "…")
(doc3, "")
Mappers Sort,Shuffle
(a, [1,1]),(hadoop, [1]),
(is, [1,1])
(map, [1]),(mapreduce, [1]),
(phase, [1,1])
Reducers
map 1mapreduce 1phase 2
a 2hadoop 1is 2
Output
(doc4, "…")
(reduce, [1]),(there, [1,1]),
(uses, 1)
(hadoop, 1)
(uses, 1)(mapreduce, 1)
(is, 1), (a, 1)
(there, 1)
(there, 1), (reduce 1)
(phase,1)
(map, 1),(phase,1)
(is, 1), (a, 1)
0-9, a-l
m-q
r-z
Map:
• Transform one input to 0-N outputs.
Reduce:
• Collect multiple inputs into one output.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
MMM
jar
M R
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
MMM
jar
M R
Map Phase
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
MMM
jar
M R
k,v k,vk,v k,v k,vMap Phase * Intermediate Data Is Stored Locally
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
k,v k,vk,v k,v k,vMap Phase
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
k,v k,vk,v k,v k,v
Shuffle/Sort
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
k,v k,v k,v k,v k,v
Shuffle/Sort
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
RR R
jar
M R
k,v k,v k,v k,v k,v
Reduce Phase
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
RR R
jar
M R
Reduce Phase
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
JobTracker
TaskTracker
DataNode
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
Cluster View of MapReduce
37
jar
M R
Job Complete!
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved38
The Hadoop Java API
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39
MapReduce in Java
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved39
MapReduce in Java
Let’s look at WordCountwritten in the
MapReduce Java API.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved40
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Let’s drill into this code...
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved41
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map CodeMapper class with 4
type parameters for the input key-value types and
output types.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved42
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Output key-value objects we’ll reuse.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved43
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Map method with input, output “collector”, and
reporting object.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved44
public class SimpleWordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
static final Text word = new Text(); static final IntWritable one = new IntWritable(1); @Override public void map(LongWritable key, Text documentContents, OutputCollector<Text, IntWritable> collector, Reporter reporter) throws IOException { String[] tokens = documentContents.toString().split("\\s+"); for (String wordString : tokens) { if (wordString.length() > 0) { word.set(wordString.toLowerCase()); collector.collect(word, one); } } }}
Map Code
Tokenize the line, “collect” each (word, 1)
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Reduce Code
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved45
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Reduce Code
Let’s drill into this code...
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
46
Reduce Code
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
46
Reduce CodeReducer class with 4
type parameters for the input key-value types and
output types.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved47
Reduce Code
Reduce method with input, output “collector”,
and reporting object.
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved48
Reduce Code
Count the counts per word and emit(word, N)
public class SimpleWordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
@Override public void reduce(Text key, Iterator<IntWritable> counts, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int count = 0; while (counts.hasNext()) { count += counts.next().get(); } output.collect(key, new IntWritable(count)); }}
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
49
Other Options
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
49
Other Options
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
49
Other Options
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved50
Conclusions
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• A cost-effective, scalable way to:- Store massive data sets.- Perform arbitrary analyses on
those data sets.
51
Hadoop Benefits
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• Offers a variety of tools for:- Application development.- Integration with other platforms
(e.g., databases).
52
Hadoop Tools
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• A rich, open-source ecosystem.- Free to use.- Commercially-supported
distributions.
53
Hadoop Distributions
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved54
Thank You!- Feel free to contact me at
- Or our solutions consultant
- As always, THINK BIG!
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved55
Bonus Content
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
56
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
56
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved57
Hive: SQL for Hadoop
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58
Hive
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved58
Hive
Let’s look at WordCountwritten in Hive,
the SQL for Hadoop.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved59
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Let’s drill into this code...
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved60
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Create a table to hold the raw text we’re
counting. Each line is a “column”.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved61
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Load the text in the “docs” directory into the
table.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved62
CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'docs' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts ASSELECT word, count(1) AS count FROM(SELECT explode(split(line, '\s')) AS word FROM docs) w GROUP BY word ORDER BY word;
Create the final table and fill it with the results from a nested query of
the docs table that performs WordCount
on the fly.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63
Hive
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved63
Hive
Because so many Hadoop userscome from SQL backgrounds,
Hive is one of the mostessential tools in the ecosystem!!
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
64
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved
• HDFS - Hadoop Distributed File System.
• Map/Reduce - A distributed framework for executing work in parallel.
• Hive - A SQL like syntax with a meta store to allow SQL manipulation of data stored on HDFS.
• Pig - A top down scripting language to manipulate.
• HBase - A NoSQL, non-sequential data store.
64
The Hadoop Ecosystem
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved65
Pig: Data Flow
for Hadoop
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66
Pig
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved66
Pig
Let’s look at WordCountwritten in Pig,
the Data Flow language for Hadoop.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved67
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output'; Let’s drill into this code...
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved68
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Like the Hive example, load “docs” content,each line is a “field”.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved69
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Tokenize into words (an array) and “flatten” into
separate records.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved70
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Collect the same words together.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved71
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output';
Count each word.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved72
inpt = LOAD 'docs' using TextLoader AS (line:chararray);
words = FOREACH inpt GENERATE flatten(TOKENIZE(line)) AS word;
grpd = GROUP words BY word;
cntd = FOREACH grpd GENERATE group, COUNT(words);
STORE cntd INTO 'output'; Save the results.Profit!
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73
Pig
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved73
Pig
Pig and Hive overlap, but Pig is popular for ETL, e.g., data transformation, cleansing, ingestion, etc.
Thursday, January 10, 13
Copyright © 2012-2013, Think Big Analytics, All Rights Reserved74
Questions?
Thursday, January 10, 13