Date post: | 04-Nov-2014 |
Category: |
Education |
Upload: | jigsawacademy2014 |
View: | 180 times |
Download: | 4 times |
BUMPER
Topic 1
HDFS – Hands On (Part – 1)
Class 2 – Hadoop Distributed File System
AGENDA
• What is Big Data?• Hadoop Distributed File System• MapReduce• Understanding Hadoop Ecosystem• Setting up a Hadoop Cluster• HDFS – Hands On• MapReduce-Hands On
Pre-requisites
HDFS – Hands On
Virtual Machine is up and running.
Connected to your Virtual Machine using putty as ‘hduser’.
Command Syntax
HDFS – Hands On
hadoop fs –ls / (To list directory contents)
Command Syntax
HDFS – Hands On
hadoop fs –ls / (To list directory contents)
hadoop fs -<command> <args>
Command Syntax
HDFS – Hands On
hadoop fs –ls / (To list directory contents)
hadoop fs -<command> <args>
hadoop: This is the binary executable.
Command Syntax
HDFS – Hands On
hadoop fs –ls / (To list directory contents)
hadoop fs -<command> <args>
hadoop: This is the binary executable.
fs: Invokes the Hadoop file system, which is the HDFS.
Command Syntax
HDFS – Hands On
hadoop fs –ls / (To list directory contents)
hadoop fs -<command> <args>
hadoop: This is the binary executable.
fs: Invokes the Hadoop file system, which is the HDFS.
<command>: Indicates what is the purpose of the statement and always preceded by a ‘-‘.
Command Syntax
HDFS – Hands On
hadoop fs –ls / (To list directory contents)
hadoop fs -<command> <args>
hadoop: This is the binary executable.
fs: Invokes the Hadoop file system, which is the HDFS.
<command>: Indicates what is the purpose of the statement and always preceded by a ‘-‘.
<args>: Indicates the arguments that are applicable for the command.
Where do DataNodes store data?HDFS – Hands On
Where do DataNodes store data?HDFS – Hands On
hadoop.tmp.dir = /tmp/hadoop
Where do DataNodes store data?HDFS – Hands On
hadoop.tmp.dir = /tmp/hadoop dfs.data.dir = ($hadoop.tmp.dir)/dfs/data
Where do DataNodes store data?HDFS – Hands On
hadoop.tmp.dir = /tmp/hadoop dfs.data.dir = ($hadoop.tmp.dir)/dfs/data = /tmp/hadoop/dfs/data
Where do DataNodes store data?HDFS – Hands On
hadoop.tmp.dir = /tmp/hadoop dfs.data.dir = ($hadoop.tmp.dir)/dfs/data = /tmp/hadoop/dfs/data
VERSION >> Java properties fileblk_********* >> Raw data of a fileblk_******.meta >> Metadata of the blockHow come there is a block when we have not loaded any file?
jobtracker.infoHDFS – Hands On
fsckHDFS – Hands On
Generates a summary report that lists the overall health of the filesystem.
fsckHDFS – Hands On
Total size: Indicates the size of the directory (root directory in our case). Does not account for replication.
Total dirs: Indicates the number of directories in HDFS
Total files: Indicates the number of files in HDFS
Total blocks: Indicates the number of blocks
Default replication factor:Average replication factor:Corrupt blocks:Missing replicas: Number of data nodes:Number of racks:
Edit .bashrc
HDFS – Hands On
Navigate to the home directory.
cd
List hidden files.
ls -a
Edit the .bashrc file.
vi .bashrc
Update HADOOP paths using ‘export’ command.
export HADOOP_CONF=/home/hduser/hadoop/confexport HADOOP_PREFIX=/home/hduser/hadoop
# Add Hadoop bin/ directory to path
export PATH=$PATH:$HADOOP_PREFIX/bin
Execute the updated contents of the .bashrc file.
source ~/.bashrc
copyFromLocalHDFS – Hands On
Copies file from local file system to HDFS.
hadoop fs –copyFromLocal <Path to source file on Local File System> <Target path in HDFS>
hadoop fs –copyFromLocal NOTICE.txt noticehdfs.txt
copyFromLocalHDFS – Hands On
copyFromLocal commands internally results in:
a file getting split into multiple blocks.
the client contacting the NameNode to find out where each block should be copied in the cluster.
replication of blocks to nodes assigned by NameNode.
How many blocks were created?HDFS – Hands On
RECAP
HDFS Commonly used commandsHDFS Concepts
BUMPER
BUMPER
Topic 2
HDFS – Hands On (Part – 2)
Class 2 – Hadoop Distributed File System
AGENDA
• What is Big Data?• Hadoop Distributed File System• MapReduce• Understanding Hadoop Ecosystem• Setting up a Hadoop Cluster• HDFS – Hands On• MapReduce-Hands On
Load a file larger than the block sizeHDFS – Hands On
Load a 200 MB file and see how many blocks were created.
Command to generate a 200 MB dummy file.dd if=/dev/zero of=file.txt count=1024 bs=204800
hadoop fs –copyFromLocal file.txt file.txtcd /tmp/hadoop/dfs/data/currentls –lrt
Load a file larger than the block sizeHDFS – Hands On
Block 1 = 64 MB
Block 2 = 64 MB
Block 3 = 8 MB
Block 4 = 64 MB
fsckHDFS – Hands On
fsck after loading 2 additional files.
Total size has increased.Total dirs: 7. Additions - /user and /user/hduser directories.Total files: 3. Additions - 2 newly loaded files.Total blocks: 6. Additions - 1 block of the 1st file and 4 blocks of the 2nd file.
catHDFS – Hands On
Displays contents of file on the command prompt.
hadoop fs –cat <Path of file in HDFS>
hadoop fs –cat noticehdfs.txt
copyToLocalHDFS – Hands On
Copies file from HDFS to local file system.
hadoop fs –copyToLocal <Path of file in HDFS> <Path of file in Local File System>
hadoop fs –copyToLocal noticehdfs.txt noticelocal.txt
mkdirHDFS – Hands On
Creates a directory inside HDFS.HDFS paths are relative.
Creates directory in current user’s home directoryhadoop fs –mkdir newdir
Creates new directory under roothadoop fs –mkdir /newdir
rmHDFS – Hands On
Removes file (s).
hadoop fs –rm <File Name>
Removes file and empty directories.hadoop fs –rm noticehdfs.txt
Trash featureHDFS – Hands On
Prevents accidental deletion of files and directories.Disabled by default.To enable, configure the fs.trash.interval property in core-site.xml file.
RECAP
HDFS Commonly used commandsHDFS Concepts
BUMPER
BUMPER
Topic 3
HDFS – Web UI
Class 2 – Hadoop Distributed File System
AGENDA
• What is Big Data?• Hadoop Distributed File System• MapReduce• Understanding Hadoop Ecosystem• Setting up a Hadoop Cluster• HDFS – Hands On• MapReduce-Hands On
NameNode Web Interface
HDFS – Hands On
HDFS Web Interface URL.
http://<namenode_host>:50070/
From the Virtual Machine:
http://localhost:50070/
From outside the Virtual Machine:http://<IP Address of VM or Hostname of VM>:50070/Example- http://192.168.234.135:50070/
NameNode Web Interface
HDFS – Hands On
Server Name and Port
Last start time of the NameNode
Hadoop Version, followed by subversion source code repository
To browse the files in HDFS View NameNode log files
Number of files, directories and blocks. Heap memory utilized/available.
Storage capacity of machines in the clusterHow much space utilized in HDFSSpace utilized by O/S, Applications etc.Amount of space available on HDFS
How many blocks have replicas less than Replication Factor
Nodes that are active and in contact with NameNodeNodes that are NOT in contact with NameNodeNodes administratively removed from the cluster
RECAP
HDFS Web UI
BUMPER
BUMPER
Topic 4
Class 2 – Hadoop Distributed File System
MapReduce – Hands On (Part – 1)
AGENDA
• What is Big Data?• Hadoop Distributed File System• MapReduce• Understanding Hadoop Ecosystem• Setting up a Hadoop Cluster• HDFS – Hands On• MapReduce-Hands On
How does MapReduce work?
MapReduce
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Reducer
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Reducer
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Reducer
Mapping Phase
Reducing Phase
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
User Defined Logic Specify Path &
Output format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
User Defined Logic Specify Path &
Output format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
User Defined Logic Specify Path &
Output format
Replication, Rack Awareness etc.
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King Queen King>
<King, 1><Queen, 1><King, 1>
<2, Minister King Soldier>
<3, Queen Soldier King>
<Minister, 1><King, 1><Soldier, 1>
<Queen, 1><Soldier, 1><King, 1>
<King, 1><King, 1><King, 1><King, 1>
<Minister, 1>
<Queen, 1><Queen, 1>
<Soldier,1><Soldier,1>
<King, (1,1,1,1)><Minister, 1>
<Queen, (1,1)><Soldier, (1,1)>
<King, 4><Minister, 1>
King Queen King
Minister King Soldier
Queen Soldier King
Input SplittingMap Shuffling Reduce Result
<Queen, 2><Soldier, 2>
Map Output
Load data into HDFS
Specify Path & Input Format
Create ‘Input Splits’
Create individual Records
User Defined Logic
User Defined Logic Specify Path &
Output format
Replication, Rack Awareness etc.
MapReduce Execution FrameworkMapReduce
MapReduce Execution FrameworkMapReduce
Mapper Process
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper ProcessDriver
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Driver
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
InputFormat
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Input Split 1
InputFormat
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Calculates
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Calculates
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Calculates
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record Reader
Calculates
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record Reader
Reads Reads
Calculates
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record Reader
Reads Reads
Calculates
Defines
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Shuffle
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairs
Calculates
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairsOutputFormat
Calculates
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Output Data
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairsOutputFormat
Calculates
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairsOutputFormat
Calculates
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairsOutputFormat
Calculates
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Calculates
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Calculates
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Defines
Calculates
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Defines
Calculates
Defines
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
MapReduce Execution FrameworkMapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txtBlock A Block B Block C
Driver
Mapper
Reducer
Record Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Defines
Calculates
Defines
Defines
Defines
DefinesDefines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition ShuffleSort
RECAP
MapReduce Execution Framework
BUMPER
BUMPER
Topic 5
Class 2 – Hadoop Distributed File System
MapReduce – Hands On (Part – 2)
AGENDA
• What is Big Data?• Hadoop Distributed File System• MapReduce• Understanding Hadoop Ecosystem• Setting up a Hadoop Cluster• HDFS – Hands On• MapReduce-Hands On
Java MapReduce Programming
MapReduce
Hello World of MapReduce >> Word Count program
Eclipse – Integrated Development Environment (IDE)
https://www.eclipse.org/downloads/
RECAP
Part two of Java MapReduce program
BUMPER