Date post: | 23-Dec-2015 |
Category: |
Documents |
Upload: | myrtle-green |
View: | 215 times |
Download: | 0 times |
Jian Wang
Based on “Meet Hadoop! Open Source Grid Computing” by Devaraj Das
Yahoo! Inc. Bangalore & Apache Software Foundation
Need to process 10TB datasets On 1 node:
◦ scanning @ 50MB/s = 2.3 days On 1000 node cluster:
◦ scanning @ 50MB/s = 3.3 min
Need Efficient, Reliable and Usable framework◦Google File System (GFS) paper◦Google's MapReduce paper
Hadoop uses HDFS, a distributed file system based on GFS, as its shared file system◦ Files are divided into large blocks and distributed
across the cluster (64MB)◦ Blocks replicated to handle hardware failure◦ Current block replication is 3 (configurable)◦ It cannot be directly mounted by an existing operating system.
Once you use the DFS (put something in it), relative paths are from /user/{your usr id}. E.G. if your id is jwang30 … your “home dir” is /user/jwang30
Master-Slave Architecture
HDFS Master “Namenode” (irkm-1)◦ Accepts MR jobs submitted by users◦ Assigns Map and Reduce tasks to Tasktrackers◦ Monitors task and tasktracker status, re-executes
tasks upon failure HDFS Slaves “Datanodes” (irkm-1 to irkm-6)
◦ Run Map and Reduce tasks upon instruction from the Jobtracker
◦ Manage storage and transmission of intermediate output
Hadoop is locally “installed” on each machine◦ Version 0.19.2
◦ Installed location is in /home/tmp/hadoop
◦ Slave nodes store their data in /tmp/hadoop-${user.name} (configurable)
If it is the first time that you use it, you need to format the namenode:◦ - log to irkm-1◦ - cd /home/tmp/hadoop◦ - bin/hadoop namenode –format
Basically we see most commands look similar ◦ bin/hadoop “some command” options◦ If you just type hadoop you get all possible
commands (including undocumented)
hadoop dfs◦ [-ls <path>]◦ [-du <path>]◦ [-cp <src> <dst>]◦ [-rm <path>]◦ [-put <localsrc> <dst>]◦ [-copyFromLocal <localsrc> <dst>]◦ [-moveFromLocal <localsrc> <dst>]◦ [-get [-crc] <src> <localdst>]◦ [-cat <src>]◦ [-copyToLocal [-crc] <src> <localdst>]◦ [-moveToLocal [-crc] <src> <localdst>]◦ [-mkdir <path>]◦ [-touchz <path>]◦ [-test -[ezd] <path>]◦ [-stat [format] <path>]◦ [-help [cmd]]
bin/start-all.sh – starts all slave nodes and master node
bin/stop-all.sh – stops all slave nodes and master node
Run jps to check the status
Log to irkm-1 rm –fr /tmp/hadoop/$userID cd /home/tmp/hadoop bin/hadoop dfs –ls bin/hadoop dfs –copyFromLocal example
example
After that bin/hadoop dfs –ls
Mapper.py
Reducer.py
bin/hadoop dfs -ls
bin/hadoop dfs –copyFromLocal example example
bin/hadoop jar contrib/streaming/hadoop-0.19.2-streaming.jar -file wordcount-py.example/mapper.py -mapper wordcount-py.example/mapper.py -file wordcount-py.example/reducer.py -reducer wordcount-py.example/reducer.py -input example -output java-output
bin/hadoop dfs -cat java-output/part-00000
bin/hadoop dfs -copyToLocal java-output/part-00000 java-output-local
Hadoop job tracker◦ http://irkm-1.soe.ucsc.edu:50030/jobtracker.jsp
Hadoop task tracker◦ http://irkm-1.soe.ucsc.edu:50060/tasktracker.jsp
Hadoop dfs checker◦ http://irkm-1.soe.ucsc.edu:50070/dfshealth.jsp