+ All Categories
Home > Technology > Hadoop Workshop on EC2 : March 2015

Hadoop Workshop on EC2 : March 2015

Date post: 15-Jul-2015
Category:
Upload: imc-institute
View: 625 times
Download: 0 times
Share this document with a friend
Popular Tags:
156
Danairat T., 2013, [email protected] Big Data Hadoop – Hands On Workshop 1 Big Data using Hadoop Hands On Workshop March 2015 Dr.Thanachart Numnonda Certified Java Programmer [email protected] Danairat T. Certified Java Programmer, TOGAF – Silver [email protected], +66-81-559-1446
Transcript
Page 1: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 1

Big Data using Hadoop

Hands On Workshop

March 2015

Dr.Thanachart NumnondaCertified Java Programmer

[email protected]

Danairat T.Certified Java Programmer, TOGAF – Silver

[email protected], +66-81-559-1446

Page 2: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Launch a virtual server on EC2 Amazon Web Services

Page 3: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 4: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hadoop Installation

Hadoop provides three installation choices:

1. Local mode: This is an unzip and run mode toget you started right away where allparts ofHadoop run within the same JVM

2. Pseudo distributed mode: This mode will berun on different parts of Hadoop as differentJava processors, but within a single machine

3. Distributed mode: This is the real setup thatspans multiple machines

Page 5: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Virtual Server

This lab will use a EC2 virtual server to install aHadoop server using the following features:

● Ubuntu Server 14.04 LTS● m3.mediun 1vCPU, 3.75 GB memory● Security group: default● Keypair: imchadoop

Page 6: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Select a EC2 service and click on Lunch Instance

Page 7: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Select an Amazon Machine Image (AMI) andUbuntu Server 14.04 LTS (PV)

Page 8: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Choose m3.medium Type virtual server

Page 9: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Leave configuration details as default

Page 10: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Add Storage: 20 GB

Page 11: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Name the instance

Page 12: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Select an existing security group > Select SecurityGroup Name: default

Page 13: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Click Launch and choose imchadoop as a key pair

Page 14: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review an instance / click Connect for an instruction to connect to the instance

Page 15: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Connect to an instance from Mac/Linux

Page 16: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Connect to an instance from Windows using Putty

Page 17: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Connect to the instance

Page 18: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Installing Hadoop

Page 19: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing Hadoop and Ecosystem

1. Update the system

2. Configuring SSH

3. Installing JDK1.6

4. Download/Extract Hadoop

5. Installing Hadoop

6. Configure xml files

7. Formatting HDFS

8. Start Hadoop

9. Hadoop Web Console

10. Stop Hadoop

Notes:-

Hadoop and IPv6; Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you willencounter problems. Source: http://wiki.apache.org/hadoop/HadoopIPv6

Page 20: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

1) Update the system: sudo apt-get update

Page 21: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

2. Configuring SSH: ssh-keygen

Page 22: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Enabling SSH access to your local machine

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Testing the SSH setup by connecting to your local machine

$ ssh 54.68.149.232

Type Exit

$ exit

Page 23: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

3) Install JDK 1.7: sudo apt-get install openjdk-7-jdk

(Enter Y when prompt for answering)

(Type command > java –version

Page 24: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

4) Download/Extract Hadoop

1) Type command > wgethttp://mirror.issp.co.th/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

2) Type command > tar –xvzf hadoop-1.2.1.tar.gz

3) Type command > sudo mv hadoop-1.2.1 /usr/local/hadoop

Page 25: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

5) Installing Hadoop

1) Type command > sudo vi $HOME/.bashrc

2) Add config as figure below

1) Type command > exec bash

2) Type command > sudo vi /usr/local/hadoop/conf/hadoop-env.sh

3) Edit the file as figure below

Page 26: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

6) Configuring Hadoop conf/*-site.xml

1. core-site.xml (hadoop.tmp.dir, fs.default.name)

2. hdfs-site.xml (dfs.replication)

3. mapred-site.xml (mapred.job.tracker)

Page 27: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Configuring core-site.xml

1) Type command > sudo vi /usr/local/hadoop/conf/core-site.xml

2)Add Private IP of a server as figure below

(in this case a private IP is 172.31.12.11)

Page 28: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Configuring mapred-site.xml

1) Type command > sudo sudo vi /usr/local/hadoop/conf/mapred-site.xml

2)Add Private IP of Jobtracker server as figure below

(in this case a private IP is 172.31.12.11)

Page 29: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Configuring hdfs-site.xml

1) Type command > sudo vi /usr/local/hadoop/conf/hdfs-site.xml

2)Add configure as figure below

Page 30: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

7) Formating Hadoop

1)Type command > sudo mkdir /usr/local/hadoop/tmp

2)Type command > sudo chown ubuntu /usr/local/hadoop

3)Type command > sudo chown ubuntu /usr/local/hadoop/tmp

4)Type command > hadoop namenode –format

Page 31: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Hadoop

ubuntu@ip-172-31-12-11:~$ start-all.sh

Starting up a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.

[ubuntu@ip-172-31-12-11:~$ jps

11567 Jps

10766 NameNode

11099 JobTracker

11221 TaskTracker

10899 DataNode

11018 SecondaryNameNode

ubuntu@ip-172-31-12-11:~$$

Checking Java Process and you are now running Hadoop as pseudo distributed mode

Page 32: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hadoop is up!

Viewing the Hadoop HDFS using WebUI http://54.68.149.232:50070/

Page 33: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Stopping Hadoop

ubuntu@ip-172-31-12-11:~$ /usr/local/hadoop/bin/stop-all.sh

stopping jobtracker

localhost: stopping tasktracker

stopping namenode

localhost: stopping datanode

localhost: stopping secondarynamenode

Page 34: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Importing Data to HDFSusing Hadoop Command Line

Page 35: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Importing Data to Hadoop

Download War and Peace Full Text

www.gutenberg.org/ebooks/2600

Page 36: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Importing Data to Hadoop

Download the file pg2600.txt

$ wget https://dl.dropboxusercontent.com/u/12655380/

pg2600.txt

$hadoop fs -mkdir /input

$hadoop fs -mkdir /output

$hadoop fs -copyFromLocal pg2600.txt /input

Import to Hadoop

Page 37: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Reviewing, Retrieving,Deleting Data from HDFS

Page 38: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS

ubuntu@ip-172-31-12-11:~$ hadoop fs -cat /input/pg2600.txt

List HDFS File

Read HDFS File

Retrieve HDFS File to Local File System

Please see also http://hadoop.apache.org/docs/r1.0.4/commands_manual.html

ubuntu@ip-172-31-12-11:~$ hadoop fs -copyToLocal /input/pg2600.txt /tmp/file.txt

Page 39: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review file in Hadoop HDFS using WebUI

Page 40: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hadoop Port Numbers

Daemon DefaultPort

Configuration Parameter inconf/*-site.xml

HDFS Namenode 50070 dfs.http.address

Datanodes 50075 dfs.datanode.http.address

Secondarynamenode 50090 dfs.secondary.http.address

MR JobTracker 50030 mapred.job.tracker.http.address

Tasktrackers 50060 mapred.task.tracker.http.address

Page 41: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Review Content from System shell

Page 42: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Removing data from HDFS usingShell Command

hdadmin@localhost detach]$ hadoop dfs -rm /input/input_test.txt

Deleted hdfs://localhost:54310/input/input_test.txt

hdadmin@localhost detach]$

Page 43: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Lecture: Understanding Map ReduceProcessing

Client

Name Node Job Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Data Node

Task Tracker

Map Reduce

Page 44: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

High Level Architecture of MapReduce

Page 45: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 45

Before MapReduce…

● Large scale data processing was difficult!– Managing hundreds or thousands of processors– Managing parallelization and distribution– I/O Scheduling– Status and monitoring– Fault/crash tolerance

● MapReduce provides all of these, easily!

Source: http://labs.google.com/papers/mapreduce-osdi04-slides/index-auto-0002.html

Page 46: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 46

MapReduce Overview

● What is it?– Programming model used by Google– A combination of the Map and Reduce models with an

associated implementation– Used for processing and generating large data sets

Page 47: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 47

MapReduce Overview

● How does it solve our previously mentioned problems?– MapReduce is highly scalable and can be used across many

computers.– Many small machines can be used to process jobs that

normally could not be processed by a large machine.

Page 48: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MapReduce Framework

Source: www.bigdatauniversity.com

Page 49: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 49

How Map and Reduce Work Together

Page 50: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 50

How Map and Reduce Work Together

● Map returns information● Reduces accepts information● Reduce applies a user defined function to reduce the

amount of data

Page 51: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 51

Map Abstraction

● Inputs a key/value pair– Key is a reference to the input value– Value is the data set on which to operate

● Evaluation– Function defined by user– Applies to every value in value input

● Might need to parse input● Produces a new list of key/value pairs

– Can be different type from input pair

Page 52: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 52

Reduce Abstraction

● Starts with intermediate Key / Value pairs● Ends with finalized Key / Value pairs

● Starting pairs are sorted by key● Iterator supplies the values for a given key to the

Reduce function.

Page 53: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 53

Reduce Abstraction

● Typically a function that:– Starts with a large number of key/value pairs

● One key/value for each word in all files being greped(including multiple entries for the same word)

– Ends with very few key/value pairs● One key/value for each unique word across all the files with

the number of instances summed into this entry● Broken up so a given worker works with input of the

same key.

Page 54: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 54

Other Applications

● Yahoo!– Webmap application uses Hadoop to create a database of

information on all known webpages● Facebook

– Hive data center uses Hadoop to provide business statistics toapplication developers and advertisers

● Rackspace– Analyzes sever log files and usage data using Hadoop

Page 55: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 55

Why is this approach better?

● Creates an abstraction for dealing with complexoverhead– The computations are simple, the overhead is messy

● Removing the overhead makes programs muchsmaller and thus easier to use– Less testing is required as well. The MapReduce

libraries can be assumed to work properly, so onlyuser code needs to be tested

● Division of labor also handled by theMapReduce libraries, so programmers onlyneed to focus on the actual computation

Page 56: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MapReduce Framework

map: (K1, V1) -> list(K2, V2))

reduce: (K2, list(V2)) -> list(K3, V3)

Page 57: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

How does the MapReduce work?

Output in a list of (Key, List of Values)

in the intermediate file

Sorting

Partitioning

Output in a list of (Key, Value)

in the intermediate file

InputSplit

RecordReader

RecordWriter

Page 58: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

How does the MapReduce work?

Sorting

Partitioning

Combining

Car, 2

Car, 2

Bear, {1,1}

Car, {2,1}

River, {1,1}

Deer, {1,1}

Output in a list of (Key, List of Values)

in the intermediate file

Output in a list of (Key, Value)

in the intermediate file

InputSplit

RecordReader

RecordWriter

Page 59: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MapReduce Processing – The Dataflow

1. InputFormat, InputSplits, RecordReader

2. Mapper - your focus is here

3. Partition, Shuffle & Sort

4. Reducer - your focus is here

5. OutputFormat, RecordWriter

Page 60: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

InputFormat

InputFormat: Description: Key: Value:

TextInputFormat Default format; readslines of text files

The byte offset of theline The line contents

KeyValueInputFormat Parses lines into key,val pairs

Everything up to thefirst tab character

The remainder of theline

SequenceFileInputFormat

A Hadoop-specifichigh-performancebinary format

user-defined user-defined

Page 61: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

InputSplitAn InputSplit describes a unit of work that comprises a single maptask.

InputSplit presents a byte-oriented view of the input.

You can control this value by setting the mapred.min.split.sizeparameter in core-site.xml, or by overriding the parameter in theJobConf object used to submit a particular MapReduce job.

RecordReader

RecordReader reads <key, value> pairs from an InputSplit.

Typically the RecordReader converts the byte-oriented view ofthe input, provided by the InputSplit, and presents a record-oriented to the Mapper

Page 62: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Mapper

Mapper: The Mapper performs the user-defined logic to the input akey, value and emits (key, value) pair(s) which are forwarded to theReducers.

Partition, Shuffle & Sort

After the first map tasks have completed, the nodes may still beperforming several more map tasks each. But they also beginexchanging the intermediate outputs from the map tasks to where theyare required by the reducers.

Partitioner controls the partitioning of map-outputs to assign to reducetask . he total number of partitions is the same as the number of reducetasks for the job

The set of intermediate keys on a single node is automatically sortedby internal Hadoop before they are presented to the Reducer

This process of moving map outputs to the reducers is known asshuffling.

Page 63: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

ReducerThis is an instance of user-provided code that performs read eachkey, iterator of values in the partition assigned. The OutputCollector object in Reducer phase has a method named collect() which willcollect a (key, value) output.

OutputFormat, Record Writer

OutputFormat governs the writing format in OutputCollector andRecordWriter writes output into HDFS.

OutputFormat: Description

TextOutputFormat Default; writes lines in "key \t value"form

SequenceFileOutputFormatWrites binary files suitable forreading into subsequent MapReducejobs

NullOutputFormat generates no output files

Page 64: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Writing you own MapReduce Program

Page 65: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Wordcount (HelloWord in Hadoop)1. package org.myorg;

2.

3. import java.io.IOException; 4. import java.util.*;

5.

6. import org.apache.hadoop.fs.Path; 7. import org.apache.hadoop.conf.*; 8. import org.apache.hadoop.io.*; 9. import org.apache.hadoop.mapred.*; 10. import org.apache.hadoop.util.*;

11.

12. public class WordCount {

13.

14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,IntWritable> {

15. private final static IntWritable one = new IntWritable(1); 16. private Text word = new Text();

17.

18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,Reporter reporter) throws IOException {

19. String line = value.toString(); 20. StringTokenizer tokenizer = new StringTokenizer(line); 21. while (tokenizer.hasMoreTokens()) { 22. word.set(tokenizer.nextToken()); 23. output.collect(word, one); 24. } 25. } 26. }

Page 66: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Wordcount (HelloWord in Hadoop)

27.

28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,IntWritable> {

29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable>output, Reporter reporter) throws IOException {

30. int sum = 0; 31. while (values.hasNext()) { 32. sum += values.next().get(); 33. } 34. output.collect(key, new IntWritable(sum)); 35. } 36. }

37.

Page 67: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Wordcount (HelloWord in Hadoop)

38. public static void main(String[] args) throws Exception { 39. JobConf conf = new JobConf(WordCount.class); 40. conf.setJobName("wordcount");

41.

42. conf.setOutputKeyClass(Text.class); 43. conf.setOutputValueClass(IntWritable.class);

44.

45. conf.setMapperClass(Map.class); 46. 47. conf.setReducerClass(Reduce.class);

48.

49. conf.setInputFormat(TextInputFormat.class); 50. conf.setOutputFormat(TextOutputFormat.class);

51.

52. FileInputFormat.setInputPaths(conf, new Path(args[0])); 53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));

54.

55. JobClient.runJob(conf); 57. } 58. }

59.

Page 68: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Packaging Map Reduceand Deploying to Hadoop Runtime

Environment

Page 69: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Packaging Map Reduce Program

Usage

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop versioninstalled, compile WordCount.java and create a jar:

$ wget https://dl.dropboxusercontent.com/u/12655380/WordCount.java

$ mkdir hduser $ cd hduserjavac -classpath /usr/local/hadoop/hadoop-core-1.2.1.jar -d hduser WordCount.java$ jar -cvf ./wordcount.jar -C hduser/ .

$ hadoop jar ./wordcount.jar org.myorg.WordCount /input/* /output/wordcount_output_dir

Output:

…….

$ hadoop fs -cat /output/wordcount_output_dir/part-00000

Page 70: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 71: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 72: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 73: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 74: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 75: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Reviewing MapReduce Output Result

Page 76: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Writing Map/ReduceProgram on Eclipse

Page 77: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Eclipse

Page 78: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create a Java Project

Let's name it HadoopWordCount

Page 79: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 79

Add dependencies to the project

● Add the following two JARs to your build path● hadoop-common.jar and hadoop-mapreduce-client-core.jar. Both can be

founded at /usr/lib/hadoop/client● By perform the following steps

– Add a folder named lib to the project

– Copy the mentioned JARs in this folder

– Right-click on the project name >> select Build Path >> thenConfigure Build Path

– Click on Add Jars, select these two JARs from the lib folder

Page 80: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 80

Add dependencies to the project

Page 81: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 81

Writing a source code

● Right click the project, the select New >> Package● Name the package as org.myorg● Right click at org.myorg, the select New >> Class● Name the package as WordCount● Writing a source code as shown in previoud slides

Page 82: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 82

Page 83: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 83

Building a Jar file

● Right click the project, the select Export● Select Java and then JAR file● Provide the JAR name, as wordcount.jar● Leave the JAR package options as default● In the JAR Manifest Specification section, in the botton, specify the Main

class● In this case, select WordCount● Click on Finish● The JAR file will be build and will be located at cloudera/workspace

Note: you may need to re-size the dialog font size by select

Windows >> Preferences >> Appearance >> Colors and Fonts

Page 84: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

LectureUnderstanding Hive

Page 85: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

IntroductionA Petabyte Scale Data Warehouse Using Hadoop

Hive is developed by Facebook, designed to enable easy datasummarization, ad-hoc querying and analysis of largevolumes of data. It provides a simple query language calledHive QL, which is based on SQL

Page 86: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

What Hive is NOT

Hive is not designed for online transaction processing anddoes not offer real-time queries and row level updates. It isbest used for batch jobs over large sets of immutable data(like web logs, etc.).

Page 87: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 87

Hive Metastore

● Store Hive metadata

● Configurations

– Embedded: in-process metastore, in-process database

– Local: in-process metastore, out-of-process database

– Remote: out-of-process metastore,out-of-process database

Page 88: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 88

Hive Schema-On-Read

● Faster loads into the database (simply copy or move)

● Slower queries

● Flexibility – multiple schemas for the same data

Page 89: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 89

HiveQL

● Hive Query Language● SQL dialect● No support for:

– UPDATE, DELETE

– Transactions

– Indexes

– HAVING clause in SELECT

– Updateable or materialized views

– Srored procedure

Page 90: Hadoop Workshop on EC2 : March 2015

Danairat T., 2013, [email protected] Data Hadoop – Hands On Workshop 90

Hive Tables

● Managed- CREATE TABLE

– LOAD- File moved into Hive's data warehouse directory

– DROP- Both data and metadata are deleted.

● External- CREATE EXTERNAL TABLE

– LOAD- No file moved

– DROP- Only metadata deleted

– Use when sharing data between Hive and Hadoop applications

or you want to use multiple schema on the same data

Page 91: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Hive

Hive Shell

● Interactive

hive● Script

hive -f myscript● Inline

hive -e 'SELECT * FROM mytable'

Hive.apache.org

Page 92: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

System Architecture and Components

• Metastore: To store the meta data.• Query compiler and execution engine: To convert SQL queries to a

sequence of map/reduce jobs that are then executed on Hadoop.• SerDe and ObjectInspectors: Programmable interfaces and

implementations of common data formats and types. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binaryrepresentation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Javaobject that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system.

• UDF and UDAF: Programmable interfaces and implementations foruser defined functions (scalar and aggregate functions).

• Clients: Command line client similar to Mysql command line.

hive.apache.org

Page 93: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Architecture Overview

HDFS

Hive CLIQueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

Execution

Hive QL

Parser

Planner

Mgm

t.W

eb U

I

HDFS

DDL

Hive

Hive.apache.org

Page 94: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Sample HiveQL

The Query compiler uses the information stored in the metastore toconvert SQL queries into a sequence of map/reduce jobs, e.g. thefollowing query

SELECT * FROM t where t.c = 'xyz'

SELECT t1.c2 FROM t1 JOIN t2 ON (t1.c1 = t2.c1)

SELECT t1.c1, count(1) from t1 group by t1.c1

Hive.apache.org

Page 95: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Creating Table andRetrieving Data using Hive

Page 96: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hive Hands-On Labs

1. Installing Hive

2. Configuring / Starting Hive

3. Creating Hive Table

4. Reviewing Hive Table in HDFS

5. Alter and Drop Hive Table

6. Preparing Dataset

7. Loading Data to Hive Table

8. Querying Data from Hive Table

9. Reviewing Hive Table Content from HDFS Commandand WebUI

Page 97: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

1. Installing Hive

# wget http://apache.mesi.com.ar/hive/hive-1.1.0/

apache-hive-1.1.0-bin.tar.gz

# tar -xvzf apache-hive-1.1.0-bin.tar.gz

# sudo mv apache-hive-1.1.0-bin /usr/local

# rm apache-hive-1.1.0-bin.tar.gz

Install Hive binary file

Page 98: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

1. Installing HiveEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc

Page 99: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

2. Configuring HiveCreating HDFS Directory for Hive

Create hdfs /tmp and /user/hive/warehouse directory

[hdadmin@localhost ~]$ hadoop fs -mkdir /tmp/hive

[hdadmin@localhost ~]$ hadoop fs -mkdir /user/hive/warehouse

[hdadmin@localhost ~]$ hadoop fs -chmod 777 /tmp/hive

[hdadmin@localhost ~]$ hadoop fs -chmod 777 /user/hive/warehouse

Page 100: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

2. Start HiveStarting Hive

hive> quit;

Quit from Hive

Page 101: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

3. Creating Hive Table

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

OK

Time taken: 4.069 seconds

hive (default)> show tables;

OK

test_tbl

Time taken: 0.138 seconds

hive (default)> describe test_tbl;

OK

id int

country string

Time taken: 0.147 seconds

hive (default)>

See also: https://cwiki.apache.org/Hive/languagemanual-ddl.html

Page 102: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

4. Reviewing Hive Table in HDFS

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse

Found 1 items

drwxr-xr-x - hdadmin supergroup 0 2013-03-17 17:51 /user/hive/warehouse/test_tbl

[hdadmin@localhost hdadmin]$

Review Hive Table fromHDFS WebUI

Page 103: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

5. Alter and Drop Hive Table

hive (default)> alter table test_tbl add columns (remarks STRING);

hive (default)> describe test_tbl;

OK

id int

country string

remarks string

Time taken: 0.077 seconds

hive (default)> drop table test_tbl;

OK

Time taken: 0.9 seconds

See also: https://cwiki.apache.org/Hive/adminmanual-metastoreadmin.html

Page 104: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

6. Preparing Large Datasethttp://grouplens.org/datasets/movielens/

Page 105: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

MovieLen Dataset

1)Type command > wgethttp://files.grouplens.org/datasets/movielens/ml-100k.zip

2)Type command > sudo apt-get install unzip

3)Type command > unzip ml-100k.zip

4)Type command > more ml-100k/u.user

Page 106: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

6. Loading Data to Hive Table

hive (default)> exit;

ubuntu@ip-172-31-12-11:~/ml-100k$ hadoop fs -put u.user /dataset/movielens/users

Loading data to Hive table

$ hive

hive (default)> CREATE EXTERNAL TABLE users (userid INT, age INT,

gender STRING, occupation STRING, zipcode STRING) ROW FORMAT

DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE

LOCATION '/dataset/movielens/users';

Creating Hive table

Page 107: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

7. Querying Data from Hive Table

Page 108: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

8. Loading Data to test_tbl Table

$ hive

hive (default)> CREATE TABLE test_tbl(id INT, country STRING) ROWFORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;

Creating Hive table

hive (default)> LOAD DATA LOCAL INPATH '/tmp/test_tbl_data.csv' INTO TABLEtest_tbl;

Copying data from file:/tmp/test_tbl_data.csv

Copying file: file:/tmp/test_tbl_data.csv

Loading data to table default.test_tbl

OK

Time taken: 0.241 seconds

hive (default)>

Loading data to Hive table

Page 109: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

9. Reviewing Hive Table Content from HDFS Commandand WebUI

[hdadmin@localhost hdadmin]$ hadoop fs -ls /user/hive/warehouse/test_tbl

Found 1 items

-rw-r--r-- 1 hdadmin supergroup 59 2013-03-17 18:08/user/hive/warehouse/test_tbl/test_tbl_data.csv

[hdadmin@localhost hdadmin]$

[hdadmin@localhost hdadmin]$ hadoop fs -cat/user/hive/warehouse/test_tbl/test_tbl_data.csv

1,USA

62,Indonesia

63,Philippines

65,Singapore

66,Thailand

[hdadmin@localhost hdadmin]$

Page 110: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Loading Data to Hive Table

$ hive

hive (default)> hive> CREATE TABLE products

(

prod_name STRING,

description STRING,

category STRING,

qty_on_hand INT,

prod_num STRING,

packaged_with ARRAY<STRING>

)

row format delimited

fields terminated by ','

collection items terminated by ':'

stored as textfile;

Creating Hive table

Page 111: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

LectureUnderstanding Pig

Page 112: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

IntroductionA high-level platform for creating MapReduce programs Using Hadoop

Pig is a platform for analyzing large data sets that consists ofa high-level language for expressing data analysis programs,coupled with infrastructure for evaluating these programs.The salient property of Pig programs is that their structure isamenable to substantial parallelization, which in turns enablesthem to handle very large data sets.

Page 113: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig Components

● Two Compnents● Language (Pig Latin)● Compiler

● Two Execution Environments● Local

pig -x local● Distributed

pig -x mapreduce

Hive.apache.org

Page 114: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Running Pig

● Script

pig myscript● Command line (Grunt)

pig● Embedded

Writing a java program

Hive.apache.org

Page 115: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig Latin

Hive.apache.org

Page 116: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig Execution Stages

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Page 117: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Why Pig?

● Makes writing Hadoop jobs easier● 5% of the code, 5% of the time● You don't need to be a programmer to write Pig scripts

● Provide major functionality required forDatawareHouse and Analytics● Load, Filter, Join, Group By, Order, Transform

● User can write custom UDFs (User Defined Function)

Hive.apache.orgSource Introduction to Apache Hadoop-Pig: PrashantKommireddi

Page 118: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Pig v.s. Hive

Hive.apache.org

Page 119: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running a Pig script

Page 120: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing Pig

# wgethttp://archive.apache.org/dist/hadoop/pig/stable/pig-0.7.0.tar.gz

# tar -xvzf pig-0.7.0.tar.gz

# sudo mv pig-0.7.0 /usr/local/

# rm pig-0.7.0.tar.gz

Install Pig binary file

Page 121: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing PigEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc

Page 122: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting Pig Command Line

Page 123: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

countryFilter.pig

A = load 'hdi-data.csv' using PigStorage(',') AS (id:int, country:chararray, hdi:float,lifeex:int, mysch:int, eysch:int, gni:int);B = FILTER A BY gni > 2000;C = ORDER B BY gni;dump C;

#Preparing Data

ubuntu@ip-172-31-12-11:~$ wget https://www.dropbox.com/s/pp168a6oiwqkxyu/

hdi-data.csv

#Edit Your Script

ubuntu@ip-172-31-12-11:~$ vi countryFilter.pig

Writing a Pig Script

Page 124: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

ubuntu@ip-172-31-12-11:~$ pig -x local

grunt > run countryFilter.pig

Running a Pig Script

Page 125: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Lecture: Understanding Sqoop

Page 126: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Introduction

Sqoop (“SQL-to-Hadoop”) is a straightforward command-linetool with the following capabilities:

• Imports individual tables or entire databases to files inHDFS

• Generates Java classes to allow you to interact with yourimported data

• Provides the ability to import from SQL databases straightinto your Hive data warehouse

See also: http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html

Page 127: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Architecture Overview

Hive.apache.org

Page 128: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Loading Data from DBMSto Hadoop HDFS

Page 129: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Sqoop Hands-On Labs

1. Loading Data into MySQL DB

2. Installing Sqoop

3. Configuring Sqoop

4. Installing DB driver for Sqoop

5. Importing data from MySQL to Hive Table

6. Reviewing data from Hive Table

7. Reviewing HDFS Database Table files

Page 130: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

1. MySQL RDS Server on AWS

A RDS Server is running on AWS with the followingconfiguration

> database: imc_db

> username: admin

> password: imcinstitute

>addr: imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com

[This address may change]

Page 131: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

1. country_tbl data

Testing data query from MySQL DB

Table name > country_tbl

Page 132: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

2. Installing Sqoop

# wget http://apache.osuosl.org/sqoop/1.4.5/sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# tar -xvzf sqoop-1.4.5.bin__hadoop-1.0.0.tar.gz

# sudo mv sqoop-1.4.5.bin__hadoop-1.0.0 /usr/local/

# rm sqoop-1.4.5.bin__hadoop-1.0.0

Page 133: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing SqoopEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc

Page 134: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

3. Configuring Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/conf/

ubuntu@ip-172-31-12-11:~$ vi sqoop-env.sh

Page 135: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

4. Installing DB driver for Sqoop

ubuntu@ip-172-31-12-11:~$ cd /usr/local/sqoop-1.4.5.bin__hadoop-1.0.0/lib/

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.05/lib$wget https://www.dropbox.com/s/6zrp5nerrwfixcj/mysql-connector-java-5.1.23-bin.jar

ubuntu@ip-172-31-12-11:/usr/local/sqoop-1.4.5.bin__hadoop-1.0.055/lib$exit

Page 136: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

5. Importing data from MySQL to Hive Table

[hdadmin@localhost ~]$sqoop import --connectjdbc:mysql://imcinstitutedb.cmw65obdqfnx.us-west-2.rds.amazonaws.com/imc_db --username admin -P --table country_tbl--hive-import --hive-table country -m 1

Warning: /usr/lib/hbase does not exist! HBase imports will fail.

Please set $HBASE_HOME to the root of your HBase installation.

Warning: $HADOOP_HOME is deprecated.

Enter password: <enter here>

Page 137: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

6. Reviewing data from Hive Table

Page 138: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

7. Reviewing HDFS Database Table files

Start Web Browser to http://http://54.68.149.232:50070 then navigate to /user/hive/warehouse

Page 139: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

7. Reviewing HDFS Database Table files

Page 140: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

LectureUnderstanding HBase

Page 141: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

IntroductionAn open source, non-relational, distributed database

HBase is an open source, non-relational, distributed databasemodeled after Google's BigTable and is written in Java. It isdeveloped as part of Apache Software Foundation's ApacheHadoop project and runs on top of HDFS (, providingBigTable-like capabilities for Hadoop. That is, it provides afault-tolerant way of storing large quantities of sparse data.

Page 142: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

HBase Features

● Hadoop database modelled after Google's Bigtab;e● Column oriented data store, known as Hadoop Database● Support random realtime CRUD operations (unlike

HDFS)● No SQL Database● Opensource, written in Java● Run on a cluster of commodity hardware

Hive.apache.org

Page 143: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

When to use Hbase?

● When you need high volume data to be stored ● Un-structured data● Sparse data● Column-oriented data● Versioned data (same data template, captured at various

time, time-elapse data)● When you need high scalability

Hive.apache.org

Page 144: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Which one to use?

● HDFS● Only append dataset (no random write)● Read the whole dataset (no random read)

● HBase● Need random write and/or read● Has thousands of operation per second on TB+ of data

● RDBMS● Data fits on one big node● Need full transaction support● Need real-time query capabilities

Hive.apache.org

Page 145: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 146: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Page 147: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

HBase Components

Hive.apache.org

● Region● Row of table are stores

● Region Server● Hosts the tables

● Master● Coordinating the Region

Servers● ZooKeeper● HDFS● API

● The Java Client API

Page 148: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

HBase Architecture

Hive.apache.org

Page 149: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

HBase Shell Commands

Hive.apache.org

Page 150: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Hands-On: Running HBase

Page 151: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing HBase

# wget http://apache.cs.utah.edu/hbase/hbase-1.0.0/hbase-1.0.0-bin.tar.gz

# tar -xvzf hbase-1.0.0-bin.tar.gz

# sudo mv hbase-1.0.0 /usr/local/

# rm hbase-1.0.0-bin.tar.gz

Page 152: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Installing HBaseEdit $HOME ./bashrc

# sudo vi $HOME/.bashrc

Page 153: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Starting HBase shell

ubuntu@ip-172-31-12-11:~$ start-hbase.sh

starting master, logging to /usr/local/hbase-0.94.10/logs/hbase-hdadmin-master-localhost.localdomain.out

ubuntu@ip-172-31-12-11:~$$ jps

3064 TaskTracker

2836 SecondaryNameNode

2588 NameNode

3513 Jps

3327 HMaster

2938 JobTracker

2707 DataNode

ubuntu@ip-172-31-12-11:~$ hbase shell

HBase Shell; enter 'help<RETURN>' for list of supported commands.

Type "exit<RETURN>" to leave the HBase Shell

Version 0.94.10, r1504995, Fri Jul 19 20:24:16 UTC 2013

hbase(main):001:0>

Page 154: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Create a table and insert data in HBase

hbase(main):009:0> create 'test', 'cf'

0 row(s) in 1.0830 seconds

hbase(main):010:0> put 'test', 'row1', 'cf:a', 'val1'

0 row(s) in 0.0750 seconds

hbase(main):011:0> scan 'test'

ROW COLUMN+CELL

row1 column=cf:a, timestamp=1375363287644,value=val1

1 row(s) in 0.0640 seconds

hbase(main):002:0> get 'test', 'row1'

COLUMN CELL

cf:a timestamp=1375363287644, value=val1

1 row(s) in 0.0370 seconds

Page 155: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Recommendation to Further Study

Page 156: Hadoop Workshop on EC2 : March 2015

Danairat T., , [email protected]: Thanachart Numnonda, [email protected] Aug 2013Big Data Hadoop on Amazon EMR – Hands On Workshop

Thank you

www.imcinstitute.comwww.facebook.com/imcinstitute


Recommended