Hadoop workshop

Hadoop Tech Talk

By : Purna Chander

Agenda

●Big Data

●Traditional System Workflow

●Hadoop

●Hadoop Tools

●Hadoop Testing

BigDataWhat is Big Data ?

What is BIG DATA ?Data which is beyond to the storage capacity and which is beyond to the processing power

VOLUME VELOCITY VARIETY

Assume we live in a world of 100% data, 90% of data was generated in the last 3-4 years and 10% of data was generated when the systems was introduced.

Volume1.Transaction based data stored in relational database since years.

2.Unstructured data stored as the part of social media .

3.Sensor and machine to machine generated data.

4. The volume on Facebook is 600 Terabytes of data every day.

Velocity1.Reacting Fast enough is one of the challenges

2.Computation is process bound.

3.Processing the unstructured data

Variety1.Different formats.

2.Structured data residing in traditional RDBMS or flat files.

3.Unstructured data related to text documents, videos, email, audio, log files and etc..

4.Managing , merging and governing different varieties of data is the biggest challenge

5.Connected and correlated to extract useful information from it.

Challenges reading from single disk

Reading data from warehouse 200GB FB (india)

200GB FB (US)

200GB FB (Japan)

200GB FB (UK)

200GB FB (CHINA)

1 TB and moreData warehouse

100 * 1000 *8 __________ = 8000 secs 100

100mbps 2.2 Hours

OLTP

( RDBMS )

Social Networking

Logs

Xml / .txt

Eg : Apache logs

Data Warehouse(Expensive storage)

Storage spread across not easily accessible, limited storage capacity

Reports

Reports

Traditional work flow data

OLTP

( RDBMS )

Social Networking

Logs

Xml / .txt

Eg : Apache logs

Data Warehouse

Reports

Reports

Hadoop

Hadoop Workflow

Vertical Scaling

Increasing the resources ( RAM, Processor, Hard Disk) on the machine is vertical scaling.

1990 - 512 MB RAM , 2 core processor.

2000 - 4 / 8 GB RAM, 8 core processor.

Production Systems - 64GB RAM / 16 core processor. ( Cost and Maintenance).

Horizontal Scaling ( Distributed Computing)

Data Warehouse

8GB 8GB 8GB 8GB

10 %20 %30 %

HADOOP

HadoopHadoop is a apache software library framework that allows for the distributed processing of large data sets across cluster of computers using simple programming model

It is designed to scale up from single server to thousand of machines each offering local computation and storage.

Rather than rely on hardware to deliver high availability the library is itself designed to detect and handle failures at the application layer , so delivering high availability of service on top of cluster of computers, each of which may be prone to failures.

Hadoop Core FeaturesHDFS - Used for Storing data on cluster of machines.

Mapreduce - it is a technique to process the data that is stored in HDFS.

Rack 1

Block1

Block 2

Block 3

Default Replication : 3

Rack 2

Computer 1 Computer 2 Computer 3



Computer 10 Computer 11 Computer

Data Replication

Map ReduceWhat is MapReduce?

Mapreduce is a programming model for processing large Data sets with a parallel and distributed algorithm on a cluster.

Example : we take a word count of a small and big and count the occurrences of the words.

Mapreduce programs can be written in Perl, Python, Java or Ruby.

Mapreduce uses hadoopstreaming.jar to convert the Python/Perl programs to jar and execute the programs in parallel and get the result count using the reducer.

Mapper code in Python#!/usr/bin/env python

import sys

for line in sys.stdin:

line = line.strip()

words = line.split()

for word in words: print '%s\t%s' % (word, 1)

words_count.txt

pramati yahoo facebook aol facebook IBMkony google pramati

Reducer code in Python#!/usr/bin/env python

from operator import itemgetterimport sys

current_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word

if current_word == word: print '%s\t%s' % (current_word, current_count)

Execution of Mapreducehadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -Dmapred.reduce.tasks=4 -file /home/hduser/mapper.py /home/hduser/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /test/words_count.txt -output /test_output

Output of mapreduce job

Hadoop Tools

Pig Latin● Apache Pig is a tool used to analyze large amounts of data by representing them as data

flows. Using the PigLatin scripting language operations like ETL ( Extract , Transform and Load) , adhoc data analysis and iterative processing can be easily achieved.

● Can solve Variety data problems for structured, unstructured and Semi-structured

● Pig was first built in Yahoo! And later became a top level Apache Project . In this series we will walk through different features of pig using a sample dataset.

Pig Access●Interactive mode

●Batch Mode

●$ Pig -x local - to run the grunt shell in the local file system mode

●$ Pig - to run the grunt shell in the HDFS system mode

Execution of PigCan Perform Joins

● Self Join

● Equi Join

● Left Outer Join

● Right Outer Join

Customers.txt

1,Ramesh,32,Ahmedabad,2000.002,Khilan,25,Delhi,1500.003,kaushik,23,Kota,2000.004,Chaitali,25,Mumbai,6500.005,Hardik,27,Bhopal,8500.006,Komal,22,MP,4500.007,Muffy,24,Indore,10000.00

Orders.txt102,2009-10-08 00:00:00,3,3000100,2009-10-08 00:00:00,3,1500101,2009-11-20 00:00:00,2,1560103,2008-05-20 00:00:00,4,2060

Output of Pig

Hive● Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It

resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.

● Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it.

● Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.

Hive FeaturesHive is not

1.A relational database.

2.A design for online transaction processing ( OLTP).

3.A language for real time queries and row level updates.

Hiive is a4.It stores schema in a database and processed data into

HDFS.

5.Designed for online analytical processing ( OLAP).

6.It provides SQL type language for querying called HiveQL or HQL.

Loading structured data into table using Hivehive> CREATE DATABASE EMP;hive > use EMP;

hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, deptno int)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','LINES TERMINATED BY '\n'STORED AS TEXTFILE;

hive> load data local inpath '/home/purnar/emp_data.txt' into table employee;

hive> select * from employee;

EMP_DATA.txt

1201,chandu,10000.00,201202,shekar,2000.00,101203,ravi,1000.00,101204,kiran,2000.00,201205,sharma,30000.00,301206,sri,4000.00,40

Difference between Hive and Pig●Hive is mainly used by data analysts whereas Pig is used by

Researchers and Programmers.

●Hive is mainly used for structured data whereas Pig is used for semi-structured data / unstructured data.

●Hive is mainly used for creating Reports whereas Pig is used for Programmers.

●Hive has the provision for Partitions so that you can process the subset of data by date or an alphabetical order where as Pig does not have any notion of partion though might be one can achieve this through filters

SQOOP●Sqoop is a tool designed to transfer data between Hadoop and

relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

●The traditional application management system, that is, the interaction of applications with relational database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure.

●SQOOP = “SQL to Hadoop and Hadoop to SQL”

Export data from HDFS to MySQL1. Create a database test and create a table ( employee) as below.

CREATE TABLE employee(id INT,name VARCHAR(20),deg VARCHAR(20),salary INT,dept VARCHAR(10));

2. Create a txt file with the data given below and input it to the hadoop file systemEmp.txt========1201, gopal, manager,50000, TP1202, manisha, preader,50000, TP1203, kalil, php dev,30000, AC1204, prasanth, php dev,30000, AC1205, kranthi, admin,20000, TP1206, satish p, grp des,20000, GR

3. Hadoop fs -mkdir /emp4. Hadoop fs -put emp.txt /emp5. Execute the sqoop command below to export the data from txt to mysql.

sqoop export --verbose --connect jdbc:mysql://localhost/hive_db --username ***** --password ****** -m 4 --table employee --export-dir /emp/emp.txt

Import data from mysql to flatfile

sqoop import --connect jdbc:mysql://localhost/test --username ******* --password *******--table employee --m 1 --target-dir /chandu

Hadoop Testing

Hadoop Testing●Unix commands like mkdir , ls, cat and etc...

●Testing the mapreduce scripts

●Test the mapper and reducer scripts separately with different input files.

●Example : Parse the apache.log file for the gmail users and count the number of times they have logged in on a particular day.

●Here we pass a 12mb file to the hadoop file system and extract only the gmail users and count the number of times , they have logged in on that particular day.

Test Scenarios1.Adding a special characters to the pattern object.

2.Provide extra spaces to the patterns.

3.Test for boundary conditions of the pattern.

4.Add some special characters inbetween the pattern.

5.Count the number of patterns using the reducer.

And etc..

Hive TestHive > create database test;

Hive > create table emp(id int, name string,salary float, deptno int, designation string);>ROW FORMAT DELIMITED>FIELDS TARMINATED BY “,”>Lines TERMINATED BY ‘\n”;

Hive > load data local inpath '/home/purnar/emp.txt' into table emp; hive> select * from emp;OKNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLTime taken: 0.1 seconds, Fetched: 6 row(s)

EMP.txt

1201, pavan,4000,30,Dev1202, ravi,3000, 10,QA1203, kalil,30000,10,phpdev1204, prasanth,30000,20,QA1205, kranthi,20000,30,QA1206, satishp,20000,40,Admin

Hive Test 2Hive > create database test;

Hive > create table emp(id int, name string,salary float, deptno int, designation string);>ROW FORMAT DELIMITED>FIELDS TARMINATED BY “,”>Lines TERMINATED BY ‘\n”>STORED AS TEXTFILE;

Hive > load data local inpath '/home/purnar/emp.txt' into table emp; hive> select * from employee;OK1202 pavan 4000.0 30 Dev1202 ravi 3000.0 NULL QA1203 kalil 30000.0 10 phpdev1204 prasanth 30000.0 20 QA1205 kranthi 20000.0 30 QA1206 satishp 20000.0 40 AdminTime taken: 0.096 seconds, Fetched: 6 row(s)

EMP.txt

1201, pavan,4000,30,Dev1202, ravi,3000, 10,QA1203, kalil,30000,10,phpdev1204, prasanth,30000,20,QA1205, kranthi,20000,30,QA1206, satishp,20000,40,Admin

Space before 10

Null Value displayed

Questions ?

Date post:	22-Feb-2017
Category:	Education
Upload:	purna-chander
View:	77 times
Download:	2 times

Hadoop workshop

Education