Date post: | 22-Feb-2017 |
Category: |
Education |
Upload: | purna-chander |
View: | 77 times |
Download: | 2 times |
Hadoop Tech Talk
By : Purna Chander
Agenda
●Big Data
●Traditional System Workflow
●Hadoop
●Hadoop Tools
●Hadoop Testing
BigDataWhat is Big Data ?
What is BIG DATA ?Data which is beyond to the storage capacity and which is beyond to the processing power
VOLUME VELOCITY VARIETY
Assume we live in a world of 100% data, 90% of data was generated in the last 3-4 years and 10% of data was generated when the systems was introduced.
Volume1.Transaction based data stored in relational database since years.
2.Unstructured data stored as the part of social media .
3.Sensor and machine to machine generated data.
4. The volume on Facebook is 600 Terabytes of data every day.
Velocity1.Reacting Fast enough is one of the challenges
2.Computation is process bound.
3.Processing the unstructured data
Variety1.Different formats.
2.Structured data residing in traditional RDBMS or flat files.
3.Unstructured data related to text documents, videos, email, audio, log files and etc..
4.Managing , merging and governing different varieties of data is the biggest challenge
5.Connected and correlated to extract useful information from it.
Challenges reading from single disk
Reading data from warehouse 200GB FB (india)
200GB FB (US)
200GB FB (Japan)
200GB FB (UK)
200GB FB (CHINA)
1 TB and moreData warehouse
100 * 1000 *8 __________ = 8000 secs 100
100mbps 2.2 Hours
OLTP
( RDBMS )
Social Networking
Logs
Xml / .txt
Eg : Apache logs
Data Warehouse(Expensive storage)
Storage spread across not easily accessible, limited storage capacity
Reports
Reports
Traditional work flow data
OLTP
( RDBMS )
Social Networking
Logs
Xml / .txt
Eg : Apache logs
Data Warehouse
Reports
Reports
Hadoop
Hadoop Workflow
Vertical Scaling
Increasing the resources ( RAM, Processor, Hard Disk) on the machine is vertical scaling.
1990 - 512 MB RAM , 2 core processor.
2000 - 4 / 8 GB RAM, 8 core processor.
Production Systems - 64GB RAM / 16 core processor. ( Cost and Maintenance).
Horizontal Scaling ( Distributed Computing)
Data Warehouse
8GB 8GB 8GB 8GB
10 %20 %30 %
HADOOP
HadoopHadoop is a apache software library framework that allows for the distributed processing of large data sets across cluster of computers using simple programming model
It is designed to scale up from single server to thousand of machines each offering local computation and storage.
Rather than rely on hardware to deliver high availability the library is itself designed to detect and handle failures at the application layer , so delivering high availability of service on top of cluster of computers, each of which may be prone to failures.
Hadoop Core FeaturesHDFS - Used for Storing data on cluster of machines.
Mapreduce - it is a technique to process the data that is stored in HDFS.
Rack 1
Block1
Block 2
Block 3
Default Replication : 3
Rack 2
Computer 1 Computer 2 Computer 3
Computer 4 Computer 5 Computer 6
Computer 7 Computer 8 Computer 9
Computer 10 Computer 11 Computer
Data Replication
Map ReduceWhat is MapReduce?
Mapreduce is a programming model for processing large Data sets with a parallel and distributed algorithm on a cluster.
Example : we take a word count of a small and big and count the occurrences of the words.
Mapreduce programs can be written in Perl, Python, Java or Ruby.
Mapreduce uses hadoopstreaming.jar to convert the Python/Perl programs to jar and execute the programs in parallel and get the result count using the reducer.
Mapper code in Python#!/usr/bin/env python
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words: print '%s\t%s' % (word, 1)
words_count.txt
pramati yahoo facebook aol facebook IBMkony google pramati
Reducer code in Python#!/usr/bin/env python
from operator import itemgetterimport sys
current_word = Nonecurrent_count = 0word = Nonefor line in sys.stdin: line = line.strip() word, count = line.split('\t', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s\t%s' % (current_word, current_count) current_count = count current_word = word
if current_word == word: print '%s\t%s' % (current_word, current_count)
Execution of Mapreducehadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar -Dmapred.reduce.tasks=4 -file /home/hduser/mapper.py /home/hduser/reducer.py -mapper "python mapper.py" -reducer "python reducer.py" -input /test/words_count.txt -output /test_output
Output of mapreduce job
Hadoop Tools
Pig Latin● Apache Pig is a tool used to analyze large amounts of data by representing them as data
flows. Using the PigLatin scripting language operations like ETL ( Extract , Transform and Load) , adhoc data analysis and iterative processing can be easily achieved.
● Can solve Variety data problems for structured, unstructured and Semi-structured
● Pig was first built in Yahoo! And later became a top level Apache Project . In this series we will walk through different features of pig using a sample dataset.
Pig Access●Interactive mode
●Batch Mode
●$ Pig -x local - to run the grunt shell in the local file system mode
●$ Pig - to run the grunt shell in the HDFS system mode
Execution of PigCan Perform Joins
● Self Join
● Equi Join
● Left Outer Join
● Right Outer Join
Customers.txt
1,Ramesh,32,Ahmedabad,2000.002,Khilan,25,Delhi,1500.003,kaushik,23,Kota,2000.004,Chaitali,25,Mumbai,6500.005,Hardik,27,Bhopal,8500.006,Komal,22,MP,4500.007,Muffy,24,Indore,10000.00
Orders.txt102,2009-10-08 00:00:00,3,3000100,2009-10-08 00:00:00,3,1500101,2009-11-20 00:00:00,2,1560103,2008-05-20 00:00:00,4,2060
Output of Pig
Hive● Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It
resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
● Hive is a database technology that can define databases and tables to analyze structured data. The theme for structured data analysis is to store the data in a tabular manner, and pass queries to analyze it.
● Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive.
Hive FeaturesHive is not
1.A relational database.
2.A design for online transaction processing ( OLTP).
3.A language for real time queries and row level updates.
Hiive is a4.It stores schema in a database and processed data into
HDFS.
5.Designed for online analytical processing ( OLAP).
6.It provides SQL type language for querying called HiveQL or HQL.
Loading structured data into table using Hivehive> CREATE DATABASE EMP;hive > use EMP;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, deptno int)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ','LINES TERMINATED BY '\n'STORED AS TEXTFILE;
hive> load data local inpath '/home/purnar/emp_data.txt' into table employee;
hive> select * from employee;
EMP_DATA.txt
1201,chandu,10000.00,201202,shekar,2000.00,101203,ravi,1000.00,101204,kiran,2000.00,201205,sharma,30000.00,301206,sri,4000.00,40
Difference between Hive and Pig●Hive is mainly used by data analysts whereas Pig is used by
Researchers and Programmers.
●Hive is mainly used for structured data whereas Pig is used for semi-structured data / unstructured data.
●Hive is mainly used for creating Reports whereas Pig is used for Programmers.
●Hive has the provision for Partitions so that you can process the subset of data by date or an alphabetical order where as Pig does not have any notion of partion though might be one can achieve this through filters
SQOOP●Sqoop is a tool designed to transfer data between Hadoop and
relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to relational databases. This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.
●The traditional application management system, that is, the interaction of applications with relational database using RDBMS, is one of the sources that generate Big Data. Such Big Data, generated by RDBMS, is stored in Relational Database Servers in the relational database structure.
●SQOOP = “SQL to Hadoop and Hadoop to SQL”
Export data from HDFS to MySQL1. Create a database test and create a table ( employee) as below.
CREATE TABLE employee(id INT,name VARCHAR(20),deg VARCHAR(20),salary INT,dept VARCHAR(10));
2. Create a txt file with the data given below and input it to the hadoop file systemEmp.txt========1201, gopal, manager,50000, TP1202, manisha, preader,50000, TP1203, kalil, php dev,30000, AC1204, prasanth, php dev,30000, AC1205, kranthi, admin,20000, TP1206, satish p, grp des,20000, GR
3. Hadoop fs -mkdir /emp4. Hadoop fs -put emp.txt /emp5. Execute the sqoop command below to export the data from txt to mysql.
sqoop export --verbose --connect jdbc:mysql://localhost/hive_db --username ***** --password ****** -m 4 --table employee --export-dir /emp/emp.txt
Import data from mysql to flatfile
sqoop import --connect jdbc:mysql://localhost/test --username ******* --password *******--table employee --m 1 --target-dir /chandu
Hadoop Testing
Hadoop Testing●Unix commands like mkdir , ls, cat and etc...
●Testing the mapreduce scripts
●Test the mapper and reducer scripts separately with different input files.
●Example : Parse the apache.log file for the gmail users and count the number of times they have logged in on a particular day.
●Here we pass a 12mb file to the hadoop file system and extract only the gmail users and count the number of times , they have logged in on that particular day.
Test Scenarios1.Adding a special characters to the pattern object.
2.Provide extra spaces to the patterns.
3.Test for boundary conditions of the pattern.
4.Add some special characters inbetween the pattern.
5.Count the number of patterns using the reducer.
And etc..
Hive TestHive > create database test;
Hive > create table emp(id int, name string,salary float, deptno int, designation string);>ROW FORMAT DELIMITED>FIELDS TARMINATED BY “,”>Lines TERMINATED BY ‘\n”;
Hive > load data local inpath '/home/purnar/emp.txt' into table emp; hive> select * from emp;OKNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLNULL NULL NULL NULL NULLTime taken: 0.1 seconds, Fetched: 6 row(s)
EMP.txt
1201, pavan,4000,30,Dev1202, ravi,3000, 10,QA1203, kalil,30000,10,phpdev1204, prasanth,30000,20,QA1205, kranthi,20000,30,QA1206, satishp,20000,40,Admin
Hive Test 2Hive > create database test;
Hive > create table emp(id int, name string,salary float, deptno int, designation string);>ROW FORMAT DELIMITED>FIELDS TARMINATED BY “,”>Lines TERMINATED BY ‘\n”>STORED AS TEXTFILE;
Hive > load data local inpath '/home/purnar/emp.txt' into table emp; hive> select * from employee;OK1202 pavan 4000.0 30 Dev1202 ravi 3000.0 NULL QA1203 kalil 30000.0 10 phpdev1204 prasanth 30000.0 20 QA1205 kranthi 20000.0 30 QA1206 satishp 20000.0 40 AdminTime taken: 0.096 seconds, Fetched: 6 row(s)
EMP.txt
1201, pavan,4000,30,Dev1202, ravi,3000, 10,QA1203, kalil,30000,10,phpdev1204, prasanth,30000,20,QA1205, kranthi,20000,30,QA1206, satishp,20000,40,Admin
Space before 10
Null Value displayed
Questions ?