Post on 11-Nov-2014
description
transcript
Slide 1
Hadoop for Java Professionals
View Hadoop Courses at : www.edureka.in/hadoop
*
Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
www.edureka.in/hadoopSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Objectives of this Session
• Un• Big Data and Hadoop• Why Hadoop?• Job Trends: Hadoop and Java• Hadoop ecosystem• MapReduce Programming and Java• User Defined Functions (UDF) in Pig and Hive• HBase and Java
For Queries during the session and class recording:Post on Twitter @edurekaIN: #askEdurekaPost on Facebook /edurekaIN
www.edureka.in/hadoopSlide 3 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Big Data
Lots of Data (Terabytes or Petabytes)
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
cloud
tools
statistics
No SQL
compression
storage
support
database
analyze
information
terabytesprocessing
mobile
Big Data
www.edureka.in/hadoopSlide 4 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Unstructured Data is Exploding
2,500 exabytes of new information in 2012 with internet as primary driver “Digital universe grew by 62% last year to 800K petabytes and will grow to1.2 zettabytes” this year
www.edureka.in/hadoopSlide 5 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Big Data - Challenges Increasing Data Volumes
New data sources and types
Email and documents
Social Media, Web Logs
Machine Device(Scientific)
Transactions, OLTP, OLAP
Slide 6 www.edureka.in/hadoop
Job Trends: Hadoop and Java
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Slide 7 www.edureka.in/hadoop
Job Trends: Hadoop and Java
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/hadoopSlide 8 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Jobs in Hadoop
Big Data has opened up the door to new job
opportunities, to name a few:
Hadoop Developer Hadoop Architects Hadoop Engineers Hadoop Application Developer Data Analysts Data Scientists Business Intelligence (BI) Architects Big Data Engineer
www.edureka.in/hadoopSlide 9 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop for Java Professionals
Hadoop is red-hot as it: allows distributed processing of large data sets across
clusters of computers using simple programming model.
has become the de facto standard for storing, processing, and analyzing hundreds of terabytes and petabytes of data.
Is cheaper to use in comparison to other traditional proprietary technologies such as Oracle, IBM etc. It can runs on low cost commodity hardware.
Can handle all types of data from disparate systems such server logs, emails, sensor data, pictures, videos etc.
www.edureka.in/hadoopSlide 10 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop for Java Professionals (Contd.)
Hadoop is Natural career progression for Java
professionals. It is a Java-based framework and written entirely in Java.
The combination of Hadoop and Java skills is the number one combination in demand among all Hadoop Jobs.
Java skills comes handy while writing code for the following in Hadoop:
MapReduce programming using Java User Defined Functions (UDFs) in PIG and Hive
scripts of Hadoop Applications Client Applications in HBase
www.edureka.in/hadoopSlide 11 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop for Big Data
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
It is an Open-source Data Management with scale-out storage & distributed processing.
www.edureka.in/hadoopSlide 12 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Hadoop and MapReduce
Hadoop is a system for large scale data processing.
It has two main components:
HDFS – Hadoop Distributed File System (Storage) highly fault-tolerant high throughput access to application data suitable for applications that have large data set Natively redundant
MapReduce (Processing) software framework for easily writing applications which
process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) in a reliable, fault-tolerant manner
Splits a task across processors
Map-Reduce
Key Value
www.edureka.in/hadoopSlide 13 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
HDFS (Hadoop Distributed File System)
Pig LatinData Analysis
HiveDW System
MapReduce Framework
HBase
Important Hadoop Eco-System components
www.edureka.in/hadoopSlide 14 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
What is Map - Reduce?
cloudsupport
database
Map - Reduce is a programming model It is neither platform- nor language-specific Record-oriented data processing (key and value) Task distributed across multiple nodes
Where possible, each node processes datastored on that node
Consists of two phases Map Reduce
ValueKey
MapReduce
www.edureka.in/hadoopSlide 15 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
What is Map - Reduce? (Contd.)
cloudsupport
database
Process can be considered as being similar to a Unix pipeline
cat /my/log | grep '\.html' | sort | uniq –c > /my/outfile
MAP SORT REDUCE
www.edureka.in/hadoopSlide 16 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
A Sample MapReduce program in Java
www.edureka.in/hadoopSlide 17 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Problem – Data Processing
www.edureka.in/hadoopSlide 18 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Huge Raw XML files with
unstructured data line reviews
Map Reduce
HDFS
Category hash url +tive -tive total
Problem - Data Processing
Output
www.edureka.in/hadoopSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Other Applications of Java Skills in Hadoop – UDFs
www.edureka.in/hadoopSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
Pig is a High-level, declarative data flow language.
It is at the top of Hadoop and makes it possible to create complex jobs to process large volumes of data quickly and efficiently.
Similar to SQL query where the user specifies the “what” and leaves the “how” to the underlying processing engine.
Hadoop
Pig
User Defined Functions (UDFs) in PIG
www.edureka.in/hadoopSlide 21 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
public class IsOfAge extends FilterFunc {@Overridepublic Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {return false;
}
try {Object object = tuple.get(0);if (object == null) {
return false;}int i = (Integer) object;if (i == 18 || i == 19 || i == 21 || i == 23 || i == 27) {
return true;} else {
return false;}
} catch (ExecException e) {throw new IOException(e);
}}
}
A Program to create UDF:
Pig Latin – Creating UDF
www.edureka.in/hadoopSlide 22 Twitter @edurekaIN, Facebook /edurekaIN, use #AskEdureka for Questions
How to call a UDF?
register myudf.jar;
X = filter A by IsOfAge(age);
Pig and UDF
Slide 23
Questions?Buy Complete Course at : www.edureka.in/hadoop
Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.in/hadoop
Interested in learning “Big-Data & Hadoop”?Let us know by mailing us at sales@edureka.in