Date post: | 17-Feb-2017 |
Category: |
Technology |
Upload: | springpeople |
View: | 188 times |
Download: | 0 times |
Presented by Stephen Peter
The Hadoop Data Access Layer
Stephen PeterE-Mail: [email protected] - https://in.linkedin.com/in/stephenepeter
Hortonworks Certified Trainer.Hortonworks Certified Developer (Apache Pig & Hive)Digital Badge : http://bcert.me/sxohnqiq
Professional Experience: Over 20 years of IT experience with specialization in Business Intelligence , Data warehousing and Big Data. Worked in organizations such as HCL Tech, Oracle , Cisco Systems. Presently working as Hadoop trainer at Spring People.
Area of interest: coexistence of Enterprise DW and Hadoop
Introduction
•The motivation for Hadoop▫The need for ingesting, storing and analyzing big data.▫Use cases on the value of Big Data.
•Hadoop as an integral part of Modern Data Architecture.•The HDP (Hortonworks Data Platform) reference architecture.▫HDP Data Access Layer.
The different components its functions and application.•Use case – Data warehouse Optimization using Hadoop.▫to achieve better insight and cost effectiveness.
Agenda
Emerging Data landscape• In the past the world’s data doubled every
century, now its every 2 years.
• The flood of data is driven by IOT, mobile devices, server logs, geo location coordinates, social media and sensor data.
• Big data is characterized by: Velocity – 90% of world’s data created in the
last two years. Volume – from 8 ZB in 2015 expected to grow
to 40 ZB by 2020. Variety – 80% of enterprise data unstructured
ranging from docs, emails, images, web logs, sensor data, geospatial coordinates and server logs.
Big Data Use Cases
Source: https://hortonworks.com
Hadoop – An integral part of modern Data Architecture
Source: https://hortonworks.com
Hortonworks Hadoop Platform - HDP
www.hortonworks.com
• Batch Processing using Map Reduce Framework
• Interactive SQL Query using Hive on Tez framework.
• Apache Pig scripting language can run on MR or Tez.
• Low latency data access via NoSQL database Hbase.
• Apache Storm processes and analyze streams of data in real time as it flows into HDFS
• Apache Spark is a fast, in-memory data processing engine that enables batch, real-time, and advanced analytics on the Apache Hadoop platform.
HDP - Data Access Layer
www.hortonworks.com
Ingest Data into HDFS using Scoop
▫ The primary use case: Stream log entries from multiple machines Aggregate them to a centralized, persistent
store such as the Hadoop Distributed File System Log entries can be analyzed by other Hadoop
tools. ▫ Flume is not limited to log entries.
Flume is used to collect many types of streaming data. Examples include network traffic data, social
media generated data, machine sensor data, and email messages.
▫ Flume is not the best choice where data is not regularly generated.
Ingest Data into HDFS using Flume
• Use the Twitter streaming API as the source• Create a twitter application • Configure the flume agent by modifying the flume
configuration.▫ Configure the source, channel and sink.▫ Source type:
org.apache.flume.source.twitter.TwitterSource▫ Channel type: MemChannel▫ Sink type : HDFS
• Run the flume command to extract data from twitter. for example
$ flume-ng agent --conf ./conf/ -f conf/twitter.conf
Importing Twitter data into HDFS
Query Data using Hive
Example Hive QL commands Create a Hive managed table:
CREATE TABLE stockinfo (symbol STRING, price FLOAT, change FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
Create a Hive external table: CREATE EXTERNAL TABLE salaries (gender string, age int, salary double,zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘LOCATION '/user/train/salaries/';
Load data from file in HDFS:LOAD DATA INPATH ‘/user/me/stockdata.csv’OVERWRITE INTO TABLE stockinfo;
View everything in the table:SELECT * from stockinfo;
Performance tuning in Hive•Hive Partition table•Hive Buckets•Use Optimized Row Columnar (ORC) Format storage•Cost Based SQL Optimization•Using Hive on Tez for low latency query
Use cases for Apache Pig• Pig can extract data from multiple sources, transform it and store it in HDFS.• Research raw data.• Iterative data processing
database data
log data
sensordata
transform HDFS
extract transform load
Hive
other tools
PIGanalysis
tools
Load data from a file and apply a schema:stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS (symbol STRING, price FLOAT, change FLOAT) ;
Display the data in stockinfo:DUMP stockinfo;
Filter the stockinfo data and write the filtered data to HDFS:IBM_only = FILTER stockinfo BY (symbol == ‘IBM’);STORE IBM_only INTO ‘ibm_stockinfo’;
Load data from a file without applying a schema a = LOAD ‘flightdelays’ using PigStorage(‘,’);
Apply schema on readc = foreach a generate $0 as year:int, $1 as month:int, $4 as name:chararray;
Example Pig Statements
Create workflow using Apache Oozie
distcp
MapReduce
Hive
PigSqoop
Oozie workflow example
data dataApache Oozie is a server-based workflow engine used to execute Hadoop jobs.
Used to build and schedule complex data transformations by combining MapReduce, Apache Hive, Apache Pig, and Apache Sqoop jobs into a single, logical unit of work.
Oozie can also perform Java, Linux shell, distcp, SSH, email, and other operations.
Oozie runs as a Java Web application in Apache Tomcat.
Use Case -Data warehouse Optimization with Hadoop
Thank you
Visit: www.springpeople.com call +91 80 6567 9700
Our Partners