Hadoop data access layer v4.0

Presented by Stephen Peter

The Hadoop Data Access Layer

Stephen PeterE-Mail: [email protected] - https://in.linkedin.com/in/stephenepeter

Hortonworks Certified Trainer.Hortonworks Certified Developer (Apache Pig & Hive)Digital Badge : http://bcert.me/sxohnqiq

Professional Experience: Over 20 years of IT experience with specialization in Business Intelligence , Data warehousing and Big Data. Worked in organizations such as HCL Tech, Oracle , Cisco Systems. Presently working as Hadoop trainer at Spring People.

Area of interest: coexistence of Enterprise DW and Hadoop

Introduction

mailto:[email protected]

https://in.linkedin.com/in/stephenepeter

http://bcert.me/sxohnqiq

•The motivation for Hadoop▫The need for ingesting, storing and analyzing big data.▫Use cases on the value of Big Data.

•Hadoop as an integral part of Modern Data Architecture.•The HDP (Hortonworks Data Platform) reference architecture.▫HDP Data Access Layer.

The different components its functions and application.•Use case – Data warehouse Optimization using Hadoop.▫to achieve better insight and cost effectiveness.

Agenda

Emerging Data landscape• In the past the world’s data doubled every

century, now its every 2 years.

• The flood of data is driven by IOT, mobile devices, server logs, geo location coordinates, social media and sensor data.

• Big data is characterized by: Velocity – 90% of world’s data created in the

last two years. Volume – from 8 ZB in 2015 expected to grow

to 40 ZB by 2020. Variety – 80% of enterprise data unstructured

ranging from docs, emails, images, web logs, sensor data, geospatial coordinates and server logs.

Big Data Use Cases

Source: https://hortonworks.com

Hadoop – An integral part of modern Data Architecture

Source: https://hortonworks.com

Hortonworks Hadoop Platform - HDP

www.hortonworks.com

• Batch Processing using Map Reduce Framework

• Interactive SQL Query using Hive on Tez framework.

• Apache Pig scripting language can run on MR or Tez.

• Low latency data access via NoSQL database Hbase.

• Apache Storm processes and analyze streams of data in real time as it flows into HDFS

• Apache Spark is a fast, in-memory data processing engine that enables batch, real-time, and advanced analytics on the Apache Hadoop platform.

HDP - Data Access Layer

www.hortonworks.com

Ingest Data into HDFS using Scoop

▫ The primary use case: Stream log entries from multiple machines Aggregate them to a centralized, persistent

store such as the Hadoop Distributed File System Log entries can be analyzed by other Hadoop

tools. ▫ Flume is not limited to log entries.

Flume is used to collect many types of streaming data. Examples include network traffic data, social

media generated data, machine sensor data, and email messages.

▫ Flume is not the best choice where data is not regularly generated.

Ingest Data into HDFS using Flume

• Use the Twitter streaming API as the source• Create a twitter application • Configure the flume agent by modifying the flume

configuration.▫ Configure the source, channel and sink.▫ Source type:

org.apache.flume.source.twitter.TwitterSource▫ Channel type: MemChannel▫ Sink type : HDFS

• Run the flume command to extract data from twitter. for example

$ flume-ng agent --conf ./conf/ -f conf/twitter.conf

Importing Twitter data into HDFS

Query Data using Hive

Example Hive QL commands Create a Hive managed table:

CREATE TABLE stockinfo (symbol STRING, price FLOAT, change FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

Create a Hive external table: CREATE EXTERNAL TABLE salaries (gender string, age int, salary double,zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘LOCATION '/user/train/salaries/';

Load data from file in HDFS:LOAD DATA INPATH ‘/user/me/stockdata.csv’OVERWRITE INTO TABLE stockinfo;

View everything in the table:SELECT * from stockinfo;

Performance tuning in Hive•Hive Partition table•Hive Buckets•Use Optimized Row Columnar (ORC) Format storage•Cost Based SQL Optimization•Using Hive on Tez for low latency query

Use cases for Apache Pig• Pig can extract data from multiple sources, transform it and store it in HDFS.• Research raw data.• Iterative data processing

database data

log data

sensordata

transform HDFS

extract transform load

Hive

other tools

PIGanalysis

tools

Load data from a file and apply a schema:stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS (symbol STRING, price FLOAT, change FLOAT) ;

Display the data in stockinfo:DUMP stockinfo;

Filter the stockinfo data and write the filtered data to HDFS:IBM_only = FILTER stockinfo BY (symbol == ‘IBM’);STORE IBM_only INTO ‘ibm_stockinfo’;

Load data from a file without applying a schema a = LOAD ‘flightdelays’ using PigStorage(‘,’);

Apply schema on readc = foreach a generate $0 as year:int, $1 as month:int, $4 as name:chararray;

Example Pig Statements

Create workflow using Apache Oozie

email

distcp

MapReduce

Hive

PigSqoop

Oozie workflow example

data dataApache Oozie is a server-based workflow engine used to execute Hadoop jobs.

Used to build and schedule complex data transformations by combining MapReduce, Apache Hive, Apache Pig, and Apache Sqoop jobs into a single, logical unit of work.

Oozie can also perform Java, Linux shell, distcp, SSH, email, and other operations.

Oozie runs as a Java Web application in Apache Tomcat.

Use Case -Data warehouse Optimization with Hadoop

Thank you

Visit: www.springpeople.com call +91 80 6567 9700

Our Partners

http://www.springpeople.com/

Date post:	17-Feb-2017
Category:	Technology
Upload:	springpeople
View:	188 times
Download:	0 times

Hadoop data access layer v4.0

Technology