+ All Categories
Home > Technology > Hadoop data access layer v4.0

Hadoop data access layer v4.0

Date post: 17-Feb-2017
Category:
Upload: springpeople
View: 188 times
Download: 0 times
Share this document with a friend
19
Presented by Stephen Peter The Hadoop Data Access Layer
Transcript
Page 1: Hadoop data access layer v4.0

Presented by Stephen Peter

The Hadoop Data Access Layer

Page 2: Hadoop data access layer v4.0

Stephen PeterE-Mail: [email protected] - https://in.linkedin.com/in/stephenepeter

Hortonworks Certified Trainer.Hortonworks Certified Developer (Apache Pig & Hive)Digital Badge : http://bcert.me/sxohnqiq

Professional Experience: Over 20 years of IT experience with specialization in Business Intelligence , Data warehousing and Big Data. Worked in organizations such as HCL Tech, Oracle , Cisco Systems. Presently working as Hadoop trainer at Spring People.

Area of interest: coexistence of Enterprise DW and Hadoop

Introduction

Page 3: Hadoop data access layer v4.0

•The motivation for Hadoop▫The need for ingesting, storing and analyzing big data.▫Use cases on the value of Big Data.

•Hadoop as an integral part of Modern Data Architecture.•The HDP (Hortonworks Data Platform) reference architecture.▫HDP Data Access Layer.

The different components its functions and application.•Use case – Data warehouse Optimization using Hadoop.▫to achieve better insight and cost effectiveness.

Agenda

Page 4: Hadoop data access layer v4.0

Emerging Data landscape• In the past the world’s data doubled every

century, now its every 2 years.

• The flood of data is driven by IOT, mobile devices, server logs, geo location coordinates, social media and sensor data.

• Big data is characterized by: Velocity – 90% of world’s data created in the

last two years. Volume – from 8 ZB in 2015 expected to grow

to 40 ZB by 2020. Variety – 80% of enterprise data unstructured

ranging from docs, emails, images, web logs, sensor data, geospatial coordinates and server logs.

Page 5: Hadoop data access layer v4.0

Big Data Use Cases

Source: https://hortonworks.com

Page 6: Hadoop data access layer v4.0

Hadoop – An integral part of modern Data Architecture

Source: https://hortonworks.com

Page 7: Hadoop data access layer v4.0

Hortonworks Hadoop Platform - HDP

www.hortonworks.com

Page 8: Hadoop data access layer v4.0

• Batch Processing using Map Reduce Framework

• Interactive SQL Query using Hive on Tez framework.

• Apache Pig scripting language can run on MR or Tez.

• Low latency data access via NoSQL database Hbase.

• Apache Storm processes and analyze streams of data in real time as it flows into HDFS

• Apache Spark is a fast, in-memory data processing engine that enables batch, real-time, and advanced analytics on the Apache Hadoop platform.

HDP - Data Access Layer

www.hortonworks.com

Page 9: Hadoop data access layer v4.0

Ingest Data into HDFS using Scoop

Page 10: Hadoop data access layer v4.0

▫ The primary use case: Stream log entries from multiple machines Aggregate them to a centralized, persistent

store such as the Hadoop Distributed File System Log entries can be analyzed by other Hadoop

tools. ▫ Flume is not limited to log entries.

Flume is used to collect many types of streaming data. Examples include network traffic data, social

media generated data, machine sensor data, and email messages.

▫ Flume is not the best choice where data is not regularly generated.

Ingest Data into HDFS using Flume

Page 11: Hadoop data access layer v4.0

• Use the Twitter streaming API as the source• Create a twitter application • Configure the flume agent by modifying the flume

configuration.▫ Configure the source, channel and sink.▫ Source type:

org.apache.flume.source.twitter.TwitterSource▫ Channel type: MemChannel▫ Sink type : HDFS

• Run the flume command to extract data from twitter. for example

$ flume-ng agent --conf ./conf/ -f conf/twitter.conf

Importing Twitter data into HDFS

Page 12: Hadoop data access layer v4.0

Query Data using Hive

Page 13: Hadoop data access layer v4.0

Example Hive QL commands Create a Hive managed table:

CREATE TABLE stockinfo (symbol STRING, price FLOAT, change FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;

Create a Hive external table: CREATE EXTERNAL TABLE salaries (gender string, age int, salary double,zip int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',‘LOCATION '/user/train/salaries/';

Load data from file in HDFS:LOAD DATA INPATH ‘/user/me/stockdata.csv’OVERWRITE INTO TABLE stockinfo;

View everything in the table:SELECT * from stockinfo;

Page 14: Hadoop data access layer v4.0

Performance tuning in Hive•Hive Partition table•Hive Buckets•Use Optimized Row Columnar (ORC) Format storage•Cost Based SQL Optimization•Using Hive on Tez for low latency query

Page 15: Hadoop data access layer v4.0

Use cases for Apache Pig• Pig can extract data from multiple sources, transform it and store it in HDFS.• Research raw data.• Iterative data processing

database data

log data

sensordata

transform HDFS

extract transform load

Hive

other tools

PIGanalysis

tools

Page 16: Hadoop data access layer v4.0

Load data from a file and apply a schema:stockinfo = LOAD ‘stockdata.csv’ using PigStorage(‘,’) AS (symbol STRING, price FLOAT, change FLOAT) ;

Display the data in stockinfo:DUMP stockinfo;

Filter the stockinfo data and write the filtered data to HDFS:IBM_only = FILTER stockinfo BY (symbol == ‘IBM’);STORE IBM_only INTO ‘ibm_stockinfo’;

Load data from a file without applying a schema a = LOAD ‘flightdelays’ using PigStorage(‘,’);

Apply schema on readc = foreach a generate $0 as year:int, $1 as month:int, $4 as name:chararray;

Example Pig Statements

Page 17: Hadoop data access layer v4.0

Create workflow using Apache Oozie

email

distcp

MapReduce

Hive

PigSqoop

Oozie workflow example

data dataApache Oozie is a server-based workflow engine used to execute Hadoop jobs.

Used to build and schedule complex data transformations by combining MapReduce, Apache Hive, Apache Pig, and Apache Sqoop jobs into a single, logical unit of work.

Oozie can also perform Java, Linux shell, distcp, SSH, email, and other operations.

Oozie runs as a Java Web application in Apache Tomcat.

Page 18: Hadoop data access layer v4.0

Use Case -Data warehouse Optimization with Hadoop

Page 19: Hadoop data access layer v4.0

Thank you

Visit: www.springpeople.com call +91 80 6567 9700

Our Partners


Recommended