Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

Pivotal eXtension Framework

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech

mailto:[email protected]

Data Analysis Timeline

ISAM files

COBOL/JCL


ISAM files

COBOL/JCL

RDBMS

SQL


ISAM files

COBOL/JCL

RDBMS

SQL

HDFS files

Map Reduce/Hive


ISAM files

COBOL/JCL

RDBMS

SQL

HDFS files

Map Reduce/Hive


ISAM files

COBOL/JCL

RDBMS

SQL

HDFS files

Map Reduce/Hive


HDFS files

Map Reduce/Hive

SQL

Simplified View of Co-existence

HDFS Files

Map Reduce , Hive, HBase

HDFS


SQL

HDFS Files


RDBMS Files

HDFS


SQL

HDFS Files


RDBMS Files

HDFS

The Great Divide

PXF addresses the divide.

Pivotal eXtension Framework (PXF)

• History

o Based on external table functionality of RDBMS

o Built at Pivotal by a small team in Israel

• Goalso Single hopo No Materialization of datao Fully parallel for high throughputo Extensible

Motivation for building PXF

• Use SQL engine’s statistical/analytic functions (e.g. Madlib) on third party data stores e.g.

o HBase data

o Hive datao Native Data on HDFS in a variety of formats

• Join in-database dimensions with other fact tables

• Fast ingest of data into SQL native format (insert into … select * from …)

Motivation for building PXF

• Enterprises love the cheap storage offered by HDFS, and want to store data over there

• M/R is very limiting

• Integrating with Third Party systems e.g. Accumulo etc.

• Existing techniques involved copying data to HDFS, which is very brittle and in-efficient

High Level Flow

SQL

DataNode5

DataNode1

DataNode2

DataNode3

DataNode4

Where is the data for table foo?

On DataNodes 1,3 and 5

- Protocol is http- End points are running on all data nodes

NameNode

Major components

• Fragmentero Get the locations of fragments for a table

• Accessoro Understand and read the fragment, return records

• Resolvero Convert the records into a SQL engine format

• Analyzero Provide source stats to the Query optimizer

PXF Architecture

HAWQMaster

M/R, Pig, Hive

Data Node

Container with End-Points

PXF Fragmenter

Local HDFS

Hadoop

Pivotal Green

pxf://dn:8080/financedata?getFragments()

Zookeeper

3

1

6

PSQL

select * from external table foolocation=”pxf://namenode:50070/financedata”

0

splits[..]

HAWQSegment

getSplit(0)

PXFWritable

A

B

0 6To

A BTo

MetaData

Data

NativePHD

5

getSplits()

4

splits[..]

PXF Accessor/Resolver Local HDFS

2getP

XFW

orke

rs()

PX

FWor

kers

[..]

Classes

• The four major components are defined as interfaces and base classes that can be extended. e.g. Fragmenter

/* * Class holding information about fragments (FragmentInfo) */public class FragmentsOutput { public FragmentsOutput(); public void addFragment(String sourceName, String[] replicas, byte[] metadata ); public void addFragment(String sourceName, String[] replicas, byte[] metadata, String userData); public List<FragmentInfo> getFragments();}

/* Internal interface that defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */public interface IReadAccessor {

public boolean openForRead() throws Exception;public OneRow readNextObject() throws Exception;public void closeForRead() throws Exception;

}/* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */public interface IWriteAccessor {

public boolean openForWrite() throws Exception;public OneRow writeNextObject(OneRow onerow) throws Exception;public void closeForWrite() throws Exception;

}

Accessor Interface

/* * Interface that defines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */public interface IReadResolver{ public List<OneField> getFields(OneRow row) throws Exception;} /** Interface that defines the serialization of data read from the DB* into a OneRow object.* Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface.*/public interface IWriteResolver{ public OneRow setFields(DataInputStream inputStream) throws Exception;}

Resolver Interface

/*Abstract class that defines getting statistics for ANALYZE. * GetEstimatedStats returns statistics for a given path * (block size, number of blocks, number of tuples (rows)). * Used when calling ANALYZE on a PXF external table, to get * table's statistics that are used by the optimizer to plan queries. */public abstract class Analyzer extends Plugin { public Analyzer(InputData metaData){ super(metaData); } /** path is a data source name (e.g, file, dir, wildcard, table name) * returns the data statistics in json format * * NOTE: It is highly recommended to implement an extremely fast logic * that returns *estimated* statistics. Scanning all the data for exact * statistics is considered bad practice. */ public String GetEstimatedStats(String data) throws Exception { /* Return default values */ return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo()); }}

Analyzer Interface

Syntax - Long Form

CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)

location('pxf://localhost:50070/pxf-data?

FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter&

ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor&

RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver&

ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer')

format 'TEXT' (delimiter = ',');

Say WHAT???

Syntax - Short Form

CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)

location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple')

format 'TEXT' (delimiter = ',');

Whew!!

Built in Profiles

• # of profiles are built in and more are being contributed

o HBase, Hive, HDFS Text, Avro, SequenceFiles, GemFireXD, Accumulo, Cassandra, JSON

o PXF will be open-sourced completely, for using with your favorite SQL engine.

o But you can write your own connectors right now, and use it with HAWQ.

Predicate Pushdown

• SQL engines may push down parts of the “WHERE” clause down to PXF.

• e.g. “where id > 500 and id < 1000”

• PXF provides a FilterBuilder class

• Filters can be combined together

• Simple expression “constant <OP> column”

• Complex expression “object(s) <OP> object(s)”

Demo

• Create a text file on HDFS

• Create a table using a SQL engine (HAWQ) on HDFS

• Create an external table using PXF

• Select from both tables separately

• Finally run a join across both tables

More info online...

• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html

• http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html

http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html

http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html

Questions?

Pivotal eXtension Framework

Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech

mailto:[email protected]

Date post:	15-Jan-2015
Category:	Software
Upload:	sameer-tiwari
View:	754 times
Download:	1 times

Accessing external hadoop data sources using pivotal e xtension framework (pxf) no animation

Software