Date post: | 15-Jan-2015 |
Category: |
Software |
Upload: | sameer-tiwari |
View: | 754 times |
Download: | 1 times |
Pivotal eXtension Framework
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech
Data Analysis Timeline
ISAM files
COBOL/JCL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
HDFS files
Map Reduce/Hive
SQL
Simplified View of Co-existence
HDFS Files
Map Reduce , Hive, HBase
HDFS
Simplified View of Co-existence
SQL
HDFS Files
Map Reduce , Hive, HBase
RDBMS Files
HDFS
Simplified View of Co-existence
SQL
HDFS Files
Map Reduce , Hive, HBase
RDBMS Files
HDFS
The Great Divide
PXF addresses the divide.
Pivotal eXtension Framework (PXF)
• History
o Based on external table functionality of RDBMS
o Built at Pivotal by a small team in Israel
• Goalso Single hopo No Materialization of datao Fully parallel for high throughputo Extensible
Motivation for building PXF
• Use SQL engine’s statistical/analytic functions (e.g. Madlib) on third party data stores e.g.
o HBase data
o Hive datao Native Data on HDFS in a variety of formats
• Join in-database dimensions with other fact tables
• Fast ingest of data into SQL native format (insert into … select * from …)
Motivation for building PXF
• Enterprises love the cheap storage offered by HDFS, and want to store data over there
• M/R is very limiting
• Integrating with Third Party systems e.g. Accumulo etc.
• Existing techniques involved copying data to HDFS, which is very brittle and in-efficient
High Level Flow
SQL
DataNode5
DataNode1
DataNode2
DataNode3
DataNode4
Where is the data for table foo?
On DataNodes 1,3 and 5
- Protocol is http- End points are running on all data nodes
NameNode
Major components
• Fragmentero Get the locations of fragments for a table
• Accessoro Understand and read the fragment, return records
• Resolvero Convert the records into a SQL engine format
• Analyzero Provide source stats to the Query optimizer
PXF Architecture
HAWQMaster
M/R, Pig, Hive
Data Node
Container with End-Points
PXF Fragmenter
Local HDFS
Hadoop
Pivotal Green
pxf://dn:8080/financedata?getFragments()
Zookeeper
3
1
6
PSQL
select * from external table foolocation=”pxf://namenode:50070/financedata”
0
splits[..]
HAWQSegment
getSplit(0)
PXFWritable
A
B
0 6To
A BTo
MetaData
Data
NativePHD
5
getSplits()
4
splits[..]
PXF Accessor/Resolver Local HDFS
2getP
XFW
orke
rs()
PX
FWor
kers
[..]
Classes
• The four major components are defined as interfaces and base classes that can be extended. e.g. Fragmenter
/* * Class holding information about fragments (FragmentInfo) */public class FragmentsOutput { public FragmentsOutput(); public void addFragment(String sourceName, String[] replicas, byte[] metadata ); public void addFragment(String sourceName, String[] replicas, byte[] metadata, String userData); public List<FragmentInfo> getFragments();}
/* Internal interface that defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */public interface IReadAccessor {
public boolean openForRead() throws Exception;public OneRow readNextObject() throws Exception;public void closeForRead() throws Exception;
}/* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */public interface IWriteAccessor {
public boolean openForWrite() throws Exception;public OneRow writeNextObject(OneRow onerow) throws Exception;public void closeForWrite() throws Exception;
}
Accessor Interface
/* * Interface that defines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */public interface IReadResolver{ public List<OneField> getFields(OneRow row) throws Exception;} /** Interface that defines the serialization of data read from the DB* into a OneRow object.* Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface.*/public interface IWriteResolver{ public OneRow setFields(DataInputStream inputStream) throws Exception;}
Resolver Interface
/*Abstract class that defines getting statistics for ANALYZE. * GetEstimatedStats returns statistics for a given path * (block size, number of blocks, number of tuples (rows)). * Used when calling ANALYZE on a PXF external table, to get * table's statistics that are used by the optimizer to plan queries. */public abstract class Analyzer extends Plugin { public Analyzer(InputData metaData){ super(metaData); } /** path is a data source name (e.g, file, dir, wildcard, table name) * returns the data statistics in json format * * NOTE: It is highly recommended to implement an extremely fast logic * that returns *estimated* statistics. Scanning all the data for exact * statistics is considered bad practice. */ public String GetEstimatedStats(String data) throws Exception { /* Return default values */ return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo()); }}
Analyzer Interface
Syntax - Long Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?
FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter&
ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor&
RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver&
ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer')
format 'TEXT' (delimiter = ',');
Say WHAT???
Syntax - Short Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple')
format 'TEXT' (delimiter = ',');
Whew!!
Built in Profiles
• # of profiles are built in and more are being contributed
o HBase, Hive, HDFS Text, Avro, SequenceFiles, GemFireXD, Accumulo, Cassandra, JSON
o PXF will be open-sourced completely, for using with your favorite SQL engine.
o But you can write your own connectors right now, and use it with HAWQ.
Predicate Pushdown
• SQL engines may push down parts of the “WHERE” clause down to PXF.
• e.g. “where id > 500 and id < 1000”
• PXF provides a FilterBuilder class
• Filters can be combined together
• Simple expression “constant <OP> column”
• Complex expression “object(s) <OP> object(s)”
Demo
• Create a text file on HDFS
• Create a table using a SQL engine (HAWQ) on HDFS
• Create an external table using PXF
• Select from both tables separately
• Finally run a join across both tables
More info online...
• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html
• http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
Questions?
Pivotal eXtension Framework
Sameer TiwariHadoop Storage Architect, Pivotal [email protected], @sameertech