PXF BDAM 2016

Post on 19-Jan-2017

133 views 0 download

transcript

Shivram Mani ( Pivotal)

PXF A Unified Access Framework for

HDFS datasets

Agenda

● Motivations● PXF Introduction● Architecture/Design● Developer View● Usage/Plugins● Value Proposition to new applications● Whats coming

Motivations: SQL on Hadoop

RDBMS

?

various formats, storages supported on HDFS

● ANSI SQL● Cost based optimizer● Transactions● ...

Foreign Tables!

PXF is an extension framework that does the following

● Uniform tabular view to heterogeneous data sources

● Exploits parallelism for data access

● Pluggable framework for custom connectors

● Provides built-in connectors for accessing data in HDFS files, Hive/HBase tables, etc

What is PXF ?

PXF Communication

Apache Tomcat

PXF WebappREST API

Java API

libhdfs3 (written in C) segments

External Tables

Native Tables

HTTP, port: 51200

Java API

Java/Thrift

Deployment Architecture

HAWQMaster Node NN

pxf

HBase Master

DN4

pxf

HAWQseg4

DN1

pxf

HAWQseg1

HBase Region Server1

DN2

pxf

HAWQseg2

HBase Region Server2

DN3

pxf

HAWQseg3

HBase Region Server3

* PXF needs to be installed on all DN* PXF is recommended to be installed on NN

PXF Components

Fragmenter Splits dataset into partitionsReturns locations of each partition

Accessor Understand and read/write the fragmentReturn records

Resolver Convert records to a consumable format (Data Types)

Architecture - Read Data Flow

HAWQMaster Node NN

pxf

DN1

pxf

HAWQseg1

select * from ext_table0

getFragments() API

pxf://<location>:<port>/<path>

1

Fragments (JSON)2

7

3Split mapping(fragment -> segment)

DN1

pxf

HAWQseg1

DN1

pxf

HAWQseg1Query dispatched to Segment 1,2,3… (Interconnect)

5

Read() REST

6 records

8

query result

Records (stream)

Fragmenter

Resolver

Accessor

4

Read Data Flow - Take 2

PXF Developer View

PXF Usage

Built-in with Plugins

● HDFS

● Hive

● HBase

Community (https://bintray.com/big-data/maven/pxf-plugins/view )

● Cassandra

● Accumulo

● Redis

● ...

CREATE [READABLE|WRITABLE] EXTERNAL TABLE table_name ( column_name data_type [, ...] )LOCATION ('pxf://host[:port]/path-to-data?PROFILE=<profile-name> [&custom-option=value...]')FORMAT '[TEXT | CSV | CUSTOM]' (<formatting_properties>);

PXF Hdfs PluginFragment - Splits (blocks)

● Support Read : multiple formats ->

● Support Write to Sequence Files

● Chunked Read Optimization

● Support for stats

Profile Description

HdfsTextSimple Read delimited single line records (plain text)

HdfsTextMulti Read delimited multiline records (plain text)

Avro Read avro records

JSON Supports simple/pretty printed JSON with

field projection

PXF Hive PluginFragment - Splits of the file stored in table

● Text based

● SequenceFile

● RCFile

● ORCFile

● Parquet

● Avro

*Complex types are converted to text

Partition Filtering

Metadata API *

Profile Description

Hive Read all Hive tables (all types)

HiveRC Hive tables stored in RC (serialized with

ColumnarSerDe/LazyBinaryColumnarSerDe)

HiveText Faster access for Hive tables stored as Text

PXF HBase PluginFragment - Regions

● Read Only. Uses Profile ‘Hbase’

● Filter push down to Hbase scanner

○ (Operators: EQ, NE, LT, GT, LE, GE & AND)

● Direct Mapping

● Indirect Mapping

○ Lookup table - pxflookup

○ Maps attribute name to hbase <cf:qualififer>

(row key) mapping

sales id=cf1:saleid

sales cmts-cf8:comments

● Abstracts application from external Datasource/APIs/Versions

● Focus on one data layout

● Off the shelf support for various datasources

● Extensibility. Ease of supporting custom datasources

● Provides means for Filter push down

● Dataset statistics for performance optimization

Value Proposition of PXF

● Using FDW callback functions that will interact with PXF.

PXF with Postgres

Apache Tomcat

PXF WebappREST API Java API

HTTP, port: 51200

Java API

Java/Thrift

FDW

● HA

● Schema Auto Discovery (Metadata)

● Support for more dataset statistics

● Time series data optimization

● More plugins (Gemfire, Solr, etc)

● Additional Filter push down support

● Custom Output Format

Whats coming

cwiki.apache.org/confluence/display/HAWQ/PXFhttp://hawq.incubator.apache.org/docs/pxf/javadoc

github.com/apache/incubator-hawq/tree/master/pxf

issues.apache.org/jira/browse/HAWQ Component = PXF

ContributionFeature Areas Custom Plugins

(storage, formats)Push Down

FiltersCustom

Applications

Documentation Wiki/Docs

Code / Review Github(Apache)

Join Discussion/Ask Questions Apache DLs dev@hawq.incubator.apache.orguser@hawq.incubator.apache.org

Github(Field) github.com/Pivotal-Field-Engineering/pxf-field

thank you !