Date post: | 16-Apr-2017 |
Category: |
Technology |
Upload: | dataworks-summithadoop-summit |
View: | 1,882 times |
Download: | 1 times |
Spark + HBaseBringing HBase Data Efficiently into Spark with DataFrame Support Zhan ZhangSoftware Engineer04/08/2016
Page 2 © Hortonworks Inc. 2014
About Zhan Zhang
Zhan Zhang (Software Engineer at Hortonworks)
Currently Focus on Apache Spark and Hadoop, etc
Contribute to Apache Spark, Yarn, HBase, Ambari, etc
Experiences on Computer Networks, Distributed System and Machine Learning Platform
Page 3 © Hortonworks Inc. 2014
Why Revamp the Existing HBase Connector?
Limited Spark Support in HBase Upstream– Scalability– RDD level, but Spark is moving to DataFrame/Dataset– Data Loss and Data Duplication
Stability– Correctness– Stability Impact with Co-processor.– Serialized RDD Lineage to HBase– Maintenance Overhead: Internal Hacks
Page 4 © Hortonworks Inc. 2014
What Improvement Have We Made? Combine Spark and HBase
– Spark Catalyst Engine for Query Plan and Optimization– HBase for Fast Access KV Store– Implement Standard External Data Source with Built-in Filter
High Performance– Data Locality: Move Computation to Data– Partition Pruning: Task only Performed in RS Holding Requested Data– Column Pruning / Predicate Pushdown: Reduce Network Overhead
Full Fledged DataFrame Support– Spark-SQL– Integrated Language Query
Run on Top of Existing HBase Table– Native Support Java Primitive Types
Page 5 © Hortonworks Inc. 2014
More …
Composite Key
Avro Format
Customized Serdes
Page 6 © Hortonworks Inc. 2014
Usage - Define the Catalog
Header (Calibri Bold 28 pt)
Page 7 © Hortonworks Inc. 2014
Usage– Write to HBase
Page 8 © Hortonworks Inc. 2014
Usage– Construct DataFrame
Page 9 © Hortonworks Inc. 2014
Usage - Language Integrate Query
Page 10 © Hortonworks Inc. 2014
Usage - Spark SQL
Page 11 © Hortonworks Inc. 2014
Usage - With Other Data Sources
Page 12 © Hortonworks Inc. 2014
Page 13 © Hortonworks Inc. 2014
Header (Calibri Bold 28 pt)
Page 14 © Hortonworks Inc. 2014
Spark HBase Connector Architecture
Page 15 © Hortonworks Inc. 2014
Byte Array Order: SHORT/INT/LONG
0 21 … … MAX -2 -1MIN … …
WHERE X <= 2
WHERE X >= -2
Page 16 © Hortonworks Inc. 2014
Implementation
Partition Pruning: – Split into Multiple Range, e.g., WHERE X < 2
Data Locality: – Each RDD Partition Has Preferred Location
Column Pruning: – Required Column in Scan/BulkGet
Predicate Pushdown: – HBase Built-in Filters
Scan/BulkGets: – Grouped by Region Server
Page 17 © Hortonworks Inc. 2014
Page 18 © Hortonworks Inc. 2014
Page 19 © Hortonworks Inc. 2014
BACK UP
Page 20 © Hortonworks Inc. 2014
Kerberos Cluster Kerberos Ticket
Token Retrieval and Renewal
Long Running Service
Page 21 © Hortonworks Inc. 2014
FLOAT/DOUBLE: IEEE-754
0.0 0.2… … … MAX -2.0… MIN…
WHERE X <= 2.0D
WHERE X >= -2.0D
-0.0
Page 22 © Hortonworks Inc. 2014
HBase Meta Table