Date post: | 05-Apr-2017 |
Category: |
Technology |
Upload: | john-park |
View: | 188 times |
Download: | 2 times |
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL on Hadoop- Batch, Interactive and BeyondSoCal Big Data Day
John ParkSolution Engineer, HortonworksRm 138-140
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Presenter John Park• Solution Engineer, SoCal• Data Science ETL, data warehousing,
software design, architecture • Previous – Various Startups, Qlik,
DW consultant, NCR• Current – Helping customers
implement and understand open source big data platforms
• Twitter: @jpark328• Email: [email protected]
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Before We Began
• We have a Raffle• 2 winner at the end of
presentation• Prize – Amazon Echo Dot• Ask Questions
https://www.surveymonkey.com/r/940amSQLHadoopBatch
Survey Link
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL is King
Why ?– Familiarity
• Primary Technical language or Business Analyst– Powerful
• Maturation of RDBMS, EDW, OLTP• ACID Compliant
– Flexible• Covers Transactional Processes to Analytics
– Pervasive• Emergence of BI tools(Tableau, BOBJ, Cognos),• Deep ecosystem of tools
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Overview of SQL on Hadoop Solutions
Spark's module for working with structured data. Run SQL queries alongside complex analytic algorithms.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
High performance relational database layer over HBase for low latency applications.
TraditionalMPP onHadoop
Many traditionally architected MPP solutions have been ported to Hadoop and some new ones have been developed from scratch.
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL on Hadoop: Vitals
Project First GA Release Lines of Code(June 2015) (*) Most Typical Use
Apache Hive April, 2009 (7 Years) 1 Million EDW / ETL OffloadSparkSQL March, 2015 (2 years) 56.6k Exploratory Analytics
Apache Phoenix March, 2014 (3 Year) 200k Low-Latency Dashboards
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Hive: Fast Facts
Most Queries Per Hour
100,000 Queries Per Hour(Yahoo Japan)
Analytics Performance
100 Million rows/s Per Node(with Hive LLAP)
Largest Hive Warehouse
300+ PB Raw Storage(Facebook)
Largest Cluster
4,500+ Nodes(Yahoo)
9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Phoenix and HBase: Fast Facts
Largest Database
5 Petabytes(Flurry)
Best Known App
Facebook Messages(Facebook)
Fastest Ingestion
10 Million Events/s(Yahoo)
Biggest SQL App
Real-Time SQL on 140m+ Records(PubMatic)
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Hive: Strengths and Cautions
• Huge Datasets• Deep SQL Analytics• EDW Offload• BI Integration
Strengths+• Near-Real-Time
Cautions?
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SparkSQL: Strengths and Cautions
• Language-Integrated Query• Exploratory Analytics
Strengths+• Large Datasets• High Concurrency• EDW Offload
Cautions?
12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Phoenix: Strengths and Cautions
• High Concurrency• Near-Real-Time Query• Fast Updates
Strengths+• Deep SQL Analytics• Full-Table Scans / Scaled Analytics• Existing BI Integrations
Cautions?
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL on Hadoop - Good to know No One Size Fits all solution Use Cases and Query Patterns are important Prototype and Fail Fast Define Scalability and Performance criteria
SQL on Hadoop Use Cases
15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Hive: Analytics Use Case
Financial Services Company:– Analyze large dataset to identify potential fraud.– Re platformed from a mature EDW platform.– Selection drivers: Breadth of SQL support, query performance, cloud consumption.
Use Case Vitals:– Analyze > 25 billion transactions per week.– More than 1.5 TB new data per day.– > 4PB historical data available for analysis through cloud infrastructure.
16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Hive Performance with Scaling - Customer results on HDP 2.2
Multi join - Allocation Aggregation Total0
500
1,000
1,500
2,000
2,500
3,000
Scalability on Hive5 nodes 10 nodes 20 nodes 40 nodes 60 nodes
Elap
sed
time
(sec
onds
)
Benchmark test 5 nodes 10 nodes 20 nodes 40 nodes* 60 nodes*
Multi join 24:02 14:33 10:32 06:54 05:49
Aggregation 21:59 12:20 07:55 05:16 02:38
Total 46:02 26:53 18:27 12:10 08:27
Same Workload on EDW -- Full Rack 8:00(*) Projected times based on 5, 10 and 20 node results.
Aggregation Workload
• 5% more time required on Hive.
• < 50% solution cost versus traditional EDW.
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SparkSQL Use Case: Medical
Sensor Data HDFS Aggregations (Hive)
HCatalog
Analytical Tools
JDBC Connector
SparkSQL
- Sensor data streamed into HDFS- Large-scale pre-aggregations done using Hive- SparkSQL powered dashboard for fast analytics.
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
+
Phoenix at PubMatic
Near-Real-Time SQL over >15 TB of DataUsing Apache Phoenix
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Phoenix at PubMatic
Key Concerns SolutionPubMatic offers marketing automation with real-time analytics that enable publishers to make smarter and faster decisions.
To empower publishers to make real-time decisions, PubMatic needs a SQL solution that scales to terabytes of data yet can process hundreds of thousands of queries daily with near-real-time SLAs.
Phoenix is the only Open Source SQL Solution for Hadoop designed for near-real-time querying, giving PubMatic’s publishers the timely insight they need to optimize their advertising strategies.
Phoenix’s linear scalability enables PubMatic to offer real-time query over more than 15 terabytes of data using commodity hardware.
Phoenix’s ANSI SQL Interface make it easy for publishers to slice and dice data the way they want.
Read more at http://phoenix.apache.org/who_is_using.html
SQL on Hadoop Next Evolution
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Evolution of Hive
Batch/ETL(HDP 2.2)
• Transactions with ACID allowing insert, update and delete
• Temporary Tables
• Cost Based Optimizer optimizes complex join queries well.
Faster SQL
• Tech Preview: Sub-5-Second queries with LLAP
• Usability: SQL Query Editor, Visual Explain and Debugging
• Transparent Data Encryption• Cross-Site Replication• SQL, Performance Improvements• Hive-on-Spark (Alpha / Beta)
Sub-Second withRich Analytics
• Rich SQL:2011 Analytics
• Tech Preview : Druid OLAP Index for Hive
• GA: Sub-Second queries with LLAP
• Transaction Improvements (BEGIN/COMMIT/ROLLBACK, MERGE)
Phase 1(Delivered: HDP 2.2)
Phase 2(Delivered: HDP 2.5)
Phase 3(Planned: HDP 2.6*)
22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Apache Hive: Modern ArchitectureSt
orag
e
Columnar Storage
ORCFile ParquetUnstructured Data
JSON CSV
Text Avro
Custom
Weblog
Engi
ne
SQL Engines
Row Engine Vector Engine
SQL
SQL Support
SQL:2011 Optimizer HCatalog HiveServer2Ca
che
Block Cache
Linux Cache
Dist
ribut
edEx
ecuti
on
Hadoop 1
MapReduceHadoop 2
Tez Spark
Vector Cache
LLAP
Persistent Server
Historical
Current
In Development
Legend
23 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Sub-Second Hive with LLAP
Sub Second:• LLAP: Persistent server to instantly execute SQL queries.• Caches hottest data in RAM.• Overcomes latencies associated with Hive on Tez or Hive on Spark.
SQL Compatibility:• 100% Compatible with Hive SQL.• Compatible with existing tools (BI, ETL, etc.)
Security:• Security via HiveServer2.• Integrates with Apache Ranger.
HadoopNode
HadoopNode
HadoopNode
Vector Cache
LLAPServer
Vector Cache
LLAPServer
Vector Cache
LLAPServer
HiveSever2
LLAP Servers(1 Per Hadoop Node)
Hive SQL
24 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Hive 2 with LLAP: Architecture Overview
Deep
St
orag
e
HDFS S3 + Other HDFS Compatible Filesystems
YARN Cluster
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
LLAP Daemon
Query Executors
QueryCoordinators
Coord-inator
Coord-inator
Coord-inator
HiveServer2 (Query
Endpoint)
ODBC /JDBC SQL
Queries In-Memory Cache(Shared Across All Users)
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Types SQL Features File Formats FuturesNumeric Core SQL Features Columnar ACID MERGE
FLOAT, DOUBLE Date, Time and Arithmetical Functions ORCFile Multi SubqueryDECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Scalar SubqueriesINT, TINYINT, SMALLINT, BIGINT Derived Table Subqueries Text Non-EquijoinsBOOLEAN Correlated + Uncorrelated Subqueries CSV INTERSECT / EXCEPT
String UNION ALL LogfileCHAR, VARCHAR UDFs, UDAFs, UDTFs Nested / Complex Recursive CTEsBLOB (BINARY), CLOB (String) Common Table Expressions Avro NOT NULL Constraints
Date, Time UNION DISTINCT JSON Default ValuesDATE, TIMESTAMP, Interval Types Advanced Analytics XML Multi Table Transactions
Complex Types OLAP and Windowing Functions Custom FormatsARRAY / MAP / STRUCT / UNION OLAP: Partition, Order by UDAF Other Features
Nested Data Analytics CUBE and Grouping Sets XPath AnalyticsNested Data Traversal ACID TransactionsLateral Views INSERT / UPDATE / DELETE
Procedural Extensions ConstraintsHPL/SQL Primary / Foreign Key (Non Validated)
Apache Hive: Journey to SQL:2011 Analytics
LegendNew
Projected: HDP 3.0
HDP 2.6
Track Hive SQL:2011 Complete: HIVE-13554
26 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Phoenix SQL: Today and Tomorrow
Phoenix: SQL for HBaseSQL Datatypes (VARCHAR, INTEGER, etc.)
UNION ALL
JOINs: Inner, Left/Right Outer, Cross Functional IndexesUPSERT / DELETE Date / Time FunctionsDerived Tables UDFsGROUP BY, ORDER BY, HAVING Multi Table TransactionsAVG, COUNT, MIN, MAX, SUM SQL GRANT / REVOKEPrimary keys, NOT NULL constraints Replication ManagementCASE, COALESCE Column Constraints and DefaultsVIEWs OLAP, Cubing, RollupSecondary Indexes UNIONFlexible Schema
Current
Future
Phoenix 4.4
27 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Looking forward - What Is Druid?
Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query.
Features:• Streaming Data Ingestion• Sub-Second Queries• Merge Historical and Real-Time Data• Approximate Computation
28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Druid’s Role in Scalable Data WarehousingUI
Core Platform
S3 or HDFS
HiveServer2
MDX
Unified SQL and MDX Layer
SQL BI Tools MDX Tools
Hive
Realtime Feeds(Kafka, Storm, etc.)
DruidOLAP Indexes
HiveServer2
Hive SQL
Thrift Server
SparkSQL
Fast SQL MDX
Superset UI
Fast Exploration
Builder UI
SmartSense
Ranger
Atlas
Ambari Management
SQL on Hadoop Conclusion
30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
query43.sq
l
query73.sq
l
query63.sq
l
query3.sq
l
query7.sq
l
query89.sq
l
query34.sq
l
query42.sq
l
query27.sq
l
query52.sq
l
query55.sq
l
query13.sq
l
query79.sq
l
query98.sq
l
query19.sq
l0
50
100
150
200
250
0
5
10
15
20
25
30
35
40
45
50
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)
Que
ry T
ime(
s) (L
ower
is B
etter
)
Spee
dup
(x F
acto
r)
Hive 2 with LLAP: 26x Performance Boost
31 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL on Hadoop: Investment Areas
Interactive Performance
Caching in Flash / SSDFast Analytics on Raw Text
Materialized Views
SQL Compliance
Comprehensive SQL:2011Support
SQL ACID
SQL Standard MERGE
EDW Integrations
Joint AtScale / Syncsort RoadmapOLAP Indexes with Druid
32 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SQL on Hadoop Summary
Project Strengths Use Cases Unique CapabilitiesApache Hive • Most Comprehensive SQL
• Scale• Maturity
• ETL Offload• Reporting• Large-Scale Aggregations
• Robust Cost-Based Optimizer
• Mature Ecosystem (BI, Backup, Security, Replication)
SparkSQL • In-Memory• Low Latency
• Exploratory Analytics• Dashboards
• Language-Integrated Query
Apache Phoenix • Real-Time Read/Write• Transactions• High Concurrency
• Dashboards• System-of-Engagement• Drill-Down / Drill-Up
• Real-Time Read/Write
33 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Scalable Data Warehousing on Hadoop: Overview
Other ETL Tools
Ingest and Store ETL, Data Mining,Advanced Analytics
Interactive SQL,Reporting, OLAP
Kafka
HDFS
NiFi Druid (Future)
Hive LLAP
HAWQ
AtScale
Spark
Hive
HPL / SQL
ACID
AtlasGovernance and
Lineage
RangerAdvanced Security
SyncsortDMX-h
ETL
Zeppelin
Ambari Hive View
BI Tools
Reporting Tools
34 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Thank You
https://www.surveymonkey.com/r/940amSQLHadoopBatch