Cloudera Impala Jus/n Erickson | Senior Product Manager May 2013
Agenda
• Why Impala? • Architectural Overview • Real-‐World Use Cases • Alterna/ve Approaches • The PlaKorm for Big Data
©2013 Cloudera, Inc. All Rights Reserved. 2
Why Hadoop?
• Scalability • Simply scales just by adding nodes • Local processing to avoid network boSlenecks
• Flexibility • All kinds of data (blobs, documents, records, etc) • In all forms (structured, semi-‐structured, unstructured) • Store anything then later analyze what you need
• Efficiency • Cost efficiency (<$1k/TB) on commodity hardware • Unified storage, metadata, security (no duplica/on or synchroniza/on)
©2013 Cloudera, Inc. All Rights Reserved. 3
What’s Impala?
• Interac2ve SQL • Typically 5-‐65x faster than Hive (observed up to 100x faster) • Responses in seconds instead of minutes (some/mes sub-‐second)
• Nearly ANSI-‐92 standard SQL queries with Hive SQL • Compa/ble SQL interface for exis/ng Hadoop/CDH applica/ons • Based on industry standard SQL
• Na2vely on Hadoop/HBase storage and metadata • Flexibility, scale, and cost advantages of Hadoop • No duplica/on/synchroniza/on of data and metadata • Local processing to avoid network boSlenecks
• Separate run2me from MapReduce • MapReduce is designed and great for batch • Impala is purpose-‐built for low-‐latency SQL queries on Hadoop
©2013 Cloudera, Inc. All Rights Reserved. 4
Benefits of Impala
5
More & Faster Value from “Big Data” § BI tools imprac/cal on Hadoop before Impala § Move from 10s of Hadoop users per cluster to 100s of SQL users § No delays from data migra/on
Flexibility § Query across exis/ng data § Select best-‐fit file formats (Parquet, Avro, etc.) § Run mul/ple frameworks on the same data at the same /me
Cost Efficiency § Reduce movement, duplicate storage & compute § 10% to 1% the cost of analy/c DBMS
Full Fidelity Analysis § No loss from aggrega/ons or fixed schemas
©2013 Cloudera, Inc. All Rights Reserved.
Impala Query Execu/on
6
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/Beeswax/Shell
©2013 Cloudera, Inc. All Rights Reserved.
Impala Query Execu/on
7
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collec2ons of plan fragments 3) Coordinator ini2ates execu2on on impalad(s) local to data
©2013 Cloudera, Inc. All Rights Reserved.
Impala Query Execu/on
8
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC Hive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client
Query results
©2013 Cloudera, Inc. All Rights Reserved.
Impala and Hive
9
Shares Everything Client-‐Facing § Metadata (table defini/ons) § ODBC/JDBC drivers § SQL syntax (Hive SQL) § Flexible file formats § Machine pool § Hue GUI
But Built for Different Purposes § Hive: runs on MapReduce and ideal for batch
processing § Impala: na/ve MPP query engine ideal for
interac/ve SQL
Storage
Integra2on
Resource Management
Metad
ata
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Hive SQL Syntax Impala
SQL Syntax + Compute Framework MapReduce
Compute Framework
Batch Processing
Interac/ve
SQL
©2013 Cloudera, Inc. All Rights Reserved.
Impala Use Cases
10
Interac/ve BI/analy/cs on more data
Asking new ques/ons
Query-‐able archive w/ full fidelity
Data processing with /ght SLAs
Cost-‐effec2ve, ad hoc query environment that offloads the data warehouse for:
©2013 Cloudera, Inc. All Rights Reserved.
Global Financial Services Company
11
Saving 90% on incremental EDW spend & improving performance by 5x
Offload data warehouse for query-‐able archive
Store decades of data cost-‐effec/vely
Process & analyze on the same system
Improve capabili/es through interac/ve query on more data
©2013 Cloudera, Inc. All Rights Reserved.
Six3 Systems
12
Boos2ng performance by 20X for mission-‐cri2cal, real-‐2me cyber security
Analyze unstructured data with flexibility & real-‐/me response
Integrate with exis/ng desktop & BI tools
Deploy in minutes with Cloudera Manager
©2013 Cloudera, Inc. All Rights Reserved.
Expedia
13
Implemen2ng self-‐service BI on big data, reducing data latency by 50%
Offload data warehouse for archiving, ETL & analy/cs
Unify IT environment
Con/nuously ingest & analyze at scale
Drive greater usability & adop/on of big data stack
©2013 Cloudera, Inc. All Rights Reserved.
Our Design Strategy
14
Storage
Integra2on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…Interac/ve
SQL IMPALA
Machine
Learning MAHOUT, DATAFU
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
One pool of data
One metadata model
One security framework
One set of system resources
An Integrated Part of the Hadoop System
©2013 Cloudera, Inc. All Rights Reserved.
Not All SQL on Hadoop is Created Equal
15
Batch MapReduce Make MapReduce faster
Slow, s2ll batch
Remote Query Pull data from HDFS over the network to the DW
compute layer
Slow, expensive
Siloed DBMS Load data into a
proprietary database file
Rigid, siloed data, slow ETL
Impala Na/ve MPP query engine that’s integrated into
Hadoop
Fast, flexible, cost-‐effec2ve
$
©2013 Cloudera, Inc. All Rights Reserved.
The Impala Advantage
16
BI Partners: Building on the
Enterprise Standard POWERED BY
IMPALA
©2013 Cloudera, Inc. All Rights Reserved.
It’s Not Just About SQL on Hadoop
17
The Plaeorm for Big Data
Storage
Integra2on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…Interac/ve
SQL IMPALA
Machine
Learning MAHOUT, DATAFU
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO… RECORDS
Engines
Management | Support
Single plaKorm for processing & analy/cs
Scales to ‘000s of servers
No upfront schema
10% the cost per TB
Open source plaKorm
©2013 Cloudera, Inc. All Rights Reserved.