Date post: | 26-Jan-2015 |
Category: |
Technology |
Upload: | alexander-alten-lorenz |
View: | 105 times |
Download: | 1 times |
Cloudera ImpalaReal Time Query for HDFS and HBase
Alexander Alten-Lorenz, Cloudera INC
Thursday, July 4, 13
2
Beyond Batch
What is Impala
Capability
Architecture
Demo
Thursday, July 4, 13
Beyond Batch
3
For some things MapReduce is just too slowApache Hive:
MapReduce execution engineHigh-latency, low throughputHigh runtime overhead
Google realized this early on Analysts wanted fast, interactive results
Thursday, July 4, 13
Dremel
4
Google paper (2010)“scalable, interactive ad-hoc query system for analysis of read-only nested data”
Columnar storage formatDistributed scalable aggregation
“capable of running aggregation queries over trillion-row tables in seconds”
http://research.google.com/pubs/pub36632.html
Thursday, July 4, 13
Impala: Goals
5
General-purpose SQL query engine for HadoopFor analytical and transactional workloadsSupport queries that take μs to hoursRun directly with Hadoop
Collocated daemonsSame file formatsSame storage managers (NN, metastore)
Thursday, July 4, 13
Impala: Goals
6
High performanceC++runtime code generation (LLVM)direct access to data (no MapReduce)
Retain user experience easy for Hive users to migrate100% open-source
Thursday, July 4, 13
Impala: Capability
7
HiveQL (subset of SQL92)select, project, join, union, subqueries, aggregation, insert, alter, order by (with limit)DDL
Directly queries data in HDFS & HBaseText files (compressed)Sequence files (snappy/gzip)Avro & Parquet
Thursday, July 4, 13
Impala: Capability
8
Familiar and unified platformUses Hive’s metastoreSubmit queries via ODBC | Beeswax Thrift API
Query is distributed to nodes with relevant dataProcess-to-process data exchangeKerberos authenticationNo fault tolerance
Thursday, July 4, 13
Impala: Performance
9
Greater disk throughput~100MB/sec/diskI/O-bound workloads faster by 3-4x
Queries that require multiple map-reduce phases in Hive are significantly faster in Impala (up to 45x)Queries that run against in-memory cached data see a significant speedup (up to 90x)
Thursday, July 4, 13
Impala: Architecture
10
impaladruns on every nodehandles client requests (ODBC, thrift)handles query planning & execution
statestoredprovides name servicemetadata distributionused for finding data
Thursday, July 4, 13
Impala: Architecture
11
Thursday, July 4, 13
Impala: Architecture
12
Thursday, July 4, 13
Impala: Architecture
13
Thursday, July 4, 13
Impala: Architecture
14
Thursday, July 4, 13
Current limitations
15
1.0.1 (available since May 2013)No SerDesNo User Defined Functions (UDF’s)impalad’s read metastore at startup refresh metadata per command line
Thursday, July 4, 13
Futures
16
DDL support (CREATE)Rudimentary cost-based optimizer (CBO)metadata distribution through statestoredColumnar storage format like Dremel’s
Impala + Parquet = Dremel superset
Thursday, July 4, 13
Demo
17
[email protected]@cloudera.com
@mapreditmapredit.blogspot.com
Web: http://goo.gl/7sxdp
Thursday, July 4, 13
Thursday, July 4, 13