Analytical Query Processing - Marco Serafini

transcript

Analytical Query Processing

Marco Serafini

COMPSCI 532Lecture 7

Announcement• Midterm date and location confirmed

• October 22 at 7-9pm in ILC S331

MapReduce vs. DBMSs

Advantages of DBMSs• Abstract data representation

• Relational model• Data storage is delegated to the DBMS

• Functional query language (SQL)• Queries specified as simple relational operators• Actual query execution delegated to the DBMS…• … including parallelism, distribution, pipelining etc.

• Support for indexing

Disadvantages of DBMSs• SQL is a limited interface for complex analytics

• E.g. image analysis, creating maps• Need to define a schema for data a priori• High cost of loading data and indexing

• Can be amortized only if same data and schema reused• Too complex for “one shot” analytics

Advantages of MapReduce• Support for arbitrary UDFs• Support for a variety of arbitrary data formats• Simple API• Scalability

Disadvantages• Many of the optimizations of DBMS must be reimplemented, for example

• Indices• Query execution plans (logical + physical)• Column-based storage• Data format specifications (ProtoBuf)• Support for updates

• Several efforts towards closing the gap for analytics

Data Analytics

In Situ Analytics• Data dumped on GFS/HDFS (data lake)• Some of this data is relational• Several systems to execute relational queries on HDFS data

• SQL-like language• Query optimization• Columnar data representation

• Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto)

• We will discuss both classes

Analytical Queries• Long-running, complex queries• Often aggregates• Run on read-only data or snapshots of dynamic data• Data characteristics

• Tuples (rows) have many possible attributes (columns)• A row will have only a subset of attributes set

Star Schema: Facts and Dimensions• Popular schema for analytics/data warehousing

• Many others exist!• At the center is a large fact table• Foreign-key references to small dimension tables

Dremel• In-situ analytics• Independent query executor (not on top of MR)• Uses columnar store

Data Model: Column Families• Also called column groups, nested columns• Common to many systems, e.g. Cassandra, HBase

New record Nested in repeated Language

Nested in repeated Name

Assembling a Row• Finite state machine

Pros and Cons of Columnar Model• Pros

• Compression: columns have uniform values• Less data to scan on projections (which are common)

• Cons• Additional CPU load to decompress columns and rebuild rows

Query Execution

SparkSQL: Spark + DBMS• Extend Spark with

• Simple, high-level SQL-like operators• Query optimization

• No need to transfer data across systems• ETL, query processing, complex analytics in one system

Architecture

DataFrames• Collection of rows with homogeneous schema

• Like a table in a DBMS• Can be manipulated like an RDD

• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries

SparkSQL Query Execution

Advantages• Relational structure enables query optimization• In-memory caching using columnar representation

• Better compression• Mix SQL-like operators and arbitrary code

• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations

Catalyst• Query optimizer of SparkSQL• Rule-based optimization

• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules

• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient

• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer

Catalyst: Rule-Based Optimization• Apply rules to subtree until fixed point

Execution tree Transformation rules

Catalyst: Code Generation• Faster than interpreting a physical plan

Analytical Query Processing - Marco Serafini

Documents