Home >Documents >Performance Engineering for Apache Spark and Databricks ......Execution Models in Spark / Databricks...

Performance Engineering for Apache Spark and Databricks ......Execution Models in Spark / Databricks...

Date post:09-Jul-2020
Category:
View:6 times
Download:1 times
Share this document with a friend
Transcript:
  • Performance Engineering for Apache Spark and

    Databricks Runtime

    ETHZ, Big Data HS19

    1

    Bart Samwel, Sabir Akhadov

  • About Databricks & the Presenters

    Databricks: "startup" by the original creators of Apache SparkTM with 1000+ employees (engineering in SF and Amsterdam).

    Bart Samwel: software engineer @ European Development Center, Tech Lead of performance engineering teams

    Sabir Akhadov: software engineer @ EDC,performance benchmarking team

  • Our Job: More Speed

    • Make Databricks Runtime (= Apache Spark + extensions) faster• Find the bottlenecks• Translate research & insights• Invent novel ways to speed it up

    • Make sure it doesn't slow down• Regression performance benchmarking

    • Make sure it's faster than the competition• Competitive benchmarking & analysis

  • The Paths to Speed

    1. Do less (reduce I/O or data volume)2. Be prepared (do stuff ahead of time, e.g. indexes, clustering)3. Do things once (caching)4. Be smarter (better algorithms / query plans)5. Go faster (better raw execution speed)

    You need all of these to win the race!

  • Filter

    Scan

    Project

    RDDSELECT Store, AmountFROM SalesWHERE day_of_week = "Friday" (*)

    (*) You can't actually write Spark SQL and get plain RDDs. This is just to make a point.

  • Scan

    RDDSELECT Store, AmountFROM SalesWHERE day_of_week = "Friday"

    We don't know what's in there!

  • Filter

    Scan

    Project

    Filter

    Scan

    Project

    ● Filter in the data source● Read only the columns you need!

    DataFrame

  • Traditional MapReduce / RDDs are opaque.SQL / DataFrames are transparent.

    Key Insight:• Opaque / operational != optimizable• Transparent / declarative = optimizable

  • 1. Do Less

    • Read only the columns you need• Columnar file formats (Parquet, ORC)• NOT Avro, JSON, CSV, ...

    • Can you avoid reading files at all?• Yes, if query has filters!• But: need clustering!

  • ('CH', 2019-12-03, 3231)

    ('NL', 2019-12-03, 2216)

    ('CH', 2019-11-29, 3283)

    ('CH', 2019-12-02, 1823)

    ('NL', 2019-12-02, 2731)

    ('NL', 2019-12-01, 812)

    ('CH', 2019-11-30, 12833)

    ('NL', 2019-11-30, 1823)

    ('CH', 2019-11-29, 5122)

    ('NL', 2019-11-28, 8975)

    ('NL', 2019-11-29, 2617)

    ('CH', 2019-11-28, 8537)

    SELECT Store, AmountFROM SalesWHERE Country = 'CH'

    File 1 File 2 File 3

  • ('CH', 2019-12-03, 3231)

    ('NL', 2019-12-03, 2216)

    ('CH', 2019-11-29, 3283)

    ('CH', 2019-12-02, 1823)

    ('NL', 2019-12-02, 2731)

    ('NL', 2019-12-01, 812)

    ('CH', 2019-11-30, 12833)

    ('NL', 2019-11-30, 1823)

    ('CH', 2019-11-29, 5122)

    ('NL', 2019-11-28, 8975)

    ('NL', 2019-11-29, 2617)

    ('CH', 2019-11-28, 8537)

    SELECT Store, AmountFROM SalesWHERE Country = 'CH'

    File 1 File 2 File 3

  • ('CH', 2019-12-03, 3231)

    ('CH', 2019-11-30, 12833)

    ('CH', 2019-11-29, 3283)

    ('CH', 2019-12-02, 1823)

    ('NL', 2019-12-02, 2731)

    ('NL', 2019-12-01, 812)

    ('NL', 2019-12-03, 2216)

    ('NL', 2019-11-30, 1823)

    ('CH', 2019-11-29, 5122)

    ('NL', 2019-11-28, 8975)

    ('NL', 2019-11-29, 2617)

    ('CH', 2019-11-28, 8537)

    SELECT Store, AmountFROM SalesWHERE Country = 'CH'

    File 1 File 2 File 3

  • ('CH', 2019-12-03, 3231)

    ('CH', 2019-11-30, 12833)

    ('CH', 2019-11-29, 3283)

    ('CH', 2019-12-02, 1823)

    ('NL', 2019-12-02, 2731)

    ('NL', 2019-12-01, 812)

    ('NL', 2019-12-03, 2216)

    ('NL', 2019-11-30, 1823)

    ('CH', 2019-11-29, 5122)

    ('NL', 2019-11-28, 8975)

    ('NL', 2019-11-29, 2617)

    ('CH', 2019-11-28, 8537)

    SELECT Store, AmountFROM SalesWHERE Country = 'CH'

    File 1 File 2 File 3

  • Knowing How to Do Less

    • Know which file contains what data• For each file, for each column:

    • min/max value• bloom filters

    • Parquet files are self describing!• So you have to read the file to skip it ?!?

    • Cloud storage is high latency• Need consolidated metadata cache / storage

  • 2. Be Prepared

    • Traditional database: indexes (e.g. B-trees)• File based big data: partitioning and clustering

  • Partitioning in Parquet

    month=1/

    file1.parquet

    file2.parquet

    month=2/

    file3.parquet

    month=3/

    ...etc...

    • Low cardinality columns only!• Can skip files per partition!

  • Order Based Clustering

    • Sort data by search columns X, Y• Split sorted data into files of reasonable size

    ✔ X = ✔ X = AND Y = ❌ Y =

  • Better: Z-Order Clustering• X, Y => 2D plane• Map onto 1D space

    using space filling curve (e.g. Z-Order)

    • Sort by 1D space• Observe: Every curve

    range has narrow min/max for both X, Y => good for skipping

    • blog post

    blog post

    https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.htmlhttps://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

  • • Clustering = reorganization• Reorganization = replacing files

    • BUT: Concurrent readers may see inconsistent data

    • Partitioning != reorganization• Happens immediately at write time• But produces many small files for incremental insertions• Compacting small files = reorganization

    Reorganization & Concurrency

  • Consistency: Delta Lake

    • Transaction log for file sets• 1 transaction = atomically add files & remove files

    • Stores file names and metadata (min/max per column)• No more file listing• No more opening files for skipping metadata

    • SELECT/INSERT/UPDATE/DELETE with serializable isolation• Reorganize data safely

    https://delta.io/

    https://delta.io/

  • 3. Do Things Once

    • 0x is best• 1x is next best• 2x is a waste

  • File Caching

    • Cloud storage (e.g. S3) has low bandwidth compared to SSD• "Delta Cache": cache cloud files on local SSD• Changes queries from I/O bound to CPU bound

  • Result Caching

    • Spark already allows you to cache DataFrames/RDDs explicitly• Automatic caching is more difficult

    • Use case: everybody in the company opens the same dashboard all the time

    • Safely reuse a result from a different user's session?– Same settings?– Same permissions?– Is the data not stale?– How stale is acceptable?

  • 4. Be Smarter

    • Pick the best algorithm for your query• Catalyst optimizer

  • Rule based optimizations

    Transform logical query plan based on rules, e.g.:• Push down work below operators to reduce data size, e.g.:

    • filter before join, aggregation, projection,...• aggregation before join• projection before everything (drop columns!)

    • Simplify expressions• Precompute constant expressions• Turn EXISTS subqueries into semijoins

  • Cost based optimization

    Select final plan using cost + rules:• Cost = based on table statistics• Compare multiple options by cost• Join reordering• Join method (sort-merge join, hash join, broadcast join)

  • Are Statistics Robust?

    No.

  • Averages are often wrong

    SELECT * FROM Sales WHERE• country = 'USA'• country = 'Greenland'

  • Filters interact in unexpected ways

    SELECT * FROM People WHERE• city = 'Amsterdam' AND favorite_team = 'Ajax'• city = 'Rotterdam' AND favorite_team = 'Ajax'

    A good plan for average data can be really bad for actual data!

  • Robust: Adaptive Query Execution

    Be smart at execution time!• In Spark 3.0: automatic Broadcast Join (fast but works only

    when one input is "small",

  • 5. Go Faster

    When you really can't avoid doing work... do the work fast!

    Multiple possible execution model choices:• Row-at-a-time vs. column-at-a-time

    • Columnar enables vectorization • Interpreted vs. compiled• Native vs. JVM

  • Execution Models in Spark / Databricks

    • Classic: Interpreted, row-at-a-time, JVM• Tungsten: JIT compiled, row-at-a-time, JVM

    • Based on ideas from HyPeR paper (but JVM instead of LLVM)• Code specialized for each query, multiple operators pipelined for the

    same data e.g. filter + project + aggregate.• Efficient because data stays in registers

    • Databricks Parquet Reader: Interpreted, columnar, Native• Only for scans.

    • Future: Tungsten+LLVM? Columnar+JVM? Columnar+Native?

    https://www.vldb.org/pvldb/vol4/p539-neumann.pdf

  • Final Notes

    • There is no silver bullet -- every workload has its own bottleneck

    • Bottlenecks are ever changing, e.g.:• Make CPU faster -> I/O bound• Reduce I/O -> CPU bound• Speed up aggregation -> shuffle becomes the problem

    • Benchmark, benchmark, benchmark• "If you say you care about performance and you don't have a

    benchmark, then you don't really care about performance"

  • And now...

  • Dynamic Partition Pruning in Apache Spark 3.0

    *Do less*

  • TPCDS Q98 on 10 TB

    How to Make a Query 100x Faster?

  • Static Partition PruningSELECT * FROM Sales WHERE date_year = 2019

    Filter

    Scan

    Basic data-flow

    Filter

    Scan

    Filter Push-down

    Filter

    Scan

    Partition files

    2019

    2014

  • Static Partition PruningSELECT * FROM Sales WHERE date_year = 2019

    Filter

    Scan

    Basic data-flow

    Filter

    Scan

    Filter Push-down

    Filter

    Scan

    Partition files

    2019

    2014

  • Star Schema

  • Joining TablesSELECT * FROM Sales JOIN Date ON date_id = Date.idWHERE Date.year = 2019

    Static pruning not possible

    ScanSales

    Filteryear = 2019

    Join

    precompute denormalized table

    ScanSales

    Join

    ScanDate

    Filteryear = 2019

    Scan

    ScanDate

    *Be prepared*

    Duplicate dataJoin maintenanceWide table

  • Dynamic Pruning

    Dynamic pruning

    ScanSales

    Filteryear = 2019

    Join

    SELECT * FROM Sales JOIN Date ON date_id = Date.idWHERE Date.year = 2019

    ScanDate

  • Dynamic Pruning

    Partition files

    Scan FACT TABLE Scan DIM TABLE

    Non-partitioned dataset

    Filter DIM

    Join on date_id

  • A Simple Approach

    Partition files

    Scan FACT TABLE

    Scan DIM TABLE

    Non-partitioned dataset

    Filter DIM

    Join on date_id

    Scan DIM TABLE

    Filter DIM

    year = 2019

  • A Simple Approach

    Partition files

    Scan FACT TABLE

    Scan DIM TABLE

    Non-partitioned dataset

    Filter DIM

    Join on date_id

    Scan DIM TABLE

    Filter DIMDouble the work

    *Do things once*

    year = 2019

  • Broadcast Hash Join

    FileScan FileScan with Dim Filter

    Non-partitioned dataset

    BroadcastExchange

    Broadcast Hash Join

    Execute the smaller side

    Broadcast the smaller side result

    Execute the join locally without a shuffle

    worker nodes

  • Reusing Broadcast Results

    Partition files with multi-columnar data

    FileScan

    FileScan with Dim Filter

    Non-partitioned dataset

    BroadcastExchange

    Broadcast Hash Join

    Dynamic Filter

    year = 2019

  • Reusing Broadcast Results

    Partition files

    FileScan

    FileScan with Dim Filter

    Non-partitioned dataset

    BroadcastExchange

    Broadcast Hash Join

    Dynamic Filter

    year = 2019

  • Experimental SetupWorkload Selection

    - TPC-DS scale factors 1-10 TB

    Cluster Configuration- 10 i3.xlarge machines, 40 cores total

    Data-Processing Framework- Apache Spark 3.0

  • TPCDS 1 TB

    60 / 102 queries speedup between 2 and 18

  • Data Skipped

    Very effective in skipping data

  • TPCDS 10 TB

    Even better speedups at 10x the scale

  • Query 98SELECT i_item_desc, i_category, i_class, i_current_price, sum(ss_ext_sales_price) as itemrevenue, sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) over (partition by i_class) as revenueratioFROM store_sales, item, date_dimWHERE ss_item_sk = i_item_sk and i_category in ('Sports', 'Books', 'Home') and ss_sold_date_sk = d_date_sk and cast(d_date as date) between cast('1999-02-22' as date) and (cast('1999-02-22' as date) + interval '30' day)GROUP BY i_item_id, i_item_desc, i_category, i_class, i_current_price

    ORDER BY i_category, i_class, i_item_id, i_item_desc, revenueratio

  • TPCDS 10 TB, Q98

    Highly selective dimension filter that retains only one month out of 5 years of data

  • Conclusion

    Apache Spark 3.0 introduces Dynamic Partition Pruning

    Significant speedup, exhibited in many TPC-DS queries

    This optimization improves Spark performance for star-schema queries, making it unnecessary to denormalize tables.

  • 57

    Thanks!

    We're hiring for internships (3 months) and full time engineers in Amsterdam and San Francisco!

    databricks.com/company/careers

    Bart Samwel linkedin.com/in/bsamwelSabir Akhadov linkedin.com/in/akhadov

    https://databricks.com/company/careers

Click here to load reader

Reader Image
Embed Size (px)
Recommended