+ All Categories
Home > Documents > In-memory Aggregation for Big Data Analytics€¦ · 1. Algorithm and Data Structure 2. Query and...

In-memory Aggregation for Big Data Analytics€¦ · 1. Algorithm and Data Structure 2. Query and...

Date post: 15-Oct-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
1
Experimental Results Evaluating the performance impact of each of the six dimensions Measured runtime with accurate timer, reporting average of five runs Please see paper [3] for additional results and details Aggregation is widely used to extract useful information from large volumes of data. In-memory databases are rising in popularity due to the demands of big data analytics applications. Many different algorithms and data structures can be used for in-memory aggregation, but their relative performance characteristics are inadequately studied. Prior studies in aggregation primarily focused on small selections of query workloads and data structures. We undertook a comprehensive analysis of in-memory aggregation that encompasses 20 popular and state-of- the-art algorithms and data structures. Insights gained from theoretical and empirical evaluation are used to identify the trade-offs of each algorithm, with the goal of offering insights to practitioners. Our results allowed us to identify the best approach in different situations, based on specific characteristics of the query workload and dataset. In-memory Aggregation for Big Data Analytics Puya Memarzia, Virendra C. Bhavsar, and Suprio Ray Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada ABSTRACT References: [1] Puya Memarzia, Suprio Ray, and Virendra C. Bhavsar, “On Improving Data Skew Resilience In Main-memory Hash Joins”, 22nd International Database Engineering & Applications Symposium (IDEAS), May 2018 [2] Gray, Jim, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. "Quickly generating billion-record synthetic databases." In Acm Sigmod Record, vol. 23, no. 2, pp. 243-252. ACM, 1994. [3] Puya Memarzia, Suprio Ray, and Virendra C. Bhavsar, “A Six-dimensional Analysis of In-memory Aggregation”, International Conference on Extending Database Technology (EDBT-2019), 2018. (a) Vector COUNT (Q1) Motivation Aggregation: a ubiquitous and expensive operation commonly used in data analytics Applications: data warehousing, data mining, business intelligence tools, … Plethora of algorithms and data structures could be used to implement aggregation What are the tradeoffs and opportunities to improve performance? How to select the best tool for the job? Memory getting faster, denser, and cheaper Significant speedups (orders of magnitude) over disk-based query processing Data resides in main memory Performance highly influenced by memory access patterns [1] Figure 3. Six analysis dimensions (important factors that can be evaluated independantly) Figure 4. Variable dataset distribution - Vector COUNT (Q1) - 100M records - 1M group-by cardinality Conclusions Aggregation heavily impacted by data and workload characteristics Sort-based approaches: generally perform best on holistic workloads Hash-based approaches: generally the fastest on distributive workloads Tree-based approaches: too slow for write-once-read-once (WORO) workloads. Potential for write-once-ready-many (WORM) workloads or situations requiring dynamic growth Sponsored by: 1. Algorithm and Data Structure 3. Key Distribution and Skew 2. Query and Aggregate Function 4. Group-By Cardinality 5. Dataset Size and Memory Usage 6. Concurrency and Multithreaded Scaling threads In - memory Query Processing Data App App Data (optional disk snapshots) Data Data CPU RAM RAM Disk resident data Experimental Methodology Hash-based, sort-based, and tree- based approaches Including popular, state-of-the-art, and custom implementations Aggregation query categories Scalar or Vector Distributive, Algebraic, or Holistic Range predicate Data key distribution (Zipfian, heavy hitter, moving cluster, etc.) Data ordering (randomness/locality) 0 10 20 30 40 50 60 70 80 Rseq Rseq-Shf Hhit Hhit-shf Zipf MovC Query Execution Time (CPU Cycles) Billions Dataset distribution ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort Number of unique values in group-by columns Determines size of output Support for concurrent (shared memory) access Ability to scale up with additional threads Size of input data Algorithm memory efficiency Query memory requirements Figure 2. Comparison of disk-based and memory-based approaches Label Type ART Tree Judy Tree Btree Tree Hash_SC Hash Table Hash_LP Hash Table Hash_Sparse Hash Table Hash_Dense Hash Table Hash_LC Hash Table Hash_TBBSC Hash Table Hash_LCMC Hash Table Introsort Sort Algorithm Spreadsort Sort Algorithm Sort_QSLB Sort Algorithm Sort_BI Sort Algorithm ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort 10 2 10 3 10 4 10 5 10 6 10 7 0 5 10 15 20 25 30 35 40 Query Execution Time (CPU Cycles) Billions Group-by Cardinality 10 2 10 3 10 4 10 5 10 6 10 7 0 10 20 30 40 50 60 Query Execution Time (CPU Cycles) Billions Group-by Cardinality 10 2 10 3 10 4 10 5 10 6 10 7 higher cardinality = more unique values => more cache/TLB misses Hash_LP Spreadsort 0 2 4 6 8 10 12 14 16 18 1 2 3 4 5 6 7 8 No of threads Query Execution Time (CPU Cycles) Billions Hash_TBBSC Hash_LCMC Sort_QSLB Sort_BI 0 5 10 15 20 25 1 2 3 4 5 6 7 8 No of threads Query Execution Time (CPU Cycles) Billions Hash_TBBSC Hash_LCMC Sort_QSLB Sort_BI (a) Vector COUNT (Q1) – 1K Groups (b) Vector COUNT (Q1) – 1M Groups Hash_TBBSC Hash_LCMC Limited improvement beyond 4 threads Reminder: we have four physical cores with hyperthreading Runtime generally faster and more consistent on Hash_TBBSC Hash_LCMC more efficient at high cardinality (1M groups) Algorithm Query Dataset Six dataset distributions Extends popular datasets originally proposed in [2] Up to 100M key-value pairs Variable cardinality/skew Seven query workloads Covers all fundamental aggregation categories Initial pool of 20 algorithms selected for evaluation Microbenchmarks used to filter out inefficient implementations Distributive query example (Q1) SELECT product_id, MEDIAN(amount) FROM sales GROUP BY product_id SELECT product_id, COUNT(*) FROM sales GROUP BY product_id Holistic query example (Q3) Table 1. Algorithms and Data Structures (b) Vector MEDIAN (Q3) Figure 5. Variable cardinality - 100M records expensive to process: Zipf and Rseq-Shf cheaper to process: Rseq and MovC (locality matters!) Figure 6. Multithreaded scaling (concurrent algorithms) - 100M records Figure 1. Aggregation overview and motivation
Transcript
Page 1: In-memory Aggregation for Big Data Analytics€¦ · 1. Algorithm and Data Structure 2. Query and Aggregate Function 3. Key Distribution and Skew 4. Group-By Cardinality 5. Dataset

Experimental Results Evaluating the performance impact of each of the six dimensions Measured runtime with accurate timer, reporting average of five runs Please see paper [3] for additional results and details

Aggregation is widely used to extract useful information from large volumes of data. In-memory databases are rising in popularity due to the demands of big data analytics applications.Many different algorithms and data structures can be used for in-memory aggregation, but their relative performance characteristics are inadequately studied. Prior studies in aggregationprimarily focused on small selections of query workloads and data structures. We undertook a comprehensive analysis of in-memory aggregation that encompasses 20 popular and state-of-the-art algorithms and data structures. Insights gained from theoretical and empirical evaluation are used to identify the trade-offs of each algorithm, with the goal of offering insights topractitioners. Our results allowed us to identify the best approach in different situations, based on specific characteristics of the query workload and dataset.

In-memory Aggregation for Big Data AnalyticsPuya Memarzia, Virendra C. Bhavsar, and Suprio Ray

Faculty of Computer Science, University of New Brunswick, Fredericton, New Brunswick, Canada

ABSTRACT

References: [1] Puya Memarzia, Suprio Ray, and Virendra C. Bhavsar, “On Improving Data Skew Resilience In Main-memory Hash Joins”, 22nd International Database Engineering & Applications Symposium (IDEAS), May 2018 [2] Gray, Jim, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. "Quickly generating billion-record synthetic databases." In Acm Sigmod Record, vol. 23, no. 2, pp. 243-252. ACM, 1994. [3] Puya Memarzia, Suprio Ray, and Virendra C. Bhavsar, “A Six-dimensional Analysis of In-memory Aggregation”, International Conference on Extending Database Technology (EDBT-2019), 2018.

(a) Vector COUNT (Q1)

Motivation Aggregation: a ubiquitous and expensive

operation commonly used in data analytics Applications: data warehousing, data

mining, business intelligence tools, … Plethora of algorithms and data structures

could be used to implement aggregation What are the tradeoffs and opportunities to

improve performance? How to select the best tool for the job?

Memory getting faster, denser, and cheaper Significant speedups (orders of magnitude)

over disk-based query processing Data resides in main memory Performance highly influenced by memory

access patterns [1]

Figure 3. Six analysis dimensions (important factors that can be evaluated independantly)

Figure 4. Variable dataset distribution - Vector COUNT (Q1) - 100M records - 1M group-by cardinality

Conclusions Aggregation heavily impacted by data and workload characteristics Sort-based approaches: generally perform best on holistic workloads Hash-based approaches: generally the fastest on distributive workloads Tree-based approaches: too slow for write-once-read-once (WORO)

workloads. Potential for write-once-ready-many (WORM) workloads orsituations requiring dynamic growth

Sponsored by:

1. Algorithm and Data Structure 3. Key Distribution and Skew2. Query and Aggregate Function

4. Group-By Cardinality 5. Dataset Size and Memory Usage 6. Concurrency and Multithreaded Scalingthreads

In-memory Query Processing

Data

App App

Data

(optional disk snapshots)

Data

Data

CPU

RAMRAM

Disk resident data

Experimental Methodology

Hash-based, sort-based, and tree-based approaches

Including popular, state-of-the-art, and custom implementations

Aggregation query categories Scalar or Vector Distributive, Algebraic, or Holistic Range predicate

Data key distribution (Zipfian, heavy hitter, moving cluster, etc.)

Data ordering (randomness/locality)

0

10

20

30

40

50

60

70

80

Rseq Rseq-Shf Hhit Hhit-shf Zipf MovCQue

ry E

xecu

tion

Tim

e (C

PU

Cyc

les)

Bill

ions

Dataset distribution

ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort

Number of unique values in group-by columns

Determines size of output

Support for concurrent (shared memory) access

Ability to scale up with additional threads

Size of input data Algorithm memory efficiency Query memory requirements

Figure 2. Comparison of disk-based and memory-based approaches

Label Type

ART Tree

Judy Tree

Btree Tree

Hash_SC Hash Table

Hash_LP Hash Table

Hash_Sparse Hash Table

Hash_Dense Hash Table

Hash_LC Hash Table

Hash_TBBSC Hash Table

Hash_LCMC Hash Table

Introsort Sort Algorithm

Spreadsort Sort Algorithm

Sort_QSLB Sort Algorithm

Sort_BI Sort Algorithm

ART Judy Btree Hash_SC Hash_LP Hash_Sparse Hash_Dense Hash_LC Introsort Spreadsort

102 103 104 105 106 107

0

5

10

15

20

25

30

35

40

Que

ry E

xecu

tion

Tim

e (C

PU

Cyc

les)

Bill

ions

Group-by Cardinality

102 103 104 105 106 107

0

10

20

30

40

50

60

Que

ry E

xecu

tion

Tim

e (C

PU

Cyc

les)

Bill

ions

Group-by Cardinality

102 103 104 105 106 107

higher cardinality = more unique values => more cache/TLB misses

Hash_LP

Spreadsort

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8

No of threads

Que

ry E

xecu

tion

Tim

e (C

PU

Cyc

les)

Bill

ions

Hash_TBBSC Hash_LCMC Sort_QSLB Sort_BI

0

5

10

15

20

25

1 2 3 4 5 6 7 8

No of threads

Que

ry E

xecu

tion

Tim

e (C

PU

Cyc

les)

Bill

ions

Hash_TBBSC Hash_LCMC Sort_QSLB Sort_BI

(a) Vector COUNT (Q1) – 1K Groups (b) Vector COUNT (Q1) – 1M Groups

Hash_TBBSCHash_LCMC

Limited improvement beyond 4 threads

Reminder: we have four physical cores with hyperthreading

Runtime generally faster and more consistent on Hash_TBBSC

Hash_LCMC more efficient at high cardinality (1M groups)

AlgorithmQueryDataset

Six dataset distributions Extends popular datasets

originally proposed in [2] Up to 100M key-value pairs Variable cardinality/skew

Seven query workloads Covers all fundamental

aggregation categories

Initial pool of 20 algorithmsselected for evaluation

Microbenchmarks used to filterout inefficient implementations

Distributive query example (Q1)

SELECT product_id, MEDIAN(amount)FROM sales GROUP BY product_id

SELECT product_id, COUNT(*)FROM sales GROUP BY product_id

Holistic query example (Q3)

Table 1. Algorithms and Data Structures

(b) Vector MEDIAN (Q3)

Figure 5. Variable cardinality - 100M records

expensive to process: Zipf and Rseq-Shfcheaper to process: Rseq and MovC (locality matters!)

Figure 6. Multithreaded scaling (concurrent algorithms) - 100M records

Figure 1. Aggregation overview and motivation

Recommended