+ All Categories
Home > Documents > Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr....

Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr....

Date post: 15-Jan-2016
Category:
Upload: nathalie-pew
View: 214 times
Download: 0 times
Share this document with a friend
Popular Tags:
47
Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence
Transcript
Page 1: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Improving Hash JoinPerformance By Exploiting

Intrinsic Data Skew

byBryce Cutt

 supervised by

Dr. Ramon Lawrence

Page 2: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Introduction

• Databases are part of our lives• Hash Join is a core database algorithm

o Very I/O intensive for large databases Queries may take hours

o Any performance improvement is significant• Real datasets contain skew

o Skew is when some values occur more frequently o Skew can greatly reduce hash join performance

• Skew traditionally considered a bad thing for join algorithmso Try to mitigate negative effects of skew

• Adapt hash joino No longer just mitigateo Use foreknowledge of skew

Improve performance

Page 3: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Relational Model Definitions

Page 4: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Example Relations

Build Relation

Probe Relation

Part

Purchase

Page 5: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase

Hash Function: modulo 5

Page 6: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 7: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 8: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 9: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 10: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 11: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 12: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 13: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 14: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 15: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 16: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Build Phase, cont.

Page 17: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Probe Relation

Page 18: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase

Page 19: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 20: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 21: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 22: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 23: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 24: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 25: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Probe Phase, cont.

Page 26: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Cleanup Phase

Page 27: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Cleanup Phase, cont.

Page 28: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Cleanup Phase, cont.

Page 29: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

DHJ Algorithm Cleanup Phase, cont.

Page 30: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Skewed Probe Relation

Page 31: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Statistics and Hash Joins

• Modern database systems maintain statistics such as histograms for query optimization

• What if hash join could use the statistics to choose the best build tuples to keep in memory?o Does not have to generate own

statistics

Page 32: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Histojoin Algorithm General Idea

• Same basic form as DHJ• Determines best build tuples from histogram

o In this case the tuples with partid 2 and 3• Create partitions for the best build tuples

o In addition to regular partitionso Freeze regular partitions first

• Perform a highly optimized multi-stage checko To determine the partition tuples belong in

Page 33: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Histojoin Algorithm Build Phase

Page 34: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Histojoin Algorithm Probe Phase

Page 35: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Implementation Details

• Avoided in algorithm descriptiono General enough to fit any database system

• But ultimately importanto  Core of algorithm implementation specific

• Implemented ino Stand alone Java app

Optimistic implementationo PostgreSQL

HHJ Conservative implementation

Page 36: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Inaccurate Statistics

• Selections• Multi-join plans

o Samplingo SITs

• Handling dependent on implementationo  PostgreSQL conservative memory usage

Page 37: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results

• TPC-Ho Database commonly used to test database system

performanceo Skewed versionso 1GB dataset used in Java testso 10GB dataset used in PostgreSQL tests

Page 38: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

Java, Lineitem/Part, skewed, 1GBApprox. 20% faster

Page 39: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

Java, Lineitem/Part,high skew, 1GBApprox. 60% faster

Page 40: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

Java, Various Joins, Percent Improvement, 1GBApprox. 20% for skewed and 60% for high skew

Page 41: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

Java, Lineitem/Part, Inaccurate Histogram, 1GB

Page 42: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

Java, Lineitem/Part/Supplier,high skew, 1GBApprox. 75% faster

Page 43: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

PostgreSQL, Lineitem/Part,skewed, 10GBApprox. 10% faster

Page 44: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

PostgreSQL, Lineitem/Part, high skew, 10GBApprox. 60% faster

Page 45: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Experimental Results, cont.

PostgreSQL, Various Joins, Percent Improvement, 10GB5-10% for skewed and 50-60% for high skew

Page 46: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Conclusion

• Histojoino significantly outperforms standard hash joins in the

presence of skew• Smart implementation mitigates pitfalls• Two papers have been published from this work• PostgreSQL patch currently in review

o Will be used by millions of users

Page 47: Improving Hash Join Performance By Exploiting Intrinsic Data Skew by Bryce Cutt supervised by Dr. Ramon Lawrence.

Thank you

Thank you Dr. Lawrence


Recommended