Date post: | 29-Aug-2014 |
Category: |
Technology |
Upload: | tilmann-rabl |
View: | 302 times |
Download: | 1 times |
Variations of the Star Schema Benchmark to Test the Effects of Data Skew on Query PerformanceTILMANN RABL, MEIKEL POESS, HANS-ARNO JACOBSEN, PATRICK AND ELIZABETH O’NEIL
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
ICPE 2013, PRAGUE, 24/04/2013
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 2
Real Life Data is Distributed Uniformly…
◦ Customers zip codes typically clustered around metropolitan areas◦ Seasonal items (lawn mowers, snow shovels, …) sold mostly during specific
periods◦ US retail sales:
◦ peak during Holiday Season◦ December sales are 2x of
January sales
Source: US Census Data
Well, Not Really
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 3
Student Seminar Signup Distribution
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 4
How Can Skew Effect Database Systems?
Data placement◦ Partitioning◦ Indexing
Data structures◦ Tree balance◦ Bucket fill ratio◦ Histograms
Optimizer finding the optimal query plan◦ Index vs. non-index driven plans◦ Hash join vs. merge join◦ Hash group by vs. sort group by
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 5
Agenda Data Skew in Current Benchmarks
Star Schema Benchmark (SSB)
Parallel Data Generation Framework (PDGF)
Introducing Skew in SSB
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 6
Data Skew in Benchmarks
TPC-D (1994-1999): only uniform data◦ SIGMOD 1997 - “Successor of TPC-D
should include data skew”◦ No effect until …
TPC-DS (released 2012)◦ Contains comparability zones ◦ Not fully utilized
TPC-D/H variations◦ Chaudhuri and Narayasa: Zipfian distribution on all columns◦ Crolotte and Ghazal: comparability zones
Still lots of open potential
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 7
Star Schema Benchmark I
Star schema version of TPC-H◦ Merged Order and Lineitem◦ Date dimension ◦ Dropped Partsupp◦ Selectivity hierarchies
◦ C_City C_Nation C_Region◦ …
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 8
Star Schema Benchmark II
Completely new set of queries
4 flights of 3-4 queries◦ Designed for functional coverage and selectivity coverage◦ Drill down in dimension hierarchies◦ Predefined selectivity
select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
select sum(lo_extendedprice*lo_discount) as revenue from lineorder, date where lo_orderdate = d_datekey and d_yearmonthnum = 199301 and lo_discount between 1 and 3 and lo_quantity between 26 and 35;
DrilldownQ1.1
Q1.2
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 9
Parallel Data Generation Framework
Generic data generation framework
Relational model◦ Schema specified in configuration file◦ Post-processing stage for alternative representations
Repeatable computation◦ Based on XORSHIFT random number generators◦ Hierarchical seeding strategy
Frank, Poess, and Rabl: Efficient Update Data Generation for DBMS Benchmarks. ICPE '12.Rabl and Poess: Parallel Data Generation for Performance Analysis of Large, Complex RDBMS. DBTest '11.Poess, Rabl, Frank, and Danisch: A PDGF Implementation for TPC-H. TPCTC '11.Rabl, Frank, Sergieh, and Kosch: A Data Generator for Cloud-Scale Benchmarking. TPCTC '10.
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 10
Configuring PDGF Schema configuration
Relational model◦ Tables, fields
Properties◦ Table size, characters, …
Generators◦ Simple generators◦ Metagenerators
Update definition◦ Insert, update, delete◦ Generated as change data capture
<table name="SUPPLIER"> <size>${S}</size> <field name="S_SUPPKEY" size="" type="NUMERIC“ primary="true" unique="true"> <gen_IdGenerator /> </field> <field name="S_NAME" size="25" type="VARCHAR"> <gen_PrePostfixGenerator> <gen_PaddingGenerator> <gen_OtherFieldValueGenerator> <reference field="S_SUPPKEY" /> </gen_OtherFieldValueGenerator > <character>0</character> <padToLeft>true</padToLeft> <size>9</size> </gen_PaddingGenerator > <prefix>Supplier </prefix> </gen_PrePostfixGenerator> </field>[..]
PDGFXML DB
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 11
Opportunities to Inject Data Skew in
Foreign key relations◦ E.g., L_PARTKEY
One fact table measures◦ E.g., L_Quantity
Single dimension hierarchy◦ E.g., P_Brand → P_Category → P_Mfgr
Multiple dimension hierarchies◦ E.g., City → Nation in Supplier and Customer
Experimental methodology◦ One experiment series for each of the above◦ Comparison to original SSB◦ Comparison of index-forced, non-index, and automatic optimizer mode◦ SSB scale factor 100 (100 GB), x86 server
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 12
Skew in Foreign Key Relations
Very realistic Easy to implement in PDGF
◦ Just add a distribution to the reference
But! Dimension attributes uniformly distributed Dimension keys uncorrelated to dimension attributes Very limited effect on selectivity Focus on attributes in selectivity predicates
<distribution name="Exponential“ lambda="0.26235" />
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 13
Lo_Quantity distribution◦ Values range between 0 and 50◦ Originally uniform distribution with:
◦ P(X=x)=0.02◦ Coefficient of variation of 0.00000557
◦ Proposed skewed distribution with:◦
Query 1.1◦ lo_quantity < x, x ∈ [2, 51]
Results◦ Switches too early to non-index plan◦ Switches too late to non-index plan◦ Optimizer agnostic to distribution
Skew in Fact Table Measure – Lo_Quantity
xxXP3.13.0)(
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 14
Skew in Single Dimension Hierarchy - Part
P_Category distribution◦ Uniform P(X=x)=0.04◦ Skewed P(X=x)= 0.01 - 48.36◦ Probabilities explicitly defined
Query 2.1 ◦ Restrictions on two dimensions
Results uniform case◦ Index driven superior◦ Optimizer chooses non-index driven
Results skewed case◦ Switches too early to non-index plan
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 15
Skewed S_City & C_City◦ Probabilites exponentially
distributed
Query 3.3◦ Restrictions on 3 dimensions◦ Variation on Supplier and Customer
city
Results uniform and skewed cases◦ Automatic plan performs best◦ Cross over between automatic
uniform and skewed too late
Skew in Multiple Dimension Hierarchies – S_City & C_City
Join Cardinality Elapsed Time
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 16
Conclusion & Future Work
PDGF implementation of SSB
Introduction of skew in SSB
Extensive performance analysis◦ Several interesting optimizer effects◦ Performance impact of skew
Future Work Further analysis on impact of skew
Skew in query generation
Complete suite for testing skew effects
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 17
Thanks
Questions?
Download and try PDGF:
http://www.paralleldatageneration.org
(scripts used in the study available on website above)
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 18
Back-up Slides
RABL, POESS, JACOBSEN, O'NEIL, O'NEIL - SSB SKEW VARIATIONS 19
Configuring PDGF Generation
Generation configuration
Defines the output◦ Scheduling◦ Data format◦ Sorting◦ File name and location
Post processing◦ Filtering of values◦ Merging of tables◦ Splitting of tables◦ Templates (e.g. XML / queries)
<table name="QUERY_PARAMETERS" exclude="false" > <output name="CompiledTemplateOutput" > [..] <template ><!-- int y = (fields [0]. getPlainValue ()).intValue (); int d = (fields [1]. getPlainValue ()).intValue (); int q = (fields [2]. getPlainValue ()).intValue (); String n = pdgf.util.Constants.DEFAULT_LINESEPARATOR; buffer.append("-- Q1.1" + n); buffer.append("select sum(lo_extendedprice *"); buffer.append(" lo_discount) as revenue" + n); buffer.append(“ from lineorder , date" + n); buffer.append(“ where lo_orderdate = d_datekey" + n); buffer.append(“ and d_year = " + y + n); buffer.append(“ and lo_disc between " + (d - 1)); buffer.append(“ and " + (d + 1) + n); buffer.append(“ and lo_quantity < " + q + ";" + n); --></template > </output ></table >