Christian Winther Kristensen

transcript

APPLICATION OF SQL SERVER COLUMNSTOREINDEXES IN BI-SOLUTIONSTemadag: Modern Analytical Database Technology28. oktober 2014, Aalborg Universitet

Christian Winther KristensenManaging consultantcwk@rehfeld.dk

• SQL server columnstore index

• Practical case

• New updateable clusteredcolumnstore in SQL server 2014

• Comparison: Pros and cons

• Questions

03-11-2014

Agenda

• Came in SQL server 2012• Shares Microsoft xVelocity

columnstore technology with Analysis Services Tabular model and PowerPivot

• Highly compressed• Memory optimized• Not updateable underlying table is read only!

03-11-2014

SQL server columnstore index

Star schema

FactSales

DimCustomer

FactSales ( CustomerKey int, ProductKey int, EmployeeKey int, StoreKey int, OrderDateKey int, SalesAmount money)

‐‐note: lots of ints in fact tables

DimCustomer ( CustomerKey int, FirstName nvarchar(50), LastName nvarchar(50), Birthdate date, EmailAddress nvarchar(50))

DimProduct (…

Best Practice: Integer keys!

DimDate

DimEmployee

DimStore

How do columnstore indexes optimize performance?

Columnstore indexes store data column-wise Each page stores data from a single column

Highly compressed About 2x better than PAGE compression More data fits in memory

Each column accessed independently Fetch only needed columns Can dramatically decrease I/O

C1 C2 C3 C4

Heaps, B-trees store data row-wise

Columnstore index architecture

• Row Group– 1 million logically contiguous rows

• Column Segment– Segment contains values from one

column for a set of rows– Segments for the same set of rows

comprise a row group– Segments are compressed– Each segment stored in a separate LOB– Segment is unit of transfer between

disk and memory

C1 C2 C3 C5 C6C4

Segment

Row Group

Columnstore index example

OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount20101107 106 01 1 6 30.0020101107 103 04 2 1 17.0020101107 109 04 2 2 20.0020101107 103 03 2 1 17.0020101107 106 05 3 4 20.0020101108 106 02 1 5 25.0020101108 102 02 1 1 14.0020101108 106 03 2 5 25.0020101108 109 01 1 1 10.0020101109 106 04 2 4 20.0020101109 106 04 2 5 25.0020101109 103 01 1 1 17.00

1. Horizontally partition (Row Groups)

OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount20101107 106 01 1 6 30.0020101107 103 04 2 1 17.0020101107 109 04 2 2 20.0020101107 103 03 2 1 17.0020101107 106 05 3 4 20.0020101108 106 02 1 5 25.00

OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount20101108 102 02 1 1 14.0020101108 106 03 2 5 25.0020101108 109 01 1 1 10.0020101109 106 04 2 4 20.0020101109 106 04 2 5 25.0020101109 103 01 1 1 17.00

2. Vertically partition via columns (segments)

OrderDateKey201011072010110720101107201011072010110720101108

ProductKey106103109103106106

StoreKey010404030502

RegionKey

122231

Quantity612145

SalesAmount

30.0017.0020.0017.0020.0025.00

OrderDateKey201011082010110820101108201011092010110920101109

ProductKey102106109106106103

StoreKey020301040401

RegionKey

121221

Quantity151451

SalesAmount

14.0025.0010.0020.0025.0017.00

3. Compress each segment*

OrderDateKey

20101107

20101108

ProductKey

StoreKey

RegionKey

Quantity

SalesAmount

Some segments will compress more than others

OrderDateKey

20101108

20101109

ProductKey

StoreKey

RegionKey

Quantity

SalesAmount

*Encoding and reordering not shown

4. Fetch only needed columns and row groups

OrderDateKey

20101107

20101108

ProductKey

StoreKey

RegionKey

Quantity

SalesAmount

OrderDateKey

20101108

20101109

ProductKey

StoreKey

RegionKey

Quantity

SalesAmount

SELECT ProductKey, SUM (SalesAmount) FROM SalesTableWHERE OrderDateKey < 20101108GROUP BY ProductKey

• Scenario:– Energy trading company migrates BI solution

to SQL server 2012

• Problems:– ETL flow and intermediary calculations takes

too long time– Loading fact tables with many indexes is slow

and indexes consumes much storage – Processing of analysis services OLAP cube is

slow– End user reporting on the relational data

mart has long response time in certain scenarios

03-11-2014

Practical case

03-11-2014

Solution 1: Optimize complex ETL calculations

Stage basic trade data

Do derivedcalculations

Load facttable

Before optimization

5 min 50 min 5 min

Drop columnstore

Stage basic trade data

Createcolumnstore

Do derivedcalculations

Load facttable

After optimization

5 min 1 min 5 min

1 hour for 6 mio rows

2 min0 min

13 min for 6 mio rows

03-11-2014

Solution 2: Reduce fact load time and save disk space

Drop non clusteredindexes

Load fact tableCreate non clusteredindexes

Before optimization

1 min 25 min(45 min not dropping ix)

15 min

Drop columnstore

indexLoad fact table

Createcolumnstore

After optimization

25 min 7 min

41/45 min for 20 mio rows, 8 GB index space

32 min for 20 mio rows, 1 GB index space

Some queries gota bit slower!

03-11-2014

Solution 3: Slow processing of OLAP cube

Load switch in table

Switch partition to fact table

ProcessOLAP cube

Before optimization

30 min 30 min

Drop columnstore

Load switch in table

Createcolumnstore

Switch partition to fact table

ProcessOLAP cube

After optimization

30 min 5 min 20 min

1 hour for 30 mio rows

0 min0 min

55 min for 30 mio rows + betterperformance for other queries

SSAS MOLAP cube with partitions like fact table. 300 mio rows total. Partition switching used for fact table load – average change of 30 mio rows per day.

• Only little time saving on cubeprocessing…

• But what if storage mode waschanged from MOLAP to ROLAP or HOLAP?

• Small experiment– Some OLAP queries got slower– Processing got a lot faster, especially

ROLAP due to no aggregations– Saved OLAP storage space

03-11-2014

Solution 3: Slow processing of OLAP cube

03-11-2014

Solution 4: Reduce reporting query time

Before optimization

After optimization

210 seconds for doing star schema join and aggregation

10 seconds for doing same query

Add columnstore index to facttable in ETL

21 X FASTER !

Columnstore in SQL 2014

• New: Clustered Columnstore– Dependency on conventional b-tree structures has

been removed– Potential for significant disk space savings if workload

is satisfied without conventional indexes

• Note: Non-clustered columnstore is still supported & is still a read-only structure– Required if:

Constraints are required Workload requires b-tree non-clustered indexes

• Fully Read/Write– Less complicated ETL

– But partition switching & BULK INSERT remain best practices

• Data type support expanded:– All data types except: (n)varchar(max), varbinary(max),

XML, Spatial, CLR (blob datatypes)

• “Batch mode” query plan improved– New support for:

• All joins (including OUTER, HASH, SEMI (NOT IN, IN)

• UNION ALL

• Scalar aggregates

• “Mixed mode” plans

Columnstore in SQL 2014:Insert & Updating Data

• Bulk insert– Creates row groups of 1Million rows, last row group is probably

not full– But if <100K rows, will be left in Row Store

• Insert/Update– Collects rows in Row Store

• Tuple Mover– When Row Store reaches 1Million rows, convert to a

Columnstore Row Group– Runs every 5 minutes by default– Started explicitly by ALTER INDEX <name> ON <table>

REORGANIZE

03-11-2014

Comparison: Pros and cons

Index type

Pros Cons

Non-clusteredcolumn store

• Fastest for queries• Allows other rowbased

indexes

• Not updateable• Uses more storage• More complex ETL design

Clusteredcolumn store

• Allows updating the table• Easier ETL design• Faster load• Minimal storage usage

• No unique or keyconstraints!

• No non-clustered indexes• Requires periodic index

maintenance

03-11-2014

Questions