Automating DWH Patterns Through Metadata

Post on 21-Nov-2014

2,413 views 1 download

Tags:

description

Around 80% of the work to create a data warehouse/BI solution is spent on the ETL phase. Although building an ETL solution can be a challenge, you can break down the project into at least two separate processes for easier management. One process is strictly related to business modeling, and therefore cannot be replicated. But the other is made up of purely technical processes that are always the same, regardless of the business environment we operate in, and thus can be highly automated. In this session, we will look at well-known patterns to solving common problems and how they can be automated with the help of specific tools and techniques that use metadata to reduce development time and bugs. Using these engineering techniques, you will be able to adopt an Agile approach to your BI solution.

transcript

Automating Data Warehouse Patterns Through MetadataDavide Mauridmauri@solidq.com

Davide Mauri20 Years of experience on the SQL Server Platform

– Specialized in Data Solution Architecture, Database Design, Performance Tuning, Business Intelligence, Data Warehouse, Big Data & Analytics

Microsoft SQL Server MVPPresident of UGISS (Italian SQL Server UG)Mentor @ SolidQ

– Regular Speaker @ SQL Server events– Projects, Consulting, Mentoring & Training

Find me here:– Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx– Twitter:@mauridb

Building a DWH in 2013Is still a (almost) manual process

A *lot* of repetitive low-value work

No (or very few) standard tools available

How it should beSemi-automatic process

– “develop by intent”

Define the mapping logic from a semantic perspective– Source to Dimensions / Measures

• (Metadata anyone?)

Design the model and let the tool build it for you

CREATE DIMENSION CustomerFROM SourceCustomerTableMAP USING CustomerMetadata

ALTER DIMENSION CustomersADD ATTRIBUTE LoyaltyLevelAS TYPE 1

CREATE FACT OrdersFROM SourceOrdersTableMAP USING OrdersMetadata

ALTER FACT OrdersADD DIMENSION Customer

The perfect BI process & architecture

AGILE BI

Iterative!

DWH PROCESSESIs automation possible?

Invest on Automation?Faster development

– Reduce Costs– Embrace Changes

Less bugs

Increase solution quality and make it consistent throughout the whole product

Automation Pre-RequisitesSplit the process to have two separate type of processes

– What can be automated– What can NOT be automated

Create and impose a set of rules that defines– How to solve common technical problems– How to implement such identified solutions

No Monkey Work!Let the people think and let the machines do the «monkey» work.

Design Pattern“A general reusable solution to a commonly occurring problem within a given context”

Design PatternGeneric ETL Pattern

– Partition Load– Incremental/Differential Load

Generic BI Design Pattern– Slowly Changing Dimension

• SCD1, SCD2, ecc.– Fact Table

• Transactional, Snapshot, Temporal Snapshot

Design PatternSpecific SQL Server Patterns

– Change Data Capture– Change Tracking– Partition Load– SSIS Parallelism

Sample Rules• Always put «last_update» column• Always log Inserted/Updated/Deleted rows to

log.load_info table• Use MD5 – binary(16) for checksums• Use views to expose data

– Dimension & Fact views MUST use the same column names for lookup columns

Hi-Level Vision

STGETLETL

OLTP DWH

ETL

Technical Process

Business Process

Technical Process

ETL Phases«E» and «L» must be

– Simple, Easy and Straightforward– Completely Automated– Completely Reusable

«E» and «L» have ZERO value in a BI Solution– Should be done in the most economic way

PATTERN Well known solution to common problems

Source Full Load E

Source Incremental Load EIn this scenario, “ID” is a IDENTITY/SEQUENCE.Probably a PK.

Source Differential Load/1 E

In this scenario the source tabledoesn’t offer any specific way to Understand what’s changed

Source Differential Load/2 E

In this scenario the source table has a TimeStamp-Like column

Source Differential Load• SQL Server 2012 that can help with

incremental/differential load– Change Data Capture

• Natively supported in SSIS 2012• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sql-ser

ver-2012-2/– Change Tracking

• Underused feature in BI…not so rich as CDC but MUCH more simpler and easier

E

SCD 1 & SCD 2 LStart

Lookup Dimension Id and MD5 Checksum From Business Key

Calculate MD5 Checksum of Non-SCD-Key Colums

Dimension Id is Null?YesInsert new members

into DWH No Checksum are different?

Yes

Store into temp table

Merge data from temp table to DWHEnd

SCD 2 Special Note• Merge => UPDATE Interval + INSERT New Row

L

FACT TABLE LOAD L

Partition Load EL

Parallel Load• Logically split the work in several steps

– E.g: Load/Process one customer at time• Create a «queue» table the stores information for each step

– Step 1 -> Load Customer «A»– Step 2 -> Load Customer «B»

• Create a Package that 1. Pick the first not already picked up 2. Do work3. Back to step 3

• Call the Package «n» times simultaneously

EL

Other SSIS Specific Patterns• Range Lookup

– Not natively supported – Matt Masson has the answer in his blog

• http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx

METADATAA key ingredient in automation

MetadataProvide context information

– Which columns are used to build/feed a Dimension?

– Which columns are Business Keys?– Which table is the Fact Table?– How Fact and Dimension are connected?

• Which columns are used?

How to manage Metadata?• Naming Convention

• Extended Properties

• Specific, Ad Hoc Database or Tables

• Other (XML, File, ecc.)

Naming Convention• The easiest and cheapest

– No additional (hidden) costs– No need to be maintained– Never out-of-sync– No documentation need

• Actually, it IS PART of the documentation– Imposes a Standard

• Very limited in terms of flexibility and usage

Extended PropertiesSupport most of metadata needs

No additional software needed

Very verbose usage– Development of a wrapper to make usage simpler is

feasible and encouraged

Metadata ObjectsDedicated Ad-Hoc Database and Tables

As Flexible as you need

Maintenance Overhead to keep metadata in-sync with data– Development of automatic check procedure is needed– DMV can help a lot here

External Metadata ObjectsReally expensive to keep them in-sync

– A tool is needed, otherwise too much manual work

Does not give any specific benefits with respect to Ad-Hoc Database/Tables

DEMO

AUTOMATIONLet’s make it possible!

Automation Scenarios• Run-Time: «Auto-Configuring» Packages

– Really hard to customize packages– SSIS limitations must be managed

• Eg: Data Flow cannot be changed at runtime• On-the fly creation of package may be needed

• Design-Time: Package Generators / Package Templates– Easy to customize created packages

Automation Solutions• Specific Tool/frameworks

– BIML / MIST

• SQL Server Platform– SQL, PowerShell, .NET– SMO, AMO

Package GeneratorsRequired Assemblies

Microsoft.SqlServer.ManagedDTSMicrosoft.SqlServer.DTSRuntimeWrapMicrosoft.SqlServer.DTSPipelineWrap

Path:C:\Program Files (x86)\Microsoft SQL Server\110\SDK\Assemblies

DEMO

Useful Resources• «STOCK» Tasks:

– http://msdn.microsoft.com/en-us/library/ms135956.aspx

• How to set Task properties at runtime:– http://technet.microsoft.com/en-us/library/microsoft

.sqlserver.dts.runtime.executables.add.aspx

BIML – BI Markup Language• Developed by Varigence

– http://www.varigence.com – http://bimlscript.com/ – MIST: BIML Full-Featured IDE

• Free via BIDS Helper– Support “limited” to SSIS package generation– http://bidshelper.codeplex.com

THANK YOU!• For attending this session and

PASS SQLRally Nordic 2013, Stockholm