Post on 19-Jan-2015
description
transcript
04/10/2023LearnDataVault.com 1
Data Vault Modeling
MethodologyA Primer…
© Dan Linstedt 2009-2012All Rights Reserved
http://LearnDataVault.com
3
A bit about me…
• Author, Inventor, Speaker – and part time photographer…
• 25+ years in the IT industry• Worked in DoD, US Gov’t, Fortune 50, and
so on…
• Find out more about the Data Vault:o http://www.youtube.com/LearnDataVaulto http://LearnDataVault.com
• Full profile on http://www.LinkedIn.com/dlinstedt
LearnDataVault.com
04/10/2023LearnDataVault.com 4
What IS a Data Vault? (Business
Definition)
• Data Vault Modelo Detail orientedo Historical traceabilityo Uniquely linked set of
normalized tableso Supports one or more
functional areas of business
ProcurementSales DeliveryContracts
FinancePlanning
Operations
Business KeysSpan / CrossLines of Business
Functional Area
• Data Vault Methodology– CMMI Level 5 Project Plan– Risk, Governance, Versioning– Peer Reviews, Release Cycles– Repeatable, Consistent,
Optimized– Complete with Best Practices
for BI/DW
04/10/2023LearnDataVault.com 5
What Does One Look Like?
Customer
Sat
Sat
Sat
F(x)
Customer
Product
Sat
Sat
Sat
F(x)
Product
Order
Sat
Sat
Sat
F(x)
Order
Elements:• Hub• Link• Satellite
Link
F(x)
Sat
Records a history of the interaction
Hub = List of Unique Business KeysLink = List of Relationships, AssociationsSatellites = Descriptive Data
Who’s Using It?
04/10/2023LearnDataVault.com 6
The PAIN!!Issues in Current EDW Projects
04/10/2023LearnDataVault.com 7
04/10/2023LearnDataVault.com 8
EDW Architecture: Generation 1
Sales
Finance
Contracts
Staging(EDW)
StarSchemas
Enterprise BI Solution
(batch)
Staging + History
Complex Business
Rules
Complex Business Rules+Dependencies
Conformed DimensionsJunk Tables
Helper TablesFactless Facts
04/10/2023LearnDataVault.com 9
Kick-Starting Data Warehousing
HR Asks IT to build the FIRST Data Warehouse / Prototype
1.
IT Says… OK: $125k and 90
days…
2.
HR Says:Great! Get
Started
3.
04/10/2023LearnDataVault.com 10
Everyone’s Happy!
IT Delivers. On-Time & In Budget!
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
First Star!
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneFact_ABCFact_DEFFact_PDQFact_MYFACT
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
4.
HR Says:Thank-you! We’re
Happy!
5.
So Where’s the PAIN?
04/10/2023LearnDataVault.com 11
04/10/2023LearnDataVault.com 12
The PAIN is RIGHT HERE!!
Contracts Sees Success, wants the same for their systems.
1.
IT Says… Ok, but… It won’t be $125k and 90
days…Because we have to “merge
it” with HR” it will be $250 and 180 days.
2.
Contracts Says:Ouch! That’s not
reasonable, but we need it, so go ahead…
3.
04/10/2023LearnDataVault.com 13
And HERE….
Finance, Sales, and Marketing want in….IT Says… Ok, but… It won’t be $250k and 90 days… Because we
have to “merge it” with HR and Contracts it will be $350k and 250 days.
And this continues….Business Says...“Can’t you just make-a-copy of the Star
Schema, and give me my own for cheaper & less time?
04/10/2023LearnDataVault.com 14
Silo Building / IT Non-Agility
We built our own because IT costs too much
SALES
We built our own because IT took too long
FINANCE
We built our own because we need customized dimension data
MARKETING
First Star
Why is this happening? What’s Causing this Problem?
Root Cause of Pain: Re-Engineering!
04/10/2023LearnDataVault.com 15
IT is forced to Re-Engineer ETL loading code + SQL BI Queries WHENEVER:• WHENEVER table
structures change• New systems are introduced
• Business Rules Change • (causing ETL Loading to change, and forcing Engineers to RELOAD existing data)
1. Adding fields to Dimensions
2. Adding fields to Facts
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneFact_ABCFact_DEFFact_PDQFact_MYFACT
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
Customer_IDCustomer_NameCustomer_AddrCustomer_Addr1Customer_CityCustomer_StateCustomer_ZipCustomer_PhoneCustomer_TagCustomer_ScoreCustomer_RegionCustomer_StatsCustomer_PhoneCustomer_Type
3. Adding Dimensions to Facts
Why Re-Engineering?
04/10/2023LearnDataVault.com 16
Require Re-Engineering!
Adding fields to a conformed dimension….
Adding fields to a shared fact….
Require adding/changingFields in target tables!
Changing code to match new business rules…
Other Pains?
04/10/2023LearnDataVault.com 17
Dimension-Itis?
Deformed Dimensions?
IT – Non-Agility?
WHAT ABOUT THE “DATA” YOU DON’T SEE?
WHAT ABOUT THE “BAD” DATA LEFT IN THE SOURCE SYSTEMS?
The SolutionGo the Data Vault Route!
04/10/2023LearnDataVault.com 18
04/10/2023LearnDataVault.com 19
EDW Architecture: Generation 2
Sales
Finance
Contracts
StagingDV
EDW
StarSchemas
ErrorMarts
ReportCollections
Enterprise BI SolutionSOA
(real-time)
(batch)
(batch)
Business Rules Downstream!(the Lens Filter)
04/10/2023LearnDataVault.com 20
Unstructured Data And Data Vault
Unstructured Data Sets Ontologies/Taxonomies
• Email• Docs• Images• Movies• Sound
Unstructured Processing Engine
Data Vault EDW
Joins through LINK Structures
On-DemandCubes
04/10/2023LearnDataVault.com 21
IT Agility
Source StagingData Vault
(EDW)
RAW“what-is”
StarSchemas
1. Fast Load & Fast Integration
ComplexBusiness
Rules
BusinessDriven
StarSchemas
3. IT Implementation of Business Rules
ETL-T
2. Business Gap Analysis•Unknown Time…•Business Requirements•Start new phase
What are the Facts Jack?
04/10/2023LearnDataVault.com 22
Generation 1 EDW’s tried to provide“One version of the truth”
Generation 2 (Data Vaults) provide…“One version of the facts, for each point in time.”
04/10/2023LearnDataVault.com 23
Business Gap Analysis
GapAnalysis
The Way Business Perceives it’s business to be running
The way the source systems see the business running.
GapAnalysis
OperationalReports
DynamicCubes(Data Marts)
04/10/2023LearnDataVault.com 24
Secured/Protected Information Systems
Non-Classified DV
HubHub
SatSatSatSat
HubHub
SatSat
LinkLinkSatSat
Classified Data Vault
HubHub
SatSatSatSat
HubHub
SatSatSatSat
LinkLink
SatSat
Hub
SatSat
Link
Sat
Sat
• Model changes are absorbed seamlessly into the classified system• Classified world can add all their own structures while maintaining congruence with standard
unclassified Data Vault
Data Copy
Model Copy
Yellow = New Tables
04/10/2023LearnDataVault.com 25
Extensibility Factor
ProductSupplierLink
Product ShippedDates
BilledAmounts Product
Quantities
Suppliers
Descriptions
Stock Quantities
Address
Rating Score
Products
Descriptions
Stock Quantities
Availability Dates
Defect Reasons
Existing EDWNo Impact!
New AdditionsNew Code
Where’s the Solution?
04/10/2023LearnDataVault.com 26
Handle Changes Wherever… Whenever… with EASE!
Re-Engineering
The Three vehicles…Pros and Cons of the Modeling Methodologies
04/10/2023LearnDataVault.com 27
04/10/2023LearnDataVault.com 28
3rd Normal Form Pros/Cons as an
EDWPROS (as 3NF)• Many to many linkages• Handle lots of information• Tightly integrated information• Highly structured• Conducive to near-real time loads• Relatively easy to extend
CONS (as EDW)• Time driven PK issues• Parent-child complexities• Cascading change impacts• Difficult to load• Not conducive to BI tools• Not conducive to drill-down• Difficult to architect for an
enterprise• Not conducive to spiral/scope
controlled implementation• Physical design usually doesn’t
follow business processes
04/10/2023LearnDataVault.com 29
Star Schema Pros/Cons as an EDW
PROS (as Data Mart)• Good for multi-dimensional analysis• Subject oriented answers• Excellent for aggregation points• Rapid development / deployment• Great for some historical storage
CONS (as EDW)• Not cross-business functional• Use of junk / helper tables• Trouble with VLDW• Unable to provide integrated
enterprise information• Can’t handle ODS or exploration
warehouse requirements• Trouble with data explosion in
near-real-time environments• Trouble with updates to type 2
dimension primary keys• Trouble with late arriving data in
dimensions to support real-time arriving transactions
• Not granular enough information to support real-time data integration
04/10/2023LearnDataVault.com 30
Data Vault Pros/Cons as an EDW
PROS (as EDW)• Supports near-real time and
batch feeds• Supports functional business
linking• Extensible / flexible• Provides rapid build / delivery
of star schema’s• Supports VLDB / VLDW• Designed for EDW• Supports data mining and AI• Provides granular detail• Incrementally built
CONS (as EDW)• Not conducive to OLAP
processing• Requires business
analysis to be firm• Introduces many join
operations
04/10/2023LearnDataVault.com 31
The Three Vehicles…
• Which would you use to win a race?• Which would you use to move a house?• Would you adapt the truck and enter a race with Porches and expect to
win?
#1 complaint about DV architecture
So you want to deal with Joins do you?
04/10/2023LearnDataVault.com 32
Joins, Everywhere!
04/10/2023LearnDataVault.com 33
Yes, the DV is full of joins but…These are highly normalized tables (thin & Narrow), reducing I/O’s to read large numbers of rows, at high speed, in parallel. Joins occur in RAM instead of on disk. The Optimizer is given a chance to “drop tables” from the join that aren’t necessary.
When Parallelism is too much…• Not enough CPU or RAM to handle the extra work-load• Not enough rows being queried, (the overhead of starting the threads
takes longer than an original scan.
End Result? The DV Scales to the Petabyte Levels when necessary…
Mathematics Behind the Data Vault Model
*** The Data Vault is BACKED by Mathematical Principles***
• Parallel versus sequential execution models• Set Logic• I/O Bandwidth & Throughput• Compression (for query performance gains)• Process Repeatability (tuning & predictability
measurements)• RAM versus electromagnetic disk (Solid-State
Drives are not measured)
• http://osl.cs.uiuc.edu/docs/IPDPS-TR04/TCA_TR04.pdf
04/10/2023LearnDataVault.com 34
Know when to hold ‘em, know when to fold ‘em
When to use DV, and when not…
04/10/2023LearnDataVault.com 35
The Challenger….
04/10/2023LearnDataVault.com 36
The challenger says:• My system works fine, why should I use the Data Vault? • I don’t have volume problems…• I don’t have compliance/auditability problems…• I don’t have real-time problems…• My system produces matching results across lines of business…• I’ve never had to “re-state” the data in the warehouse…• I can still build new marts, and conform dimensions in 30 days or less…• My business doesn’t acquire new systems often (if ever)• My incoming data sets don’t change
I Say…That’s wonderful, don’t fix what’s broken. Have a nice day, oh- but call me when or if you ever run into these problems…
When to Apply the Data Vault
04/10/2023LearnDataVault.com 37
• Scalability• Auditability• Flexibility• Adaptability
Benefits:
• IT Agility• IT and Business Accountability• Reduction in Spread-Marts• Corporate Asset Development• Money Savings• Risk Mitigation• Successful EDW Implementations
Leads To…
How to build a data vault
In 10 easy steps…
04/10/2023LearnDataVault.com 38
Step 1
04/10/2023LearnDataVault.com 39
Identify your business processes, followed by your business keys (that are used to identify the data that flows through the business processes)
** NOTE: Along the way, document your assumptions, document your reasons for choosing keys, and modeling designs, develop a list of questions to be answered by business users…
Step 2
04/10/2023LearnDataVault.com 40
Identify the issues/problems that might be carried with the identified business keys, annotate the risks, and mitigate each one.
Step 3
04/10/2023LearnDataVault.com 41
Identify the units of work, the associations – LINK tables, where keys combine to form a notion, a concept, and a relationship.
Step 4
04/10/2023LearnDataVault.com 42
Identify the descriptive data that belongs to SINGLE Hub Keys, ensure that the data doesn’t represent or rely on a relationship.
Step 5
04/10/2023LearnDataVault.com 43
Identify the Satellite data that depends on relationships – move it to the appropriate LINK table.
HINT: If you “want” to put a Foreign Key in a Satellite, you have a clear sign that the Satellite is in the WRONG place, and needs to be assigned to a LINK table rather than a HUB.
Step 6
04/10/2023LearnDataVault.com 44
Scope the Model Down to a managable chunk. Implement the first two Hubs, Hub Satellites, and first Link. BUILD IN INCREMENTS!
Step 7
04/10/2023LearnDataVault.com 45
Setup the key generation load routines, setup the staging area, and begin loading data.
Step 8
04/10/2023LearnDataVault.com 46
Review any “truncation” errors, or any data-type conversion problems, fix the staging area, and remove duplicates.
Step 9
04/10/2023LearnDataVault.com 47
Begin Loading the Data Vault. Load all Hubs, then all Hub Satellites, Then all Links, and finish with All Link Satellites.
Step 10
04/10/2023LearnDataVault.com 48
Reconcile the Data Vault to the source system, then build a first data mart from the results. Bring business value FAST!
Instructor led lab
04/10/2023LearnDataVault.com 49
04/10/2023LearnDataVault.com 50
10 minutes to find the Hubs….
04/10/2023LearnDataVault.com 51
Possible Hubs From Northwind
04/10/2023LearnDataVault.com 52
10 Minutes to find the Links…
04/10/2023LearnDataVault.com 53
Possible Links From Northwind
04/10/2023LearnDataVault.com 54
10 minutes to find the Satellites…
04/10/2023LearnDataVault.com 55
Possible Satellites From Northwind
What did we learn?• We often deal with more than 1 system at a
time… this was a lab with only one model.• We didn’t have any business requirements that
we might need to answer questions, but doesn’t that reflect real-life?
• The data set is extremely dirty (you never have that in your systems right?)
• Time Zone based data can be a problem• Lack of metadata causes integration issues and
modeling decisions
04/10/2023LearnDataVault.com 56
57
The Experts Say…“The Data Vault is the optimal choice for modeling the EDW in the DW 2.0 framework.” Bill Inmon
“The Data Vault is foundationally strong and exceptionally scalable architecture.” Stephen Brobst
“The Data Vault is a technique which some industry experts have predicted may spark a revolution as the next big thing in data modeling for enterprise warehousing....” Doug Laney
58
More Notables…
“This enables organizations to take control of their data warehousing destiny, supporting better and more relevant data warehouses in less time than before.” Howard Dresner
“[The Data Vault] captures a practical body of knowledge for data warehouse development which both agile and traditional practitioners
will benefit from..” Scott Ambler
59
Where To Learn More• The Technical Modeling Book:
http://LearnDataVault.com
• The Discussion Forums: & eventshttp://LinkedIn.com – Data Vault Discussions
• Contact me:http://DanLinstedt.com - web siteDanLinstedt@gmail.com - email
• World wide User Group (Free)http://dvusergroup.com