Date post: | 15-Apr-2017 |
Category: |
Data & Analytics |
Upload: | daniel-upton |
View: | 627 times |
Download: | 1 times |
EDW Data Model Storming for
Integration of NoSQL with RDBMS
SQL Saturday #497, April 2, 2016
Daniel Upton
DW-BI Architect, Data Modeler
DecisionLab.Net Serving Orange County and San Diego County since 2007
[email protected] blog: www.decisionlab.net
linkedin.com/in/DanielUpton
__________________________________________________________________________________________________________________________________________________________________________________
Page 2 of 20
Open Questions o With DW-BI now a mainstream I.T. career specialization with an established set of best-
practices, why do many real-world implementations still fall short of satisfying business stakeholder expectations?
o What influence have Lean and Agile thinking had on DW-BI? o What parts of DW implementation have been most resistant to Agile? o Are established DW data modeling methods an asset or a liability? o What factors are driving change in data modeling for business intelligence? o What is Data Model Storming? o What challenges does NoSQL introduce to data modeling intended for integration with RDBMS
data? o What do we mean by Integration? o What challenges does NoSQL introduce to data modeling intended for integration with RDBMS
data? o What does End-to-End Model Storming mean?
Objectives: o Describe a data modeling method and demonstrate how it differs from both dimensional
modeling and 3rd Normal Form according to… o Agile: Quickly and iteratively deliver minimally viable products (MVP’s) to users. o Lean: Design in loose coupling to minimize or eliminate functional dependencies o PMBOK: Breakdown work (including design) into small-yet-cohesive chunks.
o Review BEAM Dimensional Model Storming (Corr and Stagnitto) o Demonstrate some best-practice NoSQL data models as major variations from 3rd Normal Form. o Introduce and Perform EDW Model Storming with a simple use case involving unpredictable, last
minute changes to business rules o Extend the Model Storm w/ a last-minute requirement for NoSQL integration
__________________________________________________________________________________________________________________________________________________________________________________
Page 3 of 20
Traditional Data Modeling Methods 3rd Normal Form Dimensional Warehouse / Mart: OLTP and EDW Star Schema w/ Facts and Dimensions
__________________________________________________________________________________________________________________________________________________________________________________
Page 4 of 20
3rd Normal OLTP Source Data Vault: Aliases: Lean DW, Hyper-Normal Model
o One Hub and all of its dependent Satellites are known as an Ensemble, a stand-alone set of
tables that always have zero functional dependencies on other Ensembles. o Hubs store business keys (unique identifiers well-known to non-techies and enterprise-wide o Satellites store and historize all attribute fields
__________________________________________________________________________________________________________________________________________________________________________________
Page 5 of 20
o Links store all relationships as associations
__________________________________________________________________________________________________________________________________________________________________________________
Page 6 of 20
BEAM Model Storming (Corr and Stagnitto) o Accelerates agile dimensional design with a great short-hand notation on eye-friendly visual
information displays to perform real-time dimensional design occurring during requirements meetings with business stakeholders.
o Begins with user-information story o Ends with artifacts that capture the business requirement while also specifying the logic for a
star schema. o One such artifact is an event matrix (minimal example):
o Includes source data column profiling at column/record level; ignores source data structure
__________________________________________________________________________________________________________________________________________________________________________________
Page 7 of 20
Best-Practice NoSQL (Wide-Table, No Joins) Data Model: Why not in 3rd Normal Form? o Fields duplicated and / or pivoted to balance join-minimization with redundant storage. o Just an example, not to be integrated in our example...
__________________________________________________________________________________________________________________________________________________________________________________
Page 8 of 20
More on Lean Data Warehouse / Hyper-Normal / Data Vault): Objectives o Fully enforced, simple (single-field equi-joins only) referential integrity o Identify a business key, store values as unique records in a Hub table; Surrogate PK removes all
functional dependencies (tight couplings) to this identifier FROM other tables’ FK’s o Store history of value changes to all attributes in a child table using LoadDTS and LoadEndDTS. o Store all table relationships to accommodate any current or future real-world cardinality
relationships (1-to-1, 1-to-M, & M-to-M), via an associative join table. Why o While preserving all actual relationships between records in related tables, all DW table
relationships now abstracted as Hub_PK, related to Link_FK, related to Hub_PK. o For Satellite’s identifier fields that, in source, were used as foreign keys (thus tightly coupled),
remove these functional dependencies TO other DV Ensembles. o Benefits:
o Zero functional dependencies between DW Ensembles, thus small increments may be designed, loaded and released based only on definition of a Minimally Viable Product (MVP, rather than forcing larger slower releases or more functionally inter-twined, thus much larger increments.
o When a directly related data subject area is later to be added in, this is accomplished with zero re-factoring of the existing ensembles.
__________________________________________________________________________________________________________________________________________________________________________________
Page 9 of 20
Mindset for Lean DW ModelStorm Design:
o K.I.S.S: Once a source table determined in-scope, include all fields and records, so you never have add them later.
o Other than creating Hubs, Satellite, and Links, perform no other transformations in this layer. o No calculations, aggregations, or business rules (yet).
o As such, we are NOT, or at least NOT YET attempting to define a single version of the truth
(SVOT), nor a data presentation / reporting RDBMS layer.
o Instead, we are… o Loosely integrating data from multiple data sources o Aligning it around business keys o Tracking the history of attributes whose old values may be overwritten in source systems o Supporting all actual (intended and otherwise) relationships among records in related
tables. o Doing all of the above while enforcing simple referential integrity exclusively with single-
field equi-joins.
__________________________________________________________________________________________________________________________________________________________________________________
Page 10 of 20
DW ModelStorm Design Steps:
o Begin where BEAM ModelStorming Ends. From there… o Define Business Keys o Identify in-scope source tables
o Reverse engineer in-scope tables into Data Modeling tool
o Identify and define cardinality of physical and logical (non-instantiated) relationships
o Classify each source table as a bonafide Entity or merely an Association
__________________________________________________________________________________________________________________________________________________________________________________
Page 11 of 20
__________________________________________________________________________________________________________________________________________________________________________________
Page 12 of 20
Now, group the source tables into distinct Subject Areas o Make copies of all above tables and place into a new submodel
__________________________________________________________________________________________________________________________________________________________________________________
Page 13 of 20
Next, for each new table-copy…
o Remove all (source-based) foreign key relationships without removing underlying identifier
fields.
o Remove primary key constraint.
o Add the following control / metadata fields: DWLoadBatchID_SourceSys DW_Load_DTS DW_Load_Expire_DTS Placeholder_SurrogateKey (explained later)
o Create new composite Primary Key w/ Placeholder Surrogate Key + Load_DTS
o Satellite-splitting
If a subset of fields are updated in source much more frequently than others, and table will be sufficiently large that ETL processing of the more frequent updates will result in excessive loading time, split table in two or more subsets.
__________________________________________________________________________________________________________________________________________________________________________________
Page 14 of 20
__________________________________________________________________________________________________________________________________________________________________________________
Page 15 of 20
Then, starting with tables classified earlier as bonafide entities
In new submodel, rename Placeholder_SurrogateKey field to Hub[EntityName]_SQN (or …_HashId) for all tables split from the source entity table
Copy one of these tables again In newest table-copy, delete all fields except new PK, new control fields AND
Business Key Rename table as “ Hub_[Enter Entity Name Here] “ Remove ‘Load_DTS’ from Primary Key Add a unique constraint to the Business Key. In each corresponding tables, rename each as “ Sat_[Enter Entity Name
Here_&Something] “ Create a defining relationship between Hub (parent / 1) and each “ Sat_[Enter Entity
Name Here_&Something] “ so that child tables FK is also part of it’s PK.
Once all entity tables are converted into Hub-Satellite Sets, start on mere-Association tables
Still in new submodel, repeat above steps to add control fields
Add new “ Link_[Assoc_Name)_SQN (or _HashID)
As above, set PK as …SQN + Load_DTS
Rename table to “ Sat_Link_[Enter Assoc. Name Here] “
Create another copy of table, and rename as “ Link_[Enter Assoc. Name Here] “
Follow same remaining steps as with Hubs, except that no Business Key remains in the link.
Create defining relationship from Link (child) to directly related Hubs (parents), so that Hub_[ParentHub]_SQN is included in the Link.
Create Unique Key on composite of Hub_ParentHub_SQN fields. o Create defining relationship from Link (parent) to LinkSat (child)
__________________________________________________________________________________________________________________________________________________________________________________
Page 16 of 20
When all Hubs, Links, Satellites done, our examples looks like this…
__________________________________________________________________________________________________________________________________________________________________________________
Page 17 of 20
At this time, in the 11th Hour prior to our release, a new requirement is announced o With a truly elegant display of back-pedaling and dissembling -- by our primary business
stakeholder, standing alongside the organization’s new data scientist. o Remember that ‘not to be integrated NoSQL example? Well, it does need to integrate after all,
and, oops, before the release. o For what it’s worth, the data scientist assures us that, with his astonishing coding skills, he
neither needs nor wants a data presentation layer or SVOT.
__________________________________________________________________________________________________________________________________________________________________________________
Page 18 of 20
Your team huddles privately afterwards…
Amid the grumbling, the PM politely asks, “How long will this take to design and load it”.
You smile & answer: 1 – 2 days. An hour later, you show these model additions…
__________________________________________________________________________________________________________________________________________________________________________________
Page 19 of 20
Questions:
Does Lean Data Warehouse (Data Vault / Hyper Normal) extend to complex data models with many source systems?
__________________________________________________________________________________________________________________________________________________________________________________
Page 20 of 20
DecisionLab.Net
_____________________________________________________________________
Data Warehouse / Business Intelligence envisioning, implementation, oversight, and assessment ________________________________________________________________________________________________________________
This slide deck available now at… slideshare.net/DanielUpton/
_______________________________________________________________________________________________________________
Daniel Upton [email protected] Carlsbad, CA blog: http://www.decisionlab.net phone 760.525.3268