Date post: | 24-Nov-2014 |
Category: |
Documents |
Upload: | ramasatyam |
View: | 106 times |
Download: | 0 times |
1
Dimensional Dimensional DesignDesign
Dr. Debashis Parida
Presented by
2
Course AgendaCourse Agenda
Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts
3
Rationale for Rationale for Dimensional ModelingDimensional Modeling
4
OLTP Design CharacteristicsOLTP Design Characteristics
Focus of OLTP Design
Individual data elements
Data relationships
Design goals Accurately model
business Remove redundancy
5
OLTP Design ShortcomingsOLTP Design Shortcomings
Complex Unfamiliar to
business people Incomplete history Slow query
performance
6
Emergence of Dimensional Emergence of Dimensional ModelModel Logical modeling technique
For designing relational database structures Addresses OLTP design shortcomings
For use in analytic systems First developed early 1980's
Packaged goods industry Popularized by Ralph Kimball, PhD.
1996 book: 'The Data Warehouse Toolkit'
7
Dimensional Modeling Dimensional Modeling BasicsBasics
8
Brand
Captain Coffee
Product
Standard Coffee Maker
Thermal Coffee Maker
Deluxe Coffee Maker
All Products
Units Sold
5,000
2,400
2,073
9,473
Units Shipped
3,800
1,632
1,658
7,090
% Shipped
76%
68%
80%
75%
Coffee Maker Fulfillment Report
FactsFacts
Process MeasurementProcess Measurement
Measures Metrics or indicators
by which people evaluate a business process
Referred to as “Facts” Examples
Margin Inventory Amount Sales Dollars Receivable Dollars Return Rate
9
Perspective FocusPerspective Focus
Process-oriented business perspectives
categoryProduct, warehous
e
G/L account supplier
OperationsSales and Marketing
Customer Services
Product Developme
nt
10
Brand
Captain Coffee
Product
Standard Coffee Maker
Thermal Coffee Maker
Deluxe Coffee Maker
All Products
Units Sold
5,000
2,400
2,073
9,473
Units Shipped
3,800
1,632
1,658
7,090
% Shipped
76%
68%
80%
75%
Coffee Maker Fulfillment Report
DimensionsDimensions
Process PerspectivesProcess Perspectives
Dimensions The parameters by which
measures are viewed Used to break out, filter
or roll up measures Often found after the
word “by” in a business question
Descriptive business terms
Examples Product Warehouse Customer Supplier
11
Dimensional ModelDimensional Model
Definition Logical data model used to represent the
measures and dimensions that pertain to one or more business subject areas
Dimensional Model = Star Schema Serves as basis for the design of a
relational database schema Can easily translate into multi-
dimensional database design if required Overcomes OLTP design shortcomings
12
Dimensional Model Dimensional Model AdvantagesAdvantages
Understandable Systematically
represents history
Reliable join paths
High performance
query
Enterprise scalability
13
StoreStore
Star SchemaStar Schema
TimeTime
ProductProduct
FactsFacts
Schema SimplicitySchema Simplicity
Fewer tables Denormalized Consolidated
Dimensional Familiar to users Facts go in the fact
tables Dimensions in
dimension tables
Increases understandability
14
Time Dimension
year
quarter
month
date
day of the week
holiday flag
ord_date
Data FamiliarityData Familiarity
Adding business context
Single source field Expanded into parts Decoded into business
terms Add special indicators
and flags e.g. time dimension
Increases understandability
15
Store
Product
Facts
Time DimensionTime Dimension
Time Dimension
year
quarter
month
date
day of the week
holiday flag
Representing HistoryRepresenting History
Time dimension Part of every star
schema
Marks the date when
the facts (process
measurements)
occurred
Allows the schema to
easily add and query
data over time Especially useful for
performing comparison queries
16
Fewer Join PathsFewer Join Paths
Star schema joins Defined during schema
design - not runtime
Business people can
easily understand
these relationships
One-to-many relations
between dimensions
and facts
Referential integrity
always enforced
17
High Performance DesignHigh Performance Design
Fewer joins means less 'expensive' queries
Deterministic query patterns
Star schema query optimization supported by all major RDBMS vendors
18
Subject area dimensional
models
Subject Area ModelsSubject Area Models
Manufacturing and Process
Control
Sales Order Entry and Campaign
Management
Customer Support and Relationship Management
Shipping and Inventory
Management
Subject area E/R models
OperationsSales and Marketing
Customer Services
Product Developme
nt
19
Enterprise ModelsEnterprise Models
Enterprise Scope E/R model
Enterprise scope dimensional model
20
Dimensional Design Dimensional Design DetailsDetails
21
Dimension
Dimension
Dimension
Star Schema Dimension Star Schema Dimension TablesTables Dimension tables
Store dimension values
Textual content Dimension tables
usually referred to simply as 'dimensions'
Spend extra effort to add dimensional attributes
22
key
key
key
Dimension
Dimension
Dimension
Dimension KeysDimension Keys
Synthetic keys Each table assigned
a unique primary key, specifically generated for the data warehouse
Primary keys from source systems may be present in the dimension, but are not used as primary keys in the star schema
23
Key
attribute
attribute
attribute
Key
attribute
attribute
attribute
Key
attribute
attribute
attribute
Dimension
Dimension
Dimension
Dimension ColumnsDimension Columns
Dimension attributes Specify the way in
which measures are viewed: rolled up, broken out or summarized
Often follow the word “by” as in “Show me Sales by Region and Quarter”
Frequently referred to as 'Dimensions'
24
Fact Table
fact1
fact2
fact3
Star Schema Fact TableStar Schema Fact Table
Process measures Start by assigning
one fact table per business subject area
Fact tables store the process measures (aka Facts)
Compared to dimension tables, fact tables usually have a very large number of rows
25
Fact Table
fact1
fact2
fact3
keykeykey
Fact Table Primary KeyFact Table Primary Key
Every fact table Multi-part primary
key added Made up of foreign
keys referencing dimensions
26
Fact Table SparsityFact Table Sparsity
Sparsity Term used to describe the very common
situation where a fact table does not contain a row for every combination of every dimension table row for a given time period
Because fact tables contain a very small percentage of all possible combinations, they are said to be "sparsely populated" or "sparse"
27
Fact Table
Fact Table GrainFact Table Grain
Grain The level of detail
represented by a row in the fact table
Must be identified early
Cause of greatest confusion during design process
Example Each row in the fact
table represents the daily item sales total
28
Designing a Star SchemaDesigning a Star Schema
Five initial design steps Based on Kimball's six steps Start designing in order Re-visit and adjust over project life
29
1.1. Identify fact table
Start by naming the fact table with the name of the business subject area
Step OneStep One
30
StepStep TwoTwo
2.2. Identify fact table grain
Describe what a row in the fact table represents - in business terms
31
StepStep ThreeThree
3.3. Identify dimensions
32
StepStep FourFour
4.4. Select facts
33
StepStep FiveFive
5.5. Identify dimensional attributes
34
Fact Table DetailsFact Table Details
35
Example Fact TableExample Fact Table
Sales Factsmodel_key
dealer_key
time_key
revenue
quantity
36
FactsFacts
Fully additive Can be summed across any and all
dimensions Stored in fact table Examples: revenue, quantity
37
FactsFacts
Semi-additive Can be summed across most dimensions
but not all Anything that measures a “level” Must be careful with ad-hoc reporting Often aggregated across the “forbidden
dimension” by averaging
38
FactsFacts
Non-Additive Cannot be summed across any dimension
All ratios are non-additive
Break down to fully additive components,
store them in fact table
39
Factless Fact TableFactless Fact Table
A fact table with no measures in it Nothing to measure... …Except the convergence of
dimensional attributes Sometimes store a “1” for convenience Examples: Attendance, Customer
Assignments, Coverage
40
Dimension TableDimension TableDetails
41
Example Dimension TablesExample Dimension Tables
dealer_key
regionstatecitydealer
model_key
brandcategorylinemodel
Model time_key
yearquartermonthdate
Time
Dealer
42
Dimension TablesDimension Tables
Characteristics Hold the dimensional attributes
Usually have a large number of attributes
(“wide”) Add flags and indicators that make it easy
to perform specific types of reports Have small number of rows in comparison
to fact tables (most of the time)
43
Don’t Normalize DimensionsDon’t Normalize Dimensions
Saves very little space Impacts performance Can confuse matters when multiple
hierarchies exist A star schema with normalized
dimensions is called a "snowflake schema"
Usually advocated by software vendors whose product require snowflake for performance
44
Slowly Changing DimensionsSlowly Changing Dimensions
Dimension source data may change
over time Relative to fact tables, dimension
records change slowly Allows dimensions to have multiple
'profiles' over time to maintain history Each profile is a separate record in a
dimension table
45
Slowly Changing Dimension Slowly Changing Dimension ExampleExample Example: A woman gets married
Possible changes to customer dimension• Last Name• Marriage Status• Address• Household Income
Existing facts need to remain associated with her single profile
New facts need to be associated with her married profile
46
Slowly Changing Dimension Slowly Changing Dimension TypesTypes Three types of slowly changing
dimensions Type 1
• Updates existing record with modifications• Does not maintain history
Type 2• Adds new record• Does maintain history• Maintains old record
Type 3: • Keep old and new values in the existing row• Requires a design change
47
Designing Loads to Handle Designing Loads to Handle SCDSCD Design and implementation guidelines
Gather SCD requirements when designing data mapping and loading
SCD needs to be defined and implemented at the dimensional attribute level
Each column in a dimension table needs to be identified as a Type 1 or a Type 2 SCD
If one Type 1 column changes, then all Type 1 columns will be updated
If one Type 2 column changes, then a new record will be inserted into the dimension table
48
Designing Loads to Handle Designing Loads to Handle SCDSCD Design and implementation guidelines
For large dimension tables, change data capture techniques may be used to minimize the data volume
For smaller dimension tables, compare all OLTP records with dimension table records
Balance data volume with change data capture logic complexities
49
Degenerate DimensionsDegenerate Dimensions
Dimensions with no other place to go Stored in the fact table Are not facts Common examples include invoice
numbers or order numbers
50
Dimensional Design Dimensional Design ProcessProcess
Project Context
51
Development Phase
Deployment Phase
Design Phase
Data Mart DevelopmentData Mart Development
Dimensional modeling is a critical part of the data mart development effort
52
Data Mart DevelopmentData Mart Development
Design phase Determine requirements and design schema
Development phase Iterative build and feedback
Deployment phase Automate load, document, train users
53
Project DeliverablesProject Deliverables Design
Project definition document
Project plan Schema design Mapping document Report design
Development Populated data mart Load routines
(Sagent “Plans”) Query and reporting
environment
Deployment Automation Documentation Training materials
54
Development Phase
Deployment Phase
Design Phase
Project ApproachProject Approach
The dimensional model is developed during the design stage
Scope of the project has already been determined
55
Development Phase
Deployment Phase
Design Phase
Design Stage ActivitiesDesign Stage Activities
Gather requirements through requirements workshops
Develop star schema Conduct design review
56
Gather RequirementsGather Requirements
Requirements definition User workshops Spreadsheets Sample reports
Source systems analysis DBA interviews Copybooks E/R diagrams
57
Design DeliverablesDesign Deliverables
Deliverables The star schema itself Load mapping document
How these primary components are delivered will depend on needs and format chosen Modeling tools Spreadsheets Text documents
58
NotationNotation
No recognized standard ER semantics unnecessary Clarity is the only characteristic that
really matters
59
Design Naming StandardsDesign Naming Standards
Responsibility of data administration Extended to the data warehouse Important to start early in the project
Suggested conventions Fact tables Dimension tables Aggregate tables Keys
60
Data Element DefinitionsData Element Definitions
Clear descriptions Facts Calculated formulae Dimensional attributes Multiple meanings/synonymous terms Aliases
61
Data Element InstancesData Element Instances
Example of Data
As it will exist in the warehouse
After decoding
Adds to model understanding
Removes ambiguity/uncertainty
62
Data Element MappingData Element Mapping
Where is the data coming from
Source system
Table
Column
Record
Field
63
Data TransformationData Transformation
Changing the data
Serves as spec for ETL process
Decodes
Type conversion
Conditional logic
Handling of NULL’s
64
Aggregates SchemasAggregates Schemas
65
Aggregate DesignsAggregate Designs
Aggregates Pre-stored fact summaries Along one or more dimensions The most effective tool for improving
performance
Examples Summary of sales by region, by product, by
category Monthly sales
66
Aggregate BackgroundAggregate Background
Aggregate rationale Improve end user query performance Reduce required CPU cycles Powerful cost saving tool
Restrictions Additive facts only Must use dimensional design
67
Aggregate GuidelinesAggregate Guidelines
Don’t start with aggregates
Design and build based on usage Sooner or later you'll need to build
aggregates
68
Aggregate TypesAggregate Types
Level field
Separate fact tables
69
Aggregate TypesAggregate Types
Level field Old technique Requires “level” attribute in appropriate
dimensions Aggregates and base-level facts stored in
same table Same number of total fact records as
separate table approach Drawbacks
Every query must constrain on the level field Possibility of double counting
70
Aggregate TypesAggregate Types
Separate Tables Separate fact table for every aggregate Separate dimension table for every aggregate
dimension Same number of fact records as level field
tables Advantage
Removes possibility of double counting Schema clarity
Caveat Requires software with aggregate navigation
capability
71
Aggregate PitfallsAggregate Pitfalls
Sparsity failure Term used to describe the result of building
too many aggregate fact that do not summarize enough rows.
When Sparsity failure occurs, a relatively small star schema can grow (in terms of disk size) thousands of times.
Sparsity failure = aggregate explosion
72
Aggregate Design GuidelinesAggregate Design Guidelines
Rule of twenty To avoid aggregate explosion Make sure each aggregate record
summarizes 20 or more lower-level records
Remember Total number of possible fact tables in any
given dimensional model = cartesian product of all levels in all the dimensions
73
Year (1)
Quarter (4)
Month (12)
Date (365)
Time
5 years
20 quarters
60 months
1825 days
Hierarchies & Aggregate Hierarchies & Aggregate DesignDesign Hierarchy diagram
Helps visualize options for building aggregates
Adding cardinalities insures following the rule of 20
Not required to build initial star schema
74
Aggregate NavigationAggregate Navigation
Description Function provided by software layer:
Aggregate Navigator Directs user queries to the most favorable
available aggregate
Transparent to the end user
75
Business View
Designer View
Aggregate FrameworkAggregate Framework
76
Aggregate DeploymentAggregate Deployment
Incremental
Based on usage
Transparent to users
Typically warehouse DBA responsibility
77
Build SubjectArea 1No aggregates
Build SubjectArea 2No aggregates
BuildBuildaggregatesaggregatesforforSubject area 1Subject area 1
Build SubjectArea 3No aggregates
BuildBuildaggregatesaggregatesforforSubject area 2Subject area 2
Build SubjectArea 4No aggregates
BuildBuildaggregatesaggregatesforforSubject area 3Subject area 3
Some re-work requiredSome re-work required
Aggregate DeploymentAggregate Deployment
78
Multiple Fact TablesMultiple Fact Tables
79
Multiple Fact TablesMultiple Fact Tables
Different business processes usually require different fact tables
There are also several cases where a single business process will require multiple fact tables Core and custom Snapshot and transaction Coverage Aggregates
80
Different Business ProcessesDifferent Business Processes
Different business processes usually require different fact tables
In practice, it may be hard to identify what a “process” is
Sometimes you can spot different processes because measures are recorded With different dimensions At differing grains
81
Different Dimensions or Different Dimensions or GrainGrain Don’t take shortcuts with grain
The 'not applicable' dimension value Using a 'not applicable' row in a dimension
confuses the grain and can introduce reporting difficulty
82
Different Points in TimeDifferent Points in Time
Sometimes, it is not easy to identify the discrete business processes
All measures may have the same dimensionality or grain
Different measures are recorded at different times Quantity sold is not recorded at the same
time as quantity shipped
83
Different TimingDifferent Timing
Building a single fact table would require recording zero or null for measures that are not applicable at a point in time
Reports would contain a confusing combination of zeros, nulls, and absence of data
84
Identifying Different Identifying Different ProcessesProcesses Look at the measures in question
Sort them into fact tables based on Dimensions
Grain
Differing timings of events measured
85
Design Tools for Multiple Design Tools for Multiple TablesTables Create a set of matrices
Facts vs dimension Facts vs dimensional attributes
Mark where facts apply to dimensions Mark where facts apply to dimensional
attributes When facts don't apply, assume
separate fact table
86
Multiple Fact Table SummaryMultiple Fact Table Summary
Different processes need different tables Identified with
Grain Dimensionality Timing
Same process may need multiple fact tables Heterogeneous attributes Coverage Snapshot and transaction Aggregates
87
Architected Data Architected Data MartsMarts
88
Data MartData Mart
Meaning of the term 'data mart' has shifted over the last several years...
89
Operational Systems
E.T.L.E.T.L.
SoftwareSoftware
Data Warehouse
Analysis Users
Query & Query &
ReportinReportin
g g
SoftwareSoftware
E.T.L.E.T.L.
SoftwareSoftware
Data Marts
Data Mart Architecture 1993Data Mart Architecture 1993
90
Operational Systems
E.T.L.
SoftwareData Marts
Analysis Users
Query & Reporting Software
Data Mart Architecture 1997Data Mart Architecture 1997
91
Operational Systems
Analysis Users
Data Mart
Data Warehouse
Architected Data MartsArchitected Data Marts
E.T.LSoftwar
e
Query & Reporting Software
92
Data MartData Mart
Warehouse Subject Area
Incremental warehouse development
Centralized architecture
Not new
Well - suited to star schemas
93
Store Sales Facts
Product
Time (Day)
Product
Time (Day)
Shipments Facts
Warehouse
Warehouse
Inventory Facts
Product
Month
““Stovepipe” Data MartsStovepipe” Data Marts
“Stovepipe” data marts
Inconsistent and overlapping data
Difficult and costly to maintain
Redundant data load Can’t drill across Integration requires
starting over
Dimensions not conformed
94
Conformed DimensionsConformed Dimensions
Definition Dimensions are conformed when they are
the same -or-
When one dimension is a strict rollup of
another
95
Conformed DimensionsConformed Dimensions
Same dimensions must:
1. ... have exactly the same set of primary keys
and2. ... have the same number of records
96
Conformed DimensionsConformed Dimensions
Rolled up dimension When one dimension is a strict rollup of
another
Which means Two conformed dimensions can be
combined into a single logical dimension by creating a union of the attributes
97
Conformed DimensionsConformed Dimensions
Description Shared common dimensions
Integrates logical design
Ensures consistency between data marts
Allows incremental development
Independent of physical location
Some re-work may be required
98
Conformed DimensionsConformed Dimensions
Advantages Enables an incremental development
approach
Easier and cheaper to maintain
Drastically reduces extraction and loading
complexity
Answers business questions that cross data
marts
Supports both centralized and distributed
architectures
99
Store Dimensio
nSales Facts
Product Dimensio
n
Time Dimensio
nShipment Facts
Warehouse
Dimension
Inventory Facts
Month Dimensio
n
Conformed DimensionsConformed Dimensions
Interlocking Star SchemasInterlocking Star Schemas
100 Store Product Day Warehouse Month
Sales Facts
Shipment Facts
Inventory Facts
Kimball’s Data Warehouse Kimball’s Data Warehouse BusBus
101
Course ReviewCourse Review
Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts