Dimensional_Modeling[1]

Post on 24-Nov-2014

106 views 0 download

Tags:

transcript

1

Dimensional Dimensional DesignDesign

Dr. Debashis Parida

Presented by

2

Course AgendaCourse Agenda

Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts

3

Rationale for Rationale for Dimensional ModelingDimensional Modeling

4

OLTP Design CharacteristicsOLTP Design Characteristics

Focus of OLTP Design

Individual data elements

Data relationships

Design goals Accurately model

business Remove redundancy

5

OLTP Design ShortcomingsOLTP Design Shortcomings

Complex Unfamiliar to

business people Incomplete history Slow query

performance

6

Emergence of Dimensional Emergence of Dimensional ModelModel Logical modeling technique

For designing relational database structures Addresses OLTP design shortcomings

For use in analytic systems First developed early 1980's

Packaged goods industry Popularized by Ralph Kimball, PhD.

1996 book: 'The Data Warehouse Toolkit'

7

Dimensional Modeling Dimensional Modeling BasicsBasics

8

Brand

Captain Coffee

Product

Standard Coffee Maker

Thermal Coffee Maker

Deluxe Coffee Maker

All Products

Units Sold

5,000

2,400

2,073

9,473

Units Shipped

3,800

1,632

1,658

7,090

% Shipped

76%

68%

80%

75%

Coffee Maker Fulfillment Report

FactsFacts

Process MeasurementProcess Measurement

Measures Metrics or indicators

by which people evaluate a business process

Referred to as “Facts” Examples

Margin Inventory Amount Sales Dollars Receivable Dollars Return Rate

9

Perspective FocusPerspective Focus

Process-oriented business perspectives

categoryProduct, warehous

e

G/L account supplier

OperationsSales and Marketing

Customer Services

Product Developme

nt

10

Brand

Captain Coffee

Product

Standard Coffee Maker

Thermal Coffee Maker

Deluxe Coffee Maker

All Products

Units Sold

5,000

2,400

2,073

9,473

Units Shipped

3,800

1,632

1,658

7,090

% Shipped

76%

68%

80%

75%

Coffee Maker Fulfillment Report

DimensionsDimensions

Process PerspectivesProcess Perspectives

Dimensions The parameters by which

measures are viewed Used to break out, filter

or roll up measures Often found after the

word “by” in a business question

Descriptive business terms

Examples Product Warehouse Customer Supplier

11

Dimensional ModelDimensional Model

Definition Logical data model used to represent the

measures and dimensions that pertain to one or more business subject areas

Dimensional Model = Star Schema Serves as basis for the design of a

relational database schema Can easily translate into multi-

dimensional database design if required Overcomes OLTP design shortcomings

12

Dimensional Model Dimensional Model AdvantagesAdvantages

Understandable Systematically

represents history

Reliable join paths

High performance

query

Enterprise scalability

13

StoreStore

Star SchemaStar Schema

TimeTime

ProductProduct

FactsFacts

Schema SimplicitySchema Simplicity

Fewer tables Denormalized Consolidated

Dimensional Familiar to users Facts go in the fact

tables Dimensions in

dimension tables

Increases understandability

14

Time Dimension

year

quarter

month

date

day of the week

holiday flag

ord_date

Data FamiliarityData Familiarity

Adding business context

Single source field Expanded into parts Decoded into business

terms Add special indicators

and flags e.g. time dimension

Increases understandability

15

Store

Product

Facts

Time DimensionTime Dimension

Time Dimension

year

quarter

month

date

day of the week

holiday flag

Representing HistoryRepresenting History

Time dimension Part of every star

schema

Marks the date when

the facts (process

measurements)

occurred

Allows the schema to

easily add and query

data over time Especially useful for

performing comparison queries

16

Fewer Join PathsFewer Join Paths

Star schema joins Defined during schema

design - not runtime

Business people can

easily understand

these relationships

One-to-many relations

between dimensions

and facts

Referential integrity

always enforced

17

High Performance DesignHigh Performance Design

Fewer joins means less 'expensive' queries

Deterministic query patterns

Star schema query optimization supported by all major RDBMS vendors

18

Subject area dimensional

models

Subject Area ModelsSubject Area Models

Manufacturing and Process

Control

Sales Order Entry and Campaign

Management

Customer Support and Relationship Management

Shipping and Inventory

Management

Subject area E/R models

OperationsSales and Marketing

Customer Services

Product Developme

nt

19

Enterprise ModelsEnterprise Models

Enterprise Scope E/R model

Enterprise scope dimensional model

20

Dimensional Design Dimensional Design DetailsDetails

21

Dimension

Dimension

Dimension

Star Schema Dimension Star Schema Dimension TablesTables Dimension tables

Store dimension values

Textual content Dimension tables

usually referred to simply as 'dimensions'

Spend extra effort to add dimensional attributes

22

key

key

key

Dimension

Dimension

Dimension

Dimension KeysDimension Keys

Synthetic keys Each table assigned

a unique primary key, specifically generated for the data warehouse

Primary keys from source systems may be present in the dimension, but are not used as primary keys in the star schema

23

Key

attribute

attribute

attribute

Key

attribute

attribute

attribute

Key

attribute

attribute

attribute

Dimension

Dimension

Dimension

Dimension ColumnsDimension Columns

Dimension attributes Specify the way in

which measures are viewed: rolled up, broken out or summarized

Often follow the word “by” as in “Show me Sales by Region and Quarter”

Frequently referred to as 'Dimensions'

24

Fact Table

fact1

fact2

fact3

Star Schema Fact TableStar Schema Fact Table

Process measures Start by assigning

one fact table per business subject area

Fact tables store the process measures (aka Facts)

Compared to dimension tables, fact tables usually have a very large number of rows

25

Fact Table

fact1

fact2

fact3

keykeykey

Fact Table Primary KeyFact Table Primary Key

Every fact table Multi-part primary

key added Made up of foreign

keys referencing dimensions

26

Fact Table SparsityFact Table Sparsity

Sparsity Term used to describe the very common

situation where a fact table does not contain a row for every combination of every dimension table row for a given time period

Because fact tables contain a very small percentage of all possible combinations, they are said to be "sparsely populated" or "sparse"

27

Fact Table

Fact Table GrainFact Table Grain

Grain The level of detail

represented by a row in the fact table

Must be identified early

Cause of greatest confusion during design process

Example Each row in the fact

table represents the daily item sales total

28

Designing a Star SchemaDesigning a Star Schema

Five initial design steps Based on Kimball's six steps Start designing in order Re-visit and adjust over project life

29

1.1. Identify fact table

Start by naming the fact table with the name of the business subject area

Step OneStep One

30

StepStep TwoTwo

2.2. Identify fact table grain

Describe what a row in the fact table represents - in business terms

31

StepStep ThreeThree

3.3. Identify dimensions

32

StepStep FourFour

4.4. Select facts

33

StepStep FiveFive

5.5. Identify dimensional attributes

34

Fact Table DetailsFact Table Details

35

Example Fact TableExample Fact Table

Sales Factsmodel_key

dealer_key

time_key

revenue

quantity

36

FactsFacts

Fully additive Can be summed across any and all

dimensions Stored in fact table Examples: revenue, quantity

37

FactsFacts

Semi-additive Can be summed across most dimensions

but not all Anything that measures a “level” Must be careful with ad-hoc reporting Often aggregated across the “forbidden

dimension” by averaging

38

FactsFacts

Non-Additive Cannot be summed across any dimension

All ratios are non-additive

Break down to fully additive components,

store them in fact table

39

Factless Fact TableFactless Fact Table

A fact table with no measures in it Nothing to measure... …Except the convergence of

dimensional attributes Sometimes store a “1” for convenience Examples: Attendance, Customer

Assignments, Coverage

40

Dimension TableDimension TableDetails

41

Example Dimension TablesExample Dimension Tables

dealer_key

regionstatecitydealer

model_key

brandcategorylinemodel

Model time_key

yearquartermonthdate

Time

Dealer

42

Dimension TablesDimension Tables

Characteristics Hold the dimensional attributes

Usually have a large number of attributes

(“wide”) Add flags and indicators that make it easy

to perform specific types of reports Have small number of rows in comparison

to fact tables (most of the time)

43

Don’t Normalize DimensionsDon’t Normalize Dimensions

Saves very little space Impacts performance Can confuse matters when multiple

hierarchies exist A star schema with normalized

dimensions is called a "snowflake schema"

Usually advocated by software vendors whose product require snowflake for performance

44

Slowly Changing DimensionsSlowly Changing Dimensions

Dimension source data may change

over time Relative to fact tables, dimension

records change slowly Allows dimensions to have multiple

'profiles' over time to maintain history Each profile is a separate record in a

dimension table

45

Slowly Changing Dimension Slowly Changing Dimension ExampleExample Example: A woman gets married

Possible changes to customer dimension• Last Name• Marriage Status• Address• Household Income

Existing facts need to remain associated with her single profile

New facts need to be associated with her married profile

46

Slowly Changing Dimension Slowly Changing Dimension TypesTypes Three types of slowly changing

dimensions Type 1

• Updates existing record with modifications• Does not maintain history

Type 2• Adds new record• Does maintain history• Maintains old record

Type 3: • Keep old and new values in the existing row• Requires a design change

47

Designing Loads to Handle Designing Loads to Handle SCDSCD Design and implementation guidelines

Gather SCD requirements when designing data mapping and loading

SCD needs to be defined and implemented at the dimensional attribute level

Each column in a dimension table needs to be identified as a Type 1 or a Type 2 SCD

If one Type 1 column changes, then all Type 1 columns will be updated

If one Type 2 column changes, then a new record will be inserted into the dimension table

48

Designing Loads to Handle Designing Loads to Handle SCDSCD Design and implementation guidelines

For large dimension tables, change data capture techniques may be used to minimize the data volume

For smaller dimension tables, compare all OLTP records with dimension table records

Balance data volume with change data capture logic complexities

49

Degenerate DimensionsDegenerate Dimensions

Dimensions with no other place to go Stored in the fact table Are not facts Common examples include invoice

numbers or order numbers

50

Dimensional Design Dimensional Design ProcessProcess

Project Context

51

Development Phase

Deployment Phase

Design Phase

Data Mart DevelopmentData Mart Development

Dimensional modeling is a critical part of the data mart development effort

52

Data Mart DevelopmentData Mart Development

Design phase Determine requirements and design schema

Development phase Iterative build and feedback

Deployment phase Automate load, document, train users

53

Project DeliverablesProject Deliverables Design

Project definition document

Project plan Schema design Mapping document Report design

Development Populated data mart Load routines

(Sagent “Plans”) Query and reporting

environment

Deployment Automation Documentation Training materials

54

Development Phase

Deployment Phase

Design Phase

Project ApproachProject Approach

The dimensional model is developed during the design stage

Scope of the project has already been determined

55

Development Phase

Deployment Phase

Design Phase

Design Stage ActivitiesDesign Stage Activities

Gather requirements through requirements workshops

Develop star schema Conduct design review

56

Gather RequirementsGather Requirements

Requirements definition User workshops Spreadsheets Sample reports

Source systems analysis DBA interviews Copybooks E/R diagrams

57

Design DeliverablesDesign Deliverables

Deliverables The star schema itself Load mapping document

How these primary components are delivered will depend on needs and format chosen Modeling tools Spreadsheets Text documents

58

NotationNotation

No recognized standard ER semantics unnecessary Clarity is the only characteristic that

really matters

59

Design Naming StandardsDesign Naming Standards

Responsibility of data administration Extended to the data warehouse Important to start early in the project

Suggested conventions Fact tables Dimension tables Aggregate tables Keys

60

Data Element DefinitionsData Element Definitions

Clear descriptions Facts Calculated formulae Dimensional attributes Multiple meanings/synonymous terms Aliases

61

Data Element InstancesData Element Instances

Example of Data

As it will exist in the warehouse

After decoding

Adds to model understanding

Removes ambiguity/uncertainty

62

Data Element MappingData Element Mapping

Where is the data coming from

Source system

Table

Column

Record

Field

63

Data TransformationData Transformation

Changing the data

Serves as spec for ETL process

Decodes

Type conversion

Conditional logic

Handling of NULL’s

64

Aggregates SchemasAggregates Schemas

65

Aggregate DesignsAggregate Designs

Aggregates Pre-stored fact summaries Along one or more dimensions The most effective tool for improving

performance

Examples Summary of sales by region, by product, by

category Monthly sales

66

Aggregate BackgroundAggregate Background

Aggregate rationale Improve end user query performance Reduce required CPU cycles Powerful cost saving tool

Restrictions Additive facts only Must use dimensional design

67

Aggregate GuidelinesAggregate Guidelines

Don’t start with aggregates

Design and build based on usage Sooner or later you'll need to build

aggregates

68

Aggregate TypesAggregate Types

Level field

Separate fact tables

69

Aggregate TypesAggregate Types

Level field Old technique Requires “level” attribute in appropriate

dimensions Aggregates and base-level facts stored in

same table Same number of total fact records as

separate table approach Drawbacks

Every query must constrain on the level field Possibility of double counting

70

Aggregate TypesAggregate Types

Separate Tables Separate fact table for every aggregate Separate dimension table for every aggregate

dimension Same number of fact records as level field

tables Advantage

Removes possibility of double counting Schema clarity

Caveat Requires software with aggregate navigation

capability

71

Aggregate PitfallsAggregate Pitfalls

Sparsity failure Term used to describe the result of building

too many aggregate fact that do not summarize enough rows.

When Sparsity failure occurs, a relatively small star schema can grow (in terms of disk size) thousands of times.

Sparsity failure = aggregate explosion

72

Aggregate Design GuidelinesAggregate Design Guidelines

Rule of twenty To avoid aggregate explosion Make sure each aggregate record

summarizes 20 or more lower-level records

Remember Total number of possible fact tables in any

given dimensional model = cartesian product of all levels in all the dimensions

73

Year (1)

Quarter (4)

Month (12)

Date (365)

Time

5 years

20 quarters

60 months

1825 days

Hierarchies & Aggregate Hierarchies & Aggregate DesignDesign Hierarchy diagram

Helps visualize options for building aggregates

Adding cardinalities insures following the rule of 20

Not required to build initial star schema

74

Aggregate NavigationAggregate Navigation

Description Function provided by software layer:

Aggregate Navigator Directs user queries to the most favorable

available aggregate

Transparent to the end user

75

Business View

Designer View

Aggregate FrameworkAggregate Framework

76

Aggregate DeploymentAggregate Deployment

Incremental

Based on usage

Transparent to users

Typically warehouse DBA responsibility

77

Build SubjectArea 1No aggregates

Build SubjectArea 2No aggregates

BuildBuildaggregatesaggregatesforforSubject area 1Subject area 1

Build SubjectArea 3No aggregates

BuildBuildaggregatesaggregatesforforSubject area 2Subject area 2

Build SubjectArea 4No aggregates

BuildBuildaggregatesaggregatesforforSubject area 3Subject area 3

Some re-work requiredSome re-work required

Aggregate DeploymentAggregate Deployment

78

Multiple Fact TablesMultiple Fact Tables

79

Multiple Fact TablesMultiple Fact Tables

Different business processes usually require different fact tables

There are also several cases where a single business process will require multiple fact tables Core and custom Snapshot and transaction Coverage Aggregates

80

Different Business ProcessesDifferent Business Processes

Different business processes usually require different fact tables

In practice, it may be hard to identify what a “process” is

Sometimes you can spot different processes because measures are recorded With different dimensions At differing grains

81

Different Dimensions or Different Dimensions or GrainGrain Don’t take shortcuts with grain

The 'not applicable' dimension value Using a 'not applicable' row in a dimension

confuses the grain and can introduce reporting difficulty

82

Different Points in TimeDifferent Points in Time

Sometimes, it is not easy to identify the discrete business processes

All measures may have the same dimensionality or grain

Different measures are recorded at different times Quantity sold is not recorded at the same

time as quantity shipped

83

Different TimingDifferent Timing

Building a single fact table would require recording zero or null for measures that are not applicable at a point in time

Reports would contain a confusing combination of zeros, nulls, and absence of data

84

Identifying Different Identifying Different ProcessesProcesses Look at the measures in question

Sort them into fact tables based on Dimensions

Grain

Differing timings of events measured

85

Design Tools for Multiple Design Tools for Multiple TablesTables Create a set of matrices

Facts vs dimension Facts vs dimensional attributes

Mark where facts apply to dimensions Mark where facts apply to dimensional

attributes When facts don't apply, assume

separate fact table

86

Multiple Fact Table SummaryMultiple Fact Table Summary

Different processes need different tables Identified with

Grain Dimensionality Timing

Same process may need multiple fact tables Heterogeneous attributes Coverage Snapshot and transaction Aggregates

87

Architected Data Architected Data MartsMarts

88

Data MartData Mart

Meaning of the term 'data mart' has shifted over the last several years...

89

Operational Systems

E.T.L.E.T.L.

SoftwareSoftware

Data Warehouse

Analysis Users

Query & Query &

ReportinReportin

g g

SoftwareSoftware

E.T.L.E.T.L.

SoftwareSoftware

Data Marts

Data Mart Architecture 1993Data Mart Architecture 1993

90

Operational Systems

E.T.L.

SoftwareData Marts

Analysis Users

Query & Reporting Software

Data Mart Architecture 1997Data Mart Architecture 1997

91

Operational Systems

Analysis Users

Data Mart

Data Warehouse

Architected Data MartsArchitected Data Marts

E.T.LSoftwar

e

Query & Reporting Software

92

Data MartData Mart

Warehouse Subject Area

Incremental warehouse development

Centralized architecture

Not new

Well - suited to star schemas

93

Store Sales Facts

Product

Time (Day)

Product

Time (Day)

Shipments Facts

Warehouse

Warehouse

Inventory Facts

Product

Month

““Stovepipe” Data MartsStovepipe” Data Marts

“Stovepipe” data marts

Inconsistent and overlapping data

Difficult and costly to maintain

Redundant data load Can’t drill across Integration requires

starting over

Dimensions not conformed

94

Conformed DimensionsConformed Dimensions

Definition Dimensions are conformed when they are

the same -or-

When one dimension is a strict rollup of

another

95

Conformed DimensionsConformed Dimensions

Same dimensions must:

1. ... have exactly the same set of primary keys

and2. ... have the same number of records

96

Conformed DimensionsConformed Dimensions

Rolled up dimension When one dimension is a strict rollup of

another

Which means Two conformed dimensions can be

combined into a single logical dimension by creating a union of the attributes

97

Conformed DimensionsConformed Dimensions

Description Shared common dimensions

Integrates logical design

Ensures consistency between data marts

Allows incremental development

Independent of physical location

Some re-work may be required

98

Conformed DimensionsConformed Dimensions

Advantages Enables an incremental development

approach

Easier and cheaper to maintain

Drastically reduces extraction and loading

complexity

Answers business questions that cross data

marts

Supports both centralized and distributed

architectures

99

Store Dimensio

nSales Facts

Product Dimensio

n

Time Dimensio

nShipment Facts

Warehouse

Dimension

Inventory Facts

Month Dimensio

n

Conformed DimensionsConformed Dimensions

Interlocking Star SchemasInterlocking Star Schemas

100 Store Product Day Warehouse Month

Sales Facts

Shipment Facts

Inventory Facts

Kimball’s Data Warehouse Kimball’s Data Warehouse BusBus

101

Course ReviewCourse Review

Rationale for dimensional modeling Dimensional modeling basics Dimensional modeling details Fact table details Dimension table details Design process Aggregate schemas Multiple fact tables Architected data marts