+ All Categories
Home > Documents > IST722 Data Warehousing - Syracuse...

IST722 Data Warehousing - Syracuse...

Date post: 25-Jan-2020
Category:
Upload: others
View: 10 times
Download: 2 times
Share this document with a friend
50
IST722 Data Warehousing Dimensional Modeling Michael A. Fudge, Jr.
Transcript

IST722 Data Warehousing

Dimensional Modeling

Michael A. Fudge, Jr.

Presenter
Presentation Notes
Power point slide have quizzes in them since this presentation is not interactive I’ve left those out. I encourage you to download the corresponding powerpoint and quiz yourself

Pop Quiz: T/F1. The business meaning of a fact table row is known

as a dimension.2. A dimensional data model is optimized for

maximum query performance / ease of use.3. An attribute is a business performance

measurement.4. Order date & Shipping date use the same data.

This is an example of a conformed dimension.5. A degenerate dimension represents a dimensional

key with no attributes.

Presenter
Presentation Notes
F Fact table grain. T F fact T T

Pop Quiz: T/F - Answers1. The business meaning of a fact table row is known

as a dimension. False (Fact table grain)2. A dimensional data model is optimized for

maximum query performance / ease of use. True3. An attribute is a business performance

measurement. False (Fact)4. Order date & Shipping date use the same data.

This is an example of a conformed dimension. True5. A degenerate dimension represents a dimensional

key with no attributes. True

Presenter
Presentation Notes
F Fact table grain. T F fact T T

Objective:Define and Explain

“dimensional modeling”

Presenter
Presentation Notes
We will cover a lot of concepts today. You might have to watch this video more than once to really get a fix on what’s happening here. It is important to recognize there will be a lot of new concepts you’ll need to become intimate with them to succeed in the course.

Recall: Kimball Lifecycle

Presenter
Presentation Notes
Describes an approach for data warehouse projects

Dimensional Modeling• A Logical design technique for

structuring data with the following objectives:1.Intuitive: Easy for business users to

understand2.Fast: Excellent query performance

Presenter
Presentation Notes
Remember one of the key reasons the discipline of data warehousing exists, is that the STRUCTURE of data we have in our transactional systems is not very conducive to ad-hoc querying and analytics. The goal of dimensional modeling is to re-shape our data into a form more queryable by end-users.

E-R Models vs. Dimensional Models

Entity-Relationship Dimensional• Complex.• Designed to eliminate

data redundancy.• Optimized for storage.• Supports transaction

processing.• Operational Data.• Highly Normalized.

• Easy to Understand.• Designed to support

data redundancy.• Optimized for

information retrieval.• Decision support

processing.• De-Normalized.

The CIF & Dimensional Models

Red: Relational ModelsGreen: Dimensional Models

Components of the Dimensional Model

• Fact Table – A database table of quantifiable performance measurements (facts).• Ex. Sales Amount, Days To Ship, Quantity on Hand.

• Dimension Table – A table of contexts for the facts.• Ex. Date/Time, Location, Customer, Product

• Attribute – A characteristic of a dimension.• Ex. Product: Name, Category, Department

o Star Schema – Connections among facts and dimensions which define a business process.• Ex: Sales, Inventory Management

I like to think about it this way:

• Facts are the business process measurement events• Dimensions provide the context for that event.

“How many sneakers did we sell last week?”

Quantity (Fact)

ProductType

(Attribute ofa Product

Dimension)

Duration of Time

(Attribute ofa Sales DateDimension)

Business Process(Sales)

Recall: The Star Schema

Attribute

Dimension

Fact Table

Fact

Primary Key

Foreign Key

3 Types of Facts• Additive

o Fact can be summed across all dimensions. o The most useful kind of fact.o Ex. Quantity Sold, Hours Billed

• Semi-Additiveo Cannot be summed across all dimensions, such as time periods.o Sometime these are averaged across the time dimension.o Ex. Account Balance, Quantity on Hand

• Non-Additiveo Cannot be summed across any dimension.o These do not belong in the fact table, but with the dimension.o Ex. Building square footage, Product retail price

Is that a Fact?• Not every numeric value is a fact.

• Good Fact-Detecting Rules• Is it Additive (does it sum-up across dimensions),

then it is a fact.• If it is used for filtering or labeling then it’s not a fact

but an attribute of a dimension.o Ex: Basketball Player’s height.

• If it is used in calculations, then it should be treated as a fact.o Ex: Employee hourly wage is used to calculate weekly pay.

Facts or Attributes?Additive? Semi? Non?

1. Number of page views on a website?2. The amount of taxes withheld on an employee’s

weekly paycheck?3. Credit card balance.4. Pants waist size? 32, 34, etc…5. Tracking when a student attends class?6. Product Retail Price?7. Vehicle’s MPG rating?8. The number of minutes late employees arrive to

work each day.

Presenter
Presentation Notes

Facts or Attributes?Additive? Semi? Non?

1. Number of page views on a website? F/A2. The amount of taxes withheld on an employee’s

weekly paycheck? F/A3. Credit card balance. F/S4. Pants waist size? 32, 34, etc… N/A5. Tracking when a student attends class? F/A6. Product Retail Price? N/A7. Vehicle’s MPG rating? N/A8. The number of minutes late employees arrive to

work each day. F/A

Presenter
Presentation Notes
1. Fact, Additive 2. Fact, Additive 3. Fact, Semi? 4. Attribute of product Dimension 5. Fact, Additive (Factless fact) 6. Attribute of the product dimension 7. Attribute of the vehicle dimension 8. Fact, Additive

Fact Table Design• The Primary Key of your fact table uses the

minimum number columns possible & no surrogate keys. (Made up of FK’s and Degenerate Dimensions)

• Referential Integrity is a must. Every foreign key in the fact table must have a value.

• Avoid NULLs in the foreign key by using flags which are special values in place of null.o Ex. “No Shopper Card” in Customer Dimension

• The granularity of your fact table should be at the lowest, most detailed atomic grain captured by a business process. (more on this later)

Dimensions• Dimensions provide context for our facts.• We can easily identify dimensions because of the

“by” and/or “for” words.o Ex. Total accounts receivables for the IT Department by Month.

• Dimensions have attributes which describe and categorize their values.o Ex. Student: Major, Year, Dormitory, Gender.

• The attributes help constrain and summarize facts.

Dimension Table DesignCharacteristics of a Good Dimension table Verbose labels with full words Descriptive columns Complete – no null / empty values Discretely values – one value per row. Quality Assured – data is clean and consistent. Always have a Surrogate Primary Key

What's Wrong w/This Dimension?

Prod Id Prod Name Prod Cat Prod Price Prod Region Code

A Apple Fruit $2.00 EB Carrot Veg $1.50 SC Cherries Friut $3.00 SD Lettuce Veg $1.50E Apple Fruit $2.00 E

Can you find the 6 things wrong with the implementation of this dimension?

Presenter
Presentation Notes
No surrogate key Not discretely values Poor data quality Incomplete / missing data Poor descriptions Non-verbose data

What's Wrong w/This Dimension?

Prod Id Prod Name Prod Cat Prod Price Prod Reg CodeA Apple Fruit $2.00 EB Carrot Veg $1.50 SC Cherries Friut $3.00 SD Lettuce Veg $1.50E Apple Fruit $2.00 E

No Surrogate

Key

Not Verbose(What

do S & E mean?)

IncompletePoor DataQuality

Not DiscretelyValued

Poor Descriptions

Dimension Table Key• Surrogate keys (identities, sequences e.g. 1,2,3,…)

are used for the primary key constraint.• They yield best performance for Star Schema

o most efficient joins, o smaller indexes in fact table, o more rows per block in the fact table

• They have no dependency on primary key in operational source data.o Makes it easier to deal with changes to the source data.

• Dimension table always has a natural key used to identify a unique row.o Ex: Customer’s email address, Employee’s SSN.

Conformed Dimensions• Master or common reference dimensions.• Shared across business processes (fact tables) in the

DW.• Reusable, can be used for drill-across, lower time to

develop next star schema.• Two types:

o Identical Dimensions – exactly the same dimensions (Ex. Dates)

o Perfect Subset of an existing dimension.

Ex. Conformed DimensionsSales Fact Table

Date key FKProduct key FK

… other FKeys…Sales quantitySales amount

Product DimensionProduct key PK

Product descriptionSKU number

Brand descriptionClass description

Department description

Sales Forecast Fact TableMonth key FKBrand key FK

… other FKeys…Forecast quantityForecast amount

Brand DimensionBrand key PK

Brand descriptionClass description

Department description

Subset

Date and Time Dimensions

• Just about every fact table as a date dimension. • This is the most common of conformed dimensions.• Usually generated programmatically during the ETL

process or imported from a spreadsheet.• Acceptable to use PK in the form YYYMMDD• In you need time of day, use a separate dimension.• Time of day should only be used if there are

meaningful textual descriptions of time o Ex. Lunch, Dinner, 1st shift, 2nd Shift, Etc…

• Elapsed times intervals are facts, not attributes.o Ex. Minutes between when order was received and shipped

Ex. Date Dimension

Handling Time Zones?• Express time in coordinated universal time (UTC)• Express in local time, too.• Other options: use a single time zone (for example,

ET) to express all times in this zone.

Call Center Activity FactLocal call date key FKUTC call date key FK

Local call time of day FKUTC call time of day FK

local call datedimension

UTC call datedimension

Local call time of day dimension

UTC call time of day dimension

Degenerate Dimensions• Occur in transaction fact tables that have a parent

child (One to Many) structure.o Ex. Order Order Detail, Airline Ticket Flights

• Dimensions we store in the fact table (because there’s too many of them for their own a dimension)

• Allow us to drill-through to operational data.• Usually ends up as part of the primary key of the

fact table.

Slowly Changing Dimensions

• Dimensional data changes infrequently but when it does you need a strategy for addressing the change.o Ex: Customer has a new address, Employee has a name change

4 Popular strategies Type 1: Overwrite the existing attribute Type 2: Add a new Dimension row Type 3: Add a new Dimension attribute Mini-Dimension: Add a new Dimension

• These strategies are not mutually exclusive!

Type 1: Overwrite• Appropriate for:

o correcting mistakes or errors o changes where historical associations do not mattero the old value has no significance

• If the previous value matters, don’t use this strategy.• Problems will occur with data aggregated on old

values. • Ex. Employee Name Changes, Corrections, Natural

Key Edits.

Type 2: Add New Dimension Row

• Most popular strategy, preserves history• Natural key is repeated.• Old and new values are stored along with effective

dates and indicator of current row

Product Key

Product Descr.

Product Code

Department Effective Date

Expiration Date

Current Row

11981 Stapler, Red ST901 Accessories 4/7/2010 9/1/2011 N20344 Stapler, Red ST901 Supplies 9/2/2011 3/31/2013 N45393 Stapler, Red ST901 Office

Supplies4/1/2013 12/31/9999 Y

Type 3: Add A New Dimension Attribute

• Infrequently used, preserves history• Useful for “Soft” changes where users might want to

choose between the old and new attribute• The new value is written to the existing column, the

old value is stored in a new column.• This way queries do not have to be re-written to

access the new attribute.• Ex. Redistricting sales territories. Re-charting

accounting codes.

Mini-Dimensions: Add a new Dimension

• If attributes change frequently consider placing them in their own “mini-dimensions”

• Most effective when you have banded values, or ranges of discrete values.

Fact TableCustomer Key FK

Customer Demographics Key FK… other FKeys…

… Facts…

Customer DimensionCustomer key PK

Customer ID (Nat. Key)Customer Name

Customer Demographics DimensionCustomer Demographics Key PK

Customer Age BandCustomer Gender

Customer Income Band…

Role-Playing Dimensions• The same physical dimension plays more than one

logical dimensional role.• Common among the date dimension• Stored in the same physical table, just aliased as a

view.• Examples:

o Date: Order Date, Shipping Date, Delivery Date o Address: Ship to, Bill too Airport: Arrival, Departure

Junk Dimensions• Miscellaneous Flags and text attributes which do

not fit within any other dimension.• Place them in their own “Junk” dimension

InvoiceIndicator Id

Payment Terms

OrderMode

ShipMode

1 Net 10 Web Freight2 Net 10 Web Air3 Net 10 Fax Freight4 Net 10 Fax Air5 Net 10 Phone Freight6 Net 10 Phone Air7 Net 15 Web Freight8 Net 15 Web Air

Don’t Create a

Junk Dimension Row Until

You Need It

Snowflake & Outrigger Dimensions

• When the redundant attributes are moved to a separate table to eliminate redundancy we get a snowflaked dimension.

• Pros: Data is back in 3NF, saves space• Cons: More complex for users, decreased

performance.• Sometimes this is desirable when there are a

significant number of attributes in the outrigger dimension. These are the exception not the rule!

Product DimensionProduct Key FKProduct Name

Product Size Key FK

Product Size DimensionProduct Size Key PKProduct Size (S,M,L)

Product Size Fee

Hierarchies in Dimensions

• Fixed hierarchies – Simply de-normalize as attributeso Ex. Product: Department -> Type

• Variable-depth hierarchies - implement with a bridge table (used to resolve M-M relationships)

• Should be used only when absolutely necessaryo Negatively affects usabilityo Decreases performance Customer Dimension

Customer Key PKCustomer Name

….

Fact TableDate Key FK

Customer Key FKMore Foreign Keys…

Facts …. Customer Hierarchy BridgeParent Customer Key PK,FKSubsidiary Cust. Key PK,FK

# Levels from ParentBottom Flag

Top Flag

Multi-Valued Dimensions• Almost all Fact-Dimension relationships are M-1• Sometimes there’s a M-M relationship between fact

and Dimension.• The Weighing factor is between 0 and 1 and should

add up to 1 for each unique group key.

Diagnosis DimensionDiagnosis Key PK

ICD-9 CodeDiagnosis Description

….

Health Care Billing FactBilling Date Key FK

Patient Key FKDiagnosis Group Key FK

Bill AmountMore Facts …. Diagnosis Group Bridge

Diagnosis Group Key PK,FKDiagnosis Key PK,FK

Weighing Factor

What Kind of Dimension?1. Customers (for orders and

sales leads)2. The various classrooms on a

college campus?3. Items on a restraint menu?4. Parts required to repair an

automobile as part of a service record?

5. The instructors who teach a college class?

• Conformed?• Degenerate?• Slowly Changing?

& Type?• Role Playing?• Junk?• Outrigger?• M-M (Bridge)?

3 Fact Table Grains

Transaction Periodic Snapshot

Accumulating Snapshot

Transaction Fact• The most basic fact grain• One row per line in a transaction• Corresponds to a point in space and time• Once inserted, it is not revisited for update• Rows inserted into fact table when transaction

occurs• Examples:

o Sales, Returns, Telemarketing, Registration Events

Periodic Snapshot Fact• At predetermined intervals snapshots of the same

level of details are taken and stacked consecutively in the fact table

• Snapshots can be taken daily, weekly, monthly, hourly, etc…

• Complements detailed transaction facts but does not replace them

• Share the same conformed dimensions but has less dimensions

• Examples: o Financial reports, Bank account values, Semester class

schedules, Daily classroom Lab Logins

Accumulating Snapshot Fact• Less frequently used, application specific.• Used to capture a business process workflow.• Fact row is initially inserted, then updated as

milestones occur • Fact table has multiple date FK that correspond to

each milestone • Special facts: milestone counters and lag facts for

length of time between milestones• Examples:

o Order fulfillment, Job Applicant tracking, Rental Cars

Which Fact Table Grain?1. Concert ticket purchases?2. Voter exit polls in an election?3. Mortgage loan application and

approval?4. Auditing software use in a computer

lab?5. Daily summaries of visitors to websites?6. Tracking Law School applications?7. Attendance at sporting events?8. Admissions to sporting events at 15

minute intervals?

Transaction

Periodic Snapshot

Accumulating Snapshot

Which Fact Table Grain?1. Concert ticket purchases? T2. Voter exit polls in an election? T3. Mortgage loan application and approval?

AS4. Auditing software use in a computer lab? T5. Daily summaries of visitors to websites? PS6. Tracking Law School applications? AS7. Attendance at sporting events? T8. Admissions to sporting events at 15 minute

intervals? PS

Transaction

Periodic Snapshot

Accumulating Snapshot

Facts of Different Granularity == NO

• A single fact table cannot have facts with different levels of granularity

• All measurements must be in the same level of details

• Example: o Measurements are captured for each line order except for

the shipping charge which is for the entire order

• Solutions:o Allocating higher level facts to a lower granularity

(split shipping charge among each item)o Create two separate fact tables

(Orders fact & Line Order fact)

Multiple currencies / Units of Measure

• Measurements are provided in a local currency• Measurements should be converted to a

standardized currency or else conversion rates must be stored

• Similarly, in case of multiple units of measure, conversions to all different units of measure should be provided o Ex. Items received are by box (12 in a box =Received unit factor)

Received Price = Received unit factor x unit price

Factless Fact tables• Business processes that do not generate

quantifiable measurementso Ex: Student attendance, College adminssions

• Can be easily converted into traditional fact tables by adding an attribute Count, which is always equal to 1.

• Helps to perform aggregationso Ex: Attendance Count

Consolidated fact tables• Fact tables populated from different sources may

consolidated into single fact tableo Level of granularity must be the sameo Measurements are listed side-by-sideo Ex. by combining forecast and actual sales amounts, a

forecast/actual sales variance amount can be easily calculated and stored

Sales FactDate Key FK

Customer Key FKRegion Key FK

Actual Sales

Forecast FactDate Key FK

Customer Key FKRegion Key FKForecast Sales

Sales & Forecast FactDate Key FK

Customer Key FKRegion Key FK

Actual SalesForecast SalesSales Variance

Finally: Do’s and Don'ts• Do not take a “report centric” approach

o Reuse your dimensional models for multiple reports

• Dimensional models should not be departmentally bound.o Reuse your dimensional models for multiple departments

• Create dimensional models with the finest level of granularity. o This will be the most flexible and scalable option.

• Use Conformed dimensionso Helps with integration effortso Simplified the process of creating the next data mart.

IST722 Data Warehousing

Dimensional Modeling

Michael A. Fudge, Jr.


Recommended