+ All Categories
Home > Documents > Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Date post: 27-Dec-2015
Category:
Upload: joan-hopkins
View: 216 times
Download: 3 times
Share this document with a friend
Popular Tags:
30
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved
Transcript
Page 1: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Data Warehouseand the

Star Schema

CSCI 242

©Copyright 2014, David C. Roberts, all rights reserved

Page 2: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

2

Agenda

Definition Why data warehouse Data warehouse in the enterprise Data warehouse design Relevance of normalization Star schema Processing the star schema

Page 3: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

3

Definition

Data warehouse: A repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated

The point is that it’s not used for transaction processing; that is, it’s read-only. And the data can come from heterogeneous sources and it can all be queried in one database.

Page 4: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

4

Why Data Warehouse

A read lock on a table will prevent any updating of a table

A long-running analytic operation on all rows of a table will prevent any updates

Analysis (a.k.a. decision support) can seriously interfere with updates

Using a duplicate table for analysis, recopied once a day, allows unlimited analysis and doesn’t interfere with OLTP.

Page 5: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

5

Data Warehouse vs. OLTP

OLTP DW

Purpose Automate day-to-day operations

Analysis

Structure RDBMS RMBMS

Data Model Normalized Dimensional

Access SQL SQL and business analysis programs

Data Data that runs the business Current and historical information

Condition of data Changing, incomplete Historical, complete, descriptive

Page 6: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

6

How It Fits into the Enterprise

OLTP3

DataMart

DataWarehouse

DataMart

DataMart

DataMart

Application A

Application B

Application C

User

User

User

User

User

User

User

Extract,TransformAnd Load

OLTP2

OLTP1

Integration

Integration

Page 7: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

7

Data Warehouse Database Design

A conventional database design for data warehouse would lead to joins on large amounts of data that would run slowly

The star schema allows for fast processing of very large quantities of data in the data warehouse

Page 8: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

8

A Sample OLTP Schema

orders

productsorderitems

customers

Page 9: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

9

Transformed to a Star Schema

products

customers

sales

channels

times

fact table

dimensiontable

dimensiontable

dimensiontable

dimensiontable

Page 10: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

10

Star Schema

Fact Table

Customer

ItemSupplier

TimeLocation

Page 11: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

11

Fact Table

The fact table contains the actual business process measurements or metrics called facts, usually numbers.

Other aspects of the business process are represented by foreign keys to “dimension” tables.

These foreign keys are usually generated keys, in order to save fact table space

If you are building a DW of monthly sales in dollars, your fact table will contain monthly sales, one row per month.

If you are building a DW of retail sales, each row of the fact table might have one row for each item sold.

Page 12: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

12

Fact Table Design

A fact table may contain one or more facts. Usually you create one fact table per business process or event. For example if you want to analyze the sales numbers and also advertising spending, they are two separate business processes. So you will create two separate fact tables, one for sales data and one for advertising cost data. On the other hand if you want to track the sales tax in addition to the sales number, you simply create one more fact column in the Sales fact table called Tax.

Page 13: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

13

Dimension Table

Dimension tables are used to provide context for the measurements that are presented in the fact table. Think of the context of a measurement as the who, what, where, when, how  of a measurement.

In an example business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

Dimension attributes may contain hierarchical relationships, such as grouping time into day, week, month, year, etc.

Page 14: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

14

Time Dimension Schema

Field Name Type

Dim_Id INTEGER (4)

Month SMALL INTEGER (2)

Month_Name VARCHAR (3)

Quarter SMALL INTEGER (4)

Quarter_Name VARCHAR (2)

Year SMALL INTEGER (2)

Page 15: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

15

Time Dimension Data

TM _Dim_Id TM _Month TM_Month_Name TM _QuarterTM_Quarter_N

ameTM_Year

1001         1 Jan 1 Q1 2003

1002 2 Feb 1 Q1 2003

1003 3 Mar 1 Q1 2003

1004 4 Apr 2 Q2 2003

1005 5 May 2 Q2 2003

Page 16: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

16

Location Dimension Schema

Field Name Type

Dim_Id INTEGER (4)

Loc_Code VARCHAR (4)

Name VARCHAR (50)

State_Name VARCHAR (20)

Country_Name VARCHAR (20)

Page 17: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

17

Location Dimension Data

Dim_Id Loc_Code Name State_Name Country_Name

1001      IL01 Chicago Loop Illinois USA

1002   IL02 Arlington Hts Illinois USA

1003 NY01 Brooklyn New York USA

1004 TO01 Toronto Ontario Canada

1005 MX01 Mexico City Distrito Federal Mexico

Page 18: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

18

Product Data Schema

Field Name Type

Dim_Id INTEGER (4)

SKU VARCHAR (10)

Name VARCHAR (30)

Category VARCHAR (30)

Page 19: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

19

Product Data

Dim_Id SKU Name Category

1001 DOVE6K Dove Soap 6Pk Sanitary

1002 MLK66F# Skim Milk 1 Gal Dairy

1003 SMKSAL55 Smoked Salmon 6oz Meat

Page 20: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Categories in Dimension Tables

Categories may or may not be hierarchical; or can be both

Categories provide canned values that can be given to users for queries

20

Page 21: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

21

Granularity (Grain) of the Fact Table

The level of detail of the fact table is known as the grain of the fact table. In this example the grain of the fact table is monthly sales  number per location per product.

Page 22: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Note about Granularity

There may be multiple star schemas at different levels of granularity, especially for very large data warehouses

The first could be the finest—say, each transaction such as a sale

The next could be an aggregation, like the previous example

There could be more levels of aggregation

22

Page 23: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

23

Design Approach

1. Identify the business process. In this step you will determine what is your business process that your data warehouse represents. This process will be the source of your metrics or measurements.

2. Identify the Grain You will determine what does one row of fact table mean. In the previous example you have decided that your grain is 'monthly sales per location per product'. It might be daily sales or even each sale could be one row.

3. Identify the DimensionsYour dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much as possible and conform to your grain.

4. Finally Identify the factsIn this step you will identify what are your measurements (or metrics or facts). The facts should be numeric and should confirm to the grain defined in step 2.

Page 24: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

24

Monthly Sales Fact Table Schema

Field Name Type

TM_Dim_Id INTEGER (4)

PR_ Dim_Id INTEGER (4)

LOC_ Dim_Id INTEGER (4)

Sales INTEGER (4)

Page 25: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

25

Monthly Sales Fact Table Data

TM_Dim_Id PR_ Dim_Id LOC_ Dim_Id Sales

1001 1001 1003 435677

1002 1002 1001 451121

1003 1001 1003 98765

1001 1004 1001 65432

Page 26: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

26

Data Mart

A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Examples: finance has their data mart, marketing has theirs, sales has theirs and so on.

Each department generally runs its own data mart. Ownership of the data mart allows each department to bypass the control that might coordinate the data found in the different departments.

Each department's data mart is peculiar to and specific to its own needs. Typically, the database design for a data mart is built around a star-join structure designed for that department.

The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department.

The data mart may also include data from outside the organization, such as purchased normative salary data that might be purchased by an HR department.

Page 27: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

27

About the Data Mart

The structure of the data in the data mart may or may not be compatible with the structure of data in the data warehouse.

The amount of historical data found in the data mart is different from the history of the data found in the warehouse. Data warehouses contain robust amounts of history, while data marts usually contain modest amounts of history.

The subject areas found in the data mart are only faintly related to the subject areas found in the data warehouse.

The relationships found in the data mart may not be those relationships that are found in the data warehouse.

The types of queries satisfied in the data mart are quite different from those queries found in the data warehouse.

Page 28: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Walmart’s Data Warehouse

Half a petabyte in capacity (.5 x 1015 bytes) World’s largest DW Tracks 100 million customers buying billions of

products every week Every sale from every store is transmitted to

Bentonville every night Walmart has more than 18,000 retail stores, employs

2.2 million, serves 245 million customers every week

28

Page 29: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

Typical Questions

How much orange juice did we sell last year, last month, last week in store X?

What internal factors (position in store, advertising campaigns...) influence orange juice sales?

How much orange juice are we going to sell next week, next month, next year?

29

Page 30: Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.

30


Recommended