Date post: | 02-Jun-2018 |
Category: |
Documents |
Upload: | vishnuselva |
View: | 221 times |
Download: | 0 times |
of 63
8/10/2019 Is Data Warehouse
1/63
Data Warehousing
Spring 2009, 2010
8/10/2019 Is Data Warehouse
2/63
Overview ER and Normalization Dimensional Modeling
DM Compared to ER-models
Fact tables Dimension tables
Retail example
Database Design Design
Enterprise-wide data warehouse Data mart approach
8/10/2019 Is Data Warehouse
3/63
8/10/2019 Is Data Warehouse
4/63
Entity-Relationship Modeling Proposed by E.F. Codd in 1970 (IBM)
12 Rules for Relational DBMS
Collection of relations (tables) Normalization
Grouped by subject areas
Customer, Product, Finance, etc.
8/10/2019 Is Data Warehouse
5/63
Normalization Purpose
Reduce redundancy
Reduce errors
Simplify updates
Series of rules to test relationships in the data First (1NF), second (2NF), and third (3NF) normal forms proposed
by E.F. Codd in 1972
Boyce-Codd normal form (BCNF) 1974 Database designers oath (William Kent 1983)
Every attribute must provide a fact about the key, thewhole key, and nothing but the key, so help me Codd
8/10/2019 Is Data Warehouse
6/63
Normalization: An Example
staffNo sName position salary branchNo bAddress
SL21 John White Manager 30000 B005 22 Deer Rd, LondonSG37 Ann Beech Assistant 12000 B003 163 Main st, Glasgow
SG14 David Ford Supervisor 18000 B003 163 Main st, Glasgow
SA9 Mary Howe Assistant 9000 B007 16 Argyll St, Aberdeen
SG5 Susan Brand Manager 24000 B003 163 Main st, Glasgow
SL41 Julie Lee Assistant 9000 B005 22 Deer Rd, London
StaffBranch
8/10/2019 Is Data Warehouse
7/63
Problems with StaffBranch Insertion problems
Cannot add a new branch if no staff is associated
Details repeated for each staff member
Potential inconsistencies and errors Deletion
Delete the last member of a branch The branch is lost (e.g., SA9 -> B007 is lost)
Update problems E.g., change the address of one branch
Must be done for all entries
8/10/2019 Is Data Warehouse
8/63
Relationships / Dependencies One-to-one relationship (denoted 1:1)
Functional dependencies For one attribute there can only be one value for the other attribute
E.g. staffNo andposition (1:1): one staff number can be associated with only one
position staffNo determines unambiguouslyposition
Position is functionally dependent uponstaffNo
staffNo->position
Transitive dependencies If A->B and B->C, then A->C
E.g., staffNo -> branchNo and branchNo -> bAddress, then staffNo -> bAddress One-to-many relationship (1:*)
Not functionally dependent E.g., position and staffNo (1:*): one position can be associated with several
staffNo:s
8/10/2019 Is Data Warehouse
9/63
Example Case DependenciesStaffNo -> sName, position, salary, branchNo,
bAddress
branchNo -> bAddressbAddress -> branchNo
branchNo, position -> salary
bAddress, position -> salary
8/10/2019 Is Data Warehouse
10/63
Better?
staffNo sName position salary branchNo
SL21 John White Manager 30000 B005SG37 Ann Beech Assistant 12000 B003
SG14 David Ford Supervisor 18000 B003
SA9 Mary Howe Assistant 9000 B007
SG5 Susan Brand Manager 24000 B003
SL41 Julie Lee Assistant 9000 B005
Staff
Branch
branchNo bAddress
B005 22 Deer Rd, London
B007 16 Argyll St, Aberdeen
B003 163 Main st, GlasgowRedundancy?
8/10/2019 Is Data Warehouse
11/63
Keys Candidate keys
Attribute, or group of attributes (composite key), thatuniquely identifies each tuple
Primary key Candidate key selected to identify tuples uniquely
within a relation
Foreign key Attribute, or set of attributes, within a relation that
matches the candidate of some relation
8/10/2019 Is Data Warehouse
12/63
Un-normalized Example
clientNo cName PropertyNo pAddress rentStart rentFinish rent ownerNo oName
CR76 John Kay PG4 6 Lawrence st, Glasgow 1-Jul-00 31-Aug-01 350 CO40 Tina Murphy
PG16 5 Novar Dr, Glasgow 1-Sep-01 1-Sep-02 450 CO93 Tony Shaw
CR56 Aline Stewart PG4 6 Lawrence st, Glasgow 1-Sep-99 10-Jun-00 350 CO40 Tina Murphy
PG36 2 Manor Rd, Glasgow 10-Oct-00 1-Dec-01 375 CO93 Tony Shaw
PG16 5 Novar Dr, Glasgow 1-Nov-02 10-Aug-03 450 CO93 Tony Shaw
Example from Connelly and Begg, 2002, pp. 388-397
8/10/2019 Is Data Warehouse
13/63
1NF A relation in which the intersection of each row and column contains
one and only one value
I.e., all rows must have a equal number of columns (and vice-versa),repeating groups are eliminated
clientNo, propertyNo become new primary key
clientNo propertyNo cName pAddress rentStart rentFinish rent ownerNo oName
CR76 PG4 John Kay 6 Lawrence st, Glasgow 1-Jul-00 31-Aug-01 350 CO40 Tina Murphy
CR76 PG16 John Kay 5 Novar Dr, Glasgow 1-Sep-01 1-Sep-02 450 CO93 Tony Shaw
CR56 PG4 Aline Stewart 6 Lawrence st, Glasgow 1-Sep-99 10-Jun-00 350 CO40 Tina Murphy
CR56 PG36 Aline Stewart 2 Manor Rd, Glasgow 10-Oct-00 1-Dec-01 375 CO93 Tony Shaw
CR56 PG16 Aline Stewart 5 Novar Dr, Glasgow 1-Nov-02 10-Aug-03 450 CO93 Tony Shaw
8/10/2019 Is Data Warehouse
14/63
2NF A relation that is in first normal form and every non-
primary-key attribute is fully functionally dependent uponthe primary key. I.e., eliminate non-identifier attributes which are not functionally
dependent upon on the whole of the identifier
E.g., clientNo,propertyNo ->pAddress
pAddress is functionally dependent upon a subset of (clientNo,propertyNo), namelypropertyNo
Partial dependency
8/10/2019 Is Data Warehouse
15/63
2NF (2) Dependencies:
Fd1: clientNo, propertyNo -> rentStart, rentFinish (Primary key)
Fd2: clientNo -> cName (Partial dependency)
Fd3: propertyNo -> propertyAddress, rent, ownerNo, oName (Partial dependency)
Fd4: ownerNo -> oName (Transitive depend.)
Fd5: clientNo, rentStart -> propertyNo, pAddress, rentFinish, rent, ownerNo, oName(Candidate key)
Fd6: propertyNo, rentStart -> clienctNo, cName, rentFinish (Candidate key)
8/10/2019 Is Data Warehouse
16/63
2NF (3)Client
clientNo cName
CR76 John Kay
CR56 Aline Stewart
Rental
clientNo PropertyNo rentStart rentFinish
CR76 PG4 1-Jul-00 31-Aug-01
CR76 PG16 1-Sep-01 1-Sep-02
CR56 PG4 1-Sep-99 10-Jun-00
CR56 PG36 10-Oct-00 1-Dec-01
CR56 PG16 1-Nov-02 10-Aug-03
PropertyOwner
PropertyNo pAddress rent ownerNo oName
PG4 6 Lawrence st, Glasgow 350 CO40 Tina Murphy
PG16 5 Novar Dr, Glasgow 450 CO93 Tony Shaw
PG36 2 Manor Rd, Glasgow 375 CO93 Tony Shaw
8/10/2019 Is Data Warehouse
17/63
3NF A relation that is in first and second normal form, and
in which no non-primary-key attribute is transitivelydependent on the primary key.
Eliminate functional dependencies between non-keyattributes
propertyNo -> OwnerNo -> oName
8/10/2019 Is Data Warehouse
18/63
3NF (2)Client
clientNo cName
CR76 John Kay
CR56 Aline Stewart
Rental
clientNo propertyNo rentStart rentFinish
CR76 PG4 1-Jul-00 31-Aug-01
CR76 PG16 1-Sep-01 1-Sep-02
CR56 PG4 1-Sep-99 10-Jun-00
CR56 PG36 10-Oct-00 1-Dec-01
CR56 PG16 1-Nov-02 10-Aug-03
PropertyForRent
PropertyNo pAddress rent ownerNo
PG4 6 Lawrence st, Glasgow 350 CO40
PG16 5 Novar Dr, Glasgow 450 CO93
PG36 2 Manor Rd, Glasgow 375 CO93
Owner
ownerNo oName
CO40 Tina Murphy
CO93 Tony Shaw
8/10/2019 Is Data Warehouse
19/63
Relationships
8/10/2019 Is Data Warehouse
20/63
Normalization Advantages
Single updates
Redundancy limited
Reduced errors Disadvantages
Complex schemas
Complex queries
Not process-oriented
Multiple joins lead to poor query performance Difficult to index
8/10/2019 Is Data Warehouse
21/63
8/10/2019 Is Data Warehouse
22/63
Dimensional Modeling Also relational but denormalized structure Divided intofact tables and dimension tables Fact tables
Generally numeric data, the specific values of the transaction E.g., number of products ordered, price paid, et.
Dimension tables Descriptive information, provides the context of the transaction E.g., date, customer, product, location, etc.
Often called Star schemas because of their structure
8/10/2019 Is Data Warehouse
23/63
Star Schema Star join schema, or star schema
Old concept
One of the oldest ER-schemas Easier to understand
Reduced number of tables
Meaningful descriptors
Few joins
8/10/2019 Is Data Warehouse
24/63
Snowflake Schema Star schema, but dimensions normalized to a certain
degree E.g., zip codes, packaging codes, etc.
Gives snowflake-like structure Saves space and simplifies updating
However: Increases complexity
Decreases performance Space saved is marginal compared to size of fact table
8/10/2019 Is Data Warehouse
25/63
Fact Tables Granularity
Transaction, periodic snapshot, or accumulating snapshot
One row is a measurement Same granularity!!
Intersection of the dimensions (product, time, sales point)
Should be additive (summarizable) E.g., dollar amounts, number of products
Rarely look at individual rows
Read-only data
8/10/2019 Is Data Warehouse
26/63
Fact Tables (2) Text belongs in the dimension tables
Unless unique for each transaction
Often 90% or more of the database No null entries if no transactions
Few columns but many rows
8/10/2019 Is Data Warehouse
27/63
Fact Tables (3) Foreign keys
Two or more
Connect to primary keys in the dimension tables
E.g., product keys Referential integrity
Own primary key Consists of a subset of the foreign keys
Composite key
Avoid unique rowID if possible, size constraints Many-to-many relationship = fact table
8/10/2019 Is Data Warehouse
28/63
Dimension Tables Textual descriptors of business
Many columns (attributes) Describe the rows in the fact table
May have 50-100 columns!
Few rows
Single primary key
Textual and discrete
8/10/2019 Is Data Warehouse
29/63
Dimension Tables (2) Serve as query constraints, groupings and report labels
(dimensional attributes)
The by-words
E.g., dollar sales by week by brand
Key to making the data warehouse or data martunderstandable and useful
Spend time on the dimensions!!
8/10/2019 Is Data Warehouse
30/63
Dimension Tables (3)Attributes : real words, not abbreviations
Not operational codes
Surrogate keys
Branch ID instead of Branch No
Buffer towards operation changes
Multiple, conflicting sources for operational codes
Shorter
8/10/2019 Is Data Warehouse
31/63
Dimension Tables (4) Attributes
Split meaningful operational codes into separate attributes (i.e.,groupings) E.g., line of business, region, etc.
Hierarchical relationships as attributes E.g., products brands categories, store region country
Redundancy Problem?
Dimension tables usually less than 10% of database size Normalizing has little effect
Snowflake schema Heavily indexed
8/10/2019 Is Data Warehouse
32/63
Dimensional Modeling Usage No predefined entry point
All dimensions are equal
See as a report
Dimensions provide labeling Facts provide numeric values
One normalized, enterprise-wide model ER-model breaksdown into several dimensional models One business process or department
Beware of too many dimensions!! Size!
Conformed dimensions
Same dimensions for other fact tables
8/10/2019 Is Data Warehouse
33/63
Designing Dimensional Models1. Select the business process to model E.g., purchasing, orders, inventory, etc.
2. Select the granularity E.g., a single line on a retail receipt
3. Select the dimensions How do business people describe the data from the business
process?
4. Identify the facts What are we measuring?
Source: Kimball et al. 2002
8/10/2019 Is Data Warehouse
34/63
Example of Dimensional Modeling Retail case study from Kimball et al. 2002.
Grocery store chain 100 stores in 5 states
Each store has a number of departments E.g., frozen foods, dairy, meat, etc.
Roughly 60,000 items in each store
Data captured by POS-system (point-of-sales)
Bar codes (UPCs, universal product codes) Called SKUs (stock keeping units)
Promotions
8/10/2019 Is Data Warehouse
35/63
1. Business process: Sales
2. Granularity: Single line item fromPOS
3. Dimensions: Date, product, store,promotion
4. Facts: Sales quantity, dollaramount, cost dollar
amount, gross profit dollaramount
8/10/2019 Is Data Warehouse
36/63
Identifying the Dimensions
Date Key (PK)
.
.
.
Date Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction no.
.
.
.
POS Retail Sales Fact
Store Key (PK).
.
.
Store Dimension
Product Key (PK)
.
.
.
Product Dimension
Promotion Key (PK).
.
.
Promotion Dimension
8/10/2019 Is Data Warehouse
37/63
Populating the Fact Table
Date Key (PK)
.
.
.
Date Dimension
POS Retail Sales Fact
Store Key (PK).
.
.
Store Dimension
Product Key (PK)
.
.
.
Product Dimension
Promotion Key (PK).
.
.
Promotion Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction no.
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
Note that the facts
are additive across
all dimensions
Gross profit is
calculated, benefitsand trade-offs
A ratio, gross
margin, would not
have been additive
The same applies
to unit price
8/10/2019 Is Data Warehouse
38/63
8/10/2019 Is Data Warehouse
39/63
Promotion Key (PK)
.
.
.
Promotion Dimension
Product DimensionDate Dimension
POS Retail Sales Fact
Store Key (PK).
.
.
Store Dimension
Product Key (PK)
Product Description
SKU Number (Natural Key)Brand Description
Category Description
Department Description
Package Style Description
Packaging Size
Fat Content
Diet Type
Weight
Weight Units of MeasureStorage Type
Shelf Life Type
Shelf Width
Shelf Height
Shelf Depth
Etc.
Product Dimension
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction no.
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar Amount
Units scroll up into
brands
Brands scroll up into
categories
Categories scroll upinto departments
Redundancy,
but not a problem
8/10/2019 Is Data Warehouse
40/63
8/10/2019 Is Data Warehouse
41/63
8/10/2019 Is Data Warehouse
42/63
Simple Retail Dimensional ModelDate Dimension
POS Retail Sales Fact
Store Dimension
Product Dimension
Promotion Dimension
Date Key (PK)
Date
Full Date DescriptionDay of Week
Day Number in Epoch
Week Number in Epoch
Month Number in Epoch
Etc.
Date Key (FK)
Product Key (FK)
Store Key (FK)
Promotion Key (FK)
POS Transaction no.
Sales Quantity
Sales Dollar Amount
Cost Dollar Amount
Gross Profit Dollar AmountStore Key (PK)Store Name
Store Number (Natural Key)
Store Street Address
Store City
Store Country
Store State
Etc.
Product Key (PK)
Product Description
SKU Number (Natural Key)Brand Description
Category Description
Department Description
Package Style Description
Etc.
Promotion Key (PK)Promotion Name
Price Reduction Type
Promotion Media Type
Ad Type
Display Type
Coupon Type
Etc.
8/10/2019 Is Data Warehouse
43/63
8/10/2019 Is Data Warehouse
44/63
8/10/2019 Is Data Warehouse
45/63
8/10/2019 Is Data Warehouse
46/63
Data Warehouse Design1. Business requirements analysis2. Data design
3. Architecture design4. Implementation
5. Deployment
Source: Sen and Sinha, 2005
8/10/2019 Is Data Warehouse
47/63
Data Warehouse Design (1): Business
Requirements Analysis Identify the business questions needing answers
Prioritization of the questions
User involvement!! Conceptual model (high-level), blueprint for the
requirements of the organization
8/10/2019 Is Data Warehouse
48/63
8/10/2019 Is Data Warehouse
49/63
Data Warehouse Design (3): Architecture Design Overall schema for the data warehouse
Several approaches:
Top-down,
Bottom-up, or
Mixed
Different design philosophies:
Enterprise-wide data warehouse design
Data mart design
8/10/2019 Is Data Warehouse
50/63
Data Warehouse Design (4): Implementation Data sourcing
ETL
User applications Two important things:
Data quality management
Meta data management
8/10/2019 Is Data Warehouse
51/63
Data Warehouse Design (5): Deployment Solution Integration
Data warehouse tuning
Data warehouse maintenance One of the leading causes of data warehousing failures!!
8/10/2019 Is Data Warehouse
52/63
8/10/2019 Is Data Warehouse
53/63
Enterprise Wide Warehouse First construct an enterprise wide data warehouse, then departmental datamarts
Corporate Information Factory Top down approach Data driven approach Not systems development life cycle (SDLC), or waterfall approach
Starting point: data, not requirements Data are gathered, integrated, and tested Programs are written against the data and results analyzed Requirements are formulated
Iterative approach Often called reversed SDLC, or CLDC
Normalized
8/10/2019 Is Data Warehouse
54/63
8/10/2019 Is Data Warehouse
55/63
Enterprise Wide Warehouse Levels of
Architecture
1. Operational Detailed, day-to-day, current operational data
2. Data Warehouse Granular, time-variant, integrated, subject-oriented, some
summarization
3. Departmental (Data Mart) Summarized, departmental needs
4. Individual Resides on PCs, temporary, ad-hoc
Operational Atomic/data
warehouse
Departmental
data mart
Individual
8/10/2019 Is Data Warehouse
56/63
Disadvantages with Enterprise Wide Warehouses Data normalized, requires data marts for efficient access
Expensive to build Takes a long time
Centralized development Data driven approach
Risk that the data marts are not used
Commitment of the organization
Especially in organizations new to business analytics
8/10/2019 Is Data Warehouse
57/63
Data Mart Approach Conforming data marts Subsets of data warehouse Data warehouse bus Business dimensional lifecycle
Based upon SDLC Focus on analytic business requirements by executives / managers Dimensional approach Granularity Data marts
Business processes
Not enterprise-wide approach Not practical or possible
Spiral (prototyping) Normalized only in the staging area
8/10/2019 Is Data Warehouse
58/63
Data Mart Approach
Operational
data source
Operational
data source
Operational
data source
ETL
Data Mart 1
Data Mart 2
Data Mart 3
DataWarehou
seBus-
C
onformedfactsa
ndattributes
Data Access Tools
8/10/2019 Is Data Warehouse
59/63
Advantages with the Data Mart Approach Ease of creation
Phased development
Limits budget requirements
Learn from mistakes
Clearly defined user group
8/10/2019 Is Data Warehouse
60/63
Problems with the Data Mart Approach Overlap -> redundant data
Inconsistency of data Lack of integration of data
Different answers from different departments
Data marts are built according to the requirements of onedepartment, unit, or subdivision Not according to corporate requirements
Different granularity, degree of summarization, key structure, amount ofhistorical data, etc.
8/10/2019 Is Data Warehouse
61/63
Problems with the Data Mart Approach (2) Technical scalability
Own hardware and software suitable for corporate-wide implementation? Interfaces between data marts
Becomes a considerable burden as the number of data marts increases
Budget issues Departments not willing to use own funds for corporate-wide
considerations
Risk of creating stovepipe applications Criticism: Bill Inmon heavy criticizes the data mart approach as being
something sold by vendors who are only interested in making sales!
8/10/2019 Is Data Warehouse
62/63
Considerations for the Data Mart Approach Kimball: Bottom-up, but with a data warehouse bus
(skeletal frame) Not for summary data only
Low granularity!! Dimensional schemas
Should not be based upon functional organization, but onbusiness processes
Not multiple extracts from the same source No inconsistencies
Scalability not necessarily an issue Data marts are already huge -> scalability inherent
8/10/2019 Is Data Warehouse
63/63
References Connolly and Begg, Database Systems, 2002 Inmon, Building the Data Warehouse, 2002
Kimball and Ross, The Data Warehouse Toolkit, 2002
Sen and Sinha, A Comparison of Data Warehousing Methodologies,
Communications of the ACM, vol 48, no. 3, pp. 79-84, 2005 Gardner, Building the Data Warehouse, Communications of the ACM,
41(9), pp. 52-60, 1998
Chenoweth, Corral, and Demirkan, Seven Key Interventions for DataWarehouse Success, Communications of the ACM, 49(1), pp. 115-119,
2006