+ All Categories
Home > Documents > Is Data Warehouse

Is Data Warehouse

Date post: 02-Jun-2018
Category:
Upload: vishnuselva
View: 221 times
Download: 0 times
Share this document with a friend

of 63

Transcript
  • 8/10/2019 Is Data Warehouse

    1/63

    Data Warehousing

    Spring 2009, 2010

  • 8/10/2019 Is Data Warehouse

    2/63

    Overview ER and Normalization Dimensional Modeling

    DM Compared to ER-models

    Fact tables Dimension tables

    Retail example

    Database Design Design

    Enterprise-wide data warehouse Data mart approach

  • 8/10/2019 Is Data Warehouse

    3/63

  • 8/10/2019 Is Data Warehouse

    4/63

    Entity-Relationship Modeling Proposed by E.F. Codd in 1970 (IBM)

    12 Rules for Relational DBMS

    Collection of relations (tables) Normalization

    Grouped by subject areas

    Customer, Product, Finance, etc.

  • 8/10/2019 Is Data Warehouse

    5/63

    Normalization Purpose

    Reduce redundancy

    Reduce errors

    Simplify updates

    Series of rules to test relationships in the data First (1NF), second (2NF), and third (3NF) normal forms proposed

    by E.F. Codd in 1972

    Boyce-Codd normal form (BCNF) 1974 Database designers oath (William Kent 1983)

    Every attribute must provide a fact about the key, thewhole key, and nothing but the key, so help me Codd

  • 8/10/2019 Is Data Warehouse

    6/63

    Normalization: An Example

    staffNo sName position salary branchNo bAddress

    SL21 John White Manager 30000 B005 22 Deer Rd, LondonSG37 Ann Beech Assistant 12000 B003 163 Main st, Glasgow

    SG14 David Ford Supervisor 18000 B003 163 Main st, Glasgow

    SA9 Mary Howe Assistant 9000 B007 16 Argyll St, Aberdeen

    SG5 Susan Brand Manager 24000 B003 163 Main st, Glasgow

    SL41 Julie Lee Assistant 9000 B005 22 Deer Rd, London

    StaffBranch

  • 8/10/2019 Is Data Warehouse

    7/63

    Problems with StaffBranch Insertion problems

    Cannot add a new branch if no staff is associated

    Details repeated for each staff member

    Potential inconsistencies and errors Deletion

    Delete the last member of a branch The branch is lost (e.g., SA9 -> B007 is lost)

    Update problems E.g., change the address of one branch

    Must be done for all entries

  • 8/10/2019 Is Data Warehouse

    8/63

    Relationships / Dependencies One-to-one relationship (denoted 1:1)

    Functional dependencies For one attribute there can only be one value for the other attribute

    E.g. staffNo andposition (1:1): one staff number can be associated with only one

    position staffNo determines unambiguouslyposition

    Position is functionally dependent uponstaffNo

    staffNo->position

    Transitive dependencies If A->B and B->C, then A->C

    E.g., staffNo -> branchNo and branchNo -> bAddress, then staffNo -> bAddress One-to-many relationship (1:*)

    Not functionally dependent E.g., position and staffNo (1:*): one position can be associated with several

    staffNo:s

  • 8/10/2019 Is Data Warehouse

    9/63

    Example Case DependenciesStaffNo -> sName, position, salary, branchNo,

    bAddress

    branchNo -> bAddressbAddress -> branchNo

    branchNo, position -> salary

    bAddress, position -> salary

  • 8/10/2019 Is Data Warehouse

    10/63

    Better?

    staffNo sName position salary branchNo

    SL21 John White Manager 30000 B005SG37 Ann Beech Assistant 12000 B003

    SG14 David Ford Supervisor 18000 B003

    SA9 Mary Howe Assistant 9000 B007

    SG5 Susan Brand Manager 24000 B003

    SL41 Julie Lee Assistant 9000 B005

    Staff

    Branch

    branchNo bAddress

    B005 22 Deer Rd, London

    B007 16 Argyll St, Aberdeen

    B003 163 Main st, GlasgowRedundancy?

  • 8/10/2019 Is Data Warehouse

    11/63

    Keys Candidate keys

    Attribute, or group of attributes (composite key), thatuniquely identifies each tuple

    Primary key Candidate key selected to identify tuples uniquely

    within a relation

    Foreign key Attribute, or set of attributes, within a relation that

    matches the candidate of some relation

  • 8/10/2019 Is Data Warehouse

    12/63

    Un-normalized Example

    clientNo cName PropertyNo pAddress rentStart rentFinish rent ownerNo oName

    CR76 John Kay PG4 6 Lawrence st, Glasgow 1-Jul-00 31-Aug-01 350 CO40 Tina Murphy

    PG16 5 Novar Dr, Glasgow 1-Sep-01 1-Sep-02 450 CO93 Tony Shaw

    CR56 Aline Stewart PG4 6 Lawrence st, Glasgow 1-Sep-99 10-Jun-00 350 CO40 Tina Murphy

    PG36 2 Manor Rd, Glasgow 10-Oct-00 1-Dec-01 375 CO93 Tony Shaw

    PG16 5 Novar Dr, Glasgow 1-Nov-02 10-Aug-03 450 CO93 Tony Shaw

    Example from Connelly and Begg, 2002, pp. 388-397

  • 8/10/2019 Is Data Warehouse

    13/63

    1NF A relation in which the intersection of each row and column contains

    one and only one value

    I.e., all rows must have a equal number of columns (and vice-versa),repeating groups are eliminated

    clientNo, propertyNo become new primary key

    clientNo propertyNo cName pAddress rentStart rentFinish rent ownerNo oName

    CR76 PG4 John Kay 6 Lawrence st, Glasgow 1-Jul-00 31-Aug-01 350 CO40 Tina Murphy

    CR76 PG16 John Kay 5 Novar Dr, Glasgow 1-Sep-01 1-Sep-02 450 CO93 Tony Shaw

    CR56 PG4 Aline Stewart 6 Lawrence st, Glasgow 1-Sep-99 10-Jun-00 350 CO40 Tina Murphy

    CR56 PG36 Aline Stewart 2 Manor Rd, Glasgow 10-Oct-00 1-Dec-01 375 CO93 Tony Shaw

    CR56 PG16 Aline Stewart 5 Novar Dr, Glasgow 1-Nov-02 10-Aug-03 450 CO93 Tony Shaw

  • 8/10/2019 Is Data Warehouse

    14/63

    2NF A relation that is in first normal form and every non-

    primary-key attribute is fully functionally dependent uponthe primary key. I.e., eliminate non-identifier attributes which are not functionally

    dependent upon on the whole of the identifier

    E.g., clientNo,propertyNo ->pAddress

    pAddress is functionally dependent upon a subset of (clientNo,propertyNo), namelypropertyNo

    Partial dependency

  • 8/10/2019 Is Data Warehouse

    15/63

    2NF (2) Dependencies:

    Fd1: clientNo, propertyNo -> rentStart, rentFinish (Primary key)

    Fd2: clientNo -> cName (Partial dependency)

    Fd3: propertyNo -> propertyAddress, rent, ownerNo, oName (Partial dependency)

    Fd4: ownerNo -> oName (Transitive depend.)

    Fd5: clientNo, rentStart -> propertyNo, pAddress, rentFinish, rent, ownerNo, oName(Candidate key)

    Fd6: propertyNo, rentStart -> clienctNo, cName, rentFinish (Candidate key)

  • 8/10/2019 Is Data Warehouse

    16/63

    2NF (3)Client

    clientNo cName

    CR76 John Kay

    CR56 Aline Stewart

    Rental

    clientNo PropertyNo rentStart rentFinish

    CR76 PG4 1-Jul-00 31-Aug-01

    CR76 PG16 1-Sep-01 1-Sep-02

    CR56 PG4 1-Sep-99 10-Jun-00

    CR56 PG36 10-Oct-00 1-Dec-01

    CR56 PG16 1-Nov-02 10-Aug-03

    PropertyOwner

    PropertyNo pAddress rent ownerNo oName

    PG4 6 Lawrence st, Glasgow 350 CO40 Tina Murphy

    PG16 5 Novar Dr, Glasgow 450 CO93 Tony Shaw

    PG36 2 Manor Rd, Glasgow 375 CO93 Tony Shaw

  • 8/10/2019 Is Data Warehouse

    17/63

    3NF A relation that is in first and second normal form, and

    in which no non-primary-key attribute is transitivelydependent on the primary key.

    Eliminate functional dependencies between non-keyattributes

    propertyNo -> OwnerNo -> oName

  • 8/10/2019 Is Data Warehouse

    18/63

    3NF (2)Client

    clientNo cName

    CR76 John Kay

    CR56 Aline Stewart

    Rental

    clientNo propertyNo rentStart rentFinish

    CR76 PG4 1-Jul-00 31-Aug-01

    CR76 PG16 1-Sep-01 1-Sep-02

    CR56 PG4 1-Sep-99 10-Jun-00

    CR56 PG36 10-Oct-00 1-Dec-01

    CR56 PG16 1-Nov-02 10-Aug-03

    PropertyForRent

    PropertyNo pAddress rent ownerNo

    PG4 6 Lawrence st, Glasgow 350 CO40

    PG16 5 Novar Dr, Glasgow 450 CO93

    PG36 2 Manor Rd, Glasgow 375 CO93

    Owner

    ownerNo oName

    CO40 Tina Murphy

    CO93 Tony Shaw

  • 8/10/2019 Is Data Warehouse

    19/63

    Relationships

  • 8/10/2019 Is Data Warehouse

    20/63

    Normalization Advantages

    Single updates

    Redundancy limited

    Reduced errors Disadvantages

    Complex schemas

    Complex queries

    Not process-oriented

    Multiple joins lead to poor query performance Difficult to index

  • 8/10/2019 Is Data Warehouse

    21/63

  • 8/10/2019 Is Data Warehouse

    22/63

    Dimensional Modeling Also relational but denormalized structure Divided intofact tables and dimension tables Fact tables

    Generally numeric data, the specific values of the transaction E.g., number of products ordered, price paid, et.

    Dimension tables Descriptive information, provides the context of the transaction E.g., date, customer, product, location, etc.

    Often called Star schemas because of their structure

  • 8/10/2019 Is Data Warehouse

    23/63

    Star Schema Star join schema, or star schema

    Old concept

    One of the oldest ER-schemas Easier to understand

    Reduced number of tables

    Meaningful descriptors

    Few joins

  • 8/10/2019 Is Data Warehouse

    24/63

    Snowflake Schema Star schema, but dimensions normalized to a certain

    degree E.g., zip codes, packaging codes, etc.

    Gives snowflake-like structure Saves space and simplifies updating

    However: Increases complexity

    Decreases performance Space saved is marginal compared to size of fact table

  • 8/10/2019 Is Data Warehouse

    25/63

    Fact Tables Granularity

    Transaction, periodic snapshot, or accumulating snapshot

    One row is a measurement Same granularity!!

    Intersection of the dimensions (product, time, sales point)

    Should be additive (summarizable) E.g., dollar amounts, number of products

    Rarely look at individual rows

    Read-only data

  • 8/10/2019 Is Data Warehouse

    26/63

    Fact Tables (2) Text belongs in the dimension tables

    Unless unique for each transaction

    Often 90% or more of the database No null entries if no transactions

    Few columns but many rows

  • 8/10/2019 Is Data Warehouse

    27/63

    Fact Tables (3) Foreign keys

    Two or more

    Connect to primary keys in the dimension tables

    E.g., product keys Referential integrity

    Own primary key Consists of a subset of the foreign keys

    Composite key

    Avoid unique rowID if possible, size constraints Many-to-many relationship = fact table

  • 8/10/2019 Is Data Warehouse

    28/63

    Dimension Tables Textual descriptors of business

    Many columns (attributes) Describe the rows in the fact table

    May have 50-100 columns!

    Few rows

    Single primary key

    Textual and discrete

  • 8/10/2019 Is Data Warehouse

    29/63

    Dimension Tables (2) Serve as query constraints, groupings and report labels

    (dimensional attributes)

    The by-words

    E.g., dollar sales by week by brand

    Key to making the data warehouse or data martunderstandable and useful

    Spend time on the dimensions!!

  • 8/10/2019 Is Data Warehouse

    30/63

    Dimension Tables (3)Attributes : real words, not abbreviations

    Not operational codes

    Surrogate keys

    Branch ID instead of Branch No

    Buffer towards operation changes

    Multiple, conflicting sources for operational codes

    Shorter

  • 8/10/2019 Is Data Warehouse

    31/63

    Dimension Tables (4) Attributes

    Split meaningful operational codes into separate attributes (i.e.,groupings) E.g., line of business, region, etc.

    Hierarchical relationships as attributes E.g., products brands categories, store region country

    Redundancy Problem?

    Dimension tables usually less than 10% of database size Normalizing has little effect

    Snowflake schema Heavily indexed

  • 8/10/2019 Is Data Warehouse

    32/63

    Dimensional Modeling Usage No predefined entry point

    All dimensions are equal

    See as a report

    Dimensions provide labeling Facts provide numeric values

    One normalized, enterprise-wide model ER-model breaksdown into several dimensional models One business process or department

    Beware of too many dimensions!! Size!

    Conformed dimensions

    Same dimensions for other fact tables

  • 8/10/2019 Is Data Warehouse

    33/63

    Designing Dimensional Models1. Select the business process to model E.g., purchasing, orders, inventory, etc.

    2. Select the granularity E.g., a single line on a retail receipt

    3. Select the dimensions How do business people describe the data from the business

    process?

    4. Identify the facts What are we measuring?

    Source: Kimball et al. 2002

  • 8/10/2019 Is Data Warehouse

    34/63

    Example of Dimensional Modeling Retail case study from Kimball et al. 2002.

    Grocery store chain 100 stores in 5 states

    Each store has a number of departments E.g., frozen foods, dairy, meat, etc.

    Roughly 60,000 items in each store

    Data captured by POS-system (point-of-sales)

    Bar codes (UPCs, universal product codes) Called SKUs (stock keeping units)

    Promotions

  • 8/10/2019 Is Data Warehouse

    35/63

    1. Business process: Sales

    2. Granularity: Single line item fromPOS

    3. Dimensions: Date, product, store,promotion

    4. Facts: Sales quantity, dollaramount, cost dollar

    amount, gross profit dollaramount

  • 8/10/2019 Is Data Warehouse

    36/63

    Identifying the Dimensions

    Date Key (PK)

    .

    .

    .

    Date Dimension

    Date Key (FK)

    Product Key (FK)

    Store Key (FK)

    Promotion Key (FK)

    POS Transaction no.

    .

    .

    .

    POS Retail Sales Fact

    Store Key (PK).

    .

    .

    Store Dimension

    Product Key (PK)

    .

    .

    .

    Product Dimension

    Promotion Key (PK).

    .

    .

    Promotion Dimension

  • 8/10/2019 Is Data Warehouse

    37/63

    Populating the Fact Table

    Date Key (PK)

    .

    .

    .

    Date Dimension

    POS Retail Sales Fact

    Store Key (PK).

    .

    .

    Store Dimension

    Product Key (PK)

    .

    .

    .

    Product Dimension

    Promotion Key (PK).

    .

    .

    Promotion Dimension

    Date Key (FK)

    Product Key (FK)

    Store Key (FK)

    Promotion Key (FK)

    POS Transaction no.

    Sales Quantity

    Sales Dollar Amount

    Cost Dollar Amount

    Gross Profit Dollar Amount

    Note that the facts

    are additive across

    all dimensions

    Gross profit is

    calculated, benefitsand trade-offs

    A ratio, gross

    margin, would not

    have been additive

    The same applies

    to unit price

  • 8/10/2019 Is Data Warehouse

    38/63

  • 8/10/2019 Is Data Warehouse

    39/63

    Promotion Key (PK)

    .

    .

    .

    Promotion Dimension

    Product DimensionDate Dimension

    POS Retail Sales Fact

    Store Key (PK).

    .

    .

    Store Dimension

    Product Key (PK)

    Product Description

    SKU Number (Natural Key)Brand Description

    Category Description

    Department Description

    Package Style Description

    Packaging Size

    Fat Content

    Diet Type

    Weight

    Weight Units of MeasureStorage Type

    Shelf Life Type

    Shelf Width

    Shelf Height

    Shelf Depth

    Etc.

    Product Dimension

    Date Key (FK)

    Product Key (FK)

    Store Key (FK)

    Promotion Key (FK)

    POS Transaction no.

    Sales Quantity

    Sales Dollar Amount

    Cost Dollar Amount

    Gross Profit Dollar Amount

    Units scroll up into

    brands

    Brands scroll up into

    categories

    Categories scroll upinto departments

    Redundancy,

    but not a problem

  • 8/10/2019 Is Data Warehouse

    40/63

  • 8/10/2019 Is Data Warehouse

    41/63

  • 8/10/2019 Is Data Warehouse

    42/63

    Simple Retail Dimensional ModelDate Dimension

    POS Retail Sales Fact

    Store Dimension

    Product Dimension

    Promotion Dimension

    Date Key (PK)

    Date

    Full Date DescriptionDay of Week

    Day Number in Epoch

    Week Number in Epoch

    Month Number in Epoch

    Etc.

    Date Key (FK)

    Product Key (FK)

    Store Key (FK)

    Promotion Key (FK)

    POS Transaction no.

    Sales Quantity

    Sales Dollar Amount

    Cost Dollar Amount

    Gross Profit Dollar AmountStore Key (PK)Store Name

    Store Number (Natural Key)

    Store Street Address

    Store City

    Store Country

    Store State

    Etc.

    Product Key (PK)

    Product Description

    SKU Number (Natural Key)Brand Description

    Category Description

    Department Description

    Package Style Description

    Etc.

    Promotion Key (PK)Promotion Name

    Price Reduction Type

    Promotion Media Type

    Ad Type

    Display Type

    Coupon Type

    Etc.

  • 8/10/2019 Is Data Warehouse

    43/63

  • 8/10/2019 Is Data Warehouse

    44/63

  • 8/10/2019 Is Data Warehouse

    45/63

  • 8/10/2019 Is Data Warehouse

    46/63

    Data Warehouse Design1. Business requirements analysis2. Data design

    3. Architecture design4. Implementation

    5. Deployment

    Source: Sen and Sinha, 2005

  • 8/10/2019 Is Data Warehouse

    47/63

    Data Warehouse Design (1): Business

    Requirements Analysis Identify the business questions needing answers

    Prioritization of the questions

    User involvement!! Conceptual model (high-level), blueprint for the

    requirements of the organization

  • 8/10/2019 Is Data Warehouse

    48/63

  • 8/10/2019 Is Data Warehouse

    49/63

    Data Warehouse Design (3): Architecture Design Overall schema for the data warehouse

    Several approaches:

    Top-down,

    Bottom-up, or

    Mixed

    Different design philosophies:

    Enterprise-wide data warehouse design

    Data mart design

  • 8/10/2019 Is Data Warehouse

    50/63

    Data Warehouse Design (4): Implementation Data sourcing

    ETL

    User applications Two important things:

    Data quality management

    Meta data management

  • 8/10/2019 Is Data Warehouse

    51/63

    Data Warehouse Design (5): Deployment Solution Integration

    Data warehouse tuning

    Data warehouse maintenance One of the leading causes of data warehousing failures!!

  • 8/10/2019 Is Data Warehouse

    52/63

  • 8/10/2019 Is Data Warehouse

    53/63

    Enterprise Wide Warehouse First construct an enterprise wide data warehouse, then departmental datamarts

    Corporate Information Factory Top down approach Data driven approach Not systems development life cycle (SDLC), or waterfall approach

    Starting point: data, not requirements Data are gathered, integrated, and tested Programs are written against the data and results analyzed Requirements are formulated

    Iterative approach Often called reversed SDLC, or CLDC

    Normalized

  • 8/10/2019 Is Data Warehouse

    54/63

  • 8/10/2019 Is Data Warehouse

    55/63

    Enterprise Wide Warehouse Levels of

    Architecture

    1. Operational Detailed, day-to-day, current operational data

    2. Data Warehouse Granular, time-variant, integrated, subject-oriented, some

    summarization

    3. Departmental (Data Mart) Summarized, departmental needs

    4. Individual Resides on PCs, temporary, ad-hoc

    Operational Atomic/data

    warehouse

    Departmental

    data mart

    Individual

  • 8/10/2019 Is Data Warehouse

    56/63

    Disadvantages with Enterprise Wide Warehouses Data normalized, requires data marts for efficient access

    Expensive to build Takes a long time

    Centralized development Data driven approach

    Risk that the data marts are not used

    Commitment of the organization

    Especially in organizations new to business analytics

  • 8/10/2019 Is Data Warehouse

    57/63

    Data Mart Approach Conforming data marts Subsets of data warehouse Data warehouse bus Business dimensional lifecycle

    Based upon SDLC Focus on analytic business requirements by executives / managers Dimensional approach Granularity Data marts

    Business processes

    Not enterprise-wide approach Not practical or possible

    Spiral (prototyping) Normalized only in the staging area

  • 8/10/2019 Is Data Warehouse

    58/63

    Data Mart Approach

    Operational

    data source

    Operational

    data source

    Operational

    data source

    ETL

    Data Mart 1

    Data Mart 2

    Data Mart 3

    DataWarehou

    seBus-

    C

    onformedfactsa

    ndattributes

    Data Access Tools

  • 8/10/2019 Is Data Warehouse

    59/63

    Advantages with the Data Mart Approach Ease of creation

    Phased development

    Limits budget requirements

    Learn from mistakes

    Clearly defined user group

  • 8/10/2019 Is Data Warehouse

    60/63

    Problems with the Data Mart Approach Overlap -> redundant data

    Inconsistency of data Lack of integration of data

    Different answers from different departments

    Data marts are built according to the requirements of onedepartment, unit, or subdivision Not according to corporate requirements

    Different granularity, degree of summarization, key structure, amount ofhistorical data, etc.

  • 8/10/2019 Is Data Warehouse

    61/63

    Problems with the Data Mart Approach (2) Technical scalability

    Own hardware and software suitable for corporate-wide implementation? Interfaces between data marts

    Becomes a considerable burden as the number of data marts increases

    Budget issues Departments not willing to use own funds for corporate-wide

    considerations

    Risk of creating stovepipe applications Criticism: Bill Inmon heavy criticizes the data mart approach as being

    something sold by vendors who are only interested in making sales!

  • 8/10/2019 Is Data Warehouse

    62/63

    Considerations for the Data Mart Approach Kimball: Bottom-up, but with a data warehouse bus

    (skeletal frame) Not for summary data only

    Low granularity!! Dimensional schemas

    Should not be based upon functional organization, but onbusiness processes

    Not multiple extracts from the same source No inconsistencies

    Scalability not necessarily an issue Data marts are already huge -> scalability inherent

  • 8/10/2019 Is Data Warehouse

    63/63

    References Connolly and Begg, Database Systems, 2002 Inmon, Building the Data Warehouse, 2002

    Kimball and Ross, The Data Warehouse Toolkit, 2002

    Sen and Sinha, A Comparison of Data Warehousing Methodologies,

    Communications of the ACM, vol 48, no. 3, pp. 79-84, 2005 Gardner, Building the Data Warehouse, Communications of the ACM,

    41(9), pp. 52-60, 1998

    Chenoweth, Corral, and Demirkan, Seven Key Interventions for DataWarehouse Success, Communications of the ACM, 49(1), pp. 115-119,

    2006


Recommended