Date warehousing concepts

Data Warehousing Concepts and Design

Introduction & Ground Rules

http://www.goodexperience.com/broken/i/04/01/nokia.s.jpg

http://www.electus.com.au/images/icons/refreshments.gif

http://hsc.csu.edu.au/retail/communicate/communicate/images/toilet.gif

http://www.tableideas.com/images/Gien-Dionysos/Gien%20Dionysos%20Tea%20Pot%202.jpg

http://www.eurocosm.com/Application/Images/cornish-blue-kitchen-ware/teacup-md.jpg

http://physics.iisc.ernet.in/~murthy/img/namaste.png

Objectives

Data Warehousing Concepts

• What is Business Intelligence (BI)?• Evolution of BI• Characteristics of an OLTP system• Why OLTP is not suitable for complex analysis?• Characteristics of a Data Warehouse• Define DWH and its properties – • Subject Oriented, Integrated, Time variant, Non-Volatile• Define Grain/Granularity• Differentiate between OLTP and Data Warehouse• User expectations and User community• Enterprise Data Warehouse• Data Warehouse versus Data marts• Dependent Data marts• Independent Data marts• Data Warehouse components – • Source systems, Staging area, Presentation area, Access tools

Objectives

Data Warehousing Concepts

• Goals of a Data Warehouse• Data Warehouse development approaches - • Top-down, Bottom-up, Hybrid, Federated• Incremental approach to warehouse development• Dimensional Modeling• Star Schema – Fact and Dimension tables• Dimensions and Measure objects• Snowflake Schema• Types of Fact tables• Factless Fact table• OLAP storage modes – MOLAP, ROLAP, HOLAP, DOLAP• Slowly and Rapidly changing Dimensions- Type I, II, III• Degenarated Dimension• Junk Dimension• CASE-STUDIES

What is Business Intelligence (BI)?

“Business Intelligence (BI) is the process of transforming data into information, information into knowledge and through iterative discoveries turning knowledge into Intelligence.”

– — Gartner group

Objective of Business Intelligence

Value

Volume

Intelligence

Knowledge

Information

Data

BI can be defined as taking ‘Decisions based on Data’.The objective of BI is to transform large volumes of data into useful information.

Evolution of BI

– Executive information systems (EIS)– Management Information System (MIS)– Decision Support Systems (DSS)– Business Intelligence (BI)

EIS

MIS

DSS

BI

Information

Information in an organization could exists in two different types of systems:

– Online Transaction Processing (OLTP) systems(Operational Systems)

– Data Warehouse (DWH) systems

Both OLTP and DWH systems have different purpose, business needs and users.

Features of OLTP Systems

OLTP systems handle day-to-day transactions and operations of the business. They are high performance, high throughput systems. They run mission critical applications.

OLTP systems store, update and retrieve Operational Data. Operational Data is the data that runs the business.

Some of the Operational systems that we interact with are Net Banking system, Tax Accounting system, Payroll package, Order-processing system, SAP, Airline reservation system etc.

Why OLTP systems are not suitable for analysis?

OLTP Analytical Reporting

Supports day-to-day operations Historical information to analyze

Data stored at transaction level Data required at summary level

Islands of operational systems Data needs to be integrated

Database design: Normalized

Database design: Dimensional

OLTP Versus Data Warehouse

Property OLTP Data Warehouse

Response Time Sub seconds to seconds Seconds to hours

Operations DMLData goes in

Primarily Read onlyData goes out

Age of Data 30 – 60 days or 1 year - 2 years.Current

Snapshots over time(Quarter, Month, etc).Historical

Data Organization Application Subject, time

Size Small to large Few MB to GB

Large to very large,Few GB to TB

OLTP Versus Data Warehouse

Property OLTP Data Warehouse

Data Sources Operational, Internal Operational,Internal, External

Activities Processes Analysis

No. of records One record at a time Thousands to millionsof records

Grain Atomic (Detail),transactional level,Highest granularity

Atomic and/or Summarized (aggregate),less granularity

Database Design Normalized De-Normalized, Star schema

Data Extract Processing

A logical progression towards a data warehouse – Data Extracts

– End user computing offloaded from the operational environment– User’s own data

Decision

makers

Operational

systems

Extracts

Issues with Data Extract Programs

ExtractsOperational systems

Decisionmakers

Extract Explosion

Data Quality Issues with Extract Processing

– No common time basis– Different calculation algorithms– Different levels of extraction– Different levels of granularity– Different data field names– Different data field meanings– Missing information– No data correction rules– No Metadata– No drill-down capability

Data Warehousing and Business Intelligence

Advances Enabling Data Warehousing

Technology

– Hardware– Operating system– Database– BI Tools & Applications

Business

– Competition

Definition of a Data Warehouse

“A data warehouse is a subject oriented, integrated, non-volatile,

and time-variant collection of data to support management decisions.”

— Bill Inmon

Data Warehouse Properties

Integrated

Time-variantNonvolatile

Subject-oriented

DataWarehouse

Subject-Oriented

• Data is categorized and stored by business subject rather than by application.

OLTP Applications

Equity Plans

Shares

Insurance

Loans

Savings

Data Warehouse

Subject

Customer

financial

information

Integrated

• Data on a given subject is collected from various sources and stored once.

Data WarehouseOLTP Applications

Customer

Savings

Current Accounts

Loans

Data Warehouse

Time-Variant

• Data is stored as a series of snapshots, each representing a period of time.

Non-volatile

• Typically data in the data warehouse is not updated or deleted.

Warehouse

Read

Load

Operational

Insert, Update, Delete, or Read

Changing Warehouse Data

Operational Databases Warehouse Database

First time load

Refresh

Refresh

RefreshPurge or Archive

Goals of a Data Warehouse

• The Data Warehouse must assist in decision making process

• The Data Warehouse must meet the requirements of the business community

• The Data Warehouse must provide easy access to information

• The Data Warehouse must present information consistently and accurately

• The Data Warehouse must be adaptive and resilient to change

• The Data Warehouse must provide a secured access to information

Usage Curves

– Operational system is predictable

– Data warehouse:• Variable• Random

User Expectations

– Control expectations– Set achievable targets for query response– Set SLAs– Educate business and end users– Growth and use is exponential

Enterprisewide Data Warehouse

– Large scale implementation– Scopes the entire business– Data from all subject areas– Developed incrementally– Single source of enterprisewide data– Synchronized enterprisewide data– Single distribution point to dependent data marts

Data Warehouse Vocabulary

– Grain of Data - Granularity

Grain is defined as the level of detail of data captured in the data warehouse. More the detail, higher the granularity and vice-versa

– Fact table

It is similar to the transaction table in an OLTP system. It stores the facts or measures of the business. Eg: SALES, ORDERS

– Dimension table

It is similar to the master table in an OLTP system. It stores the textual descriptors of the business. Eg: CUSTOMER, PRODUCT

Data Marts

• A Data mart is a subset of data warehouse.

• A data mart is designed for a single line of business (LOB) or functional area such as sales, finance, or marketing.

Data Warehouses Versus Data Marts

Property Data Warehouse Data Mart

Scope Enterprise Department

Subjects Multiple Single-subject, LOB

Data Source Many Few

Implementation time Months to years Months

Size 100 GB to > 1 TB < 100 GB

Initial effort, cost, Risk Higher Lower

Next level of migration Data Mart Data Warehouse

Approach Top-Down Bottom-up

Dependent Data Mart

Data Warehouse

Data Marts

Flat FilesMarketing

Sales

Finance

MarketingSales

FinanceHR

OperationalSystems

External Data

Operations Data

Legacy Data

External Data

Independent Data Mart

Sales orMarketing

Flat Files

OperationalSystems

External Data

Operations Data

Legacy Data

External Data

Warehouse Development Approaches

• Top-down approach(Big-Bang)

• Bottom-up approach

• Hybrid approach(Combination)

• Federated approach

http://www.cia.gov/cia/publications/factbook/flags/fm-lgflag.gif

Top-Down Approach

Build the Data Warehouse

Build the Data Marts

Top-Down Approach

Data Warehouse

Data Marts

Flat FilesMarketing

Sales

Finance

MarketingSales

FinanceHR

OperationalSystems

External Data

Operations Data

Legacy Data

External Data

Bottom-Up Approach

Build Data Marts

Build the Data Warehouse

Bottom-Up Approach

Data Warehouse

Data Marts

Marketing

Sales

Finance

OperationalSystems

External Data

Operations Data

Legacy Data

Hybrid Approach

The hybrid approach tries to blend the best of both “top-down and “bottom-up” approaches

Starts by designing DW and DM models synchronously,Build out first 2-3 DMs that are mutually exclusive and criticalBackfill a DW behind the DMs Build the enterprise model and move atomic data to the DW

Federated Approach

This approach is referred to as “an architecture of architectures”.

Emphasizes the need to integrate new and existing heterogeneous BI environments.

http://www.cia.gov/cia/publications/factbook/flags/fm-lgflag.gif

Data Warehouse Components

Source Systems

Staging Area

Presentation Area

AccessTools

ODS

Operational

External

Legacy

Metadata Repository

Data Marts

Data Warehouse

Examining Data Sources

– Production– Archive– Internal– External

Production Data

– Operating system platforms– File systems– Database systems – Vertical applications

IMS

DB2

Oracle

Sybase

Informix

VSAM

SAP

Dun and Bradstreet Financials

Oracle Financials

Baan

PeopleSoft

Archive Data

– Historical data– Useful for analysis over long periods of time– Useful for first-time load

Operation databases

Warehouse database

Internal Data

– Planning, sales, and marketing organization data– Maintained in the form of:

• Spreadsheets (structured)• Documents (unstructured)

– Treated like any other source data

Warehouse database

Planning

Accounting

Marketing

External Data

– Information from outside the organization– Issues of frequency, format, and predictability – Described and tracked using metadata

A.C. Nielsen, IRI, IMRB, ORG-MARG

Barron's

Dun and Bradstreet

Purchased databases

Wall Street Journal

Economic forecasts

Competitive information

Warehousingdatabases

Extraction, Transformation and Loading (ETL)

Extraction, Transformation and Loading (ETL)

• “Effective data extract, transform and load (ETL) processes represent the number one success factor for your data warehouse project and can absorb up to 70 percent of the time spent on a typical data warehousing project.”

– DM Review, March 2001

Source TargetStaging Area

Staging Models

• Remote staging model

• Onsite staging model

Remote Staging Model

LoadWarehouse

LoadWarehouse

Data staging area within the warehouse environment

Data staging area in its own independent environment

Operationalsystem

Extract

Operationalsystem

Extract

Transform

Staging area

Transform

Staging area

On-site Staging Model

• Data staging area within the operational environment, possibly affecting the operational system

Extract Load

Warehouse

Operational system

Transform

Staging area

Extraction Methods

– Logical Extraction methods:• Full Extraction• Incremental Extraction

Extraction Methods

– Physical Extraction methods:• Online Extraction• Offline Extraction

ETL Techniques

– Programs: C, C++, COBOL, PL/SQL, Java

– Gateways: Transparent Database Access

– Tools:• In-house developed tools • Vendor’s ETL tools (Ideal technique)

Mapping Data

• Mapping data defines:– Which operational attributes to use– How to transform the attributes for the warehouse– Where the attributes exist in the warehouse

Metadata

File A

F1

Staging File One

Number

F2

F3

Name

DOB

Staging File OneNumber USA123Name Mr. BloggsDOB 10-Dec-56

File AF1 123F2 BloggsF3 10/12/56

Transformation Routines

– Cleaning data– Eliminating inconsistencies– Adding elements– Merging data– Integrating data– Transforming data before load

Transforming Data: Problems and Solutions

– Data Anomalies– Multipart keys– Multiple local standards– Multiple files– Missing values– Duplicate values– Element names– Element meanings– Input formats– Referential Integrity constraints– Name and address

Data Anomalies

– No unique key– Data naming and coding anomalies– Data meaning anomalies between groups– Spelling and text inconsistencies

CUSNUM NAME ADDRESS

90233479 Oracle Limited 100 N.E. 1st St.

90233489 Oracle Computing 15 Main Road, Ft. Lauderdale

90234889 Oracle Corp. UK 15 Main Road, Ft. Lauderdale, FLA

90345672 Oracle Corp UK Ltd 181 North Street, Key West, FLA

Multipart Keys Problem

• Multipart keys

Country code

Sales territory

Productnumber

Salesperson code

Product code = 12 M 654313 45

Multiple Local Standards Problem

– Multiple local standards– Tools or filters to preprocess

cm

inches

cm USD 600

1,000 GBP

FF 9,990

DD/MM/YY

MM/DD/YY

DD-Mon-YY

Multiple Source Files Problem

– Added complexity of multiple source files

Transformeddata

Multiple source files

Logic to detectcorrect source

Missing Values Problem

• Solution:– Ignore– Wait– Mark rows– Extract when time-stamped

If NULL thenfield = ‘A’

A

Duplicate Values Problem

• Solution:– SQL self-join techniques– RDMBS constraint utilities

ACME Inc

ACME Inc

ACME Inc

SQL> SELECT ... 2 FROM table_a, table_b 3 WHERE table_a.key (+)= table_b.key 4 UNION 5 SELECT ... 6 FROM table_a, table_b 7 WHERE table_a.key = table_b.key (+);

Element Names Problem

• Solution:– Common naming conventions

Customer

Customer

Client

Contact

Name

Element Meaning Problem

– Avoid misinterpretation– Complex solution– Document meaning in metadata

Product number

p_no

Purchase order number Policy number

Input Format Problem

ASCIIEBCDIC

12373“123-73”

ACME Co.

áøåëéí äáàéí Beer (Pack of 8)

• Different character sets or data-types

Referential Integrity Problem

• Solution:– SQL anti-join (outer join)– Server constraints– Dedicated tools

Department10

20

30

40

Emp Name Department1099 Smith 10

1289 Jones 20

1234 Doe 50

6786 Harris 60

Name and Address Problem

– Single-field format– Multiple-field format

Mr. J. Smith,100 Main St., Bigtown, County Luth, 23565

Database 1NAME LOCATIONDIANNE ZIEFELD N100

HARRY H. ENFIELD M300

Database 2NAME LOCATIONZIEFELD, DIANNE 100

ENFIELD, HARRY H 300

Name Mr. J. Smith

Street 100 Main St.

Town Bigtown

Country County Luth

Code 23565

Transformation Timing and Location

– Transformation is performed:• Before load• In parallel while loading

– Can be initiated at different points:• On the operational platform• In a separate staging area

Adding a Date Stamp: Fact Tables and Dimensions

Item TableItem_idDept_id

Time_key

Store TableStore_id

District_idTime_key

Sales Fact TableItem_idStore_idTime_key

Sales_dollarsSales_units

Time TableWeek_idPeriod_idYear_id

Time_key

Product TableProduct_idTime_key

Product_desc

Summarizing Data

1. During extraction on staging area

2. After loading to the warehouse server

Operationaldatabases

Warehousedatabase

Staging area

Loading Data into the Warehouse

– Loading moves the data into the warehouse– Loading can be time-consuming:

• Consider the load window• Schedule and automate the loading

– Initial load moves large volumes of data– Subsequent refresh moves smaller volumes of data


Warehousedatabase

Staging area

Extract

Transform

Transport,Load

Load Window Requirements

– Time available for entire ETL process– Plan– Test– Prove – Monitor

0 3 am 6 9 12 pm 3 6 9 12

User Access PeriodLoad Window Load Window

0 3 am 6 9 12 pm 3 6 9 12

User Access Period

Planning the Load Window

– Plan and build processes according to a strategy.– Consider volumes of data.– Identify technical infrastructure.– Ensure currency of data.– Consider user access requirements first.– High availability requirements may mean a small load window.

Initial Load and Refresh

• Initial Load:– Single event that populates the database with historical data– Involves large volumes of data– Employs distinct ETL tasks– Involves large amounts of processing after load

• Refresh:– Performed according to a business cycle– Less data to load than first-time load– complex ETL tasks– Smaller amounts of post-load processing

Data Refresh Models

Extract Processing Environment– After each time interval, build a new snapshot of the database.– Purge old snap shots.

T1 T2 T3


Data Refresh Models

Warehouse Environment– Build a new database the first time.– After each time interval, add delta changes to database.– Archive or purge oldest data.

T1 T2 T3


Post-Processing of Loaded Data

Post-processing of loaded data

Create indexes

Generate keys

Summarize Filter

Extract

Transform

LoadWarehouseStaging area

Unique Indexes

– Disable constraints before load.– Enable constraints after load.– Re-create index if necessary.

Load data

Disableconstraints

Enableconstraints

Create index Reprocess

Catch errors

Creating Derived Keys

• The use of derived (sometimes referred as generalized or artificial key or synthetic key or a surrogate or a warehouse key) is recommended to maintain the uniqueness of a row.

• Method– Concatenate key– Assign a number sequentially from a list

109908 01109908

109908 100

Metadata repository

Metadata Users

End users

Developers IT Professionals

Metadata Documentation Approaches

– Automated• Data modeling tools• ETL tools

– Manual

Data Warehouse Design

Dimensional Modeling

I. Identify the ‘Business Process’

II. Determine the ‘Grain’

III. Identify the ‘Facts’

IV. Identify the ‘Dimensions’

Existing Metadata Production ERD Model

BusinessRequirements

Research

Business Requirements Drive the Design Process

– Primary input

– Secondary input

Perform Strategic Analysis

– Identify crucial business processes– Understand business processes– Prioritize and select the business processes to implement

BusinessBenefit

Low High

Low

High

Feasibility

Using a Business Process Matrix

DW Bus Architecture

Business Dimensions

Business ProcessesSales Returns Inventory

Customer

Date

Product

Channel

Promotion

Conformed Dimensions

• Dimensions are conformed when they are exactly the same including the keys or one is a perfect subset of the other.

• DW bus architecture provides a standard set of conformed dimensions

Determine the Grain

YEAR?

QUARTER?

MONTH?

WEEK?

DAY?

04/10/2393

Documenting the Granularity

• Is an important design consideration

• Determines the level of detail

• Is determined by business needs

Low-level grain (Transaction-level data)

High-level grain (Summary data)

Defining Time Granularity

Fiscal Time Hierarchy

Current dimension grain

Fiscal Year

Fiscal Quarter

Fiscal Month

Fiscal Week

Day Future dimension grain

Identify the Facts and Dimensions

•The attribute is perceived as constant or discrete:

– Product– Location– Time– Size

•The attribute varies continuously:

– Balance– Units Sold– Cost– Sales

Facts (Measures)

Dimensions

Data Warehouse Environment Data Structures

The data structures that are commonly found in a data warehouse environment:

– Third normal form (3NF)– Star schema– Snowflake schema

Star Schema

Customer Location

Sales

Supplier Product

Star Schema Model

Product TableProduct_idProduct_disc,...

Time TableDay_idMonth_idYear_id,...

Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units, ...

Item TableItem_idItem_desc,...

Store TableStore_idDistrict_id,...

Central fact table

Denormalizeddimensions

Fact Table Characteristics

– Contain numerical metrics of the business– Can hold large volumes of data– Can grow quickly– Can contain base, derived,

and summarized data– Are typically additive– Are joined to dimension tables

through foreign keys that reference

Primary keys in the dimension tables

Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units...

Dimension Table Characteristics

– Contain descriptors of the business /

textual information that represents the attributes of the business– Contain relatively static data

– Are usually smaller than fact tables

– Are joined to a fact table through

a foreign key reference

Item TableItem_idItem_desc,...

Advantages of Using a Star Dimensional Model

– Design improves performance by reducing table joins.

– The model is easy for users to understand.– Supports multidimensional analysis.– Provides an extensible design

– Primary keys represent a dimension.

– Non-foreign key columns are values.

– Facts are usually highly normalized.

– Dimensions are completely de-normalized.

– End users can express complex queries.

Base and Derived Data

Payroll table

Derived dataBase data

Emp_FK Month_FK Salary Comm Comp101 05 1,000 0 1,000102 05 1,500 100 1,600103 05 1,000 200 1,200104 05 1,500 1,000 2,500

Translating Business Measures into a Fact Table

Business measures

Facts

Business MeasuresNumber of ItemsAmountCostProfit

FactNumber of ItemsItem Amount

Item CostProfit

BaseBaseBaseDerived

Snowflake Schema Model

Time TableWeek_idPeriod_idYear_id

Dept TableDept_id

Dept_descMgr_id

Mgr TableDept_idMgr_id

Mgr_name

Product TableProduct_id

Product_desc

Item TableItem_id

Item_descDept_id

Sales Fact TableItem_idStore_idProduct_idWeek_id

Sales_amountSales_units

Store TableStore_idStore_descDistrict_id

District TableDistrict_idDistrict_desc

04/10/23105

Snowflake Model

. . . .

Order

Web

History_PK

Customer

History History_FKCustomer_FKProduct_FKChannel_FK

Item_nbrItem_descQuantityDiscnt_priceUnit-priceOrder_amt…

Product

Channel

Channel_PK

Web_PKChannel_desc

Customer_PK

. . . .

Product_PK

. . . .

Web_PK

Web_url

Snowflake Schema Model

– Provides for speedier data loading– Can become large and unmanageable– Degrades query performance– More complex metadata

– Facts are usually highly normalized

– Dimensions are also normalized

Country State County City

Constellation Configuration

Atomic fact

Fact Table Measures

Nonadditive:Cannot be added

along any dimension

Semiadditive: Added along some

dimensions

Additive: Added across all

dimensions

04/10/23109

More on Factless Fact Tables

Emp_FKSal_FKAge_FKEd_FKGrade_FK

Grade dimensionGrade_PK

Education dimensionEd_PK

Employee dimensionEmp_PK

Salary dimensionSal_PK

Age dimensionAge_PK

PK = Primary Key & FK = Foreign Key

Factless Fact Tables

– Event tracking

– Coverage

04/10/23111

Bracketed Dimensions

– Enhance performance and analytical capabilities

– Create groups of values for attributes with many unique values, such as income ranges and age brackets

– Minimize the need for full table scans by pre-aggregating data

04/10/23112

Bracketing Dimensions

Customer_PKBracket_FK

Bracket_PK

Customer_PKBracket_FK

Bracket dimension

Customer dimension

Income fact

Bracket_PK Income (10Ks) Marital Status Gender Age

1 60-90 Single Male <21

2 60-90 Single Male 21-35

3 60-90 Single Male 35-55

4 60-90 Single Male >55

5 60-90 Single Female <21

6 60-90 Single Female 21-35

04/10/23113

Identifying Analytical Hierarchies

Store dimension

Store IDStore DescLocationSizeTypeDistrict IDDistrict DescRegion IDRegion Desc

Business hierarchies describe organizational structure and logical parent-child relationships within the data.

Region

District

Store

Organization hierarchy

04/10/23114

Multiple Hierarchies

Store IDStore DescLocationSizeTypeDistrict ID District DescRegion IDRegion DescCity IDCity DescCounty IDCounty DescState IDState Desc

Region

District

Store

Organization hierarchy

Store dimension

Region

District

Store

Geography hierarchy

04/10/23115

Multiple Time Hierarchies

Fiscal year

Fiscal quarter

Fiscal month

Fiscal time hierarchy

Fiscal week

Calendar year

Calendar quarter

Calendar month

Calendar time hierarchy

Calendar week

04/10/23116

Store 5Store 1 Store 2

Region 2

District 2 District 4

Drilling Up and Drilling Down

Store 4

Group

Market Hierarchy

Region 1

District 1

Store 6Store 3

District 3

Region

District

Drilling Across

Stores > 20,000 sq. ft.

Group

Market hierarchy

Region

District

Store Store City

City

City hierarchy

Using Time in the Data Warehouse

– Defining standards for time is critical.

– Aggregation based on time is complex.

– Time is critical to the data warehouse. A consistent representation of time is required for extensibility.

Where should the element of time be stored?

Timedimension

Sales fact

Date Dimension

– Should Date Dimension be modeled?

Applying the Changes to Data

• You have a choice of techniques:– Overwrite a record– Add a record– Add a field– Maintain history– Add version numbers

OLAP Models

– Relational (ROLAP)

– Multidimensional (MOLAP)

– Hybrid (HOLAP)

– Desktop (DOLAP)

Slowly Changing Dimensions (SCDs)

What is a SCD?

It is a dimension that has attribute data that needs to be updated, rather slowly over time.

There are 3 standard ways outlined by Kimball (and others) to handle this situation:– Type-I– Type-II– Type-III

http://images.google.co.in/imgres?imgurl=http://www.jpbutler.com/philadelphia/zoo-tortoise.jpg&imgrefurl=http://members.lycos.co.uk/adam2112/reptiles.htm&h=613&w=898&sz=227&tbnid=KCv5DEoAstoJ:&tbnh=98&tbnw=144&hl=en&start=5&prev=/images%3Fq%3Dtortoise%26hl%3Den%26lr%3D

Type I - Overwriting a Record

– Easy to implement– Loses all history– Not recommended

42135 John Doe Single42135 John Doe Married

Type II - Adding a New Record

– History is preserved; dimensions grow.– Generalized key is created.

42135 John Doe Single

42135_01 John Doe Married

Type III - Adding a Current Field

– Maintains some history– Loses intermediate values– Is enhanced by adding an Effective Date field

42135 John Doe Single

42135 John Doe Single Married 1-Jan-01

Maintain History

History tables:– One-to-many relationships– One current record and many history records

Product

Time

Sales

HIST_CUST

CUSTOMER

Versioning

– Avoid double counting– Facts hold version number

Time

Product

Customer

Customer.CustId Version Customer Name

1234 1 Comer

1234 2 Comer

Sales.CustId Version Sales Facts

1234 1 $11,000

1234 2 $12,000

Sales

Rapidly Changing Dimensions (RCDs)

It is a dimension that has attribute data that needs to be updated, rather quickly over time.

Also referred to as Rapidly Changing Monster dimension.

Create a separate dimension referred to as mini dimension

DemographicsKey

Age children income

1 20–24 0 <20000

2 20-24 1-2 20000 – 30000

3 20-24 > 2 >30000

4 25-30 0 <20000

5 25-30 1-2 20000 – 30000

:::: ::::: :::: ::::::::::

Mini Dimension

http://images.google.co.in/imgres?imgurl=http://www.cathouse-fcc.org/gifs-jpegs/zulurun.jpg&imgrefurl=http://vnboards.ign.com/Sieghardt/b22707/76619547/p11/&h=378&w=504&sz=48&tbnid=ry6_JOmQtYoJ:&tbnh=96&tbnw=128&hl=en&start=3&prev=/images%3Fq%3Dcheetah%26hl%3Den%26lr%3D

Junk Dimension

Junk dimension is an abstract dimension with the decodes for a group of low cardinality flags and indicators, thereby removing them from fact table.

Junk Key Payment Type Order type Order Mode

1 Cash Normal Web

2 Cash Urgent Web

3 Credit Normal Fax

4 Credit Urgent Fax

::::: ::::: :::::: ::::;

Junk Dimension

http://www.businessworldindia.com/june2104/images/images_21june04/news/dustbin-notes1.jpg

Secret of Success

Think big, start small!

References

Useful web sites:

http://www.dmreview.comhttp://www.rkimball.comhttp://www.billinmon.comhttp://www.dmforum.orghttp://www.freedatawarehouse.com

http://www.dmreview.com/

http://www.rkimball.com/

http://www.billinmon.com/

http://www.dmforum.org/

http://www.dmforum.org/

http://www.freedatawarehouse.com/

Thank-you

Date post:	27-Jan-2015
Category:	Technology
Upload:	pcherukumalla
View:	111 times
Download:	0 times