Date post: | 28-Nov-2014 |
Category: |
Documents |
Upload: | bharat-kumar-kakani |
View: | 124 times |
Download: | 4 times |
Introduction to Data Introduction to Data WarehousingWarehousing
Introduction to Data Introduction to Data WarehousingWarehousing
2©Copyright 2004, Cognizant Academy, All Rights Reserved
Session Objectives
• Overview of Data Warehousing
• Data Warehouse Architectures
• How to create a data warehouse
• How to design a data warehouse
• Understand the ETL process
• What is metadata
• How to administer a data warehouse
Operational SystemsOperational SystemsOperational SystemsOperational Systems
4©Copyright 2004, Cognizant Academy, All Rights Reserved
What is an Operational System?
• Operational systems are just what their name implies; they are the
systems that help us run the day-to-day enterprise operations.
• These are the backbone systems of any enterprise, such as order
entry inventory etc.
• The classic examples are airline reservations, credit-card
authorizations, and ATM withdrawals etc.,
5©Copyright 2004, Cognizant Academy, All Rights Reserved
Characteristics of Operational Systems
• Continuous availability
• Predefined access paths
• Transaction integrity
• Volume of transaction - High
• Data volume per query - Low
• Used by operational staff
• Supports day to day control operations
• Large number of users
6©Copyright 2004, Cognizant Academy, All Rights Reserved
Historical Look at Informational Processing
The goal of Informational Processing is to turn data into
information!
Why?
Because business questions are answered using information and
the knowledge of how to apply that information to a given problem.
DataData InformationInformation KnowledgeKnowledge
7©Copyright 2004, Cognizant Academy, All Rights Reserved
• Data : Informational data is distinctly
different from operational data in its
structure and content .
• Processing : Informational processing is
distinctly different from operational
processing in its characteristics and use of
data
Need for a Separate informational system
8©Copyright 2004, Cognizant Academy, All Rights Reserved
The Information Center
• Management requires business information
• A request for a report is made to the
Information Center
• Information Center works on developing the
report
• Requirements for the report must be clarified
9©Copyright 2004, Cognizant Academy, All Rights Reserved
• Report provided to analyst
• Analyst manipulates data for decision making
• Management receives information, but...
What took so long? and
How do I know it’s right?
The Information Center
10©Copyright 2004, Cognizant Academy, All Rights Reserved
Too Many Steps Involved!
The Information Center
11©Copyright 2004, Cognizant Academy, All Rights Reserved
Tactical InformationInventory Control System
Production quantity
Transported Quantity
Order quantity
Supports day to day control operations
Transaction Processing
High Performance Operational Systems
Fast Response Time
Initiates immediate action
OLTP Server
12©Copyright 2004, Cognizant Academy, All Rights Reserved
Strategic Information
• Understand Business Issues
• Analyze Trends and Relationships
• Analyze Problems
• Discover Business Opportunities
• Plan for the Future
FinancePayroll
Marketing Production & Inventory
13©Copyright 2004, Cognizant Academy, All Rights Reserved
Operational data helps the organization meet operational and tactical requirements for data.
While the Data Warehouse data helps the organization meet strategic requirements for information
Need for Tactical and Strategic informationOLTP Server
Strategic Information
Tactical Information
OperationalData
PeriodicRefresh
Data Warehouse Server
14©Copyright 2004, Cognizant Academy, All Rights Reserved
Operational Analytical
Primarily primitive,
Current; accurate as of
now
Constantly updated
Minimal redundancy
Highly detailed data
Referential integrity
Supports day-to-day
business functions
Normalized design
Primarily derived,
Historical; accuracy
maintained over time
Less frequently updated
Managed redundancy
Summarized data
Historical integrity
Supports long-term
informational requirements
De-normalized design
Operational Vs Analytical systems
Data WarehousingData WarehousingData WarehousingData Warehousing
16©Copyright 2004, Cognizant Academy, All Rights Reserved
Subject Oriented
Integrated
Time variant
Non-volatile collection of data in support of management decision
processes
The Data Warehouse is
Data Warehouse Definition
17©Copyright 2004, Cognizant Academy, All Rights Reserved
Accounting
Order Entry
Billing
Customer
Usage
Revenue
Operational data is organized by specific processes or tasks and is maintained by separate systems
Warehoused data is organized by subject area and is populated from many operational systems
OperationalSystems
DataWarehouse
Data Warehouse- Differences from Operational Systems
18©Copyright 2004, Cognizant Academy, All Rights Reserved
Application Specific Integrated
Applications and their databases were designed and built separately
Evolved over long periods of time
Integrated from the start
Designed (or “Architected”) at one time, implemented iteratively over short periods of time
OperationalSystems
Data Warehouse
Data Warehouse- Differences from Operational Systems
19©Copyright 2004, Cognizant Academy, All Rights Reserved
Primarily concerned with current data
Generally concerned with historical data
OperationalSystems
DataWarehouse
Data Warehouse- Differences from Operational Systems
20©Copyright 2004, Cognizant Academy, All Rights Reserved
Load/ Update
Consistent Points in Time
Updated constantly
Data changes according to
need, not a fixed schedule
Added to regularly, but loaded data
is rarely directly changed
Does NOT mean the Data
warehouse is never updated or
never changes!!
Constant Change
Operational systems Database
Data warehouse
Datawarehouse- Differences from Operational Systems
Insert
Insert
Update
Initial Load
Incremental Load
Incremental Load
Update
Delete
21©Copyright 2004, Cognizant Academy, All Rights Reserved
Data in a Data Warehouse
What about the data in the Datawarehouse?
• Separate DSS data base
• Storage of data only, no data is created
• Integrated and Scrubbed data
• Historical data
• Read only (no recasting of history)
• Various levels of summarization
• Meta data
• Subject
• Easily oriented accessible
22©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Warehousing Features
• Strategic enterprise level decision support
• Multi-dimensional view on the enterprise data
• Caters to the entire spectrum of management
• Descriptive, standard business terms
• High degree of scalability
• High analytical capability
• Historical data only
23©Copyright 2004, Cognizant Academy, All Rights Reserved
Datawarehouse - Business Benefits
Benefits To Business
• Understand business trends
• Better forecasting decisions
• Better products to market in timely manner
• Analyze daily sales information and make quick decisions
• Solution for maintaining your company's competitive edge
24©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Warehouse- Application Areas
Following are some Business Applications of a data warehouse:
• Risk management
• Financial analysis
• Marketing programs
• Profit trends
• Procurement analysis
• Inventory analysis
• Statistical analysis
• Claims analysis
• Manufacturing optimization
• Customer relationship management
Data MartsData MartsData MartsData Marts
26©Copyright 2004, Cognizant Academy, All Rights Reserved
What is a Data mart?
• Data mart is a decentralized subset of data found either in a data warehouse or as a standalone subset designed to support the unique business unit requirements of a specific decision-support system.
• Data marts have specific business-related purposes such as measuring the impact of marketing promotions, or measuring and forecasting sales performance etc,.
Data Mart
Data Mart
EnterpriseData Warehouse
27©Copyright 2004, Cognizant Academy, All Rights Reserved
Data marts - Main Features
Main Features:
• Low cost
• Controlled locally rather than centrally, conferring power on the user group.
• Contain less information than the warehouse
• Rapid response
• Easily understood and navigated than an enterprise data warehouse.
• Within the range of divisional or departmental budgets
28©Copyright 2004, Cognizant Academy, All Rights Reserved
Datamart Advantages :
• Typically single subject area and fewer dimensions
• Limited feeds
• Very quick time to market (30-120 days to pilot)
• Quick impact on bottom line problems
• Focused user needs
• Limited scope
• Optimum model for DW construction
• Demonstrates ROI
• Allows prototyping
Advantages of Datamart over Datawarehouse
29©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Mart disadvantages :
Does not provide integrated view of business information.
Uncontrolled proliferation of data marts results in redundancy
More number of data marts complex to maintain
Scalability issues for large number of users and increased
data volume
Disadvantages of Data Mart
Different Approaches for Different Approaches for Implementing Data marts Implementing Data marts Different Approaches for Different Approaches for
Implementing Data marts Implementing Data marts
31©Copyright 2004, Cognizant Academy, All Rights Reserved
Q:When is a Data Warehouse not a Data Warehouse?
A:When it’s an unarchitected collection of data marts
Non-architected Data marts
32©Copyright 2004, Cognizant Academy, All Rights Reserved
Significant and expensive duplication of effort and data.
Non-architected Data martsSource systems Data marts End user access
33©Copyright 2004, Cognizant Academy, All Rights Reserved
The upsides of Non-architected Data marts are:
1. Speed
2. Low cost
The downsides of Non-architected Data marts are:
1.Multiple extraction processes
2. Multiple business rules
3. Multiple semantics
4. Extremely challenging to integrate
Upsides and Downsides of Non-architected Datamarts
34©Copyright 2004, Cognizant Academy, All Rights Reserved
Architected Data Warehouse
EnterpriseData Warehouse
Metadata
Source systems
Data Staging
End user access
35©Copyright 2004, Cognizant Academy, All Rights Reserved
Unarchitected Data marts Vs Data warehouse
Architected
Data and results consistent
Redundancy is managed
Detailed history available for drill-down
Metadata is consistent!
Easy to do, Not architected
? Are the extracts, transformations, integration's & loads consistent?
? Is the redundancy managed?
? What is the impact on the sources?
Unarchitected Data Marts Data Warehouse
EnterpriseData Warehouse
The Operational Data The Operational Data StoreStore
The Operational Data The Operational Data StoreStore
37©Copyright 2004, Cognizant Academy, All Rights Reserved
ODS Definition
The ODS is defined to be a structure that is:
• Integrated• Subject oriented• Volatile, where update can be done• Current valued, containing data that is a day or perhaps a month
old• Contains detailed data only.
38©Copyright 2004, Cognizant Academy, All Rights Reserved
Why We Need Operational Data Store?
Need
• To obtain a “system of record” that contains the best data that exists in a
legacy environment as a source of information
• Best here implies data to be
– Complete
– Up to date
– Accurate
• In conformance with the organization’s information model
39©Copyright 2004, Cognizant Academy, All Rights Reserved
• ODS data resolves data integration issues
• Data physically separated from production environment to insulate it from the processing demands of reporting and analysis
• Access to current data facilitated.
Operational Data Store - Insulated from OLTP
Tactical Analysis
OLTP Server
ODS
40©Copyright 2004, Cognizant Academy, All Rights Reserved
• Detailed data
– Records of Business Events
(e.g. Orders capture)
• Data from heterogeneous sources
• Does not store summary data
• Contains current data
Operational Data Store - Data
41©Copyright 2004, Cognizant Academy, All Rights Reserved
ODS- Benefits
• Integrates the data
• Synchronizes the structural differences in data
• High transaction performance
• Serves the operational and DSS environment
• Transaction level reporting on current dataFlat files
RelationalDatabase
Operational Data Store
60,5.2,”JOHN” 72,6.2,”DAVID”
Excel files
42©Copyright 2004, Cognizant Academy, All Rights Reserved
• Update schedule - Daily or less time
frequency
• Detail of Data is mostly between 30
and 90 days
• Addresses operational needs
• Weekly or greater time frequency
• Potentially infinite history
• Address strategic needs
Operational Data Store- Update schedule
ODSData
Data warehouse Data
43©Copyright 2004, Cognizant Academy, All Rights Reserved
ODS Vs Data warehouse Characteristics
Parameters ODS Datawarehouse
Integrated andsubject oriented
Updated ByTransactions
Stores Summarizeddata
Used for Strategicdecisions
Used at manageriallevel
Used for tacticaldecisions
Contains currentand detailed data
Lengthy historicalperspective
OLAPOLAPOLAPOLAP
45©Copyright 2004, Cognizant Academy, All Rights Reserved
What is OLAP
• OLAP tools are used for analyzing data
• It helps users to get an insight into the organizations data
• It helps users to carry out multi dimensional analysis on the available
data
• Using OLAP techniques users will be able to view the data from
different perspectives
• Helps in decision making and business planning
• Converting OLTP data into information
• Solution for maintaining your company's competitive edge
46©Copyright 2004, Cognizant Academy, All Rights Reserved
OLAP Terminology
• Drill Down and Drill Up
• Slice and Dice
• Multi dimensional analysis
• What IF analysis
Data Warehouse Data Warehouse
ArchitectureArchitectureData Warehouse Data Warehouse
ArchitectureArchitecture
48©Copyright 2004, Cognizant Academy, All Rights Reserved
Basic Data Warehouse ArchitectureMeta Data Management
Administration
Mining
Operational & External
data
ODS
Data Staging
layer
Information Information AccessAccess
Information Information AccessAccess
Reporting tools
Web Browsers
OLAP
Data warehouse
Information Servers
Data Marts
49©Copyright 2004, Cognizant Academy, All Rights Reserved
Basic Data Warehouse Architecture
50©Copyright 2004, Cognizant Academy, All Rights Reserved
• The database-of-
record• Consists of system
specific reference
data and event data • Source of data for the
data warehouse. • Contains detailed
data • Continually changes
due to updates • Stores data up to the
last transaction.
Operational &
ExternalDataLayer
Operational &
ExternalDataLayer
Operational & External Data layer
51©Copyright 2004, Cognizant Academy, All Rights Reserved
• Extracts data from operational and external databases.
• Transforms the data and loads into the data warehouse.
• This includes decoding production data and merging of records from multiple DBMS formats.
Data
Staginglayer
Data
Staginglayer
Data Staging layer
52©Copyright 2004, Cognizant Academy, All Rights Reserved
• Stores data used for
informational analysis
• Present summarized
data to the end-user for
analysis
• The nature of the
operational data, the
end-user requirements
and the business
objectives of the
enterprise determine
the structure
Data ware houseLayer
Data Warehouse layer
53©Copyright 2004, Cognizant Academy, All Rights Reserved
• Metadata is data about data.
• Stored in a repository.
• Contains all corporate Metadata resources: database catalogs, data dictionaries
Meta Data Layer
Meta Data layer
54©Copyright 2004, Cognizant Academy, All Rights Reserved
Process Management Layer
• Scheduler or the high-level job control
• To build and maintain the data warehouse and data directory information
• To keep theData warehouse
up-to-date.
Process Management layer
55©Copyright 2004, Cognizant Academy, All Rights Reserved
Information Access Layer
• Interfaced with the
data warehouse
through an OLAP
server.• Performs analytical
operations and
presents data for
analysis.• End-users
generates ad-hoc
reports and perform
multidimensional
analysis using
OLAP tools
Information Access layer
56©Copyright 2004, Cognizant Academy, All Rights Reserved
The following should be considered for a successful implementation of
a Data Warehousing solution:
Architecture :
• Open Data Warehousing architecture with common interfaces for
product integration
• Data warehouse database server
Tools :
• Data Modeling tools
• Extraction and Transformation/propagation tools
• Analysis/end-user tools: OLAP and Reporting
• Metadata Management tools
Data Warehouse Architecture - Implementation
Different Approaches Different Approaches for Implementing an for Implementing an
Enterprise Enterprise DatawarehouseDatawarehouse
Different Approaches Different Approaches for Implementing an for Implementing an
Enterprise Enterprise DatawarehouseDatawarehouse
58©Copyright 2004, Cognizant Academy, All Rights Reserved
• An Enterprise Data Warehouse (EDW) contains detailed as well
as summarized data
•Separate subject-oriented database.
• Supports detailed analysis of business trends over a period of time
•Used for short- and long-term business planning and decision making
covering multiple business units.
What is an Enterprise Datawarehouse?(EDW)
59©Copyright 2004, Cognizant Academy, All Rights Reserved
Heterogeneous Source Systems
Staging
Common Staging interface Layer
EDW- “Top Down”Approach
Data mart bus architecture Layer
Enterprise Datawarehouse
Source1
Source2
Source3
Incremental Architected data marts
DM 1 DM 3DM 2
60©Copyright 2004, Cognizant Academy, All Rights Reserved
• An EDW is composed of multiple subject areas, such as finance,
Human resources, Marketing, Sales, Manufacturing, etc.
• In a top down scenario, the entire EDW is architected, and then a small
slice of a subject area is chosen for construction
Subsequent slices are constructed, until the entire EDW is complete
EDW- “Top Down” Approach - Implementation
61©Copyright 2004, Cognizant Academy, All Rights Reserved
The upsides to a “Top Down” approach are:
1. Coordinated environment
2. Single point of control & development
The downsides to a “Top Down” approach are:
1. “Cross everything” nature of enterprise project
2. Analysis paralysis
3. Scope control
4. Time to market
5. Risk and exposure
Upsides and Downsides of Top-Down Approach
62©Copyright 2004, Cognizant Academy, All Rights Reserved
Heterogeneous Source Systems
Staging
Common Staging interface Layer
EDW- “Bottom up”Approach
Data mart bus architecture Layer
Source1
Source2
Source3
Incremental Architected data marts
DM 1 DM 3DM 2
Enterprise Datawarehouse
63©Copyright 2004, Cognizant Academy, All Rights Reserved
• Initially an Enterprise Data Mart Architecture (EDMA) is developed
• Once the EDMA is complete, an initial subject area is selected for the
first incremental Architected Data Mart (ADM).
• The EDMA is expanded in this area to include the full range of detail
required for the design and development of the incremental ADM.
EDW- “Bottom Up” Approach - Implementation
64©Copyright 2004, Cognizant Academy, All Rights Reserved
The upsides to a “bottom up” approach are:
1. Quick ROI
2. Low risk, low political exposure learning and development environment
3. Lower level, shorter-term political will required
4. Fast delivery
5. Focused problem, focused team
6. Inherently incremental
The downsides to a “bottom up” approach are:
1. Multiple team coordination
2. Must have an EDMA to integrate incremental data marts
Upsides and Downsides of Bottom Up Approach
65©Copyright 2004, Cognizant Academy, All Rights Reserved
• Lot of tools and technologies
• Data warehouse system architectures.
• Top down approach
• Bottom up approach
Data warehouse Architecture - Summary
Building a Data Building a Data WarehouseWarehouse
Building a Data Building a Data WarehouseWarehouse
67©Copyright 2004, Cognizant Academy, All Rights Reserved
Building a Data Warehouse
The initiatives involved in building a data warehouse are
• Identify the need and justify the cost
• Architect the warehouse
• Choose product and vendors
• Create a dimensional business model
• Create the physical model
• Design & develop extract, transform and load systems
• Test and refine the data warehouse
Data Warehouse design is driven by business users; Not by the IS team
68©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Data WareWarehousehouse
Data Data WareWarehousehouse
EnterpriseData
Warehouse
EnterpriseData
Warehouse
Info Info AccessAccess
Info Info AccessAccess
Reporting tools
Web Browsers
OLAP
Mining
ETLETLETLETL
External Data External Data StorageStorage
BusinessBusinessRequirementRequirement
Map DataMap Datasourcessources
ReverseReverseEngg.Engg.
Map Map Req. to Req. to OLTPOLTP
OLTP OLTP SystemSystem
LogicalLogicalModelingModeling
RefineRefineModelModel
Data Warehouse Life cycle
ER ModelingER ModelingER ModelingER Modeling
70©Copyright 2004, Cognizant Academy, All Rights Reserved
Review of Logical Modeling Terms & Symbols
• Entities define specific groups of information
Sales Organization
Sales Org IDDistribution Channel
Entity
71©Copyright 2004, Cognizant Academy, All Rights Reserved
Review of Logical Modeling Terms & Symbols
• Entities are made up of attributes
Sales Organization
Sales Org IDDistribution Channel
Attributes
72©Copyright 2004, Cognizant Academy, All Rights Reserved
Review of Logical Modeling Terms & Symbols
• One or more attribute uniquely identifies an instance of an entity
Sales Organization
Sales Org IDDistribution Channel
Identifier
73©Copyright 2004, Cognizant Academy, All Rights Reserved
Review of Logical Modeling Terms & Symbols
• The logical model identifies relationships between entities
Sales Detail
Sales Record ID
Sales Rep
Sales Rep ID
Relationship{
74©Copyright 2004, Cognizant Academy, All Rights Reserved
Logical Data Model
Sales Detail
Sales Record ID
Customer
Customer ID
Product
Product SKU
Suppliers
Supplier ID
Manufacturing Group
Manufacturing Org ID
Factory
Factory ID
Sales Organization
Sales Org IDDistribution Channel
Sales Rep
Sales Rep ID
Retail
Market
Product Sales Plan
Plan ID
Wholesale
Industry
Dimensional Modeling Dimensional Modeling Dimensional Modeling Dimensional Modeling
76©Copyright 2004, Cognizant Academy, All Rights Reserved
• Facts or Measures are the Key Performance Indicators of
an enterprise
• Factual data about the subject area
• Numeric, summarized
Net ProfitSale
s Rev
enue
Gross Margin
ProfitabilityCost
Facts and Measures
77©Copyright 2004, Cognizant Academy, All Rights Reserved
• Dimensions put measures in perspective
• What, when and where qualifiers to the measures
• Dimensions could be products, customers, time, geography etc.
Sales Reve
nue
(Mea
sure
)
What was sold ? Whom was it sold to ? When was it sold ? Where was it sold ?
Dimension
78©Copyright 2004, Cognizant Academy, All Rights Reserved
The following Dimensions are common in all Data warehouses in
various forms
• Product Dimension
• Service Dimension
• Geographic Dimension
• Time dimension
Some Examples of Data warehousing Dimensions
79©Copyright 2004, Cognizant Academy, All Rights Reserved
• Components of a dimension
• Represents the natural elements in the business dimension
• Directly related to the dimension
• Facilitates analysis from different perspectives of a dimension
• Often referred to as levels of a dimension.
TimeProduct
Geography
Dimension Elements
80©Copyright 2004, Cognizant Academy, All Rights Reserved
• Represents the natural business hierarchy within dimension elements
• Clarifies the drill up, drill down directions
• Each element represents different levels of aggregation
• End users may need custom hierarchies
1999
April May
9/4/99 28/4/99 5/5/99 17/5/99
Year
Month
Date
Dri
ll U
p
Drill D
own
Dimension HierarchyTime Dimension
81©Copyright 2004, Cognizant Academy, All Rights Reserved
Multi-Dimensional Analysis
• Characteristic of online analytical processing (OLAP)
Geography
Time
Pro
duct
1stQtr
2ndQtr
3rdQtr
4thQtr
East A
West A
North A
0.0
20.0
40.0
60.0
80.0
100.0
East A
East B
West A
West B
North A
North B
1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast A 20.4 27.4 90.0 20.4
B 19.8 26.6 87.3 19.8West A 30.6 38.6 34.6 31.6
B 29.7 37.4 33.6 30.7North A 45.9 46.9 45.0 43.9
B 44.5 45.5 43.7 42.6
82©Copyright 2004, Cognizant Academy, All Rights Reserved
Drill Up & Drill Down
• Drill down is a process of requesting for detailed information
• Drill up is a process of summarizing the existing information
1999East 158.2West 135.4North 181.7
1st Qtr 2nd Qtr 3rd Qtr 4th QtrEast 20.4 27.4 90 20.4West 30.6 38.6 34.6 31.6North 45.9 46.9 45 43.9
Up
Down
Current Result Set
83©Copyright 2004, Cognizant Academy, All Rights Reserved
Dimensional Modeling
Subject Area What do you want to know about?
Atomic Detail What level of detail do you need?
Dimensions Analyze key performance indicators
Facts Measures
Frequency of Update How fresh do you need it?
Depth of History How far back do you need to know it?
84©Copyright 2004, Cognizant Academy, All Rights Reserved
Requirements for a Dimensional model
• Clean, current, accurate logical models
• Physical models
• A subject area model
• Star / Snowflake schema design
85©Copyright 2004, Cognizant Academy, All Rights Reserved
Dimensional Modeling Methodology
Business Req
Data Sources
External
Refine model.
OLTP System
Map
Req
. to
OL
TP
Logical Modeling
86©Copyright 2004, Cognizant Academy, All Rights Reserved
Techniques for Implementing a Dimensional model
• Star Schema
• Snow-flake Schema
• Hybrid Schema
• Optimal Snow-flake Schema
87©Copyright 2004, Cognizant Academy, All Rights Reserved
Star schema- Logical structure
EmployeeProductCustomerDayUnits soldRevenue
Time
Product
Customer
Employee
Fact Table
Dimension
DimensionDimension
Dimension
88©Copyright 2004, Cognizant Academy, All Rights Reserved
Star schema: Physical view
Time_dimday_codedateday_of_weekmonth_seqmonth_nummonth_long_namemonth_short_nameqtr_seqqtr_numquarteryear
Geography_dimemp_codeemp_namecity_codecitystate_codestate region_coderegion
Product_dimprod_codeprod_namebrandcolor_code
Customer_dimcust_codecust_nameage_codeage sex_codesex city_codecity
Fact tableemp_codeprod_codeday_codecust_codeunitsrevenue
89©Copyright 2004, Cognizant Academy, All Rights Reserved
Star schema characteristics
• A star schema is a highly denormalized, query-centric model
where the basic premise is that information can be broken into two groups: facts and dimensions.
• In a star schema, facts are in a single place (the fact table) and the descriptions (or elements) that lead to those facts are in dimension tables.
• The star schema is built for simplicity and speed. The assumption behind it is that the database is static with no updates being performed online
90©Copyright 2004, Cognizant Academy, All Rights Reserved
Star schema: Dimension Table
Empl_Code empl_name city_code city state_code state region_code region2341 Mike King 101 Atlantic city NJ New Jersey 1 New Jersey3424 Jim McCann 106 Chicago IL Illinois 2 Illinois1232 Kitty Stokes 104 Austin PA Pennsylvania 1 New Jersey3554 Clem Akins 102 Medford NJ New Jersey 1 New Jersey3963 Duncan Moore 101 Atlantic city NJ New Jersey 1 New Jersey2924 Dawn McGuire 103 Englewood NJ New Jersey 1 New Jersey2673 Joe Becker 105 Alverton PA Pennsylvania 1 New Jersey3253 Geoff Bergren 107 Springfield IL Illinois 2 Illinois234 Garth Boyd 106 Chicago IL Illinois 2 Illinois
2342 Lin Cepele 104 Austin PA Pennsylvania 1 New Jersey
Geography_dim
Region Region
State
City
Employee
State
City
Employee
ElementsAttributes
PK
• De-normalized structure• Easy navigation within the
dimension
91©Copyright 2004, Cognizant Academy, All Rights Reserved
Star schema: Fact Table
day_code prod_code cust_code empl_code units sold revenue1211 345 1231123 1232 23 79351211 22 1245223 3554 12 2641211 112 1522342 3963 6 6721212 233 1524665 2924 34 79221212 112 1366454 2673 76 85121212 22 1403453 3554 22 484
sales_factDimension Keys
Measures
• Contains columns for measures and dimensions
92©Copyright 2004, Cognizant Academy, All Rights Reserved
Snow-flake schema
RevenueUnits SoldNet Profit
Product
Time
Customer
City
Brand
Color
Region
Country
93©Copyright 2004, Cognizant Academy, All Rights Reserved
Snow-flake: Physical view
emp_codecust_codeprod_codeday_codeunitsrevenue
emp_codeemp_name
emp_codecity_codecityname
city_codestate_codestatename
state_coderegion_coderegionname
region_codecountry_codecountryname
prod_codebrand_codeprod_name
brand_codebrand_namecolor_code
color_codecolor_name
day_codeday_nameweek_code
week_codeweek_namemonth_code
month_codemonth_namequarter_codeyear
cust_codecust_nameage_codeage sex_codesex city_codecity
94©Copyright 2004, Cognizant Academy, All Rights Reserved
Hybrid schema: Physical view
emp_codecust_codeprod_codeday_codeunitsrevenue prod_code
brand_codeprod_name
brand_codebrand_namecolor_code
color_codecolor_name
day_codeday_nameweek_code
week_codeweek_namemonth_code
month_codemonth_namequarter_codeyear
emp_codeemp_namecity_codecitystate_codestate region_coderegion
cust_codecust_nameage_codeage sex_codesex city_codecity
95©Copyright 2004, Cognizant Academy, All Rights Reserved
Optimal Snow-flake schema
emp_codecust_codeprod_codeday_codebrand_codeunitsrevenue
prod_codebrand_codeprod_name
brand_codebrand_namecolor_code
color_codecolor_name
day_codeday_nameweek_code
week_codeweek_namemonth_code
month_codemonth_namequarter_codeyear
emp_codeemp_namecity_codecitystate_codestate region_coderegion
cust_codecust_nameage_codeage sex_codesex city_codecity
96©Copyright 2004, Cognizant Academy, All Rights Reserved
What is a Slowly Changing Dimension?
• Although dimension tables are typically static lists, most dimension tables
do change over time.
• Since these changes are smaller in magnitude compared to changes in fact
tables, these dimensions are known as slowly growing or slowly changing
dimensions.
97©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimension -Classification
Slowly changing dimensions are classified into three different types
• TYPE I
• TYPE II
• TYPE III
98©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type I
Shane
Name
1001
EmailEmp id
Shane
Name
1001
EmailEmp id
Shane
Name
1001
EmailEmp id
Shane
Name
1001
EmailEmp id
Source
Source Target
Target
99©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type II
Shane
Name
10
EmailEmp id
Shane
Name
10
Emp id
1000
PM_PRIMARYKEY
0
PM_VERSION_NUMBER
Source
Target
100©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions -Versioning
Shane
Name
10
EmailEmp id
Source
Target
Shane101000
Shane101001
EmailNameEmp idPM_PRIMARYKEY
PM_VERSION_NUMBER
101©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions -Versioning
Shane
Name
10
EmailEmp id
Source
Target
Shane101001
Shane101003
Shane101000
EmailNameEmp idPM_PRIMARYKEY
PM_VERSION_NUMBER
102©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type II - Flag
Shane
Name
10
EmailEmp id
Shane
Name
10
Emp id
1000
PM_PRIMARYKEY
1
PM_CURRENT_FLAG
SourceTarget
103©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions - Flag Current
Shane
Name
10
EmailEmp id
Source
Target
Shane101000
Shane101001
EmailNameEmp idPM_PRIMARYKEY
PM_CURRENT_FLAG
104©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions - Flag Current
Shane
Name
10
EmailEmp id
Source
Target
Shane101001
Shane101003
Shane101000
EmailNameEmp idPM_PRIMARYKEY
PM_CURRENT_FLAG
105©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type II
Shane
Name
10
EmailEmp id
01/01/00
PM_BEGIN_DATE
Shane
Name
10
Emp id
1000
PM_PRIMARYKEY
PM_END_DATE
Source
Target
106©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions -Effective Date
Shane
Name
EmailEmp id
Source
Target
03/01/00
01/01/00
PM_BEGIN_DATE
03/01/[email protected]
Shane101000
Shane101001
EmailNameEmp idPM_PRIMARYKEY
PM_END_DATE
107©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions - Effective Date
Shane
Name
EmailEmp id
Source
Target
05/02/00
03/01/00
01/01/00
PM_BEGIN_DATE
05/02/[email protected]
Shane101001
Shane101003
03/01/[email protected]
Shane101000
EmailNameEmp idPM_PRIMARYKEY
PM_END_DATE
108©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type III
Shane
Name
10
EmailEmp id
PM_Prev_Column Name
Shane
Name
10
Emp id
1
PM_PRIMARYKEY
01/01/00
PM_EFFECT_DATE
Source Target
109©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type III
Shane
Name
EmailEmp id
Source
Target
PM_Prev_ColumnName
01/02/[email protected]
Shane101
EmailNameEmp idPM_PRIMARYKEY
PM_EFFECT_DATE
110©Copyright 2004, Cognizant Academy, All Rights Reserved
Slowly Changing Dimensions Type III
Shane
Name
EmailEmp id
Source
Target
PM_Prev_ColumnName
01/03/[email protected]
Shane101
EmailNameEmp idPM_PRIMARYKEY
PM_EFFECT_DATE
111©Copyright 2004, Cognizant Academy, All Rights Reserved
Conformed Dimensions
• Conformed dimensions are those which are consistent across Data marts.
• Essential for integrating the Data marts into an Enterprise Data warehouse
112©Copyright 2004, Cognizant Academy, All Rights Reserved
Casual Dimensions
• Casual dimensions can be used for explaining why a record exists in a fact table
• Casual dimensions should not change the grain of the fact table
113©Copyright 2004, Cognizant Academy, All Rights Reserved
Casual Dimension - Example
Example:
• Why did a customer buy a particular product• Why did a customer use a particular ATM machine
114©Copyright 2004, Cognizant Academy, All Rights Reserved
Factless Fact Tables
The two types of factless fact tables are:
• Coverage tables
• Event tracking tables
115©Copyright 2004, Cognizant Academy, All Rights Reserved
Factless Fact Tables - Coverage Tables
Coverage tables are required when a primary fact table is sparse
Example: Tracking products in a store that did not sell
116©Copyright 2004, Cognizant Academy, All Rights Reserved
Factless Fact Tables - Event Tracking
These tables are used for tracking a event:
Example: Tracking student attendance
117©Copyright 2004, Cognizant Academy, All Rights Reserved
Helper Tables
• Helper tables are used when there are multi valued dimensions. That is when there is a many to many relationship between a fact table and a dimension table
• Helper table can be placed between two dimensions tables or between a dimension table and a fact table.
118©Copyright 2004, Cognizant Academy, All Rights Reserved
Helper Tables - Example
Example : A customer having more than one bank account
119©Copyright 2004, Cognizant Academy, All Rights Reserved
Surrogate Keys
• Joins between fact and dimension tables should be based on surrogate keys
• Surrogate keys should not be composed of natural keys glued together
• Users should not obtain any information by looking at these keys
• These keys should be simple integers
120©Copyright 2004, Cognizant Academy, All Rights Reserved
Why Existing Keys Should Not Be Used
• Keys may be reused after they have been purged even thought they are used in the warehouse
• A product description or a customer description could be changed without changing the key
• Key formats may be generalized to handle some new situation
• A mistake could be made and a key could be reused
ETL- Extraction,ETL- Extraction,Transformation & Transformation &
LoadingLoading
ETL- Extraction,ETL- Extraction,Transformation & Transformation &
LoadingLoading
122©Copyright 2004, Cognizant Academy, All Rights Reserved
What is ETL?
• ETL(Extraction, Transformation and Loading) is a process by which data is integrated and transformed from the operational systems into the Data warehouse environment
Operational systemsOperational systems
Filters andFilters andExtractorsExtractors
TransformationTransformationRulesRules
•• Rule 1Rule 1•• Rule 2Rule 2•• Rule 3Rule 3
IntegratorIntegrator
CleaningCleaningRulesRules
•• Rule 1Rule 1•• Rule 2Rule 2•• Rule 3Rule 3
TransformationTransformationEngineEngine
CleanserCleanser
LoaderLoader WarehouseWarehouse
ErrorErrorViewCheckCorrect
ErrorErrorViewCheckCorrect
123©Copyright 2004, Cognizant Academy, All Rights Reserved
Operational Data - Challenges
• Data from heterogeneous sources
• Format differences
• Data Variations
• Context
– Across locations the same code could represent different customers
– Across periods of time a product code could have been reused
124©Copyright 2004, Cognizant Academy, All Rights Reserved
Extraction
Oracle
Sybase
Text files
Target
80 tables
50 tables
Data from 30 tables
Filter
Data from 10 tables Where
Date<10/12/99
Data from files
125©Copyright 2004, Cognizant Academy, All Rights Reserved
Transformation
FirstName
LastName
Emp id
IndianaJones10001
SherlockHolmes10002
Name = Concat(First Name,
Last Name)
Indiana Jones
Sherlock Homes
Staging Area
Source
126©Copyright 2004, Cognizant Academy, All Rights Reserved
Loading
Staging Area
Source Data Warehouse
Direct Load
Cleaning,
Transformation
& Integration of Raw data
Clean,Transformed & integrated
data load
127©Copyright 2004, Cognizant Academy, All Rights Reserved
Volume of ETL in a Data warehouseSource OLTPSystems Data MartsData Marts
•Design•Mapping
•Design•Mapping
•Extract•Scrub•Transform
•Extract•Scrub•Transform
•Load•Index•Aggregation
•Load•Index•Aggregation
•Replication•Data Set Distribution
•Replication•Data Set Distribution
•Access & Analysis•Resource Scheduling & Distribution
•Access & Analysis•Resource Scheduling & Distribution
Meta DataMeta Data
System MonitoringSystem Monitoring
EnterpriseData Warehouse
Metadata
60 to 80% of the work is here
128©Copyright 2004, Cognizant Academy, All Rights Reserved
Factors Influencing ETL Architecture
• Volume at each warehouse component.
• The time window available for extraction.
• The extraction type (Full,Periodic etc.)
• Complexity of the processes at each stage.
Extraction Types Extraction Types Extraction Types Extraction Types
130©Copyright 2004, Cognizant Academy, All Rights Reserved
Extraction Types
Extraction
Full ExtractPeriodic/
IncrementalExtract
131©Copyright 2004, Cognizant Academy, All Rights Reserved
Source System
Full Extract
Existing data
Data Mart
Full Extract
132©Copyright 2004, Cognizant Academy, All Rights Reserved
Full Extract
Source System
Full Extract
Data Mart
New data
134©Copyright 2004, Cognizant Academy, All Rights Reserved
Incremental Extract
Data Mart
Source SystemIncremental Extract
Existing data
IncrementalData
135©Copyright 2004, Cognizant Academy, All Rights Reserved
Incremental Extract
Data Mart
Source SystemIncremental Extract
New data
Changed data
Existing data
IncrementalData
136©Copyright 2004, Cognizant Academy, All Rights Reserved
Incremental Extract
Data Mart
Source SystemIncremental Extract
New data
Changed data Existing data updated using changed data
IncrementalData
Incremental addition to data mart
TransformationTransformationTransformationTransformation
138©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Transformation
• Conversions
– Data type (e.g. Char to Date)
– Bring data to common units (Currency,Measuring Units)
• Classifications
– Changing continuous values to discrete ranges (e.g. Temperatures to
Temperature Ranges)
• Splitting of fields
• Merging of fields
• Aggregations (e.g. Sum, Avg., Count)
• Derivations (Percentages, Ratios, Indicators)
139©Copyright 2004, Cognizant Academy, All Rights Reserved
Structural Transformations
• Additive
Orders arrive every
two minutesAggregate
Average
Daily Productivity
figuresAverage
OLTP
OLTP
Data warehouse
Data warehouse
140©Copyright 2004, Cognizant Academy, All Rights Reserved
Format transformation
Splitting
Data Type Conversions
Source Schema
“32”
Transformation
Target Schema
32
Age as a String Age as an Integer
“15-10-1992”
Source Schema
Date as a String
Transformation15 10 1999
Target Schema
Day Month Year
Date as a combination of 3 integer fields
141©Copyright 2004, Cognizant Academy, All Rights Reserved
Simple Conversions
• Transformations using Simple Conversions
Source Schema
Rs. 10000Multiply by 1/43
Target Schema
$232.56
Revenue in Rupees
Revenue in Dollars
1000 lbs.Multiply by 0.4536
453.56 kgs.
Production in Pounds
Production in Kilograms
Source Schema
Target Schema
142©Copyright 2004, Cognizant Academy, All Rights Reserved
Classification
Name AgeJohn Black 27Richard Wayne 53Jennifer Goldman 45Helmut Koch 37Anna Ludwig 32Shito Maketha 28Tracy Withman 39Ada Zhesky 25David Rosenberg 33Pankaj Sharma 29Zhu Ling 44George Kurtz 27Rita Hartman 34
Grouping
Age GroupFrequency20-25 126-30 431-35 336-40 241-45 246-50 151-55 156-60 0
143©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Consistency Transformations
Source 1Gender
Male – MFemale – F
Source 2Gender
Male – MaleFemale – Female
Source 1GenderMale – 1
Female – 2
TargetGender
Male – MFemale – F
144©Copyright 2004, Cognizant Academy, All Rights Reserved
Reconciliation of Duplicated dataJoe Smith123 Maine St.MA - 70127
Joseph Smith123 Maine St.MA - 70127
J.R.Smith123 Maine St.MA - 70127
Joseph R Smith123 Maine St.MA - 70127
145©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Aggregation - Design Requirements
• Aggregates must be stored in their own fact tables and each level should have its own fact table
• Dimension tables attached to the aggregate fact tables should where ever possible be shrunken versions of the dimension tables attached to the base fact table
• The base fact table and all of its related aggregate fact tables must be associated together as a family of schemas
Loading Loading Loading Loading
147©Copyright 2004, Cognizant Academy, All Rights Reserved
Types of Data warehouse Loading
• Target update types
– Insert
– Update
148©Copyright 2004, Cognizant Academy, All Rights Reserved
Types of Data Warehouse Updates
Insert
Full Replace
Selective Replace
Update
Update plus Retain History
Point in Time Snapshots New Data Changed Data
Data Warehouse
Source data Data Staging
149©Copyright 2004, Cognizant Academy, All Rights Reserved
New Data and Point-In-Time Data Insert
Source data
New data
OR
Point-in-Time Snapshot(e.g.. Monthly)
New Data Added to Existing Data
150©Copyright 2004, Cognizant Academy, All Rights Reserved
Changed Data Insert
Source data Changed Data Added to Existing Data
Changed data
151©Copyright 2004, Cognizant Academy, All Rights Reserved
When the value of dimension in a data warehouse changes,
then
History of change needs to be maintained.
Changed data alone needs to be identified
Changed data should be easier to access.
Reconstruction of the dimension table any point in time should be easier
Change of Dimension values
152©Copyright 2004, Cognizant Academy, All Rights Reserved
ETL - Approach in a nutshell
1) Identify the Operational systems based on data islands in the
target
2) Map source-target dependencies.
3) Define cleaning and transformation rules
4) Validate source-target mapping
5) Consolidate Meta data for ETL
6) Draw the ETL architecture
7) Build the cleaning, transformation and auditing routines
using either a tool or customized programs
Meta Data in a Meta Data in a Data WarehouseData WarehouseMeta Data in a Meta Data in a
Data WarehouseData Warehouse
154©Copyright 2004, Cognizant Academy, All Rights Reserved
• Data about data and the processes
• Metadata is stored in a data dictionary and repository.
• Insulates the data warehouse from changes in the schema of
operational systems.
• It serves to identify the contents and location of data in the
data warehouse
What is Metadata?
155©Copyright 2004, Cognizant Academy, All Rights Reserved
• Share resources
– Users
– Tools
• Document system
• Without meta data
– Not Sustainable
– Not able to fully utilize resource
Why Do You Need Meta Data?
156©Copyright 2004, Cognizant Academy, All Rights Reserved
The Role of Meta Data in the Data Warehouse
• Know what data you have
and
• You can trust it!
Meta Data enables data to become information, because with it you
157©Copyright 2004, Cognizant Academy, All Rights Reserved
Meta Data Answers….
How have business definitions and terms changed over time?
How do product lines vary across organizations?
What business assumptions have been made?
How do I find the data I need?
What is the original source of the data?
How was this summarization created?
What queries are available to access the data
158©Copyright 2004, Cognizant Academy, All Rights Reserved
Meta Data Process
• Integrated with entire process and data flow
– Populated from beginning to end
– Begin population at design phase of project
– Dedicated resources throughout
• Build
• Maintain
•Design•Mapping
•Design•Mapping
•Extract•Scrub•Transform
•Extract•Scrub•Transform
•Load•Index•Aggregation
•Load•Index•Aggregation
•Replication•Data Set Distribution
•Replication•Data Set Distribution
•Access & Analysis•Resource Scheduling & Distribution
•Access & Analysis•Resource Scheduling & Distribution
Meta DataMeta Data
System MonitoringSystem Monitoring
159©Copyright 2004, Cognizant Academy, All Rights Reserved
Types of ETL Meta Data
.
ETL Meta data
Technical Meta data
Operational Meta data
160©Copyright 2004, Cognizant Academy, All Rights Reserved
• Data Warehouse Meta data
This Meta data stores descriptive information about the physical
implementation details of data warehouse.
• Source Meta data
This Meta data stores information about the source data and
the mapping of source data to data warehouse data
Classification of ETL Meta Data
161©Copyright 2004, Cognizant Academy, All Rights Reserved
• Transformations & Integrations.
This Meta data describes comprehensive information about the
Transformation and loading.
• Processing Information
This Meta data stores information about the activities involved in the
processing of data such as scheduling and archives etc
• End User Information
This Meta data records information about the user profile and security.
ETL Meta Data
162©Copyright 2004, Cognizant Academy, All Rights Reserved
ETL -Planning for the Movement
The following may be helpful for planning the movement
• Develop a ETL plan
• Specifications
• Implementation
Data Warehouse Data Warehouse AdministrationAdministration
Data Warehouse Data Warehouse AdministrationAdministration
164©Copyright 2004, Cognizant Academy, All Rights Reserved
Data Warehouse Administrative Tasks
• Build and maintain the data warehouse• Maintaining the meta data• To keep the data warehouse up to date• Tuning the data warehouse• General administrative tasks
165©Copyright 2004, Cognizant Academy, All Rights Reserved
Dormant Data
• The data that is hardly used in a data warehouse is called dormant data
• The faster data warehouses grows the more data becomes dormant. Over a period of time the amount of dormant data in a data warehouse increases
166©Copyright 2004, Cognizant Academy, All Rights Reserved
Origins of Dormant Data
• Storing history data that is not required
• Storing columns that are never used
• Storing detail level data when only summary level data is used
• Creating summary data that is never used
167©Copyright 2004, Cognizant Academy, All Rights Reserved
Strategy For Removing Dormant Data
The strategy for removing dormant data might include:
• Removing data after a period of time say after two years
• Removing summary data that has not been accessed in the past six months
• Removing columns that have never or only very infrequently been accessed
• Storing data for high profile users even though that data has not been accessed
• Storing data for selected accounts even though that data has not been accessed
168©Copyright 2004, Cognizant Academy, All Rights Reserved
Tuning a Data Warehouse
Some of the techniques that can be used for tuning a data warehouse are:
• Handling dormant data
• Storing pre summarized data based on data pattern usage
• Creating indexes for data that is frequently used
• Merging tables that have common and regular access