Date post: | 29-Mar-2015 |
Category: |
Documents |
Upload: | rylan-bratcher |
View: | 212 times |
Download: | 0 times |
1
Theory, Practice & Methodology of Relational Database
Design and ProgrammingCopyright © Ellis Cohen 2002-2006
Introduction toData Warehouse
DesignThese slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License.
For more information on how you may use them, please see http://www.openlineconsult.com/db
© Ellis Cohen, 2003-2006 2
Topics
OverviewStar Schema:
Fact & Dimension TablesThe Star Schema &
DenormalizationThe Data CubeETL: Extraction,
Transformation & Loading
© Ellis Cohen, 2003-2006 3
Overview
© Ellis Cohen, 2003-2006 4
Data Warehousing & Data Mining
Data WarehousingTechniques for representing & querying
large amounts of relatively static dataPotentially stored in
Multi-Dimensional DatabasesOn-line Analysis & Decision Support
Data MiningAutomated analysis: Discovering
(potentially) unexpected patterns in large amounts of data
© Ellis Cohen, 2003-2006 5
Operational vs Analytical DBs
Operational DatabaseData needed and updated constantly to directly
support business operationsFocus on OLTP (on-line transaction processing):
Transactional access & modification of relatively small # of data points at a time
Analytical Database:Data Warehouse & Data MartCopious amounts of relatively static data, culled
& integrated across enterprise, cleansed & summarized, maintained historically, used for decision support and business intelligence (BI)
Focus on OLAP (on-line analytical processing): Querying large amounts of data, scheduled modifications
© Ellis Cohen, 2003-2006 6
Operational vs Analytical DBs
Operational Warehouse
Usage Transactional(OLTP)
Analytical(OLAP)
Organized for Modifications Queries
Modifications Continual Periodic
Queries Narrow-scopeLow-complexity
Broad-scopeHigh-complexity
Database Relational Relational/Dimensional
Data NormalizedDenormalizedAggregated &
Derived
© Ellis Cohen, 2003-2006 7
Central Data Warehouse
(from Oracle 9i Data Warehousing Guide)
© Ellis Cohen, 2003-2006 8
Warehouse Questions
How many red Bally shoes did we sell by region in the third quarter of each of the last 5 years?
What are the top 25 selling products by category and region for this past quarter?
What percent of the market do we own for each product we make?
Which of our customer's zipcodes were responsible for the top 10% of total sales over the last year.
© Ellis Cohen, 2003-2006 9
Star Schema:Fact & Dimension
Tables
© Ellis Cohen, 2003-2006 10
Star Schema
Stores (Dimension)
DailySales (Fact)
storidprodiddatepriceunits
storid…
Products (Dimension)
prodid…
Measures
A Star Schema has a central fact table, with a composite primary key, which references multiple Dimension tables
what each fact measures
Data Warehousesare organized usingStar Schema models
foreign key
© Ellis Cohen, 2003-2006 11
Subjects (Facts) & Dimensions
Instead of thinking about entities & relationships, design a data warehouse by thinking about
Subjects (represented by fact tables)
Sales, Distribution, Purchases
Dimensions (represented by dimension tables)
How to uniquely identify the facts about each subject– Sales: Product, Stores, Dates
(maybe also Employee, Customer: depends what you want to analyze)
– Distribution: Warehouses, Products, Stores, Dates (maybe Employees & Trucks)
– Purchases: Products, Vendors, Dates (maybe also Employees)
© Ellis Cohen, 2003-2006 12
Fact & Dimension Tables
Fact TablesComposite primary key
• identify dimensions• uniquely identify each fact (or measurement)
Additional attributes: measures• what is measured about each fact
Dimension TablesPrimary key
Surrogate key uniquely identifies each dimension value
Additional attributesProperties of each dimension value
© Ellis Cohen, 2003-2006 13
Dimensions & Granularity
Dimensions have different levels of granularity
Stores
Regions
Districts
Products
SubCategories
ProductTypes
Categories
Manufacturers
© Ellis Cohen, 2003-2006 14
Snowflake Schema(with Normalized Dimensions)
Stores (Dimension) DailySales (Fact)storidprodiddatepriceunits
storidstornamcitystatedistid
Products (Dimension)
prodidcolorsizeprodtyp
Districtsdistiddistnamdistarearegid
Regionsregidregnam
ProductTypes
prodtypprodnamprodescrsubcatidmanfid
SubCategories
subcatidsubnamsubdescrcatid
Categories
catidcatnamcatdescr
Manufacturers
manfidmanfnam
© Ellis Cohen, 2003-2006 15
Typical Warehouse Query
How many red Bally shoes did we sell in each region in 2002?
SELECT r.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores NATURAL JOIN Districts NATURAL JOIN Regions r NATURAL JOIN Products p NATURAL JOIN ProductTypes NATURAL JOIN SubCategorie s NATURAL JOIN Manufacturers mWHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND m.manfnam = 'Bally' AND s.subnam = 'Shoe'GROUP BY r.regnam
© Ellis Cohen, 2003-2006 16
The Star Schema & Denormalization
© Ellis Cohen, 2003-2006 17
Snowflake Schema is Normalized
Snowflake Schema has normalized dimension tables
• Each dimension is represented by multiple sub-dimension tables at different levels of granularity (Product, ProductType, Category, etc.)
• Each sub-dimension table has attributes appropriate to the level of granularity– Product: color, size
– ProductType: prodnam, prodescr
– etc.
© Ellis Cohen, 2003-2006 18
Denormalization
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Products (Dimension)
prodidcolorsizeprodtyp
ProductTypes
prodtypprodnamprodescrsubcatidmanfid
SubCategories
subcatidsubnamsubdescrcatid
Categories
catidcatnamcatdescr
Manufacturers
manfidmanfnam
Why is there redundancy
here?
© Ellis Cohen, 2003-2006 19
Star Schema is Denormalized
The Star Schema has denormalized dimension tables
• Each dimension by joining together the sub-dimension table to form a single dimension table
• The dimension table has attributes at different levels of granularity
• The dimension tables contain lots of redundancy, but queries use far fewer joins
• Does not dramatically impact space: dimension tables usually < 1% size of fact table (but some descriptions may need to be stored separately)
© Ellis Cohen, 2003-2006 20
Star Schema(Fully Denormalized Dimensions)
Stores (Dimension)
DailySales (Fact)
storidprodiddatepriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)
prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescrMaybe catdescr not
included here if it is a GIF or a 4000 byte
description
Why should this be
replaced by a dateid?
© Ellis Cohen, 2003-2006 21
Query with Denormalized Schema
How many red Bally shoes did we sell in each region in 2002?
SELECT s.regnam as region, sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p WHERE to_char(f.date,'YYYY') = '2002' AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam Costly
© Ellis Cohen, 2003-2006 22
Typical Date Dimension Attributes
Requires Month + Year to identify a month within a year.Might want to add a single MonthYr field to represent the pair
Field Example Value
Year 2005
Month Feb
Quarter 1
DayOfMonth 12
DayOfYear 43
WeekOfYear 7
DayOfWeek Sat
Note: Quarter is less granular than MonthAlso, DayOfYear, WeekOfYear & DayOfWeek can be derived form the other fields
It is common and almost always more efficient to treat Dates as a dimension with a number of attributes
© Ellis Cohen, 2003-2006 23
Extended Date Dimension Hierarchy
Date (e.g. Feb 12, 2005)
DayOfWeek(e.g. Sat)
WeekYr(e.g. 2005Wk7)
MonthYr(e.g. Feb2005)
QuarterYr(e.g. 2005Q1)
Year(e.g 2005)
Quarter(e.g. 1)
Month(e.g. Feb)
WeekOfYear(e.g. 7)
DayOfYear(e.g. 43)
DayOfMonth(e.g. 12)
© Ellis Cohen, 2003-2006 24
Star Schema with Date Dimension
Stores (Dimension)DailySales (Fact)
storidprodiddateidpriceunits
storidstornamcitystatedistiddistnamdistarearegidregnam
Products (Dimension)prodidcolorsizeprodtypprodnamprodescrmanfidmanfnamsubcatidsubnamsubdescrcatidcatnamcatdescr
Dates(Dimension)
dateiddatedayofweekdayofmonthdayofyearweekyrweekofyearmonthyrmonthquarteryrquarteryear
In general, represent dates by a Dates dimension table
© Ellis Cohen, 2003-2006 25
Query using Dates DimensionHow many red Bally shoes did we sell
in each region in 2002?SELECT s.regnam as region,
sum(f.units) as sumunitsFROM DailySales f NATURAL JOIN Stores s NATURAL JOIN Products p NATURAL JOIN Dates dWHERE d.year = 2002 AND p.color = 'red' AND p.manfnam = 'Bally' AND p.subnam = 'Shoe'GROUP BY s.regnam
Needs an extra join, but simpler query, Executes faster if Dates is indexed by year
© Ellis Cohen, 2003-2006 26
The Data Cube
© Ellis Cohen, 2003-2006 27
Data Cube Representation
Products dimension
Stores dimensio
n
Dates dimension
Sales of Beanie Babies in
Pittsburgh Store Today
Sales of Beanie Babies in Pittsburgh
Store Yesterday
All Sales(of all products
over time) in NYC Store
Pgh
NYC
Sales Cube
© Ellis Cohen, 2003-2006 28
Data Cube Characteristics
Each axis represents a dimension
– Elements along axis are at lowest granularity for that dimension
Measures are the data within the cells at intersections of the cube
– Information about the topic of the cube
– e.g. units & price for each sales fact (i.e. sales in a store of a product on a date)
© Ellis Cohen, 2003-2006 29
Data Cube ViewsSlice
View data relative to a point in one or more dimensions
View sales today (for each store & each product category)
View Bally shoe sales at the NYC store (for each date)
DiceView data relative to (sets of) ranges in one or
more dimensionsView sales for the last 4 days (for each store &
each product category)View sales for each type of shoes at all the NY
and NJ stores for each of the last 10 quarters
© Ellis Cohen, 2003-2006 30
MDDB: MultiDimensional DataBase
Knows about Fact & Dimension TablesUses direct (n dimensional) hypercube
representation to provide fast access to fact elements in query
Supports sparse representations– The Pittsburgh store doesn't sell lingerie– The Cape Cod store is not open in the winter– Baked Beanie Babies are only sold in the NE
regionUses specialized query language
e.g. MDX (used by Microsoft OLAP Server)w basic data types: cube, slice, dice
© Ellis Cohen, 2003-2006 31
ETL:Extraction,
Transformation & Loading
© Ellis Cohen, 2003-2006 32
ETL: Extraction, Transformation & Loading
80% of total cost of building warehouse
Extraction Loading
Transformation
© Ellis Cohen, 2003-2006 33
ExtractionSources
Multiple DB'sFlat FilesExternal Data Sources
• e.g. Census, Geographic, Weather, Financial, Unemployment Data
• Standard DB/Spreadsheet format or semi-structured data from the web
FrequencyPeriodic (hourly, daily, weekly, …)Triggered
• Single event• #, sequence, pattern of events
MechanismsSnapshots / Materialized Views / ReplicationDatabase TriggersProcess LogsQuery Sources (full vs incremental)
© Ellis Cohen, 2003-2006 34
TransformationCleaning
ScrubbingFilteringConformance
IntegrationRenamingFusion & MergingDetermine Surrogate KeysTimestampingSummarization
Schema OrganizationDimension TablesPre-Aggregation via Materialized Views Derivation
© Ellis Cohen, 2003-2006 35
(Transformation) Cleaning
ScrubbingUse domain-specific knowledgee.g. SS#, phone-number, zipcode
FilteringCheck for inconsistent dataUse data validation rules
ConformanceMap similarly typed data to standard
representation Convert
units (inch => cm, $ => euro)scale (mm => cm)formats (string => integer, string
with/wo $)
© Ellis Cohen, 2003-2006 36
(Transformation) IntegrationRenaming
Resolve name conflictsFusion - e.g. merge
– properties in city db– properties in developer lists
Determine Surrogate KeysDo not use keys from operational data as
primary key in warehouse dataTimestamping
Add timestamps to fact data where missing to enable historical queries
Reorganization & EvolutionSupport Data Reorganization & Schema
EvolutionSummarization
Summarize original operational data and combine into less detailed tables
© Ellis Cohen, 2003-2006 37
Integration (Data Reorganization)What do we do when attributes change?
Suppose districts are reorganized and a store is now part of a different district
Consistently changing mapping of store to district– Allows new and old data to be compared
reasonably by district– But causes incorrect comparisons by district
among older data alone
Solutions1. Keep fields for both old and new mapping -- in
fact, potentially a separate field for each reorganization
2. Add effective date to store dimension.Have multiple rows for same store - each with different effective date
© Ellis Cohen, 2003-2006 38
(Integration) Summarization
DailySales (Fact)storidprodiddatepriceunitsCustomerTransaction
transidcustidempidposidtime
ItemPurchasetransidlinenoprodidpriceunits
PointOfSaleTerminals
posidpostypstoridloc
Might build different fact tables for different purposes:
e.g. ones involving Customersones involving Store Locations
TradeoffSmaller Fact Tables vs.Missed Relationships
© Ellis Cohen, 2003-2006 39
Loading
Alternatives– Incremental vs Full Refresh:
most data is incrementally added to the warehouse– Off-line vs on-line– Frequency
• Nightly• Weekly• Monthly
– All-at-once vs StagedWhat indices to create or drop?What statistics to collect (& use)?
© Ellis Cohen, 2003-2006 40
Constellation SchemaData warehouses often are designed as
constellations• Multiple fact tables• Shared/related dimension tables
Examples– Sales: store, product, date– Distribution: distributor, store, product,
carrier, period– Advertising: store, medium, product, period
Query across same or related dimensions– Compare advertising and sales by store
within various periods
© Ellis Cohen, 2003-2006 41
Data Marts
Store different fact tables (or different groups of fact tables) in separate data marts
© Ellis Cohen, 2003-2006 42
Data Mart Architectures
Subset of Data WarehouseMeets needs of subgroup of users
• Top-down: – Extracted from Data Warehouse– Problem: early availability
• Bottom-up:– Built directly from staging area– Can be combined to form warehouse– Problem: Conformance.
ETL tool must provide metadata
• Hybrid:– Some data marts built directly from staging area– Others extracted from Data Warehouse
© Ellis Cohen, 2003-2006 43
Metadata Management
Identify & define each attribute– Source(s)– Transformation(s) applied– How aggregated– Description of what it represents– Relationships to other attributes– History