Data Warehousing Dale-Marie Wilson, Ph.D.. Evolution of Data Warehousing Since 1970s, organizations...

Data Warehousing

Dale-Marie Wilson, Ph.D.

Evolution of Data Warehousing

Since 1970s, organizations gained competitive advantage Automated business processes More efficient and cost-effective services to

customer Resulted in accumulation of growing

amounts of data in operational databases

Evolution of Data Warehousing

Increased focus on ways to use operational data to support decision-making Means of gaining competitive advantage

Operational systems not designed to support such business activities Typically numerous operational systems with overlapping and

contradictory definitions

Organizations need to turn archives of data into source of knowledge Goal: single integrated / consolidated view of organization’s data

presented to user

Solution: Data Warehouse Provides system capable of supporting decision-making, receiving

data from multiple operational data sources

Data Warehousing Concepts

A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process (Inmon, 1993)

Subject-oriented Data

Warehouse organized around major subjects of the enterprise e.g. customers, products, and salesNot major application areas (e.g. customer

invoicing, stock control, and product sales)

Stores decision-support data not application-oriented data

Integrated Data

Integrates corporate application-oriented data from different source systemsIncludes inconsistent data

Integrated data source made consistent Presents unified view of data to users

Time-variant Data

Data accurate and valid at instance in time or over time interval

Time-variance shown in:Extended time data heldImplicit/explicit association of time with dataData represents series of snapshots

Non-volatile Data

Data not updated real-time

Refreshed from operational systems on regular basis

New data added as supplement not replacement

Data Webhouse

Web is source of behavioral dataClickstream – user’s path thru Website and

Web history

Data webhouse is a distributed data warehouse with no central data repository that is implemented over the Web to harness clickstream data

Benefits of Data Warehouse

Potential high returns on investment

Competitive advantage

Increased productivity of corporate decision-makers

Comparison of OLTP Systems and Data Warehousing

Data Warehouse Queries

Queries Range from relatively simple to highly complex Dependent on end-user access tools used

End-user access tools: Reporting, query, and application development

tools Executive information systems (EIS) OLAP tools Data mining tools

Examples of Typical Data Warehouse Queries

What was the total revenue for Scotland in the third quarter of 2004? What was the total revenue for property sales for each type of property in

Great Britain in 2003? What are the three most popular areas in each city for the renting of

property in 2004 and how does this compare with the figures for the previous two years?

What is the monthly revenue for property sales at each branch office, compared with rolling 12-monthly prior figures?

What would be the effect on property sales in the different regions of Britain if legal costs went up by 3.5% and Government taxes went down by 1.5% for properties over £100,000?

Which type of property sells for prices above the average selling price for properties in the main cities of Great Britain and how does this correlate to demographic data?

What is the relationship between the total annual revenue generated by each branch office and the total number of sales staff assigned to each branch office?

Problems of Data Warehousing

Underestimation of resources for data loading

Hidden problems with source systems

Required data not captured

Increased end-user demands

Data homogenization

High demand for resources

Data ownership

High maintenance

Long duration projects

Complexity of integration

Typical Architecture of Data Warehouse

Operational Data Resources

Mainframe first generation hierarchical and network databases

Departmental propriety file systems (e.g. VSAM, RMS)

Relational DBMSs (e.g. Informix, Oracle) Private workstations and servers External systems

Internet Commercially available databases Databases associated with organization’s

suppliers or customers

Operational Data Store (ODS)

Repository of current and integrated operational data used for analysis

Structured and supplied with data like data warehouse May act as staging area for data to be moved into

warehouse Created when legacy operational systems incapable of

achieving reporting requirements Benefits:

Provides users with ease-of-use of relational database Distant from decision support functions of data warehouse

Load Manager

Performs operations associated with extraction and loading of data

Size and complexity varies between data warehouses

Constructed using combination of vendor data loading tools and custom-built programs

Warehouse Manager

Performs operations associated with management of data

Constructed using vendor data management tools and custom-built programs

Warehouse Manager

Performs operations associated with management of data Constructed using vendor data management tools and

custom-built programs Operations:

Data analysis to ensure consistency Transformation and merging of source data from temporary

storage Creation of indexes and views on base tables Generation of denormalizations, (if necessary) Generation of aggregations, (if necessary) Backing-up and archiving data

Warehouse Manager

Generates query profiles to determine which indexes and aggregations are appropriate

Query profile Can be generated for each user, group of users,

or the data warehouse Describes characteristics of queries

• Frequency• Target table(s)• Size of results set

Query Manager

Performs operations associated with management of user queries

Constructed using vendor end-user data access tools, data warehouse monitoring tools, database facilities, and custom-built programs

Complexity determined by facilities provided by end-user access tools and database

Operations: Directing queries to appropriate tables Scheduling execution of queries

Can generate query profiles Allows warehouse manager to determine appropriate indexes and

aggregations

Detailed Data

Detailed data stored in database schemaNot stored online Aggregated to next level of detail

Regularly added to warehouse to supplement aggregated data

Lightly and Highly Summarized Data

Stores pre-defined lightly and highly aggregated data generated by warehouse manager

Transient - changes to respond to changing query profiles

Purpose of summary information Improve query performance

Removes requirement to continually perform summary operations in answering user queries

Summary data updated continuously as new data loaded into warehouse

Archive/Backup Data

Stores detailed and summarized data for archiving and backup

Data transferred to storage archives - magnetic tape or optical disk

Metadata

Stores metadata (data about data) definitions used by all processes in warehouse

Used for: Extraction and loading processes

• Used to map data sources to common view of information within warehouse

Warehouse management process • Used to automate production of summary tables

Query management process • Used to direct query to most appropriate data source

Metadata

Metadata structure differs between processes Different purposes

Issues: Multiple copies of metadata describe same data item

Vendor tools and end-user data access use own versions of metadata

Copy management tools use metadata to understand mapping rules that are applied to convert source data into common form

End-user access tools use metadata to understand how to build a query

The management of metadata within data warehouse is very complex task that should not be underestimated

End-User Access Tools

Principal purpose of data warehousing: To provide information to business users for strategic decision-making

Users interact with warehouse using end-user access tools

Data warehouse must efficiently support ad hoc and routine analysis

High performance achieved by: Pre-planning requirements for joins Summations Periodic reports by end-users (where possible)

Main groups of access tools Data reporting and query tools Application development tools Executive information system (EIS) tools Online analytical processing (OLAP) tools Data mining tools

Data Warehouse Information Flows

Data Warehouse Information Flows

Inflow - Processes associated with extraction, cleansing, and loading data from source systems

Upflow - Processes associated with adding value to data in warehouse through summarizing, packaging, and distribution

Downflow - Processes associated with archiving and backing-up/recovery of data

Outflow - Processes associated with making data available to end-users

Metaflow - Processes associated with management of metadata

Data Warehousing Tools and Technologies

Building data warehouse is complex taskNo vendor that provides an ‘end-to-end’

set of tools

Necessitates data warehouse built using multiple products from different vendors

Major challenge:Ensuring products work well together and

are fully integrated

Data Warehousing Tools and Technologies

Tasks of capturing data from source systems, cleansing and transforming it, and loading results into target system can be carried out either by separate products, or by a single integrated solution

Integrated solutions include Code Generators Database Data Replication Tools Dynamic Transformation Engines

Data Warehouse DBMS Requirements

Load performance Load processing Data quality management Query performance Terabyte scalability Mass user scalability Networked data warehouse Warehouse administration Integrated dimensional analysis Advanced query functionality

Administration and Management Tools

Monitoring data loading from multiple sources

Data quality and integrity checks Managing and updating metadata Monitoring database performance to

ensure efficient query response times and resource utilization

Auditing data warehouse usage to provide user chargeback information

Administration and Management Tools

Replicating, subsetting, and distributing data

Maintaining efficient data storage management

Purging data Archiving and backing-up data Implementing recovery following failure Security management

Typical Data Warehouse and Data Mart Architecture

Data Mart

A subset of a data warehouse that supports the requirements of a particular department or business function

Characteristics:Focuses on requirements of one

department or business functionDoes not normally contain detailed

operational data unlike data warehousesMore easily understood and navigated

Reasons for Creating a Data Mart

Give users access to data they need to analyze most often

Provide data in form that matches collective view of data by group of users in a department or business function area

Improve end-user response time Reduction in volume of data to be accessed

Provide appropriately structured data as dictated by requirements of end-user access tools

Building data mart is simpler compared with establishing corporate data warehouse

Cost of implementing data marts less than that required to establish data warehouse

Potential users of data mart more clearly defined More easily targeted to obtain support for data mart project

Designing Data Warehouses

Initially, need answers for questions such as: Which user requirements are most important and

which data should be considered first? Which data should be considered first? Should the project be scaled down into

something more manageable? Should the infrastructure for a scaled down

project be capable of ultimately delivering a full-scale enterprise-wide data warehouse?


Use of data marts avoids complexities associated with designing data

Difficult to commit to enterprise-wide design that must meet all user requirements

Interim solution => build data marts Goal: creation of data warehouse that

supports requirements of enterprise


Requirements collection and analysis stage: Involves interviewing appropriate members of staff (such

as marketing users, finance users, and sales users) • Identify prioritized set of requirements data warehouse must

meet Interviews conducted with members of staff responsible for

operational systems• Identify, which data sources can provide clean, valid, and

consistent data that will remain supported over next few years Interviews provide necessary information for top-down view

(user requirements) and bottom-up view (available data sources)

Database component of data warehouse described using technique called dimensionality modeling

Dimensionality Modelling

Logical design technique that aims to present data in standard, intuitive form that allows for high-performance access

Uses Entity-Relationship modeling concepts with important restrictions: Every dimensional model (DM) composed of one table with

a composite primary key, called fact table, and set of smaller tables called dimension tables

Each dimension table has simple (non-composite) primary key that corresponds exactly to one component of composite key in fact table

Forms ‘star-like’ structure called star schema or star join


Natural keys replaced with surrogate keysEvery join between fact and dimension

tables based on surrogate keys, not natural keys

Surrogate key – generalized structure based on integersAllows data in warehouse independence

from data used and produced by OLTP systems

Star schema for property sales of DreamHome


Star schema - logical structure Has fact table containing factual data in center Surrounded by dimension tables containing

reference data, which can be denormalized

Facts generated by events that occurred in the past,

Unlikely to change, regardless of how analyzed


Fact tables:Where bulk of data in data warehouse Can be extremely large

Important to treat fact data as read-only reference data that will not change over time

Most useful fact tables contain one or more numerical measures, or ‘facts’ that occur for each record and are numeric and additive


Dimension tables:Usually contain descriptive textual

informationDimension attributes used as constraints

in data warehouse queries

Star schemas speeds up query performance by denormalizing reference information into single dimension table


Snowflake schema Variant of the star schema where dimension

tables do not contain denormalized data

Starflake schema Hybrid structure that contains mixture of star

(denormalized) and snowflake (normalized) schemas

Allows dimensions to be present in both forms to cater for different query requirements

Property sales with normalized version of Branch dimension table

Dimensionality Modelling Advantages of predictable, standard form

of underlying dimensional model:Efficiency Ability to handle changing requirements

• Star schema handles ad hoc user queries wellExtensibility

• Supports changes e.g. adding new dimension, facts

Ability to model common business situations

Predictable query processing

Comparison of DM and ER models

ER model Reduces data redundancy Beneficial to transaction processing

Single ER model normally decomposes into multiple DMs

Multiple DMs are associated through ‘shared’ dimension tables

Database Design Methodology for Data Warehouses

‘Nine-Step Methodology’: Choosing the process Choosing the grain Identifying and conforming the dimensions Choosing the facts Storing pre-calculations in the fact table Rounding out the dimension tables Choosing the duration of the database Tracking slowly changing dimensions Deciding the query priorities and the query

modes

Step 1: Choosing the process

The process (function) refers to subject matter of particular data mart

First data mart built should be:Most likely to be delivered on timeWithin budgetAnswers the most commercially important

business questions

Business process of DreamHome case study

Example – Chosen Data Mart

Step 2: Choosing the grain

Decide what a record of fact table represents

Identify dimensions of fact table

Grain decision for fact table also determines grain of each dimension table

Include time as core dimension Always present in star schemas

Step 3: Identifying and Conforming dimensions

Dimensions set context for asking questions about the facts in fact table

If any dimension occurs in two data marts: Must be exactly same dimension Or one must be mathematical subset of other

Dimension used in more than one data mart referred to as being conformed

Star schemas for property sales and property advertising

Step 4: Choosing the facts

Grain of fact table determines which facts can be used in data mart

Facts should be numeric and additive

Unusable facts include: non-numeric facts non-additive facts fact at different granularity from other facts in

table

Property rentals with a badly structured fact table

Property rentals with fact table corrected

Step 5: Storing pre-calculations in the fact table

Once facts selectedRe-examine to determine whether there

are opportunities to use pre-calculations

Step 6: Rounding out the dimension tables

Text descriptions are added to dimension tables

Text descriptions should be intuitive and understandable to users

Usefulness of data mart determined by scope and nature of attributes of dimension tables

Step 7: Choosing the duration of the database

Duration measures how far back in time fact table goes

Very large fact tables raises two very significant data warehouse design issues: Often difficult to source increasing old data Mandatory that old versions of important

dimensions be used, not the most current versions - aka ‘Slowly Changing Dimension’ problem

Step 8: Tracking slowly changing dimensions

Slowly changing dimension problem Proper description of old dimension data must be used with old fact data

Generalized key assigned to important dimensions Allows distinction multiple snapshots of dimensions over period of time

Three basic types of slowly changing dimensions: Type 1 - where changed dimension attribute overwritten Type 2 - where changed dimension attribute causes new dimension

record to be created Type 3 - where a changed dimension attribute causes alternate attribute to

be created• Both the old and new values of attribute simultaneously accessible in the same

dimension record

Step 9: Deciding the query priorities and the query modes

Most critical physical design issues affecting end-user’s perception includes: Physical sort order of fact table on disk Presence of pre-stored summaries or

aggregations

Additional physical design issues: Administration Backup Indexing performance Security

Database Design Methodology for Data Warehouses

Methodology designs data mart: Supports requirements of particular business

process Allows easy integration with other related data

marts to form enterprise-wide data warehouse

A dimensional model, which contains more than one fact table sharing one or more conformed dimension tables, Referred to as fact constellation

Fact and dimension tables for each business process of DreamHome

Dimensional model (fact constellation) for the DreamHome data warehouse

Chapters 31 & 32Omit material specific to oracle

Date post:	21-Dec-2015
Category:	Documents
View:	213 times
Download:	1 times

Data Warehousing Dale-Marie Wilson, Ph.D.. Evolution of Data Warehousing Since 1970s, organizations...

Documents