(Dwh Fundamentals)

Post on 12-Jan-2016

253 views 2 download

Tags:

description

(Dwh Fundamentals)

transcript

04/21/23 TCS Confidential 1

Course Roadmap• Why we use Data warehousing

• Difference between Operational System and Data Warehouse

• Introduction to Dataware housing

• Emergence of Decision Support Systems

• Data Warehousing Approaches

• Data Warehouse Technical Architecture

• Data Modelling concepts

• Operational Data Store

• Schema Design of Data warehouse

• Data Acquisation

Why We Need Data Warehousing ?• Better business intelligence for end-users

• Reduction in time to locate, access, and analyze information

• Consolidation of disparate information sources

• To Store Large Volumes of Historical Detail Data from Mission

Critical Applications

• Strategic advantage over competitors

• Faster time-to-market for products and services

• Replacement of older, less-responsive decision support systems

• Reduction in demand on IS to generate reports

OPERATIONAL DATABASE:

Online Transaction Processing

Designed for running the business and not suitable for analyzing the business in the prospect Of business executives because data volatile nature (Keep on changing)

It does not maintain historical data.

It contains only current data.

If u insert any new values it will updateEg: Acnthno Acnthsal 1072 13,000 20,000

OLTP Systems Vs Data Warehouse

users are different

data content is different,

data structures are different

hardware is differentUnderstanding The Differences Is The KeyUnderstanding The Differences Is The Key

OLTP Vs Data Warehouse

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

Operational System Data Warehouse

Transaction Processing Query Processing

Predictable CPU Usage Random CPU Usage

Time Sensitive History Oriented

Operator View Managerial View

Normalized Efficient

Design for TP

Denormalized Design for

Query Processing

OLTP Vs WarehouseOperational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Operational System Data Warehouse

Designed for Atmocity,Consistency, Isolation andDurability

Designed for quite or staticdatabase

Organized by transactions(Order, Input, Inventory)

Organized by subject(Customer, Product)

Relatively smaller database Large database size

Many concurrent users Relatively few concurrentusers

Volatile Data Non Volatile Data

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

Operational System Data Warehouse

Stores all data Stores relevant data

Performance Sensitive Less Sensitive to performance

Not Flexible Flexible

Efficiency Effectiveness

What is a Data Warehouse ?

• Data Warehouse Data Warehouse is a

• Subject-Oriented

• Integrated

• Time-Variant

• Non-volatile

WH Inmon - Regarded As Father Of Data WarehousingWH Inmon - Regarded As Father Of Data Warehousing

10

Subject Oriented Analysis

Data Warehouse StorageTransactional Storage

SalesSales

CustomersCustomers

ProductsProducts

EntrySales RepQuantity SoldPart NumberDate Customer NameProduct DescriptionUnit PriceMail Address

Process Oriented Subject Oriented

11

Integration of Data

Data Warehouse StorageTransactional Storage

Appl. A - M, FAppl. B - 1, 0Appl. C - X, Y

Appl. A - pipeline cm.Appl. B - pipeline inchesAppl. C - pipeline mcf

Appl. A - balance dec(13,2) Appl. B - balance PIC 9(9)V99Appl. C - balance float

Appl. A - bal-on-handAppl. B - current_balanceAppl. C - balance

Appl. A - date (Julian)Appl. B - date (yymmdd)Appl. C - date (absolute)

M, F

pipeline cm

balance dec(13, 2)

balance

date (Julian)In

tegr

atio

n

Encoding

Unit of Attributes

Physical Attributes

Naming Conventions

Data Consistency

12

Load

Access

Mass Load / Access of DataRecord-by-Record Data Manipulation

Insert

Access

Insert

Change

Delete

Change

Volatile Non-Volatile

Volatility of Data

Data Warehouse StorageTransactional Storage

13

Time Variant Data Analysis

Data Warehouse StorageTransactional Storage

Current Data Historical Data

0

5

10

15

20

Sales ( in lakhs )

January February March

Year97

Sales ( Region , Year - Year 97 - 1st Qtr)

EastWestNorth

14

Decision Support Systems (DSS)

What is DSS?

Need for DSS

Comparison of OLTP & DSS

Transition from Data Processing to Information

Processing

15

Enable users to get a “Business View” of the data

Facilitate Data based Decision Making that would drive and improve the Business

Discover “Hidden Trends”

What is DSS?

Decision Support SystemsDecision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.

Decision Support SystemsDecision Support Systems (DSS) are interactive computer-based systems intended to help decision makers utilize data and models to identify and solve problems and make decisions. Data Warehouse is the foundation of DSS process. It is a Strategy and a Process for Staging Corporate Data.

Why DSS?: How to answer these Business Queries?

What is the sales distribution region wise?

What is Defaulter’s Profile?

What are the slow movers in my product line?

How did my revenue improve in the past 5 years?

Which of my Sales Agentsare doing better?

Who are my profitable customers?

Currency Risk, Interest Rate Risk, Liquidity Risk

Strategic Planning / Budgeting

Which channel costs me more and pays less?

17

OLTP v/s DSS Environment

OLTP EnvironmentOLTP Environment• get data IN

• large volumes of simple transaction queries

• continuous data changes

• low processing time

• mode of processing

• transaction details

• data inconsistency

• mostly current data

DSS EnvironmentDSS Environment

• get information OUT

• small number of diverse queries

• periodic updates only

• high processing time

• mode of discovery

• subject oriented - summaries

• data consistency

• historical data is relevant

18

OLTP v/s DSS Environment

OLTP EnvironmentOLTP Environment• high concurrent usage

• highly normalized data structure

• static applications

• automates routines

DSS EnvironmentDSS Environment

• low concurrent usage

• fewer tables, but more columns per table

• dynamic applications

• facilitates creativity

DW Implementation Approaches

• Top Down

• Bottom-up

• Combination of both

• Choices depend on:– current infrastructure– resources– architecture– ROI– Implementation speed

Top Down Implementation

Bottom Up Implementation

DW Implementation Approaches

Top Down• More planning and design

initially• Involve people from

different work-groups, departments

• Data marts may be built later from Global DW

• Overall data model to be decided up-front

Bottom Up• Can plan initially without

waiting for global infrastructure

• built incrementally

• can be built before or in parallel with Global DW

• Less complexity in design

DW Implementation Approaches

Top Down• Consistent data definition

and enforcement of business rules across enterprise

• High cost, lengthy process, time consuming

• Works well when there is centralized IS department responsible for all H/W and resources

Bottom Up• Data redundancy and

inconsistency between data marts may occur

• Integration requires great planning

• Less cost of H/W and other resources

• Faster pay-back

24

DW Architectures

25

Data warehousing Architecture

Source 1

Source 2

Source 3

Source n

Sources

Cle

an

sin

g,

Tra

nsfo

rmati

on

& L

oad

ing

Staging Layer

Data Marts

Cubes-Conformed Dimensions

Data Warehouse

Summaries /

Aggregations

ODS

Detail Data

Transformation

Summarization Aggregation

Reporting Layer

Canned Reports

Ad-hoc analysis

Metadata

Extract-Push/Pull

Benefits of DWH

To formulate effective business, marketing

and sales strategies.

To precisely target promotional activity.

To discover and penetrate new markets.

To successfully compete in the marketplace

from a position of informed strength.

To build predictive rather than retrospective models.

Data Modeling

Data Modeling

WHAT IS A DATA MODEL? A data model is an abstraction of some aspect of

the real world (system). WHY A DATA MODEL?

• Helps to visualize the business

• A model is a means of communication.

• Models help elicit and document requirements.

• Models reduce the cost of change.

• Model is the essence of DW architecture based on which DW will be implemented

STEPS in DATA MODELINGProblem & scope definition

Requirement Gathering

Analysis

Logical Database Design

Deciding Database

Physical Database design

Schema Generation

Levels of modeling• Conceptual modeling

– Describe data requirements from a business point of view without technical details

• Logical modeling– Refine conceptual models– Data structure oriented, platform independent

• Physical modeling– Detailed specification of what is physically

implemented using specific technology

Conceptual Model

• A conceptual model shows data through business eyes.

• All entities which have business meaning.

• Important relationships

• Few significant attributes in the entities.

• Few identifiers or candidate keys.

Logical Model

• Replaces many-to-many relationships with associative entities.

• Defines a full population of entity attributes.

• May use non-physical entities for domains and sub-types.

• Establishes entity identifiers.

• Has no specifics for any RDBMS or configuration.

Physical Model

• A Physical data model may include– Referential Integrity– Indexes– Views– Alternate keys and other constraints– Tablespaces and physical storage objects.

Modeling Techniques

• Entity-Relationship Modeling

– Traditional modeling technique

– Technique of choice for OLTP

– Suited for corporate data warehouse

• Dimensional Modeling

– Analyzing business measures in the specific business context

– Helps visualize very abstract business questions

– End users can easily understand and navigate the data structure

• Relationship

– Relationship between entities - structural interaction and

association

– described by a verb

– Cardinality

• 1-1

• 1-M

• M-M

– Example : Books belong to Printed Media

Entity-Relationship Modeling - Basic Concepts

Entity-Relationship Modeling - Basic Concepts

• Attributes– Characteristics and properties of entities

– Example :• Book Id, Description, book category are attributes of entity

“Book”

– Attribute name should be unique and self-explanatory

– Primary Key, Foreign Key, Constraints are defined on Attributes

37

Examples: ER Model

Limitations of E-R Modeling

• Poor Performance

• Tend to be very complex and difficult to navigate.

39

Dimensional Modeling

Dimensional Modeling

• Dimensional modeling uses three basic concepts : measures, facts, dimensions.

• Is powerful in representing the requirements of the business user in the context of database tables.

• Focuses on numeric data, such as values counts, weights, balances and occurences.

• Must identify– Business process to be supported– Grain (level of detail)– Dimensions– Facts

Dimensional modeling

What is a Facts • A fact is a collection of related data items,

consisting of measures and context data.

• Each fact typically represents a business item, a business transaction, or an event that can be used in analyzing the business or business process.

• Facts are measured, “continuously valued”, rapidly changing information. Can be calculated and/or derived.

Types of Facts• Additive

– Able to add the facts along all the dimensions

– Discrete numerical measures eg. Retail sales in $

• Semi Additive

– Snapshot, taken at a point in time

– Measures of Intensity

– Not additive along time dimension eg. Account balance, Inventory balance

– Added and divided by number of time period to get a time-average

• Non Additive

– Numeric measures that cannot be added across any dimensions

– Intensity measure averaged across all dimensions eg. Room temperature

– Textual facts - AVOID THEM

Dimensions

• A dimension is a collection of members or units of the same type of views.

• Dimensions determine the contextual background for the facts.

• Dimensions represent the way business people talk about the data resulting from a business process, e.g., who, what, when, where, why, how

45

Dimensional Hierarchy

World

America AsiaEurope

USA

FL

Canada Argentina

GA VA CA WA

TampaMiami Orlando Naples

Continent Level

State Level

City Level

World Level

Country Level

Pare

nt R

elat

ion

Dimension Member / Business

Entity

Geography Dimension

Attributes: Population, Tourist’s Place

Dimensions Types

• Conformed Dimension

• junk Dimension

• Dirty Dimension

• Monster Dimension

• Slowly Changing Dimension

• Degenerated Dimension

46

47

Data marts

A data mart is a

• Powerful and natural extension of the data warehouse• Extends information to the departmental environment

from an enterprise environment• Interprets and structures data to suit departments’

specific needs

Data marts (DM)

Several names for DMs:

• departmental DSS DBs

• OLAP Data bases

• multi-dimensional DBs (MDDB)

• lightly summarized tables

48

Data marts

• Embedded data marts are marts that are stored within

the central DW. They can be stored relationally as files or

cubes.

• Dependent data marts are marts that are fed directly by

the DW, sometimes supplemented with other feeds, such as

external data.

• Independent data marts are marts that are fed directly

by external sources and do not use the DW.

DM - Types

49

ODS

An ODS

• pulls together, validates, cleanses and integrates data• foundation for providing integrated view of enterprise data• tactical decision support, day-to-day operations and

management reporting

Operational Data Store (ODS)

Characteristics

Integrated

Subject-oriented

Volatile (including update)

Current valued

50

ODS

Class I – Immediate Load.

Class II – Delayed Load

Class III – Overnight Load.

Class IV – Data warehouse Load.

ODS - Types

OLTP Vs ODS Vs DWH

Characteristic OLTP ODS Data Warehouse

Data redundancy Non-redundantwithin system;Unmanagedredundancy amongsystems

Somewhatredundant withoperationaldatabases

Managedredundancy

Data stability Dynamic Somewhat dynamic Static

Data update Field by field Field by field Controlled batch

Data usage Highly structured,repetitive

Somewhatstructured, someanalytical

Highlyunstructured,heuristic oranalytical

Database size Moderate Moderate Large to very large

Databasestructure stability

Stable Somewhat stable Dynamic

Star Schema Design

– Single fact table surrounded by denormalized dimension tables

– The fact table primary key is the composite of the foreign keys (primary keys of dimension tables)

– Fact table contains transaction type information.– Many star schemas in a data mart– Easily understood by end users, more disk storage

required

Example of Star Schema

Snowflake Schema – Single fact table surrounded by normalized dimension

tables– Normalizes dimension table to save data storage space.– When dimensions become very very large– Less intuitive, slower performance due to joins

• May want to use both approaches, especially if supporting multiple end-user tools.

Example of Snow flake schema

Snowflake - Disadvantages

• Normalization of dimension makes it difficult for user to understand

• Decreases the query performance because it involves more joins

• Dimension tables are normally smaller than fact tables - space may not be a major issue to warrant snowflaking

57

On-Line Analytical Processing (OLAP)

OLAP Cubes

OLAP is a category of applications/technology for

collecting

managing

processing

presenting

multidimensional data for analysis and management purposes

58

OLAP Cubes

• Subject oriented approach to Decision Support

• Calculations applied across dimensions, through hierarchies and/or across members

• Trend analysis over sequential time periods, What If scenarios.

• Slicing/Dicing subsets for on-screen viewing

• Drill-down/up along the hierarchy

• Reach-through to underlying detail data

• Rotation to new dimensional comparisons in the viewing area

OLAP Features

59

Multi-dimensional OLAP (MOLAP)

Relational OLAP (ROLAP)

Hybrid OLAP (HOLAP)

OLAP Categories

OLAP Cubes

60

MOLAP

• Use pre-calculated data set – CUBE

• Cube contains all possible answers to given range of questions

Features:

• Very fast response

• Ability to quickly write data into the cube

Downsides:

• Limited Scalability

• Inability to contain detailed data

• Load time

OLAP Cubes

61

OLAP Cubes

ROLAP

• Do not use pre-calculated CUBE

• Intercept query & pose it to the Relational DB

Features:

• Ask any question (not limited to the contents of the cube)

• Ability to drill downDownsides:

• Slow Response

• Some limitations on scalability

62

OLAP Cubes

HOLAP

• Combines MOLAP & ROLAP

• Utilizes both pre-calculated cubes & relational data sources

Features:

• For summary type info – cube, (Faster response)

• Ability to drill down – relational data sources (drill through detail to underlying data)

• Source of data transparent to end-user

Data Acquisation

• Data Extraction

• Data Transformation

• Data Loading

63