Data Warehouse Architecture
Define Data Warehouse Architecture Define Data Warehouse and Data Mart Present a Data Warehouse Architectural
Framework Demo – Data Enterprise Integration Server
Objectives
Information Systems Architecture is the process of making the key choices that are essential to the development of an information system. Architecture includes:◦ Guiding Principles: ◦ Approaches/philosophies◦ “Logical” representations of a system◦ Hardware/Operating System◦ Computing model: client/server vs traditional vs Web-
based◦ Tools and technologies
It is key, when making these choices that they are:◦ Requirements driven◦ Take into consideration operational, technical and financial
feasibility◦ Made within an architectural framework
Information Systems Architecture
There are a lot of Drivers of ArchitectureArchitecture Drivers
BusinessPlan
BusinessPlan
CorporatePolitics
CorporatePolitics
SystemQualities
SystemQualities
CurrentSystems
CurrentSystems
End UserRequirements
End UserRequirements
EmergingTechnologies
EmergingTechnologies
ArchitectureArchitecture
Its not – Architecture can be considered ‘high-level’ design
Architecture includes those aspects of the design that are essential to the information system
Architecture Example:◦ Users must be able to self-serve (guiding principle)◦ “We will use a “hub and spoke” design where data
will be placed in a central data warehouse, then be propagated to one or more data marts. (approach)
◦ We will normalize data in the central warehouse and use a dimensional design in the data marts (approach)
◦ We will use Oracle 8i as our DBMS (technical architecture)
How is Architecture Different from Design?
Not Architecture:◦ The Order subject area will be composed of the
following tables: order_fact, customer_dim, product_dim and time_dim
◦ The customer_dim table will have the following attributes…….
Architecture vs Design
Communication:◦ To business sponsors, and business users◦ Between members of the project team
Planning:◦ Cross Check for Project Plan◦ Ensure that all important components of the data
warehouse are accounted for Flexibility and Growth
◦ Thinking about overall architecture will reduce risk associated with the ‘success’ of the data warehouse
Learning Productivity and Reuse
The Value of Architecture
Transaction processing systems – growth is (relatively) predictable
Example: ◦ A company uses SAP for order processing◦ They are opening a new retail store◦ They predict (based on experience) 2000
transactions per week◦ To process this volume, we need 3 workstations to
capture the transactions◦ Peak time each day is 11-2 when 50% of
transactions occur
What’s different about DW Architecture?
Success drives explosive growth◦ More users◦ More (complex)
queries◦ More data
Performance is unpredictable
◦ Unpredictable queries◦ Unpredictable use
patterns
What’s Different About Data Warehouse Architecture?
Gro
wth
Time
Siebel
SAP R/3
Data Warehouse
Bill Inmon: “The enterprise data warehouse”
Ralph Kimball: “data marts”
The compromise: “Hub and Spoke” or “Federated” models
The Great Data Warehouse Architecture Debate
If you build it, They will come
A data mart is a collection of subject areas organized for decision support based on the specific needs of a given user group.
Each mart may widely different from others (as we will see)
Typically, data marts are built on the dimensional data model:◦ Facts – things that the organization wants to
measure: revenue, orders, shipments, purchases, etc.
◦ Dimensions – the means by which the organization wants to analyze the measures (facts) – by customer, by time, by product – BY ANY COMBINATION!!
What is a Data Mart?
There are two kinds of data marts--dependent and independent.
A dependent data mart is one whose source is a data warehouse.
An independent data mart is one whose source is the legacy applications environment. All dependent data marts are fed by the same source--the data warehouse. Each independent data mart is fed uniquely and separately by the legacy applications environment.
Dependent data marts are architecturally and structurally sound.
Independent data marts have a number of significant issues
What is a Data Mart?
Data Warehouse vs. Data Marts
What comes first
From the Data Warehouse to Data Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
Data Warehouse and Data Marts
OLAPData MartLightly summarizedDepartmentally structured
Organizationally structuredAtomicDetailed Data Warehouse Data
Data Mart Centric
Data Marts
Data Sources
Data Warehouse
Problems with Data Mart Centric Solution
If you end up creating multiple warehouses, integrating them is a problem
True Warehouse
Data Marts
Data Sources
Data Warehouse
19
Generic Two-Level Architecture Independent Data Mart Dependent Data Mart and Operational
Data Store Logical Data Mart and Real-Time Data
Warehouse Three-Layer architecture
Data Warehouse Architectures
All involve some form of extraction, transformation and loading (ETL)
20
Generic two-level data warehousing architecture
E
T
LOne, company-wide warehouse
Periodic extraction data is not completely current in warehouse
21
Independent data mart data warehousing architecture
Data marts:Mini-warehouses, limited in scope
E
T
L
Separate ETL for each independent data mart
Data access complexity due to multiple data marts
22
Dependent data mart with operational data store: a three-level architecture
ET
L
Single ETL for enterprise data warehouse(EDW)
Simpler data access
ODS provides option for obtaining current data
Dependent data marts loaded from EDW
23
ET
L
Near real-time ETL for Data Warehouse
ODS and data warehouse are one and the same
Data marts are NOT separate databases, but logical views of the data warehouse Easier to create new data marts
Logical data mart and real time warehouse architecture
24
Three-layer data architecture for a data warehouse
Independent data marts Hub and spoke architecture Data mart bus architecture Federated data warehouse
The Major Data Warehouse Architectures
Independent data mart architecture
•Developed independently.
•No conformed dimensions (i.e., does not have the same categories and labels for data elements in data marts which would allow data across data marts to be combined).
•Built to a business unit or functional area.
Independent data marts
Data staging
Data sources
End user access/
applications
Hub and spoke architecture
•Key spokesperson: Bill Inmon (1992, 1998, 2001).
•Detailed enterprise oriented view of data.
•Built in iterative manner subject area by subject area.
•Dependent data marts to support user needs for dimensional data.
ODS
Dependent data marts
Central data store
Data sources
Data staging
End user access/
applications
Data mart bus architecture
•Key spokesperson: Ralph Kimball (1996, 1999).
•First data mart built as proof of concept.
•Built sequentially according to master suite of conformed dimensions and fact tables, resulting in logically integrated marts.
•Conformed dimensions provide capability to access data across architected marts.
Architected data martsData sources
Data staging
End user access/
applications
Federated architecture
•Key spokesperson: Doug Hackney (2000, 2002).
•Combines data in an organization’s existing data warehousing environment.
•Characterized by combing key metrics and measures in existing data marts, data warehouses and legacy systems.
Data warehouse
Data stores
Data staging
Data mart
Federated data store
End user access/
applications
Data Warehouse Architecture
Selection
Architecture selection
Data warehouse architectures Independent data martData mart bus architectureHub and spoke architectureFederated
Architecture selection factors
Information interdependence
Upper management’s information
needs
Urgency of need
View of the data warehouse
Compatibility with existing systems
Nature of end user tasks
Resource constraints
Perceived ability of the IT staffSource of sponsorship
Expert influence
April 8, 2023DW Architecture Best Practices
32
Use a data model that is optimized for information retrieval◦ dimensional model◦ denormalized◦ hybrid approach
Best Practice #1
April 8, 2023DW Architecture Best Practices
33
Extract Transform Load (ETL)◦ the process of unloading or copying data from
the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW
◦ also, a software development tool for building ETL processes (an ETL tool)
◦ many production DWs use COBOL or other general-purpose programming languages to implement ETL
Data Acquisition Processes
34
Capture/ExtractScrub or data cleansingTransformLoad and Index
The ETL Process
ETL = Extract, transform, and load
35
Static extract = capturing a snapshot of the source data at a point in time
Incremental extract = capturing changes that have occurred since the last static extract
Capture/Extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse
Steps in data reconciliation
36
Scrub/Cleanse…uses pattern recognition and AI techniques to upgrade data quality
Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies
Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data
Steps in data reconciliation
(cont.)
37
Transform = convert data from format of operational system to format of data warehouse
Record-level:Selection–data partitioningJoining–data combiningAggregation–data summarization
Field-level: single-field–from one field to one fieldmulti-field–from many fields to one, or one field to many
Steps in data reconciliation
(cont.)
38
Load/Index= place transformed data into the warehouse and create indexes
Refresh mode: bulk rewriting of target data at periodic intervals
Update mode: only changes in source data are written to data warehouse
Steps in data reconciliation
(cont.)
April 8, 2023DW Architecture Best Practices
39
data cleansing◦the process of validating and enriching
the data as it is published to the DW◦also, a software development tool for
building data cleansing processes (a data cleansing tool)
◦many production DWs have only very rudimentary data quality assurance processes
Data Quality Assurance
April 8, 2023DW Architecture Best Practices
40
getting data loaded efficiently and correctly is critical to the success of your DW◦implementation of data acquisition &
cleansing processes represents from 50 to 80% of effort on typical DW projects
◦inaccurate data content can be ‘the kiss of death’ for user acceptance
Data Acquisition & Cleansing
April 8, 2023DW Architecture Best Practices
41
Carefully design the data acquisition and cleansing processes for your DW◦ Ensure the data is processed efficiently and
accurately◦ Consider acquiring ETL and Data Cleansing tools◦ Use them well!
Best Practice #2
April 8, 2023DW Architecture Best Practices
42
Already discussed the benefits of a dimensional model
No matter whether dimensional modeling or any other design approach is used, the data model must be documented
Data Model
April 8, 2023DW Architecture Best Practices
43
The best practice is to use some kind of data modeling tool◦ CA ERwin◦ Sybase PowerDesigner◦ Oracle Designer◦ IBM Rational Rose◦ Etc.
Different tools support different modeling notations, but they are more or less equivalent anyway
Most tools allow sharing of their metadata with an ETL tool
Documenting the Data Model
April 8, 2023DW Architecture Best Practices
44
data model standards appropriate for the environment and tools chosen in your data warehouse should be adopted
considerations should be given to data access tool(s) and integration with overall enterprise standards
standards must be documented and enforced within the DW team◦ someone must ‘own’ the data model
to ensure a quality data model, all changes should be reviewed thru some formal process
Data Model Standards
April 8, 2023DW Architecture Best Practices
45
Business definitions should be recorded for every field (unless they are technical fields only)
Domain of data should be recorded Sample values should be included As more metadata is populated into the
modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools
Data Model Metadata
April 8, 2023DW Architecture Best Practices
46
The strategy for sharing data model and other metadata should be formalized and documented
Metadata management tools should be considered & the overall metadata architecture should be carefully planned
Metadata Architecture
April 8, 2023DW Architecture Best Practices
47
Design a metadata architecture that allows sharing of metadata between components of your DW
Best Practice #3
April 8, 2023DW Architecture Best Practices
48
Bill Inmon: “Corporate Information Factory” Hub and Spoke philosophy “JBOC” – just a bunch of cubes Let it evolve naturally
Alternative Architecture Approaches
April 8, 2023DW Architecture Best Practices
49
In most cases, business and IT agree that the data warehouse should provide a ‘single version of the truth’
Any approach that can result in disparate data marts or cubes is undesireable
This is known as data silos or…
What We Want(Architectural Principal)
April 8, 2023DW Architecture Best Practices
50
how to design an enterprise data warehouse and ensure a ‘single version of the truth’?
according to Kimball:◦ start with an overall data architecture
phase ◦ use “Data Warehouse Bus” design to
integrate multiple data marts◦ use incremental approach by building one
data mart at a time
Enterprise DW Architecture
April 8, 2023DW Architecture Best Practices
51
named for the bus in a computer◦ standard interface that allows you to plug in
cdrom, disk drive, etc.◦ these peripherals work together smoothly
provides framework for data marts to fit together
allows separate data marts to be implemented by different groups, even at different times
Data Warehouse Bus Architecture
April 8, 2023DW Architecture Best Practices
52
data mart is a complete subset of the overall data warehouse◦a single business process OR◦a group of related business processes
think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a ‘fact constellation’
Data Mart Definition
April 8, 2023DW Architecture Best Practices
53
determine which dimensions will be shared across multiple data marts
conform the shared dimensions produce a master suite of shared dimensions
determine which facts will be shared across data marts
conform the facts standardize the definitions of facts
Designing The DW Bus
April 8, 2023DW Architecture Best Practices
54
conformed dimensions will usually be granular◦ makes it easy to integrate with various base level
fact tables◦ easy to extend fact table by adding new facts◦ no need to drop or reload fact tables, and no keys
have to be changed
Dimension Granularity
April 8, 2023DW Architecture Best Practices
55
by adhering to standards, the separate data marts can be plugged together◦ e.g. customer, product, time
they can even share data usefully, for example in a drill across report
ensures reports or queries from different data marts share the same context
Conforming Dimensions
April 8, 2023DW Architecture Best Practices
56
a current trend in BI/DW is ‘data consolidation’
from a software vendor perspective, it is tempting to simplify this:◦ ‘we can keep all the tables for all your disparate
applications in one physical database’
Data Consolidation
April 8, 2023DW Architecture Best Practices
57
To truly achieve ‘a single version of the truth’, must do more than simply consolidating application databases
Must integrate data models and establish common terms of reference
Data Integration
April 8, 2023DW Architecture Best Practices
58
Take an approach that consolidates data into ‘a single version of the truth’◦ Data Warehouse Bus
conformed dimensions & facts◦ OR?
Best Practice #4
April 8, 2023DW Architecture Best Practices
59
a single point of integration for disparate operational systems
contains integrated data at the most detailed level (transactional)
may be loaded in ‘near real time’ or periodically
can be used for centralized operational reporting
Operational Data Store (ODS)
April 8, 2023DW Architecture Best Practices
60
Consider implementing an ODS only when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed◦ Must ensure that the data model is integrated,
not just consolidated◦ May consider 3NF data model◦ Avoid at all costs a ‘data dumping ground’
Best Practice #5
April 8, 2023DW Architecture Best Practices
61
DW workloads are typically very demanding, especially for I/O capacity
Successful implementations tend to grow very quickly, both in number of users and data volume
Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance◦ typically based on estimated ‘raw’ data size
of proposed database e.g. 100-150 Gb per modern CPU
Capacity Planning
April 8, 2023DW Architecture Best Practices
62
Scaling performance within a single SMP server is referred to as ‘scale up’
Database benchmarks suggest Windows scalability is near that of Linux
IBM claims near-linear scalability for Linux (on commodity hardware) up to about 4 processors◦ Probably not cost effective to scale up Linux
much beyond 4 processors IBM claims near-linear scalability for AIX
on POWER5 up to about 8 processors
SMP Server Scale Up
April 8, 2023DW Architecture Best Practices
63
To obtain the total number of processors required for the estimated DW workload, must plan either to scale up or scale out
Both options are viable but, all other things being equal, scaling up is less disruptive to end users and requires less work to implement◦ scaling up can offer lower hardware
investment, if practical◦ however, network bandwidth or latency
issues can limit effectiveness of parallelism
Scale Up vs. Scale Out
April 8, 2023DW Architecture Best Practices
64
Create a capacity plan for your BI application & monitor it carefully
Consider future additional performance demands◦ Establish standard performance benchmark
queries and regularly run them◦ Implement capacity monitoring tools◦ Build scalability into your architecture◦ May need to allow for scaling both up and
out!
Best Practice #6
April 8, 2023DW Architecture Best Practices
65
Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware◦ this is expected to offer lower total cost of ownership◦ certainly, GNU/Linux and other Open Source initiatives
do provide very good functionality and quality for minimal cost
This trend also applies to BI & DW:◦ most traditional rdbms’s are now supported on Linux◦ however, open source rdbms’s lag behind on providing
good performance for DW queries
Open Source Affordability
April 8, 2023DW Architecture Best Practices
66
DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance
production experience is limited so far, so this is not yet a ‘best practice’
DW Appliances
April 8, 2023DW Architecture Best Practices
67
In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture
Can also be used for:◦Staging area◦Master/reference data management◦Etc…
Role of an ODS in DW Architecture