1
F4: DW Architecture and Lifecycle
Erik Perjons, DSV, SU/[email protected]
The data warehouse architecture
Query/Reporting
ExtractTransformLoad
Serve
External sources
Data warehouse
Data marts
Analysis/OLAP
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
Operationalsource systems
Data access tools (RK)End user applicationsBusiness Intelligence tools
Data stagingarea (RK)Back end tools
Data presentationarea (RK)”The data warehouse”Presentation (OLAP) servers
Operational sourcesystems (RK)Legacy systemsOLTP/TP systems
The back room The front room
2
Operational source systemscharacteristics:
• the source data often in OLTP (Online Transaction Processing)systems, also called TPS (Transaction Processing Systems)
• high level of performance and availability
• often one-record-at-a time queries
• already occupied by the normal operations of the organisation
OLTP vs. DSS (Decision Support Systems) OLTP vs. OLAP (Online analytical processing)
Operational Source Systems
Operationalsource systems
More operational source systemscharacteristics:
• a OLTP system may be reliable and consistent, but there areoften inconsistencies between different OLTP systems
• different types of data format and data structures indifferent OLTP systems AND DIFFERENT SEMANTICS
Operational Source Systems
Operationalsource systems
3
Operational Source Systems
Kimball et al´s assumptions (p 7):
•Source systems are not queried in the broadand unexpected ways
•Maintain little historical data
•Each source systems is often a natural stovepipe application
Operationalsource systems
DW architecture: Data staging area
Query/Reporting
ExtractTransformLoad
Serve
External sources
Data warehouse
Data marts
Analysis/OLAP
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
Operationalsource systems
Data access toolsData staging area Data presentation areaOperationalsource systems
4
The Data Staging Area
Often the most complex part inthe architecture, and involves...
• Extraction (E)• Transformation (T)• Load (L)• indexing
ETL-tools can be usedScripts for extraction, transformation and load are
implemented
ExtractTransformLoad
Extractionmeans reading and understanding the source data andcopying the data needed for the data warehouse intostaging area for further manipulation, i.e.transformation
Data staging area
ExtractTransformLoad
5
Transformation involves…
• data conversion/transformation(specify transformation rules to convert to a common data formatand common terms/semantics)
• data cleaning/cleansing– data scrubbing (use domain-specific knowledge (e.g postal
adresses) to check the data)– data auditing (discover suspicious pattern, discover violation of
stated rules)• combining data from multiple sources• assigning warehouse (surrogate) keys• data aggregation
Data staging areaExtractTransformLoad
A debate questions:
Should the data in the data staging area be stored in a3NF relational database and loaded into the presentationarea for querying and reporting?
Kimball (p 8-9): a 3NF relational database in data staging arearequires more time and resources for development, periodicloading and updating and more capacity of storing the multiplecopies of the data
Data staging area
ExtractTransformLoad
6
Flat fileC
DB2table(s)
D’
DB2Connect
Staging area for checking, analysing, cleaning, complementing etc transactiondata
SQL, C++ ??
DB2Preliminarytarget DW
E
Fees(manually adjusted
to individualagreements)
I
Startbalance
H
CustomerdataG
Customerdata
F
Three star/join schemascomprising altogether 8 tablesFact tables:- transactions (10 attributes)- fees (7 attributes)- start balance (4 attributes)Dimensional tables:- time (7 attr)- customer (> 40 attr)- company (> 90 attr)- product (13 attr)- ”Service charged” (2 attr)
Various source files
DB2Final
target DWE’
+aggregation (new program)
E complemented with someaggregated tables
Some cleansingand scrubbingmay be neededhere
A Real World Example
DW architecture: Data presentation area
Query/Reporting
ExtractTransformLoad
Serve
External sources
Data warehouse
Data marts
Analysis/OLAP
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
Operationalsource systems
Data access toolsData staging area Data presentation areaOperationalsource systems
7
Data presentation areaData warehouse
Data marts
OLAP servers
• What is OLAP?• Dimensional modelling vs. 3 NF modelling• Data Marts• ROLAP/MOLAP servers
• Acronym for “On-line analytical processing”
• A decision support system (DSS) that support ad-hoc querying, i.e.enables managers and analysts to interactively manipulate data. Theidea is to allow the users to easy and quickly manipulate and visualisethe data through multidimensional views, i.e. different perspectives.
What is OLAP?
qua r
ter
office
product
Service
Facts
Office
Quarter
Kimball: Dimensional modelling
8
Date/Key Month Quarter Year991011 9910 4 - 99 99991012 9910 4 - 99 99
Key Customer Address RegionIncomegroup
C210 Anna N Stockholm Stockholm BC211 Lars S Malmö Skåne BC212 Erik P Rättvik Dalarna CC213 Danny B Stockholm Stockholm AC214 Åsa S Stockholm Stockholm A
Key ServiceServicegroup
S1 Local call Group AS2 Intern. call Group AS3 SMS Group BS4 WAP Group C
Key Seller OfficeF11 Anders C SundsvallF12 Lisa B SundsvallF13 Janis B Kista
Service Dimension Time Dimension
Sales DimensionCustomer Dimension
Fact table - Transactions
SumNumberof calls
C210 S1 F11 991011 25:00 3C210 S3 F11 991011 05:00 1C212 S2 F13 991011 89:00 1C213 S1 F13 991011 12:00 1C214 S4 F13 991012 08:00 1
Dimensional modelling
11
11
0..*
0..*0..*
0..*
Date/Key Month Quarter Year991011 9910 4 - 99 99991012 9910 4 - 99 99
Key Customer Address RegionIncomegroup
C210 Anna N Stockholm Stockholm BC211 Lars S Malmö Skåne BC212 Erik P Rättvik Dalarna CC213 Danny B Stockholm Stockholm AC214 Åsa S Stockholm Stockholm A
Key ServiceServicegroup
S1 Local call Group AS2 Intern. call Group AS3 SMS Group BS4 WAP Group C
Key Seller OfficeF11 Anders C SundsvallF12 Lisa B SundsvallF13 Janis B Kista
Service Dimension Time Dimension
Sales DimensionCustomer Dimension
Fact table - Transactions
SumNumberof calls
C210 S1 F11 991011 25:00 3C210 S3 F11 991011 05:00 1C212 S2 F13 991011 89:00 1C213 S1 F13 991011 12:00 1C214 S4 F13 991012 08:00 1
Query:For how muchdid customers in Sthlmuse service “Local call”in october 1999?
Σ=37:00
Dimensional modelling
9
Key difference between 3NF and Dimensional modelling:- the degree of normalisation
3 NF modelling- a logical design technique to eliminate data redundancy to keepconsistency and storage efficiency, and makes transaction simpleand deterministic- ER models for enterprise are usually complex, e.g. they oftenhave hundreds, or even thousands, of entities/tables
Dimensional modelling- a logical design technique that present data in a intuitive, i.e.easier to navigate for the user- allow high performance access/queries (the complexity of 3NFmodels overwhelms the database systems optimizer, which meansbad performance)- aims at model decision support data
3 NF modelling vs. Dimensional modelling
[Kimball et al, p 10-11]
Kimball et al (p.10-12 and 396)
“we refer to the presentation area as a series of integrateddata marts”
“a data mart is a flexible set of data, ideally based on themost atomic (granular) data possible to extract fromoperational source, and presented in a symmetric(dimensional) model that is resilient when faced withunexpected user queries”
“in its most simplistic form a data mart represent data froma single business process” (business process=purchaseorder, store inventory and so on)
Data presentation area – Data marts
10
Data martsService
Calls
Office
Quarter
Service
Subscription
ordersOffice
Quarter
Service
Calls
Office
Quarter
Subscription
orders
The data warehouse bus architecture
Orders
Production
DimensionsTimeSales RepCustomerPromotionProductPlantDistr. Center
[Kimball et al, p 78-79]
A data martA data mart
11
• A dimensional model for a large data warehouseconsists of between 10 and 25 similar-looking datamarts. Each data marts will have 5 to 15 dimensionaltables.
Data marts
Kimball et al’s strong opinions (p.10-12)
• all data in the presentation area should be presented,stored and accesses in dimensional models
• the data marts must contain detailed, atomic data (itis unacceptable that the detailed data should belocked up in 3 NF models for drill-down)
• the data marts dimensions should be conformed fordrill-across techniques, which tie the data martstogether in the data warehouse bus architecture
The Data marts
12
More about data marts:
• far smaller data volumes, fewer data sources
• easier data cleaning process, faster roll-out
• allows a “piecemeal” approach to some of the enormousintegration problems involved in creating an enterprisewide data model, but complex integration in the longterm
The Data marts
Dependent vs. Independent Data marts
Data warehouse
Dependent Data marts
Data warehouse
Independent Data marts
13
Extended Relational DBMS (ROLAP servers)– data stored in RDB– star-join schemas– support SQL extensions– index structures
Multidimensional DBMS (MOLAP servers)– data stored in arrays (n-dimensional array)– direct access to array data structure– excellent indexing properties– poor storage utilisation, especially when the data is sparse.
The presentation/OLAP servers
Data warehouse
Data marts
OLAP servers
• Index structures (bit map indexes, join indexes)
• SQL extensions (operators like Cube, Crossjoin)
• Materialised views (pre-aggregations)
More about presentation servers
What is characteristics regarding data warehouse,according to Chaudhiri&Dayal :
14
DW architechture: Metadata repository
Query/ReportingExtractTransformLoadRefreshOperational
source systems
Serve
External sourcesData warehouse
Metadatarepository
Monitoring & Administration
Data marts
OLAP servers
Analysis
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
Data access toolsData staging area Data presentation areaOperationalsource systems
What is metadata?
Main functions are to give...• data definitions• the origin of data• the structure of data• rules for the selection and transfer of data• qualitative and quantitative data about data
Contained in metadata repository
“Data about data”/”Information about data”
15
The metadata repository
An integrated complete source of metadata
• is at the heart of the data warehouse architecture• supports the information needs of...
– system developers– data administrators– system administrators– users– applications on the data warehouse
• very complex data structure• must contain full version history• must always be up to date
Metadata life cycle activities
• Collection• identify and capture metadata in a central
repository
• Maintenance• establish processes to synchronise metadata with
the changing data structure
• Deployment• provide metadata to users in the right form and
with the right tools
16
Different types of metadata
• Administrative metadata(includes all information necessary for setting up and using a DW,
e.g. Information about source databases, dw schemas,dimensions, hierachies, predefined queries, physicalorganisation, rules and script for extraction, transformationand load, back-end and front end tools)
• Business metadata (business terms and definitions, ownership of data)
• Operational metadata (information collected during the operations of the DW, e. g.
usage statistics, error reports)
DW architecture: End user applications
Query/ReportingExtractTransformLoadRefreshOperational DBs
Serve
External sourcesData warehouse
Metadatarepository
Monitoring & Administration
Data marts
OLAP servers
Analysis
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
Data access toolsData staging area Data presentation areaOperationalsource systems
17
Query/Reporting
Analysis
Falö aöldfflaöd aklödfalö alksdf
Data mining
Productt Time1 Value1 Value11
Product2 Time2 Value2 Value21
Product3 Time3 Value3 Value31
Product4 Time4 Value4 Value41
• OLAP tools, BI apps, DSS• Query/Reporting tools• Data mining
End user applications
productproduct group
mounthquarter
officeregion
Product Group Region First Quarter - 1997Group A ABC 1245Group A XYZ 34534Group B ABC 45543Group B XYZ 34533
Column headers(join constraints)
Column header(application constraint) Answer set representing
focal event
Row headers
Spreadsheet output of OLAP tool
18
Graphical output of OLAP tool
• Drill-down - decreasing the level of aggregation• Drill-up/Roll-up/Consolidation - increasing the level of aggregation• Drill-across - move between different star-join schemas using
conformed dimensions and joins• Slicing and dicing – ability to look at the database from different
views, e.g. one slice shows all sales of product type within regions,another slice shows all sales by sales channel within each producttype
• Pivoting - e.g. change columns to rows, rows to columns• Ranking - sorting
“Think of an OLAP data structure as a Rubik´s Cube of data that userscan twist and twirl in different ways to work through what-if anwhat-happend scenarios” [Lee Thé]
Functionalities of OLAP tools
19
StrategicWho: strategic leadersWhat: formulate strategy and monitor corporate performanceExamples: Balance scorecard, Strategic Planning
OperationalWho: operational managersWhat: execution of strategy againts objectivesExamples: Budgeting, Sales forcasting
AnalyticalWho: analysts, knowledge worker, controllerWhat: ad-hoc analysisExamples: Financial and Sales Analysis, Customer Segmentation,Clickstream analysis
Business Intelligence (BI) apps
• Complexity of integration– Hidden problems with source systems
– Data homogenisation
– Underestimation of resources for data loading
• Required data not captured
• High maintenance
• Long duration projects
• Why not integrating the legacy applications(OLTP systems) instead?
Problems of Data Warehousing
20
Operational Data Store (ODS)
No singel universal defintion...
ODS definition 1: Implemented to deliver operational reporting,especially when neither the legacy nor the modern OLTP systemsprovide adequate operational reports – fixed queries and for tacticaldecision makingODS definition 2: Built to support real-time interactions, especiallyin Customer Relationsship Management applications – the tradtionaldata warehouse typically is not in a position to support the demandfor near-real-time data
OMG’s standards
Model
Metamodel
Metametamodel
InstancesInvoiceno 34Helen
Nagy
M0 layer
M1 layer
M2 layer
M3 layer
CWM Metamodel
Meta Object Facility (MOF)
UML Metamodel
21
Common Warehouse Metamodel (CWM)
Data
Source
Data
Source
Data
Source
Operational
Data StoreETL
Data
Warehouse
Data Mart
Analysis
Reporting
Visualization
Data Mining
Data Mart
Data Mart
The collection of metamodels by CWM can beused to model the whole data warehousingenvironment i.e from data sources to end useanalysis, and data warehouse management
Common Warehouse Metamodel
• Common Warehouse Metamodel (CWM) is alanguage specifically design to model datawarehousing and data mining applications, i.e.integrating data warehousing and businessanalysis (business intelligence) tools
• CWM has a lot in common with the UML metamodelbut has a number of special metamodels(metaclasses), e.g modelling relational databases,multidimensional databases, OLAP, schematransformations, XML
[Kleppe et al, p.139-140 (2003)]
22
X
Check materialon stock
Captureordered items
Check materialon stock
Ordered item[captured]Ordered item
captured
Material on stock[checked]
Captureordered items
Material ison stock
Material isnot on stock
Orderrecieved
State
Precedes
Succedes
Activity
Precedes/Succedes
Succedes
Precedes
State
Event
consists of
Transform-ation
consists of
Function Event
Succedes
Precedes
Whymeta-modelling?
Metametamodel
level orReference
model
Meta-model
level
Modellevel
[Rosemann, Green, 2002]
CWM packages
Management Warehouse Process Warehouse Operation
Analysis Transformation OLAP Data Mining InformationVisualization
BusinessNomenclature
Resource Relational Record Multi-Dimensional XML
Foundation BusinessInformation Data Types Expressions Keys and
IndexesSoftware
Deployment Type Mapping
ObjectModel Core Behavioral Relationships Instance
Packages/Metamodels
23
CWM packages layers
[Poole et al, p.36-40 (2002)]
• Object layer - base metamodels/packages, which are(re)used by the other metamodels/packages
• Foundation layer - extends the object layer withservices required which are (re)used by the othermetamodels/packages, e.g “unique key” in the KeyIndexes metamodel/package is used by relationaldatabases, OO-databases and record-oriented
• Resource layer - defines metamodels/packages forvarious types of data resouces
• Analysis layer - analysis-oriented metadata
• Management layer - describing the data warehousingprocess as a whole
CWM packages relations
Relational package
ModelElement
Class
Classifier
Core package
Datatype package
ClassifierFeature
Element
Namespace Feature
StructuralFeature
Expression
ProcedureExpression
QueryExpression
Attribute
ColumnSet
NamedColumnSet
Table View
QueryColumnSet
Column
24
CWM classifyer equality
Object Package Classifier(Klass)
Feature(Attribut)
Schema Table ColumnRelational
Record Recordfile
RecordDef Field
MultiDimensional
XML
Schema Dimenson Dimensioned Objct
Schema ElementType
Attribute
More about CWM
CommonRepresentation
<<metamodels>>CWM Packages
Tool YMetamodel
Tool XMetamodel
Tool ZMetamodel
25
Business Dimensional Lifecycle
TechnicalArchitecture
Design
TechnicalArchitecture
Design
ProductSelection &Installation
ProductSelection &Installation
End-UserApplication
Specification
End-UserApplication
Specification
End-UserApplication
Development
End-UserApplication
Development
ProjectPlanningProject
Planning
Business
Requirement
Definition
Business
Requirement
Definition
DeploymentDeploymentMaintenance
andGrowth
Maintenanceand
Growth
Project ManagementProject Management
DimensionalModeling
DimensionalModeling
PhysicalDesign
PhysicalDesign
Data StagingDesign &
Development
Data StagingDesign &
Development
The Data Warehouse ArchitectureFrameworkLevel of ARCHITECTURE AREAdetail Data Back room Front room Infrastructure
Businessreqs andaudit
Architecturemodels anddocuments
Detailed models andspecs
Implemen-tation
Info neededfor better decisionsEnterprise models
How get, transform,
make availabledata
Capabilitiesneeded to get and
transform dataMajor data stores
User’s needsMajor classes of
analysesPriorities
Where is data coming from
Calc and storagereqs
HW/SW capabilities
needed vs whatwe have
Major businessissues.
How measureHow analyse
Focal events,facts, dimensions
Dimensional models
Install, test infra-structure. Connect sourcesto targets
to desktop
Standards, prodsto providecapabilities
How hook together
Report layouts, derivation
For whom, when
How interact withcapabilities
System utilties, calls, APIs ...
Write extracts, loads
Automate process
Implement reportand analysis env
Build rptTrain users
Logical and physical models
Domains,derivation rules
DB, indexesbackup ...