Date post: | 27-Dec-2015 |
Category: |
Documents |
Upload: | claud-godfrey-malone |
View: | 225 times |
Download: | 0 times |
Data Warehousing SeminarChapter 5. Data Warehouse Design Methodology
Data Warehousing Lab.Data Warehousing Lab.
HyeYoung ChoHyeYoung Cho
2Data Warehousing
Lab.DW
Index
The Information Utility's Infrastructure The Preferred Architecture: Integration Layer and High
Performance Query Structures
Alternate Warehousing Architectures
Data Store 1 - The Source SystemsData Store 1 - The Source SystemsData Flow 1 - From the Data sources to the Integration layerData Flow 1 - From the Data sources to the Integration layerData Store 2 - The Integration LayerData Store 2 - The Integration LayerData Flow 2 - From the Integration Layer to the High Performance Data Flow 2 - From the Integration Layer to the High Performance Query StructuresQuery StructuresData Store 3 - High Performance Query Structures(HPQS)Data Store 3 - High Performance Query Structures(HPQS)Data Flow 3 - From the High Performance Query Structures to the Data Flow 3 - From the High Performance Query Structures to the End User Reporting ApplicationsEnd User Reporting ApplicationsData Store 4 - Data in the End User's HandsData Store 4 - Data in the End User's Hands
3Data Warehousing
Lab.DW
The Information Utility's Infrastructure
warehouse must:warehouse must: extract data from a variety of sources integrate data into a common repository put data into a format that users can use provide users with tools to access the
warehouse
4Data Warehousing
Lab.DW
The Preferred Architecture:Integration Layer and High Performance Query Structures
4 data stores and 3 data flows.4 data stores and 3 data flows.
5Data Warehousing
Lab.DW
Data Store 1 - The Source Systems
provide data to warehouseprovide data to warehouse enterprise resource planning package(ERP)
SAP, PeopleSoft, Oracle applicationsSAP, PeopleSoft, Oracle applications
home-grown applications OASIS systemOASIS system
outside sources data purchased from outside vendorsdata purchased from outside vendors
source systems
sales, accounting, distribution,
etc.
warehousedata
6Data Warehousing
Lab.DW
Flow 1 - From the Data sources to the Integration layer
data extraction stepdata extraction step data out of its sources extracted at the beginning of every data flow very complex step
variety of data storage technologies ex. variety of data storage technologies ex. Oracle, DB2, Infomix, IMS, other formats Oracle, DB2, Infomix, IMS, other formats
-> require select statements and each code-> require select statements and each code
consideration for extraction
7Data Warehousing
Lab.DW
Flow 1 - From the Data sources to the Integration layer
Is This Extract Supporting the Initial Load of the Is This Extract Supporting the Initial Load of the Warehouse or a Periodic Refresh Load?Warehouse or a Periodic Refresh Load? problems with complete refreshes
warehouse is a record of history!warehouse is a record of history!
-> frequently lost by source systems.-> frequently lost by source systems. warehouses tend to be very large!warehouses tend to be very large!
-> poor computing and telecommunications bandwidth-> poor computing and telecommunications bandwidth
two architectures to load warehouse
initial load periodical refresh
history data from offline storage
online data
bring it all over changed source records
use special logic for timestamps
8Data Warehousing
Lab.DW
Flow 1 - From the Data sources to the Integration layer
How Will I Determine What Records to How Will I Determine What Records to Extract?Extract? change data capture
what source records have changedwhat source records have changed how, those records are moved to the warehousehow, those records are moved to the warehouse
delete question! no trace, the deleted record is just gone!no trace, the deleted record is just gone!
Techniques recognizing changes TimestampsTimestamps
records whenever inserted and deleted reduced search what records have changed.
TriggersTriggers put trigger on the source tables write a corresponding(insert,update,delete) message
in a log file
9Data Warehousing
Lab.DW
Flow 1 - From the Data sources to the Integration layer
Application Integration Software(AIS)Application Integration Software(AIS) MQ Series, Mercator, Tibco.. link applications, when a transaction occurs in one,
transmit it to all the others. all transactions in AIS-enabled systemsall transactions in AIS-enabled systems real-time access to datareal-time access to data
File ComparesFile Compares compare today’s file to the last loaded file difficult implementation and less accuracy
10Data Warehousing
Lab.DW
Flow 1 - From the Data sources to the Integration layer
How Will I Format the Extracted Records?How Will I Format the Extracted Records? store extracted records with each mean
what source system generated the recordwhat source system generated the record when the record was obtained, when the record was obtained, the key of the recordthe key of the record
What Will I Do with the Extracted Records?What Will I Do with the Extracted Records? data loading programs
read flat files / load the data into the warehouseread flat files / load the data into the warehouse
"loosely coupled" warehousing architectures separate extract programs and load programsseparate extract programs and load programs
->more flexible and maintainable warehouse!->more flexible and maintainable warehouse!
11Data Warehousing
Lab.DW
Flow 1 - From the Data sources to the Integration layer
A Few Notes About Dirty DataA Few Notes About Dirty Data dirty in several ways
Format violationsFormat violations Referential integrity violationsReferential integrity violations Cross-system matching violationsCross-system matching violations Internal consistency violationsInternal consistency violations
dirty data makes warehouse unreliablemakes warehouse unreliable corrected in the source systems before extractingcorrected in the source systems before extracting both refresh data and history databoth refresh data and history data
12Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
a normalized database in a single placea normalized database in a single place normalizationnormalization
break flat file into smaller files to store the data more efficiently.
Why Build an Integration Layer?Why Build an Integration Layer? Avoids extraction repetition
multiple data marts using data from same source systemsmultiple data marts using data from same source systems
-> read from only one source(already integrated, clean data)-> read from only one source(already integrated, clean data)
Ensures standard interpretation of enterprise data multiple groups interpret the same data differentlymultiple groups interpret the same data differently
-> develop common definitions shared across the organization-> develop common definitions shared across the organization
Provides a more flexible repository than the denormalized structures in the HPQS layer denormalized data structures in HPQS for querying are inflexibledenormalized data structures in HPQS for querying are inflexible
-> complex and required reintegration, recleasing-> complex and required reintegration, recleasing
13Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
14Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
Introduction to Database NormalizationIntroduction to Database Normalization
- data model in third normal form- data model in third normal form completely denormalized Data
1NF
15Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
First Normal Form eliminate repeating groups!eliminate repeating groups!
2NF
16Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
Second Normal Form all non-key attributes of a table must rely on the all non-key attributes of a table must rely on the
entire key of the tableentire key of the table
3NF
17Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
Third Normal Form all non-key fields must depend solely on the table's all non-key fields must depend solely on the table's
primary keyprimary key
18Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
What "Extra" Data Must the Integration Layer Hold?What "Extra" Data Must the Integration Layer Hold? surrogate Keys
Sequential number generated by warehouse load programsSequential number generated by warehouse load programs have no business meaninghave no business meaning BenefitsBenefits
single surrogate key for same attribute having different keys easy tracking for Moving information
dates, statuses, and other fields auditing support, easy identifying data to data martauditing support, easy identifying data to data mart additional information in the warehouseadditional information in the warehouse
Ex. insert date, last update date, status flag, etc.Ex. insert date, last update date, status flag, etc.
Another Note About Dirty DataAnother Note About Dirty Data Techniques for handling bad records
Ignoring them.Ignoring them. Rejecting bad records, but saving them in a separate file for manual review.Rejecting bad records, but saving them in a separate file for manual review. Loading the bad record and pointing out the errors for later review.Loading the bad record and pointing out the errors for later review.
19Data Warehousing
Lab.DW
Data Store 2 - The Integration Layer
key
20Data Warehousing
Lab.DW
Data Flow 2 - From the Integration Layer to the High Performance Query Structures
data is extracted from the integration layer and data is extracted from the integration layer and inserted into the data martsinserted into the data marts ETL: extract, transform, and load to populate data marts benefits loading from integration lay
no cleansing and integrationno cleansing and integration Identifying the loading records using timestampsIdentifying the loading records using timestamps no creating surrogate keys (only reuse!)no creating surrogate keys (only reuse!)
use of summary tables differ from data warehousediffer from data warehouse some summaries of their atomic-level detailsome summaries of their atomic-level detail
->load both the atomic level data and summary tables->load both the atomic level data and summary tables Oracle8iOracle8i
create materialized view automatical refresh every commit
21Data Warehousing
Lab.DW
Data Store 3 - High Performance Query Structures(HPQS)
databases and data structures to support end-user databases and data structures to support end-user queriesqueries
databases managed by either relational database databases managed by either relational database engines or multidimensional database enginesengines or multidimensional database engines
logical structure, not physical structurelogical structure, not physical structure share the same computer With data warehouse physically different table designs
more easier and speedier for end user to access than more easier and speedier for end user to access than normalized database formats.normalized database formats.
22Data Warehousing
Lab.DW
Data Flow 3 - From the High Performance Query Structures to the End User Reporting Applications
Query tools issue SQL calls to relational Query tools issue SQL calls to relational databasesdatabases
data is returned to the tools and data is returned to the tools and formatedformated
23Data Warehousing
Lab.DW
Data Store 4 - Data in the End User's Hands
report and analysis in end-user's hands report and analysis in end-user's hands the last data store in warehousing architecture "How can I prevent a bad employee from selling
warehouse data to one of our competitions?" only way to deny him access to that data in the only way to deny him access to that data in the
first placefirst place
24Data Warehousing
Lab.DW
Alternate Warehousing Architectures
Alternate Architecture 1 - No WarehouseAlternate Architecture 1 - No Warehouse no demand for a warehouse , don't build it
transaction systems are strong and end -user queries are limitedtransaction systems are strong and end -user queries are limited
Alternate Architecture 2 - Normalized DesignAlternate Architecture 2 - Normalized Design data integrated in integration layer users query directly out of the integration layer
integration benefits, no usability and query performanceintegration benefits, no usability and query performance
Alternate Architecture 3 - Just Data MartsAlternate Architecture 3 - Just Data Marts building one or more data marts without a normalized integration
layer no need data integrated from multiple systems.no need data integrated from multiple systems.