+ All Categories
Home > Documents > Data-Warehouse-, Data-Mining- und OLAP-Technologien · Anwendersoftware a Anwendungssoftware s...

Data-Warehouse-, Data-Mining- und OLAP-Technologien · Anwendersoftware a Anwendungssoftware s...

Date post: 27-Jan-2019
Category:
Upload: lemien
View: 216 times
Download: 0 times
Share this document with a friend
48
Anwendersoftware a Anwendungssoftware s Data-Warehouse-, Data-Mining- und OLAP-Technologien Chapter 4: Extraction, Transformation, Load Bernhard Mitschang Universität Stuttgart Winter Term 2014/2015
Transcript

Anwendersoftware aaAnwendungssoftware

ss

Data-Warehouse-, Data-Mining- und OLAP-Technologien

Chapter 4: Extraction, Transformation, Load

Bernhard Mitschang Universität Stuttgart

Winter Term 2014/2015

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

2

Overview

• Monitoring • Extraction

Export, Import, Filter, Load Direct Integration

• Load Bulk Load Replication Materialized Views

• Transformation Schema Integration Data Integration Data Cleansing

• Tools

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

Architecture

3

Load

Data Staging Area

Extraction

Transformation

Monitor

Data Warehouse Manager

Data Warehouse

Metadata Manager Metadata Repository

End User Data Access

Data Sources

Data Warehouse System Data Staging Area

data flow control flow

(Source: [BG04])

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

4

Monitoring

• Goal: Discover changes in data source incrementally • Approaches:

Based on … Changes identified by … Trigger triggers defined in

source DBMS trigger writes a copy of changed data to files

Replica replication support of source DBMS

replication provides changed rows in a separate table

Timestamp timestamp assigned to each row

use timestamp to identify changes (supported by temporal DBMS)

Log log of source DBMS read log

Snapshot periodic snapshot of data source

compare snapshots

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

5

Snapshot Differentials

• Two snapshot files: F1 was taken before F2

• Records contain key fields and data fields • Goal: Provide UPDATES/ INSERTS/

DELETES in a shapshot differential file . • Sort Merge Outerjoin:

Sort F1 and F2 on their keys Read F1' and F2' and compare records Snapshot files may be compresssed Snapshots are read multiple times

• Window Algorithm: Maintain a moving window of records in

memory for each snapshot (aging buffer) Assumes that matching records are

"physically" nearby Read snapshots only once

F1

200 A

201 U

100 A

300 Z

F2

100 A

201 X

400 Y

202 U

key

data

100 A

200 A

201 U

300 Z

100 A

201 X

400 Y

202 U

201 X UPDATE

200 A DELETE

300 Z DELETE

400 Y INSERT

202 U INSERT

snapshot differential

Fout

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

6

Window Algorithm

INPUT: F1, F2, n OUTPUT: Fout /* the snapshot differential */ (1) Input Buffer1 ← Read n blocks from F1 (2) Input Buffer2 ← Read n blocks form F2 (3) while ((Input Buffer1 ≠ EMPTY) and

(Input Buffer 2 ≠ EMPTY)) (4) Match Input Buffer1 against Input Buffer2

(5) Match Input Buffer1 against Aging Buffer2

(6) Match Input Buffer2 against Aging Buffer1

(7) Put contents of Input Buffer1 to Aging Buffer1

(8) Put contents of Input Buffer2 to Aging Buffer2

(9) Input Buffer1 ← Read n blocks from F1

(10) Input Buffer2 ← Read n blocks from F2

(11) Report records in Aging Buffer1 as deletes (12) Report records in Aging Buffer2 as inserts

Input Buffer 1

Aging Buffer 2

Aging Buffer 1

Input Buffer 2

F1

F2

TAIL HEAD

In case of a buffer overflow, this (oldest)

record is replaced and reported as deleted

age queue

buckets

Source: [LG96]

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

7

Overview

• Monitoring • Extraction

Export, Import, Filter, Load Direct Integration

• Load Bulk Load Replication Materialized Views

• Transformation Schema Integration Data Integration Data Cleansing

• Tools

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

8

Data Staging Area

ETL Processing

Load

Data Staging Area

Extraction

Data Warehouse

Data Sources

Data Marts move data to data marts

move data from DSA to data warehouse db

move data as part of transformations

move source data to data staging area

Transformation

Monitor

copy vs. transformation

same system vs. heterogeneous systems

automatic vs. user-driven

availability of source and target system

Requirements:

files vs. tables

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

9

Extraction

• support of monitoring and extraction: replica active db / trigger snapshot export / db dump logging no support

• accessing data sources application / application API database API log files

• extract current data • limited time frame • service of source system should

not be restricted

heterogeneous source systems performance

Inte

grat

ion

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

10

Export and Import

EXPORT TO c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | MESSAGES c:\cust_berlin_I\msg1.txt SELECT * FROM customer_data WHERE new_customer = true

(DB2)

• Export: ASCII files proprietary format

• Import: import command bulk load

Target System

Source System

Export Import network

compress

decompress

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

11

Import vs. Load

IMPORT FROM c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | COMMITCOUNT 1000 MESSAGES c:\cust_berlin_I\msg2.txt INSERT INTO cust_berlin_1

(DB2)

LOAD FROM c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | SAVECOUNT 1000 MESSAGES c:\cust_berlin_I\msg3.txt REPLACE INTO cust_berlin_1 STATISTICS YES

(DB2)

IMPORT LOAD COMMIT explicit automatic

Logging complete, mandatory

optional

Integrity check all constraints

check local constraints only

Trigger all none

Locking read access possible

table lock

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

12

Filter and Load

• Bulk load tools provide some filter and transformation functionality: Load data from multiple datafiles

during the same load session. Load data into multiple tables

during the same load session. Specify the character set of the

data. Selectively load data, i.e. load

records based on the records’ values.

Manipulate the data before loading it, using SQL functions.

Generate unique sequential key values in specified columns.

Input Datafiles

Loader Control

File

Log File

Bad Files

Discard Files

SQL*Loader

Database Tables Indexes

(Oracle)

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

13

Direct Integration

external tables table functions

database system

Java, C, C++, …

data source

SQL

data source

data source

federated database

database system Wrapper

data source

SQL database system

SQL

• register external data as tables

• allows to read external data

• no query optimization

• user-defined functions provides a tables as result

• function reads data from external sources

• register external data as tables

• define wrapper for access to external data

• exploit capabilities of external source for query optimization

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

14

Overview

• Monitoring • Extraction

Export, Import, Filter, Load Direct Integration

• Load Bulk Load Replication Materialized Views

• Transformation Schema Integration Data Integration Data Cleansing

• Tools

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

15

Load

Transfer data from the data staging area into the data warehouse and data marts.

• Bulk load is used to move huge amounts of data. • Data has to be added to existing tables:

add rows replace rows or values

• Flexible insert mechanism is needed to: add and update rows based on a single data source. add rows for a single data source to multiple tables in the data

warehouse. • Consider complex criteria in load processing:

write application program use procedural extensions of SQL

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

16

Update and Insert

IMPORT FROM c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | COMMITCOUNT 1000 MESSAGES c:\cust_berlin_I\msg2.txt INSERT INTO cust_berlin_1

(DB2)

• INSERT: Adds the imported data to the table

without changing the existing table data.

• INSERT_UPDATE: Adds rows of imported data to the target

table, or updates existing rows (of the target table) with matching primary keys.

• REPLACE: Deletes all existing data from

the table by truncating the data object, and inserts the imported data. The table definition and the index definitions are not changed.

• REPLACE_CREATE: If the table exists, deletes all

existing data from the table by truncating the data object, and inserts the imported data without changing the table definition or the index definitions.

If the table does not exist, creates the table and index definitions, as well as the row contents.

Insert Semantics?

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

17

MERGE INTO

MERGE INTO customer AS c1 USING ( SELECT key, name, address, … FROM cust_berlin_I WHERE …) AS c2 ON ( c1.custkey = c2.key ) WHEN MATCHED THEN UPDATE SET c1.address = c2.address WHEN NOT MATCHED THEN INSERT (custkey, name, address, … VALUES (key, name, address ,…)

• 'transaction table' (cust_berlin I) contains updates to existing rows in the data warehouse and/or new rows that should be inserted.

• MERGE Statement of SQL:2003 allows to update rows that have a

matching counterpart in the master table

insert rows that do not have a matching counterpart in the master table

customer custkey Name Address

100 Ortmann Rauchstr.

101 Martin Pariser Platz

105 Fagiolo Hiroshimastr.

106 Byrt Lassenstr.

cust_berlin_I key Name Address

100 Ortmann Rauchstrasse 1

101 Martin Pariser Platz

102 Torry Wilhelmstr.

MERGE INTO

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

18

Multiple Inserts INSERT ALL INTO customer VALUES (key, name, address) INTO location VALUES (address, 'Berlin') SELECT * FROM cust_berlin_I WHERE …

(Oracle)

INSERT ALL /* INSERT FIRST */ WHEN key < 100 INTO customer VALUES (key, name, address) WHEN key < 1000 INTO location VALUES (address, 'Berlin') SELECT * FROM cust_berlin_I WHERE …

(Oracle)

• Insert rows (partly) into several target tables.

• Allows to insert the same row several times.

• Allows to define conditions to select the target table.

• INSERT FIRST defines that the rows is inserted only once.

customer custkey Name Address

105 Fagiolo Hiroshimastr.

106 Byrt Lassenstr. cust_berlin_I key Name Address

100 Ortmann Rauchstrasse 1

101 Martin Pariser Platz

102 Torry Wilhelmstr.

location Street Town

Hiroshimastr. Berlin

Lassenstr. Berlin

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

19

Replication

UPDATE SELECT

source system

SELECT

target system

SELECT

target system

UPDATE SELECT

source system

UPDATE SELECT

source system

SELECT

target system UPDATE SELECT

UPDATE SELECT

UPDATE SELECT

data distribution data consolidation

update anywhere

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

20

Materialized Views

• Data marts provide extracts of the data warehouse for a specific application.

• Applications often need aggregated data.

• Materialized Views (MV) allow to: define the content of each data

mart as views on data warehouse tables

automatically update the content of a data mart

• Important Issues: MV selection MV refresh MV usage Load

Data Warehouse

Data Mart

End User Data Access

Data Mart

End User Data Access

End User Data Access

Data Mart

End User Data Access

dependent data marts

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

21

Materialized Views

CREATE TABLE DM2_Orders AS ( SELECT ANR, SUM(Count) FROM DW_Orders GROUP BY ANR ) DATA INITIALLY DEFERRED REFRESH DEFERRED; REFRESH TABLE DM2_Orders;

(DB2)

• Materialized views are created like views

• A strategy for refreshing has to be specified: DEFERRED: Use REFRESH TABLE

statement IMMEDIATE: As part of the update

in the source table open 1 2 100101 7017

open 1 2 120113 7537

open 1 2 120113 7456

ok 12 1 210704 7564

open 1 2 100105 7017

open 1 2 210204 7017

Status Count SNR ANR CNR

open 5 1 100101 7017

ok 3 1 100104 7098

… … … … …

DW_Orders

DM2_Orders ANR Count

100101 6

100104 3

210704 12

100105 1

210204 1

120113 2

DM1_Orders CNR Open Ok

7017 1 -

7089 - 1

7564 - 1

7017 3 -

7537 1 -

7456 1 -

Data Mart 1

Data Mart 2

Application needs number of open orders

per customer

Application needs number of products on

order

refresh refresh

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

22

Overview

• Monitoring • Extraction

Export, Import, Filter, Load Direct Integration

• Load Bulk Load Replication Materialized Views

• Transformation Schema Integration Data Integration Data Cleansing

• Tools

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

23

Transformation

Convert the data into something representable to the users and valuable to the business.

• Transformation of structure and content: Semantics → identify proper semantics Structure → schema integration Data → data integration and data cleansing

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

24

Transformation: Semantics

• Information on the same object is covered by several data sources.

• E.g., customer information is provided by several source systems. • Identify synonyms Indentify homonyms

• Identifying the proper semantics depends on the context. • Users have to define the proper semantics for the data warehouse. • Describe semantics in the metadata repository.

car automobile student pupil baby infant

cash cache bare bear sight site

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

25

Schema Integration

• Schema integration is the activity of integrating the schemata of various sources to produce a homogeneous description of the data of interest.

• Properties of the integrated schema: completeness correctness minimality understandability

preintegration

schema comparison

schema merging and restructuring

schema conforming

• Steps of schema integration:

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

26

Pre-Integration

Analysis of the schemata to decide on the general integration policy.

• Decide on: schemata to be integrated order of integration / integration process preferences

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

27

Schema Integration Process

source schemata target schema intermediate schemata

bina

ry

n-ar

y

left deep tree balanced tree

one shot iterative

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

28

Schema Matching

• Goal: Take two schemas as input and produce a mapping between elements of the two schemas that correspond semantically to each other.

• Typically performed manually, supported by a graphical user interface. tedious, time-consuming, error-prone, expensive

• General architecture of generic match: tools = schema-related apps. internal schema representation +

import and export needed use libraries to help find matches only determine match candidates user may accept or reject

Tool 1 Tool 2 Tool 3 Tool 4

Global libraries: dictionaries, schemas

Schema Import/export

Generic Match implementation

Internal schema representation (ER, XML, …)

Source: [RB01]

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

29

Schema Comparison

Analysis to determine the correlations among concepts of different schemata and to detect possible conflicts.

• Schema Matching is part of this step. • Types of conflicts in relational systems:

schema

table / table attribute / attribute table / attribute

naming • synonyms • homonyms

structure • missing attributes • implied attributes

integrity constraints

naming • synonyms • homonyms

default values integrity constraints

• data types • range of values

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

30

Schema Conforming

Conform and align schemata to make them compatible for integration.

• Conflict resolution based on the application context cannot be fully automated human intervention supported by graphical interfaces

• Sample steps: Attributes vs. Entity Sets Composite Primary Keys Redundancies Simplification

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

31

Schema Conforming

E

A B X

E EX

A B X

n 1

Attribute vs. Entity Set

E

A B X

EA EB

A B

n m

X

Composite Primary Keys

E1

E2

E3

n

m

n

m

1

1

Redundancies

redundant

E1

E2 E3

E4

E1

E2 E3

E4

E23

Simplification

generalization

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

32

Schema Merging and Restructuring

Conformed schemata are superimposed, thus obtaining a global schema.

• Main steps: superimpose conformed schemata quality tests against quality dimensions

(completeness, correctness, minimality, understandability, …) further transformation of the obtained schema

source schemata target schema intermediate/conformed schemata

integration transformation(s)

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

33

Schema Integration in Data Warehousing

Integration in data warehousing schema integration and data

integration schema integration is a

prerequisite for data integration schema integration is mainly

used for the data staging area final data warehouse schema is

defined from a global point of view, i.e., it is more than only integrating all source schemata

schema matching between source schema and data warehouse schema provides the basis for defining transformations

Integration in federated systems focus on schema integration integrated schema is used

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

34

Data Integration

• Normalization / denormalization: depending on the source schema

and the data warehouse schema

• Surrogate keys: keys should not depend on the

source system

• Data type conversion: if data type of source attribute

and target attribute differ

• Coding: text → coding; coding → text;

coding A → coding B

e x

a m

p l

e s

customer system local key global key

1 107 5400345

1 109 5401340

2 107 4900342

2 214 5401340

character→ date character → integer 'MM-DD-YYYY' → 'DD.MM.YYYY'

gross sales → 1 net sales → 2 3 → price 2 → GS

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

35

Data Integration

• Convert strings: standardization

• Convert date to date format of the target system

• Convert measures

• Combine / separate attributes

• Derived attributes

• Aggregation

'Video' → ' video' 'VIDEO' → 'video' 'Miller, Max' → 'Max Miller'

2004, 05, 31 → 31.05.2004 04, 05, 31 → 31.05.2004 '05/31/2004' → 31.05.2004

inch→ cm km → m mph → km/h

2004, 05, 31 → 31.05.2004 'video', 1598 → 'video 1598'

net sales + tax → gross sales on_stock - on_order → remaining

sales_per_day → sales_per_month

e x

a m

p l

e s

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

36

Data Cleansing

• Elementizing identify fields

• Standardizing format, coding

• Verification contradictions? should lead

to corrections in source system(s)

• Matching is 'David Miller' and/or 'Clara

Miller' already present in data warehouse?

if so, are there changed fields?

• Householding 'David Miller' and 'Clara Miller'

constitute a household

• Documenting

David and Clara Miller Ste. 116 13150 Hiway 9 Box 1234 Boulder Crk Colo 95006

first name 1: David last name 1: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Hiway 9 post box: 1234 city: Boulder Crk state: Colo zip: 95006

first name: David last name: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Highway 9 post box: 1234 city: Boulder Creek state: Colorado zip: 95006

first name: David last name: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Highway 9 post box: 1234 city: Boulder Creek state: California zip: 95006

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

37

Dimensions of Data Cleansing

single source multiple sources

single record • attribute dependencies (contradictions)

• spelling mistakes • missing values • illegal values

• duplicates / matching • householding • contradictions • standardization,

coding

multiple records • primary key foreign key

• duplicates / matching • householding

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

38

Data Quality

consistency

correctness

completeness

exactness

reliability

understandability

relevance

• Are there contradictions in data and/or metadata?

• Do data and metadata provide an exact picture of the reality?

• Are there missing attributes or values?

• Are exact numeric values available? Are different objects identifiable? Homonyms?

• Is there a Standard Operating Procedure (SOP) that describes the provision of source data?

• Does a description for the data and coded values exist?

• Does the data contribute to the purpose of the data warehouse?

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

39

Improving Data Quality

• Assumption: Various projects can be undertaken to

improve the quality of warehouse data.

Goal: Identify the data quality enhancement projects that maximize value to the users of data.

• Tasks for the data warehouse manager: Determine the organizational

activities the data warehouse will support;

Identify all sets of data needed to support the organizational activities;

Estimate the quality of each data set on each relevant data quality dimension;

Identify a set of potential projects (and their cost) that could be untertaken for enhancing or affecting data quality;

Estimate for each project the likely effect of that project on the quality of the various data sets, by data quality dimension;

Determine for each project, data set, and relevant data quality dimension the change in utility should a particular project be untertaken.

Source: [BT99]

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

40

Improving Data Quality

I: Index of organizational activities supported by a data warehouse

J: Index for data sets K: Index for data quality attributes

or dimensions L: Index for possible data quality

projects P(1) … P(S)

Current quality: CQ(J, K) Required quality: RQ(I, J, K) Anticipated quality: AQ(J, K, L) Priority of organizational activities: Weight(I) Cost of data quality enhancement: Cost(L) Value added: Utility(I, J, K, L)

Value of Project L = ∑ ∑ ∑⋅IAll JAll KAll

LKJIUtilityIWeight ),,,()(

Maximize: total value from all projects ∑LAll

LValueLX )(*)(

Resource Constraint:

Exclusiveness Constraint:

Interaction Constraint:

BudgetLCostLXLAll

≤∑ )(*)(

1))((...))2(())1(( ≤+++ SPXPXPX

1))3(())2(())1(( ≤++ PXPXPX

=otherwise ,0

selected isL Project if ,1)(LX

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

41

Overview

• Monitoring • Extraction

Export, Import, Filter, Load Direct Integration

• Load Bulk Load Replication Materialized Views

• Transformation Schema Integration Data Integration Data Cleansing

• Tools

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

42

ETL Market Size 2001-2006

Source: Forrester Research, 2004

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

43

ETL Market

• Vendors coming from several different backgrounds and perspectives:

"Pure Play" Vendors • ETL represents a core competency • ETL accounts for most of the license

revenue • This class of vendors is driving the bulk

of innovation and "mind share" in the ETL market

Business Intelligence Vendors • Business intelligence tools and platforms

are their core competency • For most of these vendors, ETL technology

plays a supporting role to their flagship business intelligence offerings, or is one component of a broad offering including business intelligence and ETL

Database Management System (DBMS) Vendors

• database vendors have an increasing impact on this market as they continue to bundle ETL functionality closer to the relational DBMS

Other Infrastructure Providers • They provide various types of technical

infrastructure components beyond the DBMS

• ETL is typically positioned as yet another technical toolset in their portfolios

Source: Gartner, 2004

A B

C

D

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

44

ETL Vendors/Products Ranked By Market Share Percentages

Source: Forrester Research, 2004

A

A

A

A

B

B

C

C

C

D

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

45

ETL Vendors/Products Ranked By Market Share Percentages

Source: Forrester Research, 2004

A

B

B

D

C

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

46

ETL Tools

Source: META Group, 2004

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

47

Summary

• Moving data is part of most steps of the ETL process: Extraction, Transformation, Loading Data Warehouse and Data Marts

• Several approaches available: export, import, load direct integration replication materialized views

• Transformation steps include semi-automatic schema matching and integration data integration steps data cleansing

Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware

ss

48

Papers

[BT99] D. Ballou, G. K. Tayi: Enhancing Data Quality in Data Warehouse Environments. Communications of the ACM, Vol. 42, No. 1, 1999.

[LG96] W. Labio, H. Garcia-Molina: Efficient Snapshot Differential Algorithms for Data Warehousing. Proc. of 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, 1996.

[RB01] E. Rahm, P. Bernstein: A survey of approaches to automatic schema matching. VLDB Journal 10:334-350 (2001)


Recommended