Anwendersoftware aaAnwendungssoftware
ss
Data-Warehouse-, Data-Mining- und OLAP-Technologien
Chapter 4: Extraction, Transformation, Load
Bernhard Mitschang Universität Stuttgart
Winter Term 2014/2015
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
2
Overview
• Monitoring • Extraction
Export, Import, Filter, Load Direct Integration
• Load Bulk Load Replication Materialized Views
• Transformation Schema Integration Data Integration Data Cleansing
• Tools
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
Architecture
3
Load
Data Staging Area
Extraction
Transformation
Monitor
Data Warehouse Manager
Data Warehouse
Metadata Manager Metadata Repository
End User Data Access
Data Sources
Data Warehouse System Data Staging Area
data flow control flow
(Source: [BG04])
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
4
Monitoring
• Goal: Discover changes in data source incrementally • Approaches:
Based on … Changes identified by … Trigger triggers defined in
source DBMS trigger writes a copy of changed data to files
Replica replication support of source DBMS
replication provides changed rows in a separate table
Timestamp timestamp assigned to each row
use timestamp to identify changes (supported by temporal DBMS)
Log log of source DBMS read log
Snapshot periodic snapshot of data source
compare snapshots
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
5
Snapshot Differentials
• Two snapshot files: F1 was taken before F2
• Records contain key fields and data fields • Goal: Provide UPDATES/ INSERTS/
DELETES in a shapshot differential file . • Sort Merge Outerjoin:
Sort F1 and F2 on their keys Read F1' and F2' and compare records Snapshot files may be compresssed Snapshots are read multiple times
• Window Algorithm: Maintain a moving window of records in
memory for each snapshot (aging buffer) Assumes that matching records are
"physically" nearby Read snapshots only once
F1
200 A
201 U
100 A
300 Z
F2
100 A
201 X
400 Y
202 U
key
data
100 A
200 A
201 U
300 Z
100 A
201 X
400 Y
202 U
201 X UPDATE
200 A DELETE
300 Z DELETE
400 Y INSERT
202 U INSERT
snapshot differential
Fout
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
6
Window Algorithm
INPUT: F1, F2, n OUTPUT: Fout /* the snapshot differential */ (1) Input Buffer1 ← Read n blocks from F1 (2) Input Buffer2 ← Read n blocks form F2 (3) while ((Input Buffer1 ≠ EMPTY) and
(Input Buffer 2 ≠ EMPTY)) (4) Match Input Buffer1 against Input Buffer2
(5) Match Input Buffer1 against Aging Buffer2
(6) Match Input Buffer2 against Aging Buffer1
(7) Put contents of Input Buffer1 to Aging Buffer1
(8) Put contents of Input Buffer2 to Aging Buffer2
(9) Input Buffer1 ← Read n blocks from F1
(10) Input Buffer2 ← Read n blocks from F2
(11) Report records in Aging Buffer1 as deletes (12) Report records in Aging Buffer2 as inserts
Input Buffer 1
Aging Buffer 2
Aging Buffer 1
Input Buffer 2
F1
F2
TAIL HEAD
In case of a buffer overflow, this (oldest)
record is replaced and reported as deleted
age queue
buckets
Source: [LG96]
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
7
Overview
• Monitoring • Extraction
Export, Import, Filter, Load Direct Integration
• Load Bulk Load Replication Materialized Views
• Transformation Schema Integration Data Integration Data Cleansing
• Tools
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
8
Data Staging Area
ETL Processing
Load
Data Staging Area
Extraction
Data Warehouse
Data Sources
Data Marts move data to data marts
move data from DSA to data warehouse db
move data as part of transformations
move source data to data staging area
Transformation
Monitor
copy vs. transformation
same system vs. heterogeneous systems
automatic vs. user-driven
availability of source and target system
Requirements:
files vs. tables
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
9
Extraction
• support of monitoring and extraction: replica active db / trigger snapshot export / db dump logging no support
• accessing data sources application / application API database API log files
• extract current data • limited time frame • service of source system should
not be restricted
heterogeneous source systems performance
Inte
grat
ion
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
10
Export and Import
EXPORT TO c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | MESSAGES c:\cust_berlin_I\msg1.txt SELECT * FROM customer_data WHERE new_customer = true
(DB2)
• Export: ASCII files proprietary format
• Import: import command bulk load
Target System
Source System
Export Import network
compress
decompress
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
11
Import vs. Load
IMPORT FROM c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | COMMITCOUNT 1000 MESSAGES c:\cust_berlin_I\msg2.txt INSERT INTO cust_berlin_1
(DB2)
LOAD FROM c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | SAVECOUNT 1000 MESSAGES c:\cust_berlin_I\msg3.txt REPLACE INTO cust_berlin_1 STATISTICS YES
(DB2)
IMPORT LOAD COMMIT explicit automatic
Logging complete, mandatory
optional
Integrity check all constraints
check local constraints only
Trigger all none
Locking read access possible
table lock
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
12
Filter and Load
• Bulk load tools provide some filter and transformation functionality: Load data from multiple datafiles
during the same load session. Load data into multiple tables
during the same load session. Specify the character set of the
data. Selectively load data, i.e. load
records based on the records’ values.
Manipulate the data before loading it, using SQL functions.
Generate unique sequential key values in specified columns.
Input Datafiles
Loader Control
File
Log File
Bad Files
Discard Files
SQL*Loader
Database Tables Indexes
(Oracle)
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
13
Direct Integration
external tables table functions
database system
Java, C, C++, …
data source
SQL
data source
data source
federated database
database system Wrapper
data source
SQL database system
SQL
• register external data as tables
• allows to read external data
• no query optimization
• user-defined functions provides a tables as result
• function reads data from external sources
• register external data as tables
• define wrapper for access to external data
• exploit capabilities of external source for query optimization
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
14
Overview
• Monitoring • Extraction
Export, Import, Filter, Load Direct Integration
• Load Bulk Load Replication Materialized Views
• Transformation Schema Integration Data Integration Data Cleansing
• Tools
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
15
Load
Transfer data from the data staging area into the data warehouse and data marts.
• Bulk load is used to move huge amounts of data. • Data has to be added to existing tables:
add rows replace rows or values
• Flexible insert mechanism is needed to: add and update rows based on a single data source. add rows for a single data source to multiple tables in the data
warehouse. • Consider complex criteria in load processing:
write application program use procedural extensions of SQL
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
16
Update and Insert
IMPORT FROM c:\cust_berlin_I\cust.data OF DEL MODIFIED BY COLDEL | COMMITCOUNT 1000 MESSAGES c:\cust_berlin_I\msg2.txt INSERT INTO cust_berlin_1
(DB2)
• INSERT: Adds the imported data to the table
without changing the existing table data.
• INSERT_UPDATE: Adds rows of imported data to the target
table, or updates existing rows (of the target table) with matching primary keys.
• REPLACE: Deletes all existing data from
the table by truncating the data object, and inserts the imported data. The table definition and the index definitions are not changed.
• REPLACE_CREATE: If the table exists, deletes all
existing data from the table by truncating the data object, and inserts the imported data without changing the table definition or the index definitions.
If the table does not exist, creates the table and index definitions, as well as the row contents.
Insert Semantics?
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
17
MERGE INTO
MERGE INTO customer AS c1 USING ( SELECT key, name, address, … FROM cust_berlin_I WHERE …) AS c2 ON ( c1.custkey = c2.key ) WHEN MATCHED THEN UPDATE SET c1.address = c2.address WHEN NOT MATCHED THEN INSERT (custkey, name, address, … VALUES (key, name, address ,…)
• 'transaction table' (cust_berlin I) contains updates to existing rows in the data warehouse and/or new rows that should be inserted.
• MERGE Statement of SQL:2003 allows to update rows that have a
matching counterpart in the master table
insert rows that do not have a matching counterpart in the master table
customer custkey Name Address
100 Ortmann Rauchstr.
101 Martin Pariser Platz
105 Fagiolo Hiroshimastr.
106 Byrt Lassenstr.
cust_berlin_I key Name Address
100 Ortmann Rauchstrasse 1
101 Martin Pariser Platz
102 Torry Wilhelmstr.
MERGE INTO
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
18
Multiple Inserts INSERT ALL INTO customer VALUES (key, name, address) INTO location VALUES (address, 'Berlin') SELECT * FROM cust_berlin_I WHERE …
(Oracle)
INSERT ALL /* INSERT FIRST */ WHEN key < 100 INTO customer VALUES (key, name, address) WHEN key < 1000 INTO location VALUES (address, 'Berlin') SELECT * FROM cust_berlin_I WHERE …
(Oracle)
• Insert rows (partly) into several target tables.
• Allows to insert the same row several times.
• Allows to define conditions to select the target table.
• INSERT FIRST defines that the rows is inserted only once.
customer custkey Name Address
105 Fagiolo Hiroshimastr.
106 Byrt Lassenstr. cust_berlin_I key Name Address
100 Ortmann Rauchstrasse 1
101 Martin Pariser Platz
102 Torry Wilhelmstr.
location Street Town
Hiroshimastr. Berlin
Lassenstr. Berlin
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
19
Replication
UPDATE SELECT
source system
SELECT
target system
SELECT
target system
UPDATE SELECT
source system
UPDATE SELECT
source system
SELECT
target system UPDATE SELECT
UPDATE SELECT
UPDATE SELECT
data distribution data consolidation
update anywhere
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
20
Materialized Views
• Data marts provide extracts of the data warehouse for a specific application.
• Applications often need aggregated data.
• Materialized Views (MV) allow to: define the content of each data
mart as views on data warehouse tables
automatically update the content of a data mart
• Important Issues: MV selection MV refresh MV usage Load
Data Warehouse
Data Mart
End User Data Access
Data Mart
End User Data Access
End User Data Access
Data Mart
End User Data Access
dependent data marts
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
21
Materialized Views
CREATE TABLE DM2_Orders AS ( SELECT ANR, SUM(Count) FROM DW_Orders GROUP BY ANR ) DATA INITIALLY DEFERRED REFRESH DEFERRED; REFRESH TABLE DM2_Orders;
(DB2)
• Materialized views are created like views
• A strategy for refreshing has to be specified: DEFERRED: Use REFRESH TABLE
statement IMMEDIATE: As part of the update
in the source table open 1 2 100101 7017
open 1 2 120113 7537
open 1 2 120113 7456
ok 12 1 210704 7564
open 1 2 100105 7017
open 1 2 210204 7017
Status Count SNR ANR CNR
open 5 1 100101 7017
ok 3 1 100104 7098
… … … … …
DW_Orders
DM2_Orders ANR Count
100101 6
100104 3
210704 12
100105 1
210204 1
120113 2
DM1_Orders CNR Open Ok
7017 1 -
7089 - 1
7564 - 1
7017 3 -
7537 1 -
7456 1 -
Data Mart 1
Data Mart 2
Application needs number of open orders
per customer
Application needs number of products on
order
refresh refresh
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
22
Overview
• Monitoring • Extraction
Export, Import, Filter, Load Direct Integration
• Load Bulk Load Replication Materialized Views
• Transformation Schema Integration Data Integration Data Cleansing
• Tools
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
23
Transformation
Convert the data into something representable to the users and valuable to the business.
• Transformation of structure and content: Semantics → identify proper semantics Structure → schema integration Data → data integration and data cleansing
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
24
Transformation: Semantics
• Information on the same object is covered by several data sources.
• E.g., customer information is provided by several source systems. • Identify synonyms Indentify homonyms
• Identifying the proper semantics depends on the context. • Users have to define the proper semantics for the data warehouse. • Describe semantics in the metadata repository.
car automobile student pupil baby infant
cash cache bare bear sight site
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
25
Schema Integration
• Schema integration is the activity of integrating the schemata of various sources to produce a homogeneous description of the data of interest.
• Properties of the integrated schema: completeness correctness minimality understandability
preintegration
schema comparison
schema merging and restructuring
schema conforming
• Steps of schema integration:
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
26
Pre-Integration
Analysis of the schemata to decide on the general integration policy.
• Decide on: schemata to be integrated order of integration / integration process preferences
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
27
Schema Integration Process
source schemata target schema intermediate schemata
bina
ry
n-ar
y
left deep tree balanced tree
one shot iterative
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
28
Schema Matching
• Goal: Take two schemas as input and produce a mapping between elements of the two schemas that correspond semantically to each other.
• Typically performed manually, supported by a graphical user interface. tedious, time-consuming, error-prone, expensive
• General architecture of generic match: tools = schema-related apps. internal schema representation +
import and export needed use libraries to help find matches only determine match candidates user may accept or reject
Tool 1 Tool 2 Tool 3 Tool 4
Global libraries: dictionaries, schemas
Schema Import/export
Generic Match implementation
Internal schema representation (ER, XML, …)
Source: [RB01]
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
29
Schema Comparison
Analysis to determine the correlations among concepts of different schemata and to detect possible conflicts.
• Schema Matching is part of this step. • Types of conflicts in relational systems:
schema
table / table attribute / attribute table / attribute
naming • synonyms • homonyms
structure • missing attributes • implied attributes
integrity constraints
naming • synonyms • homonyms
default values integrity constraints
• data types • range of values
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
30
Schema Conforming
Conform and align schemata to make them compatible for integration.
• Conflict resolution based on the application context cannot be fully automated human intervention supported by graphical interfaces
• Sample steps: Attributes vs. Entity Sets Composite Primary Keys Redundancies Simplification
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
31
Schema Conforming
E
A B X
E EX
A B X
n 1
Attribute vs. Entity Set
E
A B X
EA EB
A B
n m
X
Composite Primary Keys
E1
E2
E3
n
m
n
m
1
1
Redundancies
redundant
E1
E2 E3
E4
E1
E2 E3
E4
E23
Simplification
generalization
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
32
Schema Merging and Restructuring
Conformed schemata are superimposed, thus obtaining a global schema.
• Main steps: superimpose conformed schemata quality tests against quality dimensions
(completeness, correctness, minimality, understandability, …) further transformation of the obtained schema
source schemata target schema intermediate/conformed schemata
integration transformation(s)
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
33
Schema Integration in Data Warehousing
Integration in data warehousing schema integration and data
integration schema integration is a
prerequisite for data integration schema integration is mainly
used for the data staging area final data warehouse schema is
defined from a global point of view, i.e., it is more than only integrating all source schemata
schema matching between source schema and data warehouse schema provides the basis for defining transformations
Integration in federated systems focus on schema integration integrated schema is used
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
34
Data Integration
• Normalization / denormalization: depending on the source schema
and the data warehouse schema
• Surrogate keys: keys should not depend on the
source system
• Data type conversion: if data type of source attribute
and target attribute differ
• Coding: text → coding; coding → text;
coding A → coding B
e x
a m
p l
e s
customer system local key global key
1 107 5400345
1 109 5401340
2 107 4900342
2 214 5401340
character→ date character → integer 'MM-DD-YYYY' → 'DD.MM.YYYY'
gross sales → 1 net sales → 2 3 → price 2 → GS
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
35
Data Integration
• Convert strings: standardization
• Convert date to date format of the target system
• Convert measures
• Combine / separate attributes
• Derived attributes
• Aggregation
'Video' → ' video' 'VIDEO' → 'video' 'Miller, Max' → 'Max Miller'
2004, 05, 31 → 31.05.2004 04, 05, 31 → 31.05.2004 '05/31/2004' → 31.05.2004
inch→ cm km → m mph → km/h
2004, 05, 31 → 31.05.2004 'video', 1598 → 'video 1598'
net sales + tax → gross sales on_stock - on_order → remaining
sales_per_day → sales_per_month
e x
a m
p l
e s
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
36
Data Cleansing
• Elementizing identify fields
• Standardizing format, coding
• Verification contradictions? should lead
to corrections in source system(s)
• Matching is 'David Miller' and/or 'Clara
Miller' already present in data warehouse?
if so, are there changed fields?
• Householding 'David Miller' and 'Clara Miller'
constitute a household
• Documenting
David and Clara Miller Ste. 116 13150 Hiway 9 Box 1234 Boulder Crk Colo 95006
first name 1: David last name 1: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Hiway 9 post box: 1234 city: Boulder Crk state: Colo zip: 95006
first name: David last name: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Highway 9 post box: 1234 city: Boulder Creek state: Colorado zip: 95006
first name: David last name: Miller first name 2: Clara last name 2: Miller suite: 116 number: 13150 street: Highway 9 post box: 1234 city: Boulder Creek state: California zip: 95006
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
37
Dimensions of Data Cleansing
single source multiple sources
single record • attribute dependencies (contradictions)
• spelling mistakes • missing values • illegal values
• duplicates / matching • householding • contradictions • standardization,
coding
multiple records • primary key foreign key
• duplicates / matching • householding
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
38
Data Quality
consistency
correctness
completeness
exactness
reliability
understandability
relevance
• Are there contradictions in data and/or metadata?
• Do data and metadata provide an exact picture of the reality?
• Are there missing attributes or values?
• Are exact numeric values available? Are different objects identifiable? Homonyms?
• Is there a Standard Operating Procedure (SOP) that describes the provision of source data?
• Does a description for the data and coded values exist?
• Does the data contribute to the purpose of the data warehouse?
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
39
Improving Data Quality
• Assumption: Various projects can be undertaken to
improve the quality of warehouse data.
Goal: Identify the data quality enhancement projects that maximize value to the users of data.
• Tasks for the data warehouse manager: Determine the organizational
activities the data warehouse will support;
Identify all sets of data needed to support the organizational activities;
Estimate the quality of each data set on each relevant data quality dimension;
Identify a set of potential projects (and their cost) that could be untertaken for enhancing or affecting data quality;
Estimate for each project the likely effect of that project on the quality of the various data sets, by data quality dimension;
Determine for each project, data set, and relevant data quality dimension the change in utility should a particular project be untertaken.
Source: [BT99]
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
40
Improving Data Quality
I: Index of organizational activities supported by a data warehouse
J: Index for data sets K: Index for data quality attributes
or dimensions L: Index for possible data quality
projects P(1) … P(S)
Current quality: CQ(J, K) Required quality: RQ(I, J, K) Anticipated quality: AQ(J, K, L) Priority of organizational activities: Weight(I) Cost of data quality enhancement: Cost(L) Value added: Utility(I, J, K, L)
Value of Project L = ∑ ∑ ∑⋅IAll JAll KAll
LKJIUtilityIWeight ),,,()(
Maximize: total value from all projects ∑LAll
LValueLX )(*)(
Resource Constraint:
Exclusiveness Constraint:
Interaction Constraint:
BudgetLCostLXLAll
≤∑ )(*)(
1))((...))2(())1(( ≤+++ SPXPXPX
1))3(())2(())1(( ≤++ PXPXPX
=otherwise ,0
selected isL Project if ,1)(LX
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
41
Overview
• Monitoring • Extraction
Export, Import, Filter, Load Direct Integration
• Load Bulk Load Replication Materialized Views
• Transformation Schema Integration Data Integration Data Cleansing
• Tools
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
42
ETL Market Size 2001-2006
Source: Forrester Research, 2004
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
43
ETL Market
• Vendors coming from several different backgrounds and perspectives:
"Pure Play" Vendors • ETL represents a core competency • ETL accounts for most of the license
revenue • This class of vendors is driving the bulk
of innovation and "mind share" in the ETL market
Business Intelligence Vendors • Business intelligence tools and platforms
are their core competency • For most of these vendors, ETL technology
plays a supporting role to their flagship business intelligence offerings, or is one component of a broad offering including business intelligence and ETL
Database Management System (DBMS) Vendors
• database vendors have an increasing impact on this market as they continue to bundle ETL functionality closer to the relational DBMS
Other Infrastructure Providers • They provide various types of technical
infrastructure components beyond the DBMS
• ETL is typically positioned as yet another technical toolset in their portfolios
Source: Gartner, 2004
A B
C
D
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
44
ETL Vendors/Products Ranked By Market Share Percentages
Source: Forrester Research, 2004
A
A
A
A
B
B
C
C
C
D
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
45
ETL Vendors/Products Ranked By Market Share Percentages
Source: Forrester Research, 2004
A
B
B
D
C
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
46
ETL Tools
Source: META Group, 2004
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
47
Summary
• Moving data is part of most steps of the ETL process: Extraction, Transformation, Loading Data Warehouse and Data Marts
• Several approaches available: export, import, load direct integration replication materialized views
• Transformation steps include semi-automatic schema matching and integration data integration steps data cleansing
Extraction, Transformation, Load Anwendersoftware aaAnwendungssoftware
ss
48
Papers
[BT99] D. Ballou, G. K. Tayi: Enhancing Data Quality in Data Warehouse Environments. Communications of the ACM, Vol. 42, No. 1, 1999.
[LG96] W. Labio, H. Garcia-Molina: Efficient Snapshot Differential Algorithms for Data Warehousing. Proc. of 22th International Conference on Very Large Data Bases, Mumbai (Bombay), India, 1996.
[RB01] E. Rahm, P. Bernstein: A survey of approaches to automatic schema matching. VLDB Journal 10:334-350 (2001)