ETL Process
ETL: Overview
Two stepsI From the sources to staging area
F Extraction of data from the sourcesF Creation / detection of differential updatesF Creating LOAD Files
I From the staging area to the base databaseF Data Cleaning and TaggingF Preparation of integrated data sets
I Continuous data provision for the DWHI Assurance of consistency regarding DWH data sources
Efficient methods essential→ minimize offline timeRigorous tests essential→ ensure data quality
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–1
ETL Process
ETL Process
Frequently most elaborate part of the Data WarehousingI Variety of sourcesI HeterogeneityI Data volumeI Complexity of the transformation
F Schema and instance integrationF Data cleansing
I Hardly consistent methods and system support, but variety of toolsavailable
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–2
ETL Process
ETL Process
Extraction: selecting a section of the data from the sources andproviding it for transformationTransformation: fitting the data to predefined schema and qualityrequirementsLoad: physical insertion of the data from the staging area into thedata warehouse (including necessary aggregations)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–3
ETL Process
Definition Phase of the ETL Process
Quelldaten-analyse
Metadaten-Management
Repository
OLTP
Legacy
ExterneQuellen
Auswahl derObjekte
Erstellen derTransformation
Erstellen derETL-Routinen
Analyse-bedarf
Datenmodell undKonventionen
Dokumentation,operativerDatenkatalog
Regelwerk fürDatenqualität Transfor-
mations-regeln
Erfolgskriterienfür Laderoutinen
ETL-JobsAbbildungSchlüsseltransf.Normalisierung
DWH
Datenquellen
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–4
Extraction of Data from Sources
Extraction
TaskI Regular extraction of change data from sourcesI Data provision for the DWH
DistinctionI Time of extractionI Type of extracted data
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–5
Extraction of Data from Sources
Point in Time
Synchronous notificationI Source propagates each change
Asynchronous notificationI Periodically
F Sources produce extracts regularlyF DWH regularly scans dataset
I Event-drivenF DWH requests changes before each annual reportingF Source informs after each X changes
I Query-controlledF DWH queries for changes before any actual access
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–6
Extraction of Data from Sources
Type of Data
Flow: integrate all changes in DWHI Short positions, tradeI accomodate for changes
Stock: point in time is essential, must be setI Number of employees at end of the month in a storeI Stock at the end of the year
Value per Unit: Depending on unit and other dimensionsI Exchange rate at a point in timeI Gold price on a stock exchange
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–7
Extraction of Data from Sources
Type of Data
Snapshots: Source always provides complete data setI New suppliers directory, new price list, etc.I Detect changesI Depict history correctly
Logs: Source provides any changeI Transaction logs, application-controlled loggingI Import changes efficiently
Net Logs: Source provides net changesI Catalog updates, snapshot deltasI No complete history possibleI Changes efficiently importable
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–8
Extraction of Data from Sources
Point in Time of Data ProvisionSource . . . Method Timeliness
DWHWorkloadon DWH
WorloadonSources
creates files periodi-cally
Batch runs,Snapshots
dependingon fre-quency
low low
propagates eachchange
Trigger, Repli-cation
maximum high very high
createsextracts onrequest
beforeuse
very hard maximum medium medium
application-driven
application-driven
dependingon fre-quency
dependingon fre-quency
dependingon fre-quency
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–9
Extraction of Data from Sources
Point in Time of Data ProvisionComments for three previous options:
Many systens (Mainframe) not accessible onlineContradicts idea of DWH: More workload on sourcesTechnically not efficiently implementable
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–10
Extraction of Data from Sources
Extraction from Legacy Systems
Very dependent on the applicationAccess to host systems without online access
I Access via BATCH, Report Writer, schedulingData in non-standard databases without APIs
I Programming in PL-1, COBOL, Natural, IMS . . .
Unclear semantics, double occupancy of fields, speaking keys,missing documentation, domain knowledge only held by fewpeopleBut: Commercial tools available
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–11
Extraction of Data from Sources
Differential Snapshot Problem
Many sources provide only the full datasetI Molecular biological data basesI Customer lists, employee listsI Product catalogues
ProblemI Repeated import of all data is inefficientI Duplikates need to be detected
Algorithms to compute Delta-FilesHard for very large files
[Labio Garcia-Molina 1996]
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–12
Extraction of Data from Sources
Scenario
Sources provide Snapshots as file FI Unordered set of records (K,A1, . . . ,An)
Given: F1, F2, mit f1 = |F1|, f2 = |F2|Calculate smallest set O = {INS,DEL,UPD}∗ with O(F1) = F2
O not unique!
O1 = {(INS(X)), ∅, (DEL(X))} ≡ O2 = {∅, ∅, ∅}
Differential Snapshot Problem
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–13
Extraction of Data from Sources
Scenario
K4, t, r, ...K102, p, q, ...K104, k, k, ...K202, a, a, ...
DifferentialSnapshot
AlgorithmusK3, t, r, ...K102, p, q, ...K103, t, h, ...K104, k, k, ...K202, b, b, ...
INS K3DEL K4INS K103UPD K202: ...
F1
F2
DWH
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–14
Extraction of Data from Sources
Assumptions
Computing a consecutive order of DSI Files from 1.1.2010, 1.2.2010, 1.3.2010, . . .
Cost ModelI All operations in the main memory are for freeI IO counts the number of records: sequential readI No consideration of block sizes
Size of main memory: M (Records)File size |Fx| = fx (Records)Files generally larger than main memory
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–15
Extraction of Data from Sources
DSnaive – Nested Loop
Computing OI Read record R from F1I Read F2 sequentially and compare to R
F R not in F2 → O := O ∪ (DEL(R))F R in F2 → O := O ∪ (UPD(R)) / ignore
Problem: INS is not foundI Auxiliary structure necessaryI Array with IDs from F2 (generated on the fly)I Mark R respectively, final run for INS
Number of IO operations: f1 · f2 + δ
Improvements?I Cancel search in F2 if R has been foundI Load partitions of size M from F1: f1
M · f2
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–16
Extraction of Data from Sources
DSsmall – small files
Assumption: Main memory M > f1 (or f2)Computing O
I Read F1 completelyI Read F2 sequentially (S)
F S ∈ F1: O := O ∪ (UPD(S)) / ignoreF S 6∈ F1: O := O ∪ (INS(S))F Mark S in F1 (Bitarray)
I Finally: Records R ∈ F1 without marks: O := O ∪ (DEL(R))
Number of IO operations: f1 + f2 + δ
ImprovementsI Sort F1 in the main memory faster lookup
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–17
Extraction of Data from Sources
DSsort – Sort-Merge
General case: M � f1 und M � f2Assumption: F1 is sortedSort F2 in secondary storage
I read F2 in partitions Pi with |Pi| = MI Sort Pi in main memory and write in Fi ("Runs")I Mix all Fi
I Assumption: M >√|F2| → IO: 4 · f2
Keep sorted F2 for next DS (becomes F1 there)I Per DS only F2 needs to be sorted
Computing OI Open sorted F1 and F2I Mix (parallel reads with skipping)
Number of IO operations: f1 + 5 · f2 + δ
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–18
Extraction of Data from Sources
DSsort2 – Interleaved
Sorted F1 givenComputing O
I Read F2 in partitions Pi with |Pi| = MI Sort Pi in main memory and write in Fi
2I Mix all Fi
2 and simultaneously compare to F1
Number of IO operations: f1 + 4 · f2 + δ
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–19
Extraction of Data from Sources
DShash – Partitioned Hash
Calculating OI Hash F2 in partitions Pi with |Pi| = M/2I Hash funktion has to guarantee:
Pi ∩ Pj = ∅, ∀i 6= j
I Partitions are "equivalence classes" w.r.t. the hash functionI F1 is still partitionedI F1 and F2 have been partitioned by the same hash functionI Read and mix P1,i and P2,i in parallel
Number of IO operations: f1 + 3 · f2 + δ
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–20
Extraction of Data from Sources
Why not simply . . .
UNIX diff?I diff requires / considers surroundings of recordsI Here: records are not ordered
in the database with SQL?I Requires to read each relation three times
INSERT INTO deltaSELECT ’UPD’, ...FROM F1, F2WHERE F1.K = F2.K AND F1.W <> F2.W
UNIONSELECT ’INS’, ...FROM F2WHERE NOT EXISTS (...)
UNIONSELECT ’DEL’, ...FROM F1WHERE NOT EXISTS (...)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–21
Extraction of Data from Sources
Comparison – Features
IO BemerkungenDSnaive f1 · f2 out of concurrence, auxiliary data struc-
ture requiredDSsmall f1 + f2 only for smaller filesDSsort2 f1 + 4 · f2DShash f1 + 3 · f2 non-overlapping hash function, hard
to estimate partition size, assumptionsabout distribution (Sampling)
Extensions of DShash for "worse" hash functions known
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–22
Extraction of Data from Sources
Further DS Approaches
Number of partitions / runs larger than file descriptors in OSI Hierarchical external sorting methods
Compression: Compress FilesI Larger partitions / runsI Better chance of performing comparisons within the main memoryI In reality faster (assumptions of the cost model)
"Windows" AlgorithmI Assumption: Files have a "fuzzy" orderI Mixing with Sliding Window over both filesI Returns many redundant INS-DEL pairsI Number of IO operations: f1 + f2
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–23
Extraction of Data from Sources
DS with Timestamp
Assumption: Records are (K,A1, . . . ,An,T)
T: Timestamp of the last changeCreating O
I Adherence of Talt: Last update (max{T} of F1)I Read F2 sequentiallyI Entries with T > Talt interestingI But: INS or UPD?
Another problem: DEL is not foundTimestamp spares only attribute comparison
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–24
Data Load
Load
AufgabeI Efficient incorporation of external data in DWH
Critical PointI Loading operations may block the entire DWH (Write lock on fact
table)Aspects:
I TriggersI Integrity constraintsI Index updateI Update or Insert?
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–25
Data Load
Set based
Use of standard interfaces:PRO*SQL, JDBC, ODBC, . . .Works in the normal transaction contextTriggers, indexes and constraints remain active
I Manual deactivation possible
No large-scale locksLocks can be reduced by COMMIT
I Not in Oracle: Read operations are never locked (MVCC)
Using prepared statementsPartial proprietary extensions (arrays) available
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–26
Data Load
BULK Load
DB-specific extensions for loading large amounts of dataRunning (usually) in a special context
I Oracle: DIRECTPATH option in the loaderI Complete table lockI No consideration of triggers or constraintsI Indexes are not updated until afterI No transactional contextI No loggingI Checkpoints for recovery
Practice: BULK Uploads
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–27
Data Load
Example: ORACLE sqlldr
SQL*Loader
Loader-Kontroll-Datei
InputDatafiles
SchlechteDateien
InputDatafiles
AbgelehnteDateienDatenbank
Indexe
Tabellen
Log-Datei
InputDatafilesInput-
Dateien
[Oracle 11g Documentation]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–28
Data Load
Example: ORACLE sqlldr (2)
Control-FileLOAD DATAINFILE ’bier.dat’REPLACE INTO TABLE getraenke (bier_name POSITION(1) CHAR(35),bier_preis POSITION(37) ZONED(4,2),bier_bestellgroesse POSITION(42) INTEGER,getraenk_id "getraenke_seq.nextval")
Data file: bier.datIlmenauer Pils 4490 100Erfurter Bock 6400 80Magdeburger Weisse 1290 20Anhaltinisch Flüssig 8800 200
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–29
Data Load
BULK Load Example
Many optionsI Treatment of exceptions (Badfile)I Data transformationsI CheckpointsI Optional fieldsI Conditional loading into multiple tablesI Conditional loading of recordsI REPLACE or APPENDI Parallel loadI . . .
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–30
Data Load
Direct Path Load
Schreibe Datenbank-Block
SQL*Loader SQL*Loader Benutzerprozesse
SQL-Kommando Verarbeitung
Generiere SQL-Kommandos
Generiere SQL-Kommandos
Generiere SQL-Kommandos
Generiere SQL-Kommandos
Oracle Server
KonventionellerPfad
Speichermanagement
Hole neue Ausmaße
Passe Füllstand an
Finde partielle Blöcke
Befülle partielle Blöcke
Puffer-Cache
Puffer Cache Management- Manage Queues- Löse Konflikte auf
Datenbank-Blöckelesen
Datenbank-Blöcke schreiben
Datenbank
DirekterPfad
[Oracle 11g Documentation]
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–31
Data Load
Multi-Table-Insert in Oracle
Insert in to multiple tables or multiple times (e.g., for pivoting)
INSERT ALLINTO Quartal_Verkauf
VALUES (Produkt_Nr, Jahr || ’/Q1’, Umsatz_Q1)INTO Quartal_Verkauf
VALUES (Produkt_Nr, Jahr || ’/Q2’, Umsatz_Q2)INTO Quartal_Verkauf
VALUES (Produkt_Nr, Jahr || ’/Q3’, Umsatz_Q3)INTO Quartal_Verkauf
VALUES (Produkt_Nr, Jahr || ’/Q4’, Umsatz_Q4)SELECT ... FROM ...
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–32
Data Load
Multi-Table-Insert in Oracle (2)
Conditional insertINSERT ALLWHEN ProdNr IN
(SELECT ProdNr FROM Werbe_Aktionen)INTO Aktions_Verkauf
VALUES (ProdNr, Quartal, Umsatz)WHEN Umsatz > 1000
INTO Top_Produkte VALUES (ProdNr)SELECT ... FROM ...
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–33
Data Load
Merge in Oracle
Merge: attempt an insert in error (by breach of a key condition)→Update
MERGE INTO Kunden K USING Neukunden NON (N.Name = K.Name AND N.GebDatum = K.GebDatum)WHEN MATCHED THENUPDATE SET K.Name = N.Name, K.Vorname=N.Vorname,
K.GebDatum=N.GebDatumWHEN NOT MATCHED THENINSERT VALUES (MySeq.NextVal, N.Name,
N.Vorname, N.GebDatum)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–34
Data Load
The ETL Process: Transformation Tasks
Data-Warehouse
Einsatz-fähigeQuellen
Extraktion, Transformation, Laden
Scheduling, Logging, Monitoring, Recovery, Backup
Extraktion Integration Aggregation
Instanzextraktionund Transformation
Instanzabgleichund Integration
Filterung,Aggregation
DatenflussMeta-Datenfluss
Zwischen-speicher
Data-Warehouse
1 2 3 4 5
1 3
2
4
5
Instanz-Charakteristika(reale Meta-Daten)
Translationsregeln
Abbildungen von Quell- aufZielschemata
Filterungs- und Aggregationsregeln
Legende:
[Rahm Do 2000]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–35
Data Load
Method: Source – Staging Area – BaseDB
Quelle 1:RDBMS
Quelle 2:IMS
Rel. Schema Q1
Rel. Schema Q2
Datenwürfel,Integriertes
Schema
BULK Load only the first stepNext loads
I INSERT INTO ...SELECT ...I Logging can be switched offI Parallelizable
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–36
Data Load
Transformation Tasks
When loadingI Simple conversions (for LOAD - File)I Record orientation (tuples)I Preparation for BULK Loader –> mostly scripts or 3GL
In the data staging areaI set-oriented calculationsI Inter-and intra-relation comparisonI Comparison with base database→ DuplicatesI Tagging of recordsI SQL
Loading in the BaseDBI Bulk-LoadI set-oriented inserts without logging
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–37
Data Load
Task: Source – Staging Area – BaseDB
What to do, where and when?I No defined task assignment
Extraction Load
Source→ Staging Area Staging Area→ BaseDBAccess type record-oriented set-orientedAvailable databases one source (Updatefile) many sourcesAvailable datasets Depending in source: sll, all
changes, deltasBaseDB additionally available
Programming language Skripts: Perl, AWK, . . . or3GL
SQL, PL/SQL
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–38
Transformation Tasks
Transformation
ProblemI Data in non-working area not in the format of the basic databaseI Structure of the data varies
F Staging Area: Schema close to sourceF BaseDB: Multidimensional schemaF Structural heterogeneity
AspectsI Data transformationI Schema transformation
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–39
Transformation Tasks
Data and schema heterogeneity
Main data source: OLTP systemsSecondary sources:
I Documents from in-house old archivesI Documents from the Internet via WWW, FTP
F Unstructured: access via search engines, . . .F Semi Structured: access via search engines, mediators, wrappers
etc. as XML documents or similar
Basic problem: heterogeneity of sources
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–40
Transformation Tasks
Aspects of Heterogeneity
Various data modelsI Due to autonomous decisions on acquisition of systems in the
divisions,I Various and different powerful modeling constructs,I Application semantics are specifiable in varying degrees Mapping
ambiguous between data models
Example: Relational Model vs. object-oriented modeling vs. XML
KundeNameVornamePLZ...Kunde
NameVorname
PLZKunde
Name
Vorname
PLZ
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–41
Transformation Tasks
Aspects of Heterogeneity (2)
Different models for the same real-world factsI Due to design autonomyI Even in the same data model different modeling possible, e.g., by
different modeling perspectives of DB Designer
KundeNameVornameGeschlecht...
KundeNameVorname...
Mann Frau
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–42
Transformation Tasks
Aspects of Heterogeneity (3)
Different representations of the dataI Different data types possibleI Different scopes of the supported data typesI Different internal representations of the dataI Also, different "values" of a data type to represent the same
information
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–43
Transformation Tasks
Data Error ClassificationDatenfehler
Einzelne Datenquellen Integrierte Datenquellen
- Unzulässiger Wert- Attributabhängigkeit verletzt- Eindeutigkeit verletzt- Referenzielle Integrität verletzt
- Fehlende Werte- Schreibfehler- Falsche Werte- Falsche Referenz- Kryptische Werte- Eingebettete Werte- Falsche Zuordnung- Widersprüchliche Werte- Transpositionen- Duplikate- Datenkonflikte
- Strukturelle Heterogenität- SemantischeHeterogenität- SchematischeHeterogenität
- Widersprüchliche Werte- Unterschiedliche Repräsentationen- Unterschiedliche Genauigkeit- UnterschiedlicheAggregationsebenen-Duplikate
Schemaebene
Fehlende Integritätsbedingungen,
schlechtes SchemaDesign
Datenebene
Fehler in Datenträgern
Schemaebene
Heterogene Datenmodelle und
-schemata
Datenebene
Überlappende, widersprüchliche und inkonsistente Daten
[Rahm Do 2000, Leser Naumann 2007]c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–44
Schema Heterogeneity
Schema Heterogeneity
Cause: design autonomy different modelsI Different normalizationI What is a relation, what is an attribute, what is a value?I Distribution of data in tablesI Redundancies from source systemsI Keys
In SQL is not well supportedI INSERT has only one target tableI SQL accesses data, not schema elementsI Usually requires programming
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–45
Schema Heterogeneity
Schema Mapping
Data transformation between heterogeneous schemasI Old but recurrent problemI Usually, experts write complex queries or programsI Time intensive
F Expert for the domain, for schemas and for queriesF XML makes it even more difficult: XML Schema, XQuery
Idea: AutomationI Given: Two schemas and a high-level mapping between themI Wanted: query for data transformation
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–46
Schema Heterogeneity
Why is schema mapping difficult?
Generation of the "right" request, taking into accountI the source and target schemaI the mappingI and the user intention: semantics!
Guarantee that the transformed data correspond to the targetschema
I Flat or nestedI Integrity constraints
Efficient data transformation
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–47
Schema Heterogeneity
Schema Mapping:Normalized vs. Denormalized
1:1 associations are represented differentlyI By occurrence in the same tupleI Due to foreign key relationship
bIDnamealkoholgehalt
BierpIDnameherstellerproduktsorte
Produkt
pFKbezeichnung
Produktsorte
SELECT bID AS pID, name, NULL AS hersteller,NULL AS produktsorte FROM Bier
UNIONSELECT NULL AS pID, NULL AS name, NULL AS hersteller,
bezeichnung AS produktsorte FROM Produktsorte
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–48
Schema Heterogeneity
Schema Mapping:Normalized vs. Denormalized (2)
bIDnamealkoholgehalt
BierpIDnameherstellerproduktsorte
Produkt
pFKbezeichnung
Produktsorte
SELECT bID AS pID, name, NULL AS hersteller,bezeichnung AS produktsorte
FROM Bier, ProduktsorteWHERE bID = pFK
Only one of four possible interpretations!
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–49
Schema Heterogeneity
Schema Mapping:Normalized vs. Denormalized (3)
nameherstellerproduktsorte
Produkt bIDnamealkoholgehalt
Bier
pFKbezeichnung
Produktsorte
Requires key generation: Skolem funktion SK, supplying a uniquevalue with respect to the input (e.g., concatenation of all values)
Bier := SELECT SK(name) AS bID, name,NULL AS alkoholgehalt FROM Produkt
Produktsorte := SELECT SK(name) AS pFK,produktsorte AS bezeichnung FROM Produkt
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–50
Schema Heterogeneity
Schema mapping: Nested vs. Flat
1:1 associations are represented differentlyI I.e., nested elementsI Due to foreign key relationship
bIDnamealkoholgehalt
BierpIDnameproduktsorte
Produkt
bezeichnungProduktsorte
nameherstellerproduktsorte
Produktname
Bier
bezeichnungProduktsorte
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–51
Schema Heterogeneity
Difficulties
Example: Source(ID, Name, Street, ZIP-Code,Revenue)Target schema #1Customer(ID, Name, Revenue)Address(ID, Street, ZIP-Code)
I Requires 2 scans of the source table
INSERT INTO Customer ... SELECT ...INSERT INTO Address ... SELECT ...
Target schema #2PremCustomer(ID, Name, Revenue)NormCustomer(ID, Name, Revenue)
I Requires 2 scans of the source table
INSERT INTO PremCustomer ... SELECT ... WHERE Revenue>=XINSERT INTO NormCustomer ... SELECT ... WHERE Revenue<X
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–52
Schema Heterogeneity
Difficulties (2)
SchemaP1(Id, Name, Gender)P2(Id, Name, M, W)P31(Id, Name), P32(Id, Name)P1→ P2
INSERT INTO P2 (id, name, ’T’, ’F’) ... SELECT ...INSERT INTO P2 (id, name, ’F’, ’T’) ... SELECT ...
P3→ P1INSERT INTO P1(id, name, ’female’) ...
SELECT ... FROM P31INSERT INTO P1(id, name, ’male’) ...
SELECT ... FROM P32
Number of values must be fixed; new gender – Change all queries
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–53
Data Errors
Data Errors
KNr Name Geb.datum Alter Geschl. Telefon PLZ34 Meier, Tom 21.01.1980 35 M 999-999 3910734 Tina Möller 18.04.78 29 W 763-222 3699935 Tom Meier 32.05.1969 27 F 222-231 39107
Person Emailnullnull
PLZ391073699695555
OrtMagdeburg
SpanienIllmenau
Ort
Eindeutigkeitverletzt
Unterschiedliche Repräsentation
WidersprüchlicheWerte
Fehlende Werte(z.B. Default-Werte)
Referentielle Integrität verletzt
Duplikate
Schreib- oder Tippfehler
Falsche oder unzulässige Werte
unvollständig
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–54
Data Errors
Avoiding Data ErrorsAvoiding of bywrong data types Data type definition,
domain-constraintswrong values checkmissing values not nullinvalid foreign key references foreign keyDuplikates unique, primary keyInkonsistencies transactionsoutdated data replikation, materialized views
However, in practice:I Lack of metadata and integrity constraints, . . .I Input errors, ignorance, . . .I HeterogeneityI . . .
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–55
Data Errors
Phases of Data Processing
Sammlung/Auswahl
DQ-Problemeidentifizieren/quantifizieren
Fehlerarten/-ursachenerkennen
Standardisierung/Normalisierung
Fehler-korrektur
Duplikat-erkennung und
Merging
Aggregation /Feature-
Extraktion
Dimensions-reduktion /Sampling
Diskretisierung
Dat
a Pr
ofilin
g
Data Cleaning
Tran
sfor
mat
ion
Nutzung
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–56
Data Errors
Data Profiling
Analysis of the content and structure of individual attributesI Data type, range, distribution and variance, occurrence of null
values, uniqueness, pattern (e.g., dd / mm / yyyy)Analysis of dependencies between attributes of a relation
I "fuzzy" keysI Functional dependencies, potential primary key, "fuzzy"
dependenciesI Need:
F No explicit constraints specifiedF However, in most data satisfied
Analysis of overlaps between attributes of different relationsI Redundancies, foreign key relationships
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–57
Data Errors
Data Profiling (2)
Missing or incorrect valuesI Calculated vs. Expected cardinality (e.g. number of branches,
gender of clients)I ANumber of null values, minimum / maximum, variance
Data or input errorsI Sorting and manual testingI Similarity tests
DuplicatesI Number of tuples vs. attribute cardinality
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–58
Data Errors
Data Profiling with SQL
SQL queries for simple profiling tasksI Schema, data types: requests to schema catalogI Range of values
select min(A), max(A), count(distinct A)from Tabelle
I Data errors, default values
select City, count(*) as Numbfrom Customer group by City order by Numb
F Ascending: Input errors, e.g., Illmenau: 1, Ilmenau: 50F Descending: undocumented default values, z.B. AAA: 80
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–59
Data Errors
Data Cleaning
Detect & eliminate inconsistencies, contradictions, and errors indata with the aim of improving the quality.Also Cleansing or ScrubbingUp to 80% of the expense in DW projectsCleaning in DW: part of the ETL process
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–60
Data Errors
Data Quality and Data Cleaning
Rege
lbas
ierte
Ana
lyse
Bezie
hung
sana
lyse
Abhä
ngig
keits
anal
yse
Spal
tena
nalys
e
Gültigkeit einzelner W
erte
Gültigkeit mehrerer W
erte
Konsistenz mittels regel-
basierter AnalyseGeschäfts- und
Datenregeln (Defekte)
Referenzielle IntegritätIntegritäts-
verletzungen, Waisen (Orphans),
Kardinalitäten
Korrektheit mittels statis-
tischer KontrolleMin, Max, Mittel,
Median, Standardab-weichung, ...
KonsistenzDatentyp-,
Feldlängen- und Wertebereichs-konsistenzen
Schlüssel-eindeutigkeit
Eindeutigkeit der Primär- bzw.
Kandidatenschlüssel
RedundanzfreiheitNormalisierungsgrad(1.,2. und 3. NF), Duplikatprüfung
EindeutigkeitAnalyse der Metadaten
VollständigkeitFüllgradanalyse der
Entitäten und Attribute
GenauigkeitAnalyse der Stelligkeiten
(Gesamt- und Nach-kommastellen für
numerische Attribute)
EinheitlichkeitFormatanalyse (für
numerische Attribute, Zeiteinheiten und Zeichenketten)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–61
Data Errors
Normalization and Standardization
Data type conversion: varchar→ intEncodings: 1: address unknown, 2: old address, 3: currentaddress, 4: addresse of spouse, . . .Normalization: mapping in unified format
I Date: 03/01/11→ 01. März 2011I Currency: $→ eI Strings to uppercase
Tokenization: "Saake, Gunter"→ "Saake", "Gunter"Discretization of numeric valuesDomain-specific transformations
I Codd, Edgar Frank→ Edgar Frank CoddI Str. → StreetI Addresses from address databasesI Industry-specific product names
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–62
Data Errors
Data Transformation
In SQL well supportedI Multiple functions in the language standardI SString functions, decoding, conversion date, formulas, system
variable, . . .I Create functions in PL/SQL - use in SQL
Daten"Pause, Lilo" ⇒ "Pause", "Lilo""Prehn, Leo" ⇒ "Prehn", "Leo"SQL
INSERT INTO customers (last_name, first_name)SELECT SubStr(name, 0, inStr(name,’,’)-1),
SubStr(name, inStr(name,’,’)+1)FROM rawdata;
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–63
Data Errors
Duplicate Detection
Identify semantically equivalent data sets, i.e., they represent thesame real world objectSee also: Record Linkage, Object Identification, DuplicateElimination, Merge / Purge
I Merge: Detect duplicatesI Purge: selection / calculation of the "best" representative per class.
CustomerNr Name Address3346 Just Vorfan Hafenstrasse 123346 Justin Forfun Hafenstr. 125252 Lilo Pause Kuhweg 425268 Lisa Pause Kuhweg 42⊥ Ann Joy Domplatz 2a⊥ Anne Scheu Domplatz 28
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–64
Data Errors
Duplicate Detection: Comparisons
Typical comparison rules
if ssn1 = ssn2 then matchelse if name1=name2 then
if firstname1=firstname2 thenif adr1=adr2 then matchelse unmatch
else if adr1=adr2 then match_householdelse if adr1=adr2 then...
Naive approach: "all-vs-all"I O(n2) comparisonsI Maximum accuracy (depending on rules)I Far too expensive
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–65
Data Errors
Duplicate Detection: Principle
Rr1, r2, r3, ...
Ss1, s2, s3, ...
R x S
r1,s1r2,s2
r3,s3
...
Matches (M)
Non Matches (U)
Vergleichs-funktion
Partitionierungdes
Suchraums
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–66
Data Errors
Partitioning
BlockingI Division of the search space into disjoint blocksI Duplicates only within a block
Sorted neighborhood [Hernandez Stolfo 1998]I Sorting the data based on a selected keyI Compare in a sliding window
Multi-pass techniqueI Transitive closure over different collations
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–67
Data Errors
Sorted Neighborhood1 Compute a key for each record
I ex: SSN + "first 3 characters of Name"+ ...
I Observance of typical errors: 0-O,Soundex, neighboring keys, ...
2 Sort by key3 Scan list sequentially4 Comparisons within a window W,|W| = w
I Which tuples really need to becompared?
w
w
ComplexityI Key generation: O(n), sorting: O(n · log(n)); comparing:
O((n/w) · (w2)) = O(n · w);I Total: O(n · log(n)) or O(n · w)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–68
Data Errors
Sorted Neighborhood: Problems
Poor AccuracyI Sorting criterion always prefers certain attributesI Are the first letters more important for identity than the last ones?I Is Surname more important than the house number?
Increase window size?I Not helpfulI Dominance of an attribute remains the same, but runtime
deteriorates rapidly
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–69
Data Errors
Multi-pass technique
Sort by multiple criteria and identification of duplicatesFormation of the transitive closure of the duplicates up to a givenlength
A
B
C
B
C
A
1. Run: "A matches B"2. Run: "B matches C"Transitivity: "A matchesC"
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–70
Data Errors
Comparison functions
Comparison functions for fields (String A und B), including:I EEdit distance: number of edit operations (insert, delete, Change)
for change from A to BI q-Grams: Comparison of the amounts of all substrings of A and B of
length qI Jaro distance and Jaro-Winkler distance: Consideration of common
characters (within the half string length) and transposed characters(at another position)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–71
Data Errors
Edit Distance
Levensthein Distance:I Number of edit operations (insert, delete, modify) for change from A
to BI Example:
edit_distance("Qualität", "Quantität") = 2⇒ update(3,’n’)⇒ insert(4,’t’)
I Application:
select P1.Name, P2.Namefrom Produkt P1, Produkt P2where edit_distance(P1.Name, P2.Name) <= 2
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–72
Data Errors
q-Grams
Set of all substrings of length qQualität3 := { __Q, _Qu, Qua, ual, ali, lit, itä, tät, ät_, t__ }Observation: strings with small edit distance have many commonq-grams, i.e., for edit distance k min.
max(|A|, |B|)− 1− (k − 1) · q
common q-gramsPositional q-grams: extension with position in a stringQualität := { (-1, __Q), (0, _Qu), (1, Qua), ... }
I Filtering for efficient comparison:F COUNT: number of common q-gramsF POSITION:Position difference between corresponding q-grams ≤ kF LENGTH: The difference in string lengths ≤ k
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–73
Data Errors
Data Conflicts
Data conflict: Two duplicates have different attribute values for asemantically same attribute
I In contrast to conflicts with integrity constraintsData conflicts arise
I Within an information system (intra-source) andI With the integration of multiple information systems (inter-source)
Prerequisite: Duplicate, already established that identityRequires: Conflict Resolution (Purging, Reconciliation)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–74
Data Errors
Data Conflicts: Origins
Lack of integrity constraints or consistency checksIn case of redundant schemasBy partial informationWith emergence of duplicatesIncorrect entries
I Typing errors, transmission errorsI Incorrect calculation results
Obsolete entriesI Different update times
F Adequate timeliness of a sourceF Delayed update
I Forgotten update
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–75
Data Errors
Data Conflicts: Remedies
Reference tables for exact value mappingI For example, cities, countries, product names, codes...
Similarity measuresI With typos, language variants (Meier, Mayer, ...)
Standardizing and transformingUse of background knowledge (metadata)
I For example, conventions (typical spellings)I Ontologies, thesauri, dictionaries for the treatment of homonyms,
synonyms,. . .At integration
I Preference ordering over data sources according to relevance,trust, opening times, etc.
I Conflict resolution functions
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–76
ELT
ETL vs. ELT
ELT = Extract-Load-TransformI Variant of the ETL process, in which the data is transformed after
the loadI Objective: transformation with SQL statements in the target
databaseI Waiving special ETL engines
Quellen
Data Warehouse
E L T
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–77
ELT
ELT
ExtraktionI For Database optimized queries (e.g. SQL)I Extraction also monitored with monitorsI Automatic extraction difficult (e.g. data structure changes)
LadenI Parallel processing of SQL statementsI Bulk Load (assumption: no write access to the target system)I No record-based logging
TransformationI Utilization of set operations of the DW-transformation componentI Complex transformations by means of procedural languages (e.g.,
PL/SQL)I Specific statements (e.g., CTAS von Oracle)
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–78
ELT
Summary
ETL as a process of transferring data from source systems in theDWHTopics of ETL and data quality typically make up 80% of efforts inDWH projects!
I Slow queries are annoyingI Incorrect results make the DWH useless
Part of the transformation stepI Schema level: Schema mapping and schema transformationI Instance level: data cleaning
c© Sattler / Saake / K öppen Data Warehouse Technologies Last Change: 06.01.2019 0–79