+ All Categories
Home > Documents > ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation •...

ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation •...

Date post: 20-Apr-2018
Category:
Upload: vungoc
View: 227 times
Download: 2 times
Share this document with a friend
14
11/27/2010 1 Datawarehousing ETL ( Extract Transform Load ) Acknowledgement Data warehousing (Fall’ 2010), Saleha Raza 2 Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals By: Paulraj Ponniah The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition By: Ralph Kimball, Margy Ross
Transcript
Page 1: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

1

Datawarehousing

ETL ( Extract Transform Load )

Acknowledgement

Data warehousing (Fall’ 2010), Saleha Raza 2

• Data Warehousing Fundamentals: A Comprehensive Guide for IT Professionals

By: Paulraj Ponniah

• The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition

By: Ralph Kimball, Margy Ross

Page 2: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

2

ETL

• Extract

• Transform

• Load

• It is not uncommon for a project team to spend 50 – 70 % of the project on ETL task.

Data ExtractionDifficulties in Source System

Page 3: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

3

Major Steps in ETL Process

Data Extraction

Page 4: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

4

Data Extraction

Current vs Periodic attributes in operational system

Page 5: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

5

Data Extraction

• Immediate Data Extraction– Through transaction logs

– Through database triggers

– Capture in source system

• Deferred Data Extraction– Capture based on datetime stamp

• What if a source record gets deleted?

– Capture by comparing files (also called snapshot differntial)

Immediate Data Extraction

Page 6: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

6

Deferred Data Extraction

Data Extraction techniques - Summary

Page 7: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

7

Data Transformation• Format Revision

type/length conversions, datetime formatting etc.• Decoding of fields

• Cryptic fields, Boolean values

• Calculated and Derived values• Splitting of single field

• E.g. Address, FullName etc

• Merging of information• Different attributes coming from difference sources

• Character Set conversion• EBCIDIC, ASCII, UNICODE etc

• Conversion of units of measurement• Amounts in different currencies across different global branches, qty in different units

• Data time conversions• Different date formats (American/British data formats)

• Summarization• Generation of summary tables

• Key restructuring• Generation of surrogate keys to avoid business keys

• Deduplication• Resolution among different records coming from different sources pointing to the same object

Data Integration & Consolidation

• Entity Identification problem

• Multiple sources problem

• Transformation of dimension attributes• Incorporating dimension changes (Type 1/Type 2/Type

3 changes)

Page 8: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

8

Data Loading

Data Loading Techniques

• Load

• Append

• Destructive Merge

• Constructive Merge

Before loading data in datawarehouse, indexes are usually dropped from tables and are recreated after loading.

Page 9: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

9

Data Loading Techniques

Loading changes in dimension tables

Page 10: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

10

Data Quality

Three critical aspects of data in data warehouse are: quality, quality, and quality.

Data Quality

• Data quality implies that data is fit for the purpose for which it is intended.

• Data quality vs Data accuracy

Page 11: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

11

Some explicit data quality problems

• Dummy values in fields e.g. 11111111111 in zip code , spaces in mandatory fields

• Absence of data valuesData is not important for operational system and hence is not mandatory but is crucial in analysis.

• Unofficial use of fieldse.g. phone no/fax in address line 3, Customer comments in Contact field, Product features in handling instructions

• Cryptic fieldsCryptic code/ Magic numbers

• Contradicting valuesHome address vs Home phone, State vs Zip code, DOB cs Age

• Violation of business rulesSell price > Cost price, Profit percent between 1 and 100, Probability between 0 and 1, Qty Produced = Qty Accepted + Qty Rejected

• Reused primary keys

Some explicit data quality problems

• Non-unique identifiersEntity Identification problem, Product code – 366 points to different records in inventory system and POS system

• Inconsistent valuesStudent vs Faculty record for students who teach as well.

• Incorrect values@. in email

• Multipurpose fields• FacultyID / Student ID in LoginID

• Erroneous IntegrationAuction Example

Page 12: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

12

Data Quality

Incorrect codes, states , status etc

int value stored in string format, datetime in string

Null values, empty strings not allowed in DW

Data QualityFrom date < To dateSell price > Cost priceLoan balance >= 0

Logical parts of attributes

Address line 3 for phone/email,

Res phone, Home phone, Cell

e.g. PK must not be null,FK must be properly referenced

Page 13: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

13

Take a break!

Sources of Data Pollution

• System Conversions• Data aging• Heterogeneous System integration• Poor database design• Incomplete information at data entry• Input errors• Internationalization/localization• Fraud • Lack of policies

Page 14: ETL ( Extract Transform Load ) - Wikispaces ( Extract Transform Load ) ... Data Transformation • Format Revision type/length ... Data is not important for operational system and

11/27/2010

14

Validation of Names and Addresses


Recommended