8/7/2019 (kajal maam)ETL
1/32
Extract Transform Load Cycle
8/7/2019 (kajal maam)ETL
2/32
8/7/2019 (kajal maam)ETL
3/32
Challenges in ETL process-ETL functions are challenging because of the nature of source systems
Diverse and disparate Different operating systems/platforms May not preserve historical data Quality of data may not be guaranteed in the older
operational source systems Structures keep changing with time Prevalence of data inconsistency in the source system. Data may be stored in cryptic form
Data type, format,naming convention may be different
8/7/2019 (kajal maam)ETL
4/32
Steps for ETL process
Determine all the target data needed All the data sources,both internal/external Prepare data mapping for target data elements from
sources Determine data transformation and cleansing rules Plan for aggregate tables Organize data staging area and test tools Write procedures for all data loads
ETL for dimension tables ETL for fact tables
8/7/2019 (kajal maam)ETL
5/32
The ETL Process
Capture/Extract Scrub or data cleansing
Transform Load
ETL = Extract, transform, andload
8/7/2019 (kajal maam)ETL
6/32
The ETL Process
SourceSystems
Extract Transform
StagingArea
Load
PresentationSystem
8/7/2019 (kajal maam)ETL
7/32
Data Extraction Often performed by COBOL routines
(not recommended because of high programmaintenance and no automatically generated
meta data) Sometimes source data is copied to the target
database using the replication capabilities ofstandard RDMS (not recommended because of
dirty data in the source systems) specialized ETL software
8/7/2019 (kajal maam)ETL
8/32
Data Extraction Techniques
Immediate Data Extraction(real time)-capture through transaction logs-capture through database triggers-capture through source applications Deferred data Extraction(capture happens later)-capture based on data and timestamp
-capture by comparing files
8/7/2019 (kajal maam)ETL
9/32
Capture through transaction logs
Does not provide much flexibility forcapturing specifications
Does not affect the performance of sourcesystems
Does not require any revisions to theexisting source applications
Cannot be used on file oriented system.
8/7/2019 (kajal maam)ETL
10/32
Capture through database triggers
Does not provide much flexibility forcapturing specifications
Does not affect the performance of sourcesystems
Does not require any revisions to theexisting source applications
Cannot be used on file oriented system. Cannot be used on a legacy system
8/7/2019 (kajal maam)ETL
11/32
Capture in Source Application
Provides flexibility for capturingspecification
Does not affect the performance of sourcesystems
Requires the existing source systems tobe revised
Can be used on a file oriented system Can be used on a legacy system
8/7/2019 (kajal maam)ETL
12/32
Capture based on date andtimestamp
Provides flexibility for capturingspecification
Does not affect the performance of sourcesystems
Requires the existing source systems tobe revised
Can be used on a file oriented system Cannot be used on a legacy system
8/7/2019 (kajal maam)ETL
13/32
Capture by comparing files
Provides flexibility for capturingspecification
Does not affect the performance of sourcesystems
Does not require the existing sourcesystems to be revised
may be used on a file oriented system may be used on a legacy system
8/7/2019 (kajal maam)ETL
14/32
Data Cleansing
Source systems contain dirty data that must becleansed
ETL software contains rudimentary data
cleansing capabilities Specialized data cleansing software is often
used. Important for performing name andaddress correction and householding functions
Leading data cleansing vendors include Vality(Integrity), Harte-Hanks (Trillium), and Firstlogic(i.d.Centric)
8/7/2019 (kajal maam)ETL
15/32
Reasons for Dirty Data
Dummy Values Absence of Data Multipurpose Fields
Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules
Reused Primary Keys, Non-Unique Identifiers Data Integration Problems
8/7/2019 (kajal maam)ETL
16/32
8/7/2019 (kajal maam)ETL
17/32
Parsing
Parsing locates and identifies individualdata elements in the source files and thenisolates these data elements in the targetfiles.
Examples include parsing the first, middle,and last name; street number and street
name; and city and state.
8/7/2019 (kajal maam)ETL
18/32
Correcting
Corrects parsed individual datacomponents using sophisticated dataalgorithms and secondary data sources.
Example include replacing a vanityaddress and adding a zip code.
8/7/2019 (kajal maam)ETL
19/32
Standardizing
Standardizing applies conversion routinesto transform data into its preferred (andconsistent) format using both standardand custom business rules.
Examples include adding a pre name,replacing a nickname, and using a
preferred street name.
8/7/2019 (kajal maam)ETL
20/32
Matching
Searching and matching records withinand across the parsed, corrected andstandardized data based on predefinedbusiness rules to eliminate duplications.
Examples include identifying similarnames and addresses.
8/7/2019 (kajal maam)ETL
21/32
Consolidating
Analyzing and identifying relationshipsbetween matched records andconsolidating/merging them into ONErepresentation.
8/7/2019 (kajal maam)ETL
22/32
Data Transformation
Transforms the data in accordance withthe business rules and standards thathave been established
Example include: format changes, duplication, splitting upfields, replacement of codes, derived values, andaggregates
Deals with rectifying any inconsistency
8/7/2019 (kajal maam)ETL
23/32
Attribute naming inconsistency issueonce all the data elements have right namesthey must be converted into commonformats. Data format has to be standardized All the transformation activities are
automated Tool: DataMapper
8/7/2019 (kajal maam)ETL
24/32
Basic tasks in Transformation
Selection Splitting/joining Conversion Summarization
8/7/2019 (kajal maam)ETL
25/32
Data Loading
Data are physically moved to the datawarehouse
The loading takes place within a timewindow
The trend is to near real time updates ofthe data warehouse as the warehouse is
increasingly used for operationalapplications
8/7/2019 (kajal maam)ETL
26/32
Different modes in which data can be
applied to the warehouse Load Append Merge
8/7/2019 (kajal maam)ETL
27/32
Loading Techniques
Initial load Incremental load Full refresh
8/7/2019 (kajal maam)ETL
28/32
Sample ETL Tools
Teradata Warehouse Builder from Teradata DataStage from Ascential Software SAS System from SAS Institute
Power Mart/Power Center from Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from Hummingbird
Communications
8/7/2019 (kajal maam)ETL
29/32
Steps in data reconciliation
Static extract = capturing asnapshot of the source dataat a point in time
Incremental extract =capturing changes that haveoccurred since the laststatic extract
Capture = extractobtaining asnapshot of a chosen subset of thesource data for loading into the datawarehouse
8/7/2019 (kajal maam)ETL
30/32
Steps in data reconciliation (continued)
Scrub = cleanseuses patternrecognition and AI techniques toupgrade data quality
Fixing errors: misspellings,erroneous dates, incorrect field usage,mismatched addresses, missing data,duplicate data, inconsistencies
Also: decoding, reformatting, timestamping, conversion, key generation,merging, error detection/logging,locating missing data
8/7/2019 (kajal maam)ETL
31/32
Steps in data reconciliation (continued)
Transform = convert data from formatof operational system to format ofdata warehouse
Record-level:Selection data partitioningJoining data combiningAggregation data summarization
Field-level:single-field from one field to one fieldmulti-field from many fields to one, orone field to many
8/7/2019 (kajal maam)ETL
32/32
Steps in data reconciliation (continued)
Load/Index= place transformed datainto the warehouse and createindexes
Refresh mode: bulk rewriting oftarget data at periodic intervals
Update mode: only changes insource data are written to datawarehouse