Tips & Tricks Every ETL Developer Should Know
Sean Desmond, InformaticaVijay Viswanathan, Cognicase
Agenda
� Objectives
� Who Are The Presenters
� Top Ten ETL Tips & Tricks - Overview
� The Ministry Project – Overview
� Top Ten ETL Tips & Tricks – “The Meat and Potatoes”
� Summary
� Questions & Answers
Introduction
Objectives
� By the end of this session you should…� Understand how a warehouse design is tightly
integrated into the PowerCenter Architecture� Receive several mapping ‘tips & tricks’ garnered from a
successful implementation� Weigh the pros and cons of applying these ‘tips &
tricks’ to one of your own project solutions
The Presenters
� Sean Desmond� Regional Manager, Informatica Professional Services
New England/E. Canada� 6 years in Data Warehousing, Metadata Management,
Project Delivery
� Vijay Viswanathan� BIDW Consultant, Cognicase (Toronto)� Specializes in data warehousing / ETL design� Over 5 years in Data Warehousing
Top Ten ETL Tips & Tricks
Top Ten ETL Tips & Tricks
10. Use Velocity
9. Dedicate Time to Infrastructure and Standards Prior to Development (Baseline Architecture Deployment)
8. Reduce Reliance on Stored Procedures
7. Audit Your Loads
6. Track Data Errors
Top Ten ETL “Tips & Tricks”
5. Bless the ROUTER!
4. Be careful of “Lookup Gotchas”!
3. Determine the Record Type in Staging
2. Use Parameter Files
1. Create a Common Library of Sources, Targets and Transformations
The Ministry Project
Ministry Project - Overview
� Combined data about students, course marks, schools, teachers, funding, socio-economic demographics, standardized testing results
� Tools Used:� ErWin v4.0� Informatica PowerCenter v5.1� Cognos PowerPlay Web v7.0, Cognos IWR v7.0� DB2 v7.2
� 3 target areas� Stage (based on source file layouts)� Data Warehouse (mainly normalized)� Data Marts (dimensional)
Project Architecture
Datamart
Staging Area
ALL INFORMATION DEPERSONALIZED
NO USER ACCESS
Transformation• Computation of
referenceelements
• Data aggregationwhere necessary
ESDW Architecture
Operational Databases
• Removal of personal identifying information
• Generation of unique record identifier
Extract, Cleanse&
LoadCentral Store
Metadata
Index
Data
Extract,Transform
& Load
UserData Mart
(Access via BITools etc)
Business Intelligence&
Reporting Tools
Load
Extract,Transform &
Load
DataAtomisation
Central Store
Source information&
Metadata (Updates)
Extract Cleansing
Depersonalising
Top Ten ETL Tips & Tricks“The Meat & Potatoes”
Tip #10 - Dedicate Time to Infrastructure and Standards
Prior to Development
Dedicate Time to Infrastructure and Standards Prior to Development
� Should take place at least 2-3 weeks BEFORE beginning any development (BLA)� Development Standards� Folder Architecture� Security Measures� Naming Convention Standards� Metadata Documentation Standards� Lifecycle Strategy� Shared Objects Strategy
Tip #9 – Use Velocity
Use Velocity
� A methodology for the development of analytic solutions based on Informatica platform products, InformaticaPowerCenter® and Informatica PowerMart®.
� Applications & PowerAnalyzer being incorporated
� Informatica Velocity covers each of the major phases of analytic solution development efforts, including Manage, Architect, Design, Build, Deploy, and Operate.
� ‘Project Roadmap’
� Best Practices – Mapping Design, P&T, Migration
� Sample Deliverables – Mapping Inventory, Mapping Spec, System Test Plan
Tip #8 – Reduce Reliance on Stored Procedures
Reduce Reliance on Stored Procedures
� Stored Procedures are a big performance hit!
� Try to avoid external calls to stored procedures unless it is a necessity
� For surrogate key generation use native Informatica sequence generator or the IDENTITY datatype.
� Personally saw performance of a mapping increase from 5 rows/sec to over 500 rows/sec once stored proc was replaced!
Tip #7 – Audit Your Loads
Audit Your Loads
� A key area that is quite often ignored
� You must match to the source systems or be able to explain the differences
� Auditing data loads (when did we start a load and what is the status?)
� Audit information is provided to increase the end-user’s confidence in the quality of data contained in the Data Warehouse
� Without proof, you will lose all credibility!
Audit Data Model
ETL_AUDITetl_load_key: INTEGER NOT NULL
academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR
ETL_Source_Systemsystem_cd: VARCHAR(5) NOT NULL
system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER
ETL_AUDIT_TABLE_LOADSetl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL
num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP
Audit Your Loads - Methodology
At the beginning of this stage the status code is set to ‘Stg-Loading’
At completion, it is set to ‘Stg-Complete’ only if all the source extract files have been processed.
If any one of the source files is not processed, the status code is set to ‘Stg-Fail’.
Staging LoadLoad Status CodesStep
Audit Your Loads - Methodology
At the beginning of this stage, the status is set to ‘WH-Loading’
At the end, if all the tables are properly loaded, it is set to ‘WH-Complete’
If any of the expected tables did not load completely, the load status is set to ‘WH-Fail’
Warehouse LoadLoad Status CodesStep
Audit Your Loads - Methodology
At the beginning of a Data Mart load, the Load_Status_Flag is set to ‘DM-Loading’
When the Data Mart load is completed, the status flag is set as ‘DM-Loaded’
If the Data Mart load does not succeed, the status will be set to ‘DM-Fail’
The load status of ‘DM-Completed’ is used only after the data is reviewed and the business metrics match what is expected from the source system
Data Mart LoadLoad Status CodesStep
Audit Your Loads – Informatica Mapping
Audit Your Loads – Informatica Mapping
Tip #6 – Track Data Errors
Track Data Errors
� Use Informatica to conduct Data Validity, IsNull, IsDate, IsNumber and other Pre Defined error checks
� Errors are logged by calling the INSERT ERROR RECS stored procedure from Informatica
� Invalid values can be either skipped or passed through with default values
� Hold or point to the original source record and be able to recreate it
� Best Practices exist, but design is key
Error Correction Model
Target
Reload
ErrorExists
Source StageLoadProcess
Error Correction Data Model
Error_typeerror_type_cd: VARCHAR(2) NOT NULL
error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP
Severity_Levelseverity_cd: VARCHAR(3) NOT NULL
severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR
etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER
severity_cd: VARCHAR(3) NOT NULL (FK)
General Rules for Non-Lookup Errors
Missing or NullMN
Data LengthDL
Is DeleteDD
Is Valid DateID
Datatype MismatchDM
Inconsistent RecordIR
Error Type DescriptionError Type Code
Error Severity
Non Fatal NF
Fatal ErrorFTL
Severity DescriptionSeverity Code
Error Correction Checks
Pulling Audit & Error Correction Together
ETL_AUDITetl_load_key: INTEGER NOT NULL
academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR
ETL_Source_Systemsystem_cd: VARCHAR(5) NOT NULL
system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER
ETL_AUDIT_TABLE_LOADSetl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL
num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP
Error_typeerror_type_cd: VARCHAR(2) NOT NULL
error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP
Severity_Levelseverity_cd: VARCHAR(3) NOT NULL
severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR
etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER
severity_cd: VARCHAR(3) NOT NULL (FK)
Tip #5 – Bless The Router!
Bless the Router!
� New feature introduced in Powercenter 5
� Similar to a Filter since both allow the developer to use a condition to test data
� Big Difference – Router allows you to test multiple conditions!
� Use the Router instead of multiple Filter transformations
� Big Advantage – Only reads the data once!
� Considerable Performance Gains
� Crucial in dealing with both Type 1 Dimensions and Type 2 Dimensions
Router At Work
Groups in a Router
Tip #4 – Lookup Gotchas
Lookup Gotchas
� Use Dynamic Lookup if conducting a lookup on the target and want the lookup to be synchronized with target. MUST BE CONNECTED!
� Cannot explicitly set the ORDER BY clause in the SQL Overide. Can trick Informatica by ordering ports in desired ORDER BY sequence
� Make sure Datatypes and Precision of ports being compared are the same otherwise you might get undesired results
� If Lookups are large adjust the lookup data cache and lookup index cache size in the session properties to improve performance
� Can also take advantage of the persistent cache feature. This feature is valuable if you know the lookup table does not changebetween sessions runs
Tip #3 – Determination of Record Type in Staging Layer
Determination of Record Type in Staging Layer
� First a Quick Refresher on Dimensions:� Type 1 – No history� Type 2 – All history� Type 3 – Some history
More Dimension Types…Combinations
� Type 3 Prime – Types 1 and 2 (the most common)
� Type 4 – Types 1 and 3
� Type 5 – Types 2 & 3
� Type 6 – Types 1, 2, and 3 (the second most common)
Type 1 – No History
Source Transaction #1
Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id 100Key
Warehouse Transaction #1
01-Jan-2002DateMs.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id
Type 1 – No HistorySource Transaction #1
Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id
100KeyWarehouse Transaction #1
01-Jan-2002DateMrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id
Source Transaction #2
Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id
Type 2 – All HistorySource Transaction #1
Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id
100KeyWarehouse Transaction #1
01-Jan-2002DateMs.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id
Source Transaction #2
Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id 100Key
Warehouse Transaction #2
15-Nov-2002DateMrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id
Type 3 – Some History
Source Transaction #1
Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id
Ms.Original Salutation
100KeyWarehouse Transaction #1
01-Jan-2002DateMs.Salutation
BedrockCity23 Boulder RdAddressSandy RubbleName1Id
Type 3 – Some HistorySource Transaction #1
Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id
Ms.Original Salutation
100KeyWarehouse Transaction #1
15-Nov-2002DateMrs.Salutation
GravelPitCity42 Slate AveAddressSandy RubbleName1Id
Source Transaction #2
Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id
Methodology
� As each record from the source file is processed into the staging area, a record type indicator is added to identify how the staging record should later be processed (e.g. as an insert, delete or update)
� This indicator is set based on a comparison to the previous successful data load for that table
� The flag then dictates what path the record will take when loaded into the Warehouse Layer
Record Type Identification
Record should be expired in the warehouse by performing a type 1 update to the previous instance of the record and populating the expiry_ts column with the current date/time. (Fatal error for systems in which deletions are prohibited.)
D
Record contains a trigger field update, and should be treated as a type II update
L
Record contains a non-trigger field update, and should be treated as a type I update.
M
Record is New, and will be treated as an insert.N
Record is unchanged. Only the ETL load key and fields should be updated in the target table.
X
DescriptionRecord Type Identifier
Record Type Identification – An Example
M2Terrell Davis HS
B00991997-19983
X2Terrell Davis HS
B00991998-19994
1
1
SemesterCode
Terrell Davis HS
John Elway HS
School Name
NB00991995-19961
LB00991996-19972
Record Type Flag
School Number
Academic Yr
Stage Key
Early Detection - Advantages
� Reduces Complexity in Warehouse Layer Mappings
� Shift focus of Warehouse Layer Mappings to Error Checking and Error Handling
� Improved Performance of Warehouse Load
Tip #2 – Use Parameter Files
Using Parameter Files
� Parameter Files� A mapping parameter represents a constant value that
you can define before running a session� A mapping parameter retains the same value
throughout the entire session� In a parameter file for the session, one defines the
value of the parameter� During the session, the Informatica Server evaluates
all references to the parameter to that value
Parameter Files Syntax
� Use the following format to define parameters and variables in a session. The folder name is optional:
[(folder_name.)session_name]parameter_name=valueparameter2_name=value
� An Example:[s_m_DM_BOARD]$DBConnection_DW=ESDWP_DW$DBConnection_DM=ESDWP_DM
How do you call a Parameter File?
� Can be specified at the session level
How do you call a Parameter File?
� Also can be specified at the the batch level
Why Use a Parameter File?
BENEFITS:
� Portable Across Environments (Dev, SIT, UAT, Prod)
� Simplifies and automates the Code Promotion Process
� Removes the manual step of updating the database connection(s) in a session
MAIN DRAWBACK:
� File needs to be modified if new mappings are created and are part of the load process
Tip #1 – Creation of Common Library of Components
Creation of Common Library of Components
� Components include: Sources, Targets, Reusable Transformations (Mapplets, Lookups), Variables, Parameter Files, Database Connections
Advantages:
� Reduces redundancy
� Increases standardization/common structure in mappings
� Consistency among mappings
� Reduces chance of mapping errors due to “designer license”
Summary
Summary – Always remember…
� PLAN PLAN PLAN! � Short Term Pain – Long Term Gain� Promote Standardization and Structure� Net Net Effect is more consistent ETL Mappings and a
more robust ETL Load Process
� Credibility is Everything� Remember to audit the load process� Have a strong error detection and correction
methodology
� Leverage Resources� Methodology, Devnet, etc.
Thank YouSean Desmond, [email protected]
Vijay Viswanathan, [email protected]
A N S W E R SA N S W E R SQ U E S T I O N SQ U E S T I O N S