Informatica Tips and Tricks Every ETL Developer Should Know

Post on 28-Oct-2015

436 views 12 download

Tags:

description

NA

transcript

Tips & Tricks Every ETL Developer Should Know

Sean Desmond, InformaticaVijay Viswanathan, Cognicase

Agenda

� Objectives

� Who Are The Presenters

� Top Ten ETL Tips & Tricks - Overview

� The Ministry Project – Overview

� Top Ten ETL Tips & Tricks – “The Meat and Potatoes”

� Summary

� Questions & Answers

Introduction

Objectives

� By the end of this session you should…� Understand how a warehouse design is tightly

integrated into the PowerCenter Architecture� Receive several mapping ‘tips & tricks’ garnered from a

successful implementation� Weigh the pros and cons of applying these ‘tips &

tricks’ to one of your own project solutions

The Presenters

� Sean Desmond� Regional Manager, Informatica Professional Services

New England/E. Canada� 6 years in Data Warehousing, Metadata Management,

Project Delivery

� Vijay Viswanathan� BIDW Consultant, Cognicase (Toronto)� Specializes in data warehousing / ETL design� Over 5 years in Data Warehousing

Top Ten ETL Tips & Tricks

Top Ten ETL Tips & Tricks

10. Use Velocity

9. Dedicate Time to Infrastructure and Standards Prior to Development (Baseline Architecture Deployment)

8. Reduce Reliance on Stored Procedures

7. Audit Your Loads

6. Track Data Errors

Top Ten ETL “Tips & Tricks”

5. Bless the ROUTER!

4. Be careful of “Lookup Gotchas”!

3. Determine the Record Type in Staging

2. Use Parameter Files

1. Create a Common Library of Sources, Targets and Transformations

The Ministry Project

Ministry Project - Overview

� Combined data about students, course marks, schools, teachers, funding, socio-economic demographics, standardized testing results

� Tools Used:� ErWin v4.0� Informatica PowerCenter v5.1� Cognos PowerPlay Web v7.0, Cognos IWR v7.0� DB2 v7.2

� 3 target areas� Stage (based on source file layouts)� Data Warehouse (mainly normalized)� Data Marts (dimensional)

Project Architecture

Datamart

Staging Area

ALL INFORMATION DEPERSONALIZED

NO USER ACCESS

Transformation• Computation of

referenceelements

• Data aggregationwhere necessary

ESDW Architecture

Operational Databases

• Removal of personal identifying information

• Generation of unique record identifier

Extract, Cleanse&

LoadCentral Store

Metadata

Index

Data

Extract,Transform

& Load

UserData Mart

(Access via BITools etc)

Business Intelligence&

Reporting Tools

Load

Extract,Transform &

Load

DataAtomisation

Central Store

Source information&

Metadata (Updates)

Extract Cleansing

Depersonalising

Top Ten ETL Tips & Tricks“The Meat & Potatoes”

Tip #10 - Dedicate Time to Infrastructure and Standards

Prior to Development

Dedicate Time to Infrastructure and Standards Prior to Development

� Should take place at least 2-3 weeks BEFORE beginning any development (BLA)� Development Standards� Folder Architecture� Security Measures� Naming Convention Standards� Metadata Documentation Standards� Lifecycle Strategy� Shared Objects Strategy

Tip #9 – Use Velocity

Use Velocity

� A methodology for the development of analytic solutions based on Informatica platform products, InformaticaPowerCenter® and Informatica PowerMart®.

� Applications & PowerAnalyzer being incorporated

� Informatica Velocity covers each of the major phases of analytic solution development efforts, including Manage, Architect, Design, Build, Deploy, and Operate.

� ‘Project Roadmap’

� Best Practices – Mapping Design, P&T, Migration

� Sample Deliverables – Mapping Inventory, Mapping Spec, System Test Plan

Tip #8 – Reduce Reliance on Stored Procedures

Reduce Reliance on Stored Procedures

� Stored Procedures are a big performance hit!

� Try to avoid external calls to stored procedures unless it is a necessity

� For surrogate key generation use native Informatica sequence generator or the IDENTITY datatype.

� Personally saw performance of a mapping increase from 5 rows/sec to over 500 rows/sec once stored proc was replaced!

Tip #7 – Audit Your Loads

Audit Your Loads

� A key area that is quite often ignored

� You must match to the source systems or be able to explain the differences

� Auditing data loads (when did we start a load and what is the status?)

� Audit information is provided to increase the end-user’s confidence in the quality of data contained in the Data Warehouse

� Without proof, you will lose all credibility!

Audit Data Model

ETL_AUDITetl_load_key: INTEGER NOT NULL

academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR

ETL_Source_Systemsystem_cd: VARCHAR(5) NOT NULL

system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER

ETL_AUDIT_TABLE_LOADSetl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL

num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP

Audit Your Loads - Methodology

At the beginning of this stage the status code is set to ‘Stg-Loading’

At completion, it is set to ‘Stg-Complete’ only if all the source extract files have been processed.

If any one of the source files is not processed, the status code is set to ‘Stg-Fail’.

Staging LoadLoad Status CodesStep

Audit Your Loads - Methodology

At the beginning of this stage, the status is set to ‘WH-Loading’

At the end, if all the tables are properly loaded, it is set to ‘WH-Complete’

If any of the expected tables did not load completely, the load status is set to ‘WH-Fail’

Warehouse LoadLoad Status CodesStep

Audit Your Loads - Methodology

At the beginning of a Data Mart load, the Load_Status_Flag is set to ‘DM-Loading’

When the Data Mart load is completed, the status flag is set as ‘DM-Loaded’

If the Data Mart load does not succeed, the status will be set to ‘DM-Fail’

The load status of ‘DM-Completed’ is used only after the data is reviewed and the business metrics match what is expected from the source system

Data Mart LoadLoad Status CodesStep

Audit Your Loads – Informatica Mapping

Audit Your Loads – Informatica Mapping

Tip #6 – Track Data Errors

Track Data Errors

� Use Informatica to conduct Data Validity, IsNull, IsDate, IsNumber and other Pre Defined error checks

� Errors are logged by calling the INSERT ERROR RECS stored procedure from Informatica

� Invalid values can be either skipped or passed through with default values

� Hold or point to the original source record and be able to recreate it

� Best Practices exist, but design is key

Error Correction Model

Target

Reload

ErrorExists

Source StageLoadProcess

Error Correction Data Model

Error_typeerror_type_cd: VARCHAR(2) NOT NULL

error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP

Severity_Levelseverity_cd: VARCHAR(3) NOT NULL

severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR

etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER

severity_cd: VARCHAR(3) NOT NULL (FK)

General Rules for Non-Lookup Errors

Missing or NullMN

Data LengthDL

Is DeleteDD

Is Valid DateID

Datatype MismatchDM

Inconsistent RecordIR

Error Type DescriptionError Type Code

Error Severity

Non Fatal NF

Fatal ErrorFTL

Severity DescriptionSeverity Code

Error Correction Checks

Pulling Audit & Error Correction Together

ETL_AUDITetl_load_key: INTEGER NOT NULL

academic_yr: CHAR(9)prev_etl_load_key: INTEGERmost_rcnt_fy_ind: CHAR NOT NULLsystem_cd: VARCHAR(5) NOT NULL (FK)load_status_flg: VARCHAR(12)load_type_flg: CHARstage_archvd_date: DATEwh_archvd_date: DATEstage_start_ts: TIMESTAMPwarehouse_start_ts: TIMESTAMPnum_rows_read: INTEGERfct_cleanup_ind: CHARacad_yr_transt_ind: CHAR

ETL_Source_Systemsystem_cd: VARCHAR(5) NOT NULL

system_name: VARCHAR(20)system_desc: VARCHAR(255)sys_req_file_cnt: INTEGER

ETL_AUDIT_TABLE_LOADSetl_load_key: INTEGER NOT NULL (FK)source_name: VARCHAR(80) NOT NULL

num_rows_read: INTEGERnum_records_reqd: INTEGERload_status_flg: VARCHAR(12)extract_num: INTEGERextract_ts: TIMESTAMPstop_source_row_id: INTEGERload_session_name: VARCHAR(80)load_start_ts: TIMESTAMPload_stop_ts: TIMESTAMP

Error_typeerror_type_cd: VARCHAR(2) NOT NULL

error_type_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP

Severity_Levelseverity_cd: VARCHAR(3) NOT NULL

severity_desc: VARCHAR(255)last_update_ts: TIMESTAMP NOT NULLrecord_expiry_ts: TIMESTAMP ETL_ERROR

etl_load_key: INTEGER NOT NULL (FK)sys_load_col_name: VARCHAR(30) NOT NULLsource_name: VARCHAR(80) NOT NULL (FK)error_type_cd: VARCHAR(2) NOT NULL (FK)source_row_id: INTEGER

severity_cd: VARCHAR(3) NOT NULL (FK)

Tip #5 – Bless The Router!

Bless the Router!

� New feature introduced in Powercenter 5

� Similar to a Filter since both allow the developer to use a condition to test data

� Big Difference – Router allows you to test multiple conditions!

� Use the Router instead of multiple Filter transformations

� Big Advantage – Only reads the data once!

� Considerable Performance Gains

� Crucial in dealing with both Type 1 Dimensions and Type 2 Dimensions

Router At Work

Groups in a Router

Tip #4 – Lookup Gotchas

Lookup Gotchas

� Use Dynamic Lookup if conducting a lookup on the target and want the lookup to be synchronized with target. MUST BE CONNECTED!

� Cannot explicitly set the ORDER BY clause in the SQL Overide. Can trick Informatica by ordering ports in desired ORDER BY sequence

� Make sure Datatypes and Precision of ports being compared are the same otherwise you might get undesired results

� If Lookups are large adjust the lookup data cache and lookup index cache size in the session properties to improve performance

� Can also take advantage of the persistent cache feature. This feature is valuable if you know the lookup table does not changebetween sessions runs

Tip #3 – Determination of Record Type in Staging Layer

Determination of Record Type in Staging Layer

� First a Quick Refresher on Dimensions:� Type 1 – No history� Type 2 – All history� Type 3 – Some history

More Dimension Types…Combinations

� Type 3 Prime – Types 1 and 2 (the most common)

� Type 4 – Types 1 and 3

� Type 5 – Types 2 & 3

� Type 6 – Types 1, 2, and 3 (the second most common)

Type 1 – No History

Source Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id 100Key

Warehouse Transaction #1

01-Jan-2002DateMs.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Type 1 – No HistorySource Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

100KeyWarehouse Transaction #1

01-Jan-2002DateMrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Source Transaction #2

Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Type 2 – All HistorySource Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

100KeyWarehouse Transaction #1

01-Jan-2002DateMs.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Source Transaction #2

Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id 100Key

Warehouse Transaction #2

15-Nov-2002DateMrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Type 3 – Some History

Source Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Ms.Original Salutation

100KeyWarehouse Transaction #1

01-Jan-2002DateMs.Salutation

BedrockCity23 Boulder RdAddressSandy RubbleName1Id

Type 3 – Some HistorySource Transaction #1

Ms.SalutationBedrockCity23 Boulder RdAddressSandy RubbleName1Id

Ms.Original Salutation

100KeyWarehouse Transaction #1

15-Nov-2002DateMrs.Salutation

GravelPitCity42 Slate AveAddressSandy RubbleName1Id

Source Transaction #2

Mrs.SalutationGravelPitCity42 Slate AveAddressSandy RubbleName1Id

Methodology

� As each record from the source file is processed into the staging area, a record type indicator is added to identify how the staging record should later be processed (e.g. as an insert, delete or update)

� This indicator is set based on a comparison to the previous successful data load for that table

� The flag then dictates what path the record will take when loaded into the Warehouse Layer

Record Type Identification

Record should be expired in the warehouse by performing a type 1 update to the previous instance of the record and populating the expiry_ts column with the current date/time. (Fatal error for systems in which deletions are prohibited.)

D

Record contains a trigger field update, and should be treated as a type II update

L

Record contains a non-trigger field update, and should be treated as a type I update.

M

Record is New, and will be treated as an insert.N

Record is unchanged. Only the ETL load key and fields should be updated in the target table.

X

DescriptionRecord Type Identifier

Record Type Identification – An Example

M2Terrell Davis HS

B00991997-19983

X2Terrell Davis HS

B00991998-19994

1

1

SemesterCode

Terrell Davis HS

John Elway HS

School Name

NB00991995-19961

LB00991996-19972

Record Type Flag

School Number

Academic Yr

Stage Key

Early Detection - Advantages

� Reduces Complexity in Warehouse Layer Mappings

� Shift focus of Warehouse Layer Mappings to Error Checking and Error Handling

� Improved Performance of Warehouse Load

Tip #2 – Use Parameter Files

Using Parameter Files

� Parameter Files� A mapping parameter represents a constant value that

you can define before running a session� A mapping parameter retains the same value

throughout the entire session� In a parameter file for the session, one defines the

value of the parameter� During the session, the Informatica Server evaluates

all references to the parameter to that value

Parameter Files Syntax

� Use the following format to define parameters and variables in a session. The folder name is optional:

[(folder_name.)session_name]parameter_name=valueparameter2_name=value

� An Example:[s_m_DM_BOARD]$DBConnection_DW=ESDWP_DW$DBConnection_DM=ESDWP_DM

How do you call a Parameter File?

� Can be specified at the session level

How do you call a Parameter File?

� Also can be specified at the the batch level

Why Use a Parameter File?

BENEFITS:

� Portable Across Environments (Dev, SIT, UAT, Prod)

� Simplifies and automates the Code Promotion Process

� Removes the manual step of updating the database connection(s) in a session

MAIN DRAWBACK:

� File needs to be modified if new mappings are created and are part of the load process

Tip #1 – Creation of Common Library of Components

Creation of Common Library of Components

� Components include: Sources, Targets, Reusable Transformations (Mapplets, Lookups), Variables, Parameter Files, Database Connections

Advantages:

� Reduces redundancy

� Increases standardization/common structure in mappings

� Consistency among mappings

� Reduces chance of mapping errors due to “designer license”

Summary

Summary – Always remember…

� PLAN PLAN PLAN! � Short Term Pain – Long Term Gain� Promote Standardization and Structure� Net Net Effect is more consistent ETL Mappings and a

more robust ETL Load Process

� Credibility is Everything� Remember to audit the load process� Have a strong error detection and correction

methodology

� Leverage Resources� Methodology, Devnet, etc.

Thank YouSean Desmond, sdesmond@informatica.com

Vijay Viswanathan, vijay.viswanathan@cognicase.com

A N S W E R SA N S W E R SQ U E S T I O N SQ U E S T I O N S