1
Using Informatica Data Explorer 5
Informatica Corporation, 2005-2006. All rights reserved.
Education Services
Version IDE-25102006
2
Agenda
• Overview of Informatica Data Explorer
• Importing Metadata and Accessing Source Data
• Column Profiling
• Data Rules
• Single Table Structural Analysis
• Cross Table Profiling
• Validating Table and Cross Table Analysis
• Normalization
• Repository
• Using the Repository Navigator
• Using Repository Reports
• Integration with PowerCenter
3
Introduction
4
Introduction Objectives
• Identify the components of the Informatica Data Explorer product suite
• Describe the Informatica Data Explorer process flow
Informatica Data Explorer Product Suite
IDE Source Profiler
IDEClient
Windows XP, 2000
IDEServer
Unix or Windows
UDB
Informix
Sybase
Oracle
COBOLPrograms
IDEFTM / XML
DDL,XML
& DTDs
IDE Repository
FlatFiles
IDEImport
Flat File
DDL
IDEProject
RepositoryNavigator
IDE ImportIMS
IDEImportVSAM
VSAM
IMS
Sequential
PDS
OS/390DB2
Unload
Command Files &
JCL
IDE Repository
PortsMSSQL
IDE Import
Relational
Via ODBC
6
FixedTargetMapping
RepositoryNavigator
Ports:InformaticaCWM
Sources
RDBMSODBC
FlatFile
* MainFrame
Connectors
Data Table n
Data Table 2
Data Table 1
Co
nten
t Pro
filin
g
Cross
Tab
le P
rofili
ng
Single Table profiling
IDE Profiling IDE Design
Consolidated Schema
Source DataKnowledge
Base
StructureContentQuality
Product Architecture
* The IMS and VSAM importers actually use a GUI (Source Profiler) to read a Copybook and generate a program to extract mainframe data into Flat Files for use by IDE
7
IDE Server Platform• Windows (2000/2003/XP)• Sun Solaris (7,8,9)• HP-UX (11 or later)•AIX (4.3, 5L)
IDE Server
Repository DBMS and
Server Platform
ODBC DriverConnectivity
Client
Workstation• Windows 2000• Windows XP
IDE Client
Project
Files
Workstation• Windows 2000/XP
Repository
Navigator
Workstation• Windows 2000/XP
FTM/XML
Data / Header
Files
•IBM DB2 UDB-7.2,8.1 •Informix 7.31,9.2,9.3 Informix 9•Microsoft SQL Server 7 and 2000)•Oracle 8i, 9i•Sybase 12 and 12.5
Relational Importers for:•IBM DB2 UDB-5.2, 6.1,7.1,7.2,8.1•Informix-7.24,7.31,9.1,9.2,9.3•Oracle -7.3,8,8i,9i•Sybase 10,11,12•ODBC(SQL Server, etc.)
Performs Actual Profiling
Profiling Results initial store
RepositoryCompleted Profiling
ResultsRepository does not
need to be on the same server as IDE
TCPIPConnectivity
ODBC/JDBCConnectivity
ODBC Driver (API 3.x conformance level 2)
Ports
XML format files
Technical Diagram
Flat File Importer for:•Fixed Length•Delimited•DB2 Unload
IDE Process Flow
IDE Data Profiling
IDE Data Prep / Import
Products
IDE Schema Development
• Data Extraction • Cleansing• Transformation
Specifications
FTM / XML Metadata Mapping
DB Load
Messaging
Target DB
Target DesignOR
Message
IDE Repository and NavigatorMetadata Management
OR
OR
Data• Relational• Flat Files• VSAM• IMS• ODBC
Documented Metadata
9
Introduction Review
• The Informatica Data Explorer product line: • Informatica Data Explorer
• Importer for Flat Files• DDL Generators• Source Profiler• Repository Navigator• Repository
• Import for Relational Databases• Import for VSAM• Import for IMS• FTM
10
Introduction Review (cont.)
• The IDE Process Flow consists of five major processes:• Data Preparation and Import
• Data Profiling
• Schema Validation and Development
• Metadata Development
• Metadata Management
11
Lesson 1
Importing Metadata and Accessing Source Data
12
Lesson 1 Objectives
• Explain what an Informatica Data Explorer Project is, and how it is used
• Create and setup Informatica Data Explorer Projects
• Define the term “metadata” as used by Informatica Software
• Explain the importance of metadata in Data Profiling using Informatica Data Explorer
13
Lesson 1 Objectives (cont.)
• Explain what source data are, and the ways in which they may be imported into Informatica Data Explorer
• Explain what the Informatica Data Explorer Flat File Importer does
• Describe the format of an Informatica Data Explorer Flat File, including the minimum requirements for Informatica Data Explorer to use it to access source data
14
Case Study Description
• The Customer Order system is a mainframe application accessed through a CICS user interface
• It was developed 10 years ago
• The Employee Identification system is an Oracle database created 2 years ago
• Business users are sure they know the data
• Senior executives suspect the quality of the data is bad
IDE Source Profiler
IDEClient
Windows 98, NT or 2000
IDEServer
Unix or NT
UDB
Informix
Sybase
Oracle
COBOLPrograms
IDEFTM / XML
DDL, IDE XML& DTDs
IDE Repository
FlatFiles
IDEImport
Flat File
DDL
IDEProject
RepositoryNavigator
IDE ImportIMS
IDE ImportVSAM
VSAM
IMS
Sequential
PDS
OS/390DB2
Unload
Command Files &
JCL
IDE Repository
PortsMSSQL
IDE ImportRelational
Via ODBC
Informatica Data Explorer Project
16
Informatica Data Explorer Projects
• The persistent data store used by IDE
• A Project is a UNIX or Windows NT container (directory, folder etc.)
• Projects contain:• Metadata
• Data
• Profiling as well as Mapping information
• Projects are opened and closed by the IDE Server
17
What is Metadata?
• Informatica Data Explorer defines metadata as:• Data that describes data
• Information about the characteristics of source data
• In Informatica Data Explorer, metadata is information that will create: • Schemas
• Tables
• Columns
• Other objects
18
Why Import Metadata?
• Must be imported into an Informatica Data Explorer Project before any subsequent tasks or activities can be started
• Informatica Data Explorer needs? to know the names of the Columns in order to store Data Profiling results
• Informatica Data Explorer needs to know how to interpret the source data (Fixed vs. Delimited)
• Provides basis for automated quality assessments in data profiling
IDE Data Sources
IDE Source Profiler
IDEClient
Windows 98, NT or 2000
IDEServer
Unix or NT
UDB
Informix
Sybase
Oracle
COBOLPrograms
IDEFTM / XML
DDL, IDE XML& DTDs
IDE Repository
FlatFiles
IDEImport
Flat File
DDL
IDEProject
RepositoryNavigator
IDE ImportIMS
IDE ImportVSAM
OS/390
Command Files &
JCL
IDE Repository
PortsMSSQL
IDE ImportRelational
Via ODBC
VSAM
IMS
Sequential
PDS
DB2Unload
20
IDE Flat Files
• Consist of two components
• Header File• Contains metadata describing contents of a data file
• Data file• Data in either delimited or fixed column format as well as
DB2 Load format
21
IDE Flat Files (cont.)
• Header and Data files may be • Separate files or
• Combined into one file
• A header file should not contain duplicate column names (IDE will automatically re-name them)
• IDE Flat Files may not contain Arrays (repeating groups or occurs)
Informatica Data Explorer Flat File Components
header:file=empinfo.dat
attribute:EMPIDdata_type=INTEGERnull_rule=NOT NULLmin_value=1000max_value=9999
attribute:LAST_NAMEdata_type=CHAR(20)null_rule=NOT NULL
attribute:FIRST_NAMEdata_type=CHAR(20)null_rule=NOT NULL
attribute:GENDERdata_type=CHAR(1)null_rule=NOT NULL
attribute:DEPTIDdata_type=CHAR(4)null_rule=NOT NULLmin_value=100
149,Francis,Lynn,3,200,MIS,Database Administrator,"120 Co
249,Venkatachalam,Nagarajan,3,200,MIS,Project Leader,"300289,Kim,Suk,3,200,MIS,Staff Consultant,"4040 N Fairfax Dr216,Masood,Airaj,,200,MIS,MIS Analyst,"300 N Wakefield Dr134,Swenson,Allison,F,200,MIS,Database Administrator,"900 164,Park,Allison,F,200,MIS,Database Analyst,"PO BOX 1471"323,Blaskiewicz,Allison,F,200,MIS,Technical Specialist,"3255,Barbles,Amy,F,100,Sales,Sales Executive,"4019 Rice Bl273,Karneh,Anna,1,200,MIS,Sr Prog Analyst,"12601 Fair Lak
Header File Data File
The data file example shows the associated comma delimited file to which this header file refers
The header file example shows some of the documented information that can be loaded into Informatica Data Explorer
23
Header and Data Files
• Header and data can be in one file
• We recommend that two files be made if created manually
• The more information that is provided in the header file, the more automatic comparisons Informatica Data Explorer can make
24
Login
25
Open Project
26
Import Metadata
27
Lab Exercises 1.1–1.6
28
Lesson 1 Review
• An Informatica Data Explorer Project is:• The persistent data store used by Informatica Data
Explorer
• Used to organize and partition the work effort
• Metadata describes the data source and is used by Informatica Data Explorer to access the source data
• A structure that contains: • Metadata
• Data
• Profiling and Mapping information
29
Lesson 1 Review (cont.)
• Informatica Data Explorer can import data from:• Relational Databases
• Oracle 7.3, 8, 8i, 9i or 10g
• Informix 7, 9.1, or 9.2
• Sybase 10, 11, 12 or 12.5
• IBM DB2 UDB 5.2, 6.1 7.1 or 7.2
• Microsoft SQL Server 7 and 2000 (using an ODBC driver)
• Flat Files• Delimited Format
• Fixed Length Format
• DB2 Load Format
30
Lesson 1 Review (cont.)
• Informatica Data Explorer Flat Files must be:• ASCII or EBCDIC character format (no binary data)
• Binary data is supported via the DB2 Load Utility format
• Informatica Data Explorer Flat File may not contain:• Arrays (repeating groups or occurs)
• Duplicate column names
31
Lesson 1 Review (cont.)
• Informatica Data Explorer Flat Files must have a header file along with the data file
• Additional information on data preparation is available in the Using Informatica Data Explorer Source Profiler course and the documentation
32
Lesson 2
Column Profiling
33
Lesson 2 Objectives
• Explain what Column Profiling is, and why it should be performed.
• Execute the Column Profile function of Informatica Data Explorer.
• Navigate and review the results of Column Profiling.
• Explain informational Tags.
• Describe when and how to apply informational Tags to Informatica Data Explorer objects.
34
What is Column Profiling?
• A process of discovering physical characteristics of each column in a file
• Comparing documented Metadata against Metadata inferred from the data source
• Column Profiling is done against data in the form of • ASCII flat files
• DB2 Load Utility files
• RDBMS tables
35
Why Profile Columns?
• Not all database metadata and documentation are accurate pictures of the data source
• Documented descriptions of data elements may be inconsistent with the way the element is actually used
• Informatica Data Explorer Column Profiling builds a description of a column (its metadata) based on the data it contains
36
Column Lists
• The results of Column Profiling are stored with the Columns in a Table
• Column List viewers can be opened from the Navigation Tree
• Column List viewers provide information about Documented and Inferred Metadata• Documented Metadata are supplied from the header
file or source table
• Inferred Metadata are those that Informatica Data Explorer determined from examining the data.
37
Column Profiling
38
Column Viewer
39
Lab Exercises 2.1–2.4
40
Drill Down
• Allows you to perform ad hoc drill downs through data presented in the Informatica Data Explorer viewers.
• Used to interrogate any data sources that can be accessed via an ODBC connection or Informatica Data Explorer Importer.
• Searches are issued against the selected data, and rows are returned for the specified search.
Drill Downs
41
Column Details
• Lists of properties about a Column that have been inferred by Informatica Data Explorer
• Columns can have several potential sets of characteristics
• The potential sets of characteristics are dependent on the physical view that is chosen
42
Drill Down
43
Drill Down Results
44
Lab Exercises 2.5–2.7
45
Column Value Pairs
• Informatica Data Explorer will store per Column• Up to 16,000 distinct values
• These are the most frequently occurring values from the set of all values that were observed during the Column Profile execution
• The frequency with which each value was observed
• Informatica Data Explorer will calculate• % Distribution for each distinct value based on the frequency
divided by the total rows profiled
46
Value Pair Review
• Issues to evaluate during Column Value Pair analysis:• Are the values/range of values correct?
• Is the data type correct?
• Is there a pattern or format to the data for this Column? Do all of the values match this pattern/format?
• Is there a difference in case for alpha characters? Are some values mixed case are others all upper or lower case? Is this an issue?
• Are there different representations (different abbreviations/misspellings) of the same data?
• Are there duplicate values in a field that should be unique?
47
Sorting Viewers
• It is possible to sort any of the tables displayed in Informatica Data Explorer
• By clicking on the column header, the results will be sorted in ascending order. Double-clicking again will sort the list in descending order
Value Pair Review
48
Sort Order
• Sorting is based on the character codes of the values in the data:• Spaces sort to the top of an ascending sort. When the caret
(^) symbol is displayed, the sort is based on the actual “space” character not the caret (^).
• Special characters (i.e. #, &, ‘)
• Nulls
• Numbers
• Alpha characters
49
Lab Exercises 2.8–2.10
50
Tags
• Informatica Data Explorer Tags come in various forms, depending on the type of information you want to convey:• Notes – general text
• Action Items – things that need to be done
• Rules – business rules defining nature of object
• Transformations – requirements to change the data to fit the object
51
Tags (cont.)
• Think of Tags as high-tech Post-Its™ that you can attach to many types of objects in an Informatica Data Explorer Project
• Note: All of the pull down menu items in Tags can be configured through server configuration files
52
Action Tag
53
Note Tag
54
Rule Tag
55
Lab Exercises 2.11–2.14
56
Content Presentation
• Constant Analysis
• Empty Column Analysis
• Inferred Data Type Analysis
• Null Rule Analysis
• Source Data Type Analysis
• Unique Analysis
• Frequency Analysis
• Pattern Analysis
• Domain Analysis
57
Content Presentations
58
Content Presentation (Continued)
59
Constant Analysis
60
Lesson 2 Review
• Column Profiling is about the analysis of column content and format
• Column Profiling scans data files and stores the resulting profile information in an Informatica Data Explorer Project
• Column Profiling information can be viewed by opening an Column List for a Table
61
Lesson 2 Review
• The results of Column Profiling are stored with an Column
• The results of Column Profiling include:• Primary and Alternate Data Types• Null Rules• Minimum/Maximum Value ranges• Value Pairs• Patterns
• Tags can be added to Columns or Tables to convey additional information or instructions about the Column
62
Lesson 3
Data Rules
63
Data Rules - Objectives
• What is a Data Rule?
• Using Data Rules in Informatica Data Explorer
• How to test for Data Rules
• Execute Data Rules tasks
• When to apply Data Rules in the data discovery process.
64
Define Data Rules
• What is a Business Rule? • Business Rule: describe the main characteristic of the data
• What is a Data Rule?• Data Rule is a constraint written against one or more
Tables that is used to find incorrect data.
• Can be viewed as business rules for data
65
Define Data Rules (cont.)
• Data Rules are often embedded in application programs
• The Informatica Data Explorer Practitioner can discover, document and test Data Rules against the initial source.
66
Using Data Rules in Informatica Data Explorer
• Data Rules is the process of using Informatica Data Explorer to determine if the externally proposed data relationships are fully supported by the source data.
• Discover if the source data supports the relationships and business needs.
• Data Rules are tested against the initial source, stored and then can be re-run after the data has been cleansed or moved.
67
Business Rules and Data Rules
• Employees with 2 or more years of service are paid 3 weeks vacation.
• Fulltime employees are assigned to a salary band.
• Employees in Dept C – salaries cannot be greater than $40,000.
• Department number contained in the employee record must correspond to an existing Department number.
• Does the Column contain a particular string of characters?
68
Business Rules and Data Rules (cont.)
• Does one Column include the full contents of another Column?
• In an address, is there a line of blanks followed by a line of non-blanks?
• Are all three fields of a key null?
• Is the date Column in the wrong format?
• Does the Column contain the right type of data for this type of record?
69
Create and Execute Data Rules
• Data Rules can be created from two locations:• Rules Tag
• Drill down
• Execute Data Rules from the Rules Tag viewer or Data Rules Management.
70
Drill Down
71
New Rule Tag
72
Lab Exercises 3.1–3.6
73
When to Apply Data Rules
• Tightly coupled to Drill Down
• Data Rules can be executed against different sources.
• Data Rules can be applied at any point in time during the data discovery process.
• Data Rules can be saved and re-run• After the data load as occurred or• A feed is supplied or• Data has changed for any reason
74
Complex Data Rules
RULE LoanTypeAmtTerm
SELECT "Loan_ID","Loan_Type","Loan_Amt","Loan_Term"
FROM <Use Table in Data Source>
WHERE (UPPER(LOAN_TYPE) = 'AUTO' and
(LOAN_AMT not between 3000 and 50000 or
LOAN_TERM not between 12 and 60)) or
(UPPER(LOAN_TYPE) = 'REAL' and
(LOAN_AMT not between 10000 and 500000 or
LOAN_TERM not between 36 and 360)) or
LOAN_TYPE is null or LOAN_AMT is null or LOAN_TERM is null
75
Data Rule Management
76
Lesson 3 Review
• Data Rules can be created on Columns that we think are volatile.
• Data Rules can be created, saved and ran on different data sources.
• Data Rules can be created from two locations:• Rules tab• Drill down
• Execute Data Rules from the Rules Tag viewer or Data Rules Management.
77
Lesson 4
Single Table Structural Analysis
78
Lesson 4 Objectives
• Explain what Table Structural Profiling is, and why it should be performed
• Define the term “Functional Dependency” as used by Informatica Data Explorer, and explain the significance
• Contrast a Single-Column Determinant to a Multiple-Column (or compound) Determinant as used by Informatica Data Explorer
79
Lesson 4 Objectives (cont.)
• Define the terms “Inferred Dependencies” and “Model Dependencies” as used by Informatica Data Explorer
• Explain why and when an Inferred Dependency should be added to the set of Model Dependencies
• Define the term “Sample Data” as used by Informatica Data Explorer, and explain the use of Sample Data in Dependency Profiling
• Understand when and how to apply Informational Tags in Dependency Profiling
80
What is Table Structural Profiling?
• A process that discovers the interrelationships between columns in your source data
• Is performed against samples of data that you have imported into Informatica Data Explorer
• It identifies Columns that determine the value of other Columns
81
Why Profile Table Structure?
• Functional Dependencies determine the structure of a data model and/or database design
• Functional Dependencies can be equated to an elementary form of Business Rule
• Dependencies between data items suggest organization of data storage that is both natural and efficient
82
Why Profile Table Structure? (cont.)
• Quickly validate expected Dependencies (Keys)
• If data does not conform to expected or required dependency rules, you most likely have a data integrity problem
83
IDEServer
RDBMS
Flat FilesDB2 LU
What is Sample Data?
• Sample Data is actual data that you import into an Informatica Data Explorer Table either from:• Downloaded flat files or • Directly from a relational database
• Sample Data is a subset of the data in the source database:• Multiple data samples can be loaded into
Informatica Data Explorer• Each data sample is stored in the Project
• Sample Data is associated with a particular Table
84
Why Import Sample Data?
• Sample data is used in Table Structural Profiling to examine relationships of all columns of a given record
Source
Data
Column Profiling(stores results only)
ImportSampleData
Data Sample #1
Table Structural Profiling
(examines entire records)
Data Sample #2
85
A value of EMPNO always determines the same value of ENAME throughout the sample data
EMPNO ENAME
EMPNO
123456789012345789
ENAME
John DoeJane Smith
Eduardo SanchezJane SmithJohn Doe
Eduardo Sanchez
Functional Dependencies
• An Column is functionally dependent on other Columns that determine its value
86
Functional Dependencies (cont.)
• A Functional Dependency is written as:• A B
• ‘A’ is the Determinant Column
• ‘B’ is the Dependent Column
• The statement is ALWAYS read left to right• ‘A functionally determines B’, or
• “If I know a value for A, I can determine the value for B” or
• For each distinct value of ‘A’ there can only be one value of ‘B’
87
Functional Dependencies (cont.)
• The determinant side can be compound:• A + B C
• ‘A’ and ‘B’ together are the Determinant Column
• ‘C’ is the Dependent Column
• The determinant side can be Null:• Ø C
• Nothing is the Determinant Column
• ‘C’ is the Dependent Column
• ‘C’ has only one value, or one value and nulls, in the whole sample
88
Reviewing Inferred Dependencies
• You must review the set of Inferred Dependencies
• The Dependencies inferred by Informatica Data Explorer exist implicitly in the data
• You must make decisions as to which of the Inferred Dependencies explicitly represent the current use of the data
• The review process is to determine• Which dependencies should be included in the set of
dependencies from which the Normalized Schema will be generated
89
Sample Data
90
Exercises 4.1 - 4.4
91
Adding an Inferred Dependency to the Model
• Inferred Dependencies added to the model establish the Tables (tables) that will be created in Normalization• Normalization breaks a single Table (table) into multiple
Tables (tables)
• For example, “Employee” Table in the source system represents two Tables (Employee and Department) once the dependencies are created and the model is normalized
92
Adding an Inferred Dependency to the Model (cont.)
• Columns that do not participate as a Dependant are automatically included in the Primary Key• Informatica Data Explorer considers all Columns as part of
the key until a relationship is established
• Dependency Profiling is an iterative process
93
Exercises 4.5 - 4.6
94
Dependency Subject Area
• Inferred Dependencies• The set of dependencies that are inferred from a sample
of data for a Table
• Table Dependencies• A subset of the Model Dependencies that are wholly
contained in a Table
95
Dependency Subject Area (cont.)
• Model Dependencies• The set of dependencies that you determine fit into your
design and are supported by the data• Model Dependencies are associated at the schema level• Model Dependencies are the set of all dependencies
across all Tables• Model Dependencies are used to create the normalized
schema
96
Dependencies
97
Inferred Dependencies
98
Key Dependencies
99
Model Dependencies
100
Filter Dependencies
101
Add Dependencies to Model or Filter
102
When to Add an Inferred Dependency
• Review each Inferred Dependency and add to model only those that can have a explicit reason for existing• Is the application enforcing the dependency?
• Is the user/business enforcing the dependency?
• Is some outside source enforcing the dependency?
103
Types of Dependencies
• True• The dependency is true for 100% of the data analyzed• Example: Every time a unique value is known for
EMPID, additional information is available (i.e. Employee Name, Address, Phone, etc.)
• Gray• The dependency is almost, but not quite 100% true for
the data analyzed• One row causes the violation
104
Types of Dependencies (cont.)
• Unsupported
• Two or more rows in the sample data do not support the dependency
• Unknown
• The dependency has not yet been validated against the sample data (Basis dependencies validation appear as Unknown)
105
• Questions to Ask:• What caused the dependency to be gray?• Should another sample be imported for verification?
• Review each Inferred Gray Dependency and add to model only those that can have a explicit reason for existing:• Is the application supposed to be enforcing the
dependency?• Is the user/business supposed to be enforcing the
dependency?• Is some outside source supposed to be enforcing the
dependency?
When to Add an Inferred Gray Dependency
106
Lab Exercise 4.7
107
Tagging Dependencies
• You cannot tag an Inferred or Model Dependency
• You add Tags to the Column that is causing the problem
108
Compound Determinants
• Two or more Columns that uniquely identify the Dependent Column
• This often represents a M to 1 relationship in the data
• This happens quite often in older file-based systems
109
Lab Exercise 4.8 – 4.9
110
Lesson 4 Review
• Importing Sample Data stores the data inside an Informatica Data Explorer Project
• Sample Data is used as input to Dependency Profiling
• You must import Sample Data before you can perform the Profile Dependencies task using Informatica Data Explorer• Data samples are imported using the Import Sample
Data feature• Data samples can be retained from doing a Drill Down
or executing a Data Rule
111
Lesson 4 Review (cont.)
• Dependency Profiling finds the relationships between Columns in the same source file or table
• All Inferred Dependencies are associated with sets of Sample Data
• Table Dependencies are dependencies that have been added to the model, and are associated with a specific Table
112
Lesson 4 Review (cont.)
• Model Dependencies are the set of dependencies from all Tables in the schema
• Only Model Dependencies are used as input to the generation of a Normalized Schema
• All Dependencies inferred by Informatica Data Explorer exist implicitly in the data
113
Lesson 4 Review (cont.)
• You will find many Inferred Dependencies that have no meaning in context of the application or business use of the data
• These are Implicit Dependencies that have no explicit meaning
• Dependency Profiling is an iterative process
114
Lesson 5
Cross Table Profiling
115
Lesson 5 Objectives
• Explain what Cross Table Profiling is, and why it should be performed
• Execute the Cross Table Profiling function in Informatica Data Explorer
• Navigate and review the results of Cross Table Profiling
• Define the terms “Synonym” and “Homonym” as used in Informatica Data Explorer
116
Lesson 5 Objectives (cont.)
• Understand what data is used for Cross Table Profiling, and how potential Synonyms are identified
• Describe why and when a Synonym should be created
• Create a Synonym
• Understand the significance of creating Synonyms
117
Cross Table Profiling
• The process that identifies similarity between the values in other columns
• Performed using the value sets associated with the Column objects inside Informatica Data Explorer• These are the Value Frequency
Lists that were created by Column Profiling
118
Why Profile Redundancies?
• To uncover Columns that actually represent the same business facts
• Informatica Data Explorer can uncover two types of redundancies: • Synonyms
• Redundant data that you would like to eliminate through the creation of Synonyms
• Redundant data that is intended to improve database performance
• Homonyms• Data that looks redundant but actually represents quite different
business facts (Homonyms)
119
Comparing Value Sets
ABC
BCDE
Value SetOverlap
Value Set1
ValueOverlap
Value Set2
120
Inferred Redundancies
121
Exercise 5.1 - 5.2
122
• Two or more Columns having the same business meaning
• Comparing common values between columns can identify candidate Synonyms
SP_NOValueSet
EMPIDValueSet
28%overlap
Synonyms
123
Effect of Synonyms
• If the Primary Keys of two Tables are synonyms, they will collapse into a single Table in the Normalized Schema
TransactionID (PK)
ProductID
ProductName
InventorySupplier
TransactionID
TransactionID (PK)
ProductID
PruductName
SupplierName
SupplierAddr
TransactionID (PK)
SupplierName
SupplierAddress
124
Effect of Synonyms (cont.)
• If two Columns that are synonyms represent a parent-child relationship, they will result in two Columns in two Tables with one Column participating in a Primary Key and the other in the corresponding Foreign Key
OrderNumber
ProductID
ProductName
Order Payment
PaymentID
OrderID
CheckNumber
OrderNumber (PK)
ProductID
ProductName
OrderNumber
PaymentID
PaymentID
OrderNumber (FK)
CheckNumber
125
Homonyms Defined
• Two or more Columns having the same name yet different business meanings
70%overlap
SHIPPING_STATEValueSet
STATEValueSet
126
Making Synonyms
127
Synonyms
128
Exercise 5.3 - 5.4
129
Lesson 5 Review
• Cross Table Profiling is about data integration between sets of data
• Cross Table Profiling comprises 2 activities• Comparing value lists
• Use Foreign Key or Join analysis to compare value lists greater than 16,000
• Assigning Synonyms
• Rule of Thumb• Be conservative about making Synonyms
• You can always come back after you’ve normalized the schema and make more
130
Lesson 5 Review
• You can not make intra-table Synonyms, only inter-table
• You must have built Value Lists either during the Profile Columns task, or during the Import Sample Data task, before you can perform Cross Table Profiling
• Creation of Synonyms participates in Normalization
131
Lesson 6
Validating Table and Cross Table Analysis
132
• Understand how Validation differs from Cross Table Profiling
• Define and discuss the term Referential Integrity
• Explain various methods of validation and how it can be use
• Execute Validation tasks
• View Validation results
Lesson 6 Objectives
133
• Validation can be used to:• Define the exact overlap characteristics of two redundant
Columns• Validate a single or multi-Column foreign key• Validate that the keys of two tables do not overlap (Vertical
Merge)• Validate single or multiple Column keys (Validate Keys)• Validate a Join• Validate against reference table• Validate against Domain values
• Execute Validation from the Single Table Structural Analysis and Cross Table Structural Analysis
Validation
134
Referential Integrity
• Example A: An Order File contains an OrderID that uniquely identifies each customer order. There should be no OrderID values in the Order or Detail file that do not exist in the other.
Example A
135
Referential Integrity (cont.)
• Example B: An Order file may have OrderID values that do not exist in the Payment file (outstanding payments or unbilled customers). The Payment file should not have any OrderID values that do not occur in the Order file.
Example B
136
• Validation compares sets of Columns between two relations to discover the quality of the overlap.
• Validation exhaustively tests all the data.
• Cross Table Profiling discovers potential overlap between Columns.
• Cross Table Profiling estimates overlap.
• Results of Validation– sets of statistics about the overlap and non-overlapping values
Validation and Cross Table Profiling
137
• To understand the exact overlap:• Execute Validation from the Cross Table Profiling
• create a relationship (Primary Key / Foreign Key, Join, …) between the two Columns and choose Validate
Profile Redundant Columns
138
Exercise 6.1-6.2
139
• Validate a Single or Multi-Column Foreign Key
• Primary use – test the Referential Integrity of primary and foreign key relationships.
• Each row in a child table must reference a row in the parent table.
• Every order detail record must reference an order.
• Information discovered can be used to help write logic to perform the data integration.
Foreign Key Analysis
140
Foreign Key Analysis Results
141
Parents Without Children
142
Exercise 6.3-6.5
143
• Primary use – when two similar systems are merged together.
• Company A merges with Company B: payroll master records are merged.
• It is expected that all rows in the parent and child tables are orphans.• Employees of Company A are not on Company B’s payroll
master file.
• Employees of Company B are not on Company A’s payroll master file.
Vertical Merge Analysis
144
Vertical Merge
145
Vertical Merge Analysis Results
146
Exercise 6.6-6.8
147
• Primary use – validate keys in a single Table
• Validation looks at the table and checks to make sure that every row is unique.
• Use this feature to find any duplicate rows for keys discovered in Single Table Structural Analysis.
Validate Key Analysis
148
Validate Key
149
Validate Key Analysis Results
150
New Alternate Key
151
Validate Alternate Key
152
Exercise 6.9 - 6.12
153
Lesson 7
Normalization
154
Lesson 7 Objectives
• Explain what Normalization is and when it should be performed
• Execute the Normalization function of Informatica Data Explorer
• Navigate and review the results of Normalization
155
Lesson 7 Objectives (cont.)
• Describe what an Column Trace is, and how it is used
• Understand how to modify the Normalized Schema by making changes to the Source Schema
• Explain the iterative nature of Normalization
156
Normalization
• A process that transforms an initial schema into a schema with greater integrity
• A process of transforming Source Schema into a:• Non-redundant
• Anomaly-free
• Third Normal Form model
• Normalization is based upon:• Dependencies added to the model in Single Table Structural
Analysis and
• Synonyms made in Cross Table Structural Analysis
157
Why Normalize?
• A Third Normal Form (3NF) schema has no: • Redundant Columns other than Foreign Keys
• Columns that are only partially dependant on the key
• Transitive Dependencies
• The Normalized Schema provides a checkpoint for the completeness and accuracy of the decisions you made during the profiling tasks
158
Exercise 7.1
Normalized Schema v. Source Schema
custord
ORDER_NO: char(4)ITEM_NO: char(6)
ORDER_DATE: datetimeSHIPDT: datetimePO_NUM: smallintLAST_NAME: varchar(10)FIRST_NAME: varchar(11)CNAME: varchar(36)CON_TTL: varchar(27)SHIPPING_STREET: varchar(40)SHIPPING_CITY: varchar(20)SHIPPING_STATE: char(2)SHIPPING_ZIP: varchar(10)PHONENUM: varchar(12)SP_NO: smallintQUANTITY: smallintITEM_DSC: varchar(25)SUPID: smallintUNIT_COST: moneyTAX_RATE: decimal(5,4)BILL_CODE: char(10)
empinfo
EMPID: smallint
LAST_NAME: varchar(17)FIRST_NAME: varchar(12)GENDER: char(1)DEPTID: smallintDEPTNM: varchar(14)TITLE: varchar(30)STREET: varchar(40)CITY: varchar(15)STATE: varchar(3)ZIP: varchar(10)PHONE: varchar(14)
All_Constant_Attributes
BILL_CODE: char(10)TAX_RATE: decimal(5,4)
ITEM_NO
ITEM_NO: char(6)
SUPID: smallintITEM_DSC: varchar(25)
ITEM_NO_ORDER_DATE
ITEM_NO: char(6)ORDER_DATE: datetime
UNIT_COST: money
ORDER_NO
ORDER_NO: char(4)
PHONENUM: varchar(12)SHIPPING_ZIP: varchar(10)SHIPPING_STATE: char(2)SHIPPING_CITY: varchar(20)SHIPPING_STREET: varchar(40)CON_TTL: varchar(27)CNAME: varchar(36)FIRST_NAME: varchar(11)LAST_NAME: varchar(10)PO_NUM: smallint
ITEM_NO_ORDER_NO
ITEM_NO: char(6)ORDER_NO: char(4)ORDER_DATE: datetimeEmployeeID: smallintDEPTID: smallint
QUANTITY: smallintSHIPDT: datetime
DEPTID
DEPTID: smallint
DEPTNM: varchar(14)
EmployeeID
EmployeeID: smallintDEPTID: smallint
PHONE: varchar(14)ZIP: varchar(10)STATE: varchar(3)CITY: varchar(15)STREET: varchar(40)TITLE: varchar(30)GENDER: char(1)FIRST_NAME: varchar(12)LAST_NAME: varchar(17)
160
Normalized Schema Anomalies
• Observable normalization anomalies may include:• Unexpected Tables
• Duplicate Tables
• Tables with strange/unexpected keys
• Columns in the wrong locations
161
Column Tracing
• Allows you to find the origin of an Column in another schema
• Used to determine the Source Model Dependencies and Synonyms (or the lack thereof) which may be causing the anomaly
162
Schema Locking
• The existence of a Normalized Schema causes Informatica Data Explorer to lock various objects in the Source Schema
• In order to modify Dependencies in the Source Schema, you must remove the Normalized Schema
163
Re-Normalizing
• In order to change the Normalized Schema, you must Remove the Normalized Schema
• Modify the Source Schema, then Re-run Normalization
• The next exercises:• Remove a dependency
• Add another Table
• Renormalize schema
• Review the new Normalized Schema
164
Lab Exercises 7.2 – 7.3
165
New Normalized Schema
All_Constant_Attributes
BILL_CODE: char(10)TAX_RATE: decimal(5,4)
ITEM_NO
ITEM_NO: char(6)
SUPID: smallintITEM_DSC: varchar(25)
SHIPPING_ZIP
SHIPPING_ZIP: varchar(10)
SHIPPING_STATE: char(2)SHIPPING_CITY: varchar(20)SHIPPING_STREET: varchar(40)
ORDER_NO
ORDER_NO: char(4)
PHONENUM: varchar(12)SHIPPING_ZIP: varchar(10)CON_TTL: varchar(27)CNAME: varchar(36)FIRST_NAME: varchar(11)LAST_NAME: varchar(10)PO_NUM: smallint
ITEM_NO_ORDER_NO
ITEM_NO: char(6)ORDER_NO: char(4)
UNIT_COST: moneyQUANTITY: smallintEmployeeID: smallintSHIPDT: datetimeORDER_DATE: datetime
DEPTID
DEPTID: smallint
DEPTNM: varchar(14)
EmployeeID
EmployeeID: smallint
PHONE: varchar(14)ZIP: varchar(10)STATE: varchar(3)CITY: varchar(15)STREET: varchar(40)TITLE: varchar(30)DEPTID: smallintGENDER: char(1)FIRST_NAME: varchar(12)LAST_NAME: varchar(17)
166
Lesson 7 Review
• Normalization is a 100% automated process
• The only inputs to the normalization process are• Dependencies added to the Model
• Column Synonyms
• Refinement of the Normalized Schema is an iterative process
167
Lesson 7 Review (cont.)
• The Normalized Schema is most often used as a basis for• Baseline view
• Review for anomalies
• Comparison to business requirements
• Staging Area
• The Normalized Schema is not a business model
168
Lesson 7 Review (cont.)
• Normalized Schema anomalies stem from either:• Dependencies added to the model
• Dependencies not added to the model
• Incorrect (or unmade) Synonyms
• You can Normalize the Source Schema as soon as you have added dependencies to the model during Single Table Structural Analysis• Actually, you can do it any time but it will just make
a copy of your existing schema if you have not added any dependencies.
169
Lesson 7 Review (cont.)
• If you have not established inter-relational Synonyms, you will get duplicate Tables and/or Columns in the Normalized Schema• Duplicate Tables will appear in the Normalized Model
with an extension, such as:• EmployeeID
• EmployeeID_1
• Suggestions:• Make only one change at a time and then renormalize
• Often making one change in the Source Schema can result in several changes in the Normalized Schema
170
Lesson 8
Exporting to the Repository
171
Lesson 8 Objective
• Export Projects to the IDE Repository
172
What is the Repository?
• A series of relational database tables that store the results from the Informatica Data Explorer Product Suite
173
Repository Export
• The Repository Export dialog box enables you to export an IDE catalog to the Repository
• The Repository Export dialog box provides the ability to limit some of the data that is exported to the Repository
• Once in the Repository, the Catalog becomes available to a variety of DBMS tools, such as SQL, report generators, and so on
• All schemas in the Catalog will be exported to the Repository
IDE Repository Architecture
UNIX or Windows NT
IDEServer
ODBCDrivers
Windows XP, 2000
IDEClient
Client Server
Project
RepositoryRDBMS
UNIX or Windows NT
175
Exporting to Repository
176
Lab Exercise 8.1
177
Lesson 8 Review
• You control what information from a Project is included in the Export process
• The more you export, the longer the process will take
• Information exported to the IDE Repository becomes available to:• Informatica Data Explorer Repository Navigator
• Report Writing tools
• SQL tools
178
Lesson 9
Using the Repository Navigator
179
Lesson 9 Objectives
• Understand use of the Repository Navigator
• Access the IDE Repository and browse its contents using the Navigator
• Explain Tags
• Understand how to share information among departments
180
IDE Repository Navigator
• A browser for the contents of the IDE Repository
• Can be used by anyone in your enterprise
Repository~~~~
KnowledgeAbout
Corporate Systems
Structure
Content
Quality
181
IDE Repository Architecture
UNIX or Windows NT
UNIX or Windows NT
IDEServer
ODBCDrivers
Windows XP, 2000
IDEClient
ODBCDrivers
IDESourceProfiler
Client Server
IDEFTM/XML
Project
IDE Repository
RDBMS
RepositoryNavigator
182
Schema Viewer
• The Schema Viewer functions similar to the Navigation Tree in Informatica Data Explorer• You expand/contract objects
• You use a right-click of the mouse to view properties
• The Schema Viewer provides users with the ability to query profiling information for Tables and Columns (Properties, Tags, Sample Data, Value Frequency Lists) within each schema
183
Exercise 9.1-9.3
184
The Link Viewer
• The Link Viewer shows links between any two schemas in the current project.
• Link Viewer uses:
• View Links between Columns
• Find information on compatibility problems
• Access Tags associated with Links
Link Viewer
185
Table Viewer
• Provides SQL access to the IDE Repository
• Has several pre-built SQL queries
• Allows you to run your own custom queries
186
Exercise 9.4-9.6
187
Lesson 9 Review
• The IDE Repository provides:• Rapid access to source data knowledge
• Team collaboration
• Enhanced communication
• Flexible ad hoc reporting
188
Lesson 10
Repository Reports
189
Lesson 10 Objectives
• Understand what IDE Repository Reports are
• Demonstrate how to use Repository Reports
• Create a report using a Crystal Reports template
• Export a report using Crystal Reports
190
What are Repository Reports
• IDE Repository Reports are a series of reports to provide specific management information from the IDE Repository.
• Reports are written with Crystal Reports.
191
Why use Crystal Reports?
• Provides a user interface to guide the design of reports that are stored in a relational database
• Can export data to other programs such as Excel, Word or HTML pages
• Provides the flexibility to create custom or ad hoc reports. The user is not limited to the reports provided in the Informatica Data Explorer Product Suite
• Accesses the IDE Repository through an ODBC connection
192
Report Templates
• A series of reports are provided as an easy means of obtaining documentation from the IDE Repository
• The Report Templates can be modified to meet individual needs
193
List of Reports
Column Profile - By File Column Profiling results sorted by FileColumn Profile - By Field Column Profiling results sorted by FieldNull Rule Exceptions List of Attributes with Null, Zero or BlanksValue Frequency Value Frequency Lists for AttributesSupported Relationships Inferred Dependencies for each Data SampleModel Relationships Dependencies that have been added to the ModelOverlapping Data Redundancy Profiling Overlap ReportNotes Note Tag ReportAction Items Action Item Tag ReportRules Rule Tag ReportTransformations Transformation Tag ReportAttribute Links Reports Links between Attributes
194
Selection Criteria
• Allow users to select values for certain fields within the templates
• Limit the amount of data reported from the IDE Repository
• Each Template provides selection on ProjectName and SchemaName at a minimum
195
Exercises 10.1 – 10.4
196
Exporting Reports
• Crystal Reports provides an option to export report data into other file formats
• Useful for sharing data with individuals that do not have access to Crystal Reports
197
Exercises 10.5
198
Lesson 10 Review
• IDE Repository Reports provide reporting capability from the IDE Repository
• Additional reports can be created to meet business needs
199
Lesson 11
Integration with PowerCenter
200
PowerCenter Integration
• Informatica Data Explorer has the ability to share metadata with PowerCenter. This allows the business users to share knowledge that was found during the data discovery process with the PowerCenter developers.
• Objects that can be shared are: • Source and target schemas
• Filters
• Expressions (transformation tags in IDE).
201
Create a Transformation Tag
202
Transformation Tag
203
Set Physical Properties
204
Export to Repository
205
Open Fixed Target Mapping (FTM)
206
Open Your Project
207
Export to PowerCenter
208
Import Object into PowerCenter
209
Open Customer in Source Analyzer
210
Open a new Transform
211
Open Ports Tab
212
Informatica Resources
213
Informatica – The Data Integration Company
Informatica provides data integration tools for both batch and real-time applications:
Data Migration Data Synchronization
Data Warehousing Data Hubs
Business Activity Monitoring
214
• Founded in 1993
• Leader in enterprise solution products
• Headquarters in Redwood City, CA
• Public company since April 1999 (INFA)
• 2000+ customers, including over 80% of Fortune 100
• Strategic partnerships with IBM Global Services, HP, Accenture, SAP, and many others
• Technology partnership with Composite Software for Enterprise Information Integration (EII) – real-time federated views and reporting across multiple data sources
• Worldwide distribution
Informatica – Company Information
215
Informatica Affiliations
216
Informatica Resources
www.informatica.com – provides information (under Services) on:• Professional Services• Education Services
my.informatica.com – customers and contractual partners can sign up to access:• Technical Support• Product documentation (under Tools – online documentation)• Velocity Methodology (under Services)• Knowledgebase• Mapping templates
devnet.informatica.com – sign up for Informatica Developers Network• Discussion forums• Web seminars• Technical papers
217