Date post: | 09-Nov-2014 |
Category: |
Documents |
Upload: | srikanthatp5940 |
View: | 184 times |
Download: | 3 times |
Best Informatica Interview Questions & Answers
Deleting duplicate row using Informatica
Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in
the Target System eliminating the duplicate rows. What will be the approach?
Ans.
Let us assume that the source system is a Relational Database . The source table is having duplicate rows.
Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source
table and load the target accordingly.
Informatica Join Vs Database Join
Which is the fastest? Informatica or Oracle?
In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and
found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will
look into the JOIN operation, not only because JOIN is the single most important data set operation but also
because performance of JOIN can give crucial data to a developer in order to develop proper push down
optimization manually.
Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other
hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from
1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the
technologies that they support. But when it comes to the application development, developers often face
challenge to strike the right balance of operational load sharing between these systems. This article will help
them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join
your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question
is – which system performs this faster?
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start
with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4
million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data
volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool3. Database and Informatica setup on different physical servers using HP UNIX4. Source database table has no constraint, no index, no database statistics and no partition5. Source database table is not available in Oracle shared pool before the same is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer.
The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in
database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in
informatica level. We have executed these mappings with different data points and logged the result.
Further to the above test we will execute m_db_side_join mapping once again, this time with proper
database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each
system to sort data. The average time is plotted along vertical axis and data points are plotted along
horizontal axis.
Data Points Master Table Record Count Detail Table Record Count
1 0.1 M 1 M
2 0.2 M 2 M
3 0.4 M 4 M
4 0.6 M 6 M
Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance
benchmarking.
2. This data is only indicative and may vary in different testing conditions.
In this "DWBI Concepts' Original article", we put Oracle database and Informatica PowerCentre to lock horns
to prove which one of them handles data SORTing operation faster. This article gives a crucial insight to
application developer in order to take informed decision regarding performance tuning.
Comparing Performance of SORT operation (Order By) in Informatica and Oracle
Which is the fastest? Informatica or Oracle?
Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other
hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from
1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the
technologies that they support. But when it comes to the application development, developers often face
challenge to strike the right balance of operational load sharing between these systems.
Think about a typical ETL operation often used in enterprise level data integration. A lot of data processing
can be either redirected to the database or to the ETL tool. In general, both the database and the ETL tool
are reasonably capable of doing such operations with almost same efficiency and capability. But in order to
achieve the optimized performance, a developer must carefully consider and decide which system s/he
should be trusting with for each individual processing task.
In this article, we will take a basic database operation – Sorting, and we will put these two systems to test in
order to determine which does it faster than the other, if at all.
Which sorts data faster? Oracle or Informatica?
As an application developer, you have the choice of either using ORDER BY in database level to sort your
data or using SORTER TRANSFORMATION in Informatica to achieve the same outcome. The question is –
which system performs this faster?
Test Preparation
We will perform the same test with different data points (data volumes) and log the results. We will start with
1 million records and we will be doubling the volume for each next data points. Here are the details of the
setup we will use,
1. Oracle 10g database as relational source and target
2. Informatica PowerCentre 8.5 as ETL tool
3. Database and Informatica setup on different physical servers using HP UNIX
4. Source database table has no constraint, no index, no database statistics and no partition
5. Source database table is not available in Oracle shared pool before the same is read
6. There is no session level partition in Informatica PowerCentre
7. There is no parallel hint provided in extraction SQL query
8. The source table has 10 columns and first 8 columns will be used for sorting
9. Informatica sorter has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer.
The first mapping m_db_side_sort will use an ORDER BY clause in the source qualifier to sort data in
database level. Second mapping m_Infa_side_sort will use an Informatica sorter to sort data in informatica
level. We have executed these mappings with different data points and logged the result.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each
system to sort data. The time is plotted along vertical axis and data volume is plotted along horizontal axis.
Verdict
The above experiment demonstrates that Oracle database is faster in SORT operation than Informatica by an average factor of 14%.
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
This data can only be used for performance comparison but cannot be used for performance benchmarking.
To know the Informatica and Oracle performance comparison for JOIN operation, please click here
In this yet another "DWBI Concepts' Original article", we test the performance of Informatica PowerCentre
8.5 Joiner transformation versus Oracle 10g database join. This article gives a crucial insight to application
developer in order to take informed decision regarding performance tuning.
Which is the fastest? Informatica or Oracle?
In our previous article, we tested the performance of ORDER BY operation in Informatica and Oracle and
found that, in our test condition, Oracle performs sorting 14% speedier than Informatica. This time we will
look into the JOIN operation, not only because JOIN is the single most important data set operation but also
because performance of JOIN can give crucial data to a developer in order to develop proper push down
optimization manually.
Informatica is one of the leading data integration tools in today’s world. More than 4,000 enterprises
worldwide rely on Informatica to access, integrate and trust their information assets with it. On the other
hand, Oracle database is arguably the most successful and powerful RDBMS system that is trusted from
1980s in all sorts of business domain and across all major platforms. Both of these systems are bests in the
technologies that they support. But when it comes to the application development, developers often face
challenge to strike the right balance of operational load sharing between these systems. This article will help
them to take the informed decision.
Which JOINs data faster? Oracle or Informatica?
As an application developer, you have the choice of either using joining syntaxes in database level to join
your data or using JOINER TRANSFORMATION in Informatica to achieve the same outcome. The question
is – which system performs this faster?
Test Preparation
We will perform the same test with 4 different data points (data volumes) and log the results. We will start
with 1 million data in detail table and 0.1 million in master table. Subsequently we will test with 2 million, 4
million and 6 million detail table data volumes and 0.2 million, 0.4 million and 0.6 million master table data
volumes. Here are the details of the setup we will use,
1. Oracle 10g database as relational source and target 2. Informatica PowerCentre 8.5 as ETL tool3. Database and Informatica setup on different physical servers using HP UNIX4. Source database table has no constraint, no index, no database statistics and no partition5. Source database table is not available in Oracle shared pool before the same is read6. There is no session level partition in Informatica PowerCentre7. There is no parallel hint provided in extraction SQL query8. Informatica JOINER has enough cache size
We have used two sets of Informatica PowerCentre mappings created in Informatica PowerCentre designer.
The first mapping m_db_side_join will use an INNER JOIN clause in the source qualifier to sort data in
database level. Second mapping m_Infa_side_join will use an Informatica JOINER to JOIN data in
informatica level. We have executed these mappings with different data points and logged the result.
Further to the above test we will execute m_db_side_join mapping once again, this time with proper
database side indexes and statistics and log the results.
Result
The following graph shows the performance of Informatica and Database in terms of time taken by each
system to sort data. The average time is plotted along vertical axis and data points are plotted along
horizontal axis.
Data Points Master Table Record Count Detail Table Record Count
1 0.1 M 1 M
2 0.2 M 2 M
3 0.4 M 4 M
4 0.6 M 6 M
Verdict
In our test environment, Oracle 10g performs JOIN operation 24% faster than Informatica Joiner Transformation while without Index and 42% faster with Database Index
Assumption
1. Average server load remains same during all the experiments
2. Average network speed remains same during all the experiments
Note
1. This data can only be used for performance comparison but cannot be used for performance
benchmarking.
2. This data is only indicative and may vary in different testing conditions.
Informatica Reject File - How to Identify rejection reason
When we run a session, the integration service may create a reject file for each target instance in the
mapping to store the target reject record. With the help of the Session Log and Reject File we can identify
the cause of data rejection in the session. Eliminating the cause of rejection will lead to rejection free loads
in the subsequent session runs. If theInformatica Writer or the Target Database rejects data due to any valid
reason the integration service logs the rejected records into the reject file. Every time we run the session the
integration service appends the rejected records to the reject file.
Working with Informatica Bad Files or Reject Files
By default the Integration service creates the reject files or bad files in the $PMBadFileDir process variable
directory. It writes the entire reject record row in the bad file although the problem may be in any one of the
Columns. The reject files have a default naming convention like [target_instance_name].bad . If we open
the reject file in an editor we will see comma separated values having some tags/ indicator and some data
values. We will see two types of Indicators in the reject file. One is the Row Indicator and the other is
the Column Indicator .
For reading the bad file the best method is to copy the contents of the bad file and saving the same as a
CSV (Comma Sepatared Value) file. Opening the csv file will give an excel sheet type look and feel. The
firstmost column in the reject file is the Row Indicator , that determines whether the row was destined for
insert, update, delete or reject. It is basically a flag that determines the Update Strategy for the data row.
When the Commit Type of the session is configured as User-defined the row indicator indicates whether
the transaction was rolled back due to a non-fatal error, or if the committed transaction was in a failed target
connection group.
List of Values of Row Indicators:
Row Indicator Indicator Significance Rejected By
0 Insert Writer or target
1 Update Writer or target
2 Delete Writer or target
3 Reject Writer
4 Rolled-back insert Writer
5 Rolled-back update Writer
6 Rolled-back delete Writer
7 Committed insert Writer
8 Committed update Writer
9 Committed delete Writer
Now comes the Column Data values followed by their Column Indicators, that determines the data quality of the corresponding Column.
List of Values of Column Indicators:
>
Column
IndicatorType of data Writer Treats As
DValid data or
Good Data.
Writer passes it to the target database.
The target accepts it unless a database
error occurs, such as finding a duplicate
key while inserting.
O
Overflowed
Numeric
Data.
Numeric data exceeded the specified
precision or scale for the column. Bad
data, if you configured the mapping
target to reject overflow or truncated
data.
N Null Value.
The column contains a null value. Good
data. Writer passes it to the target,
which rejects it if the target database
does not accept null values.
TTruncated
String Data.
String data exceeded a specified
precision for the column, so the
Integration Service truncated it. Bad
data, if you configured the mapping
target to reject overflow or truncated
data.
Also to be noted that the second column contains column indicator flag value 'D' which signifies that the
Row Indicator is valid.
Now let us see how Data in a Bad File looks like:
Implementing Informatica Incremental Aggregation
Using incremental aggregation, we apply captured changes in the source data (CDC part) to aggregate calculations in a session. If the source changes incrementally and we can capture the changes, then we can configure the session to process those changes. This allows the Integration Service to update the target incrementally, rather than forcing it to delete previous loads data, process the entire source data and recalculate the same data each time you run the session.
Using Informatica Normalizer Transformation
Normalizer, a native transformation in Informatica, can ease many complex data transformation requirement.
Learn how to effectively use normalizer here.
Using Noramalizer Transformation
A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate
data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns
from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in
columns to rows.
Normalizer effectively does the opposite of what Aggregator does!
Example of Data Transpose using Normalizer
Think of a relational table that stores four quarters of sales by store and we need to create a row for each
sales occurrence. We can configure a Normalizer transformation to return a separate row for each quarter
like below..
The following source rows contain four quarters of sales by store:
Source Table
Store Quarter1 Quarter2 Quarter3 Quarter4
Store1 100 300 500 700
Store2 250 450 650 850
The Normalizer returns a row for each store and sales combination. It also returns an index(GCID) that
identifies the quarter number:
Target Table
Store Sales Quarter
Store 1 100 1
Store 1 300 2
Store 1 500 3
Store 1 700 4
Store 2 250 1
Store 2 450 2
Store 2 650 3
Store 2 850 4
How Informatica Normalizer Works
Suppose we have the following data in source:
Name Month Transportation House Rent Food
Sam Jan 200 1500 500
John Jan 300 1200 300
Tom Jan 300 1350 350
Sam Feb 300 1550 450
John Feb 350 1200 290
Tom Feb 350 1400 350
and we need to transform the source data and populate this as below in the target table:
Name Month Expense Type Expense
Sam Jan Transport 200
Sam Jan House rent 1500
Sam Jan Food 500
John Jan Transport 300
John Jan House rent 1200
John Jan Food 300
Tom Jan Transport 300
Tom Jan House rent 1350
Tom Jan Food 350
.. like this.
Now below is the screen-shot of a complete mapping which shows how to achieve this result using
Informatica PowerCenter Designer. Image: Normalization Mapping Example 1
I will explain the mapping further below.
Setting Up Normalizer Transformation Property
First we need to set the number of occurences property of the Expense head as 3 in the Normalizer tab of
the Normalizer transformation, since we have Food,Houserent and Transportation.
Which in turn will create the corresponding 3 input ports in the ports tab along with the fields Individual and
Month
In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab.
Interestingly we will observe two new columns namely,
GK_EXPENSEHEAD
GCID_EXPENSEHEAD
GK field generates sequence number starting from the value as defined in Sequence field while GCID holds
the value of the occurence field i.e. the column no of the input Expense head.
Here 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION.
Now the GCID will give which expense corresponds to which field while converting columns to rows.
Below is the screen-shot of the expression to handle this GCID efficiently:
Image: Expression to handle GCID
This is how we will accomplish our task!
Informatica Dynamic Lookup Cache
A LookUp cache does not change once built. But what if the underlying lookup table changes the data after
the lookup cache is created? Is there a way so that the cache always remain up-to-date even if the
underlying table changes?
Dynamic Lookup Cache
Let's think about this scenario. You are loading your target table through a mapping. Inside the mapping you
have a Lookup and in the Lookup, you are actually looking up the same target table you are loading. You
may ask me, "So? What's the big deal? We all do it quite often...". And yes you are right. There is no "big
deal" because Informatica (generally) caches the lookup table in the very beginning of the mapping, so
whatever record getting inserted to the target table through the mapping, will have no effect on the Lookup
cache. The lookup will still hold the previously cached data, even if the underlying target table is changing.
But what if you want your Lookup cache to get updated as and when the target table is changing? What if
you want your lookup cache to always show the exact snapshot of the data in your target table at that point
in time? Clearly this requirement will not be fullfilled in case you use a static cache. You will need a dynamic
cache to handle this.
But why anyone will need a dynamic cache?
To understand this, let's first understand a static cache scenario.
Informatica Dynamic Lookup Cache - What is Static Cache
STATIC CACHE SCENARIO
Let's suppose you run a retail business and maintain all your customer information in a customer master
table (RDBMS table). Every night, all the customers from your customer master table is loaded in to a
Customer Dimension table in your data warehouse. Your source customer table is a transaction system
table, probably in 3rd normal form, and does not store history. Meaning, if a customer changes his address,
the old address is updated with the new address. But your data warehouse table stores the history (may be
in the form of SCD Type-II). There is a map that loads your data warehouse table from the source table.
Typically you do a Lookup on target (static cache) and check with your every incoming customer record to
determine if the customer is already existing in target or not. If the customer is not already existing in target,
you conclude the customer is new and INSERT the record whereas if the customer is already existing, you
may want to update the target record with this new record (if the record is updated). This is illustrated
below, You don't need dynamic Lookup cache for this
Image: A static Lookup Cache to determine if a source record is new or updatable
Informatica Dynamic Lookup Cache - What is Dynamic Cache
DYNAMIC LOOKUP CACHE SCENARIO
Notice in the previous example I mentioned that your source table is an RDBMS table. This ensures that
your source table does not have any duplicate record.
But, What if you had a flat file as source with many duplicate records?
Would the scenario be same? No, see the below illustration.
Image: A Scenario illustrating the use of dynamic lookup cache
Here are some more examples when you may consider using dynamic lookup,
Updating a master customer table with both new and updated customer information coming
together as shown above
Loading data into a slowly changing dimension table and a fact table at the same time. Remember,
you typically lookup the dimension while loading to fact. So you load dimension table before loading fact
table. But using dynamic lookup, you can load both simultaneously.
Loading data from a file with many duplicate records and to eliminate duplicate records in target by
updating a duplicate row i.e. keeping the most recent row or the initial row
Loading the same data from multiple sources using a single mapping. Just consider the previous
Retail business example. If you have more than one shops and Linda has visited two of your shops for the
first time, customer record Linda will come twice during the same load.
Informatica Dynamic Lookup Cache - How does dynamic cache work
So, How does dynamic lookup work?
When the Integration Service reads a row from the source, it updates the lookup cache by performing one of
the following actions:
Inserts the row into the cache: If the incoming row is not in the cache, the Integration Service
inserts the row in the cache based on input ports or generated Sequence-ID. The Integration Service flags
the row as insert.
Updates the row in the cache: If the row exists in the cache, the Integration Service updates the
row in the cache based on the input ports. The Integration Service flags the row as update.
Makes no change to the cache: This happens when the row exists in the cache and the lookup is
configured or specified To Insert New Rows only or, the row is not in the cache and lookup is configured to
update existing rows only or, the row is in the cache, but based on the lookup condition, nothing changes.
The Integration Service flags the row as unchanged.
Notice that Integration Service actually flags the rows based on the above three conditions.
And that's a great thing, because, if you know the flag you can actually reroute the row to achieve different
logic. This flag port is called
NewLookupRow
Using the value of this port, the rows can be routed for insert, update or to do nothing. You just need to use
a Router or Filter transformation followed by an Update Strategy.
Oh, forgot to tell you the actual values that you can expect in NewLookupRow port are:
0 = Integration Service does not update or insert the row in the cache.
1 = Integration Service inserts the row into the cache.
2 = Integration Service updates the row in the cache.
When the Integration Service reads a row, it changes the lookup cache depending on the results of the
lookup query and the Lookup transformation properties you define. It assigns the value 0, 1, or 2 to the
NewLookupRow port to indicate if it inserts or updates the row in the cache, or makes no change.
Informatica Dynamic Lookup Cache - Dynamic Lookup Mapping Example
Example of Dynamic Lookup Implementation
Ok, I design a mapping for you to show Dynamic lookup implementation. I have given a full screenshot of
the mapping. Since the screenshot is slightly bigger, so I link it below. Just click to expand the image.
And here I provide you the screenshot of the lookup below. Lookup ports screen shot first,Image: Dynamic Lookup Ports Tab
And here is Dynamic Lookup Properties Tab
If you check the mapping screenshot, there I have used a router to reroute the INSERT group and UPDATE
group. The router screenshot is also given below. New records are routed to the INSERT group and existing
records are routed to the UPDATE group.
Router Transformation Groups Tab
Informatica Dynamic Lookup Cache - Dynamic Lookup Sequence ID
While using a dynamic lookup cache, we must associate each lookup/output port with an input/output port or
a sequence ID. The Integration Service uses the data in the associated port to insert or update rows in the
lookup cache. The Designer associates the input/output ports with the lookup/output ports used in the
lookup condition.
When we select Sequence-ID in the Associated Port column, the Integration Service generates a sequence
ID for each row it inserts into the lookup cache.
When the Integration Service creates the dynamic lookup cache, it tracks the range of values in the cache
associated with any port using a sequence ID and it generates a key for the port by incrementing the
greatest sequence ID existing value by one, when the inserting a new row of data into the cache.
When the Integration Service reaches the maximum number for a generated sequence ID, it starts over at
one and increments each sequence ID by one until it reaches the smallest existing value minus one. If the
Integration Service runs out of unique sequence ID numbers, the session fails.
Informatica Dynamic Lookup Cache - Dynamic Lookup Ports
About the Dynamic Lookup Output Port
The lookup/output port output value depends on whether we choose to output old or new values when the
Integration Service updates a row:
Output old values on update: The Integration Service outputs the value that existed in the cache
before it updated the row.
Output new values on update: The Integration Service outputs the updated value that it writes in
the cache. The lookup/output port value matches the input/output port value.
Note: We can configure to output old or new values using the Output Old Value On Update transformation
property.
Informatica Dynamic Lookup Cache - NULL handling in LookUp
Handling NULL in dynamic LookUp
If the input value is NULL and we select the Ignore Null inputs for Update property for the associated input
port, the input value does not equal the lookup value or the value out of the input/output port. When you
select theIgnore Null property, the lookup cache and the target table might become unsynchronized if you
pass null values to the target. You must verify that you do not pass null values to the target.
When you update a dynamic lookup cache and target table, the source data might contain some null values.
The Integration Service can handle the null values in the following ways:
Insert null values: The Integration Service uses null values from the source and updates the
lookup cache and target table using all values from the source.
Ignore Null inputs for Update property : The Integration Service ignores the null values in the
source and updates the lookup cache and target table using only the not null values from the source.
If we know the source data contains null values, and we do not want the Integration Service to update the
lookup cache or target with null values, then we need to check the Ignore Null property for the corresponding
lookup/output port.
When we choose to ignore NULLs, we must verify that we output the same values to the target that the
Integration Service writes to the lookup cache. We can Configure the mapping based on the value we want
the Integration Service to output from the lookup/output ports when it updates a row in the cache, so that
lookup cache and the target table might not become unsynchronized
New values. Connect only lookup/output ports from the Lookup transformation to the target.
Old values. Add an Expression transformation after the Lookup transformation and before the
Filter or Router transformation. Add output ports in the Expression transformation for each port in the target
table and create expressions to ensure that we do not output null input values to the target.
Informatica Dynamic Lookup Cache - Other Details
When we run a session that uses a dynamic lookup cache, the Integration Service compares the values in
all lookup ports with the values in their associated input ports by default.
It compares the values to determine whether or not to update the row in the lookup cache. When a value in
an input port differs from the value in the lookup port, the Integration Service updates the row in the cache.
But what if we don't want to compare all ports? We can choose the ports we want the Integration Service to
ignore when it compares ports. The Designer only enables this property for lookup/output ports when the
port is not used in the lookup condition. We can improve performance by ignoring some ports during
comparison.
We might want to do this when the source data includes a column that indicates whether or not the row
contains data we need to update. Select the Ignore in Comparison property for all lookup ports except
the port that indicates whether or not to update the row in the cache and target table.
Note: We must configure the Lookup transformation to compare at least one port else the Integration
Service fails the session when we ignore all ports.
Links
Pushdown Optimization In Informatica
Pushdown Optimization which is a new concept in Informatica PowerCentre, allows developers to balance
data transformation load among servers. This article describes pushdown techniques.
What is Pushdown Optimization?
Pushdown optimization is a way of load-balancing among servers in order to achieve optimal performance.
Veteran ETL developers often come across issues when they need to determine the appropriate place to
perform ETL logic. Suppose an ETL logic needs to filter out data based on some condition. One can either
do it in database by using WHERE condition in the SQL query or inside Informatica by using Informatica
Filter transformation. Sometimes, we can even "push" some transformation logic to the target database
instead of doing it in the source side (Especially in the case of EL-T rather than ETL). Such optimization is
crucial for overall ETL performance.
How does Push-Down Optimization work?
One can push transformation logic to the source or target database using pushdown optimization. The
Integration Service translates the transformation logic into SQL queries and sends the SQL queries to the
source or the target database which executes the SQL queries to process the transformations. The amount
of transformation logic one can push to the database depends on the database, transformation logic, and
mapping and session configuration. The Integration Service analyzes the transformation logic it can push to
the database and executes the SQL statement generated against the source or target tables, and it
processes any transformation logic that it cannot push to the database.
Pushdown Optimization In Informatica - Using Pushdown Optimization
Using Pushdown Optimization
Use the Pushdown Optimization Viewer to preview the SQL statements and mapping logic that the
Integration Service can push to the source or target database. You can also use the Pushdown Optimization
Viewer to view the messages related to pushdown optimization.
Let us take an example: Image: Pushdown Optimization Example 1
Filter Condition used in this mapping is: DEPTNO>40
Suppose a mapping contains a Filter transformation that filters out all employees except those with a
DEPTNO greater than 40. The Integration Service can push the transformation logic to the database. It
generates the following SQL statement to process the transformation logic:
INSERT INTO EMP_TGT(EMPNO, ENAME, SAL, COMM, DEPTNO)
SELECT
EMP_SRC.EMPNO,
EMP_SRC.ENAME,
EMP_SRC.SAL,
EMP_SRC.COMM,
EMP_SRC.DEPTNO
FROM EMP_SRC
WHERE (EMP_SRC.DEPTNO >40)
The Integration Service generates an INSERT SELECT statement and it filters the data using a WHERE
clause. The Integration Service does not extract data from the database at this time.
We can configure pushdown optimization in the following ways:
Using source-side pushdown optimization:
The Integration Service pushes as much transformation logic as possible to the source database. The
Integration Service analyzes the mapping from the source to the target or until it reaches a downstream
transformation it cannot push to the source database and executes the corresponding SELECT statement.
Using target-side pushdown optimization:
The Integration Service pushes as much transformation logic as possible to the target database. The
Integration Service analyzes the mapping from the target to the source or until it reaches an upstream
transformation it cannot push to the target database. It generates an INSERT, DELETE, or UPDATE
statement based on the transformation logic for each transformation it can push to the database and
executes the DML.
Using full pushdown optimization:
The Integration Service pushes as much transformation logic as possible to both source and target
databases. If you configure a session for full pushdown optimization, and the Integration Service cannot
push all the transformation logic to the database, it performs source-side or target-side pushdown
optimization instead. Also the source and target must be on the same database. The Integration Service
analyzes the mapping starting with the source and analyzes each transformation in the pipeline until it
analyzes the target. When it can push all transformation logic to the database, it generates an INSERT
SELECT statement to run on the database. The statement incorporates transformation logic from all the
transformations in the mapping. If the Integration Service can push only part of the transformation logic to
the database, it does not fail the session, it pushes as much transformation logic to the source and target
database as possible and then processes the remaining transformation logic.
For example, a mapping contains the following transformations:
SourceDefn -> SourceQualifier -> Aggregator -> Rank -> Expression -> TargetDefn
SUM(SAL), SUM(COMM) Group by DEPTNO
RANK PORT on SAL
TOTAL = SAL+COMM
Image: Pushdown Optimization Example 2
The Rank transformation cannot be pushed to the database. If the session is configured for full pushdown
optimization, the Integration Service pushes the Source Qualifier transformation and the Aggregator
transformation to the source, processes the Rank transformation, and pushes the Expression transformation
and target to the target database.
When we use pushdown optimization, the Integration Service converts the expression in the transformation
or in the workflow link by determining equivalent operators, variables, and functions in the database. If there
is no equivalent operator, variable, or function, the Integration Service itself processes the transformation
logic. The Integration Service logs a message in the workflow log and the Pushdown Optimization Viewer
when it cannot push an expression to the database. Use the message to determine the reason why it could
not push the expression to the database.
Pushdown Optimization In Informatica - Pushdown Optimization in Integration ServicePage 3 of 6
How does Integration Service handle Push Down Optimization?
To push transformation logic to a database, the Integration Service might create temporary objects in the
database. The Integration Service creates a temporary sequence object in the database to push Sequence
Generator transformation logic to the database. The Integration Service creates temporary views in the
database while pushing a Source Qualifier transformation or a Lookup transformation with a SQL override to
the database, an unconnected relational lookup, filtered lookup.
1. To push Sequence Generator transformation logic to a database, we must configure the session
for pushdown optimization with Sequence.
2. To enable the Integration Service to create the view objects in the database we must configure the
session forpushdown optimization with View.
2. After the database transaction completes, the Integration Service drops sequence and view objects
created for pushdown optimization.
Pushdown Optimization In Informatica - Configuring Pushdown Optimization
Configuring Parameters for Pushdown Optimization
Depending on the database workload, we might want to use source-side, target-side, or full pushdown
optimization at different times and for that we can use the $$PushdownConfig mapping parameter. The
settings in the $$PushdownConfig parameter override the pushdown optimization settings in the session
properties. Create $$PushdownConfig parameter in the Mapping Designer , in session property for
Pushdown Optimization attribute select $$PushdownConfig and define the parameter in the parameter file.
The possible values may be,
1. none i.e the integration service itself processes all the transformations,
2. Source [Seq View],
3. Target [Seq View],
4. Full [Seq View]
Pushdown Optimization In Informatica - Using Pushdown Optimization Viewer
Pushdown Optimization Viewer
Use the Pushdown Optimization Viewer to examine the transformations that can be pushed to the database.
Select a pushdown option or pushdown group in the Pushdown Optimization Viewer to view the
corresponding SQL statement that is generated for the specified selections. When we select a pushdown
option or pushdown group, we do not change the pushdown configuration. To change the configuration, we
must update the pushdown option in the session properties.
Database that supports Informatica Pushdown Optimization
We can configure sessions for pushdown optimization having any of the databases like Oracle, IBM DB2,
Teradata, Microsoft SQL Server, Sybase ASE or Databases that use ODBC drivers.
When we use native drivers, the Integration Service generates SQL statements using native database SQL.
When we use ODBC drivers, the Integration Service generates SQL statements using ANSI SQL. The
Integration Service can generate more functions when it generates SQL statements using native language
instead of ANSI SQL.
Pushdown Optimization In Informatica - Pushdown Optimization Error Handling
Handling Error when Pushdown Optimization is enabled
When the Integration Service pushes transformation logic to the database, it cannot track errors that occur in
the database.
When the Integration Service runs a session configured for full pushdown optimization and an error occurs,
the database handles the errors. When the database handles errors, the Integration Service does not write
reject rows to the reject file.
If we configure a session for full pushdown optimization and the session fails, the Integration Service cannot
perform incremental recovery because the database processes the transformations. Instead, the database
rolls back the transactions. If the database server fails, it rolls back transactions when it restarts. If the
Integration Service fails, the database server rolls back the transaction.
Links
Informatica Tuning - Step by Step ApproachThis is the first of the number of articles on the series of Data Warehouse Application performance tuning scheduled to come every week. This one is on Informatica performance tuning.
Please note that this article is intended to be a quick guide. A more detail Informatica performance tuning
guide can be found here: Informatica Performance Tuning Complete Guide
Source Query/ General Query Tuning
1.1 Calculate original query cost1.2 Can the query be re-written to reduce cost?- Can IN clause be changed with EXISTS?- Can a UNION be replaced with UNION ALL if we are not using any DISTINCT cluase in query?- Is there a redundant table join that can be avoided?- Can we include additional WHERE clause to further limit data volume?- Is there a redundant column used in GROUP BY that can be removed?- Is there a redundant column selected in the query but not used anywhere in mapping?1.3 Check if all the major joining columns are indexed1.4 Check if all the major filter conditions (WHERE clause) are indexed- Can a function-based index improve performance further?1.5 Check if any exclusive query hint reduce query cost - Check if parallel hint improves performance and reduce cost 1.6 Recalculate query cost - If query cost is reduced, use the changed query
Tuning Informatica LookUp
1.1 Redundant Lookup transformation - Is there a lookup which is no longer used in the mapping? - If there are consecutive lookups, can those be replaced inside a single lookup override?1.2 LookUp conditions - Are all the lookup conditions indexed in database? (Uncached lookup only) - An unequal condition should always be mentioned after an equal condition 1.3 LookUp override query - Should follow all guidelines from 1. Source Query part above 1.4 There is no unnecessary column selected in lookup (to reduce cache size) 1.5 Cached/Uncached - Carefully consider whether the lookup should be cached or uncached - General Guidelines - Generally don't use cached lookup if lookup table size is > 300MB - Generally don't use cached lookup if lookup table row count > 20,000,00 - Generally don't use cached lookup if driving table (source table) row count < 1000 1.6 Persistent Cache - If found out that a same lookup is cached and used in different mappings, Consider persistent cache
1.7 Lookup cache building - Consider "Additional Concurrent Pipeline" in session property to build cache concurrently "Prebuild Lookup Cache" should be enabled, only if the lookup is surely called in the mapping
Tuning Informatica Joiner
3.1 Unless unavoidable, join database tables in database only (homogeneous join) and don't use joiner
3.2 If Informatica joiner is used, always use Sorter Rows and try to sort it in SQ Query itself using Order By (If SorterTransformation is used then make sure Sorter has enough cache to perform 1-pass sort) 3.3 Smaller of two joining tables should be master
Tuning Informatica Aggregator
4.1 When possible, sort the input for aggregator from database end (Order By Clause) 4.2 If Input is not already sorted, use SORTER. If possible use SQ query to Sort the records.
Tuning Informatica Filter
5.1 Unless unavoidable, use filteration at source query in source qualifier 5.2 Use filter as much near to source as possible
Tuning Informatica Sequence Generator
6.1 Cache the sequence generator
Setting Correct Informatica Session Level Properties
7.1 Disable "High Precision" if not required (High Precision allows decimal upto 28 decimal points) 7.2 Use "Terse" mode for tracing level 7.3 Enable pipeline partitioning (Thumb Rule: Maximum No. of partitions = No. of CPU/1.2) (Also remember increasing partitions will multiply the cache memory requirement accordingly)
Tuning Informatica Expression
8.1 Use Variable to reduce the redundant calculation 8.2 Remove Default value " ERROR('transformation error')" for Output Column. 8.3 Try to reduce the Code complexity like Nested If etc. 8.4 Try to reduce the Unneccessary Type Conversion in Calculation
Implementing Informatica Partitions
Why use Informatica Pipeline Partition?
Identification and elimination of performance bottlenecks will obviously optimize session performance. After
tuning all the mapping bottlenecks, we can further optimize session performance by increasing the number
of pipeline partitions in the session. Adding partitions can improve performance by utilizing more of the
system hardware while processing the session.
PowerCenter Informatica Pipeline Partition
Different Types of Informatica Partitions
We can define the following partition types: Database partitioning, Hash auto-keys, Hash user keys, Key
range, Pass-through, Round-robin.
Informatica Pipeline Partitioning Explained
Each mapping contains one or more pipelines. A pipeline consists of a source qualifier, all the
transformations and the target. When the Integration Service runs the session, it can achieve higher
performance by partitioning the pipeline and performing the extract, transformation, and load for each
partition in parallel.
A partition is a pipeline stage that executes in a single reader, transformation, or writer thread. The number
of partitions in any pipeline stage equals the number of threads in the stage. By default, the Integration
Service creates one partition in every pipeline stage. If we have the Informatica Partitioning option, we
can configure multiple partitions for a single pipeline stage.
Setting partition attributes includes partition points, the number of partitions, and the partition types. In the
session properties we can add or edit partition points. When we change partition points we can define the
partition type and add or delete partitions(number of partitions).
We can set the following attributes to partition a pipeline:
Partition point: Partition points mark thread boundaries and divide the pipeline into stages. A stage is a
section of a pipeline between any two partition points. The Integration Service redistributes rows of data at
partition points. When we add a partition point, we increase the number of pipeline stages by one.
Increasing the number of partitions or partition points increases the number of threads. We cannot create
partition points at Source instances or at Sequence Generator transformations.
Number of partitions: A partition is a pipeline stage that executes in a single thread. If we purchase the
Partitioning option, we can set the number of partitions at any partition point. When we add partitions, we
increase the number of processing threads, which can improve session performance. We can define up to
64 partitions at any partition point in a pipeline. When we increase or decrease the number of partitions at
any partition point, the Workflow Manager increases or decreases the number of partitions at all partition
points in the pipeline. The number of partitions remains consistent throughout the pipeline. The Integration
Service runs the partition threads concurrently.
Partition types: The Integration Service creates a default partition type at each partition point. If we have
the Partitioning option, we can change the partition type. The partition type controls how the Integration
Service distributes data among partitions at partition points. We can define the following partition types:
Database partitioning, Hash auto-keys, Hash user keys, Key range, Pass-through, Round-robin. Database
partitioning: The Integration Service queries the database system for table partition information. It reads
partitioned data from the corresponding nodes in the database.
Pass-through: The Integration Service processes data without redistributing rows among partitions. All
rows in a single partition stay in the partition after crossing a pass-through partition point. Choose pass-
through partitioning when we want to create an additional pipeline stage to improve performance, but do not
want to change the distribution of data across partitions.
Round-robin: The Integration Service distributes data evenly among all partitions. Use round-robin
partitioning where we want each partition to process approximately the same numbers of rows i.e. load
balancing.
Hash auto-keys: The Integration Service uses a hash function to group rows of data among partitions. The
Integration Service groups the data based on a partition key. The Integration Service uses all grouped or
sorted ports as a compound partition key. We may need to use hash auto-keys partitioning at Rank, Sorter,
and unsorted Aggregator transformations.
Hash user keys: The Integration Service uses a hash function to group rows of data among partitions. We
define the number of ports to generate the partition key.
Key range: The Integration Service distributes rows of data based on a port or set of ports that we define as
the partition key. For each port, we define a range of values. The Integration Service uses the key and
ranges to send rows to the appropriate partition. Use key range partitioning when the sources or targets in
the pipeline are partitioned by key range.
We cannot create a partition key for hash auto-keys, round-robin, or pass-through partitioning.
Add, delete, or edit partition points on the Partitions view on the Mapping tab of session properties of a
session in Workflow Manager.
The PowerCenter® Partitioning Option increases the performance of PowerCenter through parallel data
processing. This option provides a thread-based architecture and automatic data partitioning that optimizes
parallel processing on multiprocessor and grid-based hardware environments.
Implementing Informatica Persistent Cache
You must have noticed that the "time" Informatica takes to build the lookup cache can be too much
sometimes depending on the lookup table size/volume. Using Persistent Cache, you may save lot of your
time. This article describes how to do it.
What is Persistent Cache?
Lookups are cached by default in Informatica. This means that Informatica by default brings in the entire
data of the lookup table from database server to Informatica Server as a part of lookup cache building
activity during session run. If the lookup table is too huge, this ought to take quite some time. Now consider
this scenario - what if you are looking up to the same table different times using different lookups in different
mappings? Do you want to spend the time of building the lookup cache again and again for each lookup?
Off course not! Just use persistent cache option!
Yes, Lookup cache can be either non-persistent or persistent. The Integration Service saves or deletes
lookup cache files after a successful session run based on whether the Lookup cache is checked as
persistent or not.
Where and when we shall use persistent cache:
Suppose we have a lookup table with same lookup condition and return/output ports and the lookup table is
used many times in multiple mappings. Let us say a Customer Dimension table is used in many mappings to
populate the surrogate key in the fact tables based on their source system keys. Now if we cache the same
Customer Dimension table multiple times in multiple mappings that would definitely affect the SLA loading
timeline.
There can be some functional reasons also for selecting to use persistent cache. Please read the
article Advantage and Disadvantage of Persistent Cache Lookup to know how persistent cache can be used
to ensure data integrity in long running ETL sessions where underlying tables are also changing.
So the solution is to use Named Persistent Cache.
In the first mapping we will create the Named Persistent Cache file by setting three properties in the
Properties tab of Lookup transformation.
Lookup cache persistent: To be checked i.e. a Named Persistent Cache will be used.
Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that will
be used in all the other mappings using the same lookup table. Enter the prefix name only. Do not enter .idx
or .dat
Re-cache from lookup source: To be checked i.e. the Named Persistent Cache file will be rebuilt or
refreshed with the current data of the lookup table.
Next in all the mappings where we want to use the same already built Named Persistent Cache we need to
set two properties in the Properties tab of Lookup transformation.
Lookup cache persistent: To be checked i.e. the lookup will be using a Named Persistent Cache that is
already saved in Cache Directory and if the cache file is not there the session will not fail it will just create
the cache file instead.
Cache File Name Prefix: user_defined_cache_file_name i.e. the Named Persistent cache file name that
was defined in the mapping where the persistent cache file was created.
Note:
If there is any Lookup SQL Override then the SQL statement in all the lookups should match exactly even
also an extra blank space will fail the session that is using the already built persistent cache file.
So if the incoming source data volume is high, the lookup table’s data volume that need to be cached is also
high, and the same lookup table is used in many mappings then the best way to handle the situation is to
use one-time build, already created persistent named cache.
Aggregation with out Informatica Aggregator
Since Informatica process data row by row, it is generally possible to handle data aggregation operation
even without an Aggregator Transformation. On certain cases, you may get huge performance gain using
this technique!
General Idea of Aggregation without Aggregator Transformation
Let us take an example: Suppose we want to find the SUM of SALARY for Each Department of the
Employee Table. The SQL query for this would be:
SELECT DEPTNO,SUM(SALARY) FROM EMP_SRC GROUP BY DEPTNO;
If we need to implement this in Informatica, it would be very easy as we would obviously go for an
Aggregator Transformation. By taking the DEPTNO port as GROUP BY and one output port as
SUM(SALARY the problem can be solved easily.
Now the trick is to use only Expression to achieve the functionality of Aggregator expression. We would use
the very funda of the expression transformation of holding the value of an attribute of the previous tuple over
here.
But wait... why would we do this? Aren't we complicating the thing here?
Yes, we are. But as it appears, in many cases, it might have an performance benefit (especially if the input is
already sorted or when you know input data will not violate the order, like you are loading daily data and
want to sort it by day). Remember Informatica holds all the rows in Aggregator cache for aggregation
operation. This needs time and cache space and this also voids the normal row by row processing in
Informatica. By removing the Aggregator with an Expression, we reduce cache space requirement and ease
out row by row processing. The mapping below will show how to do this
Image: Aggregation with Expression and Sorter 1
Sorter (SRT_SAL) Ports Tab
Now I am showing a sorter here just illustrate the concept. If you already have sorted data from the source, you need not use this thereby increasing the performance benefit. Expression (EXP_SAL) Ports Tab Image: Expression Ports Tab Properties
Sorter (SRT_SAL1) Ports Tab
Expression (EXP_SAL2) Ports Tab
Filter (FIL_SAL) Properties Tab
This is how we can implement aggregation without using Informatica aggregator transformation. Hope you
liked it!
What are the differences between Connected and Unconnected Lookup?
Connected Lookup Unconnected Lookup
Connected lookup participates in dataflow and
receives input directly from the pipeline
Unconnected lookup receives input values from
the result of a LKP: expression in another
transformation
Connected lookup can use both dynamic and
static cache
Unconnected Lookup cache can NOT be
dynamic
Connected lookup can return more than one
column value ( output port )
Unconnected Lookup can return only one
column value i.e. output port
Connected lookup caches all lookup columns Unconnected lookup caches only the lookup
output ports in the lookup conditions and the
return port
Supports user-defined default values (i.e. value
to return when lookup conditions are not
satisfied)
Does not support user defined default values
What is the difference between Router and Filter?
Router Filter
Router transformation divides the incoming
records into multiple groups based on some
condition. Such groups can be mutually
inclusive (Different groups may contain same
record)
Filter transformation restricts or blocks the
incoming record set based on one given
condition.
Router transformation itself does not block any
record. If a certain record does not match any
of the routing conditions, the record is routed to
default group
Filter transformation does not have a default
group. If one record does not match filter
condition, the record is blocked
Router acts like CASE.. WHEN statement in
SQL (Or Switch().. Case statement in C)Filter acts like WHERE condition is SQL.
What can we do to improve the performance of Informatica Aggregator Transformation?
Aggregator performance improves dramatically if records are sorted before passing to the aggregator and
"sorted input" option under aggregator properties is checked. The record set should be sorted on those
columns that are used in Group By operation.
It is often a good idea to sort the record set in database level (why?) e.g. inside a source qualifier
transformation, unless there is a chance that already sorted records from source qualifier can again become
unsorted before reaching aggregator
What are the different lookup cache?
Lookups can be cached or uncached (No cache). Cached lookup can be either static or dynamic. A static
cache is one which does not modify the cache once it is built and it remains same during the session run.
On the other hand, Adynamic cache is refreshed during the session run by inserting or updating the
records in cache based on the incoming source data.
A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains
the cache even after session run is complete or not respectively
How can we update a record in target table without using Update strategy?
A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the
target table in Informatica level and then we need to connect the key and the field we want to update in the
mapping Target. In the session level, we should set the target property as "Update as Update" and check
the "Update" check-box.
Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and
"Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we
have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID
and Customer Address fields in the mapping. If the session properties are set correctly as described above,
then the mapping will only update the customer address field for all matching customer IDs.
Deleting duplicate row using Informatica
Q1. Suppose we have Duplicate records in Source System and we want to load only the unique records in
the Target System eliminating the duplicate rows. What will be the approach?
Ans.
Let us assume that the source system is a Relational Database . The source table is having duplicate rows.
Now to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source
table and load the target accordingly.
Source Qualifier Transformation DISTINCT clause
But what if the source is a flat file? How can we remove the duplicates from flat file source?
To know the answer of this question and similar high frequency Informatica questions, please continue to,