Performance Tuning Best Practices v2 - TIBCO …By modifying queries with SQL option hints , using...

1

Professional

Services

www.tibco.com

Global Headquarters 3303 Hillview

Avenue Palo Alto, CA 94304

Tel: +1 650-846-1000 +1 800-420-8450 Fax: +1 650-846-1005

TIBCO Software empowers

executives, developers, and

business users with Fast Data

solutions that make the right data

available in real time for faster

answers, better decisions, and

smarter action. Over the past 15

years, thousands of businesses

across the globe have relied on

TIBCO technology to integrate their

applications and ecosystems,

analyze their data, and create real-

time solutions. Learn how TIBCO

turns data—big or small—into

differentiation at www.tibco.com.

Performance Tuning Best Practices

Project Name

Release 2.2

Date March 2018

Primary Author

Matthew Lee

Document Owner

Tony Young

Client

Document Location

Purpose Provides best practice guidelines so that developers may make performance-tuning decisions that optimize TDV performance,

while highlighting important considerations

when making certain changes to query

behavior for performance gains.


© Copyright TIBCO Software Inc. 2 of 39



Revision History

Version Date Author Comments

1.4 October 2012 Matthew Lee Initial revision

1.7 January 2013 Matthew Lee Corrections and updates for TDV 6.2.3

1.8 October 2013 Calvin Goodrich Added incremental caching section

1.9 April 2014 Matthew Lee Updated for TDV 6.2.6

2.0 April 2015 Matthew Lee Updated for TDV 7.0

2.1 June 2015 Matthew Lee Updated section on Netezza distribution columns. Additional corrections and minor content updates

2.2 March 2018 Deane Harding Updated with TIBCO branding

Approvals This document requires the following approvals. Signed approval forms are filed in the project files.

Name Signature Title Company Date of Issue Version

Distribution This document has been distributed to:

Name Title Company Date of Issue Version

Related Documents This document is related to:

Document File Name Author

TDV Reference Manual

TDV User Guide



Copyright Notice COPYRIGHT© TIBCO Software Inc. This document is unpublished and the foregoing notice is affixed to protect TIBCO Software Inc. in the event of inadvertent publication. All rights reserved. No part of this document may be reproduced in any form, including photocopying or transmission electronically to any computer, without prior written consent of TIBCO Software Inc. The information contained in this document is confidential and proprietary to TIBCO Software Inc. and may not be used or disclosed except as expressly authorized in writing by TIBCO Software Inc. Copyright protection includes material generated from our software programs displayed on the screen, such as icons, screen displays, and the like.

Trademarks All brand and product names are trademarks or registered trademarks of their respective holders and are hereby acknowledged. Technologies described herein are either covered by existing patents or patent applications are in progress.

Confidentiality The information in this document is subject to change without notice. This document contains information that is confidential and proprietary to TIBCO Software Inc. and its affiliates and may not be copied, published, or disclosed to others, or used for any purposes other than review, without written authorization of an officer of TIBCO Software Inc. Submission of this document does not represent a commitment to implement any portion of this specification in the products of the submitters.

Content Warranty The information in this document is subject to change without notice. THIS DOCUMENT IS PROVIDED "AS IS" AND TIBCO MAKES NO WARRANTY, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO ALL

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. TIBCO Software Inc. shall not be liable for errors contained herein or for incidental or consequential damages in connection with the furnishing, performance or use of this material.

Export This document and related technical data, are subject to U.S. export control laws, including without limitation the U.S. Export Administration Act and its associated regulations, and may be subject to export or import regulations of other countries. You agree not to export or re-export this document in any form in violation of the applicable export or import laws of the United States or any foreign jurisdiction.

For more information, please contact:

TIBCO Software Inc. 3303 Hillview Avenue Palo Alto, CA 94304

USA



Table of Contents 1 Introduction .............................................................................................................................. 7

1.1 Purpose ................................................................................................................... 7

1.2 Goals for Optimizing Queries ................................................................................. 7

2 Execution Plans ........................................................................................................................ 9

2.1 Viewing the Execution Plan ................................................................................... 9

2.2 Execute and Show Statistics ................................................................................. 10

2.3 Nodes of an Execution Plan .................................................................................. 10

3 Specifying JOIN Algorithms .................................................................................................. 12

3.1 Sort Merge ............................................................................................................ 12

3.2 Hash Join .............................................................................................................. 12

3.3 Nested Loop Join .................................................................................................. 13

3.4 Semi-join and Partitioned Semi-join .................................................................... 13

3.5 Procedure Join ...................................................................................................... 14

3.6 Star Schema Semi-join ......................................................................................... 14

3.7 Outer Joins and Streaming .................................................................................... 15

3.8 Inferred Where Clause Filters for Outer Joins ...................................................... 15

3.9 Null rejecting filter causes Outer joins to convert to Inner joins .......................... 16

3.10 Data Ship Join Optimization ................................................................................. 16

4 Influencing Execution Plans ................................................................................................... 19

4.1 Rules Based Optimization .................................................................................... 19

4.1.1 Influencing the Join Ordering ........................................................................................ 19

4.1.2 Check for Extra Join Nodes .......................................................................................... 20

4.1.3 SQL Script with TDV SQL Query Engine ...................................................................... 21

4.1.4 Using Packaged Queries with the TDV SQL Query Engine ........................................... 21

4.1.5 Using Parameterized Queries ....................................................................................... 22

4.1.6 Use SQL-92 Syntax ..................................................................................................... 22

4.1.7 Fundamental Join Algorithms ....................................................................................... 22

4.1.8 Structuring views to enable join pruning........................................................................ 22

4.2 Cost Based Optimization ...................................................................................... 23

4.2.1 Statistical Processing ................................................................................................... 23

4.2.2 Cardinality Hints ........................................................................................................... 25

5 Virtual Indexes ....................................................................................................................... 26

6 Caching ................................................................................................................................... 27

6.1 File Caching .......................................................................................................... 27

6.2 Database Caching ................................................................................................. 28

6.3 File versus Database Caching ............................................................................... 28



6.4 Supported Database Caching Models ................................................................... 28

6.4.1 Single Table / User Specified Caching .......................................................................... 29

6.4.2 Index Limitations of Single Table Caching .................................................................... 29

6.4.3 Netezza Distribution Key Limitations for Single Table Caching ...................................... 29

6.4.4 Cache Update Concurrency Limitations of Netezza ...................................................... 30

6.4.5 Multi-Table Caching ..................................................................................................... 30

6.4.6 Limitations of Multi-Table Caching ................................................................................ 31

6.4.7 Incremental Caching .................................................................................................... 31

6.4.8 Data Requirements of Incremental Caching .................................................................. 31

6.4.9 Limitations of Incremental Caching ............................................................................... 31

6.5 Cache Loading Optimizations .............................................................................. 32

6.5.1 Native Cache Loading .................................................................................................. 32

6.5.2 Parallel Cache Loading ................................................................................................ 32

6.5.3 Netezza Bulk Data Loading .......................................................................................... 33

7 Case Sensitivity and Trailing Spaces Mismatches ................................................................. 34

7.1 Overview of Case Sensitivity and Trailing Spaces Issue ..................................... 34

7.2 Determining Whether Settings are Affecting Query Performance ....................... 35

7.3 Dealing with Settings Mismatches ....................................................................... 37

8 Advanced Tuning Concepts ................................................................................................... 38

8.1 Externally Generated (Ad hoc) Queries ............................................................... 38

9 Conclusion .............................................................................................................................. 39



1 Introduction

1.1 Purpose To ensure that the TIBCO Data Virtualization Server (TDV) is delivering data in the fastest and most efficient way possible, developers should consider tuning queries for performance. In some cases, there are trade-offs between the expected behavior of queries related to case-sensitivity and trailing spaces vs. performance.

This document provides best practices guidelines so that developers may make performance-tuning decisions that optimize TDV performance, while highlighting important considerations when making certain changes to query behavior for performance gains.

The goals of TDV query optimization are clarified here so that more meaningful adjustments for performance optimization may be made.

Please note performance improvements may also be realized by changing the configuration properties of the TIBCO Data Virtualization Server in order to affect the run time behavior of the query engine. Please refer to DVBU AS white paper Configuring the TIBCO Data Virtualization Server for details on setting the TDV server’s configuration properties.

1.2 Goals for Optimizing Queries

Minimize Network Load

Problem: One of the primary limiting factors for information integration from physically disparate data sources is finite network bandwidth available between the data source(s) and the integration server. Network or data transmission latency is the biggest limiting factor for performance when gathering data from disparate physical sources.

Goal: Retrieve data efficiently to minimize network traffic.

Traditional database query tuning reduces the amount of disk I/O required to satisfy the queries, and in a very similar way performance tuning for an integration server reduces the amount of network traffic caused by data retrieval from the sources. In databases disk I/O latency is the most common performance-limiting factor, and as information integration data retrieval is limited by network bandwidth and speed, efficient retrieval becomes a top priority and the first goal for performance optimization of information integration.

DBAs and developers analyze execution plans and query statistics looking for unneeded table scans and other inefficiencies. By modifying queries with SQL option hints, using appropriate table indexes, and by pushing work to the sources, TDV performance tuning causes less data to be passed across the network.

Leverage Data Source Efficiencies

Problem: By design, TDV does not store the data. Physical data sources are optimized to take advantage of the native data type definitions, indexes, and other system efficiencies, and TDV is designed to leverage those source efficiencies so that users get the fastest and most efficient method to retrieve the requested data. The developer must assess existing data source implementations to most effectively push processing to the sources.

Goal: Tune SQL and query execution plans to optimize use of native data source indexes, filtering, and sorting to pre-process data prior to data integration. Push as much processing to the physical data sources as is practical.

TDV should leverage the natural advantages inherent in processing data on the original data sources. The original data sources can often execute joins and functions faster than TDV since the data is both local and indexed. Sorting enables



additional processing efficiencies that make join integration incrementally faster. Push work to the physical data source and fewer rows need be returned over the network. This reduces both network and processing overhead in TDV.

Minimize Memory Usage

Problem: TDV has a finite amount of processing power and memory that must be managed and shared across all running queries and procedures. Each active request requires some part of that memory for execution. A poorly tuned query may require much more memory than a well-tuned query, forcing the query or other users' queries to use file-based memory or to sit in a wait queue. Requests using file-based memory take much longer to process and may place heavy processing load on the TDV server as results are swapped to disk.

Goal: Efficiently distribute query work by pushing operations to the original data sources. While a TDV instance can simultaneously serve many hundreds of well-tuned requests, when processing usage rises above a minimum memory threshold imposed by JRE limits and a TDV configurable safety factor, requests are swapped out to file-based memory or are sent to a wait queue to wait for release of processing resources. A few poorly written requests could occupy memory resources, holding back other requests by reducing processing capacity.

TDV Solution Architecture

Problem: Data must be integrated from different data sources that have completely different scopes, sizes, data architectures, and data types. To further complicate the task: JDBC and ODBC client applications must be able to request and retrieve data without foreknowledge of data source idiosyncrasies.

Solution: TDV automatically adjusts messaging and interactions with different databases to gracefully handle data retrieval from the sources. Some functionality and certain settings may be used on some data sources and not with others. Regardless of the database functionality and handling, it is always best to manually review queries and execution plans to take advantage of potential data source efficiencies.

TDV uses multiple strategies to accomplish these goals, but well-formed and efficient SQL execution plans will yield the greatest performance gains by pushing operations to the data sources. Use the Execution Plan for any given SQL to review, analyze, and modify how TDV breaks down the query into individual tasks



2 Execution Plans

For distributed queries, performance depends very much on the query execution plan. Execution plans are generated from the SQL and strongly influenced by the SQL options, or execution plan hints, which are written by the developer and used to select optimal join algorithms. A lot of TDV performance tuning is accomplished by rewriting SQL select statements to force the generation of a more optimal query execution plan.

SQL query tuning requires inspection, evaluation, and revision of the associated execution plan. Execute and analyze SQL query plans to see whether changes might more effectively push equality or comparison conditions for filtering, sorting, or otherwise processing data for consumption.

How do you know how much is being pushed? Look at the query's execution plan (EP). Each node that you see in the EP represents some work that is being done in TDV. So you want as few nodes as possible.

2.1 Viewing the Execution Plan

Open a view in TIBCO Data Virtualization Studio and choose the “Show Execution Plan” button, , to see what operations TDV will perform to generate that view

The TDV Execution Plan enables insight into the actions taken to implement a SQL request. It is not different from most databases that have a tool or command that allows a developer or DBA to see the execution plan. For instance, people



working with DB2 may use “Visual Explain” while people working with Oracle would use the command “EXPLAIN PLAN”. While every vendor’s details are a little different, all these tools are conceptually very similar.

2.2 Execute and Show Statistics In the Execution Plan pane, use the “Execute and Show Statistics” button, to execute the query, to display how many rows were produced for each node, and to display the percentage of the overall execution time that was required for that execution path.

2.3 Nodes of an Execution Plan The execution plan is displayed as a tree of nodes, where each node represents a unit of work for TDV. Execution plan nodes represent local TDV work with the only exception being the FETCH node, which consists of SQL passed to the physical data source.

For performance optimization the fewer nodes present in the execution plan the better, as it is almost always faster for the physical data sources to execute tasks leveraging indexes, localized data, and other efficiencies rather than transmitting data en masse to be processed by TDV.

Some of the nodes have one child, some have two children, and some have no children at all. The childless nodes actually generate or retrieve data and produce a row set. The nodes with one child transform a row set in some way, and the nodes with two children merge two row sets into a single row set. Select any node to get more detailed information about it.

The bottom of the tree will always have childless nodes such as FETCH or PROCEDURE.

• FETCH - executes a SQL select statement against some data source. Selecting a FETCH node displays the exact

SQL being pushed to the underlying physical data source. To find other potential efficiencies, copy this SQL into the native data source ‘explain’ tool to analyze it further.

• PROCEDURE - executes a stored procedure and returns its results as a row set. Procedures cannot be broken

down further and must return all rows. Packaged Queries, Excel data retrieval, and Web Services require execution as a single node.

Six common nodes perform operations on a single child node: SELECT, FILTER, GROUP BY, ORDER BY, DISTINCT, and FUNCTION.

• SELECT - takes a row set and transforms it into a new row set with the same number of rows. Columns, however,

may be renamed, and derived columns may be calculated in a SELECT node.

• FILTER - applies a filter condition to each row, either passing it on or discarding it. All FILTER nodes are

processed by TDV and are not passed to the physical data source. Filter conditions include expressions specified

in a WHERE or a HAVING clause. For instance, applying a function on the result of a query that is not supported by

the underlying source will cause the server to apply the filter on those results. This is typically costly. It is extremely important to analyze and understand the filter nodes. Questions to ask when you see the filter node: Is the filter that is being done on the server a known function on the underlying database; that is, can it be pushed?

• GROUP BY - takes a row set and aggregates it.

• ORDER BY - takes a row set and sorts it.

• DISTINCT - takes a row set and removes duplicates.



• FUNCTION - indicates that TDV is performing a function call on one or more columns in a row set. Ordinarily, it's

preferable to see these function nodes pushed to a data source. However, when a function is being applied to a column in every row in the ultimate result set, TDV will sometimes decide to perform the function itself, even if pushing the function is possible. This situation does not generally affect query performance.

Four common nodes merge two child row sets into a single row set: EXCEPT, INTERSECT, JOIN and UNION.

• EXCEPT - generates a row set where a row appears in the first source row set, but not in the other.

• INTERSECT - generates a row set where a row appears in both source row sets.

• JOIN - takes two row sets and merges them into one row set where each row reflects data from both source row sets. It can implement any of several join operations, INNER JOIN, LEFT/RIGHT/FULL OUTER JOIN, and CROSS JOIN. The type of JOIN algorithm may also be optimized by specification of the best fundamental join

algorithm as an {option} of the JOIN. Double-click any JOIN node to enable view and modification of the JOIN

properties.

• UNION - generates a row set where each row comes entirely from one or the other source row set.



3 Specifying JOIN Algorithms

Specify the preferred join algorithm using an {option attribute} JOIN in the SQL. The join algorithm option is

specified within the SQL with text like: “{option HASH} ” or “{option SEMIJOIN} ” appearing just before the JOIN

keyword. The execution plan will always use the specified JOIN algorithm option when valid.

Note: Join algorithm options may also be specified using the Join Properties pane in Studio. Call the Join Properties

pane by double-clicking any JOIN line in the execution plan model. The Join algorithm can be chosen by UI selection instead of manually adding option hints to the SQL. If you change the algorithm in the Join Properties pane, the appropriate query hint will be added to the SQL automatically.

Recommendation: Developers should familiarize themselves with all the available join algorithms so that the optimal or

correct join may be used in a given situation.

3.1 Sort Merge The sort merge is a streaming join that relies on the underlying data sources to pre-sort the data, so it can only be used to join two FETCH nodes. An ORDER BY clause based on the join criteria is added to the SQL under each side of the join. Then only a few rows from each side need to be maintained in memory at any time. Consequently, this algorithm uses very little memory and should perform better than a Hash Join up to about 500,000 rows depending on data source indexing and buffer sizes.

SQL usage hint: {option SORTMERGE} JOIN

Recommendation: When both sides are relational, accept ORDER BY clauses, and result sets aren’t too big, then sort merge is the default join algorithm. Try both the sort merge and hash joins to compare the results when the table sizes are large.

There are two weaknesses, however, with the sort merge:

First, the underlying sources must be able to evaluate an ORDER BY clause. Some databases are better than others and proper indexing may have a large impact on the join. If TDV needs to sort the rows itself, a hash join might as well be used.

Second, most databases perform poorly when asked to sort more rows than can fit in the database’s sort buffer. Buffers are typically sized to accommodate between 100,000 and 300,000 rows. When fetching the next row after exhausting the buffer, the database will re-execute the query, discard the first n rows, and then load the buffer with the next group. Sorting a million rows potentially means evaluating the query 10 times.

For these reasons, it is best to avoid a sort merge for large data sets.

3.2 Hash Join The hash join is one of the most common join algorithms, and it is fairly efficient. The left side of the join is completely evaluated and loaded into an in-memory table. At the same time a hash-based index is calculated on the columns that form the join criteria. After the left side of the join has completed evaluation, the right side is evaluated one row at a time against the hash. For each row, the hash of the join criteria is calculated and a quick lookup is done to find the matching row or rows form the in-memory table. Each successfully joined row is streamed up as it is generated.



Note that where statistics or cardinality estimates are available for both sides of a join, TDV will automatically reorder the join to hash the view or table with the smallest cardinality. It is essential that statistics and cardinality estimates are kept accurate to ensure that the query engine does not choose to hash the wrong view based on incorrect cardinality information.

SQL usage hint: {option HASH} JOIN

Recommendation: Provided that the size of the left side is reasonable, the hash join performs very well. The relative

quickness of hash lookup means that this join algorithm scales well for large right sides. Be aware that this join algorithm does tend to use more memory than sort merge due to the need to build an in-memory table

3.3 Nested Loop Join The nested loop join is the least efficient join algorithm available, but it is occasionally necessary for evaluating inequality join conditions. The nested loop join works by completely retrieving the row set on the left side of the join,

then retrieving one row at a time from the right hand side and comparing the values in that row with each and every row in the in-memory table, which can get quite large. The number of comparisons is equal to the product of the cardinalities of each side.

Fortunately, the nested loop join comes up fairly infrequently in real life. Most join criteria are based on equality conditions, which can be more efficiently evaluated by other algorithms.

The left side of the join is completely evaluated and loaded in memory, and then the right side is retrieved one row at a time.

SQL usage hint: {option NESTEDLOOP} JOIN

Recommendation: Set the data source table with the smaller cardinality as the LHS to reduce the memory footprint.

If the larger cardinality side of the join is not very large and the other member is significantly slower to respond, memory isn’t critical and the time to first row may be reduced by putting the faster data source on the LHS. The entire query will probably still take about as long, but the client will get the first row results more quickly as the entire LHS must be loaded before the RHS starts streaming.

3.4 Semi-join and Partitioned Semi-join The Semi-join is a very fast optimization that reduces the number of rows retrieved from the RHS by rewriting the FETCH pushed to the second data source with selective criteria provided by the cardinal values of the LHS. While the other join algorithms can be found in traditional database products, the Semi-join is exclusively an Information Integration tool.

In the Semi-join, the left side is evaluated and loaded into an in-memory hash table. Then the cardinality is evaluated based on available statistics or cardinality estimates. If the

cardinality is small enough, an IN clause or an OR expression

is created containing all the values in the join criteria from the

left side. That is then appended to the WHERE clause on the

right hand side and pushed to the database. In this way, only rows with matches are retrieved from the right side.



Because various database vendors restrict how large a SQL statement or IN clause can be, the semi-join is restricted

by a configurable default to a left hand side cardinality of 100 or less. If the cardinality is larger, a partitioned semi-join

may be attempted, where the IN list is broken up into chunks of 100 or fewer and multiple queries are executed against

the right hand source. If the cardinality is too large, the system will fall back to a hash join.

Note that the global default cardinality and minimum ratio settings for semi-join can be overridden for individual data sources in the “Advanced” tab of the data source’s Connection Information. Refer to the TIBCO Data Virtualization

User’s Guide section on the Semi-join optimization operation for more information.

Be aware that the TDV query engine may abandon use of the semi-join algorithm at run time if the actual cardinality of the result set returned from the left hand side exceeds the configured default cardinality for a semi-join. In this case, TDV reverts to using the hash-join algorithm. The query engine will not abandon the semi-join algorithm if the view contains a query hint specifying a semi-join should be used.

SQL usage hint: {option SEMIJOIN} JOIN

The semi-join can only be attempted if the right hand side may be queried as a single node which fetches against a

data source supporting an IN or an OR clause.

Recommendation: Consider overriding the default semi-join settings for individual data sources to improve the

chances of semi-join being chosen and to optimize the performance of the semi-join for different database platforms

3.5 Procedure Join The procedure join optimization allows developers to define a join between a table and a procedure and is logically equivalent to performing a semi-join between two tables where the procedure is located on the right hand side of the join. The procedure is called once for each unique set of input values that are retrieved from the table defined in the left side of the join. Results returned by the procedure for each distinct set of input values are retained in memory for the duration of the query to eliminate the need to re-execute the procedure for repeated input values.

SQL usage:

<table_expression>

[LEFT OUTER | RIGHT OUTER |INNER |FULL OUTER] PROCE DURE JOIN

<procedure> ProcedureAlias

ON <condition_expression>

Example:

(T1 LEFT OUTER JOIN T2 ON T1.x = T2.x)

INNER PROCEDURE JOIN

MyProc(T1.y+T2.y) P1 ON (T1.z = P1.z)

3.6 Star Schema Semi-join The star schema semi-join optimization allows TDV to execute a view that contains multiple join operations on the same field using semi-joins for each view, subject to the same criteria as a normal semi-join operation. All join sources must

return data from the same underlying data source.



While this can allow for faster execution of a view containing multiple joins that qualify to be evaluated as semi-joins the star schema semi-join operation can very quickly overwhelm some data sources if the data source is not very powerful or one or more of the join sources return a large number of rows.

For this reason, star schema semi-join support is not enabled by default in TDV and must be specifically enabled for a specific data source to guard against unintentionally overwhelming data sources.

SQL usage hint: {option SEMIJOIN} JOIN

The star schema semi-join can only be attempted if the data source used has star schema semi-join support specifically enabled and the view’s joins meet the criteria for attempting a semi-join.

Recommendation: Star schema semi-join should only be used for views that make use of sources which return a fairly

small number of rows or make use of a very powerful data source. It is strongly recommended that a view using star schema semi-join be thoroughly performance tested on production volumes of data to ensure that the data source used is not overwhelmed.

3.7 Outer Joins and Streaming TDV attempts to maintain as little data in memory as possible, streaming data whenever it can. If at any time join memory consumption exceeds allotted memory, then processing will spool to disk with 10x performance decrease! For this reason, most outer joins are evaluated as follows:

First, LEFT OUTER JOINS are rewritten as RIGHT OUTER JOINS.

Next, the left side (the optional side) is completely evaluated and loaded into an in-memory table.

Finally, the right side is retrieved one row at a time. If a match is found in the in-memory table, it is merged and the combined row is emitted. However, if no match is found, then it is emitted immediately.

3.8 Inferred Where Clause Filters for Outer Joins When resolving an outer join TDV will automatically push any WHERE clause filters defined on join criteria columns from

one side of an outer join to the join criteria column on the other view in the join

For example, the following query

SELECT *

FROM CUST_ORDER RIGHT OUTER JOIN ORDERS

ON CUST_ORDER.CUST_ID = ORDERS.CUST_ID

WHERE ORDERS.CUST_ID = 244 AND

CUST_ORDER.ORDER_TIMESTAMP = ‘2015-01-01 00:00: 00’

Will be rewritten as

SELECT *

FROM CUST_ORDER RIGHT OUTER JOIN ORDERS

ON CUST_ORDER.CUST_ID = ORDERS.CUST_ID



WHERE ORDERS.CUST_ID = 244 AND CUST_ORDER.CUST_ID = 244 AND

CUST_ORDER.ORDER_TIMESTAMP = ‘2015-01-01 00:00: 00’

Note that TDV will automatically apply appropriate casts to inferred filters on date and timestamp columns.

3.9 Null rejecting filter causes Outer joins to convert to Inner joins TDV will automatically convert a left, right or full outer join between two views or tables to an inner join at execution time if a null rejecting where clause filter is applied to one or more columns of the outer table.

For example the query

SELECT *

FROM ORDERS LEFT OUTER JOIN CUST_ORDER

ON ORDERS.CUST_ID = CUST_ORDER.CUST_ID

WHERE CUST_ORDER.CUST_ORDER_STATUS = ‘CANCELLED’

Will be rewritten as

SELECT *

FROM ORDERS INNER JOIN CUST_ORDER

ON ORDERS.CUST_ID = CUST_ORDER.CUST_ID

WHERE CUST_ORDER. CUST_ORDER_STATUS = ‘CANCELLED’

Developers may wish to prevent the outer join from being converted to an inner join should restructure their query to place the filter inline with the outer table as in the following example

SELECT *

FROM ORDERS LEFT OUTER JOIN (

SELECT *

FROM CUST_ORDER

WHERE CUST_ORDER. CUST_ORDER_STATUS = ‘CANCELLED’

) filteredCustOrders

ON ORDERS.CUST_ID = filteredCustOrders.CUST_ID

3.10 Data Ship Join Optimization The data ship optimization accomplishes efficient federated query execution by transformation of federated queries into locally executed SQL operations. When SQL operations involve a very large table (with millions or hundreds of thousands of rows) and a small table with significantly fewer rows, the TDV instance can enhance query performance by shipping a copy of the smaller table to the data source of the larger table. This optimization can yield significantly faster results than a traditional federated query.



The following SQL operators can take advantage of the data ship optimization:

• Join

• Union

• Intersect

• Except

The data ship optimization is currently supported on the following data sources and targets:

• Microsoft SQL Server

• Netezza

• Oracle

• PostgreSQL

• Sybase IQ

• Teradata

• Vertica

Please consult the TIBCO Data Virtualization User Guide for the list of currently supported data source versions.

Recommendation: When executing a federated join between a small and large table, located on a data source that

supports the data ship optimization, data ship can significantly improve the performance of a join by taking advantage of database optimizations. Improvements include potentially reducing: the size of the result set returned to TDV, the amount of network overhead associated with the operation, and the amount of memory used on the TDV instance.

When considering the use of data ship be aware of the following considerations:

The data ship optimization requires that statistics be gathered on both the source and target tables in TDV for the optimizer to accurately determine if the data ship optimization should be used. Collection of statistics requires some additional configuration and does put some additional load on the data sources. Care must also be taken to ensure that

statistics are regularly updated to reflect the current state of the data source.



Excessive use of the data ship optimization can place additional load on the target data source, impacting the performance and availability of the database. Developers should be careful to avoid overuse of data ship join for this reason.

For some target data sources TDV is not able to cancel a join operation that has been delegated to the data source. This means that a poorly performing join operation cannot be cancelled easily.

The case sensitivity and trailing space settings of the TDV instance and data ship target must match in order for TDV to successfully delegate the join operation to the data source. In the case of a mismatch, TDV will perform the join operation itself.

For these reasons use of the data ship optimization should be carefully considered before implementation.

Refer to the TIBCO Data Virtualization User’s Guide section on data ship performance optimization for more information on Data Ship Join.



4 Influencing Execution Plans

TDV uses a combination of both rules based optimization and cost based optimization.

Rules based optimization enables the TDV query engine to interpret SQL into an execution plan that efficiently retrieves

data from disparate sources based upon rule interpretation. Option hints placed in the SQL and JOIN ordering help to

optimize query execution plans based upon an automatic aggregation of multiple calls from the same data source and ordering of table selection from the “SELECT... FROM...“ clause while also implementing a specific fundamental join algorithm option suggested by the developer.

Cost based optimization relies on a set of statistics to enumerate and categorize unique values gathered from the data

source or tracked by previous results in order to estimate the actual cost (in time and resources) for executing a particular plan.

A description of both optimization strategies follows.

4.1 Rules Based Optimization

4.1.1 Influencing the Join Ordering Consider the order of table loading when joining data from disparate sources. Faster response times may be obtained when filter conditions or more restrictive conditions are first pushed to the data sources and then quicker joins are performed on the smaller subset of data returned. Change the execution plan data table loading order by rephrasing SQL with parentheses.

Parenthetical phrasing specifies execution plan order based on SQL in the FROM clause. The following two four-way joins illustrate how algebraic phrasing will determine the execution plan order of table joins. The join criteria are excluded for simplicity.

FROM A INNER JOIN ( B INNER JOIN ( C INNER JOIN D ))

The entire contents of both A and B will be loaded into memory. First C will be joined with D by some algorithm, and then each resulting row will be immediately joined with matching rows from B, and then each resulting row will be joined with matching rows from A.

This is referred to as chain joining and generally executes faster but consumes more memory. Chain joins are best used when result sets are relatively small or retrieval time of the data set is critical.



FROM ( A INNER JOIN B ) INNER JOIN ( C INNER JOIN D )

In this case, A will be joined with B and loaded into an in-memory table. Then C will be joined with D by some unspecified algorithm and each matching rows are joined with the results of the inner join of A and B already in-memory.

The first FETCH element under a JOIN is the left-hand side (LHS) and the second FETCH element is the right (RHS).

This is referred to as T joining and generally consumes less memory but runs slower chain chain joining. T joins are best used for larger data sets that may consume a large amount of memory or when the fetch time for one or more data sources is significant.

4.1.2 Check for Extra Join Nodes TDV’s query engine by default attempts to merge all the SELECT/JOIN statements against a single data source into a single select. This optimization, called data source grouping, pushes the join operations to the database and will almost always result in fewer rows moved over the network and much faster response time. TDV evaluates many different permutations of a SQL statement, looking for the best way to combine the various pieces into a plan with the fewest possible FETCH nodes.

However, for even moderately complex SQL it is impossible to evaluate all permutations in a reasonable amount of time. Instead, TDV cuts the plan evaluation process short to get to work using the best plan found after a reasonable computation period. In these cases, the plan may show multiple fetch operations against the same database, and joins being evaluated in TDV rather than in the database. Manually reordering the joins often helps the query engine find a more efficient plan.

Recommendation: Manually reorder SQL so that fetches from the same database will be physically close and so that the left-hand join is called first. Additionally, circular joins are more difficult to evaluate and seem to cause this issue fairly often. For instance, this type of join will pose problems with optimization: FROM A INNER JOIN B

ON A.col1 = B.col1

INNER JOIN C

ON A.col1 = C.col1

AND B.col2 = C.col2

In this case A is joined to B, B is joined to C, and C is joined back to A, creating a circle. In the vast majority of cases, this expression could be rewritten as follows:



FROM A INNER JOIN B

ON A.col1 = B.col1

INNER JOIN C

ON B.col1 = C.col1

AND B.col2 = C.col2

Re-writing that type of join in the fashion shown will often generate better query plans.

4.1.3 SQL Script with TDV SQL Query Engine SQL Script is the TDV procedural scripting language that provides finely tunable control over how the data is processed; however, use of SQL Script limits the ways in which the TDV Query Engine can manipulate the query. SQL Script cannot be pushed to the data source so resource and network optimizations are not available for these parts. Note that queries executed inside the SQL script are optimized within the scope of the SQL script.

SQL Script may be successfully employed in many cases where control increases efficiencies. For example, when the left-hand side of a semi-join exceeds default size limitations, it may be better to implement SQL Script than to change the configurable semi-join maximum size settings.

Recommendation: Consider rewriting SQL Script that returns CURSOR type objects to return PIPES type objects

instead if it is desirable to begin returning results before the entire result set has been retrieved. Be aware that the use of PIPES can place additional load on the TDV instance due to additional concurrency requirements.

Recommendation: To enable automatic SQL optimization at all levels of the query, consider avoiding SQL Script

unless

• A task can’t be accomplished in a view

• Maintenance of the view delivering equivalent functionality would be more difficult than a SQL Script

• Task runs substantially faster in a procedural manner

4.1.4 Using Packaged Queries with the TDV SQL Query Engine Packaged queries, like SQL Script, also reduce the ability of the SQL Query Engine to optimize a query. Even though they may prove more efficient for quick implementation, they must be executed as blocks.

Recommendation: Use packaged queries when the query requires very non-standard SQL such as Oracle’s tree

walking ‘connect by ’ or ‘start with ’ keywords.

If the packaged query contains only a single select operation then enabling the Single Select checkbox in the packaged query will allow TDV to treat the packaged query as if it were a derived table, allowing the query engine to push filters down to the data source. See the TIBCO Data Virtualization User’s Guide section on creating a packaged query for

more information



4.1.5 Using Parameterized Queries Recommendation: Use parameterized queries at the client access / presentation layers when possible to prevent

clients from issuing queries without reasonable predicates. This protects the source systems from unnecessary load.

4.1.6 Use SQL-92 Syntax TDV evaluates SQL-92 join syntax more cleanly than other syntax.

Recommendation: Use the INNER JOIN ON form instead of separating tables with commas and placing the join

criteria in the WHERE clause.

4.1.7 Fundamental Join Algorithms Now that the maximum amount of work has been pushed to the underlying systems, the next step is to analyze the

impact of the join algorithm on the queries.

As discussed earlier in this document, TDV offers five join methods: Hash Join, Nested Loop Join, Sort Merge Join, Semi Join, and Data Ship Join. TDV will automatically choose a join algorithm based upon available data source statistics (when available) or based upon an option hint specified by the View developer. While TDV will choose the join algorithm for you, the developer’s knowledge of both the query and the data source should be used to validate or to preferentially bias the selection of the optimal join algorithm for that situation.

4.1.8 Structuring views to enable join pruning Under the right circumstances, when a query is executed against a view that joins data from two or more tables or

views, the TDV optimizer is capable of trimming references to any tables or views that are not actually used in the query. This feature is designed to improve performance and reduce the execution cost of a query by reducing retrieval of unneeded data.

Recommendation: Use right- or left-outer joins where the non-outer side is joined on a primary key. For example:

SELECT

c.y

FROM

(SELECT

a.x,

b.y

FROM

a RIGHT OUTER JOIN b

ON a.primaryKey = b.col1

) c

is logically equivalent to

SELECT

c.y

FROM

(SELECT

b.y

FROM

b

) c



which is much more efficient.

NOTE: The optimizer will not use index metadata. So putting "primary key" metadata in TDV on a joined resource will

be ignored by the optimizer for the purposes of join pruning. Gathering statistics on the underlying data tables will yield a similar performance improvement to creating a primary key on a data column.

NOTE: The SELECT hint DISABLE_OUTER_ON_PK_REMOVER will tell the optimizer not to perform join pruning on the

query.

4.2 Cost Based Optimization Cost-based optimization uses statistical processing at scheduled off-peak intervals to provide realistic cardinality estimates so that the TDV query engine can generate efficient execution plans.

Cost based optimization estimates the cardinality of each FETCH and then orders the joins and chooses the join

algorithms appropriately. This is most effective when statistical processing is enabled or cardinality estimates are provided for both data sources.

NOTE: Rule based optimization alone may yield sufficient performance when dealing with tables where the number of

unique values in a particular column will often be nearly static or change at a predictable rate.

Be aware that joins defined in an externally generated ad hoc query may not be able to take full advantage of rule based optimization and may require the use of cost based optimizations to be tuned.

See the section entitled Externally Generated (Ad hoc) Queries below under Advanced Tuning Concepts for

information on tuning externally generated queries

4.2.1 Statistical Processing When the number of unique values from a column is known, the cost based optimizer can estimate the size of the result set returned from a SQL statement and make the query execution plan most efficient for joins and other logical



processes that are pushed to the data source.

Any TDV defined data source may enable statistics processing from the “Cardinality Statistics” tab of the data source properties window.

Developers have the ability to fine tune statistics processing behavior for individual introspected tables from the table’s “Cardinality Statistics” tab. This option becomes available after enabling statistics processing for the parent data source.

Developers may schedule periodic statistics processing during off-peak hours to quantify unique values and ranges in the data source. If the cardinality of the data in the table is mostly static then it may make sense to perform the statistics processing manually just once during development.

Caution developers against overuse of statistics processing to avoid overuse of the data source. Configurable administrative settings on the SQL Engine disable multiple consecutive data source statistics requests received within the first 30 minutes after a request has been processed. New data source statistics don’t necessarily change the cached execution plan as any changes in the SQL or turnover in the execution plan cache are the triggers for new execution plan creation. Query execution plans are cached so that requests may be executed without the computing time overhead of analysis and execution plan generation.

Recommendation: Enable periodic statistics processing on data sources to inventory database cardinality if the cardinality is unknown, unpredictable, or subject to frequent changes. Consider defaulting statistics mode to table boundary statistics for a data source and selecting more granular statistics gathering modes for individual introspected tables. This will allow developers to gather only the level of statistics that are required, minimizing the impact statistics gathering has on the data source.

Usually developers will be better at determining when and how join order, semi-joins, and sort merge joins should be applied. But if the number of table rows returned in a runtime result set varies enough to create doubt about which



query result component will have fewer rows, then the cost based query optimizer and statistical analysis of the data sources will prove useful

4.2.2 Cardinality Hints If table cardinality is not specified, then the SQL Query Engine creates the execution plan based on the assumption that the table may have a million rows.

Recommendation: When relying on cost based optimization, specify table cardinality estimates so the SQL Engine can create execution plans using the number of expected unique values.

Table and procedure cardinality estimates may be explicitly specified in the “Cardinality Statistics” tab of the resource:

Specify the expected cardinality, and optionally provide a maximum and minimum estimate of cardinality so that when the table or procedure is joined with another, the execution plan will make a logical selection of the most appropriate join algorithm.

Cardinality estimates may also be set using the SQL option syntax to override or provide missing cardinality information. The cost based optimizer will then use this information in order to tune. Cardinality can be provided with SQL such as: “{option left_cardinality=55, right_cardinality=1250 } ” just before the JOIN keyword in the “SQL” tab of the view.

Note that these options are also selected by choosing the pertinent options within the Join Properties pane in Studio. The Join Properties pane can be called up by double clicking on the Join line in the model in Studio. The left and right cardinality can be chosen here instead of the method of adding hints to the SQL portion of the view. If you specify the cardinality in the Join Properties pane, the appropriate query hint will be added to the SQL.



5 Virtual Indexes

Recommendation: Establish virtual indexes for use by application clients. Virtual indexes can be used to expose indexes defined on source tables to JDBC and ODBC clients that connect to TDV. Developers can also define virtual indexes on columns that are known to only contain unique values. Client applications like Cognos and Business Objects require index information for best operation.

TDV and the SQL Query Engine generally does not make use of virtual index information internally. Queries that are pushed down to the data sources are generated to take advantage of native indexing, which enables direct retrieval of selected values.

Virtual index information is only used internally when creating multi-table caches and when publishing views as oData services.



6 Caching

TDV enables definition of caches on views and procedures in order to temporarily store results returned from views and procedures. While the primary goal of caching is to protect data sources from overuse, caching can often be used to improve performance by locally storing data for faster access.

Caching aids in improving performance for heavily federated views, which process large volumes of data to satisfy client requests. Procedures are typically used to implement complex business rules, which can result in significant execution times, and caching these results can improve performance. Caching can also reduce processing loads on TDV and underlying source systems. Performance gains can be realized by caching results returned from views that have a significant data retrieval time and for views or procedures that require significant execution time to compute joins or other results.

TDV allows developers to make use of either file based or database based caches. A description of both cache types follows

Caution: Maintenance of caches needs to be carefully managed to ensure that cached data is refreshed frequently

enough to meet service level agreements and not so frequently that the refresh operations negatively impact the

performance of the TDV instance or data sources involved.

6.1 File Caching A file cache can improve performance for a query that takes a long time to run, but a file cache can also significantly degrade performance when used improperly.

File Cache Limitations: File caches do not store index information, nor do they allow for SQL logical operators and

filters to be pushed to the data source. Since there is no index information, any selection of a subset of a view requires loading the cached view into TDV memory for a row-by-row scan. For this reason, file caches are well suited in

instances where an entire data set needs to be read each time it is used, or where the data set is very small.

To illustrate how this can affect performance, suppose a view, V, represents a million rows and your query is

select name from V where id=1

If V is not cached, the "id=1 " filter would be pushed down and TDV only needs to process a single row from the data

source. If V is cached, TDV will be forced to read a million rows of data from the file cache (if there are 100 columns, all

100 columns are read into memory as well) and evaluate the filter on every single row to produce that single row. This is a lot of overhead!

CAUTION: Warn your developers about the limitations and the potential performance degradation when using a file

cache in the wrong circumstances.

CAUTION: File caches are not recommended in clustered TDV environments. Each node must maintain its own copy of

the cache. (With database caching, the cache is shared across all nodes.)

Recommendation: Consider creating file caches for views requiring a full table scan, or that return a small result set.

File caches are not recommended for data requests that use SQL operators, logical where conditions, or requests that would benefit from use of an indexed column. These are best handled using a database cache.



6.2 Database Caching Database caching removes the major drawback of file caching. With a database cache, conditional operators on the

cached data can be pushed to the database holding the cached data.

Recommendation: Allow for very fast query pushdown of joins by using database caching from two different data

sources.

Use Database managed cache if queries will apply filters against cache (leverage indexed columns). Database managed caches also allow the pushdown of aggregations to the underlying data source letting you use the power of underlying DBMS engine.

There are presently two drawbacks with database caching:

1) Result set and data table must match exactly. TDV will not allow caching a view to a table that does not have enough

columns of the proper data types. This means that if any changes are made to the schema of an introspected table or TDV view that has caching configured, the cache table must be regenerated.

2) Case sensitivity and trailing space behavior settings mismatches between the TDV instance and database cache may result in poor performance of cache retrieval and unexpected rows being returned in the result set. For these reasons, it is critical that the case sensitivity and trailing space behavior settings be kept in sync between the TDV instance and cache database.

By default, TDV refreshes the cache in batches, preventing large transaction slowdowns.

6.3 File versus Database Caching Consider the following when deciding whether to cache a view to a file or to a database.

• Compare relative refresh cost: For example, let's say you've got a 1000 row view which takes an hour to

execute and you only need to refresh the information once a day. This would be a good use case for the file cache. If the data is relatively unchanging though costly to retrieve, and many different views are derived from that same data, then the view should probably be cached in a table to avoid issues due to concurrent data access.

• Size of dataset: The bigger the file cache, the more expensive it will be to use in queries. Remember, every

access requires a full table scan.

• Filters: File caches do not allow filters to be pushed down, meaning that the TDV instance must completely load

the cached dataset and apply filters in memory. If the view is usually used with filters, this can result in significant performance issues. Additionally, if filters applied to a cached dataset will greatly reduce the cardinality of the returned result set consider using a database cache to speed up processing.

• Clustering: If the cache is going to be used in a clustered environment, database caching is recommended.

All caches may be set to refresh periodically based on an hourly, daily, or weekly interval. Caching may also be used to get snapshots of consistent data when the data source’s contents are volatile.

6.4 Supported Database Caching Models TDV 6.2 Service Pack 1 (SP1) or later allows developers to define multi-table caches when configuring database based caching for introspected tables and SQL views. Multi-table caching allows developers to create multiple cache target tables, each of which is used to store a single data snapshot. All prior versions of TDV only support single table caches.

Please note that all versions of TDV prior to 6.2 SP1 refer to Single Table caching as User Specified Caching.



6.4.1 Single Table / User Specified Caching Single table caching is supported in all versions of TDV. In this model, a single database table is used to store a result set returned from a view and a single table is created to store all scalar values from a procedure and an additional table is created for each cursor returned by the procedure.

Single table caching is useful when caching relatively small result sets of data that do not need to be retained for long periods of time. If multiple snapshots of cache data need to be retained or a developer is implementing incremental

caching, then single table caching should be used instead of multi table caching.

This is the only option available when creating a cache for a procedure in all versions of TDV.

6.4.2 Index Limitations of Single Table Caching When using a single table caching several snapshots of the result set returned by a view or procedure may be stored in the same target table. This can potentially reduce the effectiveness of database indexes on the cache table due to overlapping data. To distinguish between data from different snapshots, all single table cache targets will have a cacheKey column to them. It is strongly recommended that cacheKey column be included in any index definitions on the underlying cache table. This column contains a unique value to identify data from each snapshot.

The absence of a database index on the cache target table will generally result in poor performance when fetching data from the cache. It is generally advisable to maintain an index on the cache table to improve the response times of cache requests. However, the performance of a cache refresh operation will be slowed considerably if the cache target table has an index defined. It is highly recommended all indexes present on a cache target be dropped prior to performing a refresh operation and then recreated after the refresh completes. If requests for cache data are expected to be received while a cache refresh operation is in progress, then it is up to the developer to determine if the index should be retained to improve performance of requests for cache data, or dropped to improve performance of the cache refresh operation. To address such issues, it is recommended that cache refreshes be scheduled during off hours.

Because of these performance constraints TDV does not support automatic dropping and creation of database indexes during a cache refresh when using single table caching, it is up to the developer to create automated scripts to perform this operation.

6.4.3 Netezza Distribution Key Limitations for Single Table Caching Defining a cache on a Netezza database requires some additional work to avoid sub-optimal query performance. Instead of a table index, Netezza appliances use the concept of a distribution key to partition a data set across multiple physical servers to improve search performance by executing a parallel search across distributed subsets of a dataset contained in a table.

Distribution of a table’s dataset is based upon a hash generated using the unique values of the distribution key. It is critical to select a distribution key that contains sufficient unique values to produce optimal distribution of data across the Netezza cluster. Failure to select a sufficiently unique distribution key can result in data not being evenly distributed across the Netezza appliance, which will result in poor performance due to limited parallelism. Unlike a conventional index, creation of a distribution key is mandatory and can only be performed when a table is created. If a distribution key definition is not explicitly provided in the table creation DDL, Netezza will use the first column of the table. This is a concern when creating a cache table by directly executing the TDV system generated DDL for a cache table.

TDV single table caches use the cacheKey column to act as the primary key for identifying a unique cache snapshot.

For this reason, the cacheKey value will always be the same for an entire cached dataset. This will result in poor distribution of a cache’s dataset across the Netezza appliance, which can result in long cache access times.



Recommendation: When creating cache storage tables on a Netezza database, implementations using TDV 6.2 or

earlier should not directly execute the TDV generated DDL for the cache table from TDV Studio. Instead, developers should copy the DDL text from studio, modify the DDL to specify a distribution key that contains sufficient unique values, manually execute the DDL against the target cache database and introspect the cache table back into TDV. Once the table has been introspected, it can be selected as a cache storage target as normal.

Implementations using TDV 7.0 and later allow developers to specify columns to distribute a cache storage table on in TDV Studio. Developers should create the cache storage table in studio as normal, remembering to specify columns to distribute on.

Note that this issue only applies to single table caches stored to a Netezza data source. Multi-table caches are not affected.

For more information about distribution keys, please consult your Netezza documentation.

6.4.4 Cache Update Concurrency Limitations of Netezza Netezza databases are designed to support a very high concurrent volume of select operations efficiently and a low volume of writes or updates to stored data. As a consequence of this, Netezza only supports table level locking which prevents multiple processes from updating a single table at the same time. When a process obtains a table level lock, no other processes may read from or update the contents of the locked table.

This can create issues for TDV caching implementations on Netezza when multiple cache refresh operations are executing at the same time. In this case it is likely that one or more cache refresh processes may fail execution while attempting to read from or update the cache_status table due to the fact that another cache refresh process has locked

the cache_status table.

Recommendation: Cache implementations on Netezza should be limited to use cases that require very infrequent

refresh operations to minimize the likelihood of experiencing concurrency issues. Caches that require more frequent updates should be implemented on a platform with support for row level locking.

6.4.5 Multi-Table Caching TDV 6.2 SP1 or later allows developers to configure multi-table cache storage for introspected tables and TDV views. Under this model, multiple identical cache target tables are created for a single cached view. When a cache refresh is performed, a single snapshot of cache data is loaded into one of the target tables. TDV automatically uses the most

recently refreshed target table when servicing requests. This allows TDV to continue servicing cache requests while a cache refresh operation is in progress on another target table.

Under this model, indexes on cache target tables tend not to suffer from the same performance concerns outlined above, because only a single snapshot of the cached resource is retained in each table. Additionally, TDV does not require the presence of a cacheKey column in the target tables to distinguish between snapshots.

Because individual cache snapshots are stored in separate target tables, concerns about whether to retain a database index on the target table during refresh operations do not arise when using a multi-table cache. TDV will service requests for data from a different target table than the one being refreshed. Thus, the index on the table servicing

requests does not impact the refresh, so it may be kept. Under the multi-table model, TDV will automatically drop and re-create database indexes during a cache refresh operation. Indexes are added to target tables based on the virtual indexes defined on the TDV view being cached.

Multi-table caches are generally recommended when applying caching to views so long as the additional space required for keeping parallel copies of cached data is not a concern. It is not recommended to use multi-table caching when implementing an incremental cache.



6.4.6 Limitations of Multi-Table Caching Multi-table caching is only available when configuring caching for TDV views or introspected tables. TDV does not support multi-table caching for procedures of any kind.

Multi-table caching may also be subject to additional specific limitations based on the database platform chosen to host the cache data. Please refer to the Caching Limitations section of the TDV User Guide for more information on platform specific limitations of multi-table caching.

6.4.7 Incremental Caching Incremental caching is a mechanism that detects and updates only the changes made to the source data since the last time the cache was refreshed. For example, if 10,000 rows of data were cached, an incremental caching solution might detect that only 100 rows were added / updated / deleted since the original refresh. Thus, during the next refresh, only 100 rows of data would be affected instead of all 10,000 rows.

Two different means of incremental refresh exist in TDV today: push- and pull-based refreshing.

Push-based incremental caching can be implemented on all versions of TDV, though this may require significant development work. With this refresh strategy, a changed data detection mechanism is used to notify TDV when data

has been updated in a data source. TDV then determines what caches need to be updated and updates them immediately. In this manner, changes to cache data are "pushed" to TDV. This strategy requires the use of Oracle’s Golden Gate as the change detection mechanism, and TIBCO’s ESB as the message bus that will queue changed data notifications to TDV.

In TDV 6.2 SP1, a pull-based incremental caching framework was introduced to allow developers to create their own ways of detecting changed data, and apply these changes to an existing target table. With this framework, two procedures are required to 1) initialize the cache data in the same way that full data refreshes currently work in TDV and 2) detect and apply changed data to the cache data after the initialization step has run. In this way, the refresh is

initiated by, and changes are "pulled" to, TDV through the cached resource (with caching temporarily disabled) and applied to the existing cache data. This framework utilizes the same scheduling and expiration policies that the full refresh mechanism uses, so incremental refreshes can be scheduled to occur on a regular basis.

Note that pull based incremental caches can be implemented for procedural caches but may involve additional implementation complexity.

6.4.8 Data Requirements of Incremental Caching In order to implement an incremental cache on a result set the following conditions must be met.

• The result set must provide a mechanism for tracking changes made to data such as an update timestamp or version number. This value must allow the tracking of insert, update and delete operations.

• Changes to the cached result set should not affect a significant portion of the data within a single refresh interval. As a general rule of thumb updates should affect no more than 25 – 35% of the data set during a single refresh interval depending on the cost of a refresh operation. If this value is exceeded, then there may be little to no performance savings as compared to performing a full cache refresh operation.

6.4.9 Limitations of Incremental Caching Both methods of incremental caching require a significant amount of configuration and/or development to implement. The push-based solution also requires the acquisition of additional software from 3rd party vendors.

While technically allowed by TDV, incremental caching of multi-table caches is discouraged due to potential data synchronization issues.



See the section titled “What Incremental Caching Options Are Available in TDV” in the TDV User Guide for more details on incremental caching.

6.5 Cache Loading Optimizations

6.5.1 Native Cache Loading The TDV native loading option improves cache performance when the cache data is being saved to a Vertica, Microsoft

SQL Server, Oracle, or Netezza target, and is enabled by default for these data sources when used as a cache target.

Native loading makes use of functionality that is inherent to the data source to provide cache loading performance gains that are not possible when using the TDV JDBC connections.

The following data source targets support native cache loading as of this writing:

• Microsoft SQL Server (Uses bulk import and bulk export functionality)

• Oracle 10g and later (Uses database link functionality)

• Netezza 5 and later (Uses NZLOAD functionality)

• Vertica

• Teradata (Uses bulk loading functionality)

• PostgreSQL (Uses bulk loading functionality)

For a definitive list of data source targets that support native cache loading, check the TDV User Guide that came with your TDV instance.

Recommendations: Native cache loading can significantly improve performance of cache refresh operations by taking

advantage of bulk load functionality available on some database data sources.

Because native cache loading pushes more work to the target cache database, this can place greater system load on the database than when using the native TDV cache load functionality.

The capabilities of native cache load are defined by the bulk load functionality available on the target database. This can affect the data types that can be loaded, the types of caches that can be refreshed using native cache loading and the level of control that TDV has over the load operation.

For these reasons the use of native cache loading should be carefully considered to ensure that it is chosen only for appropriate scenarios where the performance gains realized for cache refresh operations justifies the additional expense of additional load on the cache system.

Please refer to the TDV User Guide for more details on native cache loading, including platform specific limitations and configuration information.

6.5.2 Parallel Cache Loading TDV provides a parallel caching option that is used to maximize performance of data caching. This option uses statistics derived from a unique numeric key to create multiple partitions that can be used to load the data using parallel processes. This performance option is available for all supported cache targets but requires some additional setup beyond TDV configuration parameters.



Parallel cache loading is only available for views that meet the following criteria:

• The view uses a numeric simple key

• TDV cardinality statistics available on the key column

Based on the number of unique partitions found by statistics collection and available system resources, TDV will create multiple threads to load data into the cache during a refresh operation.

Recommendation: Parallel cache loading can significantly reduce the amount of time required to perform a cache refresh for cached views that contain a large number of rows provided that the view meets the prerequisite criteria for parallel loading. This performance gain comes at the cost of having to regularly gather statistics on the view’s key column and potentially placing higher load on the TDV instance due to the creation of multiple cache loading threads. Developers should use care to schedule both statistics collection and cache refreshes during periods of low activity on the TDV server when possible to reduce any performance impact upon user requests.

Please refer to the TDV User Guide for more details on parallel cache loading including configuration information.

6.5.3 Netezza Bulk Data Loading The execution time of SQL insert operations on a Netezza database tends to generally be higher than other SQL databases. This can result in extremely poor performance of normal cache refresh operations when data is cached to a Netezza database, especially for large datasets.

Recommendation: Use the configuration property Enable Bulk Data Loading (TDV instance > Configuration >

Debugging > Enable Bulk Data Loading) to enable use of the Netezza bulk data loader when performing cache refresh operations to a Netezza database. Enabling Netezza bulk cache refresh functionality will generally result in a significant reduction of cache refresh times for large datasets. It is strongly recommended that this option be enabled whenever

data is to be cached to a Netezza database.



7 Case Sensitivity and Trailing Spaces Mismatches

7.1 Overview of Case Sensitivity and Trailing Spaces Issue Case sensitivity and trailing space mismatches are often encountered in enterprise environments with many different database systems. TDV’s primary goal in this regard is to ensure reproducible and accurate results; however, there is often a trade-off with slower performance in certain cases when TDV must query databases with different case sensitivity or trailing spaces settings. Case sensitivity and trailing spaces mismatches only occur with the following conditions:

• There is a mismatch between TDV and the underlying data source’s case sensitivity and/or trailing spaces settings

• There is a join or where clause with a CHAR or VARCHAR field in the test condition.

It should be noted here that TDV, like database management systems, necessarily provides a contract to the clients regarding case sensitivity and trailing spaces. TDV has a unique position within the enterprise in that it must deal with databases that have “conflicting” settings in this regard. TDV handles this by always following the convention configured via the Administration/Configuration menu.

This section documents how these settings may be overridden on a query by query basis; however, this practice should be considered very carefully to avoid providing queries to clients that could produce unexpected results. Consider the following example.

A client submits a simple SQL statement such as:



SELECT v1.balance FROM accounts v1 WHERE v1.accou nt_name = 'bob'

The client is aware of what case sensitivity it wants to use. If it submits this to a case sensitive database, then it expects to only get accounts with exactly 'bob' as the name. If it submits this to a case insensitive database, it expects to get accounts with 'bob', 'BOB', and 'Bob'. If the client knows the database is case sensitive and it wants an insensitive compare then it would submit "WHERE UPPER(v1.account_name) = UPPER('bob') ".

The same is true of TDV. However, in the case where TDV is not case sensitive and the underlying database is case sensitive, TDV will add the UPPERs to the SQL sent to the underlying database to insure TDV’s contract with the client is maintained. Unfortunately, doing this may invalidate an existing index - in the previous example, the index on “account_name” would be invalidated, causing a table scan.

7.2 Determining Whether Settings are Affecting Query Performance Evaluate any FILTER nodes or the SQL underlying each FETCH node in the Execution Plan in Studio to determine if case sensitivity or trailing spaces settings are impacting the query. Focus primarily on the WHERE clause or any filter nodes.

One of the two major issues that can come up is that some string comparisons in the where clause have RTRIM or UPPER functions applied to them which is manifested in the FETCH node.

Here is a partial screen shot of a comparison being pushed to a Sybase source with a conflicting case sensitivity setting:

The issue is that wrapping a column with a function like UPPER or RTRIM may prevent the underlying system from using an index on that column. This is necessary to provide correct results, but there is a performance trade-off. Suddenly a quick lookup through an index becomes a full table scan.

Another issue that arises with string comparisons is settings differences may force TDV to perform filter operations instead of pushing them to the data source. A filter applied in TDV requires that all rows of data should be fetched from the underlying table, which could impact performance.



Review the following matrix to determine possible impact of differing case sensitivity and trailing spaces settings:

TDV Setting Underlying Data Source Setting

TDV Effect on Joins TDV Effect on Where Clause

case_sensitivity=true case_sensitivity=true None None

case_sensitivity=true case_sensitivity=false Prevents JOINs being pushed down to same data source involving CHAR or VARCHAR fields.

Performs WHERE clause string comparison in TDV in addition to pushing down to database

case_sensitivity=false case_sensitivity=true Cannot use more efficient Sort Merge algorithm between sources with conflicting settings. Optimizer reverts to Hash Join algorithm.

Adds UPPER to both sides

case_sensitivity=false case_sensitivity=false None None

ignore_trailing_spaces=true ignore_trailing_spaces=true None None

ignore_trailing_spaces=true ignore_trailing_spaces=false Cannot use more efficient Sort Merge algorithm between sources with conflicting settings. Optimizer reverts to Hash Join algorithm.

Add RTRIM to both sides

ignore_trailing_spaces=false ignore_trailing_spaces=true Will prevent JOINs from being pushed down to same data source involving VARCHAR fields.

Performs WHERE clause string comparison in TDV in addition to pushing down to database.

ignore_trailing_spaces=false ignore_trailing_spaces=false None None

The TDV query engine is designed to get the “correct” and consistent answer regardless of the SQL database implementation. If you find an RTRIM in the WHERE clause, it is because TDV is configured with a contract to ignore



trailing spaces while the underlying database does not ignore them. Likewise if you find an UPPER, it means that TDV is configured to ignore case while the underlying database is sensitive to case.

This situation may be avoidable where there are data sources with conflicting settings. If all underlying data sources have the same settings in this regard, it is strongly recommended that TDV be set to the same exact case sensitivity and trailing spaces settings. If this is possible, the developer need not worry about this issue.

7.3 Dealing with Settings Mismatches There are three ways to deal with these issues.

First, the system wide configuration values for case sensitivity and trailing spaces can be modified via the Administration/Configuration menu. This is only useful if the data sources are fairly homogeneous in regard to this behavior. Changes to this setting should be carefully considered or avoided as they will cause all other query plans to be re-evaluated to accommodate the new setting.

Second and more “dangerous” with respect to affecting consistent results, if the data sources have varying policies for case sensitivity and/or trailing spaces, these values can be modified on a per-query basis by using SQL query options. This option is useful when numerous types of data sources are used with varying case-sensitivity and/or trailing space settings.

WARNING: These query hints should be used with an understanding that the global contract provided by TDV is overridden. It must be communicated to clients querying this published resource that the contractual behavior has been overridden.

In the previous example, the developer would use this syntax immediately after the SELECT keyword:{option ignore_trailing_spaces="false", case_sensitive="tru e"}

Here is the same query after setting “case_sensitive” to “true.” (Note there are no longer RTRIMs in the SQL sent to the database.)

Third is to create "function" indexes on the affected data source. This capability is only available on certain database platforms (for example, Oracle, SQL Server, and DB2 on zOS). The trade-off is that the index takes up additional space in the data source.



8 Advanced Tuning Concepts

8.1 Externally Generated (Ad hoc) Queries Problem: Some applications such as BI tools may contain their own query generation engine that a developer may

have limited or no ability to influence. These applications will often abstract away the task of query generation by allowing a developer to graphically define relationships between data. Analyzing and debugging potential query

performance issues can be challenging when such an application connects to TDV as a client.

Unfortunately, performance tuning of these automatically generated queries can be difficult as the developer typically does not have the same degree of control over how queries are generated.

Solution: First understand the query that is being generated by the client application. If the generated query received

by TDV contains operations that make it inherently inefficient such as excessive group by operations, multiple levels of complex nested queries or complicated conditional logic, then it may be necessary to restructure the generated query.

If the structure of the incoming query appears to be free of issues, then it is useful to create a view in TIBCO Data Virtualization Studio that executes the same resolved SQL as the incoming query. This allows the developer to take advantage of Studio’s developer tools to more easily analyze the impact of tuning operations.

Analyze the performance of the queries passed from TDV to the underlying data sources. Look for potential performance issues with the data sources. Are we pushing a bad query down? Would adding an index to the source table(s) speed up the query? Is the data source just slow?

Be aware that the rule based optimizer is not as able to affect the execution plan of an externally generated query as would be the case with queries generated from internally modeled views and procedures. The use of statistics to enable cost based optimization is more important when tuning the performance of externally generated queries.

Analyze the current explain plan of the query, tune as above and reanalyze the explain plan to measure the impact of

changes. Measure the impact of applying cardinality constraints, changing joins, etc.

TDV enables some BI tools such as Microstrategy to create short duration cache tables to enable the execution of multi-pass SQL. See the section “Configuring DDL for a Data Service” in the TDV User Guide for details.

Keep in mind that it may be necessary to convert some externally generated queries into TDV views in some scenarios to allow developers a greater degree of control over how the query is structured. Converting generated queries to TDV views allows developers to take advantage of performance optimizations beyond cost based optimization such as making use of caching to allow pre-computation of joins.



9 Conclusion

In this document, considerations and guidelines for performance tuning of resources built in TDV have been outlined. For more information, please consult the information sources referenced in-line.

Date post:	08-Jun-2020
Category:	Documents
Upload:	others
View:	13 times
Download:	0 times