DB2 UDB

DB2 INFORMATION INTEGRATOR

PLANNING AND SIZING MICKS PURNELL

IBM – ADVANCED TECHNICAL SUPPORT-AMERICAS

NOVEMBER 11, 2004

Page 1

Trademarks

The following terms are trademarks or registered trademarks of the IBM Corporation in the United States and/or other countries: IBM, AIX, DataHub, DB2, DB2 Universal Database, DRDA, IBM, iSeries, and z/OS.

Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both

UNIX is a registered trademark of The Open Group in the United States and other countries.

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc.

Other company, product or service names may be the trademarks or service marks of others.

© Copyright International Business Machines Corporation 2004. All rights reserved.

Page 2

Introduction

IBM® DB2® Information Integrator (DB2 II) is a product that lets you take advantage of the variety of types of data that may be stored in various formats in the enterprise. You can use it to access data in both relational and non-relational formats, and in both DB2 and other non-IBM relational databases, in an integrated fashion, transparent to your applications.

As you can imagine, when you are accessing data that may be distributed across the enterprise in a variety of formats, performance is an important issue, and to prepare for high performance, you need to consider both hardware and network early in your planning cycle. This article discusses hardware and network planning for three different major uses of DB2 Information Integrator:

• Online and batch applications that read, update, or read/update non-DB2 data (federated database applications).

• Data analysis and reporting applications that access DB2 and/or non-DB2 data (federated query applications).

• Federated data replication (DB2 to non-DB2; non-DB2 to DB2).

The primary performance factors for performance of DB2 II are:

• Capacity and tuning at data sources

• Network speed

• Amount of data moved or replicated. For federated queries, though the final result to the user may be small, it is the amount of data that must move between data sources and DB2 II to get this result that affects performance.

• For multi-data source queries: the complexity of the query compared to the structure of the distributed data. For instance, will the user’s query only require one interaction between DB2 II and each data source, or will multiple interactions be required to some data sources? And will DB2 II have to assemble intermediate results at the DB2 II server, for instance to process GROUP BY and ORDER BY clauses in the user’s SQL. See detailed discussion in the section on multi-data source queries.

• Additionally for replication: the location of the Apply component.

The hardware for the federated server or replication server is in most cases not the primary factor in determining performance of federated database applications, federated queries and replication.

The detailed discussion in this article is segregated into 8 sections:

• DB2 II basics, including the minimum hardware requirements to install DB2 II on a server and make it active.

Page 3

• Planning for situations where application SQL statements submitted to DB2 II will require DB2 II access only one data source per SQL statement. This section is relevant for both online/batch applications that use DB2 II to read, update and read/update data and also for data analysis/reporting applications that use DB2 II.

• Planning for situations where application SQL statements submitted to DB2 II will require DB2 II to access multiple data sources per SQL statement. This section is relevant for both online/batch applications that use DB2 II to read, update, or read/update data, and also for data analysis and reporting applications.

• Planning for federated data replication with DB2 II.

• DB2 II integration with DB2 Universal Database (UDB) partitioned systems

• Scaling, load-balancing, and fail-over planning for DB2 II

• Conclusion

• Other information sources for DB2 II planning

The appropriate way to plan for DB2 II capacity requirements for all types of workloads is based on the number of data sources that will be accessed by application or user SQL statements, regardless of workload type. Thus you will not find separate sections for planning for online/batch workloads, and for data analysis/reporting workloads. Online/batch applications typically submit statements that call for DB2 II to access only one data source per statement, but multi-data source statements are certainly possible for them. And while data analysis/reporting users typically use DB2 II because it can provide results from joins across multiple data sources, they also typically submit SQL statements to DB2 II that analyze data or make reports from data at only one data source because DB2 II is the user’s one place to go to access all data.

In each of the three main sections (single-data-source workloads, multi-data-source workloads, replication workloads), we will cover the following topics:

• Architecture

• Sample SQL statements with discussion of how they are processed by DB2 II.

• Primary performance factors

• Discussion of how DB2 II uses hardware components (processor, memory, disk, and network)

• Planning questions

This article discusses the planning for using DB2 II functionality only. If there will be user data in the DB2 II database, or DB2 II will be integrated with a DB2 UDB system, then you should plan for the capacity requirements to manage the local DB2 data as you would normally for a DB2 UDB system. If the DB2 II and DB2 UDB functionality will be used together or at the

Page 4

same time, the capacity requirements for both running together would be additive. If DB2 II will be integrated with a DB2 UDB partitioned system, see also the section in this article on “DB2 II integration with DB2 UDB partitioned systems.” While the capacity requirements of DB2 UDB and DB2 II are additive when they are integrated, it should be clear in the sections that follow that if DB2 II pushes all or most work down to the federated data sources, the additional capacity requirements for DB2 II at the DB2 UDB sever will not be large.

A DB2 UDB capability that you can use with DB2 II, whether it is integrated with a DB2 UDB system or not, is the Materialized Query Table (MQT). An MQT is a real (that is, persistent) table that contains the result of query of tables, nicknames, or a combination. When a user query contains SQL that matches the SQL used to create the MQT, the optimizer has the option of using the materialized query table, instead of the tables and/or nicknames in the user’s query, in order to give the result set faster. For DB2 II systems (or DB2 UDB/II integrated systems) that will implement MQT’s, these are local DB2 tables, and capacity planning needs to consider them as such. You will see an example of using an MQT in the section on planning for use of DB2 II to access multiple data sources per SQL statement.

Predicting capacity requirements with greater accuracy

If it is an option for you, we recommend using the DB2/II explain tools with sample SQL statements of the proposed workload to find out the access plan that DB2 II will use to process SQL statements from users and applications. From the access plan, you can see what SQL statements DB2 II will send to data sources (including which columns will be referenced in predicates), the estimated size of result sets from the data source, and any processing DB2 II will do before giving the final result to the user. In this article, I have used DB2 II’s db2exfmt tool to find out the access plan that DB2 II uses for the example SQL statement.

The explain tools can be used effectively in a test environment that does not access production data, as long as the index information and statistics of the data that will be accessed when DB2 II is implemented in production are known. After nicknames are created for test tables that have the same columns as production tables, but less data, the nickname statistics can be updated via the DB2 II SYSSTAT catalog views. If the test tables do not have the same indexes as the production tables, index information can be added to the test nicknames with CREATE INDEX…SPECIFICATION ONLY’ commands. With nickname statistics and index information defined as they would be against production tables, you can see the access plans that DB2 II would use in production in the DB2 II test environment, using DB2 explain tools like db2exfmt.

1. DB2 Information Integrator basics

This article is not intended to be the reader’s first introduction to DB2 Information Integrator. However, it will be helpful to first briefly review the software components of a DB2 Information Integrator system configuration.

Figure 1 shows the software components involved when DB2 II is used by online/batch applications and data analysis/reporting applications. I’ll discuss the components in this figure from right to left, starting with the data source client software that DB2 II uses to access data sources.

Page 5

Figure 1.Software components involved when DB2 II is used for batch/online or data analysis

Data source client and Client configuration file/directory. DB2 II accesses data sources using the normal client software of the data source. Connection to a specific data source is configured the same way connections are configured for any software to access the data source. For instance, to access Oracle, DB2 II uses the Oracle client and relies on an entry for a specific Oracle instance to be created in the Oracle tnsnames.ora file on the DB2 II system.

Wrapper libraries. DB2 Information Integrator comes with wrapper libraries that provide the interface between the DB2 federated server at the heart of DB2 II and the data source client. DB2 II has relational wrappers (shown in Figure 1) for use with relational data sources such as Oracle, Micrososft SQL Server, DB2, and others. DB2 II also has non-relational wrappers (not shown) for accessing files, XML documents, Web services, life sciences data sources, and more. A wrapper includes:

• Interfaces to data source client API’s to make connections to data sources and send SQL statements and receive results. The wrapper provides the logic for any translation that is required between the DB2 SQL issued by the DB2 II user/applications and the syntax of the data source.

• Default type mappings that are used when nicknames are created for tables at data sources. The type mappings tell DB2 II how to map data source data types to DB2 UDB data types

• Default function mappings that are used when user SQL statements are processed. The function mappings tell DB2 II pushdown analysis what functions can be included in SQL sent to the data source. If a function is supported by the data source, then final decision on whether to push it down is made the DB2 II’s cost-based optimizer.

• Server attributes that are used when user SQL statements are processed. The server attributes tell DB2 II pushdown analysis what SQL operations can be included in SQL sent to the data source. If a SQL operation is supported by the data source, then final decision on whether to push it down is made the DB2 II’s cost-based optimizer.

Page 6

By the way, DB2 UDB V8 Enterprise Server Edition comes with the wrappers for accessing DB2 family data sources and Informix IDS/XPS; the DB2 Information Integrator product itself is not required for federated access to these data source.

DB2 instance. At the heart of DB2 II is the DB2 UDB federated server, and DB2 II makes the data of federated data sources appear to be in DB2 UDB. The installation of a DB2 II standalone system starts with the installation of the DB2 Enterprise Server Edition (ESE) that comes with DB2 II for use with DB2 II, and the creation of a DB2 instance. DB2 II can also be added to an existing DB2 UDB ESE instance. DB2 UDB functionality in DB2 II provides the following for federated queries:

• Preparation of execution plans for application/user queries that reference data at the data sources, including the following functions:

o Query re-write that optimizes the structure of the user/application’s SQL for better performance

o Pushdown analysis that determines what functions/operations in the SQL are supported by the data sources and therefore are eligible for inclusion in SQL sent to data sources. Inputs to pushdown analysis include function mappings, server attributes, data source version, and server options like COLLATING_SEQUENCE. Output of pushdown analysis tells the optimizer which operations/functions are allowed to be pushdown, and which are not.

o Cost optimization which evaluates alternative access plans for executing the user/applications SQL and picks the plan with the lowest cost estimate for execution. The access plans generated meet the guidelines given by pushdown analysis. Inputs to the cost estimation are index information and statistics about tables at data sources and server options, and the indicators from pushdown analysis on which operations/functions are allowed to be pushed down. The optimizer makes final determination on which operations to push down based on relative cost estimates of alternative plans that push the operation down to the data source and plans that perform the operation/function locally at the DB2 II server.

• Interfaces to the wrappers for requesting connections to data sources and sending SQL statements and receiving results.

• A DB2 engine for processing SQL functions and operations that are not pushed down to data sources.

• Connection interface for DB2 II applications and users.

DB2 II database. DB2 II makes data in data sources appear to applications and users to be in a DB2 UDB database. If no user data is stored in the database, then the only persistent tables of the database are the tables of its catalog in which information about data sources and their tables is stored. The only tablespaces used are the one for the catalog tables and the temporary tables space; the latter is used only for temporary tables that DB2 II may have to use for processing portions of user SQL that are not pushed down to data sources; even then, there may be no disk

Page 7

I/O involved if the temporary table will fit in the memory available to the bufferpool for the temporary tablespace.

Wrapper, server, user mapping, nickname definitions. Information about data sources and their data is defined in the catalog of the DB2 II database. The definitions can be put in place either by command (described in the DB2 UDB SQL Reference) or by an administrative GUI (DB2 Control Center).

• Wrapper – Registers a type of wrapper in the DB2 II database. A wrapper library works with a specific type of data source client.

• Server – Registers a particular data source’s data store in the DB2 II database. Indicates an entry in the configuration file or directory of the data source client, and, if appropriate, a database at the data source.

• User mapping Indicates to DB2 II what userid/password to put in connections to a data source for a particular userid that accesses DB2 II.

• Nickname – Registers a data source table in the DB2 II database. DB2 II users and applications can refer to nicknames in their SQL statements as if they were tables in the DB2 II database. Information for a nickname includes:

o Name of the table at the data source and the DB2 II server definition for the data source where the table is located.

o Column definitions that indicate the characteristics of the columns of the table at the data source and the DB2 characteristics of these columns as ‘seen’ by the DB2 II user/application.

o Index information and statistics about the table at the data sources. This information is very important to the DB2 II cost-based optimizer. It is collected from the data source’s catalog when the nickname is created and when the statistics for the nickname are updated using the DB2 II 8.2 ‘Update Nickname Statistics’ procedure. If a nickname is created for a view, which has no index information or statistics in the data source’s catalog, then the index information and statistics for a nickname can be provided by SQL statement executed after the nickname is created.

Client applications. DB2 II applications and users use DB2 client interfaces to interact with DB2 II. Applications or users on workstations (Linux, UNIX, Windows) use the interfaces of the DB2 UDB client which include JDBC, ODBC/CLI, and DB2 UDB embedded SQL interfaces. Applications on mainframes and iSeries™ use the native interfaces of those platforms and DRDA connectivity to access DB2 II.

Replication components. When DB2 II is used for replication to or from non-DB2 data sources, the Apply component of DB2 UDB SQL replication is, in effect, the application that is using DB2 II. When replicating from DB2 to non-DB2, the Apply component that comes with DB2 II is typically used to replicate the changes staged by the SQL Capture at a DB2 source to the tables of a non-DB2 target. Apply replicates to DB2 II nicknames for the target tables; DB2

Page 8

II takes care of establishing the connection to the data source for Apply and sending the insert/update/delete statements to the data source for the actual target tables. When replicating from non-DB2 to DB2, Apply normally runs at the DB2 target system and connects to the DB2 II database and selects changes from nicknames for the staging tables located at the data source; DB2 II makes the actual connection to the data source and selects the changes from the actual staging tables.

Minimum hardware requirements for DB2 II This section covers the minimum hardware configuration just to install DB2 Information Integrator and allow it to be active on a server.

Consult the DB2 II Version 8.2 Installation Guide and DB2 UDB Version 8.2 Quick Beginnings for DB2 Servers for minimum hardware requirements for memory and disk..

• Add enough disk space to install non-DB2 data source client software (Oracle client, Sybase Open Client, Informix SDK, Teradata utilities, DataDirect Connect (for DB2 II UNIX to access SQL Server), and ODBC Driver for other data sources.

System: A system that will run an operating system on which DB2 II is supported: PC-Intel Windows NT, XP, 2000, 2003

Red Hat Linux 2.7 or later, or Red Hat Enterprise Linux 2.1 or 3.0 SUSE Linux Enterprise Server 8.

IBM pSeries AIX 4.3 AIX 5 Sun Solaris 7,8,9 HP HP-UX 11i (11.11) Memory:

1. Operating system 2. DB2 ESE base – 256MB 3. DB2 II wrappers – no additional memory when DB2 II is initially started. I’ll discuss

memory for user connections to the DB2 II database in the section ‘Single data source query/update workloads’ in the sub-topic on ‘Hardware planning.’

Disk: 1. DB2 ESE:

a. Typical installation – 450-500MB b. Compact installation – 300–350MB c. Custom installation – 200-800MB

2. Relational Wrappers – 5-20MB 3. Nonrelation Wrappers – 5-20MB 4. non-DB2 data source client software – consult vendor documentation

a. Oracle client b. Sybase Open Client c. Informix Client SDK

Page 9

d. DataDirect Technologies Connect ODBC i. On UNIX/Linux, required for access to Microsoft SQL Server

ii. On Windows, Microsoft SQL Server ODBC driver that comes with the operating system can be used by DB2 II to access MS SQL Server.

e. Teradata Tools and Utilities – includes CLIV2 and BTEQ f. ODBC 3.0 driver for other relational data sources

5. DB2 II installation – 140MB of free disk is needed during installation.

2. Single data source query/update workloads 2.1 Architecture for federated queries with 1 data source.

Figure 2 depicts the typical scenario when a DB2 II application submits an SQL statement that requires DB2 II to access only one data source.

Figure 2. DB2 II accessing a single data source

If the user or application issues select, update, or select and update SQL statements to DB2 II that reference nicknames for only one data source at a time, most of the work is expected to be pushed down to the data source. The time to execute the SQL statements using DB2 II will in most cases be approximately the same as the time to access the data source directly from its own client and run the same SQL. The DB2 optimizer still generates plans that push down operations to the data source and plans that don’t, but the cost estimates of the alternative plans usually leads to the conclusion that the plan that pushes down all, or almost all, the SQL operations and functions to the data source is the lowest cost and so will be the one that is used. All or most of the processing required for the SQL statement is done at the data source where indexes, if available, can be used to quickly identify the records meeting join and filter criteria in the statement. Just the final result set, or an intermediate result set the same size as the final result set, has to transit the network back to the DB2 II server, and DB2 II has no (or very little) processing that it has to do to give the user the final result. If there are any transient intermediate result sets, they are usually small and can easily fit in the memory on the DB2 II server and allocated for DB2 II use via the SORTHEAP parameter and the bufferpool for the temporary tablespace. If there are large result sets that have to cross the network, then network speed and latency is a factor. Also, CPU speed of the DB2 II server can also be factor if there are very large result sets, but the processing at the data sources and the network speed are still more important.

Page 10

There are some circumstances where the response time will be longer than direct access. These are covered in the discussion of performance factors below. If you can, use the DB2 explain tools that come with DB2 Information Integrator in advance to determine the amount of SQL pushdown.

2.2 Sample SQL statement – single data source

Listing 1 shows a sample SQL statement for data analysis/reporting reporting application. In this case, all the DB2 II nicknames in the statement reference tables at the same data source:

Listing 1. Sample SQL referencing tables at the same data source SELECT N_NAME as NATION, SUM(L_EXTENDEDPRICE * (1-L_DISCOUNT)) AS REVENUE FROM ORA01.CUSTOMER, ORA01.ORDERS, ORA01.LINEITEM, ORA01.SUPPLIER, ORA01.NATION, ORA01.REGION WHERE C_CUSTKEY = O_CUSTKEY AND L_ORDERKEY = O_ORDERKEY AND L_SUPPKEY = S_SUPPKEY AND C_NATIONKEY = S_NATIONKEY AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY AND R_NAME = 'ASIA' AND O_ORDERDATE >= DATE('1994-01-01') AND O_ORDERDATE < DATE('1994-01-01') + 1 YEAR GROUP BY N_NAME ORDER BY REVENUE DESC

Here is the access plan graph for this statement in the db2exfmt output:

Listing 2. Access plan for Listing 1 SQL Rows RETURN ( 1) Cost I/O | 25 TBSCAN ( 2) 363719 35207.7 | 25 SORT ( 3) 363719 35207.7 | 25 SHIP ( 4) 363719 35207.7 +-----------------+---------+-------+-----------------+ 5 25 1.5e+006 6.00122e+006 NICKNM: ORA01 NICKNM: ORA01 NICKNM: ORA01 NICKNM: ORA01 REGION NATION ORDERS LINEITEM

Page 11

We can see the SQL statement that DB2 II will send to the data source. It is in the RMTQTXT in the db2expln detail section for the SHIP ( 4) operator:

Listing 3. SQL actually sent to the data source SELECT A3."N_NAME", SUM( (A5."L_EXTENDEDPRICE" * (+1.00000000000000E+000 - (A5."L_DISCOUNT")))) FROM "SALES"."SUPPLIER" A0, "SALES"."ORDERS" A1, "SALES"."REGION" A2, "SALES"."NATION" A3, "SALES"."CUSTOMER" A4, "SALES"."LINEITEM" A5 WHERE (TO_DATE('19940101 000000','YYYYMMDD HH24MISS') <= A1."O_ORDERDATE") AND (A1."O_ORDERDATE" < TO_DATE('19950101 000000','YYYYMMDD HH24MISS')) AND (A2."R_NAME" = 'ASIA ') AND (A3."N_REGIONKEY" = A2."R_REGIONKEY") AND (A4."C_NATIONKEY" = A3."N_NATIONKEY") AND (A4."C_CUSTKEY" = A1."O_CUSTKEY") AND (A5."L_ORDERKEY" = A1."O_ORDERKEY") AND (A5."L_SUPPKEY" = A0."S_SUPPKEY") AND (A4."C_NATIONKEY" = A0."S_NATIONKEY") GROUP BY A3."N_NAME"

We see that all operations and functions in the user’s SQL statement to DB2 II are pushed down to the data source, except for the ORDER BY REVENUE clause at the end. So a result set the same size as the final result set (estimated by the optimizer to be 25 records) is returned from the data source to DB2 II and put into a SORT at the DB2 II server so that the ORDER BY clause that was not pushed down can be processed. The SORT is estimated to have a record-length of 36 bytes. 113 such records would fit in one 4KByte sortheap or bufferpool page so this sort should require only one sortheap page. The ORDER BY will be processed very quickly even though it is not pushed down to the data source since the sort/temp table required for it is small and won’t require any disk I/O to the file containers of the temporary tablespace at the DB2 II server.

The optimizer estimates the size of the final result set, and the size of results from SHIP ( 4) operation and the size of intermediate result set SORT ( 3) by using the statistics and index information for the nicknames involved. This information is gathered from the data source when the nicknames are created, and can be updated when the DB2 II update nickname statistics procedure (NNSTAT) is used.

If the ORDER BY REVENUE clause is not included in the user’s SQL statement, all the SQL is pushed down and the result passes through DB2 II to the user without any SQL processing or transient intermediate result sets at the DB2 II server. Listing 4 shows the access plan graph if the ORDER BY clause is not included in the SQL statement:

Listing 4. Access plan without the ORDER BY clause Rows RETURN ( 1) Cost I/O | 5 SHIP ( 2) 536804 134355 +-----------------+---------+-------+-----------------+ 5 25 1.5e+006 6.00122e+006 NICKNM: ORA01 NICKNM: ORA01 NICKNM: ORA01 NICKNM: ORA01 REGION NATION ORDERS LINEITEM

Page 12

In this case with complete pushdown, DB2 II simply receives a block of records from the data source into the DB2/II client communications buffer (size determined by RQRIOBLK configuration parameter) and then sends the records in that block to the DB2 II application or user.

2.3 Primary performance factors

When DB2 II has to access only one data source to process an SQL statement from a user or application usually, the primary performance factors are:

• Available capacity at the data source to process the SQL sent to it. If the application workload coming to the data source via DB2 II is simultaneous with the original application workload of the data source, then the capacity requirements of both workloads need to be added to find out the new capacity requirements at the data source.

• Tuning the data source. If the application workload coming to the data source via DB2 II uses different columns in filters and joins (WHERE clauses) then the original application workload of the data source, consider adding indexes (preferable unique indexes if possible) to help the data source process the DB2 II application workload faster.

• Size of result sets or number of selects/updates attempted quickly in a short interval.

• Network speed to data source.

The circumstances where response time will be slower than direct access to the single data source are:

• Very large result sets. Capacity tuning at the data source and network speed are still very, very important factors. But it also takes noticeable time and capacity for DB2 II to pass large number of records received on the connection from the data source through DB2 II and out onto the connection to the DB2 II application program. For instance, if you connect to the data source directly using its client software and issue the same SQL statement, and record the elapsed time to get the large result, and then issue the same SQL statement using DB2 II nickname(s) and record the elapsed time, there will be a noticeable difference between the two times. A faster processor for the DB2 II server will decrease the additional time attributable to passing the result sets through DB2 II.

• Short simple statements that return small results. If the response time per statement is sub-second with direct access, it should be sub-second with DB2 II. But if lots of simple statements with small results are submitted in rapid-fire fashion, the aggregate of the additional overhead going through DB2 II will be noticeable.

• Operations in the SQL statement that DB2 II determines are not supported by the data source. DB2 II makes the data source data appear to be in DB2 UDB, and it is DB2 UDB’s range of SQL operations and functions that establish the range of operations and functions that DB2 II accepts in SQL statements coming from users and applications. Most of these operations and functions are supported by other relational DBMS’s, but there are some exceptions. These exceptions are specific to data source types and versions; the input to pushdown analysis on these exceptions is from the server attributes

Page 13

and default function mappings within the wrapper used to access a specific type of data source.

2.4 Hardware planning Processors: CPU speed is usually not the primary hardware performance factor when the work of SQL statements is pushed down to the data source. To process an individual statement, DB2 II will use only 1 CPU; DB2 II can take advantage of a second CPU only if there are multiple concurrent queries. If final result sets are large (millions of records), then the CPU speed of the DB2 II server is a factor; it is still expected that most of the work to process the query is performed at the data source and that passing the result set through the federated server is a small part of the total workload, but when there are large result sets, a faster network to the data source and a faster CPU at the federated server can reduce the time to deliver the total result set to the user. The exception to the 1-CPU/query is with DB2 II v8.2 in a partitioned configuration. If the DB2 II instance is implemented with data partitioning feature (multiple physical or logical nodes), then for queries where not all processing is pushed down to data sources, the DB2 II optimizer has the option, based on cost, to distribute the intermediate result sets and processing to multiple nodes, thereby taking advantage of the multiple CPUs and the memory for sortheap and bufferpool of each node. You can expect that this distribution of the intermediate result sets to multiple nodes will be the ‘low-cost’ plan selected by the optimizer only if the intermediate result sets are extremely large or much CPU is required to do the processing that was not pushed down. Even with the availability of this capability, if given a choice, fewer and faster CPUs will be recommended over more and slower CPUs for the DB2 II server. Memory: If the SQL of a query or update operation is pushed down to the data source, then there are no major demands for memory at the federated server to process the user or application queries and updates. But if the SQL includes an ORDER BY, GROUP BY, or column function that is not pushed down, then DB2 II will have to put the result set into a transient temporary sort at the federated server to process the operation that is not pushed down. If there is memory available and allocated to SORTHEAP or the bufferpool for the temporary tablespace, the operation can be processed without any disk I/O to file containers of the temporary tablespace. The answers to the questions in the questions below about number of records and record length of result sets are relevant. If the user/application SQL joins nicknames for tables at the same data source, in most cases the join is pushed down and requires no memory resources that DB2 II server. For the uncommon cases where joins of tables at a single data source are not pushed down, two of the three joins techniques available to DB2 II use allocations of SORTHEAP and so memory to contain the joins in memory is useful. See the discussion of join technique and memory in the section on multi-data source queries. If there will be many concurrent connections to the DB2 II server, calculate and add the memory for the db2agent processes for these concurrent connections. Memory requirements for db2agent processes are really application dependent. It is expected that the base memory requirement for a db2agent process just to support a user connection before any nicknames are referenced in SQL is 2-5Mbytes. When the user/application references nicknames, DB2 II loads

Page 14

the appropriate wrapper(s) and data source client software into the db2agent process. Base memory for the db2agent process plus the wrapper and source client software can be estimated to be 20 Mbytes per user. The best way to determine the per-user memory requirement for concurrent connections is to monitor memory usage while varying numbers of users for a particular type of application are connected to the DB2 II server. Memory savings for wrappers can be achieved with the wrapper option DB2 _FENCED ‘Y’. If the wrapper definition has the option DB2_FENCED ‘Y’, instead of loading the wrapper into the db2agent process created for each user connection, DB2 II creates a fenced mode procedure (FMP) process to load the wrapper and data source client. For data source clients that are thread-safe (that is, they can support multiple connection threads), DB2 II creates just one FMP process per wrapper and uses that to support all user connections for the type of data source supported by the wrapper. Disk: If all SQL is pushed down, or result sets are small and only ORDER BY or GROUP BY clauses are not pushed down, you can anticipate that DB2 II will be able to do all its work without any disk I/O to the file containers of the temporary tablespace. So disk is required only to store DB2 II, the data source client software, and the catalog of the DB2 II database. The exception would be if ORDER BY or GROUP BY is not pushed down and result sets are so large that they overflow SORTHEAP and memory of the bufferpool for tempspace, in which case there needs to be enough disk to hold these transient (temporary) sorts for processing the ORDER BY or GROUP BY. Network adapter: If the result sets for the SQL submitted by the application are small, or the application performs very few update statements in short time intervals, network data rate will not have a great impact on performance. But if large result sets are expected, or the application will make lots of quick select or update statements calling for fast turn-around, then higher network data rate and low network latency will help performance, and a network adapter for the high-speed network is called for on the DB2 II server.

2.5 Planning Questions

Here are some questions to ask when planning for use of DB2 II by online/batch applications or data analysis/reporting applications, if only one data source will be accessed per SQL statement:

1. What will the types of data sources be (for example. DB2 z/OS®, DB2 UDB, Oracle, Sybase, and so on)?

2. How many users will there be for the application that access DB2 II?

a. How many will be concurrently connected to DB2 II server?

3. How many SQL statements will be submitted to DB2 II during the peak period?

a. How long is the expected response time for these SQL statements?

Page 15

b. What, then, is the number of SQL statements that DB2 II will be processing at any point in time?

4. How large will the results be? Number of records? Row-length?

5. Will the join and filter columns in the SQL statements be in unique indexes or primary keys at the data source?

6. Will the DB2 II server be standalone, or will DB2 II be installed on the same system as DB2 UDB or a non-DB2 data source?

7. What will be the network speed from DB2 II to the data source?

8. How many routers will there be between DB2 II and the data source?

3. Multi-data source query/update workloads

3.1 Architecture for multi-data source federated queries

Figure 3 depicts the processing that occurs when DB2 II joins the data from one table and one data source with the data of one table at another data source.

Page 16

Figure 3. DB2 II joining data from two data sources

Applications that require DB2 II to access multiple data sources for each SQL statement should expect that the response will be slower than if all the data were at one data source. In some cases, the difference may not be noticeable, but this should not be the initial expectation. One of the major factors affecting performance is the structure of the data distributed across the data sources and the structure of the SQL that will be submitted to DB2 II by the application. The factors affecting performance if there is only one data source per SQL statement also apply; that is, tuning and capacity at data sources and network speed and latency will have major impacts on performance. But even in the cases where all work of the user’s SQL statement except the join between data sources is pushed down to the data sources, there is work that has to be done at the DB2 II server, namely the work of the join.

Regarding applications that update data, DB2 II does not support updating multiple data sources in a single transaction. A transaction that updates nicknames can read data at many data sources, but data at only one DB2 II data source can be updated. DB2 II intends to support multi-site update / two-phase commit in the future.

Next we’ll discuss the types of complexity in the SQL statements that affect performance and look at several examples of different complexity.

3.1.1 How DB2 II handles user and application SQL – multiple data source workloads

For user and application SQL statements that reference nicknames for data at multiple data sources queries, joins between data at different data sources have to be performed at the DB2 II server. However, DB2 performs these joins and the rest of the SQL in the user/application SQL as efficiently as possible. Query re-write determines if the user/application SQL could be better structured. Pushdown analyis determines what operations and functions are supported at the data sources and the optimizer generates alternative plans for processing the SQL, estimates the cost of each, and prepares the one with the lowest estimated cost for execution. If there are two or more tables from the same data source in the user SQL, the optimizer will evaluate whether a

Page 17

plan that pushes that join to the data source is the lower cost than performing that same-data-source-join at the DB2 II server.

The SQL statements that DB2 II will send to the data sources will make the processing at the data sources and the transfer of data across the network as efficient as possible. The SQL statements will ask only for the columns that are needed for the join and the result requested by the user; filters from the user’s SQL statement and predicates for the join will be included when possible to reduce the number of records retrieved and exploit indexes at the data sources.

DB2 II has a number of join techniques that it can use for joining data between data sources: hash join, nested loop join, and merge scan join. The optimizer will typically evaluate plans that use these different alternatives for the same join and will pick the plan that has the lowest cost estimate. For the hash-join and merge-scan join techniques, DB2 II creates temporary table structures, using memory allocations of sortheap. If sortheap is not large enough to hold the temporary hash table or merge table, then these tables overflow into the memory pages of the bufferpool for the temporary tablespace; if that bufferpool does not have enough available pages to hold this overflow, then the pages of these temporary table structures are written temporarily to the file containers of the temporary tablespace. For nested loop joins, DB2 II does not create temporary tables, therefore the NLJOIN technique does not require as much memory, but it is often slower than the hash-join technique if that can be used. As said before, the optimizer evaluates alternative plans that use all 3 join techniques and picks the one that has the lower cost. Amount of available memory (sortheap and bufferpool) to contain the temp table for the hash join affects this decision.

The most important input to the DB2 II optimizer to make the most precise estimate of the relative cost of alternative access plans is the statistics and index information about the nicknames in the user/application SQL. Statistics and index information for nicknames are gathered from data sources at the times nicknames are defined; with DB2 II V8.2 they can also be updated after a nickname is created. It should be a coordination item between the DB2 II administrator and the administrator of the data sources to plan to update statistics at the data source before nicknames are created and when the size of tables at data sources change. Also, if given a choice of creating nicknames for tables or views at the data source, create nicknames for tables. Views do not have statistics or index information in the catalog at the data source; if nicknames are created for views, the II administrator must use a manual process to provide good statistics and index information for the nicknames.

3.1.2 Complexity of query against the distributed data structure:

If there are multiple data sources in each query, there are many factors in the analysis that the user/application is trying to do that affect performance. The main ones are:

• The number of interactions DB2 II will have to do with the data sources. For instance, if nicknames for 2 tables at the same data source are joined in a query, it is likely DB2 II will be able to send one SQL statement to the data source to get the data for both tables. But if the 2 nicknames are joined in the query only with nicknames for table(s) at other data sources and not with each other, DB2 II will not be able to engineer a single SQL statement to send to the data source to get the data from both for the final result; a

Page 18

minimum of 2 interactions (SHIP operations in db2exfmt output) will be required to the data source.

• The amount of SQL pushdown to data sources. For instance, are filters, functions, and join operations involving tables at the same data source pushed down so that more processing takes place at the data sources?

• The amount of data that will have to move from the data sources to DB2 II

Here are some examples of characteristics of multi-data source queries that affect performance: • A SELECT that joins one table at the first data source with one table at a second data

source. • A SELECT that joins 2+ tables at one data source with 2+ tables at a second data source,

and the tables at each data source can be joined at the data sources, so minimum interactions to each data source can be 1.

• A SELECT that joins 2+ tables at one data source with tables at a second data source, but tables at the first data source are not joined together, but instead are joined separately to tables at the second data source; therefore, minimum interactions to the first data source are 2+.

• SQL contains sub-selects or table expressions, and both main select and the sub-select/table expressions contain joins of data at multiple data sources; therefore minimum interactions to each data source will be 2: one interaction to each data source for the join in the main select and one interaction to each data source for the join in the sub-select/table-expression.

• Join between data sources, and many rows from both data sources will meet criteria of the join. Transient intermediate result from the first data source will have many values requiring many probes to the second data source.

• SQL contains outer join of tables at different data sources.

If you can, use DB2 explain tools such as db2exfmt in advance with sample queries and nicknames that have statistics and index information that accurately reflect the characteristics of the production data that DB2 II will be using.

Also, another factor of distributed data that can affecting performance when nicknames join tables at two or more data sources is the characteristics of the join columns. Though the same value may indicate the same customer in two application systems, the two application systems may not store the customer number in columns of the same data type and/or length. DB2 II can do the join most efficiently if the joins columns are the same type and length. If they are not, DB2 II is not able to use all of its different join techniques (particularly, hash-join can not be used). And also because of the difference in type and length of the join columns, DB2 II may need to use extra transient temporary tables. There are three different techniques that can be used to help DB2 II performance when the join columns are not the same type and length in two systems:

• Alter the data type of the join column in one of the nicknames to make it the same type and length as in the nickname that is joined to. This technique is valid if the existing and new data type are compatible (for instance, existing and new type are character types or both are numeric types) and the change won’t cause the values to be truncated or padded in a way that causes join results to be invalid.

Page 19

• Create a view over the table at one of the data sources. In that view, cast the join column to the same data type and length as the join column in the table at the other data source. Create a nickname in DB2 II for that view and use that in the join. To help the DB2 II optimizer, it is recommended to still create a nickname for the table beneath the view and use the index information and statistics for that nickname to update the index information and statistics for the view nickname that is used in the join.

• Add a column to one of the tables at the data source. The data type and length of the new column should match that of the join column in the other data source. Then update the new column with values from the former join column. And add an index to the table that includes the new column. Update statistics for the table and then create nickname for the table and use this nickname in the join, joining on the new column.

3.2 Sample SQL Statetment #1 – join multiple data sources In our first sample, we’ll use an SQL statement with just one table at each data source: Listing 5. SQL statement with one table at each data source SELECT SUM(L_EXTENDEDPRICE * (1-L_DISCOUNT)) AS REVENUE FROM ORA01.LINEITEM, MSS02.PART WHERE ( P_PARTKEY = L_PARTKEY AND P_BRAND ='Brand#12' AND P_CONTAINER IN ('SM CASE','SM BOX','SM PACK','SM PKG') AND L_QUANTITY >=1 AND L_QUANTITY <=1+10 AND P_SIZE BETWEEN 1 AND 5 AND L_SHIPMODE IN ('AIR','AIR REG') AND L_SHIPINSTRUCT = 'DELIVER IN PERSON' ) OR ( P_PARTKEY = L_PARTKEY AND P_BRAND = 'Brand#23' AND P_CONTAINER IN ('MED BAG','MED BOX','MED PKG','MED PACK') AND L_QUANTITY >=10 AND L_QUANTITY <=10 + 10 AND P_SIZE BETWEEN 1 AND 10 AND L_SHIPMODE IN ('AIR','AIR REG') AND L_SHIPINSTRUCT = 'DELIVER IN PERSON' ) Here is the access plan graph from db2exfmt for this query: Listing 6. Access plan for sample statement #1 Rows RETURN ( 1) Cost I/O | 1 GRPBY ( 2) 451347

Page 20

113368 | 0.175222 HSJOIN ( 3) 451347 113368 /------+-----\ 167863 595.84 SHIP SHIP ( 4) ( 6) 436070 15263.8 109567 3801 | | 6.00122e+006 200000 NICKNM: ORA01 NICKNM: MSS02 LINEITEM PART The graph shows that DB2 II’s optimizer has selected to use a hash join with the LINEITEM table at ORA01 as the outer table. The SQL statement sent to the first data source (from the RMTQTXT of SHIP ( 4) operator in the db2exmt output) is: Listing 7. SQL statement form sample statement #1 that is sent to first data source SELECT A0."L_PARTKEY", A0."L_QUANTITY", A0."L_EXTENDEDPRICE", A0."L_DISCOUNT" FROM "SALES"."LINEITEM" A0 WHERE (((A0."L_QUANTITY" >= +1.00000000000000E+000) AND (A0."L_QUANTITY" <= +1.10000000000000E+001)) OR ((A0."L_QUANTITY" >= +1.00000000000000E+001) AND (A0."L_QUANTITY" <= +2.00000000000000E+001))) AND (A0."L_SHIPMODE" IN ('AIR', 'AIR REG')) AND (A0."L_SHIPINSTRUCT" = 'DELIVER IN PERSON ') So we can see that DB2 II only retrieves the columns it needs from the first data source and sends filters from the user’s query to reduce the number of records returned to go into the hash table. The SQL statement to the second data source (from RMTQTXT for SHIP ( 6) operator) is: Listing 8. SQL statement from original sample statement #1 query that is sent to second data source SELECT A0."P_PARTKEY", A0."P_BRAND", A0."P_CONTAINER", A0."P_SIZE" FROM "inventory"."PART" A0 WHERE ((A0."P_BRAND" = 'Brand#12 ') OR (A0."P_BRAND" = 'Brand#23 ')) AND ((A0."P_CONTAINER" IN ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')) OR (A0."P_CONTAINER" IN ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK'))) AND ((A0."P_SIZE" <= 5) OR (A0."P_SIZE" <= 10)) AND (A0."P_SIZE" >= 1) Again we can see that DB2 II only retrieves the columns it needs from the second data source. Also it sends filters from the user’s query to reduce the number or rows returned from the second data source. In this case, DB2 II uses a hash join. This is possible since the join columns are the same data type and length in both of the data sources. Other join techniques DB2 II has available include nested loop join and merge scan join. The DB2 II optimizer evaluated all three techniques for

Page 21

this query and determined that the plan we see using the hash-join technique would be the fastest though it requires the creation of a transient hash-table. In this case, the hash table is not estimated to be large so it will fit in memory. Major input to these cost estimates were the statistics and index information in the DB2 II catalog for the nicknames involved. The DB2 II optimizer will make good decisions if the statistics for the tables at the data sources are current either when nicknames are created, or right before the nickname statistics are updated. There is also a manual technique for providing statistics and index information for nicknames if the tables at the data sources do not have accurate statistics, or nicknames are created for views which do not have statistics or index information in the data source’s catalog. Finally, the only operation that is not pushed down, besides the join between data sources, is the application of the SUM function. That is represented by the GRPY ( 2) operator in the access plan graph.

3.2.2 Sample SQL Statement #2 – join multiple data sources Next we’ll look at a SQL statement that is much more complex. It contains a sub-select. The main ‘select’ joins tables at 2 data sources; the sub-select joins the same tables at the same 2 data sources. The fact that 2 data sources are joined in both the main select and in the sub-select keeps DB2 II from combining the data retrieval for the both main select and sub-select into a single SHIP operator to each data source. Here is the SQL statement: Listing 9. SQL statement #2, joining multiple data sources SELECT S_ACCTBAL, S_NAME, N_NAME, P_PARTKEY, P_MFGR, S_ADDRESS, S_PHONE, S_COMMENT FROM MSS02.PART, MSS02.SUPPLIER, MSS02.PARTSUPP, ORA01.NATION, ORA01.REGION WHERE P_PARTKEY = PS_PARTKEY AND S_SUPPKEY = PS_SUPPKEY AND P_SIZE = 15 AND P_TYPE LIKE '%BRASS' AND S_NATIONKEY = N_NATIONKEY AND R_NAME = 'EUROPE' AND PS_SUPPLYCOST = (SELECT MIN(PS_SUPPLYCOST) FROM MSS02.PARTSUPP, MSS02.SUPPLIER, ORA01.NATION, ORA01.REGION WHERE

Page 22

P_PARTKEY = PS_PARTKEY AND S_SUPPKEY = PS_SUPPKEY AND S_NATIONKEY = N_NATIONKEY AND N_REGIONKEY = R_REGIONKEY AND R_NAME = 'EUROPE' ) ORDER BY S_ACCTBAL DESC, N_NAME, S_NAME, P_PARTKEY FETCH FIRST 100 ROWS ONLY Here is access plan graph from DB2 II’s db2exfmt for this statement: Listing 10. Access plan for sample statement #2 Rows RETURN ( 1) Cost I/O | 0.046503 NLJOIN ( 2) 1.29299e+006 58503.7 /-----+----\ 0.0186012 2.5 TBSCAN SHIP ( 3) ( 38) 1.29297e+006 25.0283 58502.7 1 | | 0.0186012 5 SORT NICKNM: ORA01 ( 4) REGION 1.29297e+006 58502.7 | 0.0186012 NLJOIN ( 5) 1.29297e+006 58502.7 /----------------+---------------\ 0.0186012 1 NLJOIN SHIP ( 6) ( 35) 1.29294e+006 25.0252 58501.7 1 /-------------------+------------------\ | 1600 1.16257e-005 25 TBSCAN FILTER NICKNM: ORA01 ( 7) ( 21) NATION 62223.3 300.192 9286.9 12 | | 1600 1 SORT GRPBY ( 8) ( 22) 62223.1 300.192 9286.9 12 | | 1600 2 SHIP HSJOIN ( 9) ( 23) 62221.7 300.192 9286.9 12 +-----------------+-----------------+ /----------+----------\ 10000 200000 800000 12.5 4 NICKNM: MSS02 NICKNM: MSS02 NICKNM: MSS02 SHIP SHIP SUPPLIER PART PARTSUPP ( 24) ( 29) 25.0629 275.127 1 11 | /------+-----\ 5 10000 800000 NICKNM: ORA01 NICKNM: MSS02 NICKNM: MSS02

Page 23

REGION SUPPLIER PARTSUPP

We can see from the graph that DB2 II 1. is able to push down the join of SUPPLIER, PART, and PARTSUPP in the main select.

The RMTQTXT for SHIP( 9) shows the following filters are also pushed down: (A0."P_SIZE" = 15) AND (A0."P_TYPE" LIKE '%BRASS')

2. is able to push down the join of SUPPLIER and PARTSUPP in the sub-select. The RMTQTXT for SHIP ( 29) shows the following filters also pushed down: (:H0 = A0."PS_PARTKEY") with the values for :H0 coming from the the PS_PARTKEY values from SHIP( 9).

3. because of joins with NATION and REGION at ORA01 in both the main select and sub-select, DB2 II is not able to devise an access plan that pushes down a join with filters to MSS02 that gets the results needed for the joins in both the main select and the sub-select.

4. The 3 temporary table structures that will be created during the execution of this plan are a. SORT ( 8) which will have record length 228 bytes and is estimated to have 1600

records so should take a ninety-four 4KByte memory pages. b. HSJOIN (23) which should have record length of 16 bytes and is estimated to

have 12 records and so should take only one 4KB memory page c. SORT (4) which will have record length of 240 bytes and is estimated to have

only 1 record and so should required only one 4KB memory pate. So the temporary tables are not large and should not overflow the memory available to sortheap and the bufferpool for the temporary tablespace. The primary factor affecting performance of this query is the number of interactions DB2 II will have to do the data sources: 2 SHIP’s to MSS02 and 3 SHIP’s to ORA01.

3.2.3 Sample SQL Statement #2 with MQTs DB2 UDB allows tables to be created that are the result of a SQL statement. These tables are called ‘materialized query tables’, or MQTs. Once created, the optimizer has the option to use the MQT when a user submits a query where all or part of the SQL submitted by a user matches the SQL used to create the MQT. The optimizer generates access plans that use the tables referenced in the user’s queries, and access plans that use the MQT. The optimizer picks the plan with the lowest estimate; the plan that uses the MQT is often chosen if the MQT has indexes like the tables in the SQL that created it, and runstats has been run for the MQT. With DB2 II, the MQT capability has been extended to allow MQTs to be created that include references to nicknames for data at both relational and nonrelational data sources. If an MQT references nicknames, refreshing of the data in the MQT has to be done by a command that is run manually or regularly scheduled. Or, if the MQT is created for a single nickname with no aggregation, then DB2 SQL replication can be set up from the source table at the data source to the MQT in the federated database so that changes to the source table are automatically replicated into the MQT. The DB2 Control Center’s Cache Table wizard can set up the SQL replication from the table at the data source into the MQT in the DB2 II database. When a user query references nicknames, and the data in one or more of the remote tables is relatively static, then the performance of the user’s query can be improved without comprising the accuracy of the result by making an MQT in the federated database for the static data. The optimizer then has the option of accessing the data in the MQT on local disk.

Page 24

For ‘Sample SQL statement #2’, the data in the PART, SUPPLIER, and PARTSUPP tables at the MSS01 data sources is updated all the time. If an MQT were created in the DB2 II database for the data in these tables, it might be difficult to keep the data in the MQT current enough for the DB2 II users. However, data in the NATION and REGION tables at the ORA01 data source is updated infrequently and data in an MQT for these tables can easily be kept current per the requirements of the DB2 II users. We create two MQTs in the DB2 II database:

• MQT01.ORA01_REGNAT for the join of the REGION and NATION nicknames • MQT01.ORA01_REGION for the REGION nickname

We submit the original query, still referencing the ORA01.REGION and ORA01.NATION nicknames, but this time we get the following access plan: Listing 11. Access plan for Statement #2 when materialized query tables are used Rows RETURN ( 1) Cost I/O | 100 TBSCAN ( 2) 107050 7459.36 | 100 SORT ( 3) 107050 7459.36 | 128 NLJOIN ( 4) 107050 7459.36 /------------+------------\ 128 1 HSJOIN TBSCAN ( 5) ( 34) 107037 12.872 7458.36 1 /---------------+--------------\ | 128 25 5 NLJOIN FETCH TABLE: MQT01 ( 6) ( 32)

ORA01_REGION

107024 12.8927 7457.36 1 /------------+------------\ /----+---\ 3200 0.04 25 25 SHIP FILTER IXSCAN TABLE: MQT01 ( 7) ( 19) ( 33) ORA01_REGNAT 106224 325.292 0.0303773 7443.87 13.4911 0 +-----------------+-----------------+ | | 10000 200000 800000 1 25 NICKNM: MSS02 NICKNM: MSS02 NICKNM: MSS02 GRPBY INDEX: MQT01 SUPPLIER PART PARTSUPP ( 20)

ORA01_REGNAT_NK

325.292 13.4911 | 0.8 HSJOIN ( 21) 325.292 13.4911 /------------+-----------\ 4 5 SHIP FETCH ( 22) ( 30) 312.394 12.8968 12.4911 1 /------+-----\ /----+---\ 10000 800000 25 25

Page 25

NICKNM: MSS02 NICKNM: MSS02 IXSCAN TABLE: MQT01 SUPPLIER PARTSUPP ( 31) ORA01_REGNAT 0.0303773 0 | 25 INDEX: MQT01 OR01_REGNAT_NKRK

Notice that the access plan has the same complexity as the one before when we did not have the MQTs, but in this case the optimizer will access the local MQT tables to get the data for the NATION and REQION nicknames, instead of sending SQL statements over the network to the ORA01 data source for the NATION and REGION data. The query runs faster with the MQTs. We won’t go into the planning for MQTs in this article. They are essentially tables in a DB2 UDB database and the principles that apply for configuring DB2 UDB tables for good performance apply to MQTs. In this case the MQTs are small; 25 records and 5 records respectively. We allowed the tables and their indexes to be created in the default user tablespace of the DB2 II database. For larger MQTs, more attention to their placement on disk and the bufferpool that is used with them would be appropriate.

3.2.4 Sample statement #3 – union of multiple data sources

In our next example we will see how DB2 II handles unions of data at multiple data sources.

We will start with a query of a single ‘Union All’ view for nicknames for ORDERS tables at two data sources where the ORDERS table at both data sources have identical structures. DB201.ORDERS is nickname for the ORDERS table at the first data source and DB202.ORDERS is the nickname for the ORDERS tables at the second data source. Here is the SQL that creates the view UNN01.ORDERS:

Listing 12. SQL creating the view used in sample statement #3 create view UNN01.ORDERS as select O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE, O_ORDERDATE from DB201.ORDERS UNION ALL select O_ORDERKEY, O_CUSTKEY, O_TOTALPRICE, O_ORDERDATE from DB202.ORDERS Our initial query asks simply for the orders at both data sources where the total price was over $525,000. Listing 13. Sample SQL statement accessing data from 2 data sources select O_CUSTKEY , O_TOTALPRICE from UNN01.ORDERS where O_TOTALPRICE > 525000.00 The access plan for this query looks like this:

Page 26

Listing 14. Access plan for sample statement #3 Rows RETURN ( 1) Cost I/O | 105423 UNION ( 2) 350136 88832 /------+-----\ 52711.6 52711.6 SHIP SHIP ( 3) ( 5) 175068 175068 44416 44416 | | 1.5e+006 1.5e+006 NICKNM: DB202 NICKNM: DB201 ORDERS ORDERS DB2 II queries the two data sources serially; it sends the same SQL statement to both, which looks like this: Listing 15. SQL statement sent to data sources SELECT A0."O_CUSTKEY", A0."O_TOTALPRICE" FROM "TPCD"."ORDERS" A0 WHERE (525000.00 < A0."O_TOTALPRICE") If O_TOTALPRICE is indexed at the data sources, we expect the indexed to be used to determine if any ORDERS records meet the criteria; otherwise the data sources will do table scans. Each data source only returns data to DB2 II if it has any ORDERS meeting the criteria; and if it does, only the records meeting the criteria cross the network to DB2 II. In this case, DB2 II can send the results to the user as soon as they start arriving from the first data source. Now let’s add some aggregation operations to the query, so it now looks like this: Listing 16. Sample statement #3 with aggregation select O_CUSTKEY , O_TOTALPRICE from UNN01.ORDERS where O_TOTALPRICE > 525000.00 GROUP BY O_CUSTKEY , O_TOTALPRICE ORDER BY O_TOTALPRICE DESC DB2 II’s access plan now looks like this: Listing 17. Access plan for sample statement #3 Rows

Page 27

RETURN ( 1) Cost I/O | 21084.6 GRPBY ( 2) 350315 88832 | 105423 TBSCAN ( 3) 350306 88832 | 105423 SORT ( 4) 350296 88832 | 105423 UNION ( 5) 350136 88832 /------+-----\ 52711.6 52711.6 SHIP SHIP ( 6) ( 8) 175068 175068 44416 44416 | | 1.5e+006 1.5e+006 NICKNM: DB202 NICKNM: DB201 ORDERS ORDERS

DB2 II still sends the same SQL statement containing the filter for O_TOTALPRICE to both data sources, but the GROUP BY and ORDER BY are processed at the DB2 II server. DB2 II uses a transient temporary table (SORT ( 4)) to process the GROUP BY/ORDER BY. The size of this temporary table is the size of the results received from the two data sources. If the DB2 II server has enough memory to contain this temporary table, and the configuration of DB2 II’s SORTHEAP and/or buffer pool for temporary tables allow, the processing of the GROUP BY/ORDER BY can be completed in memory without any disk I/O to the file containers of DB2 II’s temporary tablespace. DB2 II will not be able to provide the first row of the result to the user until it has received the results from both data sources and processed the GROUP BY and ORDER BY.

3.2.5 Sample statement #4 – Union of multiple data sources We will now take the query used in the section above, and add to it a second UNION ALL view UNN01.LINEITEM that is a union of the DB201.LINEITEM and DB202.LINEITEM nicknames for identical LINEITEM tables at the same data sources that contain the ORDERS tables for which we created the UNN01.ORDERS view. The SQL that creates the view looks like this: Listing 18 SQL to create the view for sample statement #4 create view UNN01.LINEITEM as select L_ORDERKEY, L_PARTKEY,

Page 28

L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, from DB201.LINEITEM UNION ALL select L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER, L_QUANTITY, L_EXTENDEDPRICE, from DB202.LINEITEM Our query that includes columns of both the ORDERS and LINEITEM view looks like this: Listing 19. Sample statement #4 select O_CUSTKEY , O_TOTALPRICE , L_PARTKEY , L_EXTENDEDPRICE from UNN01.ORDERS , UNN01.LINEITEM where O_TOTALPRICE > 525000.00 AND O_ORDERKEY = L_ORDERKEY GROUP BY O_CUSTKEY , O_TOTALPRICE , L_PARTKEY , L_EXTENDEDPRICE ORDER BY O_TOTALPRICE DESC , L_EXTENDEDPRICE DESC As we can see, the query is now asking the break-down by PART for the orders whose total price was greater than $525,000.00. DB2 II’s access plan for this query looks like this: Listing 20. Access plan for sample statement #4 Rows RETURN ( 1) Cost I/O | 801154 GRPBY ( 2) 4.07986e+006 1.02414e+006 | 801154 TBSCAN ( 3) 4.07979e+006 1.02414e+006 | 801154 SORT ( 4) 4.06711e+006 1.01735e+006 | 801154 UNION ( 5) 3.97759e+006 1.01056e+006 +-------------------------------+-----------------+---------------+-------------------------------+ 200289 200289 200289 200289 HSJOIN SHIP HSJOIN SHIP ( 6) ( 11) ( 15) ( 20) 994398 994398 994398 994398 252640 252640 252640 252640 /------+-----\ /------+-----\ /------+-----\ /------+-----\ 6.00122e+006 52711.6 1.5e+006 6.00122e+006 6.00122e+006 52711.6 1.5e+006 6.00122e+006 SHIP SHIP NICKNM: DB201 NICKNM: DB201 SHIP SHIP NICKNM: DB202 NICKNM: DB202 ( 7) ( 9) ORDERS LINEITEM ( 16) ( 18) ORDERS LINEITEM 818930 175068 818930 175068 208224 44416 208224 44416 | | | | 6.00122e+006 1.5e+006 6.00122e+006 1.5e+006 NICKNM: DB201 NICKNM: DB202 NICKNM: DB202 NICKNM: DB201 LINEITEM ORDERS LINEITEM ORDERS

Page 29

We can see in the access plan SHIP ( 11) and SHIP ( 20) that send to each of the data sources a join of their respective ORDERS and LINEITEM tables. This is what we wanted DB2 II to do. But we also see in the access plan the operations HSJOIN ( 6) and HSJOIN ( 15) in which DB2 II is joining ORDERS and LINEITEM tables between the data sources, even though we know there is no result from these joins. DB2 II is doing these extra joins because there is no information available to DB2 II in its catalog or in the way we structured the query to tell DB2 II there is no result from these inter-data source joins. For queries where the desired result is a union of joins at multiple data sources, there are a number of techniques to provide DB2 II information so that it knows that inter-data source joins such as we see in the access plan graph above are not necessary: 1. Write the query explicitly as a union of the join of the nicknames for the tables at the respective data sources, instead as a join of unions as we did. 2. Add informational check constraints to all the nicknames specified in the union-all views, and includes the columns of these check constraints with values or ranges in the WHERE clause of the query. For instance, if both ORDERS and LINEITEM had an order-date column, the range of order dates for each data source could be specified in DB2 II in informational check constraints for each of the nicknames, and the order-date columns for both ORDERS and LINEITEM, with a range, would be specified in the WHERE clause of the query. 3. Specify informational foreign key relationships between the nicknames for tables at the same data source and an informational check constraint for the range of primary key values for the ‘parent’ nickname, and then include in the query a predicate for the ‘primary key’ column of the parent nickname with a range of values. In our example, we would alter the respective ORDERS and LINEITEM nicknames to add foreign key relationship based on the ORDERKEY, and would alter the respective ORDERS nickname to add an information check constraint for range of O_ORDERKEY’s. Then we would include O_ORDERKEY with a range of values in the WHERE clause of the query. 4. In the UNION ALL views, add a ‘Server’ column with an assigned value based on the data source for each nickname, and add that column to the join predicates of the query. For instance, in our example, in the union all views we would add these columns:

• O_SERVER in the UNN01.ORDERS union all view with assigned value ‘DB201’ for the DB201.ORDERS nickname and ‘DB202’ for the DB202.ORDERS nickname,

• L_SERVER in the UNN01.LINEITEM union all view with assigned values ‘DB201’ for the DB201.LINEITEM nickname and ‘DB202’ for the DB202.LINEITEM nickname,

In the query, we would add the additional join predicate that is underlined in the example below: Listing 21. Adding an additional join predicate to the query

select O_CUSTKEY , O_TOTALPRICE , L_PARTKEY , L_EXTENDEDPRICE from UNN01.ORDERS , UNN01.LINEITEM where O_TOTALPRICE > 525000.00 AND O_ORDERKEY = L_ORDERKEY AND O_SERVER = L_SERVER GROUP BY O_CUSTKEY , O_TOTALPRICE , L_PARTKEY , L_EXTENDEDPRICE ORDER BY O_TOTALPRICE DESC , L_EXTENDEDPRICE DESC

Page 30

3.3 Primary performance factors

When DB2 II has to access multiple data sources to process a SQL statement from a user or application, the primary performance factors are:

• Structure of the distributed data that needs to be accessed to get the result. This topic was discussed in detail in the SQL statement examples above.

• Similarity in data type and length of the join columns between the different data sources. This was also discussed above with techniques to use to achieve better performance if the join columns are dissimilar.

• Available capacity at data sources to process SQL sent by the federated server. If the application workload coming to the data sources via DB2 II is simultaneous with the original application workload of the data sources, then the capacity requirements of both workloads need to be added to find out the new capacity requirements at the data sources. Use DB2 explain tools in advance with sample queries and data to see the SQL that will be sent by DB2 II to data sources.

• Tuning at data sources for the SQL sent by the federated server; for instance, the availability of unique indexes or primary keys for the columns specified in joins and filters by the DB2 II users.

• Network speed and latency to data sources

• Result set size.

• Capacity of the DB2 II server, since it will have to do some of the work. But the factors above will usually have greater impact than the capacity of the DB2 II server on the total response time to the user.

• Size of the tables accessed. This is listed here because it is a factor, but it is listed last because it is not the primary factor. What is more critical is the size of transient intermediate result sets (sorts and temporary tables) at the DB2 II server during the execution of the query, amount of data that crosses the network combined with network speed, and the number of rows coming from one data source that have to be used in a join with another data source. The size of the data only compounds the impacts of these factors.

3.4 Hardware planning – multiple data sources per SQL statement Processors: With the ideal access plan, all but the joins between data sources are performed at the data sources and the demands on CPU at the DB2 II server may not be great. Still, with these joins being executed at the DB2 II server, there are more demands on CPU than in the single data source case where all work is pushed down to the data source. To process an individual query, DB2 II will use only 1 CPU; DB2 II can take advantage of a second CPU if there are multiple

Page 31

concurrent queries. If final result sets will be large, see the note about large result sets in the ‘processor’ comments of the previous section on ‘single data source.’ The exception to the 1-CPU/query is with DB2 II v8.2 in a partitioned configuration. If the DB2 II instance is implemented with data partitioning feature (multiple physical or logical partitions), then for queries where not all processing is pushed down to data sources, the DB2 II optimizer has the option, based on cost, to distribute the intermediate result sets and processing to multiple partitions, thereby taking advantage of the multiple CPU’s and the memory for sortheap and bufferpool of each partition. One can expect that this distribution of the intermediate result sets to multiple partitions will be the ‘low-cost’ plan selected by the optimizer only if the intermediate result sets are extremely large or much CPU is required to do the processing that is not pushed down. Memory: We recommend more memory per concurrent federated query in the multi-data source case than in the single data source case because

• The join between data sources has to be performed at the DB2 II server, and several of the join techniques available to DB2 II use memory. The hash-join and merge-scan join techniques use transient temporary tables in memory; the nested loop join technique does not. DB2 II’s cost-based optimizer will pick the best join technique to use for a particular query.

• In the execution of multi-data source federated queries, there is greater likelihood that transient intermediate result sets (SORT and TEMP in explain output) will be used. DB2 tries to contain these intermediate result sets in memory.

If the space required for the join technique or for any transient or temporary intermediate results is too much for memory defined for the DB2 II database, the pages for the join technique or the intermediate result set are spilled to the file containers of the temporary tablespace, meaning disk I/O occurs, which degrades performance. Hardware memory, and its allocation to DB2 II via the definitions for SORTHEAP/SHEAPTHRES and the bufferpool for the temporary tablespace, can reduce this spilling. If there will be multiple users concurrently connected to the DB2 II database, allow memory for the db2agent processes created for each user connections. The wrappers and data source clients are loaded into each user’s db2agent process. See the comments about concurrent connections in the ‘memory’ section in the section 2, “Single data source query/update workloads”. Disk: If there will be large transient/temporary intermediate result sets that cannot be contained in memory, then besides the disk space for the DB2 II base, the data source client software, and the catalog of the DB2 II database, there also needs to be enough free disk space available to the temporary tablespace to contain these transient/temporary result sets. Also, faster disk I/O rate will help performance, as will lack of contention with other disk I/O activity. Network adapter: If the structure of the queries requires DB2 II to do many interactions with data sources, or great amounts of data have to pass between the data sources and DB2 II, maximize the network throughput between the DB2 II system and the data sources, and configure the DB2 II system with a high speed network adapter that will attach to this high speed network.

Page 32

3.5 Planning questions – multiple data sources

• What are the names of the data sources and what are their types (e.g. DB2 z/OS, DB2 UDB, Oracle, Sybase, etc.)?

• What percentage of queries will need data from only one data source? What percentage will need data from multiple data sources?

• What are the different combinations of data sources (by name and type) that will be combined together in queries?

• What are the different structures of the SQL statements that will be used in the multi-data source queries?

• Will join columns and filters in the SQL statements be in indexes or primary keys at the data sources?

• Will join columns be the same data type and length in the data sources?

• Will the queries join data in data marts and large operation data stores?

• Will the queries join many records from one data source with large tables at a second data source?

• How large will final result sets be? How many records? What record length?

• Will there be large transient intermediate result sets (if an option, use DB2/II explain tools with sample queries to obtain this information)?

o How much memory can be made available for sorts and temp tables?

o How much disk space needs to be available to hold large sorts and temp tables that overflow SORTHEAP and the bufferpool for tempspace.

• How many users will there be in total?

• How many will be connected to the DB2 II server concurrently?

• At what rate will queries be submitted at the peak hour?

• What will be the average response time?

• What then is the average number of queries that DB2 II will be processing concurrently?

• Will DB2 II be a standalone server, or will DB2 II be at one of the data sources, either part of a DB2 UDB system or on the same system as a non-DB2 data source?

Page 33

• How fast will the network be between DB2 II and each of the data sources?

• How many routers will there be between DB2 II and each data source?

Page 34

4. Federated data replication

This section discusses replication from DB2 to non-DB2 and non-DB2 to DB2 using Information Integrator SQL-replication. This is the type of replication that has been supported for many years.

Federated data replication with DB2 II supports the following DB2 sources and targets:

• DB2 UDB on Linux, UNIX, Windows

DB2 UDB for z/OS

DB2 for iSeries

Federated data replication with DB2 Information Integrator supports the following non-DB2 targets:

Oracle

Sybase

Microsoft SQL Server

Teradata

Informix

The following non-DB2 sources are supported:

Oracle

Sybase

Microsoft SQL Server

Informix

The new Information Integrator Q-replication will not be discussed; it only supports replication from DB2 to DB2. It will support replication to non-DB2 soon. Support for Q-Replication from non-DB2 is much further in the future.

4.1 Architecture of federated data replication

Figure 4 shows the architecture of SQL-replication from DB2 to non-DB2.

Page 35

Target System

Figure 4. SQL replication from DB2 to non-DB2

Figure 5 below shows the architecture of SQL-replication from non-DB2 to DB2.

Figure 5. SQL-based replication from non-DB2 to DB2

For a detailed description of SQL-replication see the IBM Redbook A Practicial Guide to DB2 UDB Data Replication V8, (IBM Publication number SG24-6828/ISBN 0-784-2761-6), Chapter 10, ‘Performance.’

This article only discusses planning for Apply, which is the component that actually replicates the changes from the source system to the target. We won’t discuss the performance of Capture or the Capture Triggers which put changes into staging tables at the source system.

Page 36

Very little work is done by the Apply process itself. Apply creates SQL that is sent to the source server and the target server. They do most of the work.

1. Apply connects to the source server (or via DB2 II to a non-DB2 source server) and sends SELECT statements to the source server which are processed by the source server; Apply receives the results into the transient Apply spill files. On Linux, UNIX, and Windows, the Apply spill files must be written to disk; on z/OS, the spill files can be in memory. Apply itself makes no changes to the data in the spill files. Any and all transformations of data per the replication mapping definitions were made by the source server when it processed the select statement sent to it by Apply. When replicating from a non-DB2 source, Apply gives the select statements meant for the source server to DB2 Information Integrator, and DB2 II pushes them down to the non-DB2 source.

2. Apply then connects to the target server (or via DB2 II to a non-DB2 target server) and takes records from the spill file and sends record-level INSERT, UPDATE, and DELETE statements to the target server, and these statements are processed by the target server. When replicating to a non-DB2 target, Apply gives the record-level INSERT, UPDATE, and DELETE statements to DB2 Information Integrator, and DB2 II pushes them down to the non-DB2 target.

Number of Apply processes: The maximum number of Apply processes recommended is 4; if there are more than 4 Apply processes replicating simultaneously to the same target server (DB2 or non-DB2), the contention for resources at the target server will likely degrade replication performance. Replication (Apply) throughput: We don’t have any published benchmarks of replication with DB2 Information Integrator. A published benchmark performed in 2002 is available for DB2 UDB–to-DB2 UDB replication where source server was an IBM pSeries S80 8-CPU with 32GB of memory and the target server was an IBM pSeries M80 8-CPU with 32GB of memory. Apply was on the target system so that workload on that system included not only Apply’s own small application workload but also the processing of Apply’s SQL INSDERT/UPDATE/DELETEs to the target tables. Average record length was 145 bytes. One Apply achieved a throughput of 2800 change records per second. With 2 simultaneous Applys, a total replication thoughput achieved was 3800 records per second. For more details, see the IBM redbook A Practical Guide to DB2 UDB Data Replication V8, (IBM publication number SG24-6828 / ISBN 0-7384-2761-6), chapter ‘Performance,’ section ‘Development benchmarks.’ Hardware resources and the configuration at the target server and source server, and also the network speed and network configuration affect replication throughput. You will see greater or lesser throughput than that in the cited benchmark with different hardware, network, and configuration.

4.2 SQL Statements

Federated data replication using DB2 Information Integrator uses SQL to fetch change data from the replication source and to apply the changes to the target. The SQL statements used are created by the Apply process based on information in the Apply control tables. Each statement created by Apply references only one nickname. When DB2 II is used to replicate to a non-DB2 target, the SQL statement created by Apply to replicate updates and deletes includes a WHERE clause that references columns that are indexed at the data source; DB2 II includes the WHERE clause in the updates and delete statements it sends to the non-DB2 target. When DB2 II is used to replicate from non-DB2 sources, the SQL statement that Apply uses to fetch changes from staging tables includes a WHERE clause for the change sequence columns of the staging tables.

Page 37

DB2 II includes the WHERE clause for these columns in the SQL statement it sends to the non-DB2 source.

4.3 Primary performance factors – federated data replication:

The following are important performance factors in federated data configurations.

• Location of the DB2 replication Apply component in relationship to the targets, and network speed to the targets. Best performance is always achieved if Apply is on the same system as the targets so that its record-level SQL operations to apply changes to the target are done locally. When replicating to non-DB2 targets, use the Apply that comes with DB2 Information Integrator, and, if you can, install DB2 II on the non-DB2 target system. When replicating from a non-DB2 source, run Apply on the DB2 target system.

• The network from Apply to the target system. Ideally, Apply is on the same system as the replication target. This is often not the case when replicating to non-DB2 targets, and DB2 Information Integrator (which includes Apply) is on a different system. In that case, maximize the throughput capacity of the network between the DB2 II system and the non-DB2 target system.

• Available capacity at the data source to process the SQL operations that Apply sends to apply changes. If the data is updated by Apply at the same time as it is queried by the users of the targets, then capacity needs to be available for both workloads.

• Indexes on the target tables. Apply needs only one unique index (or primary key) to apply changes to the target efficiently. Additional indexes slow down replication.

• Network from Apply to the source server. This can be a factor, though it is not as important as the network from Apply to the target since Apply can fetch changes from the source in multi-record blocks but does record-level SQL operations to the target.

• The source server’s processing of Apply’s request to select changes from the staging tables. In most cases, this is done by exploiting the unique indexes on the staging tables, even if the source is non-DB2 and accessed via DB2 II. Statistics on staging tables should be updated when the staging tables are full, not right after Capture pruning.

4.4 Hardware planning for data replication Processors: CPU speed for the Apply process itself is not the primary hardware performance factor for Apply’s workload. A single Apply process can use only 1 CPU; a second Apply process can take advantage of a second CPU. Memory: The Apply process itself does not require much memory. Disk:

Page 38

Besides the minimum disk space requirement to install DB2 II, the data source client, and have a DB2 II database, space must be available for the transient Apply spill files where Apply stores changes temporarily while replicating. Apply fetches changes from the staging tables at the source server, puts the result into spill files in the Apply working directory on the system where Apply is running, applies the change records from the spill files into the target tables and then erases the spill files. Use the answers from the planning questions on the number of changes records and average row length to calculate the space needed for the spill files. If there will be multiple Apply processes replicating simultaneously, you can reduce contention for resources on the Apply server by giving each Apply a separate I/O path to the disk space of its working directory. Also, faster disk I/O rate to the disk space of the Apply working directories will help performance, as will lack of disk I/O contention. Network adapter: Network data rate to target server will affect the turn-around time of the record-level SQL operations Apply sends to the target server. Network data rate to the source server will affect the time to fetch changes in multi-record blocks from the source server. A fast network and corresponding network adapter on the DB2 II server are recommended.

4.5 Planning questions – heterogeneous data replication

Here are some questions that will help as you begin to plan for your own DB2 II installation.

• What are the types of replication sources? What are the types of replication targets?

• If replicating to non-DB2 target, where will DB2 II be installed?

o Standalone (neither source nor target system)? On the non-DB2 target system? On the DB2 source system?

o Where will Apply run? On the DB2 II system? The source system?

o What will be the network throughput from DB2 II to the non-DB2 target? From DB2 source to the DB2 II system?

• If replicating from a non-DB2 source, where will DB2 II be installed?

o Standalone system (neither non-DB2 source nor DB2 target)? Non-DB2 source system? DB2 target system?

o Where will Apply run? DB2 target? DB2 II system?

o What is the network throughput from non-DB2 source to DB2 II system? From DB2 II system to DB2 target system?

• How often will replication occur? Continuously? At specified intervals?

Page 39

o Will there contention between replication and other workloads, such as user queries of the target tables?

• How much data needs to be replicated at the peak time? How many changes records? Average record length?

• How many indexes will there be on the target tables?

• How many Apply processes will there be?

• How much disk space is needed for the Apply working directories? 5. DB2 II integration with DB2 UDB partitioned systems DB2 Information Integrator can be added to a multi-partition DB2 UDB ESE. When a user connects to the DB2 multi-partition system, the connection is made to a coordinator node of the multi-partition DB2 UDB system. If the user references nicknames, DB2 II/DB2 UDB makes the connection to the federated data source(s) from the coordinator node that the user is connected to. The data source client software for the data source must be installed on the computer system that has the coordinator node, as well as the DB2 II relational or nonrelational wrapper that is used for access to the data source. The data source client on that computer system must be configured to access the data source. If the wrapper for accessing the data source has the option DB2_FENCED ‘Y’, the fenced mode procedure (FMP) process that contains the wrapper and the data source client software are also at the coordinator node that user is connected to. If the DB2 UDB administrator configures multiple partitions of the DB2 multi-partitioning system as coordinator nodes for load balancing or other purposes, the data source client software and DB2 II wrappers must be installed on each computer system that has a coordinator node. Starting with DB2 UDB and DB2 II 8.2 (or DB2 UDB/II 8.1 fix pack 7), if the wrapper definitions have the option DB2_FENCED ‘Y’, and a user query references both partitioned tables and nicknames, the optimizer has the option of sending data received from federated data sources to the partitions of the partitioned DB2 database, so that the join of federated data and DB2 partitioned data can be done where the partitioned data is stored. The optimizer will also be evaluating access plans that move the partitioned data to the coordinator node for the join with the federated data. The optimizer decides which access plan to use based on its cost estimates for both access plans. If the plan that joins the federated data and partitioned data in the DB2 partitions is chosen, the data from the federated data sources passes through the wrapper FMP process at the coordinator node to the partitions where the DB2 data is stored. Also, if DB2 II is integrated with a DB2 UDB partitioned system, and a user query only references nicknames, and there are SQL operations that are not pushed down to the federated data source(s), the optimizer has the option to evaluate access plans that send the intermediate results from the data sources to the DB2 UDB partitions to perform the residual SQL operations in parallel. A DB2 Computational Node Group needs to be configured to enable this alternative. The optimizer will also be generating access plans that do not send the intermediate result to the partitions, and the optimizer will select the plan with the lowest cost estimate. The intermediate result would have to be large and the residual SQL operations require significant CPU and/or memory buffers for the plan that does the residual operations in partitions to be the plan with the lowest cost estimate.

Page 40

6. Scaling, load-balancing, and failover with DB2 II When thinking of how to make a DB2 II implementation scale to support a larger number of users, first of all remember that when DB2 II is efficiently used, most of the processing is done at the data sources, and there is SQL statement and data flow over the network between the DB2 II server and the data sources. So if you increase the number of DB2 II users, consider scaling up the capacity at the data sources and network also. Scaling, Load-balancing, and failover for online/batch and data analysis/reporting applications Regarding scaling the II server itself, there are 2 approaches: • Add more processors and memory to an existing system so that it can handle more users • Add a second DB2 II server and implement a load-balancer (such as IBM Websphere Edge Server

Network Dispatcher) between the users and applications and the DB2 II servers.

For the latter approach, create all the federated objects (wrappers, servers, user mappings, nicknames) with scripts, and run the scripts at both servers so that both DB2 II servers will have identical definitions. Both systems must also have all the wrappers and data source client software to be able to connect to the data sources. The second approach is also the approach for implementing load-balancing and fail-over for multiple DB2 II servers. Figure 6 shows a load-balancing configuration for DB2 Information Integrator.

Figure 6. Load balancing configuration for DB2 Information Integrator Fail-over for replication with DB2 II When planning to use DB2 II to replicate from DB2 source to non-DB2 target, for a fail-over configuration, consider this approach:

• DB2 II primary and backup servers have identical configurations of the data source client and of wrapper/server/user-mapping/nicknames in the DB2 II database.

• Apply control tables are located at the source server. Apply reads its control tables to get information about the tables that are replicated, and Apply updates its control tables with information on its progress in replicating changes from source to target.

With this configuration, if the primary DB2 II server fails, Apply at the back up DB2 II server can be started. It will replicate the same tables that were replicated by the Apply on the primary system, and it

Page 41

can continue the flow of changes from source to target from the point at which that flow stopped when the primary DB2 II server with its Apply process failed. Figure 7 shows the replication fail-over configuration.

Figure 7. Replication failover configuration 7. Conclusion When planning for DB2 Information Integrator and trying to size a server system for it to run on, first determine the type of workload that will be used – batch/online, data analysis/reporting, or data replication. If batch/online or data analysis/reporting, determine if SQL statements submitted to DB2 II will reference data at one data source, or more than one data source. With this basic information about the workload, the appropriate section of this article can be used to try to predict how DB2 II will use resources at the data sources, in the network, and at the DB2 II server. If possible, create a DB2 II test environment with nickname statistics and index information for the production data that DB2 II will work with so that DB2 II’s explain tools can be used to see the access plans that DB2 II will use for the workload. 8. Other sources of information for DB2 II planning

Product manuals - online at http://www.ibm.com/software/data/db2/udb/support/manualsv8.html

• DB2 Information Integrator V8.2 Federated System Guide

• DB2 Information Integrator V8.2 Installation Guide

• DB2 Information Integrator V8.2 Data Source Configuration Guide

• DB2 Information Integrator V8.2 SQL Replication Guide and Reference

Page 42

Also, the commands to create, alter, and drop wrappers, servers, user mappings, and nicknames are described in the DB2 UDB SQL Reference Volume 2 which can be found at the same website.

Redbooks – online at http://www.redbooks.ibm.com

• Data Federation with IBM DB2 Information Integrator V8.1 (SG24-7052)

• DB2 Information Integrator V8.2 Performance Monitoring, Tuning, and Capacity Planning Guide (SG24-7073)

• The Practical Guide to DB2 UDB Replication V8 (SG24-6828)

About the author

Micks Purnell is a Data Management Software IT Specialist in IBM’s Advanced Technical Support Americas organization. Micks has been working with DB2 Information Integrator and its predecessor products, DB2 Relational Connect and DataJoiner, since 1995. He can be reached at [email protected].

Page 43

Date post:	10-May-2015
Category:	Documents
Upload:	tess98
View:	642 times
Download:	0 times

DB2 UDB

Documents