DB2 Information Integrator Version 8.1Liz Drolet April 1, 2004
Integration is a strategic priority
Different Types of Integration Required
"Legacy"Apps
InformationIntegration
ProcessIntegration
UserInteraction
Consumers
Trading Partners
Service Providers
Suppliers
B2B Portal
B2C Site
Value ChainHeterogeneousEnvironments
New e-business Applications
Supply Chain Management
Customer Relationship Management
Service Provider Integration Into
ERP and HR Systems
Product Development Management
ApplicationConnectivity
Build to Integrate
What is Information Integration?
§Information integration refers to a category of middleware which lets applications access data as though it were in a single database, whether or not it is. It enables the integration of data and content sources to provide real-time read and write access, to transform data for business analysis and data interchange, and to manage data placement for performance, currency, and availability.
DB2 Information Integrator
§FederationƒDefine Integrated Access across Diverse and Distributed DataƒTransparent Access to Distributed, Disparate Data, Query as a Single Source
§ReplicationƒConsolidation and Placement of Distributed, Disparate Data
History of DB2 federated database productsDB2 UDB & II V8
DataJoinerüHeterogeneous Data AccessüDistributed JoinsüHeterogeneous ReplicationüSpatial Data Support
V1
V2
V1
Parallel Edition
V1
DB2UDBV5
DB2 UDB V5üObject RelationalüOLAP, OLTP and BIüFull ParallelismüIntegrated ToolsüWeb Integration
Parallel Edition
V1
Common Server
V2
üHeterogeneous Data Access (Read-only)üLimited Data SourcesüDistributed JoinsüEnhanced O-R supportüOLAP, OLTP, and BIüSpatial Data Support
üHeterogeneous Data Access (Read and Write)üHeterogeneous
ReplicationüMost Data Sources
RelationalNon-relationalLife Sciences
üCustom WrappersüMaterialized Query Tables
over Relational Nicknames üNet Search Extender over
Relational NicknamesüWeb ServicesüMQüand more
DB2 UDB V7
IBM's Approach to Data Federation
Functions
§TransparentƒAppears to be one source
§HeterogeneousƒIntegrates data from diverse sourcesƒRelational, Structured, XML, messages, Web, …
§ExtensibleƒFederate almost any data source.ƒDevelopment tooling provided
§High FunctionƒFull query support against all dataƒCapabilities of sources as well
§AutonomousƒNon-disruptive to data sources, existing applications, systems.
§High PerformanceƒOptimization of distributed queries
What Problems Does Federation Address?
§Real-time or near real-time access to rapidly changing data is required.§Direct immediate write access to the original data is required.§It is technically difficult to use copies of the source data.§The cost of copying the data exceeds that of accessing it remotely.§It is illegal or forbidden to make copies of the source data.§The users’ needs are not known in advance.
Federated Database Architecture
Federated Database
Server
Data
Relational Data Source Data
Global Catalog
SQL API(JDBC/ODBC)
Wrappers
00001|SONY|Television|... 00002|RCA|VideoPlayer|.. 00004|SONY|DVDPlayer
00003|SONY|VideoRecorder.......
Database Application
SELECT I.man, count(*)FROM transactions T,
items IWHERE I.id=T.item_id
AND I.category='Television'AND YEAR(T.tran_date)=2001
GROUP BY I.man;
SELECT tran_date, item_idFROM transactions
ITEMS
TRANSACTIONS
List the number of TV salesper manufacture in 2001
Flexible Access with Standard API's
n SQLƒFamiliar language with widely deployed skillsƒRich analytical capabilitiesƒTraditional database clientsƒExtensions for XML data (SQL/XML)
n XMLƒEmerging standard for interchangeƒXML extensions to SQL languageƒXQuery - XML Query Language (future)
–Based on a formal algebra–IBM is co-submitter of XML Query specification (http://www.w3.org/TR/xquery)
ƒExploit unique features within XML data model -hierarchy, sequence
ƒWeb services Financial Services Sector to spend$8.3B (US) on XML and Web Servicesby 2005
–Zap Think, March 2002
Federated Access to Diverse Data
Reaching Relational Data Sources
ApplicationWebsphereQuery ToolOther vendor SWC,Java, Cobol,RPGJDBC/ODBC,Embedded SQL
DB2 interfaceDDF - z/OSRDB - iSeriesDB2 RT Client
UNIX, Linux,Windows
Data SourceDB2
z/OSiSeriesLUW (UDB)VM and VSE
OracleSybaseInformixTeradataSQL ServerODBC Sources
DB2 instance�federated server
Wrappers�DRDA�NET8�CTLIB�INFORMIX�TERADATA�DJXMSSQL3�ODBC
Client�DB2 Connect - built-in�Oracle client�Sybase Open Client�Informix Client SDK�Teradata CLI�ODBC Drivers
Operating system �Windows 2000, AIX, Solaris, HP-UX, Linux/Intel
��DB2 database�wrapper�server�user mappings�nicknames Config File/Directory
�DB2 DB Directory�tnsnames.ora�interfaces�sqlhosts�/etc/hosts�odbc.ini/System DSN
§Relational wrappers require client libraries/configurations.
Reaching Non-Relational Data Sources
§Same architecture as for Life Sciences Data Connect
ApplicationWebsphereQuery ToolOther vendor SWC,Java, Cobol,RPGJDBC/ODBC,Embedded SQL
DB2 interfaceDDF - z/OSRDB - iSeriesDB2 RT Client
UNIX, Linux,Windows
DB2 instance�federated server
Wrappers�Flatfile�XML�BLAST�Excel�Documentum�HMMER�Entrez�BioRS�Extended Search
Operating system (check for specific wrappers)�Windows 2000, AIX, Solaris, HP-UX, Linux/Intel
��DB2 database�wrapper�server�user mappings�nicknames
�Table-Structured File
�XML file
Excel 97/2000 �Excel file
Dctm clientDocumentum
data storeHMMER data
sourceHMMERdaemon
NCBI Website
BioRS Server
ES Server
ES client
BLAST data source
BLAST daemon
local DB2 for "scratch" temp tables
Comparing performance of distributed queries in an application with and without DB2 Federated
DB2 Federated
Server
Oracle Excel/ODBC
DB2
Federated Application
Non-Federated Application
Connection to Federated server
Connection to all individual data sources
Comparing performance of distributed queries with and without DB2 Information Integrator
§Without DB2 Federated: Application connects to each source, issues SQL in its dialect, retrieves appropriate data from each, inserts into local temporary tables, and processes query locally§With DB2 Federated: Application connects only to Federated server and submits queries against nicknames to several sources
ƒFederated manages the decomposition and processing of the queryƒCan also create join and union views over nicknames to make multiple remote tables appear as one to
the application
§Results w/J2EE servlets issuing queries involving 3 remote data sourcesƒ40 - 65% less code with DB2 IIƒ50 - 65% less time required to develop with DB2 II
Query Time with Federated
Time without Federated
1 3.5 sec 3.4 sec2 0.24 sec 0.16 sec3 54.2 sec 170.1 sec4 6.5 sec 81.2 sec5 15.1 sec 9.9 sec
Staging tablesStaging tables
Combined with the federation engine -> a powerful integration tool!
IMS
DB2/zOS
DB2/iSeries
DB2/UDB
Sybase
SQL Server
IBM Informix
Oracle
ANY source
LOG based
Trigger based
DB2/zOS
DB2/iSeries
Sybase
SQL Server
IBM Informix
Oracle
DB2/UDB
External application
Replication architecture for integrating information
Teradata
Federated Engine
What Problems Does Replication Address?
§Read-only access to reasonably stable data is required.§Users need historical or trending data.§Data access performance or availability are overriding requirements.§Users routinely want quick data access, necessitating that a local, pre-processed copy of the data be made available.§Users' needs are repeatable and can be predicted in advance.§Transformations or joins needed are complex or long-running.
Federated Concepts
§Wrapper: the wrapper code module itself§Server: a specific data source, e.g. a database on a DB2 instance.§User Mapping: information needed to connect to a specific server§Passthru Session: a special mode that allows users to submit SQL statements directly to a data source§Nickname: a specific data set managed by a server, mapped to rows and columns in DB2 UDB§Index Specification: a index catalog entry for a nickname§Type Mapping: a mapping between a data source type and a DB2 UDB data type§Function Mapping: a mapping between a data source function and a DB2 UDB function§Function Template: a virtual function definition for a data source function that cannot be executed on DB2 UDB§Option: an additional attribute specific to each source to customize an object
Global Catalogs
Object Catalog Viewswrapper syscat.wrappers (wraptype='R'/'N' for
Relational/Non-relational wrapper)syscat.wrapoptions
server syscat.serverssyscat.serveroptions
user mapping syscat.useroptions
nicknames syscat.tables (type ='N'),sysstat.tablessyscat.taboptionssyscat.columns, sysstat.columnssyscat.coloptionssyscat.indexes, sysstat.indexessyscat.indexoptionssyscat.keycoluse
Global Catalogs (continued)
Object Catalog Viewsindex specification
syscat.indexes, sysstat.indexes
type mapping syscat.typemappings(mappingdirection = 'F'/'R')
function template syscat.functions, sysstat.functionssyscat.routines, sysstat.routines (replacing sys*.functions in DB2 V8)
function mapping syscat.funcmappingssyscat.funcmapoptionssyscat.funcmapparmoptions
passthru privilege syscat.passthruauth
Cost-based Optimization
SERVER ƒServer Type/VersionƒServer OptionsƒCPU Ratio,IO RatioƒCommrate
Physical Properties:Federated system configuration
Query Properties:Optimization class, data distribution,operators used, available alternatives,cost models, FIRST N ROWS ?
Statistics:SYSTABLESSYSCOLUMNSSYSINDEXES
ƒObject StatisticsƒColumn StatisticsƒIndex Statistics
Non-Relational Wrapper�Wrapper Plans�Cost Models
Query Compiler Flow for Distributed Queries
Cost-Based Plan Selection
Parse Query
Check Semantics
Query Rewrite
Remote SQL Generation
Code Generation
Relational Data Source Pushdown Analysis
ORNon-Relational Wrappers
Relational Data Source Pushdown Analysis
• Determine what portion of a query can be executed outside DB2 for relational sources
• Will not dictate how much really gets pushed down to the data source in most cases
• What matters:– data source capabilities
• e.g. Can it handle join operation ?– characteristics about the data
• e.g. Can sorting sequence affects predicates on a column ?– function mappings
• e.g. Can this source evaluate COUNT(*) ?– and more
A Simple Join Query
Show me:the number of TV sales per
manufacture in year 2001
SQL to DB2:SELECT I.man, count(*) FROM transactions T,
items I WHERE I.id=T.item_id
and I.cat='Television' and YEAR(T.tran_date)=2001
GROUP BY I.man;
Global Plan
RETURN |
GROUP BY |
SORT |
Hash Join / \
Relational Evaluate cat='Television'Query | SHIP Non-Relational Plan
| RPDNickname: |
TRANSACTIONSNickname:
ITEMS
A Frequently Asked Question
• If I am joining two nicknames (N1 and N2) from the same source, does the optimizer always push down this join to the source to execute ?
– The answer is IT DEPENDS.– If possible, the optimizer will generate the following:
• Remote join plans• Local join plans
– different join sequence (N1 as outer, N2 as outer)– different join methods (nested loop join, merge join, hash
join)– other ordering requirements
• The cheapest final query plan that satisfies the cost criterion will win unless DB2_MAXIMAL_PUSHDOWN is set to 'Y'.
Actual pushdown is cost-based
§Just because processing can be pushed down doesn't mean it will be.§Influenced by estimates of rows processed/returned. Example: two-table join with nicknames ORA.T1 and ORA.T2 to a single remote source that is "nearly" a Cartesian product. May be better to do the join at the Federated server to avoid retrieval of many rows. §Example: Retrieving (10,000 + 25) rows to do a local join is faster than retrieving (10,000 * 25) = 250,000-row remote join result
SELECT .... from ORA.T1, ORA.T2 where T1.a = T2.b
ORA.T1 ORA.T2
25 rows 10,000 rows
Single remote Oracle source
Highlights: Federated Features in DB2 II
§Transparent DDLƒAllows CREATE TABLE to be issued on DB2 to create a remote table and a nickname for
this remote table in one step.
§Insert/Update/Delete against relational nicknamesƒSupported operations include:
–insert with values, insert with subselect–searched updates/deletes–positioned updates/deletes
ƒused by Heterogeneous replication in DPropR.
§Materialized Query Tables over relational nicknamesƒAllows previously cached query results to be used to answer queries.
§LOB retrieval for all relational wrappers§LOB Insert/Update/Delete for Oracle
Federated Features in DB2 II (continued)§Expand data source support
ƒODBC R/W wrapperƒTeradata wrapperƒXML wrapperƒHMMER wrapperƒEntrez for Nucleotide and PubMed wrapperƒBioRS wrapperƒExtended Search wrapper
§Expanded operating system supportƒIn addition to AIX ® and Windows NT, servers that use Linux, HP-UX,Solaris ™ Operating
Environment, and Windows 2000 operating systems can now be used as DB2 federated servers.
§Enhanced net8 wrapper for Oracle§Garlic wrapper planning technology
ƒAllows non-relational wrappers to be written easilyƒ'Quirky' data sources can be modeled more accurately.
§Non-relational Life Sciences wrappersƒLSDC wrappers are all now using Garlic wrapper planning technology.
Federated Features in DB2 II (continued)
§Net Search Extender: text search against relational nicknamesƒuseful for those sources that do not support text searchesƒtext indexes can be defined over relational nicknames on DB2 II
§MQ UDF with transaction support (1PC)*ƒContent of a message queue can be recovered on rollback request
§MQ Listener*§DB2 as Web Service Provider/Consumer*§Enhanced Control Center GUI for Federated System*§Heterogeneous Replication
* Included in DB2 UDB v8
Materialized Query Tables over Relational Nicknames
Remote data source
Local data
optimizationDB2 Federated Server
nickname
MQT
�MQTs can be defined over combinations of local data and relationalnicknames.�Users can indicate whether MQTs can be considered for query evaluation.
Using XML Wrapper
Federated Database Server
Data
Relational Data Source Data
Global Catalog
SQL API(JDBC/ODBC)
Wrappers
Database Application
SELECT C.name,GSE.ST_DISTANCE (geocode(C.address),geocode(S.address), 'mile')
FROM Customers C, Orders O,Items I, Stores S
WHERE C.cid = O.cidAND O.oid = I.oidAND I.desc = 'TV'
SELECT address FROM stores
CUSTOMERS, ORDERS, ITEMS
STORES
<doc><customer id='123'>
<name>...</name><address>...</address><order>
<amount> ...</amount><item quant=1>
<desc> ...</desc>....
</order>....
</doc>
Display the distance between each customer who bought a TVand each store
§ IBM Extended Search�Brokered search architecture for searching thousands of existing data sources�Results are aggregated, ranked, and returned in a single hit list�Easily embeddable into any application�Lotus databases, document systems, full text indexes, e-mail, directories, WWW, syndicated content, relational, file systems
§ Combined with federation�Generated search arguments�Sophisticated ordering�XML document generation
Web ContentMgmt
Taxonomy & Indices
ILES ServerILES Server
ContentContentManagementManagement
Broker
DocumentDocumentRepositoriesRepositories
RelationalRelationalDatabasesDatabases
AgentAgentAgent Agent
InternetInternet Web Web CrawlerCrawler
Extended Search: Unstructured access
n Provide additional logic which is invoked as part of SQL processing n Can return either scalar, row, or table resultsn Can be used to compose standard viewsn Simple to develop and configuren Can exploit parallelism n Built-in UDFs for WebSphere MQ, OLE DB, Web Services
User Defined Functions Extend Access
SELECT MQSend(followup.service,a.custid || ' ' || a.ordid) FROM account a WHERE a.status = 'overdue'
Increase Messaging System Value
• Simplify integration between database and messaging systems
– Brings message queue access into a familiar paradigm for database programmers
• Analyze message queue data with standard analytical software
– Virtual or physical queue snapshots can be analyzed using standard SQL-based analytical tools
• Compose and publish XML messages– A single query can access diverse and distributed data, compose
a complex document, validate the document against a DTD or XML schema, and publish it to a message queue
Insert into PENDING_ORDERSselect t.msg from
table(MQRECEIVEALL());
Access to Web Services
§ Integrate SQL statements and Web Service invocations§ One statement can access local and federated data and web services§ Support for generating SQL scalar and table UDFs based on WSDL web service
descriptionØCommand line version
ØTooling integrated into WebSphere Studio
Web
Airline Fare
Language Translate
Currency Rate
Temperature
Stock Quote
ServiceProviders
SELECT city, GetTemperature(city)
FROM location
WrappersData serversData objectsTransformations
ViewsTransformationsTopology
ConfigureRolloutAdminister/TuneMonitor
Administration
§Control CenterƒTools to configure and administer standard wrappers ƒPlug-in architecture allows custom wrappers to be administered
Wrapper Administration Tools: Plug-in architecture
Administration Tools
SybaseOracleSQL ServerDB2InformixODBCTeradata
Wrappers which support discoveryHMMEREntrezXMLFlat FileExcelExtended Search
"Create Nicknames" window
Launches customized GUI
Returns Nickname defintions
Customized "Discover" GUI
Discovery for Nicknames
n
Replication Administration
n Definitionsƒ Manage control definitions for
replicationƒ Customize names and sizes of
objectsn Operations
ƒ Start Capture, Apply, Monitor, Analyzer,
and Traceƒ Issue commands such as STOPor STATUS
n Monitoringƒ Perform static and dynamic
monitoring
DB2 Information Integrator Considerations
§Not designed for OLTP-like transaction throughput – a federated layer adds cost while masking complexity
ƒSuited to decision support style queries
§Updates support 1 phase commitƒRoll back one data source at a time, when DB2 is transaction coordinator (no support for
external TM)ƒCustomers with this requirement should postpone migrating from DB2 DataJoiner
§DB2 II should not be installed on top of partitioned DB2 ESE (“EEE”)ƒenvironment requires care - federated processing will be performed from a single node
§Some wrappers not available on all platforms (client availability)
§Open infrastructure for integrating diverse and distributed data for real time access
ƒStructured, semi-structured, and unstructured data, standards-based extensibility ƒLeverage native source capabilitiesƒQuery engine provides optimized cross-source queries
§An extension of industrial-strength technologyƒRobust, high function, high performanceƒBenefit from R & D in relational DBMSƒBenefit from experience in content management
§Eases application developmentƒTransparency, heterogeneity, high functionƒAPIs for modern environments (XML, J2EE, Web Services)
§Application AutonomyƒDoes not disrupt existing applications
Summary: II Federated Technology
DB2 Information Integrator PackagingII Advanced Edition
• Full Function DB2 ESE• Net search Extender
II Standard Edition• Heterogeneous Federation• Caching • Limited Use DB2 ESE
•••
•••
• Heterogeneous Replication• Limited use DB2 ESE
II Replication Edition
• Full Function DBMS• Includes II Foundation for DB2
• Homogeneous Federation & Replication to DB2 & IDS
• XML store
DB2 ESE
§DB2 II Advanced Edition UnlimitedƒPer CPU chargeƒNo Connector Charges
§DB2 II Advanced EditionƒPer CPUƒPer Connector
§DB2 II Standard EditionƒPer CPUƒPer Connector
§DB2 II Replication EditionƒPer CPUƒPer Connector
§DB2 II Developer's EditionƒPer DeveloperƒNot for Production Deployment
è1 Connector License is required for each remote relational database instance or unlimited access to one non-relational data type (see announcement letter for details)
Connector Licensing
§DB2 UDB, Informix IDS, Informix XPS, WebSphere MQ, Web services, and OLE DB - provided with base server license.§Microsoft Excel - access to one or more Excel documents that reside on one or more servers. §Flat files - access to all flat files that reside on one or more servers. §XML - access to all XML documents that reside on one or more servers.§Extended Search - access to all data sources accessible through Extended Search§Documentum - access to one repository of Documentum documents. §Relational databases - access only one database instance of Oracle, Microsoft SQL Server, Sybase or Teradata databases, or any one data source instance accessed using the ODBC connector. §BLAST - access to all data sources accessible through one BLAST daemon.§Entrez - access to all supported data sources accessible through Entrez. Access to PubMed and Nucleotide databases is supported by DB2 Information Integrator.§HMMER - access to all data sources accessible through one HMMER daemon.§BioRS - access to all the databanks supported by a single installation of BioRS.
For More Information
§DB2 Information Integrator Home Pageƒhttp://www-3.ibm.com/software/data/integration/db2ii/
§IBM Systems Journal on Information Integration (Vol. 41, No. 4, 2002)ƒOnline: http://www.research.ibm.com/journal/sj41-4.htmlƒOrder from Publication Center: G321-0147-00
§Announcement LettersƒDB2 Information Integrator Version 8.1: 203-134ƒDataJoiner, Relational Connect Withdrawal from Marketing: 903-106
§Whitepaper: "Creating a flexible infrastructure for integrating information"ƒhttp://www-3.ibm.com/software/data/pubs/pdfs/ii.pdf
§DB2 Information Integrator on Xtreme Leverageƒhttp://d25web1.torolab.ibm.com/db2info/dminflib.nsf/dm/integration
§Announcing IBM DB2 Information Integrator 8.1 General AvailabilityƒDB2 INFObrief #1020