+ All Categories

9iETL

Date post: 26-Oct-2014
Category:
Upload: vankayalapati-srikanth
View: 106 times
Download: 3 times
Share this document with a friend
Popular Tags:
20
www.rmoug.org Rocky Mountain Oracle Users Group www.clear-peak.com Clear Peak Solutions, LLC B LENDING ORACLE 9I ETL OPTIONS WITH P ROVEN ETL TOOLS AND TECHNIQUES Brad Cowdrey, Data Warehouse Architect, Partner Clear Peak Solutions, LLC Summary Oracle 9i provides a new set of ETL options that can be effectively integrated into the ETL architecture. In order to develop the correct approach to implementing new technology in the ETL architecture, it is important to understand the components, architecture options, and best practices when designing and developing a data warehouse. With this background, each option will be explored in detail from its syntax to its behavior and performance (where appropriate). Based on the results of the examples, combined with a solid understanding of the ETL architecture, strategies and approaches to leverage the new options in the ETL architecture will be discussed. To provide a complete view of the options, a look at these options working together will be explored stemming from examples throughout the paper. The Oracle 9i ETL options that will be discussed include: External Tables Multiple Table Insert Upsert / MERGE INTO (Add and Update Combined Statement) Table Functions The information in the document is targeted for information technology managers, data warehouse developers, and data warehouse architects. Overview of the Extract, Transform, & Load Architecture (ETL) The warehouse architect can assemble ETL architectures in many different forms using an endless variety of technologies. Due to this fact, the warehouse can take advantage of the software, skill-sets, hardware, and standards already in place within an organization. The potential weakness of the warehouse arises when a loosely managed project, which does not adhere to a standard approach, results in an increase in scope, budget, and maintenance. This weakness may result in vulnerabilities to unforeseen data integrity limitations in the source systems as well. The key to eliminating this weakness is to develop a technical design that employs solid warehouse expertise and data warehouse best practices. Professional experience and the data warehouse fundamentals are key elements to eliminating failure on a warehouse project. Potential problems are exposed in this document not to deliver fear or conjure the popular cliché that “warehouse projects fail.” It is simply important to understand that new technologies, such as database options, are not a replacement for the principals of data warehousing and ETL processing. New technologies should, and many times will, advance or complement the warehouse. They should make its architecture more efficient, scalable, and stable. That is where the new Oracle 9i features play nicely. These features will be explored while looking at their appropriate uses in the ETL architecture. In order to determine where the new Oracle 9i features may fit into the ETL architecture, it is important to look at ETL approaches and components. Approaches to ETL Architecture Within the ETL architecture two distinct, but not mutually exclusive, approaches are traditionally used in the ETL design. The custom approach is the oldest and was once the only approach for data warehousing. In effect, this approach takes the technologies and hardware that an organization has on hand and develops a data warehouse using those technologies. The second approach includes the use of packaged ETL software. This approach focuses on performing the majority of connectivity, extraction, transformation, and data loading within the ETL tool itself. However, this software comes with an additional cost. The potential benefits of an ETL package include a reduction in development time as well as a reduction in maintenance overhead.
Transcript
Page 1: 9iETL

www.rmoug.org Rocky Mountain Oracle Users Group www.clear-peak.com Clear Peak Solutions, LLC

BLENDING ORACLE 9I ETL OPTIONS WITH PROVEN

ETL TOOLS AND TECHNIQUES Brad Cowdrey, Data Warehouse Architect, Partner Clear Peak Solutions, LLC

Summary Oracle 9i provides a new set of ETL options that can be effectively integrated into the ETL architecture. In order to develop the correct approach to implementing new technology in the ETL architecture, it is important to understand the components, architecture options, and best practices when designing and developing a data warehouse. With this background, each option will be explored in detail from its syntax to its behavior and performance (where appropriate). Based on the results of the examples, combined with a solid understanding of the ETL architecture, strategies and approaches to leverage the new options in the ETL architecture will be discussed. To provide a complete view of the options, a look at these options working together will be explored stemming from examples throughout the paper. The Oracle 9i ETL options that will be discussed include:

• External Tables

• Multiple Table Insert

• Upsert / MERGE INTO (Add and Update Combined Statement)

• Table Functions

The information in the document is targeted for information technology managers, data warehouse developers, and data warehouse architects.

Overview of the Extract, Transform, & Load Architecture (ETL) The warehouse architect can assemble ETL architectures in many different forms using an endless variety of technologies. Due to this fact, the warehouse can take advantage of the software, skill-sets, hardware, and standards already in place within an organization. The potential weakness of the warehouse arises when a loosely managed project, which does not adhere to a standard approach, results in an increase in scope, budget, and maintenance. This weakness may result in vulnerabilities to unforeseen data integrity limitations in the source systems as well. The key to eliminating this weakness is to develop a technical design that employs solid warehouse expertise and data warehouse best practices. Professional experience and the data warehouse fundamentals are key elements to eliminating failure on a warehouse project.

Potential problems are exposed in this document not to deliver fear or conjure the popular cliché that “warehouse projects fail.” It is simp ly important to understand that new technologies, such as database options, are not a replacement for the principals of data warehousing and ETL processing. New technologies should, and many times will, advance or complement the warehouse. They should make its architecture more efficient, scalable, and stable. That is where the new Oracle 9i features play nicely. These features will be explored while looking at their appropriate uses in the ETL architecture. In order to determine where the new Oracle 9i features may fit into the ETL architecture, it is important to look at ETL approaches and components.

Approaches to ETL Architecture Within the ETL architecture two distinct, but not mutually exclusive, approaches are traditionally used in the ETL design. The custom approach is the oldest and was once the only approach for data warehousing. In effect, this approach takes the technologies and hardware that an organization has on hand and develops a data warehouse using those technologies. The second approach includes the use of packaged ETL software. This approach focuses on performing the majority of connectivity, extraction, transformation, and data loading within the ETL tool itself. However, this software comes with an additional cost. The potential benefits of an ETL package include a reduction in development time as well as a reduction in maintenance overhead.

Page 2: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

ETL Components The ETL architecture is traditionally designed into two components:

• The source to stage component is intended to focus the efforts of reading the source data (“sourcing”) and replicating the data to the staging area. The staging area is typically comprised of several schemas that house individual source systems or sets of related source systems. Within each schema, all of the source system tables are usually “mirrored”. The structure of the stage table is identical to that of the source table with the addition of data elements to support referential integrity and future ETL processing.

• The stage to warehouse component focuses the effort of standardizing and centralizing the data from the source systems into a single view of the organization’s information. This centralized target can be a data warehouse, data mart, operational data store, customer list store, reporting database, or any other reporting/data environment. (The examples in this document assume the final target is a data warehouse.) This portion of the architecture should not be concerned with translation, data formats, or data type conversion. It can now focus on the complex task of cleansing, standardizing, and transforming the source data according to the business rules.

It is important to note that an ETL tool is strictly “Extract, Transform, and Load”. Separate tools or external service organizations, which may require additional cost, accomplish the work of name and address cleansing and standardization. These “data cleansing” tools can work in conjunction with the ETL packaged software in a variety of ways. Many organizations exist that are able to perform the same work offsite under a contractual basis. The task of “data cleansing” can occur in the staging environment prior to or during stage to warehouse processing. In any case, it is a good practice to house a copy of the “data cleansing” output in the staging area for auditing purposes.

The following sections include diagrams and overviews of:

• Custom source to stage,

• Packaged ETL tool source to stage,

• Custom stage to warehouse, and

• Packaged ETL tool stage to warehouse architectures.

This document assumes that the staging and warehouse databases are Oracle 9i instances hosted on separate systems.

Page 3: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

Custom ETL – Source to Stage Figure 1a outlines the source to stage portion of a custom ETL architecture.

PL/SQLProcs

KSH/ CSH/PERL/ SQL/

OTHERSCRIPTS

FTPExternal Files Flat Files

Flat Files

WRITE

Native READ

Native READ

WRITE

READ

Oracle StagingDatabaseREAD

SQL*Loader

Host System

Java/C/Other

SQL*Loader

READ

WRITE

ReplicationSoftware

WRITEREAD

FTP

InternalSourceSystem

InternalSourceSystem

Native READ

Internal DB

Figure 1a

Figure 1a exposes several methods of data “connections”. These methods include:

• Replicating data through the use of data replication software (“mirroring” software) that detects or “sniffs” changes from a database or file system logs

• Generating flat files by pulling or pushing data from a client program connected to the source system

• ”Ftping” internal data from the source system in a native or altered format

• Connecting natively to source system data and/or files (i.e. a db2 connection to a AS/400 file systems)

• Reading data from a native database connection

• Reading data over a database link from an Oracle instance to the target Oracle staging instance, and “ftping” data from an external site to the staging host system

Other data connection options may include a tape delivered on site and copied, reading data from a queue (i.e. MQSeries), reading data from an enterprise application integration (EAI) message, reading data via a database bridge or other third party broker for data access (i.e. DB2 Connect, DBAnywhere), etc…

After a connection is established to the source systems, many methods are used to read and load the data into the staging area as described in the diagram. These methods include the use of:

• Replication software (combines read and write replication into a single software package)

• A shell or other scripting tool such as KSH, CSH, PERL, and SQL reading data from a flat file

• A shell or other scripting tool reading data from a database connection (i.e. over PERL DBI)

• A packaged or custom executable such as C, C++, AWK, SED, or Java reading data from a flat file

• A packaged or custom executable reading data from a database connection, and SQL*Loader reading from a flat file

Packaged ETL Tool – Source to Stage Figure 1b outlines the source to stage portion of an ETL architecture using a packaged ETL tool.

Page 4: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

PACKAGEDETL

TOOL

Internal DB

FTPExternal Files Flat Files

InternalSourceSystem

Flat Files

WRITE

Native READ

Native READ

WRITE

R E A D

Oracle StagingDatabase

READ

REA

D

Host System

ReplicationSoftware

WRITEREAD

InternalSourceSystem

FTP

Figure 1b

Figure 1b exposes several methods of data “connections” which are similar to those used in the custom source to stage processing model. The connection method with a packaged ETL tool typically allows for all of the connections one would expect from a custom development effort. In most cases, each type of source connection requires a license. For example if a connection is required to a Sybase, DB2, and Oracle database, three separate licenses are needed. If licensing is an issue, the ETL architecture typically embraces a hybrid solution using other custom methods to replicate source data in addition to the packaged ETL tool.

Connection methods include:

• Replicating data using data replication software (“mirroring” software) that detects or “sniffs” changes from the database or file system logs

• ”Ftping” internal data from the source system in native or altered format

• Connecting natively to the system data and/or files (i.e. a db2 connection to a AS/400 file systems)

• Reading data from a native database connection

• Reading data over a database link by an Oracle staging database from the target Oracle instance

• “Ftping” data from an external site to the staging host system

Other options may include a tape delivered on site and copied, reading data from a queue (i.e. MQSeries), reading data from an enterprise application integration (EAI) message / queue, reading data via a database bridge or other third party broker for data access (i.e. DB2 Connect, DBAnywhere), etc…

After a connection is established to the source systems, the ETL tool is used to read, perform simple transformations such as rudimentary cleansing (i.e. trimming spaces), perform data type conversion, convert data formats, and load the data into the staging area. Advanced transformations are recommended to take place in the stage to warehouse component and not in the source to stage processing (explained in the next section). Because the package ETL tool is designed to handle all of the transformations and conversions, all the work is done within the ETL server itself. Within the ETL tool’s server repository, separate mappings exist to perform the individual ETL tasks.

Custom ETL – Stage to Warehouse Figure 2a outlines the stage to warehouse portion of a custom ETL architecture.

Page 5: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

Oracle StagingDatabase

Flat Files/Export Files

Oracle DataWarehouse Database

WRITE

READ

WRITE / EXPORTPL/SQL

SQL*Loader/ IMPORTREAD

WRITE

READ

WRITE

BUSINESSLOGIC

BUSINESSLOGIC

BUSINESSLOGIC

WRITE

Scripts/C/ Java/

Other

READ

PL/SQLREAD

READ

Figure 2a

The ETL stage to warehouse component is where the data standardization and centralization occurs. The work of gathering, formatting, and converting data types is has been completed by the source to stage component. Now the ETL work can focus on the task of creating a single view of the organization’s data in the warehouse.

This diagram exposes several typical methods of standardizing and/or centralizing data to the data warehouse. These methods include the use of a:

• PL/SQL procedure reading and writing directly to the data warehouse from the staging database (this could be done just as easily if the procedure was located in the warehouse database)

• PL/SQL procedure reading from the staging database and writing to flat files (i.e. via a SQL script)

• SQL*Plus client writing data to a flat file from stage, SQL*Loader importing files into the warehouse for loading or additional processing by a PL/SQL procedure

• An Oracle table export-import process from staging to the warehouse for loading or additional processing by a PL/SQL procedure

• Shell or other scripting tool such as KSH, CSH, PERL, or SQL reading data natively or from a flat file and writing data into the warehouse

• Packaged or custom executable such as C, C++, AWK, SED, or Java reading data natively or from a flat file and writing data into the warehouse

Page 6: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

Packaged ETL Tool – Stage to Warehouse Figure 2b outlines the stage to warehouse portion of a packaged ETL tool architecture.

PACKAGEDETL

TOOLREAD

Oracle StagingDatabase Oracle Data

Warehouse Database

WRITE

READ

BUSINESSLOGIC

Figure 2b

Figure 2b diagrams the packaged ETL application performing the standardization and centralization of data to the warehouse all within one application. This is the strength of a packaged ETL tool. In addition, this is the component of ETL architecture where the ETL tool is best suited to apply the organization’s business rules. The packaged ETL tool will source the data through a native connection to the staging database. It will perform transformations on each record after pulling the data from the stage database through a pipe. From there it will load each record into the warehouse through a native connection to the database. Again, not all packaged ETL architectures look like this due to many factors. Typically a deviation in the architecture is due to requirements that the ETL software cannot, or is not licensed, to fulfill. In these instances one of the custom stage to warehouse methods is most commonly used.

Business Logic and the ETL Architecture In any warehouse development effort, the business logic is the core of the warehouse. The business logic is applied to the proprietary data from the organization’s internal and external data sources. The application process combines the heterogeneous data into a single view of the organization’s information. The logic to create a central view of the information is often a complex task. In order to properly manage this task, it is important to consolidate the business rules into the stage to warehouse ETL component, regardless of the ETL architecture. If this best practice is ignored, much of the business logic may be spread throughout the source to stage and stage to warehouse components. This will ultimately hamper the organization’s ability to maintain the warehouse solution long term and may lead to an error prone system.

Within the packaged ETL tool architecture, the centralization of the business logic becomes a less complex task. Due to the fact that the mapping and transformation logic is managed by the ETL software package, the centralization of rules is offered as a feature of the software. However, using packaged ETL tools does not guarantee a proper ETL implementation. Good warehouse development practices are still necessary when developing any type of ETL architecture.

In the custom ETL architecture, it becomes critical to place the application of business logic in the stage to warehouse component due to the large number of individual modules. The custom solution will typically store business logic in a custom repository or in the code of the ETL transformations. This is the greatest disadvantage to the custom warehouse. Developing a custom repository requires additional development effort, solid warehousing design experience, and strict attention to detail. Due to this difficulty, the choice may be made to develop the rules into the ETL transformation code to speed the time of delivery. Whether or not the decision is made to store the rules in a custom repository or not, it is important to have a well thought out design. The business rules are the heart of the warehouse. Any problems with the rules will create errors in the system.

It is important to understand some of the best practices and risks when developing ETL architectures to better appreciate how the new technology will fit into the architecture. With this background it is apparent that new technology or database options

Page 7: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

will not be silver bullet for ETL processing. New technology will not increase a solution’s effectiveness nor replace the need for management of the business rules. The new Oracle 9i ETL options provide a great complement to custom and packaged ETL tool architectures.

Oracle 9i ETL Options

External Tables

Overview External tables are a new a feature in Oracle 9i that effectively combine a SQL*Loader process with a DDL construct. Because the DDL structure of the external table presents itself just like a table, all normal read-only SQL operations can be performed on the external table. This feature will provide several new ETL architectural options. These options will be explored in later sections. The DDL statements to create external tables, for delimited and fixed implementations, are listed below.

DELIMITED FILE FIXED WIDTH FILE CREATE TABLE external_table ( attribute1 number(10), attribute2 varchar2(25), attribute3 varchar2(3) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY external_source ACCESS PARAMETERS ( RECORDS DELIMITED BY newline BADFILE oracle_dir:'external_file.bad' --optional DISCARDFILE oracle_dir:'external_file.discard' --optional LOGFILE oracle_dir:'external_file.log' --optional FIELDS TERMINATED BY ',' - - comma delimited OPTIONALLY ENCLOSED BY '"' - -double quote qual ) LOCATION (‘external_file_name.txt') ) REJECT LIMIT 0--optional /

CREATE TABLE external_table ( attribute1 number(10), attribute2 varchar2(25), attribute3 varchar2(3) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY oracle_dir ACCESS PARAMETERS ( RECORDS FIXED 39 -- 38characters + “\n” = 39 BADFILE oracle_dir:'external_file.bad' --optional DISCARDFILE oracle_dir:'external_file.discard' --optional LOGFILE oracle_dir:'external_file.log' --optional FIELDS ( attribute1 CHAR(10), attribute2 CHAR(25), attribute3 CHAR(3) ) LOCATION (‘external_file_name.txt') ) REJECT LIMIT 0--optional /

The syntax of the code is fairly straightforward. The table definition in the statement is similar to the CREATE TABLE syntax with the exception that constraints are not allowed. The next section of code, starting with ORGANIZATION EXTERNAL, is standard for all external tables. The oracle directory will need to be specified here. The directory, in this case “oracle_dir”, will need to be established prior to the creation of the external table. The schema of the external table will require read privileges to the directory as well. The syntax to create a directory is as follows:

CREATE OR REPLACE DIRECTORY oracle_dir as ‘/oracle/external_data’;

All of the syntax under “ACCESS PARAMETERS” is very similar to that of SQL*Loader. For reference, the syntax of the external table “record_format_info” is listed below:

Page 8: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

RECORD {FIXED integer | VARIABLE integer | DELIMITED BY {NEWLINE | string}} { CHARACTERSET string |DATA IS {LITTLE | BIG} ENDIAN |STRING SIZES ARE IN {BYTES | CHARACTERS} |LOAD WHEN condition_spec |{NOBADFILE | BADFILE [directory object name : ]filename} |{NODISCARDFILE | DISCARDFILE [directory object name : ] filename} |{NOLOGFILE | LOGFILE [directory object name : ] filename} |SKIP integer }

The final section of the code, starting with LOCATION, identifies the name of the file to read and the reject limit configuration if applicable.

External Table Limitations • The structure of the external table is not enforced until the time a query is issued against it

• This mans that only the data elements accessed in the SQL query are actually loaded by the SQL*Loader process. In other words, if an invalid data element is not accessed it will not be caught by the SQL*Loader process. However, if a different query exposes this field, an error will occur on those records with invalid data. This could result in a loss of records depending on the “REJECT LIMIT” parameter.

• External tables are read only

• Check constraints and indexes can not be enforced against an external table

• It is not possible to suppress checking for byte-order marks in an external table load where the file contains a character set of UTF8 or UTF16 (This is possible for SQL*Loader)

External Table Notes • Performance of any flat file import tool, including external tables, is impacted by the amount of parsing it must perform.

Fixed width files do not require the level of parsing that delimited files do. Although fixed width files are larger than delimited files, they will parse and load faster. (However, the difference in performance may be negligible)

• The external table can be queried in parallel. Just like a normal table, the external table parallel degree can be set in the DDL or in a hint added to the query.

• I recommend that the parallel degree option be withheld from the DDL. If you find that a parallel query is to your benefit, consider applying the parallel hint. This will give you additional flexibility in your code. Different data volumes of the same file layout may react differently to the parallel query and in a few cases will slow the process down.

• My experience has been that the hints are not always implemented by the database (even if it is to your benefit in the rule or cost based optimizer). To ensure parallel processing occurs, consider altering the session to force parallel DML:

ALTER SESSION FORCE PARALLEL DML;

select /*+ PARALLEL(EXTERNAL_TABLE_NAME, 3) */ * from external_table_name;

Example – Delimited Vs. Fixed The next set of examples use files that contain identical sets of data with one million rows each. One file is delimited with commas and is double quote qualified. The other file is fixed width. Each is associated with an external table DDL structure.

DELIMITED (CSV) SAMPLE DATA FIXED SAMPLE DATA $ head customer.txt "9100001","THOMAS","718","R","A","8761548" "9100002","ROBERT","718","R","A","8760955" "9100003","SUSAN","718","R","A","8845303" "9100004","LEWIS","718","R","A","8768860" "9100005","JOHN","718","R","A","8766615" "9100006","FRANK","0","R","I","8761565" "9100007","JOHN","718","R","A","8765060" "9100008","CHIARINA","718","R","A","8757494" "9100009","JANET","718","R","I","8758429"

$ head customer_fixed.txt 9100001 THOMAS 718RA8761548 9100002 ROBERT 718RA8760955 9100003 SUSAN 718RA8845303 9100004 LEWIS 718RA8768860 9100005 JOHN

Page 9: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

"9100010","DESMOND","718","R","I","8751427" 718RA8766615 9100006 FRANK 0 RI8761565 9100007 JOHN 718RA8765060 9100008 CHIARINA 718RA8757494 9100009 JANET 718RI8758429 9100010 DESMOND 718RI8751427

The following are external table DDL statements that reference the delimited and fixed width files.

EXTERNAL TABLE – DELIMITED (CSV) FILE EXTERNAL TABLE - FIXED FILE drop table stg_customer; CREATE TABLE STG_CUSTOMER ( CUSTOMER_KEY number(10), FIRST_NAME varchar2(25), RES_AREACODE varchar2(3), CUSTOMER_TYPE char(1), CUSTOMER_STATUS char(1), PREMISE_KEY number(10) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY external_source ACCESS PARAMETERS ( RECORDS DELIMITED BY newline BADFILE external_source:'customer.bad' DISCARDFILE external_source:'customer.discard' LOGFILE external_source:'customer.log' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ) LOCATION ('customer.txt') ) --default noparallel (parallel clause goes here) REJECT LIMIT 1;

drop TABLE STG_CUSTOMER_fixed; CREATE TABLE STG_CUSTOMER_fixed ( CUSTOMER_KEY number(10), FIRST_NAME varchar2(25), RES_AREACODE varchar2(3), CUSTOMER_TYPE char(1), CUSTOMER_STATUS char(1), PREMISE_KEY number(10) ) ORGANIZATION EXTERNAL (TYPE oracle_loader DEFAULT DIRECTORY external_source ACCESS PARAMETERS ( RECORDS FIXED 51 --50 characters + “\n” = 51 BADFILE external_source:'customer.bad' DISCARDFILE external_source:'customer.discard' LOGFILE external_source:'customer.log' FIELDS ( CUSTOMER_KEY CHAR(10), FIRST_NAME CHAR(25), RES_AREACODE CHAR(3), CUSTOMER_TYPE CHAR(1), CUSTOMER_STATUS CHAR(1), PREMISE_KEY CHAR(10) ) ) LOCATION ('customer_fixed.txt') ) --default noparallel (parallel clause goes here) REJECT LIMIT 1;

The next example includes a set of queries that will test the performance of aggregations against external tables. Notice that adding a filter or a group by to a basic aggregation adds very little overhead to the performance. In most cases, the entire overhead of the query lies in the SQL*Loader processing of the records (when an aggregation query is used without a join).

RESULTS EXTERNAL TABLE - CSV FILE RESULTS EXTERNAL TABLE - FIXED FILE 1* select count(*) from stg_customer; COUNT(*) ----------

SQL> select count(*) from stg_customer_fixed; COUNT(*)

Page 10: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

1000000 Elapsed: 00:00:06.06 SQL> select /*+ parallel(stg_customer,2) */ count(*) 2 from stg_customer; COUNT(*) ---------- 1000000 Elapsed: 00:00:04.01 1* select min(customer_key), max(premise_key) from stg_customer; MIN(CUSTOMER_KEY) MAX(PREMISE_KEY) ----------------- ---------------- 9100001 10698005 Elapsed: 00:00:07.05 SQL> select count(*) from stg_customer where customer_status = 'A'; COUNT(*) ---------- 395334 Elapsed: 00:00:06.08 SQL> select customer_status, count(*) 2 from stg_customer 3 group by customer_status; C COUNT(*) - ---------- A 395334 I 604665 Elapsed: 00:00:07.01

---------- 1000000 Elapsed: 00:00:05.02 SQL> select /*+ parallel(stg_customer_fixed,2) */ count(*) 2 from stg_customer_fixed; COUNT(*) ---------- 1000000 Elapsed: 00:00:03.01 SQL> select min(customer_key), max(premise_key) from stg_customer_fixed; MIN(CUSTOMER_KEY) MAX(PREMISE_KEY) ----------------- ---------------- 9100001 10698005 Elapsed: 00:00:06.03 SQL> select count(*) from stg_customer_fixed where customer_status = 'A'; COUNT(*) ---------- 395334 Elapsed: 00:00:05.07 SQL> select customer_status, count(*) 2 from stg_customer_fixed 3 group by customer_status; C COUNT(*) - ---------- A 395334 I 604666 Elapsed: 00:00:06.01

This set of examples will test the performance of the external table query as it processes every column. The first two queries filter on the first record of each file and the last record of each file respectively. The third query tests a sort of the full data set. The last query demonstrates the external table functionality when a parallel hint is used. Note that only a sort on the full dataset adds noticeable overhead to the performance.

RESULTS EXTERNAL TABLE - CSV FILE RESULTS EXTERNAL TABLE - FIXED FILE $ head -1 customer.txt "9100001","THOMAS","718","R","A","8761548" SQL> SELECT * FROM stg_customer 2 WHERE customer_key = 9100001 3 AND first_name = 'THOMAS' 4 AND res_areacode = '718' 5 AND customer_type = 'R' 6 AND customer_status = 'A' 7 AND premise_key = 8761548; … Elapsed: 00:00:07.08 $ tail -1 customer.txt "13014586","TERRY","318","R","I","10649703" SQL> SELECT * FROM stg_customer 2 WHERE customer_key = 13014586 3 AND first_name = 'TERRY'

$ head -1 customer_fixed.txt 9100001 THOMAS 718RA8761548 SQL> SELECT * FROM stg_customer_fixed 2 WHERE customer_key = 9100001 3 AND first_name = 'THOMAS' 4 AND res_areacode = '718' 5 AND customer_type = 'R' 6 AND customer_status = 'A' 7 AND premise_key = 8761548; … Elapsed: 00:00:06.20 $ tail -1 customer_fixed.txt 13014586 TERRY 318RI10649703 SQL> SELECT * FROM stg_customer_fixed

Page 11: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

4 AND res_areacode = '318' 5 AND customer_type = 'R' 6 AND customer_status = 'I' 7 AND premise_key = 10649703; … Elapsed: 00:00:07.09 SQL> SELECT * FROM stg_customer order by customer_key desc; --full data set … Elapsed: 00:00:17.31 SELECT /*+ PARALLEL(STG_CUSTOMER,4) */ * FROM stg_customer ORDER BY customer_key DESC; … Elapsed: 00:00:9.18

2 WHERE customer_key = 13014586 3 AND first_name = 'TERRY' 4 AND res_areacode = '318' 5 AND customer_type = 'R' 6 AND customer_status = 'I' 7 AND premise_key = 10649703; … Elapsed: 00:00:06.09 SQL> SELECT * FROM stg_customer_fixed order by customer_key desc; --full data set … Elapsed: 00:00:15.01 SELECT /*+ PARALLEL(STG_CUSTOMER_FIXED,4) */ * FROM stg_customer_fixed ORDER BY customer_key DESC; … Elapsed: 00:00:7.12

The ETL Architecture & External Tables

CUSTOM ETL – SOURCE TO STAGE, STAGE TO WAREHOUSE Consider external tables as an alternative to SQL*Loader, export-import operations, and scripts or custom executables to load

flat files in the custom source to stage and stage to warehouse ETL architecture. External tables, which combine SQL*Loader and the SQL query, will reduce the overhead of SQL*Loader operations, reduce the need for placing logic in the SQL*Loader job itself, and make scalability and performance easier by simply calling a parallel query (vs. breaking up SQL*Loader jobs to achieve parallel processing). New overhead, such as examining the log files produced by external tables, is required to ensure that the data was processed properly. However, this is a small compromise compared to the advantages of this new option (in regard to the custom ETL architecture).

There may be times that an external table will present a better solution to the script or custom executable for flat file loading. For existing implementations, this would require converting file system I/O code to code that processes a result set from a native Oracle database connection. In many cases, this may not make sense if the current solution is adequate and does not require future expansion. For new or add-on development, the external table may be considered as a replacement for the script or custom executable. The advantages of the external table, including overhead, scalability, and performance, should be considered when compared it to scripting or custom code.

PACKAGED ETL TOOL – SOURCE TO STAGE Similar to the custom solution, external tables are an alternative to using ETL tool flat file extractions. This alternative should

be considered cautiously however. The use of external tables in the packaged ETL tool source to stage ETL architecture may include the replacement of the ETL tool flat file import. Most packaged ETL tools already provide a facility to load flat files in a very rapid manner. The drawback is that not all ETL tools provide a facility to load a file in parallel when reading from a single file. In these cases where parallel processing is advantageous, it is a simple matter to divide the source file into sections for parallel processing. However, this approach will add to the overhead of batching these processes together. This may be where an external table provides an advantage, as performing a parallel query will eliminate this overhead.

Page 12: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

Almost all ETL tools produce logs and provide error-handling capabilities on a record-by-record basis. The downside of using external tables with a packed ETL tool is bypassing the error-handling capability within the tool. This means that code will be required to check the bad, discard, and log files for errors after external table processing. This disadvantage may outweigh the advantages of combining the parallel processing against a single file. When loading a flat file, the performance is typically depreciated by the write I/O and not the read. Reading a single file and processing it in parallel may not make sense in light of gaining marginal performance and needing to process the bad, discard, and log files of external tables. To be certain of the correct approach, small performance test cases may be examined to aid the decision.

Multiple Table Insert Statement

Overview The multiple table insert extends SQL as a new feature in Oracle 9i. This new SQL statement goes beyond the typical INSERT SELECT statement by allowing multiple tables to be loaded from a single SQL query. Within the multiple table insert statement, two different constructs are available; the INSERT ALL construct and the INSERT FIRST construct. The INSERT ALL construct applies all records in the SELECT query to each individual WHERE condition. The INSERT FIRST construct applies the data from the SELECT query in a mutually exclusive fashion to each WHEN condition. The two types of multiple insert statements are listed below:

INSERT ALL INSERT FIRST INSERT ALL WHEN condition_1 THEN

INTO table_name_1 INTO table_name_2 (attrib1, attrib2) VALUES (va1, val2)

WHEN condition_2 THEN INTO table_name_3

ELSE INTO table_name_4 SELECT val1, val2, val3 FROM table_name_0 WHERE filter_condition_0;

INSERT FIRST WHEN condition_1 THEN

INTO table_name_1 INTO table_name_2 (attrib1, attrib2) VALUES (va1, val2)

WHEN condition_2 THEN INTO table_name_3

ELSE INTO table_name_4

SELECT val1, val2, val3 FROM table_name_0 WHERE filter_condition_0;

The INSERT ALL statement allows all records from the SELECT statement to be applied to every condition. However, any records that are not applied to WHERE conditions in the INSERT ALL can be caught by using the ELSE construct. Looking at the INSERT FIRST statement, data applied to meet condition_1 will not be applied to condition_2, regardless of how condition_2 is defined. This is similar to an “if else if else” statement.

Example – INSERT ALL vs. INSERT FIRST This set of examples stem from the use of external tables in the previous section. The logic in the SQL query does not make sense for a real world scenario (there is no reason to place active customers in the inactive customer table). However, this approach is taken in order to demonstrate the behavior of the ELSE condition in the INSERT ALL operation.

CREATION OF ACTIVE, INACTIVE, & OTHER CUSTOMER TABLE SQL> create table active_cust 2 as select * from stg_customer_fixed 3 where customer_key is null; Table created. SQL> create table other_cust 2 as select * from stg_customer_fixed 3 where customer_key is null; Table created.

SQL> create table inactive_cust 2 as select * from stg_customer_fixed 3 where customer_key is null; Table created. --no records will exist in these tables due to the fact that the key is always populated

RESULTS – INSERT ALL RESULTS – INSERT FIRST 1 INSERT ALL 2 WHEN customer_status = 'A' THEN 3 INTO active_cust

1 INSERT FIRST 2 WHEN customer_status = 'A' THEN 3 INTO active_cust

Page 13: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

4 WHEN customer_status = 'I' or customer_status = 'A' THEN 5 INTO inactive_cust 6 ELSE 7 INTO other_cust 8* SELECT * FROM STG_CUSTOMER_FIXED; 1395334 rows created. Elapsed: 00:00:21.03 SQL> select count(*) from inactive_cust; COUNT(*) ---------- 1000000 SQL> select count(*) from active_cust; COUNT(*) ---------- 395334 SQL> select count(*) from other_cust; COUNT(*) ---------- 0

4 WHEN customer_status = 'I' or customer_status = 'A' THEN 5 INTO inactive_cust 6 ELSE 7 INTO other_cust 8* SELECT * FROM STG_CUSTOMER_FIXED; 1000000 rows created. Elapsed: 00:00:14.09 SQL> select count(*) from inactive_cust; COUNT(*) ---------- 604666 SQL> select count(*) from active_cust; COUNT(*) ---------- 395334 SQL> select count(*) from other_cust; COUNT(*) ---------- 0

--no records contain a status of “B” 1 INSERT ALL 2 WHEN customer_status = 'B' THEN 3 INTO active_cust 4 ELSE 5 INTO other_cust 6* SELECT * FROM STG_CUSTOMER_FIXED; 1000000 rows created. SQL> select count(*) from other_cust; COUNT(*) ---------- 1000000 SQL> select count(*) from active_cust; COUNT(*) ---------- 0

--no records contain a status of “B” 1 INSERT FIRST 2 WHEN customer_status = 'B' THEN 3 INTO active_cust 4 ELSE 5 INTO other_cust 6* SELECT * FROM STG_CUSTOMER_FIXED; 1000000 rows created. SQL> select count(*) from other_cust; COUNT(*) ---------- 1000000 SQL> select count(*) from active_cust; COUNT(*) ---------- 0

The ETL Architecture & Multiple Table Insert Statement

CUSTOM ETL – SOURCE TO STAGE, STAGE TO WAREHOUSE The INSERT ALL/FIRST statement provides a facility to split or redundantly load data from a query into multiple target tables.

There are many areas where this is beneficial. The multiple table insert statement can be used for dataset fork/split operations, full refresh + archive operations, and insert only transformation and load operations in the custom source to stage and stage to warehouse ETL architecture. These categories of processing most commonly occur within scripts, custom executables, and PL/SQL procedures.

Consider using this feature to split an incoming dataset into multiple tables such as dividing source records into two categories; records that are required and records that are not required. In warehousing, it is critical to have a full audit trail for every record in a source table in order to trace back to the source system. It is a good practice to store records that are not used by the warehouse to maintain a full audit

trail versus filtering this data out altogether. The unused data can be accessed in the future if requirements change or a problem with the data occurs. A scenario may include a source system order table that contains service and sales orders. If only sales orders are required for the warehouse, the INSERT FIRST statement could be used to load the warehouse fact table while NON-sales orders are loaded into an “invalid order” table. If we find in the future that the requirements change to

Page 14: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

include both sales and service orders, we can make a modification to our INSERT FIRST code and reload the required service orders into the fact table from the “invalid order” table.

A full refresh and archive operation is a practical place to apply the multiple table insert statement. In many warehouses it is necessary to perform a “full refresh” on

certain tables within the system. In some cases, this table will have an archive counterpart that houses a “snapshot” of each full refresh dataset. An example may include a monthly aggregated revenue-reporting table. This table houses revenue calculations for the last month’s business. Every time this table is truncated and reloaded, that month’s snapshot of revenue figures is placed into the full refresh table as well as the archive table. Data loaded into the archive table usually contains additional columns, such as a date stamp, to uniquely identify that snapshot. With an INSERT ALL statement we can perform all of load operations in a single statement.

Upsert Statement

Overview The upsert / MERGE INTO statement is a new feature in Oracle 9i. This new SQL statement replaces the need to perform separate update and insert operations, and increases performance when compared to using these separate operations. The MERGE INTO combines the “add” and “update” logic into a single statement. The basic structure of the MERGE INTO statement and several working examples are provided below. They stem from examples in the previous sections:

UPSERT (MERGE INTO) STATEMENT PRE-ORACLE 9I STATEMENTS MERGE INTO target_table tgt USING (select * from source_table) src ON (source_table.pk = target_table.pk) WHEN MATCHED THEN UPDATE SET tgt.attrib1 = src.attrib1, tgt.attrib2 = src.attrib2, … tgt.attribN = src.attribN WHEN NOT MATCHED THEN INSERT ( tgt.attrib1, tgt.attrib2, … tgt.attribN ) VALUES ( src.attrib1, src.attrib2, … src.attribN ); commit;

INSERT INTO target_table tgt SELECT * FROM source_table src WHERE NOT EXISTS ( SELECT 1 FROM target_table tgt2 WHERE tgt2.pk = src.pk ); UPDATE target_table tgt SET ( tgt.attrib1, tgt.attrib2, … tgt.attribN ) = ( SELECT stg.attrib1, stg.attrib2, … stg.attribN FROM source_table src WHERE src.pk = tgt.pk ); commit;

Example – Upsert MERGE INTO vs. Traditional INSERT SELECT and UPDATE The performance of MERGE INTO is superior to that of the traditional UPDATE or UPDATE TABLE constructs by an unexpected margin. Notice how fast the update in the MERGE INTO statement is compared to a traditional update. In fact, the

Page 15: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

MERGE INTO statement only ran slightly slower that the INSERT INTO SELECT alone. Due to this performance increase, consider using the MERGE INTO statement as a replacement for the UPDATE statement.

RESULTS - UPSERT (MERGE INTO) STATEMENT RESULTS - TRADITIONAL INSERT SELECT AND UPDATE SQL> create unique index idx_dim_customer_pk on dim_customer(customer_key); Index created. --preload the dim_customer table with the --first 500000 records from stg_customer_fixed SQL> insert into dim_customer 2 select * from stg_customer_fixed 3 where rownum < 500001; 500000 rows created. Elapsed: 00:00:27.07 1 MERGE INTO dim_customer dim 2 USING (select * from stg_customer_fixed) stg 3 ON (stg.customer_key = dim.customer_key) 4 WHEN MATCHED THEN UPDATE SET 5 dim.FIRST_NAME = stg.FIRST_NAME, 6 dim.RES_AREACODE = stg.RES_AREACODE, 7 dim.CUSTOMER_TYPE = stg.CUSTOMER_TYPE, 8 dim.CUSTOMER_STATUS =stg.CUSTOMER_STATUS, 9 dim.PREMISE_KEY = stg.PREMISE_KEY 10 WHEN NOT MATCHED THEN INSERT VALUES 11 ( 12 stg.CUSTOMER_KEY, 13 stg.FIRST_NAME, 14 stg.RES_AREACODE, 15 stg.CUSTOMER_TYPE, 16 stg.CUSTOMER_STATUS, 17 stg.PREMISE_KEY 18 ); 1000000 rows merged. Elapsed: 00:01:21.05 SQL> select count(*) from dim_customer; COUNT(*) ---------- 1000000

SQL> create unique index idx_dim_customer_pk on dim_customer(customer_key); Index created. --preload the dim_customer table with the --first 500000 records from stg_customer_fixed SQL> insert into dim_customer 2 select * from stg_customer_fixed 3 where rownum < 500001; 500000 rows created. Elapsed: 00:00:27.05 1 INSERT INTO dim_customer dim 2 SELECT * FROM stg_customer_fixed stg 3 WHERE NOT EXISTS( 4 SELECT 1 5 FROM dim_customer dim2 6* WHERE dim2.customer_key = stg.customer_key); 500000 rows created. Elapsed: 00:00:56.00 1 UPDATE dim_customer dim SET 2 ( 3 dim.FIRST_NAME, 4 dim.RES_AREACODE, 5 dim.CUSTOMER_TYPE, 6 dim.CUSTOMER_STATUS, 7 dim.PREMISE_KEY 8 ) = 9 ( 10 SELECT 11 stg.FIRST_NAME, 12 stg.RES_AREACODE, 13 stg.CUSTOMER_TYPE, 14 stg.CUSTOMER_STATUS, 15 stg.PREMISE_KEY 16 FROM stg_customer_fixed stg 17 WHERE stg.customer_key = dim.customer_key 18 );

500000 rows updated. Elapsed: 00:12:22.09 SQL> select count(*) from dim_customer; COUNT(*) ---------- 1000000

Page 16: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

The ETL Architecture & Upsert (Merge Into) Statement

CUSTOM ETL – SOURCE TO STAGE, STAGE TO WAREHOUSE The MERGE INTO statement provides a facility to add and update target tables in one optimized operation. The MERGE INTO

statement becomes a key element in the application of incremental changes after transformations have been performed. Consider using the upsert statement for incremental load change application in the custom source to stage and stage to warehouse ETL architecture. This category of processing most commonly occurs within scripts, custom executables, and PL/SQL procedures.

The MERGE INTO statement becomes a key component in the staging area. Each staging table typically represents a “mirror” of the the source table with additional fields, such as a create date/time stamp and a source system identifier. By simply adding the primary key and the change identifcation attribute (such as a date and time stamp) into the ON portion of the upsert statement, the work of inserting and updating will be performed in one operation.

The upsert statement becomes releveant in the warehouse after transformation processing. The MERGE INTO statement can perform a lookup via an outer join to a cross refrence table to retrieve the warehouse primary key (in the USING clause). From there, any matches to the primary key in the target table can be treated as updates. All other records, where the cross refrence table did not surface a key, are inserted into the target table as a new record. This example will be explored in more detail in the “Bringing It Together” section.

Oracle has provided the custom ETL system with a real solution to the incremental load process. Combining the MERGE INTO statement with table functions and external tables will allow a true “source to transformation to target” approach.

Table Functions

Overview Table functions extend PL/SQL and are a new option in Oracle 9i. This option allows a function to accept, process, and return multiple rows. For each record read by a table function zero, one, or many records may be returned. These functions can act as an independent transformation. This is possible because they are defined without static cursors for input and output tables. This is an advantage over a pre-Oracle 9i procedure. The code of a table function and the objects that must be created in order to support it will be explored in the following example.

Example – Pipelined Table Function This example takes an incoming “dim_customer” record-set and outputs two records for every record in the input. A query against the external table stg_customer (from earlier sections) will be used as the input to the table function. The output from the function is targeted for the fact_areacode table. The first output record will contain the customer_key and the second will contain the premise_key. This output record-set acts just like a query result-set. Note the new object dependencies that now require management when deploying this type of function.

Page 17: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

TABLE FUNCTION – SUPPORTING OBJECTS & CODE --create a TYPE to support the output record SQL> create type type_fact_areacode as object 2 (key number, type varchar2(20), areacode varchar2(3)); Type created. --create an object that acts like a table of records with the new TYPE --used for the return of output records SQL> create TYPE typeset_fact_areacode as TABLE OF type_fact_areacode; Type created. --create a reference cursor TYPE to support the input record SQL> create or replace package pkg_areacode IS 2 TYPE refcursor_areacode IS REF CURSOR RETURN dim_customer%ROWTYPE; 3 end pkg_areacode; 4 / Package created. --create the table function SQL> CREATE OR REPLACE FUNCTION 2 create_areacodefact(input_cursor pkg_areacode.refcursor_areacode) -- the new REF CURSOR input 3 RETURN typeset_fact_areacode – the new typeset for output 4 PIPELINED - - allows processing as data in piped into function, (doesn’t wait for fetch complete) 5 PARALLEL_ENABLE(PARTITION input_cursor BY ANY) - - allows parallel query input via REF CURSOR 6 IS 7 input_record input_cursor%ROWTYPE; 8 --must initalize variable return_record to avoid runtime error 9 return_record type_fact_areacode := type_fact_areacode(NULL,NULL,NULL); 10 11 BEGIN 12 LOOP 13 FETCH input_cursor INTO input_record; 14 EXIT WHEN input_cursor%NOTFOUND; 15 return_record.KEY := input_record.CUSTOMER_KEY; --creating first record for the cust key 16 return_record.TYPE := 'CUSTOMER'; 17 return_record.AREACODE:= input_record.RES_AREACODE; 18 PIPE ROW (return_record); --output record one 19 return_record.KEY := input_record.PREMISE_KEY; --creating second record for the premise key 20 return_record.TYPE := 'PREMISE'; 21 return_record.AREACODE:= input_record.RES_AREACODE; 22 PIPE ROW (return_record); --output record two 23 END LOOP; 24 RETURN; 25 END create_areacodefact; 26 / Function created.

Now that the table function is created, a query will be used with the new TABLE keyword combined with the new CURSOR construct to produce the desired result-set. Notice that the CURSOR query is passed to the function using the REF CURSOR typeset that was created above.

RESULTS -TABLE FUNCTION QUERY PROCESSING $ head –5 customer_fixed.txt 9100001 THOMAS 718RA8761548

Page 18: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

9100002 ROBERT 718RA8760955 9100003 SUSAN 718RA8845303 9100004 LEWIS 718RA8768860 9100005 JOHN 718RA8766615 SQL> select * 2 from TABLE(create_areacodefact( 3 cursor(select * from stg_customer_fixed where rownum < 6)));

KEY TYPE ARE ---------- -------------------- --- 9100001 CUSTOMER 718 8761548 PREMISE 718 9100002 CUSTOMER 718 8760955 PREMISE 718 9100003 CUSTOMER 718 8845303 PREMISE 718 9100004 CUSTOMER 718 8768860 PREMISE 718 9100005 CUSTOMER 718 8766615 PREMISE 718 10 rows selected. Elapsed: 00:00:00.02

CUSTOM ETL – SOURCE TO STAGE, STAGE TO WAREHOUSE The intent of the table function is to provide a facility to hand off record-sets to a PL/SQL function, which in turn can output a

record-set. This allows processing to occur through a pipe without having to use temporary or intermediary tables. Table functions are the final element necessary to allow the creation of a full-featured ETL solution. Consider using table functions as transformation components in the custom source to stage and stage to warehouse ETL architecture. This category of processing most commonly occurs within scripts, custom executables, and PL/SQL procedures.

The table function is the component that allows the development of tranformation “modules”. These modules can be applied to any dataset that conforms record input. This transformation can output zero, one, or many records for every record input. Although packaged ETL tools have always provided this type of functionality, these types of transformation are now available within the database for a custom ETL solution.

The drawback to be aware of with table functions is their dependent supporting objects. These objects define the input and output specification for the table function. These objects will require the same management as your code and must be deployed together with the table funciton.

For existing implementations, the decision to change your code over to table functions should be approached carefully. Because the overall ETL design approach may change when using table functions, adding these functions as a replacement to

existing functionality may cause an increase in the development time and the maintenance overhead.

For add-on or new development efforts, table functions have a lot to offer when combined with all of the options explored in this paper. Notice that undergoing this type of effort may require a minor adjustment to the design approach due to the nature of these options. Testing code samples is advised, as early as your planning phase, in order to fully comprehend the behavior and potential bugs of this new Oracle technology. This will provide the architect with the necessary experience to appropriately scope the effort.

Page 19: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

Bringing It Together After gaining a sense of what each ETL option is capable of, it is important to see the whole picture with the appropriate options working together.

Stemming from the examples in the previous sections, the following is provided to demonstrate an “end to end” custom ETL scenario.

EXAMPLE – COMBINING THE EXTERNAL TABLE, MULTIPLE TABLE INSERT, MERGE INTO, AND TABLE FUNCTION OPTIONS SQL> truncate table fact_areacode 2 / Table truncated.

SQL> create table invalid_fact_areacode 2 as select * from fact_areacode 3 / Table created. --insert just the records where type = ‘customer’ into the fact_areacode table --send all other records to the invalid_fact_arecode table 1 INSERT FIRST 2 WHEN TYPE = 'CUSTOMER' THEN 3 INTO fact_areacode 4 ELSE 5 INTO invalid_fact_areacode 6 select * 7 from TABLE(create_areacodefact( 8 cursor( 9* select * from stg_customer_fixed))) SQL> / 2000000 rows created. --there are 1,000,000 records each in the fact_areacode and invalid_fact_areacode tables SQL> select count(*) from fact_areacode; COUNT(*) ---------- 1000000 SQL> select count(*) from invalid_fact_areacode; COUNT(*) ---------- 1000000 --perform a merge into to load all records, via update and insert into the fact_areacode table 1 MERGE INTO fact_areacode fact 2 USING (select * 3 from TABLE(create_areacodefact( 4 cursor( 5 select * from stg_customer_fixed)))) stg 6 ON (stg.KEY = fact.key) 7 WHEN MATCHED THEN UPDATE SET 8 fact.TYPE = stg.TYPE, 9 fact.AREACODE = stg.AREACODE 10 WHEN NOT MATCHED THEN INSERT VALUES 11 ( 12 stg.KEY, 13 stg.TYPE, 14 stg.AREACODE 15* ) SQL> /

Page 20: 9iETL

Oracle 9i ETL Options Cowdrey, Brad Clear Peak Solutions, LLC

www.rmoug.org RMOUG Training Days 2003 www.clear-peak.com Clear Peak Solutions, LLC

2000000 rows merged. --all records from the merge loaded properly SQL> select count(*) from fact_areacode; COUNT(*) ---------- 2000000

Conclusion A look at the best practices and risks when developing ETL architectures allows for a better understanding of how to leverage new technology in the warehouse development effort. With this understanding, it is apparent that the new database options will not be a silver bullet for ETL processing. However, the new Oracle 9i ETL options provide a complement to the custom and packaged ETL tool architectures. The packaged ETL tool architecture will benefit from the external table in some cases. Although the other options provide solid new capabilities, almost all packaged ETL software currently provides similar functionality. On the other hand, all of the new Oracle 9i ETL options become key elements in the custom ETL architecture. The custom ETL architecture can now perform true source to stage and stage to warehouse operations with greater flexibility, performance, and scalability. Careful planning and design will make the use of these new options in the ETL architecture successful.

Acknowledgements All technical information on Oracle 9i ETL options was gathered from the Oracle Technical Network (otn.oracle.com). All code examples and test data were derived by the author. Text and opinions regarding data warehouse development and the ETL architecture are origional compositions based on the author’s experiances. A warm thank you goes out to Hasan Hicsasmaz, Kim Lau, Marc Tormey, Ed Peters, Jeff Meyer, and Chanda Cowdrey for their exhaustive review and input.

About the Author Brad Cowdrey is a Data Warehouse Architect and Partner with Clear Peak Solutions, LLC in Littleton, Colorado. Mr. Cowdrey has over twelve years of professional experience in the IT industry with leadership expertise in the delivery of enterprise datawarehouse and data driven solutions. He works with every level of management from Executive to Project Management in his professional consulting roles as software architect, project manager, and engagement manager. His technical expertise spans over many technologies with strength in Relational Database Technology, Extract, Transform, and Load (ETL) Software, OLAP Front End Reporting Software, Campaign Management Software, Data Mining Software, Data Modeling Software, Enterprise Customer Relationship Management (CRM) Packages, Object Oriented Technology, and Shell Scripting Technology. Brad can be reached via email at [email protected].

About Clear Peak Solutions, LLC Clear Peak is an information technology services company dedicated to the delivery of datawarehouse and data driven solutions based in Littleton, Colorado. Clear Peak focuses on Datawarehouse Solutions, Executive Information System (EIS), Balanced Scorecard / Executive Dashboard, On Line Analytical Processing (OLAP), Data Integration, Data Mining, and Data Processing Solutions. These solutions allow Clear Peak clients to achieve real financial results by combining their data, strategy, and key resources with Clear Peak’s delivery experience and expertise. Clear Peak can be reached through their website at www.clear-peak.com or via email at [email protected].

About Rocky Mountain Oracle Users Group (RMOUG) RMOUG is one of the largest Oracle user groups in the world with over 1,400 members. RMOUG offers general membership meetings, a professional newsletter, an annual training event, and an information-packed World Wide Web site. Members include professional analysts, project managers, database administrators, developers, and designers who work with Oracle products to produce high-quality business solutions. RMOUG is an alliance partner with the International Oracle Users Group - Armericas. RMOUG is a not-for-profit organization incorporated in Colorado.


Recommended