+ All Categories
Home > Documents > CDC Whitepaper

CDC Whitepaper

Date post: 08-Jan-2016
Category:
Upload: utagore58
View: 219 times
Download: 0 times
Share this document with a friend
Description:
CDC Whitepaper

of 20

Transcript

Customer's Name

Data Quality

Table of Contents

11Change Data Capture: An Introduction

22Why CDC?

33CDC classifications

44CDC Implementation

145CDC Tools

156CDC Consideration

177CDC Benefits to business

18About Syntel

1 Change Data Capture: An Introduction

CDC is the process of identifying the changed data by a system, from a system that is changed from a previous point in time, in order to take an action based on the changed data. The system that contains the changed data is the source and the system that identifies the changed data and performs an action, is the target. The source and the target may be the same system physically or might be geographically separated and can be of heterogeneous technologies.

CDC can reduce the volume of extracted data to enable data integration scalability and can speed up data integration cycles to enable fast-paced business activities. With rapid increase in data volumes and the pace of business, CDC is an important enabler for data integration scalability and right-time business.

2 Why CDC?Primary problems in data integration, data synchronization, data propagation and ETL & Data Warehousing projects today areExponential Data Growth

Ever-increasing volumes of data in operational databases put ever-increasing loads on data integration processes.

Real-time operation

Although critical data could change in a database at any moment, the majority of data integration processes run overnight, which limits how quickly and frequently a business can act on the data.

Shrinking integration windows

In companies moving toward the around-the-clock operations of a global business, traditional nightly windows for IT administration and integration are shrinking or have closed.CDC can overcome these problems by:

Reducing the amount of data extracted from operational databases

Nonselective methods like table dumps or storage snapshots extract too much data, which puts a load on a downstream integration server where the data is filtered to reveal the changes since the last extract. With CDC, the filtering task is distributed upstream to the data source, which results in less data to move. With less to integrate, networks and downstream servers for data transformation or data quality carry a lighter load, thereby giving the total data integration architecture more room to scale up.

Providing more frequent opportunities to act on data

Instead of waiting overnight for data integration processes to deliver information to its consumers, CDC enables business processes to act on new or changed data multiple times daily in support of time-sensitive business practices like performance management, activity monitoring, or customer service.

Replacing lengthy load windows

CDC extracts data incrementally as it appears or changes in a DBMS, thus alleviating the need to extract data in bulk during integration windows that are gradually disappearing.

3 CDC classifications

CDC can be implemented in two ways, namely

1. Event driven Push approach

2. Interval driven Pull approach

Event driven or Push approach

Updates are applied in response to an event on the data source. Change capture agents identify and send changes to the target system as soon as the changes occur. This model enables customers to update their analytical applications on-demand with the latest information.

Interval driven or pull approach

Updates are applied at regular intervals in response to requests from the target. Requests might occur every five minutes or every five days. The size of the interval is usually based on the volatility of the data and the latency requirements of the application.

4 CDC Implementation

CDC implementation can be approached in two ways, namely

1. Home grown solutions implementation- Involves coding

a. Source Timestamps All RDBMS

b. Source Triggers All RDBMS

c. Database Logs Oracle 10G

2. Tool implementation GUI based with minimal coding

Source Timestamps Timestamps on rows

Source system tables have a placeholder for the last updated timestamp for each row. Every time a row is added or updated this column should be updated with the current timestamp value.

Control tables, as part of load metadata are maintained by the ETL system. (Refer table 1 for the metadata of this table) Before every load regardless of the frequency, a row is inserted in the control table that indicates that the process has started and it is urrently running. The following fields will be loaded process_id, process_nm, process_start_tms, record_start_tms, record_end_tms and status_flg.

If a particular process is running for the first time (this can be identified by checking if there are any records for this process name in the table), then the record_start_tms field will be loaded with a pre date such as 01/01/1900. The record_end_tms field will be loaded with the current timestamp. The status_flg field will be loaded with the value F. The process_id field will be a running sequence number, the process_nm field will be loaded with the name of the process. The process_start_tms filed will be loaded with the current timestamp. In case the particular process is running for the nth time then the record_start_tms field will be loaded with the value equivalent to the value stored in the record_end_tms field of the record, which was loaded by this process when it ran the (n-1) th time. This will be the only change and everything else will just remain the same.

Once the process completes the record that was inserted during the beginning of the process will be updated and the following fields will be updated process_end_tms and status_flg. The process_end_tms field will be loaded with the current timestamp. The status flag will be overwritten with the value P indicating that the process has completed successfully.

While the process is running at any given point in time, there will be only record in this table for that process name with status flag F and process end timestamp as null. All mappings/jobs/graphs/procedures that run as part of the process will extract records from the respective source tables that have changed from the time of record_start_tms of this process till the time of record_end_tms of this process. This way we ensure that for each load we are able to capture only the changed data from the source systems.

Load control table structure:

Column NameColumn TypeDescription

process_idNumber(30)Sequence generated number uniquely identifying a record

process_nmVarchar(30)Name of the process

process_start_tmsTimestampThe time the process started, current time

process_end_tmsTimestampThe time the process ended, current time

record_start_tmsTimestampThe time from which the source record changes are captured

record_end_tmsTimestampThe time till which the source record changes are captured

Status_flgChar(1)The process completion status (P, F)

Table 1Source Triggers

Triggers are created on each of the source tables from which data is extracted. Triggers are created for after_insert, after_update and after_delete scenarios.

For every source table a new change table is created with just the primary key columns and two additional columns, namely action_flg and target_cd. Action_flg column will be of the datatype char(1). The target_cd column will be of the datatype Varchar(30).

The action_flg column will store one of the following values

I ( Insert record

U ( Update record

D ( Delete record.

The target_cd column will store the name of the target system, which is using this change table. If more than one target system is interested in capturing the changes occurring in one source table, this column will help in identifying the same.

Each time an operation is performed on the source table, the triggers will write a record(s) into the corresponding change table with the primary key values. The number of records written per change is based on the number of target systems that subscribe for the change capture in the source table. It is possible that not all changes to the source table will be subscribed by every target system, they can subscribe for insert, update and delete changes or one of the operations or a combination of operations. Te target systems can also subscribe for changes pertaining to a single or a combination of columns or the whole row. The after_insert trigger will load I in the action_flg column, while the after_update trigger will load the value U and the after_delete trigger will load the value D. This flag will help us in identifying the change that was performed on the record.

Similarly for each of the source table and target system combination that subscribes for the change capture, a view will be created which will join the source table with the respective change table using the primary keys. The structure of the view will be similar to the source table. The only additional column in the view will be the action_flg, which helps in classifying the changed record.

Once the target system has extracted/responded the change, it has to purge those records in the change table (should essentially be done as soon as the view is formed), in order for subsequent change capture. The target system should purge all records in the change table for which that system happens to be the subscriber.

Database logs Asynchronous (Supported only in Oracle 10G)In this mode the changes are captured from the database redo log files after changes have been committed to the source database. This mode of CDC is dependent on the level of supplemental logging enabled at the source database. Supplemental logging adds redo logging overhead at the source database.

Before we get into the details we need to understand the following terminologies which will be extensively used in the following sections.

Publisher The person/system which captures and publishes the changed data.

Subscriber One/multiple application(s) or individual(s) that access the changed data.

Source database The production database that contains the data of interest.

Change Source This is a logical representation of the source database.

Staging database This is the database to which the captured change data is applied.

Change table - This is a relational table into which change data for a single source table is loaded. To subscribers, a change table is known as a publication.

Change set A set of change data that is guaranteed to be transactionally consistent. This contains one or more change tables.There are three ways by which the asynchronous CDC can be achieved

Asynchronous hotlog mode

Asynchronous distributed hotlog mode

Asynchronous autolog mode

Asynchronous hotlog mode

In the asynchronous hotlog mode, change data is captured from the online redo log file on the source database. There is a brief latency between the act of committing source database trannsactions and the arrival of the change data.

There is a single, predefined hotlog change source, HOTLOG_SOURCE, that represents the current online redo log files of the source database. This is the only hotlog change source and cannot be altered or dropped. Change tables for this mode of change data capture must reside locally in the source database.

Figure 1 illustrates the asynchronous hotlog configuration. The loqwriter process (LGWR) records commited transactions in the onlie redo log files on the source database. CDC uses oracle streams to automatically populate the change tables in the change sets within the HOTLOG_SOURCE change source as newly committed transactions arrive.

Figure 1Asynchronous distributed hotlog

In the asynchronous distributed hotlog mode, change data is captured from the online redo log file on the source database. There is no predefined distibuted hotlog change source. Unlike other modes of CDC, this mode splits CDC activities and objects across the source and staging database. Change sources are defined on the source databse by the staging database publisher.

A distributed hotlog change source represents the current online redo log files of the source database. However, staging database publishers can define multiple distributed hotlog change sources, each of which contains change sets on a different staging database. The source and staging database can be on different hardware platforms and be running different operating systems.

Figure 2 illustrates the asynchronous distributed hotlog configuration. The change source on the source database captures change data from the online redo log files and used streams to propogate it to the change set on the staging datanase. The change set on the staging database populates the change tables within the change set.

There are two publishers required for this mode of change data capture, one on the source and other on the staging database. The source dtaabase publisher defines a database link on the source database to connect to the staging database as the staging database publisher. The staging database publisher defines a database link on the staging database to connect to the source database on the source database publisher. All publishing operations are performed by the staging database publisher.

Figure 2

Asynchronous autolog mode

In this mode, change data is captured from a set of redo log files managed by redo transports services. Redo transport services control the automated transfer of redo log files from the source database to the staging database. Using database intilialization parameters the publisher configures redo transports services to copy the redo log files from the source database system to the staging database system and to automatically register the redo log files. Asynchronous autolog mode can obtain change data from either the source database online redo log or from source database archive redo logs. These options are called asynchronous autolog online and asynchronous autolog archive. With the autolog online option, redo transport services is set to copy redo data from the online redo log at the source database to the standby redo log at the staging database. Change sets are populated after individual source database transactions commit. There can only be one AutoLog online change source on a given staging database and it can contain only one change set. With the AutoLog archive option, redo transport services is set up to copy archived redo logs from the source database to the staging database. Change sets are populated as new archived redo log files arrive on the staging database. The degree of latency depends on the frequency of redo log file switches on the source database. The AutoLog archive option has a higher degree of latency than the AutoLog online option, but there can be as many AutoLog archive change sources as desired on a given staging database.

There is no predefined AutoLog change source. The publisher provides information about the source database to create an AutoLog change source.

Figure 3 shows a CDC asynchronous AutoLog online configuration in which the LGWR process on the source database copies redo data to both the online redo log file on the source database and to the standby redo log files on the staging database as specified by the LOG_ARCHIVE_DEST_2 parameter. (Although the image presents this parameter as LOG_ARCHIVE_DEST_2, the integer value can be any value between 1 and 10.)

Note that the LGWR process uses Oracle Net to send redo data over the network to the remote file server (RFS) process. Transmitting redo data to a remote destination requires uninterrupted connectivity through Oracle Net.

On the staging database, the RFS process writes the redo data to the standby redo log files. Then, CDC uses Oracle Streams downstream capture to populate the change tables in the change sets within the AutoLog change source.

The source database and the staging database must be running on the same hardware, operating system, and Oracle version.

Figure 3Figure 4 shows a typical Change Data Capture asynchronous AutoLog archive configuration in which, when the redo log file switches on the source database, archiver processes archive the redo log file on the source database to the destination specified by the LOG_ARCHIVE_DEST_1 parameter and copy the redo log file to the staging database as specified by the LOG_ARCHIVE_DEST_2 parameter. (Although the image presents these parameters as LOG_ARCHIVE_DEST_1 and LOG_ARCHIVE_DEST_2, the integer value in these parameter strings can be any value between 1 and 10.)

Note that the archiver processes use Oracle Net to send redo data over the network to the remote file server (RFS) process. Transmitting redo log files to a remote destination requires uninterrupted connectivity through Oracle Net.

On the staging database, the RFS process writes the redo data to the copied log files. Then, Change Data Capture uses Oracle Streams downstream capture to populate the change tables in the change sets within the AutoLog change source.

Figure 45 CDC Tools

Popular CDC tools available in the market

Tool NameSupport for

Informatica PowerExchangeNetweaver, Salesforce, Adabas, Datacom, DB2, IDMS, IMS DB, SQL Server, Oracle, VSAM

Websphere DataStage Change Data CaptureSQL Server, Oracle, DB2, IMS

Attunity StreamSQL Server, Oracle, DB2, VSAM, IMS, DB2/400, Adabas, Enscribe, SQL/MP

Oracle asynchronous CDCOracle 10G

Sybase ASE Real time data servicesSybase

Connx DatasyncRMS, Oracle, DB2, Sybase, Rdb, DBMS, C-ISAM, Informix, Micro Focus, My SQL, SQL Server, VSAM, IMS, DataFlex, POWERFlex, Adabas

DataMirrorOracle, UDB, DB2, SQL Server

GoldenGateOracle, SQL Server, DB2, UDB, Sybase, Enscribe, SQL/MP, SQL/MX, Teradata

Table 26 CDC Consideration

Points to be considered for CDC home grown solution implementation

1. Schema changes

Changing/Addition of the source system tables is not always well accepted by the existing system administrators. CDC by source timestamps involves a schema change of adding a new timestamp field. CDC by triggers requires additional tables and views to be created in order to capture the changed data.

2. Minimal overhead on existing application

The CDC when implemented should not degrade the existing systems performance beyond a mutually acceptable SLA. CDC by timestamps involves an additional operation of logging the timestamp in each record. CDC by triggers will involve additional operation of inserting the changed records primary key fields into the change table for any action performed on the source table. These will marginally increase the time taken by the existing application to perform the same task. This scenario during more than a million record changes a day will become a marginal overhead.

3. Physical vs. Virtual logging

Logging can be done in two ways. One is physically logging changes i.e. copying each record to the log. Another is virtual approach, which maintains a list of pointers to changed records without copying their contents. Physical logging increases load on the source systems and need disk storage. It is very fast for the target system to access and apply. Virtual logging is much easier on the source system, but increases the load for the target system to access the change.

4. Latency

The extent to which the target systems can wait, before reflecting the changed data. Some systems want this to be real time, some can accept latency for hour(s) and some can wait for a day.

5. Cost & Time

The budget and the timeframe within which the CDC has to be implemented. Tools available in the market are expensive than the solution approaches. The implementation time for the tools is much lesser than the solution approaches.

Key features to be considered for a CDC tool implementation

1. Non Intrusive

2. Low operational overhead

3. Heterogeneous environment (multiple source/target environments)

4. Reliable (guaranteed delivery, fail over and recoverability)

5. Performance and throughput

6. Ease of use

7. High volumes

8. Batch & Near real time (right time) delivery

9. Integration with ETL/EAI tools

10. Metadata Management

Parties involved in the CDC decision

Business users of the downstream system

Operational users of the upstream system

Administrator (Database) of the upstream system

Design lead of the upstream system

7 CDC Benefits to business

Benefits that the business can derive get from implementing the CDC has been described below.

1 Less extracted data means more data integration scalability

As data volumes swell and integration windows shrink, CDCs ability to reduce the amount of extracted data (as well as to produce data that requires less downstream processing) will be increasingly important to achieving ever-increasing scalability for data integration processes.

2 Capturing change as it occurs means businesses can react sooner

As the pace of business continues to accelerate, CDC becomes yet another option for recognizing time-sensitive events (like a customer interaction, inventory outage, or shortfall in sales), so managers can react quickly to correct a problem or leverage an opportunity.

About SyntelSyntel is a global Applications Outsourcing and e-Business company that delivers real-world technology solutions to Global 2000 corporations. Founded in 1980, Syntel's portfolio of services includes complex application development, management, product engineering, and enterprise application integration services, as well as e-Business development and integration, wireless solutions, data warehousing, CRM, and ERP. We maximize outsourcing investments through an onsite/offshore Global Delivery Service, increasing the efficiency of how complex IT projects are delivered. Syntel's global approach also makes a significant and positive impact on speed-to-market, budgets, and quality. We deploy a custom delivery model that is a seamless extension of your IT organization to fit your business goals and a proprietary knowledge transfer methodology to guarantee knowledge continuity.

SYNTEL INC.

525 E. Big Beaver, Third Floor

Troy, MI 48083

Phone: 248.619.3503

[email protected]

2007 Syntel Limited

ALL RIGHTS RESERVED

Copyright in the whole and part of this Change Data Capture paper belongs to Syntel Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in whole or in part in any manner or form or in any media without the prior written consent of Syntel.December, 2007

A Syntel White Paper On

Change Data Capture (CDC)

>

Confidential 2007 Syntel, Inc

i

_1088494023.unknown


Recommended