Date post: | 31-Oct-2014 |
Category: |
Technology |
Upload: | wael-elrifai |
View: | 16 times |
Download: | 3 times |
London - New York - Dubai - Hong Kong - Mumbai 2011
Technical Data Migration in Six Steps by
Wael O. Elrifai, July 2011
2
Confidential - not for redistribution
Abstract
Data Migration is a field that is often overlooked as a simple step between retiring an existing
platform and bringing to life a new one. The reality is that we are constantly migrating data between
systems, implementing new applications, setting up co-existence scenarios and sharing information
in a way that could be thought of as data migration.
The actual task of data migration is a complex one requiring experience and a detailed
understanding of the steps to be taken, both in terms of business process and technical
implementation. This includes a clear delineation of responsibility between IT and business, since far
too often the job of manipulating data is relegated to IT, well outside of their core competency.
Enabling the migration requires a team experienced in abstracting the technical details such that the
business team can implement their rules for migration without ever writing a single line of code, a
task for IT.
This document focuses on technical architecture rather than process. We describe an architecture
that we’ve found to be ideal in dozens of large data migration scenarios at organizations ranging
from global Fortune 500 companies to local Small Businesses. This architecture is portable between
different software toolsets; we’ve implemented it in hand-coded systems as well as open source and
best-of-breed ETL/DQ solutions.
This document is aimed at anyone who’s about to embark on a data migration task, or who is trying
to rescue an existing project that is “off-plan”. To gain the most from this document, you should only
require a rudimentary understanding of the terminology of data handling, terms such as ETL
(Extract, Transform, and Load) and Reference Data (e.g. ISO currency codes, lists of possible post
code).
3
Confidential - not for redistribution
Introduction
The Peak solution for data migration involves a 6-stage process designed to isolate and validate
elemental data changes at the lowest level, hierarchy changes, referential integrity checks, duplicate
checks and capture metadata into distinct steps such that they can be easily tracked, reversed, and a
complete lineage exists for audit purposes. The diagram describes the process at a high level, a more
detailed description can be found below. Please bear in mind that a “stage” refers to the staging
area and the preceding technical ETL/DQ process between the previous stage and that stage. This
document is by no means a complete description of the systems and processes required, but it’s a
good indicator as to the technical elements and requirements of a good data migration system.
This architecture is not tied to a particular ETL tool, but Peak does recommend the use of tools to aid
in re-usability and visibility of process. Some tools are better suited to the “pivot” tasks required at
some points though all major commercial and open source ETL tools can provide the necessary
functionality. No data is deleted at any point within the stages, any data that does not pass a
particular stage is excluded via the error logging mechanism and a “warning” or “error” flag attached
to the record by Stage and Surrogate ID (SID).
Stage 1: Data is read directly from source systems in the native code page and put into database
staging areas in a format and ER model that exactly matches the source systems, whether they are
ERPs, bespoke systems, flat files, or semi-structured data. Some metadata is added to aid in tracking
of records in the form of a surrogate ID (SID) as well as timestamp, source, and other metadata as
appropriate for the system.
Stage 2: Data is moved from the Stage 1 area to the Stage 2 area. Only that subset of columns
considered necessary in the data map to target systems are moved, including any fields necessary to
perform calculations but not necessarily designed to be loaded to the target, as well as all metadata.
SIDs are maintained to provide easy “back tracking”. Any filtering rules to remove data known to be
irrelevant to the migration is performed. Duplicates are removed as deemed appropriate (i.e.
customer master data might be removed while transactions are persisted). Data is still maintained in
the entity-relationship model of the source systems less the columns that have been deemed
extraneous. All actions are logged in the audit and error data model with error data being excluded
from the target.
Stage 3: Data is mapped from the old entity-relationship model to the new one, metadata is
transferred, SIDs maintained, and all actions logged while error records are excluded from the
target. It is important to note that no data translations or calculations have taken place yet. The data
is in the target ER model but not necessarily in the target data formats. All data types are defined as
VARCHAR (MAX LENGTH) to avoid rejecting any source data.
Stage 4: Data is translated to meet target system rules, implement functional requirements, and
checked against target system reference data. All metadata and SIDs are maintained and all actions
logged, both audit and error information. A second duplicate check is performed here to account for
the potential for “new” duplicates create by data translation, of course the concept of a duplicate is
one requiring a business and technical definition. The target data model including data types have
4
Confidential - not for redistribution
now been put in place and a column-by-column check is enforced to ensure that no data violate the
datatype constraints in the target system. Error records are excluded from the target and logged.
“Dictionary” translations, those translations where a given input from a given source system
translates to a particular output (e.g. “Source System A” uses a 3 character country code while
“Target System X” uses a 2 character one) will have the code abstracted such that the translation is
stored in a value mapping table. Those tables can be easily maintained by the business users via a
spreadsheet that gets loaded into the value mapping table. This ensures that the business maintains
responsibility for the data translation while the development team is responsible for correctly coding
the abstract rule.
Stage 5: Referential integrity checks are performed to ensure that the data structures are
completely ready to be written to the target system. Audit and error data is logged and metadata is
maintained. At this point the data should be error free, any “child” records that did not pass error
checks have been excluded and any parent records that didn’t pass error checks have also been
excluded along with any related child records such that orphan records cannot be written to the
target system. Alteration of the parent-child exclusion rules is possible but heavily dependent on the
nature of the target systems (i.e. SAP will not allow creation of a parent-child hierarchy if parent and
child legal entities don’t already exist in the system).
Stage 6: This stage involves invoking any APIs necessary and passing data in the structure required to
those methods. It also includes writing data directly in cases requiring a direct-write rather than API,
IDOCS, or similar calls.
Reconciliation: The stage 6 reconciliation is performed by reading the data structures from the
target system and comparing backwards to Stage 5 data, performing logical DELETEs against exact
matches.
Fault, fix, and follow-through: Using our internally developed audit and error data model along with
the technical steps detailed here allows a single view of all excluded data, why, and where it was
excluded, and also exposes a simple interface in which a business user can change the data (with
further audit information logged, of course). This system can be built to log exclusions based on
minor differences (i.e. variations of ‘-‘ vs. ‘/’ on dates) or to only indicate major failures (i.e.
variations in transaction values). Ultimately the system allows for “fixing” data in situ and “re-
migrating” the data to the new system.
This system is meant to support test data migrations such that all rules can be verified and enforced
with the goal of ZERO errors in the stages by the time of the final cutover.
5
Confidential - not for redistribution
Added value
The occasion of a data migration often involves work that can be used to support other initiatives
bringing new value to the enterprise.
Master Data Management As part of the data and application migration, it is often necessary to capture and validate
various “master data” including:
Customer data
Product data
Reference data
Value mappings for translation: “Dictionary translations” as described in Stage 4.
Having developed the data model and captured the requisite information, the possibility of
implementing a full featured Master Data Hub can be considered. Master data could be
created, maintained, or simply exposed via this central repository with all systems
synchronized in real-time or batch modes to this Single Version of Truth.
Co-existence At the time of writing the translations and transformations of data and hierarchies in the
migration work stream, “reverse functions” are written for those transformations that are
easily reversed (e.g. VAT added calculations can be easily reversed). Other not easily
reversible transformations (e.g. concatenations) are “reversed” by writing the rule as though
it was to support a migration from the target system(s) to the source system(s) with the
added benefit of being able to examine programmatically the audit trails. With the object-
oriented writing of all functions, they are easily re-used to support a co-existence scenario of
legacy and new systems that can be used to replicate data between those systems on a real-
time or batch basis as appropriate. These transformations can configured to be called via an
Enterprise Service Hub (ESB) or other methods, as required.
Archiving During any data migration, business requirements should be captured around the availability
of data; what data should be available on which SLAs in order to ensure that the operational
system continues to operate at maximum efficiency. A typical plan might envision
segregating former customers to a secondary system outside of the main physical database
(for performance reasons) 12 months after they terminate a client relationship. The same
could be true of older transactions and other data. The ideal technical system would have
the ability to treat data in an “object oriented” form, archiving off such “objects” like a
customer, automatically capturing all information related to that customer and moving it to
the secondary system. The data can still be interrogated but is no longer cause for
performance degradation as data volumes expand. Use of this tactic ensures that the sizing
of databases and therefore software and hardware requirements can be easily predicted
and is only variable insofar as operational behaviors vary, or the client base expands at
6
Confidential - not for redistribution
which time additional hardware or software can become necessary. This function can also
support a full Information Lifecycle Management system that includes creation of data for
system testing with features such as object movement (with referential integrity
maintained) and data masking for security. Additional work required to support this would
include creation of a security model that defines how datasets can be created, who can view
unmasked data (at the branch/column/customer/etc. level), and how the datasets can be
extracted or moved.
There are existing software products that can best leverage the work that will have already
been performed to support the Information Lifecycle Management system.
7
Confidential - not for redistribution
Getting Started
Our experience shows that if a well-defined business process is implemented, along with our
technical steps as described above, you can significantly reduce the cost of your data migration and
eliminate many sources of risk, especially when that migration is repetitive in nature (multiple
countries to be migrated independently over a number of months/years to a new system). A typical
cost curve levels out at 30% cost per migration after the second migration. Furthermore you can
leverage the artifacts created to gain new value as described in “Added Value” above.
You can contact me directly via email at [email protected] or via our website at
www.peakconsulting.eu to learn more about how Peak Consulting can help you successfully
complete your data migration or other “big data” initiative.
8
Confidential - not for redistribution
diagram 1: Data Migration Architecture Diagram