Service Oriented Architecture
white paper
ascential.com
TABLE OF CONTENTS
Enabling Authoritative Source with Service-Oriented Architecture 3
What is MDM? 3
Good Data is a Key to Attracting and Retaining Customers 4
Companies Lack Good Customer Data Today 6
A Reference Architecture for Master Customer Data Management 7
A Proven Implementation Process 8
Profile Source Systems 8
Cleanse and Standardize Data 9
Match and Load the Master Cross-Reference 10
Load Analytical Data into the Data Warehouse 11
Define a Data Model for the Services 12
Create your Data Service Requests 12
Create Update Services 13
Joint Solution by IBM and BEA Systems 14
Companies Must Manage Master Customer Data as an Asset 15
Conclusion 16
white paper
ascential.com
white paper
ascential.com
3
ENABLING AUTHORITATIVE SOURCE WITH SERVICE-ORIENTED ARCHITECTURE
A JOINT WHITE PAPER BY BEA SYSTEMS AND IBM
Customers are the lifeblood of your business. In order to attract and retain them, your
customer-facing processes must be as efficient and effective as possible. Complete
visibility to all current information on the customer will have an enormous impact on
these processes. In addition, every customer interaction should refine the bigger picture
understanding of the customer. But, this information is often locked away in multiple
systems, making it difficult to obtain the complete, accurate, and current view of the
customer required. Traditional solutions to this problem often involve replicating all data in
one or multiple places. But replication is expensive and creates latency related problems,
as well as errors and inconsistency in the data.
This paper describes a Service Oriented Architecture for the federated management
of master customer data. The architecture creates a set of shared data services that
allow for consistent accessing and updating of data across multiple systems, providing
complete, accurate, and current customer data to any process or application at any
time. This approach involves creating cross-reference keys linking multiple systems and
creating services against a logical representation of the data in multiple systems. These
data services on a logical model combined with “clean” cross-reference keys creates a
solution that is 10X cheaper to build.
white paper
ascential.com
4
WHAT IS MDM?
Every business has elements of core business reference data which are used in multiple
types of applications and business processes. Often this is the most important data
that the business has, since it represents the business’ understanding of its customers,
suppliers, products, inventory, bills of materials, or parts. This type of data is called Master
Data, and it is one of the most important assets that a company owns, although it is often
not treated that way.
Because of its importance to the business, this data is often stored in multiple systems
for different purposes. For example, customer data is captured and stored during the
sales cycle in many business applications, and later it is also stored in support systems to
provide ongoing customer support after the sale. The problem with this is that keeping it
synchronized and aggregated is very difficult. In our example, it is likely that after the sale
is made, the support system will have better information on the customer than the sales
system. If Sales now wants to sell a new product to that customer, it has no way to benefit
from the new information in the support system since this data is kept in separate stove-
piped systems, with little or no visibility across them.
Master data management (MDM) focuses on creating a single logical representation of
this data that can be shared across multiple applications and processes. Rather than
creating separate copies of master data with no way to tie them together, MDM provides
linkages between separate instances of this data, allowing businesses to maintain a
consistent and complete view of it at all times. In effect, MDM allows enterprises to
leverage this data as a consistent and important corporate asset.
GOOD DATA IS A KEY TO ATTRACTING AND RETAINING CUSTOMERS
For many businesses, customer data is the most important information in the business.
This information is vital to understanding customer buying patterns, identifying up-sell
opportunities, providing a higher level of customer service, tailoring and optimizing
marketing activities, predicting and addressing business triggers like renewals, recalls,
and upgrades. Today many processes and applications have a direct impact on
customers, and most of these are starved for information that could and should have
an impact on their operations and decisions. For example, knowing that a customer has
white paper
ascential.com
5
recently made a large purchase may impact the level of customer service. Unfortunately,
this data is simply not available in an aggregated form in most businesses.
“ By 2007/08, 30% of the Global 2000 will have created a comprehensive framework for the
management of referenced at a.” – META Group
A complete and accurate view of customer data, sometimes referred to as a single or
“360° view,” provides enormous business benefits. These benefits include a reduction in
sales and marketing costs, improved customer satisfaction, higher service renewal rates,
and optimized allocation of resources.
The basic requirements to provide the optimal value from this data are the following:
• Complete – It must represent all known information about customers across all systems
• Accurate – It must be cleansed, de-duplicated, and verified against established business rules
• Current – It must be up-to-date, reflecting the latest relevant business events
• Accessible – It must be easily available to enterprise processes whenever they need it
It is easy to point out everyday examples of the customer service frustration that can
be caused by a lack of integrated customer data. For example, anyone who has had to
repeat their personal information over and over again on the telephone to various support
people in order to get a simple question answered knows this frustration. However, the
true business impact goes well beyond this.
According to the META Group, “Customer data integration can provide real ROI benefits
by improving the underlying quality and real-time accessibility of synchronized customer
data.”
CASE STUDY:
A Global 1000 manufacturing company needed a better understanding of their customers.
While their existing EAI infrastructure and CRM systems were excellent at processing
customer-based transactions, they were unable to consolidate a complete customer
record from their multiple source systems. At the same time, their databases contained
many duplicate entries for the same customer with no way of linking records together
across systems, or even within the same system.
As a result, they were spending too much money on their marketing programs without
white paper
ascential.com
6
seeing the level of results commensurate with the level of investment.
By implementing an MDM solution, they were able to reduce the average cost per
qualified marketing lead by 80%. In addition, they now have a single source for complete
and accurate customer data across all systems which has a positive impact on many
other processes, as well.
COMPANIES LACK GOOD CUSTOMER DATA TODAY
The problem of distributed and duplicated customer data is not a new one. Companies
have used a variety of mechanisms to try to deal with this issue, however none of
them have completely solved the problem of providing complete, accurate, and current
data that is easily plugged into customer centric processes. In fact, they are all likely
participants in a customer-oriented master data management solution.
CIF — Traditionally, many companies have used Customer Information Files (CIFs)
to centralize customer data. CIFs are usually created by extracting data from source
systems in batch, and loading customer data into a common location. CIFs fail to
meet the requirements for current accurate, and complete data, since they are loaded
infrequently and do little to correct data or link records together across systems.
CRM — Probably the most common misconception is that Customer Relationship
Management (CRM) systems solve this problem. These systems allow management of
customer-centric processes like sales, marketing, and customer service, but data accuracy
is often a problem, and they are unable to provide a complete, cross-system view of data.
Data Warehouse — Data warehousing has also been commonly used as a mechanism
to consolidate customer data, as they are excellent sources of complete and accurate
data. However, they fail to meet the requirement of current data since they are incapable
of providing up-to-date data and immediate recognition of events. In addition, warehouses
are usually designed around dimensional schemas that are optimized for multi-conditional
queries, but not well-suited to rapid cross-referencing of core data elements.
ODS — Operational data stores (ODS) are another mechanism commonly thought to
provide customer data consolidation. However, ODS implementations fail to meet the
completeness requirement, since they simply aggregate raw transactions, and do not
white paper
ascential.com
7
provide context around how individual transactions or their underlying data elements are
related together, making it impossible to obtain a complete view of any given customers.
More recently, as companies have begun to focus specifically on MDM, other alternatives
have emerged. Many of the enterprise application vendors have created master data
components that augment their base offering. These products can be effective, but they
do not automatically accommodate data completeness or accuracy, so additional software
is required. In addition, these products can be difficult to plug into enterprise processes,
focusing instead on providing an application to access the data. However, achieving the
synchronized view in this way is virtually impossible.
All of these and other similar solutions attempt to create a monolithic database for
customer data. This approach has fundamental limitations as replication involves bulk
movement of data and creates latency, synchronization and inconsistency issues.
A REFERENCE ARCHITECTURE FOR MASTER CUSTOMER DATA MANAGEMENT
The following diagram describes the service-oriented architecture for federated master
data management. In this architecture:
1. A matching service (such as IBM® WebSphere® QualityStage™) is used to create cross-reference
keys across multiple systems. While creating the cross-reference keys, best practices include
data profiling, data cleansing, de-duplication, and standardizing data formats.
2. An ETL tool (such as IBM® WebSphere® DataStage®) is used to perform the initial load of the
cross reference database, and to create “historical” aggregates, which can be stored in a
separate data warehouse.
3. An EII tool (such as BEA Liquid Data) is used to create composite data services that span
all sources, cross reference database as well as a data warehouse. Note that the cross-
reference keys created in step 1 are essential to link data in different systems. All data access
and update logic as well as policies like security and caching are defined in this layer - which
allows for consistent definition of these policies across all systems.
4. Data cleansing and matching functions are also exposed as services by the EII tool. This
means applications can link “Phil” on the phone to “Phillip” in the customer database.
Thus, all of this data is available to calling applications (like dashboards, workflows, and
portals) as a service from one logical source. This dramatically simplifies application
development as well as solves the traditional of data latency and inconsistency.
white paper
ascential.com
8
A PROVEN IMPLEMENTATION PROCESS
Implementing this service oriented architecture is not complex and can be up to 10X
faster than implementing data replication and related synchronization processes.
Following is a suggested approach that we have seen work for mutual customers.
PROFILE SOURCE SYSTEMS
The first step to an MDM strategy is to understand the data in source systems. This
requires the ability to profile the data in source systems to understand its content,
structure, and relationships. Automated profiling tools accelerate this process, allowing
analysis of column values and structures, and uncovering data anomalies, primary and
foreign keys, relationships, and table normalization opportunities. Profiling also helps to
uncover business rules within the data that will later be used to provide ongoing validation
of data quality.
Profiling is a critical activity for accelerating MDM efforts. When choosing a profiling tool,
white paper
ascential.com
9
it is important to select tools that can access and understand the various data sources
you are trying to reach. In addition, the profiling tool needs to automate the process of
understanding data and assist in building a common metadata understanding across
systems. Other considerations for profiling tools include the ability to deal with large
volumes of data. Most tools will suggest profiling on a sample set of data, but often these
approaches will miss important data trends that do not show up in the sample.
CLEANSE AND STANDARDIZE DATA
The second step is to clean and standardize the data records. Cleansing and
standardizing involves assigning categories to individual elements within the data and
applying rules to these data category based on their business content. For example,
the text “100 St. Virginia St.” would be parsed into four fields of data. A lexical analysis
of those fields would determine that the “100” is the street number, the “Virginia” is the
street name, and the second “St.” is the street type. The first “St.” could be mistaken for
a repeated street type, but a contextual analysis would show that it is actually part of the
street name: “St.Virginia”. Once the data is categorized, it can first be cleansed, removing
any unexpected characters or flagging anything that can’t be categorized. Next, it is
standardized to ensure the same abbreviations and standards are applied consistently
to all elements in the same category. In this example, it could be determined that “Street”
will always be abbreviated as “St”. In this case, the period after the second “St” would be
removed. Address verification could then be run to verify that this address is an actual
address according to local postal records.
This ensures that the data is as clean as possible, establishes a foundation for ongoing
data cleansing, and lays the groundwork for matching and record-linkage. It’s important to
select a cleansing tool that is flexible enough to handle non name and address data such
as product and material descriptions. Some tools only work with names and addresses.
Cleansing tools should be able to validate and certify location data on a global basis, so
Unicode support and a global reference file for location data is critical to success.
While cleansing the data, you may also need to create a temporary copy in a staging
database. This staging database can be used to build cleansing, matching, and cross-
reference creation logic – such that heavy interactions will not impact source systems.
white paper
ascential.com
10
MATCH AND LOAD THE MASTER CROSS-REFERENCE
The next step is to create a master cross-reference database. This entails applying
matching and survivorship rules to the data in all pertinent source systems to create a
database that stores the key structures of the various systems involved in each matched
record.
At its most basic level, the master cross reference simply defines the key structure
relationships between systems for any particular customer. This cross-reference must
also contain enough information on the customer to identify a positive match when new
inbound data is received. This may just be name and address information, but it may also
include other information, such as hierarchical information that describes the relationship
of this customer to other customer entities. For example, an employer or parent may be
used to help identify an individual.
Creating this cross-reference is one of the most challenging aspects of customer MDM. It
requires that a strong understanding of source systems is in place, and that the complex
matching and survivorship rules are in place. The engine used to load the data must be
capable of working through the large amounts of data in all the source systems. It also
must be capable of maintaining a metadata lineage of the sources and processing of the
data.
Matching and linking records between systems involves identifying common elements
across systems, and determining how data will be matched together. It also involves
determining a precedence order for how data elements will be selected when it exists in
multiple systems. Most matching products employ a deterministic matching algorithm to
determine when records match. This mechanism looks for matches from multiple data
sets or multiple records from within single data set using full agreement across a set
of common variables (e.g. name, phone number, birth date). Some matching products
employ probabilistic matching, which also considers the frequency of data values within
the database when determining a match (effectively giving less common entries in the
database a higher weighting – two John Smiths are less likely to match than two Zeke
Durgans). Probabilistic matching is preferable in cases where a reliable and accurate
identifying field is not present in all records of data, as it has been shown to produce
white paper
ascential.com
11
higher match rates and lower chances of false positives in these cases.
Once the cross-reference database is in place, an ongoing mechanism must be created
to ensure that new records do not dramatically degrade the quality of the initial load
by creating duplicates and unmatched entries. This is accomplished by packaging as
services the same matching rules used to create the cross-reference database. This
enables the determination of whether or not new customer data actually already exists
in any system. If a match is determined, the complete record can be assembled and
returned. If a match is not determined, the customer is a new record, and can be entered
into the systems. The matching logic can be packaged as a service that can be easily
exposed and reused other applications and environments. This ensures that the logic is
applied consistently, and that there is only one point of maintenance moving forward.
When choosing a data quality product, the ability to create reusable services from
matching logic should be considered in order to meet this requirement. In addition, the
product must be able to handle high real-time processing volumes, and provide high
availability to ensure that outages will not occur during critical operating hours.
LOAD ANALYTICAL DATA INTO THE DATA WAREHOUSE
Concurrent with loading the cross reference database, many customers also load the related
analytical data into the relevant Data Warehouse for ongoing historical and trend analysis.
Incremental updates to the cross-reference database and the data warehouse are a regular
part of the production schedule. Unlike the cross-reference database, the data warehouse
may receive full record data from the source systems, which may be interesting from an
analytical perspective. For example, a customer’s purchasing habits over the past 90 days.
Loading the data warehouse involves transforming the data into a structure that is
optimized for analysis. Many customers choose a star schema or snowflake schema
for this purpose. This requires the different dimensions of the data to be split out into
separate tables. Aggregates and calculations are also often applied to the data to provide
additional analytical information. These calculations are performed as the data is loaded
into the data warehouse.
When choosing a data integration product, the ability to load very large volumes of data
within very short processing windows and trickle feed as required, should be considered.
white paper
ascential.com
12
DEFINE A DATA MODEL FOR THE SERVICES
With the cross-reference database and matching services in place, the next step is to
develop data services for your applications. Liquid Data uses a model based approach
to develop data services. A model helps you to create and maintain data services in an
organized fashion. In this approach all the complexity of data access, such as transforms,
data integration, validation rules, caching, and security is hidden behind the data model
– which creates breakthrough productivity for application development.
The data model in Liquid Data is defined based on the application’s data requirements.
The data model is mapped to the underlying physical sources. The data model can
encompass the underlying operational data sources, analytical data sources and other
XML and non relational sources. It uses the cross-reference information to link the various
sources. The data model can be used to get data from single or multiple sources. Hence
all the complexity of accessing or updating the data is hidden from the developers.
The mapping in the data model can be simple or complex. It can map to physical data
sources and also to services such as IBM’s matching engine. For example, in order to
get an accurate response when identification data is either unreliable or not provided,
the mapping in the data model is defined to call out to the matching service to obtain a
definitive match, or a list of probable candidates. Similarly, the data model mapping may
also call out to a service to get the survivorship information for the data model that has
multiple similar potential sources.
CREATE YOUR DATA SERVICE REQUESTS
With the logical data model in place, the application developer can write data services
across the virtual data source. The logical data model hides all the complexity of different
types of sources, different API’s, matching service, survivorship rules, and validation rules
from the application developer. The application developer defines the services against the
one virtual data source, and does not need to understand the source data structures or
how they relate to that view.
The data services can be invoked by multiple types of applications. For Java applications,
generally XML/SDO oriented approach works better, while reporting type applications
may require JDBC/SQL type access to the data. Finally, the queries can also be saved as
white paper
ascential.com
13
WSDL’s for SOA centric environments.
Once you understand the type of services your applications will use, you can configure
caching strategies, security policies and validation rules on the logical data model itself.
Caching allows the data be held in memory, so that subsequent requests for the same
data do not impact source systems. Inmost cases, the size and refresh rate of the cache
is configurable. Security allows you to control who can get access to what type of data.
Generally, you will need ability to specify security by data source, and user. In some cases
of sensitive or financial information, you may implement query level or field level security
policy. Validation rules on the logical data model allow you to specify valid updates and
hence make sure only “good” data goes back into your source systems.
For creating the data services, the key issue is performance tuning and debugging.
Understanding the performance characteristic of a service request that spans multiple
sources and optimizing its execution path requires rich tooling.
CREATE UPDATE SERVICES
When any event occurs that creates or updates customer information, these changes
need to be reflected in source systems and in the cross-reference database. The
logic for updating this information across systems can be very complex. It involves an
understanding of the important data elements across systems, along with the mapping
rules. The design of these processes can leverage metadata and business rules
discovered during profiling to jump start the development process. In addition, these
update processes will likely reuse the existing matching services to ensure that they are
not creating duplicate records.
These update processes can be published as services that are callable from any other
process or application. This ensures that these common business rules will be shared
from project to project rather than re-created. The result is a higher level of consistency
across all processes. Like the query services, these processes need to be secured, to
ensure that only entitled resources can call them.
white paper
ascential.com
14
JOINT SOLUTION BY IBM AND BEA SYSTEMS
BEA and IBM are working together on a service-oriented architecture for federated Master
Customer Data Management.
In this joint solution, shown in the picture below, IBM® WebSphere® ProfileStage™ is
used to profile the data sources, IBM® WebSphere® QualityStage™ is used to cleanse,
standardize, and match the data, and WebSphere DataStage is used to create and load
the cross-reference database and the data warehouse.
BEA Liquid Data Liquid Data is used to create a logical model spanning all underlying
data sources, and define composite data services for the applications. All the matching
logic and cross-reference keys generated in the IBM products can be exposed in BEA
Liquid Data.
The combined approach gives you a services oriented approach to federated master
customer data management.
white paper
ascential.com
15
COMPANIES MUST MANAGE MASTER CUSTOMER DATA AS AN ASSET
Master data management initiatives are significant undertakings for most companies
because so much investment has gone into creating and maintaining separate instances
of reference customer data, and so many processes are linked to this data. The only
way to meet the challenges of providing complete, accurate, and current data that is
accessible to enterprise processes is to take a comprehensive approach to managing
master data.
In this paper, we have described a comprehensive service-oriented architecture and
approach that we see applicable to most Fortune 500 companies. The advantages of this
approach are:
• Our initial customer successes indicate that this approach is up to 10X cheaper than
approaches based on replicating the data. Replicating the data is expensive due to cost of data
migration and synchronization. Further, this approach reduces the fundamental limitations of
replication such as latency, and inconsistency issues.
• Ongoing data quality issues are resolved as updates are always consistently applied
across multiple sources; and
• The shared services-oriented approach allows for reuse in the enterprise. It is not a problem
that every application developer has to solve again and again.
50 Washington Street Westboro, MA 01581
About Ascential Software Ascential Software Corporation, an IBM company, is the leader in enterprise information integration. Customers
and partners worldwide use the IBM Websphere Enterprise Integration Suite™ to confidently transform data into accurate, reliable and complete business
information to improve operational performance and decision-making across every critical business dimension. Our comprehensive end-to-end solutions
provide on demand information integration complemented by our professional services, industry expertise, and methodologies. Ascential Software is
headquartered in Westboro, Mass., and has customers and partners globally across such industries as financial services and banking, insurance,
healthcare, retail, manufacturing, consumer packaged goods, telecommunications and government. For more information call 1-800-966-9875 (508-366-
3888 if calling from outside the US or Canada ) or visit the Ascential Software website at www.ascential.com.
© 2005 Ascential Software Corporation., an IBM company. All rights reserved. Ascential and Ascential DataStage are trademarks of Ascential Software Corporation or its affiliates and may be registered in the United States or other jurisdictions. IBM, WebSphere, WebSphere Data Integration Suite, and WebSphere DataStage are trademarks of International Business Machines Corporation. Other marks are the property of the owners of those marks.
800.966.9875, Option 2 508.366.3888 ascential.com
white paper 16
CONCLUSION
Customer data is the lifeblood of a business. A complete and accurate understanding of
this data across systems is vital to providing top-tier customer service and maximizing
revenue opportunities. Using a federated approach, achieving this requires a master data
management strategy that involves data profiling and de-duplication, the creation of a
cross-reference database, and the creation of a virtual queryable view across multiple
source data systems. The ideal architecture for this is a service oriented architecture,
where shared data services provide consistent access and update of data across multiple
systems in real time.
The benefits of implementing this approach are enormous, allowing any new or existing
application, process, or user to get a complete and accurate view of any customer at
any time. The quality processes embedded in the design provide ongoing assurance of
the validity of the data, and the federated approach reduces the risk, latency, and cost of
replicating data across databases.