CHAPTER
2What Is Data Integration?
INFORMATION IN THIS CHAPTER
Data in motion ............................................................................................................ 7
Integrating into a common format—transforming data ................................................... 7
Migrating data from one system to another ................................................................... 8
Moving data around the organization ........................................................................... 9
Pulling information from unstructured data ................................................................. 11
Moving process to data ............................................................................................. 12
Data in motionPlanning the management of data in data stores is about “persistent” data that sits
still. Managing the data that travels between systems, applications, data stores,
and organizations—the “data in motion”—is central to the effectiveness of any
organization and the primary subject of this book.
It shouldn’t be news that available, trusted data is absolutely critical to the suc-
cess of every organization. The processes of making the data “trusted” is the subject
of data governance and data quality, but making the data “available”—getting the
data to the right place, at the right time, and in the right format—is the subject of
data integration.
The practice associated with managing data that travels between applications,
data stores, systems, and organizations is traditionally called data integration
(DAMA international, 2009). This terminology may be a little misleading to those
who are not used to the term. Data integration intuitively sounds to be more
about the consolidation of data, but it is the movement, not the persistence that is
the focus. Data interface refers to an application written to implement the move-
ment of data between systems.
Integrating into a common format—transforming dataUsually, the most complex and difficult part of integrating data is transforming
data into a common format. Understanding the data to be combined and
7
understanding (and possibly defining) the structure of the combined data requires
both a technical and business understanding of the data and data structures in
order to define how the data needs to be transformed.
In Figure 2.1, multiple sources of data of different formats are transformed
into an integrated target data set. Many data transformations are accomplished
simply by changing the technical format of the data, but frequently, as depicted in
the diagram, additional information needs to be provided to look up how the
source data should be transformed from one set of values to another.
Migrating data from one system to anotherWhen an application in an organization is replaced, either by a new custom appli-
cation or by a purchased package, data from the old system needs to be migrated
to the new application. The new application may already be in production use and
additional data is being added, or the application may not yet be in use and the
data being added will populate empty data structures.
As shown in Figure 2.2, the data conversion process interacts with the source
and target application systems to move and transform from the technical format
needed by the source system to the format and structure needed by the target sys-
tem. This is best practice, especially to allow a data update to be performed by the
FIGURE 2.1
Transforming Data into a Common Format.
8 CHAPTER 2 What Is Data Integration?
owning application code rather than updating the target data structures directly.
There are times, however, when the data migration process interacts directly with
the source or target data structures instead of the application interfaces.
Moving data around the organizationMost organizations of middle to large size have hundreds or, more probably,
thousands of applications, each with its own various databases and other data
stores. Whether the data stores are from traditional technologies and database
management systems, emerging technologies, or other types of structures such as
documents, messages, or audio files, it is critical to the organization that these
applications can share information between them. Independent, stand-alone appli-
cations that do not share data with the rest of the organization are becoming less
and less useful.
The focus of information technology planning in most organizations tends to
be around the efficient management of data in databases and other data stores.
This may be because ownership of the spaces between the applications running in
an organization may be unclear, and so somewhat ignored. Data integration
FIGURE 2.2
Migrating Data from One Application to Another.
9Moving data around the organization
solutions have tended to be implemented as accompanying persistent data solu-
tions such as data warehouses, master data management, business intelligence
solutions, and metadata repositories.
Although traditional data interfaces were usually built between two systems
“point to point,”with one sending and another receiving data, most data integra-
tion requirements really involve multiple application systems that want to be
informed real time of changes to data from multiple source application systems.
Implementing all data interfaces as point to point solutions quickly becomes
overwhelmingly complex and practically impossible for an organization to man-
age. As depicted in Figure 2.3, specific data management solutions have been
designed to centralize data for particular uses to simplify and standardize data
integration for an organization, such as data warehousing and master data manage-
ment. Real-time data integration strategies and solutions now involve designs of data
FIGURE 2.3
Moving Data into and out of Central Consolidation Points.
10 CHAPTER 2 What Is Data Integration?
movement that are significantly more efficient than point to point as depicted in
Figure 2.4.
Pulling information from unstructured dataIn the past, most data integration projects involved almost exclusively data stored
in databases. Now, it is imperative that organizations integrate their database (or
structured) data with data in documents, e-mail, websites, social media, audio,
and video files. The common term for data outside of databases is unstructured
data. Integration of data of various types and formats usually involves use of
the keys or tags (or metadata) associated with unstructured data that contains
information relating the data to a customer, product, employee, or other piece of
master data. By analyzing unstructured data containing text, it may be possible to
FIGURE 2.4
Moving Data around the Organization.
11Pulling information from unstructured data
associate the unstructured data with a customer or product. Thus, an e-mail may
contain references to customers or products that can be identified from the text
and added as tags to the e-mail. A video may contain images of a customer that
can be matched to the customer image, tagged, and linked to the customer infor-
mation. Metadata and master data are important concepts that are used to inte-
grate structured and unstructured data.
As shown in Figure 2.5, data found outside of databases, such as documents,
e-mail, audio, and video files, can be searched for customers, products, employ-
ees, or other important master data references. Master data references are attached
to the unstructured data as metadata tags that then can be used to integrate the
data with other sources and types.
Moving process to dataIn an age of huge expansion in the volume of data available to an organization
(big data), sometimes it is more efficient to distribute processing to the multiple
locations of the data rather than collecting data together (and thus duplicating)
in order to process it. Big data solutions frequently approach data integration
from a significantly different perspective than the traditional data integration
solutions.
FIGURE 2.5
Pulling Information from Unstructured Data.
12 CHAPTER 2 What Is Data Integration?
As shown in Figure 2.6, in some cases of working with very large volumes, it
is more effective to move the process to the data and then consolidate the much
smaller results.
Emerging big data solutions are mostly used by programmers and technologists
or highly skilled specialists such as data scientists.
FIGURE 2.6
Moving Process to Data.
13Moving process to data