+ All Categories
Home > Documents > ATHABASCA UNIVERSITY Applying Fuzzy Logic for...

ATHABASCA UNIVERSITY Applying Fuzzy Logic for...

Date post: 24-Feb-2018
Category:
Upload: vukhuong
View: 216 times
Download: 1 times
Share this document with a friend
60
ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance BY XiaoHai Lu A project submitted in partial fulfillment Of the requirements for the degree of MASTER OF SCIENCE in INFORMATION SYSTEMS Athabasca, Alberta November, 2014 © XiaoHai Lu, 2014
Transcript
Page 1: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

ATHABASCA UNIVERSITY

Applying Fuzzy Logic for Data Governance

BY

XiaoHai Lu

A project submitted in partial fulfillment

Of the requirements for the degree of

MASTER OF SCIENCE in INFORMATION SYSTEMS

Athabasca, Alberta

November, 2014

© XiaoHai Lu, 2014

Page 2: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

DEDICATION

This essay is dedicated to my supported wife Winnie and my boys Andrew and Michale.

Page 3: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

ABSTRACT

Every day, as we browse the internet, we consume big data from the various search

engines and social networks that we visit. Like individuals, enterprises also confront

a vast stream of information from individuals, communities, corporations, and

governments. With vast volumes of information, long retention cycles and high

velocity decision-making has the potential to derail the usefulness of information and

do more damage than good to enterprises. The axiom 'better data means better

decisions' becomes critical. Without solid data governance in place, data can be

inaccurate and unfit for usage.

This essay will describe the history and future of data governance. It will also

explain the current process of data governance before demonstrating a prototype of

a data governance application in the banking industry.

Data governance processes such as matching and linking related records require

mathematical support in the decision-making process. Fuzzy logic, which is a

approach to computing that is based on varying degrees of truth, was found to be a

good solution to this issue. As such, this essay successfully applies fuzzy logic to

overcome and improve the process, reduce human intervention, and improve the

data quality of data governance processes.

   

3

Page 4: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

ACKNOWLEDGMENTS

I thank all who were involved in the support and review process of this book. Without

their support, the essay could not have been satisfactorily completed.

Thanks go to all those who provided their insightful and constructive comments, in

particular, to professor Richard Huntrods of Athabasca University who provided

priceless suggestions and feedback on my essay.

4

Page 5: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Table of Contents

DEDICATION...........................................................................................................................................2

ABSTRACT...............................................................................................................................................3

ACKNOWLEDGMENTS.........................................................................................................................4

CHAPTER 1 – INTRODUCTION............................................................................................................7

Data Governance: The History..............................................................................................................7

Data Governance: The current literature on the topic...........................................................................8

Data Governance: The Future...............................................................................................................9

CHAPTER2 – DATA GOVERNANCE PROCESS.................................................................................11

Data Governance Process....................................................................................................................11

CHAPTER 3 – ISSUES, CHALLENGES AND TRENDS.....................................................................43

The Potential Overlay Task:................................................................................................................43

Match Duplicate Suspects to Create a New Master Record:...............................................................44

Link Related Records from Multiple Sources:....................................................................................45

CHAPTER4 – FUZZY LOGIC................................................................................................................48

Traditional Logic:................................................................................................................................48

Fuzzy Logic History............................................................................................................................51

The Basic Concept of Fuzzy Logic ....................................................................................................52

A Fuzzy Implementation:....................................................................................................................52

Brief Discussion:.................................................................................................................................57

CHAPTER 5 - CONCLUSIONS.............................................................................................................57

References................................................................................................................................................58

5

Page 6: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

List of Figures

Figure 1: Data Governance Process.........................................................................................................11

Figure 2: MDM Process...........................................................................................................................20

Figure 3: MDM Initial Load Process.......................................................................................................24

Figure 4: MDM Delta Load Process........................................................................................................26

Figure 5: Quality Stage Initial Load Process...........................................................................................29

Figure 6: Quality Stage Delta Load Process............................................................................................29

Figure 7: Case 5.......................................................................................................................................43

Figure 8: Case 3.......................................................................................................................................45

Figure 9: Case 2.......................................................................................................................................46

Figure 10: Cases ......................................................................................................................................47

Figure 11: Training Set.............................................................................................................................49

Figure 12: Traditional Decision Tree.......................................................................................................51

Figure 13: Fuzzy MF................................................................................................................................52

Figure 14: Traditional Decision Tree.......................................................................................................55

Figure 15: Decision Matrix......................................................................................................................56

6

Page 7: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

CHAPTER 1 – INTRODUCTION

Data Governance: The History

Data governance is an emerging discipline with an ever evolving definition. The

discipline embodies a convergence of data quality, data management, data policies, business

process management, and risk management surrounding the handling of data in an

organization.1 The central point of this definition of data governance is related to data quality.

From the point of view of businesses, data governance needs to be able to provide qualified

information. The data governance process is the practice of transforming data into qualified

information, which can be used by businesses. Incidentally, the concept of data governance

has been around since the beginning of relational databases. Data is stored across

referenced tables. Businesses can retrieve information by joining the data through cross

referencing those tables. With the growth of information technology, databases are gradually

becoming central part of information systems. In order to insert qualified data into databases,

data governance is extended from databases into a set of processes which are defined as

extracting, transforming, and loading (ETL) areas in order to provide databases with clean,

accurate, and prompt data feeds. New terms such as metadata, data source, target, and

staging are emerging with the ETL approach. There are numerous ETL tools available on the

market such as Informatica and Ab initio. However, the motivation for ETL comes from an

information technology (IT) perspective and focuses on IT techniques. In 2004, IBM started

to introduce data governance as a discipline for treating data as an enterprise asset, 3. As a

financial asset, data has to be treated like other financial assets — just as one would treat a

plant and equipment. Data inventory is required for enterprises with existing data, in as much

7

Page 8: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

the same way as inventories are needed for physical assets. Preventing unauthorized data

changes for critical data, should also be considered since this can affect the integrity of

financial reporting, as well as the quality and reliability of daily business decisions.3 Protecting

sensitive data and intellectual information property from both internal and external threats is

also another element that falls under data governance. Since data is a business asset, the

question of how to maximize its value is also under the umbrella of data governance.

Data Governance: The current literature on the topic

As an emerging form of technology, data governance has been mainly supported by business

vendors rather than academic research. For example, performing an query on the subject

“data governance” on the ACM Digital library (Association for Computing Machinery) only

yields 2824 results (queried on Aug 27, 2014). In contrast, when the same query is performed

on Google, 36,200,000 results are yielded (queried on Aug 27, 2014). The technologies

pushed by business vendors share common challenges, such as having broad fundamental

concepts with aspects being emphasized differently by each vendor. For example, Oracle

does not buy the unified processes introduced by IBM white paper. In addition, challenges

include that the concepts of data governance and practices are still shadowed by their

precedences such as ETL, data warehouse, and ERP products. “MDM is effectively Data

Warehousing branded with ERP market rhetoric and contains an added repository of 'master

data'. We see MDM as another attempt at data integration due to the failure of previous Data

Warehousing, ERP and ERPII/BI initiatives.” 17 Although many companies prefer specialized

MDM solutions, the three main players in the MDM market are IBM, Oracle and SAP.

8

Page 9: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Data Governance: The Future

Data governance is constantly evolving and morphing into new forms. This process of

evolving has resulted in the next generation of data that is beginning to enter companies.

Different from traditional data, next generation data will be a part of companies' daily routine.

For example, when we make a cellphone call, the relationship data (which includes the

callers' name, phone number, and location) will have been collected. Likewise, the

transactional data (which includes the time of the call and the duration of the call) will have

been collected as well. Such kinds of big data are not limited to mobile data, GPS

coordinates, location awareness data, and social interactions such as LinkedIn and

Facebook. The way that next generation data is captured through the cloud will definitely

change the way we deal with traditional data. It's one thing to be flooded with big data; it's

another thing to be able to make sense of it and then be able to act on it or make

recommendations for a human or another system to act on it.6 Big data by itself is merely

unstructured data, as we have to analyze the data in order to understand it. MDM and data

governance processes will make the analysis more efficient. Through data governance's

identity resolution, we can have a single view of an entire company's data. With data

governance, we will not be drawn by next generation big data; however, we can understand

their relationship and react on it quickly.

Big data and the cloud, which generates and delivers real time data, will require us to react in

real time, while next generation data governance will help us with understanding and reacting

to real time data.

In addition, unlike traditional data, big data may be owned by a number of brokers or a third-

party. The next generation data governance process should also have the ability to accept

9

Page 10: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

different protocols.

10

Page 11: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

CHAPTER2 – DATA GOVERNANCE PROCESS

Data Governance Process

Below is a diagram detailing the process of data governance by IBM: 6

Figure 1: Data Governance Process

Note. Descriptive note. Adapted from “ The IBM Data Governance Unified Process” by Sunil Soares, 2010, p8 Copyright 2010 by MC Press

Online,LLC. Adapted with permission

1) Define the business problem

The main reason for the failure of data governance programs is that they do not identify a

tangible business problem. It is imperative that the organization defines the initial scope of the

11

Page 12: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

data governance program around a specific business problem, such as a failed audit, a data

breach, or the need for improved data quality for risk- management purposes. Once the data

governance program begins to tackle the identified business problems, it will receive support

from the business functions to extend its scope to additional areas.

2) Obtain executive sponsorship

It is important to establish sponsorship from key IT and business executives for the data

governance program. The best way to obtain this sponsorship is to establish value in terms of

a business case and quick hits. For example, the business case might be focused on house

holding and name-matching in order to improve the quality of data to support a customer-

centricity program.

3) Conduct a maturity assessment

Every organization needs to conduct an assessment of its data governance maturity,

preferably on an annual basis. The IBM Data Governance Council has developed a maturity

model based on 11 categories (discussed in Chapter 5), such as Data Risk Management and

Compliance, Value Creation, and Stewardship. The data governance organization needs to

assess the company’s current level of maturity (current state) and the desired future level of

maturity (future state). The company's future state is usually projected at a time frame

spanning 12 to 18 months ahead. This duration must be long enough to produce results.

However, at the same time, it must be short enough to ensure the continued buy-in from key

stakeholders.

12

Page 13: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

4) Build a road map

The data governance organization needs to develop a roadmap to bridge the gap between

the current state and the desired future state for the eleven categories of data governance

maturity. For example, the data governance organization might review the maturity gap for

stewardship and determine that the enterprise needs to appoint data stewards who will focus

on targeted subject areas such as the customer, vendor, and product. The data governance

program also needs to include quick hit areas where the initiative can drive near-term

business value.

5) Establish an organizational blueprint

The data governance organization needs to build a charter to govern its operations, and to

ensure that it has enough authority to act as a tiebreaker in critical situations. Data

governance organizations operate best in a three-tier format. The top tier is the data

governance council, which consists of the key functional business leaders who rely on data

as an enterprise asset. The middle tier is the data governance working group,which consists

of middle managers. The final tier consists of the data stewardship community, which is

responsible for the quality of the data on a day-to-day basis.

6) Build a data dictionary

The effective management of business terms can help ensure that the same descriptive

13

Page 14: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

language applies throughout the organization. A data dictionary or business glossary is a

repository with definitions of key terms. It is used to gain consistency and agreement between

the technical and business sides of an organization. For example, what is the definition of a

“customer”? Is a customer someone who has made a purchase, or someone who is

considering a purchase? Is a former employee still categorized as an “employee”? Are the

terms “partner” and “reseller” synonymous? These questions can be answered by building a

common data dictionary. Once implemented, the data dictionary can span the organization to

ensure that business terms are tied via metadata to technical terms and that the organization

has a single, common understanding.

7) Understand data

Someone once said, “You cannot govern what you do not first understand.” Few applications

stand alone today. Rather, they are made up of systems, and “systems of systems”, with

applications and databases across the enterprise, yet integrated, or at least interrelated. The

relational database model worsens the situation through the fragmentation of business

entities for storage. However, how is everything related? The data governance team needs to

discover the critical data relationships across the enterprise. Data discovery may include

simple and hard-to-find relationships, as well as the locations of sensitive data within the

enterprise’s IT systems.

8) Create a metadata repository

Metadata is data that has the purpose of giving information about other data. It is information

regarding the characteristics of any data artifact, such as its technical name, business name,

14

Page 15: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

location, perceived importance , and relationships to other data artifacts in the enterprise. The

data governance program will generate a lot of business metadata from the data dictionary

and a lot of technical metadata during the discovery phase. This metadata needs to be stored

in a repository so that it can be shared and leveraged across multiple projects.

9) Define metrics:

Data governance needs to have robust metrics to measure and track progress. The data

governance team must recognize that when something is measured, performance improves.

As a result, the data governance team must pick a few key performance indicators (KPIs) to

measure the ongoing performance of the program. For example, a bank will want to assess

the overall credit exposure by industry. In that case, the data governance program might

select a percentage of null Standard Industry Classification (SIC) codes as a KPI, to track the

quality of risk management information.

10) Govern master data

The most valuable information within an enterprise, which is critical data about customers,

products, materials, vendors, and accounts, is commonly known as master data. Despite its

importance, master data is often replicated and scattered across business processes,

systems, and applications throughout the enterprise. Governing master data is an ongoing

practice, whereby business leaders define the principles, policies, processes, business rules,

and metrics for achieving business objectives, by managing the quality of their master data.

Challenges regarding master data tend to bedevil most organizations, but it is not always

easy to get the right level of business sponsorship to fix the root cause of the issues. As a

15

Page 16: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

result, it is important to justify an investment in a master data initiative. For example, consider

an organization such as a bank, which sends multiple pieces of mail to the same household.

The bank can establish a quick return on investments by cleansing its customer data to create

a single view of the “household.” The bottom line is that the vast majority of data governance

programs deal with issues around data stewardship, data quality, master data, and

compliance.

11) Govern analytics

Enterprises have invested huge sums of money to build data warehouses to gain competitive

insight. However, these investments have not always yielded results. As a consequence,

businesses are increasingly scrutinizing their investments. We define the “analytics

governance” track as the setting of policies and procedures to better align business users

with the investments in analytic infrastructure. Data governance organizations need to ask the

following questions:

❏ How many users do we have for our data, by business area?

❏ How many reports do we create, by business area?

❏ Do the users derive value from these reports?

❏ How many report executions do we have per month?

❏ How long does it take to produce a new report?

❏ What is the cost of producing a new report?

❏ Can we train the users to produce their own reports?

Many organizations will want to set up a Business Intelligence Competency Centre (BICC) to

educate users, increase business intelligence, and develop reports.

16

Page 17: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

12) Manage security and privacy

Data governance leaders, especially those who report to the chief

information security officer, often have to deal with issues around data security and privacy.

Some of the common data security and privacy challenges include:

❏ Where is our sensitive data?

❏ Has the organization masked its sensitive data in non-production

environments (for example, in development, testing, and training) to comply with privacy

regulations?

❏ Are database audit controls in place to prevent privileged users, such as DBAs from

accessing private data, such as employees' salaries and customer lists?

13) Govern the information lifecycle

Unstructured content makes up more than 80 percent of the data within

the typical enterprise. As organizations move from data governance to

information governance, they start to consider the governance of this

unstructured content.

The lifecycle of information starts with data creation and ends with

its removal from production. Data governance organizations have to deal with the following

issues regarding the lifecycle of information:

❏ What is our policy regarding digitizing paper documents?

❏ What is our records management policy for paper documents,

electronic documents, and email? (In other words, which documents do

17

Page 18: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

we maintain as records and for how long?)

❏ How do we archive structured data to reduce storage costs and improve performance?

❏ How do we bring structured and unstructured data together under a

common framework of policies and management?

14) Measure the results:

Data governance organizations must ensure continuous improvement by constantly

monitoring metrics. In step nine, the data governance team sets up the metrics. In this step,

the data governance team reports to senior stakeholders on the progress of those metrics

from IT and the business.

Data Governance Business Application

Today, banking systems, establish and maintain line of business (LoB) specific customer

views with associated accounts and product holdings – either in product systems or in LoB

specific Customer Information Files (CIFs). Thus, the customer, account, and product

relationship information resides in information silo applications. This limits the ability to

understand the customer holistically (across LoBs) and does not provide an enterprise view of

the customer.

The Master Data Management (MDM) initiative enables a complete 360 degree operational

view of customers across the bank (enterprise goal). At the target state, the key capabilities of

MDM are to:

Provide consistent and accurate data about essential business entities derived from a

single trusted source.

18

Page 19: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Uniquely identify a customer and all the associated relationships/holdings with the

bank, based on the customer's privacy preferences

To achieve the target state objective, the MDM solution will integrate/interface between the

numerous LoB specific applications, consolidate the data, and create a single golden master

record.

19

Page 20: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Below is a typical data governance business (Master Data Management) application diagram:

20

Figure 2: MDM Process

Page 21: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

The solution overview diagram clearly depicts various sub-systems in the solution. At a high

level, the entire solution is classified into following layers:

Presentation Layer

OCIF Sub-system

Data Integration and Quality Layer

Application Layer and

Database Layer

Presentation Layer

The presentation layer of the solution essentially implies user interface applications. The

following user interface applications are included:

Reporting User Interface

Data Stewardship User Interface

Business Administration User Interface

The Reporting user interface will generate business and stewardship reports on the data

available in MDM, the Data Stewardship user interface will provide various options for

operating with customer information along with searching and handling duplicate or potential

duplicate customers, and the Data Administration user interface will manage reference data

and other meta-data in the MDM database.

21

Page 22: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

OCIF Sub-System

The OCIF is an existing authoritative operational source of customer information being used

by multiple systems. This sub-system is presently being considered as a ‘Book of Record’ in

the enterprise. The key objective of this system is creating and maintaining standardized and

consistent customer information across the systems, reducing potential duplicate customers

and improving customer data integrity significantly so that it can be treated as a single source

of truth. In the current solution context, this system is considered as the only source systems

from which customer information will be loaded into MDM data base. Based on the solution

overview diagram, there will be two approaches of data synchronization between the

systems. These are:

The initial load – The entire content of the data base

The delta load – The difference in content between the last day and the current day

To populate data into MDM from OCIF, an OCIF component/utility is required, which will

extract required data. The new component that will be developed will be responsible for

providing extracts on a daily basis which will be the input for downstream sub-systems to

transform and load into MDM and thus synchronize two systems.

Information Integration Layer

Information Integration Layer – Data Stage

The information integration layer is a key component that is responsible for integrating OCIF

and the MDM server application. The data format provided by OCIF is not compatible with the

22

Page 23: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

MDM server and hence it is not directly consumable. DataStage, being the part of the

integration layer, is responsible for transforming OCIF extracts to an MDM specific format.

The key objectives of this layer is to:

Read extracts provided by the source system

Transform the extract in the required format based on a synchronization

mechanism

Transform the extract file in the format required by the data quality component

for standardization during the initial load

Transforming reference value to MDM specific codes depending on the source

system reference value

Loading the transformed data into a data base/file

The IIS DataStage component is responsible for reading extracts from the source system,

transforming them into SIF format and pushing data into MDM in two different ways:

Directly into the database during the initial load

Writing into files (ExSIF) for the delta load

The following sections detail the approaches to be followed in the ETL layer.

23

Page 24: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

The diagram below describes the high level steps to be performed in DataStage during the

initial data load.

1. Custom DataStage extract job will be developed to read extract files from the ETL

receiving zone and parse each record based on the record type and sub-type into

individual records of the SIF format, which is a pipe delimited standard interface file.

2. Validation jobs will be responsible for data standardization. It will also perform the SIN

validation and phone number validation. Any failed record information will be logged

into an error log file through error handling jobs.

3. An ETL job will be invoked to populate a separate file for standardization which will be

24

Figure 3: MDM Initial Load Process

Page 25: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

used by QualityStage. The above steps will generate the SIF files for consumption of

the BIL jobs.

4. The BIL import job imports the SIF file for processing.

5. A validation job validates the code column value and invokes error handling framework

jobs in the case of failure. In such scenarios, these records which are a source of

issue, are dropped from the requested SIF file. Based on the strategy of the initial load,

the dropping of records is minimized to synchronize the MDM with the source system

at the highest degree.

6. The party referential integrity validation job ensures every party has either a valid

PersonName or OrgName record and also verifies that a valid party record exists for

the “Provided By” Source System Key (SSK).

7. The BIL consists of one job for each Record Type or Sub Type (RT/ST) that performs

key assignment and database loading. For example the Contact key assignment job

assigns CONT_ID, PERSON_ID, ORG_ID and CONTEQUIV_ID to CONTACT,

PERSON, ORG and CONTEQUIV records respectively and inserts them into the MDM

database. Before loading the records into MDM, an MDM Involved Party ID will be

generated within ETL jobs. At a high level, the new MDM Involved Party ID will be of

an 18 character length where the first 2 characters will imply the version of the BIL and

the last 16 characters will be a random number.

8. The data quality error consolidation process reads the data quality error files created

during the import SIF, validation, and referential integrity validation phases and drops

any records associated with the records in the error file.

25

Page 26: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

The diagram below describes the high level steps to be performed in DataStage during the

delta data load.

1. Custom DataStage extract jobs for the process of the initial load will be re-used to

read extract files from the ETL receiving zone and to parse each record based on the

record type or sub-type.

2. Data validation jobs are responsible for CII data standardization. It will also perform

SIN validation and phone number validation. Any failed records will be logged into the

26

Figure 4: MDM Delta Load Process

Page 27: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

log file through an error handling mechanism. The above two steps essentially

generate the SIF files for consumption of the BIL asset.

3. The DataStage import job imports the SIF file for processing.

4. Applicable business transformation rules are invoked using DataStage transformation

jobs which are responsible for generating extended SIF files for MDM to consume.

Errors are logged using DataStage's out of box error handling mechanism for further

analysis and action.

Data Quality Management – Quality Stage

The master data hub solution is about providing complete, accurate, standardized information

about the customers stored in the MDM system. Even though OCIF maintains its own data

quality, customer attributes need further standardization before they are stored in MDM as it

will be the single version of truth on customer data throughout the enterprise. The

QualityStage component is primarily responsible for data standardization, the improvement of

overall quality of the data asset of the enterprise, and identification of duplicate/potentially

duplicate customers. The current solution places QualityStage with the following objectives:

Standardize name and address related attributes

Validate and correct customers' addresses with the Canada post address

repository implemented through SERP

Perform probabilistic matching to identify potential duplicate customers

The IIS QualityStage component is responsible for maintaining data quality stored in MDM.

27

Page 28: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

The key objective of QualityStage is:

Name and address standardization

Identifying duplicate/potentially duplicate customers

Matching

28

Page 29: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

The diagram below describes the high level steps to be performed during the initial load.

The diagram below describes the high level steps to be performed during the delta load.

Individual Customer Name Standardization

This standardization procedure will receive an individual name from MDM before processing

the individual name through the MNNAME rule set. The MNNAME rule set will parse the

individual name into separate name elements and create an analysis value or phonetic

representation value for the first and last name of the individual.

29

Figure 5: Quality Stage Initial Load Process

Source to SIF DS jobs

Source Extract File

QS Stan Jobs

MDM Code conformation

MDM DB

Figure 6: Quality Stage Delta Load Process

Source to SIF DS jobs

Source Extract File

MDM Code conformation

MDM DB

MDM

QS Stan Jobs

QS Stan Jobs No Code

value conformation

Page 30: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

For example:

If an individual by the name of “Mr William Chen” was passed to the individual

standardization procedure, this would be the standardization result.

Organizational Customer Name Standardization

This standardization procedure will receive an organization name from MDM and process the

organization name through the MNNAME rule set. The MNNAME rule set will parse the

organization name into separate word elements and create an analysis value or phonetic

representation value for word1 and word2 of the organization name.

For Example:

If the organization name of “Bank of Example” was passed to this organization

standardization procedure, this would be the standardization result.

The important thing to note is that the original name feed into QualityStage from MDM will be

passed back to MDM. QualityStage does not change or enhance the organization name in

any way. QualityStage parses the name into smaller elements for matching purposes only.

MDM will receive the original name, the phonetic representation of organization name, and

the standardized name.

Address Standardization

This standardization procedure will receive an address from MDM and process the address

through the MDMCADDR and MDMCAAREA rule sets. The MDMCAADDR rule set will parse

30

Page 31: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

the address name into separate address elements and create an analysis value or phonetic

representation value for street name. The MDMCAAREA rule set will parse the city, province,

and postal code into separate address elements and create an analysis value or phonetic

representation value for the city name.

For example:

If the address of “123 Maple Street Unit 5 ” was passed to this address standardization

procedure, this would be the standardization result.

Matching

In order to maintain data quality, adding and updating a customer will trigger the matching

process.

Individual and organizational customers will be processed by different match specifications in

QualityStage, which consists of blocking parameters and scoring specifications for different

passes.

The MDM service will provide QualityStage (QS) with a set of candidates by searching the

MDM database through blocking parameters for different passes. The QS matching process

will compare and score each candidate and return the match result to the MDM.

In order to implement the match specification and respond to MDM requests, ISD job and

shared containers are created for the interface.

31

Page 32: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Application Layer

The MDM server is an application component of the solution that interfaces with the data

source where the master data will be stored. It is also responsible for providing various

features for managing and maintaining master data to keep the data source as single version

of truth. The application is responsible for:

Interfacing with the master data source through various protocols

Managing master data through the exposed interfaces with other sub-

systems/external sources

Controlling access in terms of data visibility and enhancing data security

Identifying and providing information on potential candidate list of duplicate

customers to assist quality stage with figuring out detailed information on

customer duplication and storing them in the data source.

Merging two/multiple customers to enforce MDM data source as single view of

the customer and a single version of truth.

Provides a user interface to merge and maintain customer information while

duplicates are potential and not guaranteed.

Provides a user interface to manage and configure metadata

Database Layer

The database layer in the solution is responsible for storing all the master data. It also stores

the history data, audit, and meta-data required for the MDM application to execute. During the

initial load, the database is populated directly by the information integration layer. Once the

32

Page 33: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

initial population is successfully completed, daily extracts from source systems will be loaded

into the MDM database through the MDM batch framework and maintenance services.

Apart from business data, the database layer also contains meta-data which is required for

the MDM application. Meta-data is another key set of information which is configured for the

MDM application and controls the behaviour and functionality of the MDM application.

Data Quality Management in Detail

For example, if we have the following input file:

File Name

Profile

ID Name Address

Phone

Number

Party

Type B2B Personal

Cardholders

B2BPC

1 John Smith

123 Main Street, Toronto,

Ontario, Canada X1X1X1

416-549-

7061 <Blank>B2B Personal

Cardholders

B2BPC

2 ABC Limited

456 King Avenue, Calgary,

Alberta, Y2Y2Y2

416-549-

7061 <Blank>B2B Personal

Cardholders

B2BPC

3

John and

Jane Smith

123 Main Street, Toronto,

Ontario, Canada X1X1X1 <Blank> <Blank>B2B Personal

Cardholders

B2BPC

4 A <Blank> <Blank>

There are several data quality requirements (baseline) that need to be followed in order to

keep the data quality:

Requirement Description

33

Page 34: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Name Formatting and Standardization

1. If a free form name, i.e. it is unparsed as a single string,is received

as an input, the MDM matching solution should parse or tokenize

the name to the common format required for processing. E.g.:

John Smith may need to be tokenized into First Name = John and

Last Name = SmithAddress Formatting and Standardization

1. If a free form address, i.e. it is unparsed as a single string, is

received as an input, the MDM matching solution should parse or

tokenize the address to the common format required for

processing – for both Canadian and US addresses. E.g.: 123 Main

Street may need to be tokenized into Street Number = 123, Street

Name = Main Street. If the country code / name is missing in the

incoming files, the Canadian address standardization rules will be

applied as a default Address Validation and Correction

1. All addresses received in the input files should be validated and

corrected based on checks with Canada Post. In case of an

address correction, the address as provided by Canada Post will

be applied.Phone Number Formatting and Standardization

1. If a free form phone number, i.e. it is unparsed as a single string,

is received as an input, the MDM matching solution should parse

or tokenize the phone number to the common format required for

34

Page 35: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

processing. E.g.: 416-549-7061 may need to be tokenized into

Area Code = 416, Number = 549-7061Name Patterns

1. The MDM matching solution should develop data processing rules

to handle the following patterns that may occur in the ‘name’

fields:

For individuals, the connectors that will identify such patterns are

a. Space And Space (e.g.: John And Jane Smith)

b. Space and Space (e.g.: John and Jane Smith)

c. & (e.g.: John&Jane Smith)

d. Space & Space (e.g.: John & Jane Smith)

e. / (e.g.: John/Jane Smith)

f. Space / space (e.g.: John / Jane Smith)

g. \ (e.g.: John\Jane Smith)

h. Space \ Space (e.g.: John \ Jane Smith)

For organizations, the connectors that will identify such patterns are

i. / (e.g.: John/ABC Limited)

j. Space / space (e.g.: John / ABC Limited)

k. \ (e.g.: John\ABC Limited)

l. Space \ Space (e.g.: John \ ABC Limited)

2. If a name pattern has either of the ‘And’, ‘and’ or ‘&’ connector, the

following requirements should be developed:

a. A lookup with the organization name directory should be

35

Page 36: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

performed.

b. If the name pattern matches with an organization name

from the directory, the record should not be split into

discrete records.

c. If the name pattern does not match with an organization

name from the directory, the record should be split into

discrete records.

Matching Process Requirement:

Req. ID Requirement DescriptionFR

2.3.1

Rules - Overview

1. The MDM matching rules should be designed and developed to match

incoming records across all input files – i.e. match all input files with

each other

2. The MDM matching rules should be designed and developed to match

the incoming records with all records stored in the MDM

FR

2.3.2

Rules - List

The following matching rules should be designed and developed in the

MDM environment:

1. Rule 1: Individual Matching - Individual Full Name and Full Address

2. Rule 2: Organizational Matching - Organization Name and Full Address

3. Rule 3: Household Matching Individual - Last Name and Full Address

4. Rule 4: Address Matching - Full Address

36

Page 37: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

5. Rule 5: Phone Number Matching – Full Phone Number

NOTE: Each of the above matching rules should generate independent

match IDs/keysFR

2.3.3

Rules - Data Elements

1. Full Name – When the matching rules are based on Full Names, the

following discrete data elements should be used:

a. First Name a.k.a Given Name

b. Last Name

c. Name Suffix

d. Organization Name (as applicable)

2. Full Address – When the matching rules are based on Full Address, the

following discrete data elements should be used:

a. Apartment / Unit Number

b. Street Number

c. Street Name

d. Street Type

e. City

f. Province

g. Postal Code

h. Country

i. Non Civic Address Info (as applicable)

3. Full Phone Number – When the matching rules are based on Full

Phone Number, the following discrete data elements should be used:

37

Page 38: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

a. Country Code

b. Area Code

c. Number

NOTE: The Individual Matching Process uses ‘Residential Primary’ or

equivalent addresses only, while Organizational Matching Process uses

‘Business Primary’ or equivalent addresses only

NOTE: Phonetic representations of first name, last name, and street name

are used by the current MDM matching processFR

2.3.4

Rules – Guidelines

1. The corrected postal address should be used by the MDM matching

process.

2. Each record from each input file should undergo each of the 4 rules

stated above.

NOTE: For example, a record identified as ‘Individual’ should undergo the

Organizational match rule as well.

3. Wherever applicable, the Match IDs/keys as generated by the individual

matching rules should be cross referenced in the output files. E.g.: A

record could have an Individual Match Key as 123 and a Household

Match Key as 456.

4. A separate match ID / key should be generated for records within the

MDM that do not have a match with records in the input files.FR

2.3.5

Rules – Weights, Thresholds and Categories

1. The MDM matching solution should be designed and developed for

38

Page 39: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

‘looser’ matching rules.

2. The weights and thresholds that are currently assigned in the MDM

environment should be used as a starting point for the design and

development.

3. The match categories that are currently identified in the MDM

environment should be used as a starting point for the design.

FR

2.3.6

Rules – Error Condition

1. In case the incoming record was unable to be processed by the MDM

matching solution, it should be highlighted in the output file.

2. A description of the reason why the record could not undergo the

matching process should be included in the output file.

NOTE: These error descriptions should be as provided by the MDM

matching solution with no new requirements.

Output file:

Input Data

Input 

File 

Name

In­

put 

Pro­

file 

ID

In­

put 

Name

In­

put 

Ad­

dres

s

Phon

Num­

ber

In­

put 

Part

Type 

Se­

quenc

Num­

ber

Spli

Name

Ad­

dress 

Valid­

ation 

or 

Cor­

rec­

tion 

Indic­

Ad­

dress 

Valid­

ation 

or 

Cor­

rec­

tion 

De­

Cor­

rec­

ted 

Ad­

dress 

or 

Ad­

dress 

from 

Matc

Pro

cess

39

Page 40: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

ator

scrip­

tion MDM

B2B 

Per­

sonal 

Card­

hold­

ers 

B2BP

C 1

John 

Smit

h

123 

Mai 

Stre

et, 

Toro

nto, 

Onta

rio, 

X1X1

X1

416­

549­

7061

<Bla

nk>

<Blan

k>

John 

Smit

h

Cor­

rected

Street 

name 

not 

found

123 

Main 

Stree

t, 

Toron

to, 

Ontar

io, 

X1X1X

1

Suc

cess

B2B 

Per­

sonal 

Card­

hold­

ers 

B2BP

C 2

ABC 

Lim­

ited

456 

King 

Aven

ue, 

Calg

ary, 

Albe

rta, 

Y2Y2

Y2

416­

549­

7061

<Bla

nk> 1

ABC 

Lim­

ited

Cor­

rected

Street 

name 

not 

found

456 

King 

Aven­

ue, 

Cal­

gary, 

Al­

berta

Y5M2Y

2

Suc

cessB2B 

Per­

sonal 

Card­

hold­

ers 

B2BP

C 2

ABC 

Lim­

ited

456 

King 

Aven

ue, 

Calg

ary, 

Albe

416­

549­

7061

<Bla

nk>

2 ABC 

Lim­

ited

Cor­

rected

Postal 

Code 

Incor­

rect

456 

King 

Aven­

ue, 

Cal­

gary, 

Al­

Suc

cess

40

Page 41: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

rta, 

Y2Y2

Y2

berta

Y5M2Y

2

B2B 

Per­

sonal 

Card­

hold­

ers

B2BP

C 3

John 

and 

Jane 

Smit

h

123 

Main 

Stre

et, 

Toro

nto, 

Onta

rio, 

X1X1

X1

<Bla

nk>

<Bla

nk> 3

John 

Smit

h Valid

Accur­

ate

123 

Main 

Stree

t, 

Toron

to, 

Ontar

io, 

X1X1X

1

Suc

cess

B2B 

Per­

sonal 

Card­

hold­

ers

B2BP

C 3

John 

and 

Jane 

Smit

h

123 

Main 

Stre

et, 

Toro

nto, 

Onta

rio, 

X1X1

X1

<Bla

nk>

<Bla

nk> 4

Jane 

Smit

h Valid

Accur­

ate

123 

Main 

Stree

t, 

Toron

to, 

Ontar

io, 

X1X1X

1

Suc

cessB2B 

Per­

sonal 

Card­

hold­

ers

B2BP

C 4

A <Bla

nk>

<Bla

nk>

<Bla

nk>

<Blan

k>

<Bla

nk>

<Blank

>

Insuf­

fi­

cient 

(or 

blank) 

ad­

dress 

  Fail

41

Page 42: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

in­

forma­

tion

<Blank

>

<Bla

nk>

<Bla

nk>

<Bla

nk>

<Bla

nk>

<Bla

nk> 5

Dav­

id 

John

son

<Blank

>

<Blank

>

789 

Pop­

lar 

Road, 

Ott­

awa, 

Ontar

io, 

A6A6A

6

<Bla

nk>

42

Page 43: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

CHAPTER 3 – ISSUES, CHALLENGES AND TRENDS

Unfortunately, during the MDM matching process, there are still processes that need human

intervention, such as the following tasks:

The Potential Overlay Task:

A potential overlay occurs when a record is updated with information that is radically different

from the data already in the record. For example, consider the situation illustrated below:

The data steward will mark the record as a potential overlay record because the ID field from

both records are the same. However, when we look closely on these two records, we can find

that Linda Xiang and Jane Lewis are clearly not the same person. The ID

388293023980000000 was created on Feb 28, 1998 and belongs to Linda Xiang. Somehow,

on Aug 24, 2006, the record was updated. It now appears to belong to a woman named Jane

Lewis. It may have been caused by a common typographical data entry mistake in which

43

Figure 7: Case 5

party id Gender Address phone last modify

5### LEWIS JANE F 26-Jun-71 08/24/06

### XIANG LINDA F 13-Jan-78 02/28/98

Case

Family Name

Given Name

Date of Birth

100 Kumar

Avenue, Markham, Ontario, Canada A2B2C2

416-549-7070456 King

Avenue, calgary, Alberta, Y2Y2Y2

416-549-7070

Page 44: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Linda Xiang's record was open on the screen when the customer service representative

started typing, not realizing that he or she was typing over someone else's data.

There are also some situations in which this scenario would be perfectly valid. In cases of

events such as marriage, divorce, a move, or phone-number change, a person's data would

change significantly enough to flag a potential overlay task by a data steward application.

Using data mining and fuzzy logic can automatically solve the potential overlay tasks.

Match Duplicate Suspects to Create a New Master Record:

As a solution to Data warehouse applications, data governance will match the records from

multiple lines of business (LOB). There are situations where customers from multi LOBs may

have similar names, addresses, and telephone numbers and may have fields with blank or

values that are not available (N/A). For instance, see below:

44

Page 45: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

The two records above have the same Given name, Date of Birth, Address, Phone number.

However, the Family Name is slightly different. Are these two records the same customer?

Data governance applications currently available in markets will stop here and waiting human

intervention to identify. Through the application of data mining and fuzzy logic we would be

able to identify such cases without human intervention and generate a single customer profile

with the best data from all sources.

Link Related Records from Multiple Sources:

With overlays, the task verifies the existing records in the system. With duplicate suspects,

the task gets rid of extra records. For this task, it links records between systems. The current

data steward application available in the market may not able to automatically link such

45

Figure 8: Case 3

Case party id Address phone

3### VERKIN SMITH 5-Aug-60

### VERKI SMITH 5-Aug-60

Family Name

Given Name

Date of Birth

987 Village Ave,

Toronto, Ontario, Canada T2T1C1

416-222-3333

987 Village Ave,

Toronto, Ontario, Canada T2T1C1

416-222-3333

Page 46: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

records due to not having enough data in common, such the example below:

For the above example, when you first look at these two records, they appear to be the same

company. However, if one looks closely, one can determine some differences. First, the

address are different: “Unit 10” is only displayed in one of the record's address field. Second,

the phone numbers are different. One is “647-123-4567” and the other is “647-123-2352”.

Data mining and fuzzy logic can automatically verify these two records are the same company

and link them together.

46

Figure 9: Case 2

Case party id Address phone Source

2### 10-May-98 Market

### 10-May-98 Auto

Family Name

Given Name

Date of Birth

GUGGENHEIM REAL ESTATE

LLC

123 Main Street,

Toronto, Ontario, Canada X1X1X1

647-123-4567

GUGGENHEIM REAL ESTATE

LLC

St Unit 10,

Toronto, Ontario, Canada X1X1X1

647-123-2352

Page 47: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

CHAPTER4 – FUZZY LOGIC

Traditional Logic:

Now let's suppose that we generate the following training set based on the Data Steward

application output including the potential overlay task, duplicate suspects, and related records

from multiple sources. We would have:

47

Page 48: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

48

Figure 10: Cases 

Case party id Address phone Source Class

1### VERKIN YOUSSOU 5-Aug-60 Mortgage

N

### VERKIN JANE 5-Aug-60 Auto

2### 10-May-98 Market

Y

### 10-May-98 Auto

3### VERKIN SMITH 5-Aug-60 Life

Y

### V. SMITH 5-Aug-60 Auto

4### 5-Aug-60

Y

### 5-Aug-60

5 ### LEWIS JANE 26-Jun-71 N

### XIANG LINDA 13-Jan-78

Family Name

Given Name

Date of Birth

10 Main Street,

Markham, Ontario, Canada X2Y1X1

915-123-4213

10 Main Street,

Markham, Ontario, Canada X2Y1X1

915-123-4213

GUGGENHEIM REAL ESTATE

LLC

123 Main Street,

Toronto, Ontario, Canada X1X1X1

647-123-4567

GUGGENHEIM REAL ESTATE

LLC

123 Main St Unit

10, Toronto, Ontario, Canada X1X1X1

647-123-2352

987 Village Ave,

Toronto, Ontario, Canada T2T1C1

416-222-3333

987 Village Ave,

Toronto, Ontario, Canada T2T1C1

416-222-3333

CREATIVE LEADERSHIM GROUM

LTD

456 King Avenue, calgary, Alberta, Y2Y2Y2

416-549-7070

CREATIVE LEADERSHIM GROUM

LTD

456 King Avenue, calgary, Alberta, Y2Y2Y2

416-549-7070

100 Kumar

Avenue, Markham, Ontario, Canada A2B2C2

416-549-7070

24/08/2006

456 King Avenue, calgary, Alberta, Y2Y2Y2

416-549-7070

28/02/1998

Page 49: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Traditional logic - the idea that the outcome can only be either true or false, 1 or 0, right or

wrong. This form of logic dates back to ancient Greece and is perfectly adequate to answer

simple questions in single dimensions. For example, if A is 1 and B is 0 what is A AND B ? It

can be extended, as is done in Boolean algebra to more complex questions, as long as all the

parts can be described using the same restricted alphabet of two symbols. Such logic is a

deductive way of understanding consequences and is a highly valuable intellectual technique.

12

If we use the above traditional logic, we will get the following training set:

Applying the information gain on the above training set, we will get the information gain on the

attributes:

49

Figure 11: Training Set

case Address phone class

1 T F T T T N

2 T T T F F Y

3 F T T T T Y

4 T T T T T Y

5 F F F F T N

Family Name

Given Name

Date of Birth

Page 50: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Info(D)=−∑i=1

m

p i log2( p i)

InfoA(D)=∑j=1

v ∣D j∣∣D∣

×Info(D j)

Gain( A)=Info(D)−InfoA (D)

Info(D)=−25

log225−

35

log235=0.97

InfoFamilyName (D)=35 (−2

3log2

23−

13

log213 )+2

5 (−12

log212−

12

log212 )=0.95

InfoGivenName (D)=35 (−3

3log2

33−

03

log203 )+2

5 (−22

log222−

02

log202 )=0

InfoDateofBirth(D)=45 (−3

4log2

34−

14

log214 )+1

5 (−11

log211−

01

log201 )=0.65

InfoAddress(D)=35 (−2

3log2

23−

13

log213 )+2

5 (−12

log212−

12

log212 )=0.96

InfoPhone(D)=45 (−2

4log2

24−

24

log224 )+1

5 (−01

log201−

11

log211 )=0.8

Hence, the gain in information from such a partitioning would be:

Gain(FamilyName)=Info(D)−InfoFamilyName (D)=0.97−0.95=0.02Gain(GivenName)=Info(D)−InfoGivenName (D)=0.97−0=0.97

Gain(DateofBirth)=Info (D)−InfoDateofBirth(D)=0.97−0.65=0.32Gain( Address)=Info(D)−InfoAddress(D)=0.97−0.96=0.01

Gain(Phone)=Info(D)−InfoPhone(D)=0.97−0.8=0.17

Courier

50

Page 51: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Since GivenName has the highest information gain among the attributes, it is selected as the

splitting attribute. So we get the following decision tree:

One of the issues about the above decision tree is the uncertainty of the attributes. For

example, is the name “John Smith” the same as “J.Smith”. The above model only provided

two states for attributes which considered whether the Given Name is the same or the Given

Name is not the same. I will illustrate here to tackle the uncertainty associated with the

description of knowledge by using fuzzy logic.

51

Figure 12: Traditional Decision Tree

Page 52: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

Fuzzy Logic History

The term "fuzzy logic" was introduced with the 1965 proposal of the fuzzy set theory by Lotfi

A. Zadeh .[2][3] Fuzzy logic has been applied to many fields, from control theory to artificial

intelligence. Fuzzy logic however had been studied since the 1920s as infinite-valued logic

notably by Łukasiewicz and Tarski.[4]

The Basic Concept of Fuzzy Logic

 Fuzzy mathematics forms a branch of mathematics related to the fuzzy set theory and

fuzzy logic . It started in 1965 after the publication of Lotfi Asker Zadeh 's seminal work Fuzzy

sets.[1] A fuzzy subset A of a set X is a function A:X→L, where L is the interval [0,1]. This

function is also called a membership function. A membership function is a generalization of a

characteristic function or an indicator function of a subset defined for L = {0,1}. More

generally, one can use a complete lattice L in a definition of a fuzzy subset A . 9

A Fuzzy Implementation:

For each input and output variable selected, I define two or more membership functions (MF).

There is qualitative category for each one, for example: true or false. The shape of these

functions can be diverse but I will work with a triangle, which needs three points to define one

MF of one variable. Below is the triangle for the variable GivenName:

52

Page 53: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

If we take GivenName as a variable, 'true' as the triangle, and 'false' as the trapezoid ( see

the figure above),

– the MF 'true' will be defined by three points : (x0, x1, x2) (x0 is any negative

value. )

– the MF 'false' will be defined by four points : (x1, x2, x3, x4) ( x4 is any positive

value > x3. This means that 'false' will be 1 after x2 infinite. )

We have the following MF for Given Name:

y(triangle)

true ( x; x0 ,x1 , x2 )=max(min( x−x0

x1−x0

,x2−xx2−x1

),0)y(trapezoid)

false(x ; x1, x2 , x3 , x4)=max (mix ( x−x1

x2−x1

,1 ,x4−x

x4−x3) ,0)

For the Given Name variable, I use the Levenshtein distance to calculate the value of x:

53

Figure 13: Fuzzy MF

x0 x1 x2 x3 x4

y1

true false

Page 54: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance
Page 55: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

5, then the two values are somewhat similar. If the distance is greater than 5, then the two

values are not the same at all.

After the above specification, we have the fuzzificate real value for GivenName. For example,

for “kitten” and “sitting” with distance 3, we can get the fuzzificated

ytrue=max(min( 3−∞

0−∞,5−35−0 ),0)=0.4

y false=max(min( 3−0

5−0,1 ,

∞−3∞−15 ) ,0)=0.6

Decision Tree definition:

Now let's reconsider the decision tree we introduced before:

For this simple case, we have the following rule based on the decision tree above:

IF GivenName is equal (T), THEN two records are equal.

Next is to compute the degree of membership to the MF (true, false) of the output (the THEN

55

Figure 14: Traditional Decision Tree

Page 56: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

part). Once a variable such as the Given Name is fuzzificated, it takes a value between 0 and

1, indicating the degree of membership to a given MF of the specific variable. The degrees of

membership of the input variables have to be combined to get the degree of membership of

the output. For a single input variable, such as the rule specified above, we can for example

have a fuzzy rule as shown below:

IF GivenName is equal (T), THEN two records are equal;

IF GivenName is not equal (F), THEN two records are not equal;

According to these rules, if we suppose that the degree of membership for GivenName is 0.6

to MF 'false', then the two records that are not equal are 0.6, too.

In case we have more than one input variable, the degree of membership for the output value

will be the minimum value of the degree of membership for the different inputs. For example,

suppose we have two input variables (GivenName X and Family Name Y) and the decision

matrix below:

If we calculated the attributes as having the following fuzzificated values:

56

Figure 15: Decision Matrix

FamilyNameequal not equal

GivenNameequal equal not equalnot equal not equal not equal

Page 57: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

yGivenNameequal

=0.8

yFamilyNamenotequal

=0.9

Then we have the following rule satisfied:

IF GivenName is equal (degree of 0.8) and FamilyName is not equal (degree of 0.9) THEN

the two records are not equal (degree of 0.8).

yGivenNamenot equal

=0.8

yFamilyNameequal

=0.2

The following rule would also be satisfied:

IF GivenName is not equal (degree of 0.8) and FamilyName is equal (degree of 0.2) THEN

the two records are not equal (degree of 0.2)

Brief Discussion:

In applying fuzzy logic to the data governance process, we can get a more accurate decision

tree, which will enhance the decision making process. With the above example, using the

traditional decision tree model, it has to be taken into consideration whether FamilyName and

GivenName are slightly different. If FamilyName and GivenName are different, the conclusion

may be drawn that the two records belong to different persons. However, when we apply

fuzzy logic, we may say that the records with FamilyName are not equal to some extent(let's

say 20% not equal) and that GivenName is somewhat equal at 0.3 degrees. In that case, the

records would be considered to belong to the same person based on fuzzificated logic.

Therefore, a more accurate result is gained.

57

Page 58: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

CHAPTER 5 - CONCLUSIONS

In this essay, the history of data governance was discussed, as well as current literature and

the future of this process. The data governance process itself was then explained wherein it

was found that the central point of data governance is related to data quality. In order to

improve the data quality of the master data repository, fuzzy logic was applied to the data

governance process. With data governance constantly evolving , we have the requirement to

guarantee the quality of data governance. Applying fuzzy logic will definitely help to improve

the quality of data governance. Fuzzy logic will not only improve the data quality process, but

it will actually also improve the process automation.

58

Page 59: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

REFERENCE

1. Data Governance (November 7, 2013). In Wikimedia, the free encyclopedia.

Retrieved December 5, 2013, from http://en.wikipedia.org/wiki/Data_governance

2. A Brief History of Data Quality (March 25, 2009). Data Governance Insider:

Covering the world of big data and data governance. Retrieved from http://data-

governance.blogspot.ca/2009/03/brief-history-of-data-quality.html

3. Nigel Turner (Nov 15, 2013). Kindling the Flames: The Future of Data Governance.

Retrieved December 11, 2013, from http://smartdatacollective.com/dat-

mai/167531/kindling-flames-future-data-governance

4. Rick Sherman (2011) A must to avoid: Worst practices in enterprise data governance.

Retrieved from http://searchdatamanagement.techtarget.com/feature/A-must-to-avoid-

Worst-practices-in-enterprise-data-governance

5. Marketing Data Governance in the Era of “Big Data” Retrieved from http://www.kbmg.-

com/wp-content/uploads/2013/07/Winterberry-Group-White-Paper-Market-

ing-Data-Governance-July-2013.pdf

6. Sunil Soares (Sept 2010). The IBM Data Governance Unified Process. Ketchum,

USA: MC Press Online, LLC

7. Julie Langenkamp-Muenkel (Oct 2013). MDM and Next-Generation Data Sources.

Information Management

8. Huey-Li Chen, Long-Hui Chen and chien-Yu Huang (2009). Fuzzy Goal Programming

Approach To Solve The Equipments-Purchasing Problem of AN FMC. International

Journal of Industrial Engineering, 16(4), 270-281, 2009

9. Fazzy Mathematics ( Nov 28, 2013). In Wikimedia, the free encyclopedia. Retrieved

Feb 2, 2014, from http://en.wikipedia.org/wiki/Fuzzy_mathematics

10. A Fuzzy implementation. Retrieved Nov 15, 2014 from http://apps.ensic.inpl-

nancy.fr/benchmarkWWTP/RiskAnalysis/RiskWeb/RiskModule_070423_fichiers/Fuzzy

_implementation_070423.pdf

11. Risk Analysis (April 2007) Retrieved Nov 15, 2014 from http://apps.ensic.inpl-

nancy.fr/benchmarkWWTP/RiskAnalysis/RiskWeb/RiskModule_070423_fichiers/

59

Page 60: ATHABASCA UNIVERSITY Applying Fuzzy Logic for …dtpr.lib.athabascau.ca/action/download.php?filename=scis-07/open/... · ATHABASCA UNIVERSITY Applying Fuzzy Logic for Data Governance

Applying Fuzzy Logic for Data Governance

12.Fuzzy Multidimensional Logic (March 2004). Retrieved Feb 18, 2014 from

http://www.calresco.org/lucas/fuzzy.htm

13.Levenshtein distance ( Feb, 2014 ). Retrieved Feb 19, 2014 from

http://en.wikipedia.org/wiki/Levenshtein_distance

14.Adler. Big Data Governance Maturity (March 2012). Retrieved Feb 23, 2014 from

https://www.ibm.com/developerworks/community/blogs/adler/entry/big_data_governan

ce_maturity?lang=en

15.DataFlux Data Management. The Intersection of Big Data, Data Governance and

MDM. Retrieved Feb 23, 2014 from http://digital.info-mgmt.com/info-

mgmt/DataFlux_SAS2012#pg1

16. Sammon, D. and Adam, F. “Making Sense of the Master Data Management (MDM)

Concept: Old Wine in New Bottles or New Wine in Old Bottles?” Proceedings of the

2010 conference on Bridging the Socio-technical Gap in Decision Support Systems:

Challenges for the Next Decade Pages 175-186

60


Recommended