+ All Categories
Home > Documents > Data Quality Class 2 David Loshin. Goals Overview of Databases Cost of low data quality The...

Data Quality Class 2 David Loshin. Goals Overview of Databases Cost of low data quality The...

Date post: 19-Dec-2015
Category:
View: 219 times
Download: 0 times
Share this document with a friend
Popular Tags:
36
Data Quality Class 2 David Loshin
Transcript

Data Quality

Class 2

David Loshin

Goals

Overview of Databases Cost of low data quality The information chain Use of Mini Tools

Overview of Databases

Data Model Tables and Attributes Relational Databases

Data Model

A representation of an object Describes the properties associated with a

modeled entity Also describes how those properties are

represented

Tables and Attributes

A collection of object instances reside in a single table

Each instance is represented as a row in the table

Each property is manifested as a column in the table

Relational Databases

We attempt to manage data in “normal form” Data in different tables are related via “keys” A key is a set of one or more attributes that

uniquely identify an entity within a table Tables are related to each other via foreign

keys

Relational Databases con’t.

Referential Integrity Constraint – If a value is used as a foreign key to a different table, a record must exist in that table that has that value as its primary key

Functional Dependence – An attribute B is functionally dependent on attribute A if for any distinct value j of A, there is a corresponding value k of B, then in all instances where A contains j B will contain k

Databases and Data Quality

Data quality implies validation of all integral constraints associated with a database– Existence of a primary key– Referential Integrity– Null value constraints– Functional Dependence

Data quality will also encompass higher level content-oriented rules

Cost of Low Data Quality

Data quality is measured using anecdotes “Hazy” feeling of wrongness Desire to gauge the true cost of poor data

quality

Evidence of Economic Impact

Frequent service interruptions and system failures

Drop in productivity vs. volume High employee turnover High new business/continued business ratio Increased customer service requirements Customer Attrition

The Data Quality Scorecard

Use scorecard as a tool to manage the corporate information asset

Precise methods to measure level of data quality

Evaluate the costs and impacts associated with low data quality

Build a ROI model

Knowledge Integrity Incorporated

Building and Using the Data Quality Scorecard

1. Map the flow of information

2. Find the critical points of pain

3. Locate the origin of the problems

4. Identify the impacts

5. Calculate the cost

6. Identify targets for improvement

Knowledge Integrity Incorporated

Map the Flow of Information

Data processing can be likened to an “information factory”

Data sets from multiple sources are used as “raw input”

Final products are created in the form of business processes, information products, strategic reports, etc.

Knowledge Integrity Incorporated

Stages in the Information Map

Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data ConsumptionKnowledge

Integrity Incorporated

Directed Information Channels

Indicates the flow of information from one processing stage to another

Example: supplier data is delivered to an acquisition stage through an information channel

Directed indicates the direction in which data flows

This effectively maps all points at which a data fault or nonconformance may appear Knowledge

Integrity Incorporated

Find the Critical Points of Pain

Look for evidence of impact– Frequent system failures– Drops in productivity– High employee turnover– Increased customer service requirements– Inability to scale– Decreased margins– Customer attrition

Knowledge Integrity Incorporated

Customer Interviews

Ask about potential for data errors associated with:

– Billing– Customer service– Attrition– Recommendations

Ask about customer perception of the organization

Knowledge Integrity Incorporated

Employee Interviews

Look for instances where low data quality affects smooth operation

Seek out scrap and rework:– Where do data problems affect ability to do job?– How often must processes be rerun due to data

problems?– How much time is spent fixing data problems?– How does error correction scale within organization?

What keeps employees from being able to get their job done successfully and on time?

Knowledge Integrity Incorporated

Preliminary Expectations

Early in process for formal definition of rules

Not too early for “gross-level” statement of expectations

Example: All addresses must contain street name, city, state, and ZIP code

Knowledge Integrity Incorporated

Initial Assessment

Simple tests for non-conformance to expectations

Highest level assessment may be done using sampling

Get a gross-level score for conformance– Define rules– Test– Measure– ScoreKnowledge

Integrity Incorporated

Isolate Flawed Data

Where are data problems recognized? Who finds the problem data?

– Customers?– Call center personnel?– Internal Knowledge Workers?– What are the workers’ rewards for finding bad

data?

Knowledge Integrity Incorporated

Trace Back to Origin of Fault

Follow path of information from its insertion into information flow through its exit points

Trace backward through the information chain to find the point at which the information became flawed.

Knowledge Integrity Incorporated

Identify the Impacts

The quality of information at any consumption stage can affect any of these variables:

– Increase in Profit– Decrease in Profit– Cost increase– Cost decrease– Delay– Speedup– Increased satisfaction– Decreased satisfaction

Knowledge Integrity Incorporated

Soft Impacts

Difficulty in decision-making Time delays in operation Organizational mistrust Lowered ability to compete Data ownership conflicts Lowered customer satisfaction Lowered employee satisfaction

Knowledge Integrity Incorporated

Hard Impacts

Customer attrition Scrap and rework Error prevention Increased customer service costs Costs associated with fixing customer

problems Spin control Loss of equity value Enterprisewide data inconsistencyKnowledge

Integrity Incorporated

Calculate the Costs

Some impacts are easily tied to hard $$$ Some impacts are hard to characterize

exactly, but are clearly felt Assign some cost to each problem

Knowledge Integrity Incorporated

Cost Categories

Detection Correction Rollback Rework Prevention Warranty Reduction Attrition BlockadingKnowledge

Integrity Incorporated

Information Chains and Data Flows

Multiple impacts may be attributed to the same data problem

Tracing problems back through the information chain provides insight into overall impact of poor data quality

Knowledge Integrity Incorporated

The Assessment Matrix

Axis 1:– Log each data quality problem

Axis 2:– Specify activities associated with each problem

Axis 3:– Impact areas for each activity

Each cell contains the estimated cost associated with the impact

Knowledge Integrity Incorporated

Aggregate Costs, Build the Model

Superimpose matrix onto spreadsheet Tally and summarize across the model Use the spreadsheet as a simulation model

Knowledge Integrity Incorporated

Putting it Together

Map the information chain Conduct interviews to locate data quality problems Annotate information chain with location of data quality

problems Identify impact domains for each problem Characterize economic impact (=cost!) Aggregate totals

The Information Chain

Represented as a directed graph Vertices are processing stages Edges are directed channels At each intersection point, we manage a

collection of objects representing data objects passing through that intersection

Information Chain, con’t.

Each intersection data object contains:– The model of the data passing through the

intersection– A set of rules describing validity for those data– The named reference objects that are related to the

data passing through At each point in the information chain, we can

measure conformance to our data quality validation criteria

Information Chain, 3

We can model the information chain in a database

Table for vertices Table for edges Table for data objects Table for rules Table for reference objects See the book for details

Data Quality: Using the Tools

Example data: Hierarchical Department name data

Source: 2 data sources Goal: If we wanted to go with 1 data source,

how would it impact the other?

Example, con’t.

Goals:– Determine overlap between source A and source B– Determine what is not is intersection between 2 data

sources– Look for duplicates that are and are not exact

matches


Recommended