Data Model
A representation of an object Describes the properties associated with a
modeled entity Also describes how those properties are
represented
Tables and Attributes
A collection of object instances reside in a single table
Each instance is represented as a row in the table
Each property is manifested as a column in the table
Relational Databases
We attempt to manage data in “normal form” Data in different tables are related via “keys” A key is a set of one or more attributes that
uniquely identify an entity within a table Tables are related to each other via foreign
keys
Relational Databases con’t.
Referential Integrity Constraint – If a value is used as a foreign key to a different table, a record must exist in that table that has that value as its primary key
Functional Dependence – An attribute B is functionally dependent on attribute A if for any distinct value j of A, there is a corresponding value k of B, then in all instances where A contains j B will contain k
Databases and Data Quality
Data quality implies validation of all integral constraints associated with a database– Existence of a primary key– Referential Integrity– Null value constraints– Functional Dependence
Data quality will also encompass higher level content-oriented rules
Cost of Low Data Quality
Data quality is measured using anecdotes “Hazy” feeling of wrongness Desire to gauge the true cost of poor data
quality
Evidence of Economic Impact
Frequent service interruptions and system failures
Drop in productivity vs. volume High employee turnover High new business/continued business ratio Increased customer service requirements Customer Attrition
The Data Quality Scorecard
Use scorecard as a tool to manage the corporate information asset
Precise methods to measure level of data quality
Evaluate the costs and impacts associated with low data quality
Build a ROI model
Knowledge Integrity Incorporated
Building and Using the Data Quality Scorecard
1. Map the flow of information
2. Find the critical points of pain
3. Locate the origin of the problems
4. Identify the impacts
5. Calculate the cost
6. Identify targets for improvement
Knowledge Integrity Incorporated
Map the Flow of Information
Data processing can be likened to an “information factory”
Data sets from multiple sources are used as “raw input”
Final products are created in the form of business processes, information products, strategic reports, etc.
Knowledge Integrity Incorporated
Stages in the Information Map
Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data ConsumptionKnowledge
Integrity Incorporated
Directed Information Channels
Indicates the flow of information from one processing stage to another
Example: supplier data is delivered to an acquisition stage through an information channel
Directed indicates the direction in which data flows
This effectively maps all points at which a data fault or nonconformance may appear Knowledge
Integrity Incorporated
Find the Critical Points of Pain
Look for evidence of impact– Frequent system failures– Drops in productivity– High employee turnover– Increased customer service requirements– Inability to scale– Decreased margins– Customer attrition
Knowledge Integrity Incorporated
Customer Interviews
Ask about potential for data errors associated with:
– Billing– Customer service– Attrition– Recommendations
Ask about customer perception of the organization
Knowledge Integrity Incorporated
Employee Interviews
Look for instances where low data quality affects smooth operation
Seek out scrap and rework:– Where do data problems affect ability to do job?– How often must processes be rerun due to data
problems?– How much time is spent fixing data problems?– How does error correction scale within organization?
What keeps employees from being able to get their job done successfully and on time?
Knowledge Integrity Incorporated
Preliminary Expectations
Early in process for formal definition of rules
Not too early for “gross-level” statement of expectations
Example: All addresses must contain street name, city, state, and ZIP code
Knowledge Integrity Incorporated
Initial Assessment
Simple tests for non-conformance to expectations
Highest level assessment may be done using sampling
Get a gross-level score for conformance– Define rules– Test– Measure– ScoreKnowledge
Integrity Incorporated
Isolate Flawed Data
Where are data problems recognized? Who finds the problem data?
– Customers?– Call center personnel?– Internal Knowledge Workers?– What are the workers’ rewards for finding bad
data?
Knowledge Integrity Incorporated
Trace Back to Origin of Fault
Follow path of information from its insertion into information flow through its exit points
Trace backward through the information chain to find the point at which the information became flawed.
Knowledge Integrity Incorporated
Identify the Impacts
The quality of information at any consumption stage can affect any of these variables:
– Increase in Profit– Decrease in Profit– Cost increase– Cost decrease– Delay– Speedup– Increased satisfaction– Decreased satisfaction
Knowledge Integrity Incorporated
Soft Impacts
Difficulty in decision-making Time delays in operation Organizational mistrust Lowered ability to compete Data ownership conflicts Lowered customer satisfaction Lowered employee satisfaction
Knowledge Integrity Incorporated
Hard Impacts
Customer attrition Scrap and rework Error prevention Increased customer service costs Costs associated with fixing customer
problems Spin control Loss of equity value Enterprisewide data inconsistencyKnowledge
Integrity Incorporated
Calculate the Costs
Some impacts are easily tied to hard $$$ Some impacts are hard to characterize
exactly, but are clearly felt Assign some cost to each problem
Knowledge Integrity Incorporated
Cost Categories
Detection Correction Rollback Rework Prevention Warranty Reduction Attrition BlockadingKnowledge
Integrity Incorporated
Information Chains and Data Flows
Multiple impacts may be attributed to the same data problem
Tracing problems back through the information chain provides insight into overall impact of poor data quality
Knowledge Integrity Incorporated
The Assessment Matrix
Axis 1:– Log each data quality problem
Axis 2:– Specify activities associated with each problem
Axis 3:– Impact areas for each activity
Each cell contains the estimated cost associated with the impact
Knowledge Integrity Incorporated
Aggregate Costs, Build the Model
Superimpose matrix onto spreadsheet Tally and summarize across the model Use the spreadsheet as a simulation model
Knowledge Integrity Incorporated
Putting it Together
Map the information chain Conduct interviews to locate data quality problems Annotate information chain with location of data quality
problems Identify impact domains for each problem Characterize economic impact (=cost!) Aggregate totals
The Information Chain
Represented as a directed graph Vertices are processing stages Edges are directed channels At each intersection point, we manage a
collection of objects representing data objects passing through that intersection
Information Chain, con’t.
Each intersection data object contains:– The model of the data passing through the
intersection– A set of rules describing validity for those data– The named reference objects that are related to the
data passing through At each point in the information chain, we can
measure conformance to our data quality validation criteria
Information Chain, 3
We can model the information chain in a database
Table for vertices Table for edges Table for data objects Table for rules Table for reference objects See the book for details
Data Quality: Using the Tools
Example data: Hierarchical Department name data
Source: 2 data sources Goal: If we wanted to go with 1 data source,
how would it impact the other?