+ All Categories
Home > Documents > Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality...

Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality...

Date post: 20-Dec-2015
Category:
View: 219 times
Download: 2 times
Share this document with a friend
41
Data Quality Class 5
Transcript
Page 1: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Quality

Class 5

Page 2: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Goals

• Project

• Data Quality Rules (Continued)

• Example

• Use of Data Quality Rules

Page 3: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Quality Rules Classes

• 1)      Null value rules• 2)      Value rules• 3)      Domain membership rules• 4)      Domain Mappings• 5)      Relation rules• 6)      Table, Cross-table, and Cross-message assertions• 7)      In-Process directives• 8)      Operational Directives• 9)      Other rules

Page 4: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Representing Data Quality Rules

• Data is divided into 2 sets:– conformers– violators

• Sets can be represented using SQL

• Create SQL statements representing violating set

Page 5: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Using SQL

• Direct queries• Embedded queries

– Using ODBC/JDBC, can create validation scripts in

• C• C++• Java• Visual Basic• Etc.

Page 6: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Null Value Representations

• Maintain a table of null representation types and names:

create table nullreps (name varchar(30),

nulltype char(1),

description varchar(1024),

source varchar(512),

nullval varchar(100),

nullrepid integer

);

Page 7: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Null Value Rules

• Allows nulls– If the rule is “allows nulls” without any

additional characterization• Nothing necessary

– If the rule is “allows nulls,” but only of a specific type

• Must check for real nulls (and possibly blanks and spaces):

• SELECT * from <table> WHERE <table>.<attribute> is NULL;

Page 8: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Null Value Rules

• Does not allow nulls– Must check for nulls(and possibly blanks and

spaces):• SELECT * from <table> WHERE

<table>.<attribute> is NULL;

Page 9: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Value Rules

• Value rule is specified as some set of constraints

• Makes use of operators and functions:– +, -, *, /, <, <=, >, >=, !=, ==, AND, OR– User defined functions

• Example:– value >= 0 AND value <= 100

Page 10: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Value Rules 2

• Validation test is opposite of constraint

• Use DeMorgan’s laws– If constraint was “value >= 0 AND value <=

100), use:

SELECT * from <table> where <table>.<attribute> < 0 OR

<table>.<attribute> > 100;

Page 11: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Membership

• Domains are stored in a database table

• Test for domain membership of an attribute is a test to make sure that all values are represented in domain table

Page 12: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Reference Tables

create table domainref (

name varchar(30),

dtype char(1),

description varchar(1024),

source varchar(512),

domainid integer

);

Page 13: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Reference Tables

create table domainvals (

domainid integer,

value varchar(128)

);

Page 14: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Membership

• Test for membership of attribute foo in the domain named bar:

SELECT * from <table> where foo not in

(SELECT value from domainvals where domainid =

(SELECT domainid from domainref

where domainref.name = “bar”));

Page 15: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Assignment

• The values in the attribute define the domain:– Find all the values not in the domain already– Update domain tables with those values

Page 16: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Domain Assignment 2

• SELECT * from <table> where foo not in

(SELECT value from domainvals where domainid =

(SELECT domainid from domainref

where domainref.name = “bar”));

For all values in this set, create a record with (the value, the domain id for “bar”), and insert into domainvals.

Page 17: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Mapping Membership

• Similar to domain membership, except:– Must include domain membership tests for both

values– Also must be looked up in the mapping tables

Page 18: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Completeness

• Defines when a record is complete– Ex: IF (Orders.Total > 0.0), Complete With

{Orders.Billing_Street, Orders.Billing_City, Orders.Billing_State, Orders.Billing_ZIP}

• Format:– Condition– List of fields that must be complete

Page 19: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Completeness 2

• Equivalent to a set of null tests using condition

• Select * from <table> where <condition is true> and <list of not null tests>;

Page 20: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Exemption

Defines which fields may be missingIF (Orders.Item_Class != “CLOTHING”) Exempt

{Orders.Color,

Orders.Size

}

• Format:– Condition

– List of fields that must be complete

Page 21: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Exemption 2

• If condition is true, the fields may be null

• Therefore, if condition is false, fields may not be null

• Equivalent for test of opposite of condition and test for nulls

Page 22: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Consistency

• Define a relationship between attributes based on field content– IF (Employees.title == “Staff Member”)

Then (Employees.Salary >= 20000 AND Employees.Salary < 30000)

– Format:• Condition

• Assertion

Page 23: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Consistency 2

• If condition is true, the assertion must be true

• Equivalent to test for cases where the condition is true and the assertion is false:

Select * from <table> where <condition> and not <assertion>;

Page 24: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Derivation

• Prescriptive form of consistency rule• Details how one attribute’s value is determined

based on other attributesIF (Orders.NumberOrdered > 0) Then {Orders.Total = (Orders.NumberOrdered *

Orders.Price) * 1.05}

• Format:– Condition– assignment

Page 25: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Derivation 2

• The assigned fields must be updated if condition is true

• Find all records where the condition is true

• Generate update SQL calls with updated values

• Execute updates

Page 26: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Functional Dependence

• Functional Dependence between columns X and Y:– For any two records R1 and R2 in a table,

• if field X of record R1 contains value x and field X of record R2 contains the same value x, then if field Y of record R1 contains the value y, then field Y of record R2 must contain the value y.

• In other words, attribute Y is said to be determined by attribute X.

Page 27: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Functional Dependence 2

• Rule Format:– Attribute X determines Attribute Y

• Validation test makes sure that the functional dependence criterion is met

• This means that if we extract the X value from the set of all distinct value pairs, that set should have no duplicates

Page 28: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Functional Dependence 3

• Create view FD as select distinct X, Y from <table>;

• Select count (*) from FD;

• Select count (distinct X) from <table>;

• These should be the same numbers.

Page 29: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Primary Key/Uniqueness

• A set of attributes defined as a primary key must uniquely identify a record

• Can also be viewed as a uniqueness constraint

• Format:– {attribute list} is PRIMARY– {attribute list} is UNIQUE

Page 30: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Primary

• Test to make sure that the number of distinct records with the expected key is the same as the number of records

• Select count(*) from <table>;• Select count (distinct <attribute list>) from

<table>;

• These numbers should be the same

Page 31: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Uniqueness

• Test for multiple record occurrences with the same set of values that should have been unique, if there is a separate known primary key

SELECT <table>.<attribute>, <table>.<attribute>

FROM <table> AS t1, <table> AS t2

WHERE t1.<attribute> = t2.<attribute> and t1.<primary> <> t2.<primary>;

Page 32: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Foreign Key

• When the values in field f in table T is chosen from the key values in field g in table S, field S.g is said to be a foreign key for field T.f

• If f is a foreign key, the key must exist in table S, column g (=referential integrity)

Page 33: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Foreign Key 2

• Similar to primary key

• Test is to make sure that all values in foreign key field exist in target table

Select * from <source table> where <attribute> not in (Select distinct <attribute> from <target table>);

Page 34: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Use of Data Quality Rules

• Data Validation

• Root Cause Analysis

• Message Transformation

• Data-driven GUIs

• Metadata Collection

Page 35: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Validation

• Translate rule set into select statements

• Create a program that:– Loads select statements into an array, indexed

by a unique integer– Connects to database via ODBC– Iterates through the array of select statements

those results

Page 36: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data Validation 2

– Each type of rule has an expected result; check against the expected result

– Outputs the result of each statement to output file, tagged by rule identifier

– Results can be tallied to yield an overall percentage of valid records to total records

Page 37: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Root Cause Analysis

• Root cause analysis can be started by looking at the counts of violated rules

• Use the most frequently violated rule as a starting place

Page 38: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Message Transformation

• Electronic Data Interchange

• Use DQ rules to validate incoming messages

• Use DQ rules (derivations, mappings) to transform incoming messages into an internal format

Page 39: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data-driven GUIs

• Data dependence is specified in a collection of rules

• Generate equivalence classes of data values based on dependence specification

Page 40: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Data-driven GUIS

• First, look for all independent attributes – this is class 0

• For class i, collect all attributes that depend on class (i – 1)

• The GUI will be constructed to iteratively request data from class 0..n

• Based on the results from collecting data at step j, the rules associated with the actual values are applied, determining which values are requested at step j + 1

Page 41: Data Quality Class 5. Goals Project Data Quality Rules (Continued) Example Use of Data Quality Rules.

Metadata Collection

• Use domain and mapping derivation rules to collect metadata

• Use other rules as a documentation of business operations


Recommended