+ All Categories
Home > Documents > Data Quality_Information Quality for Northwind

Data Quality_Information Quality for Northwind

Date post: 16-Dec-2015
Category:
Upload: aminul-daffodil
View: 255 times
Download: 3 times
Share this document with a friend
Description:
Data quality Management for information representation and decision making purpose
18
COLOGNE UNIVERSITY OF APPLIED SCIENCES February 15, 2015 Mohammad Aminul Islam (MatrNo: 11103812) Business Intelligence Data Quality / Information Quality for Northwind
Transcript
  • COLOGNE UNIVERSITY OF APPLIED SCIENCES

    February 15, 2015

    Mohammad Aminul Islam (MatrNo: 11103812)

    Business Intelligence

    Data Quality / Information Quality for Northwind

  • - i -

    List of Contents

    1. Introduction .............................................................................. 1

    2. Data & Information ................................................................... 1

    3. What is data / Information quality ? ......................................... 2

    4. Dimension of Data quality ......................................................... 3

    5. Different approaches for data quality ...................................... 5

    5.1. Data Profiling...............................................................5

    5.2. Cleaning and Conforming.............................................8

    6. Data quality analysis on Northwind database.........................10

    7. Summary................................................................................15

    8. Conclusion ............................................................................... 15

    9. References..........................................................................................................................16

  • 1

    1 Introduction

    Now most of the business owner prefers using data warehouse for their business. Using

    data warehouse is the convenient way to handle overall business because they can observe the

    overall business condition, easy for decision making and also easy to predict the business

    future. Actually the management of the business are not much interested to look all activities;

    they are more interested to see the reports or the summery of the business. The reports or the

    summery are the different calculation of data in database. So, the data in the database should

    be qualified because it will impact on decision making of the business. Otherwise lots of

    problem will arise such as more users complain or wrong business direction etc. In Kimball

    book talking about three important reasons why executives are more concern for data quality.

    First, if I could see the data then I could manage my business better. Second, most of the data

    sources are distributed; integrating disparate data sources are required. Third, sharply

    increased demand of complain mean lake of qualified data.

    2 Data and Information

    Data is raw, unorganized facts that need to be processed. Data can be something simple

    and seemingly random and useless until it is organized.1

    Example: Each student marks are data

    When data is processed, organized, structured or presented in a given context so as to make it

    useful, it is called information.

    Example: Average marks of all students are information.

    1 See (Data Vs Information)

  • - 2 -

    3 What is Data/Information Quality?

    In computing, data quality is the reliability and application efficiency of data, particularly

    when kept in a data warehouse. Data quality assurance (DQA) as the process of verifying the

    reliability and efficiency of data.2

    Data quality is an essential characteristic that determines the reliability of data for

    making decisions. 3

    Data are of high quality if, "they are fit for their intended uses in operations, decision

    making and planning." (J. M. Juran). Alternatively, data are deemed of high quality if they

    correctly represent the real-world construct to which they refer.

    From the different definition of data quality we can say that data should be reliable,

    represent the real world and also meet the decision making purpose.

    2 See (Rouse, M. 2015)

    3 See (IBM)

  • - 3 -

    3.1 Data Quality Dimension

    Data quality dimension concern about accuracy, availability, completeness,

    conformance, consistency, credibility, process ability, Relevance, Timeliness. 4

    Figure 1: Dimension of Data Quality

    Accuracy: The accuracy of data is the correctly representation of the real world object,

    situation or event.

    4 See (Danette McGilvray)

    Data Quality

    Dimension

    Accuracy

    Availability

    Completeness

    Conformance

    ConsistencyCredibility

    Processability

    Relevance

    Timeliness

  • - 4 -

    Example: Wrong name of the employee may be typing mistake.

    Availability: The availability of data means the data accessible for long time without

    any problem.

    Example: Suppose our source data coming from a URL. When URL not available will

    show 404 not found error.

    Completeness: The completeness of data means the data items or data points necessarily

    need to support the application which it is intended.

    Example: Full name of the customer, address etc.

    Conformance: The conformance of data means a set of rules or regulation for capturing,

    description of the data.

    Example: Standard date format.

    Consistency: The consistency means the data cannot violet the own rules of the database.

    Example: For integer data type cannot insert character data type in database.

    Credibility: The credibility of data means the trust sources of data.

    Process ability: Process ability data means that the data giving as input is understandable by

    the machine or software.

    Relevance: Relevance data means the data contain necessary information for support the

    application.

    Timelineness: The current state of data showing unnecessary delays.

  • - 5 -

    5 Different approaches for data quality

    Data Profiling: According to Kimball Data profiling is technical analysis of data to describe

    its content, consistency and structure. 5

    Data profiling plays two roles strategic and tactical. When data source identified then data

    profiling assessment determine its suitability for data warehouse and make go/no decision.

    Data profiling is very critical stage for initiate any database; it incorporates source data from

    external system. Allocation sufficient time and analysis to data profiling assessment give

    designer a better solution and reduce project risk by identifying the potential data.

    Best practices for data profiling6

    Distinct count and percent: Analyzing the distinct values for each column will help to identify

    the unique value within the source data. Identification of unique keys is the fundamental

    requirements for database and ETL architecture. Especially when need to insert or update any

    data from database we need this unique keys for do action on specific data.

    Order ID

    Order Date

    Shipped Date

    Ship Via

    Ship Name

    Ship City

    Ship Region

    Ship Country

    5 See (Kimball, R., Ross)

    6 See (tdwi)

    Customer ID

    Customer Name

    Address

    City

    Region

    Postal Code

  • - 6 -

    Zero, blank, null percent: Analyzing missing, blank, null values will help to identify

    the potential data issue. This information help database or ETL architecture to set appropriate

    default values or allow null to a target database column where data is unknown.

    Field Zero Blank Null Percent

    Order ID 0 0 0 100%

    Order Date 500 200 30 30%

    Shipped Date 50 20 40 40%

    Ship Via 40 100 10 35%

    Ship Name 20 400 30 22%

    Ship City 20 400 12 25%

    Ship Country 400 200 40 20%

    Minimum, Maximum string length and type: Analyze string length of source data

    will help to set length and type to a database. This is very important for big type database. It

    save the space, increase the query performance by minimizing the table scan time. If the field

    part of the index, keeping the data type in check will help the minimize index size, overhead

    and scan times.

    Field Minimum Maximum Type

    Order ID 6 8 integer

    Order Date 10 16 Date

    Shipped

    Date

    10 16 Date

    Ship Via 3 15 Varchar

    Ship Name 2 14 Varchar

    Ship City 4 15 Varchar

    Ship Country 3 11 Varchar

  • - 7 -

    Numerical and date range analysis: This analysis help to identify the numerical and

    date values. Suppose we need only integer values but if we declare with precision then it take

    more size than integer value. After observing date values then we can define which formant is

    appropriate for database.

    Field Data 1 Data 2 Data 3

    Order ID 123456 123457 123458

    Order Date 01.01.2015 2015.01.02 03.01.2015

    Shipped

    Date

    03.01.2015 04.01.2015 05.01.2015

    Ship Via Air Bus

    Ship Name XXX XXX

    Ship City Dhaka Kln Cologne

    Ship Country Bangladesh Germany Deutschland

    Pattern analysis: Checking pattern of the data will confirm that data field formatted

    correctly..

    Field Data 1 Data 2

    Customer ID 123456 123457

    Customer Name Md. Aminul Mohammad Islam

    Address Fuldaer str Oranienstr

    Mobile No 017564879954 +4914756214789

    E-mail [email protected] [email protected]

    Websibe www.aminul.com Go.com

    Blank

    Different

    Format Same

    meaning

    but

    confusion

    First

    name,

    Last

    name

    Different

    format

  • - 8 -

    Cleaning and Conforming Data: According to Kimball Cleaning and conforming

    are the critical ETL system task. Extracting, delivering are simply move and load the data but

    cleaning and conforming add value to the data and enhance value to the organization.

    (Kimball, Ralph, The Data warehouse lifecycle toolkit, second edition), page 330).

    Kimball says, 9 things which will help to address the data quality

    Declare a high level commitment to the data quality culture

    Drive process reengineering at the executive level

    Spend money to improve data entry environment

    Spend money to improve application integration

    Spend money to change how process work

    Promote end to end team awareness

    Promote interdepartmental cooperation

    Publicly celebrate data quality excellence

    Continuously measure and improve data quality

    Data cleansing system: The ETL data cleansing system fix the dirty data. At the same

    time data warehouse providing the accurate picture of data capture by the organization`s

    production system. Develop a ETL system which is capable of correcting, rejecting, or

    loading data and easy to use structure, rules, standardization.

    Quality screen: Quality screen is the heart of ETL system. Quality screen is a

    test against data. If the test is successful then nothing happen but if the test get wrong data

    then it keep record in the error record schema. There are three types of quality screen test

    Column Screen test: This test happens within a single column. Such as

    whether the column contains wrong values, null values, or the value fail the required format.

    Structure Screen test: This test relationship of data among the columns.

    Structure screen test primary key, foreign key, one to many relationship between fields in two

    column.

    Business rule screen: Implement more complex and not similar with column

    or structure screen test. For example shipment date

  • - 9 -

    Error Event Schema: The error event schema is a centralized schema whose purpose

    is record every error occurs in database with the date and time. By viewing the recorded error

    it is possible to improve data quality.

    Figure 1: Error event schema

    Audit Dimension Assembler: The audit dimension is a special dimension for the ETL

    system for each fact table. When each record created, it adds metadata with the table. This

    metadata is available to BI application for visibility into data quality.

  • - 10 -

    6. Data quality analysis on Northwind Database

    Address:

    Full Name:

    House No.

    Country:

    Region:

    City:

    Street:

    Postal code:

    Contact

    Optional: Input person name

    Every input should be specific.

    Not in one line address.

    Constant value for country, city, street, postal code.

    Contact number should be a valid format

    All fields are mandatory

    If entry occurs by data entry operator then should include his name or id.

    Order, shipment: Order id, shipment id will be automatic.

    When order or shipment create them date and time will auto add.

    Order date < shipment date

    For Region, Zip SAS Code

    If region eq Not Specified or trim(region)eq or trim (region)eq then region=*;

    If ZIP eq Not Specified or trim(ZIP)eq or trim (ZIP)eq then region=*;

    Here, if region, zip code is not specified then set it as a *

    First name, middle name, last name

  • - 11 -

    Finding null value in the leader column

    Figure: Dim Employee Dimension

    SAS Code

    proc sql;

    select NMISS(Leader)

    from West3.DIM_EMPLOYEE;

    Results:

  • - 12 -

    Replacing null value of Leader column

    SAS code:

    proc sql;

    select COALESCE(Leader,0)

    from West3.DIM_EMPLOYEE;

    Results:

    Here, replacing null value by 0 because the data type of the column is integer type.

  • - 13 -

    Checking unique value of Bestell_Nr

    SAS code for count row:

    Proc sql;

    select count(Bestell_Nr)

    from WEST3.BESTELLUNGEN;

    Results:

    Sas code distinct value count:

    Proc sql;

    select distinct(Bestell_Nr)

    from WEST3.BESTELLUNGEN;

    result:

    Here, total row showing 832 and distinct value is 830. Thats mean tow values are

    duplicate. Lets find the duplicate values.

  • - 14 -

    SAS code for duplicate value:

    proc sql;

    select Bestell_Nr from WEST3.BESTELLUNGEN

    group by Bestell_Nr

    having count(Bestell_Nr)>1 ;

    Result:

    Bestell no 10369 and 10830 are not unique values.

  • - 15 -

    Summary

    Distinct count and percent: Find unique value in dimension

    Zero, blank, null percent: Find zero, blank, null value and make a rule for avoid this.

    Minimum, Maximum string length and type: Find string length and type and set the proper

    string length and type.

    Numerical and date range analysis: Analyze numerical value with fraction needed or not.

    Pattern analysis: Set different pattern for getting real data.

    Quality screen: Column screen test, structure screen test, business rule. These tests will give

    us more accurate data.

    7. Conclusion

    At the end, I want to say that data quality is a continual process. But if we analyze data

    before building warehouse and use those techniques then it is possible to minimize data error

    and increase data quality.

  • - 16 -

    References

    1. Data Vs Information, Available from: http://www.diffen.com/difference/Data_vs_Information. [Accessed: 20

    th December

    2014]

    2. Rouse, M. (2015) Data Quality.[Online] Available from: http://searchdatamanagement.techtarget.com/definition/data-quality. [Accessed: 11

    th

    February 2015]

    3. IBM, Data Quality, http://www-01.ibm.com/software/data/quality/ , [Accessed: 21th December 2014]

    4. Danette McGilvray, Granite Falls Consulting, Inc.Excerpted from Executing Data Quality Projects, http://www.gfalls.com/storage/book/individual-downloads-quick-

    ref/10steps_DQDimen.pdf , [Accessed: 25th

    December 2015]

    5. Kimball, R., Ross, M., Thornthwaite, W., Mundy, J., & Becker, B. (2008). The Data Warehouse Lifecycle Toolkit. 2nd Edition: Wiley Publishing.

    6. tdwi(3 February 2010), The Necessity of Data Profiling, http://tdwi.org/Articles/2010/02/03/Data-Profiling-Value.aspx?Page=1, [Accessed:

    27th

    December 2015]

    7. Open data & Meta data quality. Authors: Makx Dekkers, Nikolaos Loutas, Michiel De Keyzer and Stijn Goedertier

    8. Data Quality Management. The Most Critical Initiative You Can Implement. Authors: Jonathan G. Geiger, Intelligent Solutions, Inc., Boulder

    9. http://tdwi.org/Articles/2010/02/03/Data-Profiling-Value.aspx?Page=1 10. Wikipedia : http://en.wikipedia.org/wiki/Data_quality


Recommended