Date post: | 23-Jan-2018 |
Category: |
Education |
Upload: | ramakant-soni |
View: | 1,717 times |
Download: | 0 times |
Role of Data cleaning in Data Warehouse
Presentation on
Ramakant SoniAssistant Professor, BKBIET, Pilani
What is Data Warehouse ?
Data warehouse is an information delivery system where we can integrate and
transform data into information used largely for strategic decision making. The
historic data in the enterprise from various operational systems is collected and
is clubbed with other relevant data from outside sources to make integrated
data as content of data warehouse.
What is Data Cleaning ?
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data.
Introduction
RAMAKANT SONI, BKBIET
Need of Data Cleaning
• Data warehouses require and provide extensive support for data cleaning.
• They load and continuously refresh huge amounts of data from a variety of
sources so the probability of “dirty data” is high.
• Data warehouses are used for decision making, so the correctness of data
is vital to avoid wrong conclusions.
RAMAKANT SONI, BKBIET
Requirements
A data cleaning approach should satisfy several requirements:
• Detect and remove all major errors and inconsistencies both in individualdata sources and when integrating multiple sources. The approach shouldbe supported by tools to limit manual inspection and programming effort.
• Data cleaning should not be performed in isolation but together withschema-related data transformations based on comprehensive metadata.
• Mapping functions should be specified in a declarative way for datacleaning and be reusable for other data sources as well as for queryprocessing.
• A workflow infrastructure should be supported to execute all datatransformation steps for multiple sources and large data sets in a reliableand efficient way.
RAMAKANT SONI, BKBIET
Single-source problems
The data quality of a source largely depends on the degree to which it is governed byschema and integrity constraints controlling permissible data values.
• Sources without schema, such as files, have few restrictions on what data can beentered and stored, giving rise to a high probability of errors and inconsistencies.
• Database systems, enforce restrictions of a specific data model (e.g., the relationalapproach requires simple attribute values, referential integrity, etc.) as well asapplication-specific integrity constraints.
Schema-Level problems occur because of the lack of appropriate model-specific orapplication-specific integrity constraints.
Instance-Level problems relate to errors and inconsistencies that cannot be preventedat the schema level (e.g., misspellings).
RAMAKANT SONI, BKBIET
Multi-source problems
The problems in single sources are aggravated when multiple sources are integrated.Each source may contain dirty data and the data in the sources may be representeddifferently, overlap or contradict because of the independent sources.
Result: Large degree of heterogeneity.
Problem in cleaning: To identify overlapping data, in particular matching recordsreferring to the same real-world entity. This problem is also referred to as the objectidentity problem, duplicate elimination problem.
Frequently, the information is only partially redundant and the sources maycomplement each other by providing additional information about an entity.
Solution: duplicate information should be purged out and complementing informationshould be consolidated and merged in order to achieve a consistent view of real worldentities.
RAMAKANT SONI, BKBIET
Data cleaning Phases
In general, data cleaning involves several phases:
• Data analysis
• Definition of transformation workflow and mapping rules
• Verification
• Transformation
• Backflow of cleaned data
RAMAKANT SONI, BKBIET
Data cleaning process
Data analysis & Definingtransformation workflow,mapping rules
Verification & Transformation
Backflow of cleaned data
Figure 3. Data Cleaning Process
RAMAKANT SONI, BKBIET
Data cleaning Tool support
Large variety of tools is available to support data transformation and data cleaning:
• Data analysis Tools 1. Data profiling tool Eg. MigrationArchitect( Evoke Software)2. Data mining tool Eg. WizRule( WizSoft)
• Data reengineering tools uses discovered patterns and rules for cleaning.Eg. Integrity( Vality Software)
• Specialized cleaning tools deal with Particular Domain1. Special Domain Cleaning Eg. IDCentric( FirstLogic)2. Duplicate Elimination Eg. MatchIt( HelpItSystems)
• ETL tools uses repository built on DBMS to manage all metadata about data sources, target schema, mapping script etc. in uniform way
Eg. Extract( ETI), CopyManager( InformationBuilders)
RAMAKANT SONI, BKBIET
References
1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-University of Leipzig
2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -Shridhar B. Dandin- BKBIET Pilani
3. Principles and methods of data cleaning- Arthur D. Chapman
RAMAKANT SONI, BKBIET