Data preprocess

transcript

1. Introduction

2. Data Quality: Needs of Preprocessing the data?

3. Data Preprocessing tasks

4. Data Cleaning

5. Data integration

6. Data reduction

7. Data Transformation and Data Discretization

8. Conclusion

• It is a process which is comes before applying data mining technique's

• Low-quality data will lead to low-quality mining results.

• So we need to smear Data Preprocessing techniques such as:

- Data quality- Data cleaning

- Data integration- Data reduction - Data transformation - Data discremination

• Data have quality if the requirements of the intended use.

• There are many factors comprising data quality, including:

– Accuracy– Completeness– Consistency– Timeliness– Believability – Interpretability

• Data cleaning routines attempt to fill in missing values , smooth out noise while identifying outliers, and inconsistencies in data.

• Basic methods of data cleaning:

– Missing value

– Noisy Data

– Data Cleaning as a process

• Ignore the tuple

• Fill in missing values manually [ time consuming and infeasible]

• Fill in it automatically with[a global constant : e.g., “Unknown”, ∞]

• Use the most portable value to fill in the missing value [regression, inference-based tools using Bayesian formalism or decision tree induction]

• Noise is the random error or variance in a measured variable.

• Binning:

Binning method smooth a sorted data value by consulting its “neighborhood”, that is, the value around it.

The sorted values are distributed into number of “buckets”, or “bins”.

• Smoothing by bin means:

Each value in a bin is replaced by the mean value of the bin [4,8,15 in bin is 9].

• Smoothing by bin medians:

Each value in a bin replaced by the bin median

• Smoothing by bin boundaries:

The minimum and maximum values in a given bin are identified as the bin boundaries each bin values is then replaced by closest boundary value

Binning is also used as a discretization technique.

• Regression:

Data smoothing can also done by regression, a technique that conforms of values to the function

– Linear regression involves finding “best” line to fit two attributes. one attribute used to predict other

– Multiple linear regression extension of linear regression.

• Outlier analysis:

it may be detected by clustering. Where similar values are organized into groups or clusters.

• The first step in the data cleaning is discrepancy detection [inconsistent data] .

• The data should examined regarding :

– Unique rule [ each attribute value must be different from all other attribute value ]

– Consecutive rule [no missing values between lowest and highest values of the attribute]

– Null rule [specifies the use of blanks, question marks, special characters]

• Use commercial tools

Data scrubbing: use simple domain knowledge (e.g, postal code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and relationship to detect violators (e.g., correlation and clustering to find outliers)

• Data migration and integration

Data migration tools: allow transformations to be specified

ETL (Extraction/Transformation/Loading) tools: allow users to specify transformations through a graphical user interface

• It is the merging of data from multiple

data stores.

• Careful integration avoid and reduce redundancies and inconsistencies in resulting data set.

• Schema integration: [ Integrate metadata from different sources]

• Entity identification problem: [ Identify real world entities from multiple data sources]

• Redundancy analysis: [an attribute value may be redundant that can be detect by correlation analysis]

• This technique applied to obtain a reduced representation of the data set.

• Data reduction strategies include

– Dimensionality reduction :

Remove unimportant attributes

Its method include wavelet transforms , principal components analysis(PCA) which transforms the original data onto a smaller space.

– Numerosity reduction:

Replace the original data volume by alternative

– Data compression:

transformations are applied to obtain a reduced or “compressed” representation of the original data.

• If the compressed data without any information loss then the Data reduction is called “lossless”.

• If we reconstruct only an approximation of the original data, then the Data reduction is called “lossy”.

• Dimensionality reduction and numerosity reduction techniques can also be considered forms of “data compression”.

Original Data Compressed Data

lossless

Original DataApproximated

Data compression

• Data transformation routines convert the data into appropriate forms for mining.

• Strategies for data transformation includes: Smoothing: Remove noise from data

Attribute/feature construction: New attributes constructed from the given ones to help mining process.

Aggregation: Summarization, data cube construction. (e.g) daily sales aggregate to compute monthly or annual total amounts.

Normalization: Scaled to fall within a smaller, specified range, min-max normalization(0.1 to 1.0 or 0.0 to 1.0)

• It transforms numeric data by mapping values to interval or concept labels.

• Discretization and concept hierarchy generation can also be useful,

• where raw values for attributes are replaced by ranges or higher conceptual levels .

• raw values of a numeric attribute (e.g age) are replaced by interval lables (e.g 0-10, 11-20, etc) or higher-level concepts (e.g youth , adult, senior).

• Three types of attributes

– Nominal values from an unordered set, e.g., color, profession– Ordinal values from an ordered set [military or academic rank ]– Numeric real numbers, e.g integer or real numbers

• Discretization:

Divide the range of a continuous attribute into intervals

– Interval labels can then be used to replace actual data values – Reduce data size by discretization– Supervised vs. unsupervised– Split (top-down) vs. merge (bottom-up)– Discretization can be performed recursively on an attribute– Prepare for further analysis, e.g., classification

Although numerous methods of data preprocessing have been developed ,data preprocessing remains an active area of research ,due to the huge amount of inconsistent or dirty data and the complexity of the problem.

Data preprocess

Education