Download - DATA MINING AND WAREHOUSING · DATA MINING UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28 UNIT-I What is Data Mining? Data Mining is defined as extracting information from huge sets

DATA MINING

UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28

ELECTIVE : DATA MINING

UNIT I: Basic Data Mining Tasks – Data Mining Versus Knowledge Discovery in Data Bases –

Data Mining Issues – Data Mining Matrices – Social Implications of Data Mining – Data

Mining from Data Base Perspective.

UNIT II : Data Mining Techniques – a Statistical Perspective on data mining – Similarity

Measures – Decision Trees – Neural Networks – Genetic Algorithms.

UNIT III : Classification : Introduction – Statistical – Based Algorithms – Distance Based

Algorithms – Decision Tree – Based Algorithms – Neural Network Based Algorithms – Rule

Based Algorithms – Combining Techniques.

UNIT IV: Clustering : Introduction – Similarity and Distance Measures – Outliers –

Hierarchical Algorithms . Partitional Algorithms.

UNIT V: Association Rules : Introduction - Large Item Sets – Basic Algorithms – Parallel &

Distributed Algorithms – Comparing Approaches – Incremental Rules – Advanced Association

Rules Techniques – Measuring the Quality of Rules.

TEXT BOOK : Margaret H.Dunbam – ― Data Mining Introductory and Advanced Topics ―

Pearson Education – 2003.

REFERENCE BOOK : Jiawei Han & Micheline Kamber – ― Data Mining Concepts &

Techniques ― 2001 Academic Press.

DATA MINING


UNIT-I

What is Data Mining?

Data Mining is defined as extracting information from huge sets of data. In other words, we can

say that data mining is the procedure of mining knowledge from data. The information or

knowledge extracted so can be used for any of the following applications −

Market Analysis

Fraud Detection

Customer Retention

Production Control

Science Exploration

BASIC DATA MINING TASKS

Basic Data Mining Tasks

Classification maps data into predefined groups or classes

Supervised learning

Pattern recognition

Prediction

Regression is used to map a data item to a real valued prediction variable.

Clustering groups similar data together into clusters.

Unsupervised learning

Segmentation

Partitioning

Summarization maps data into subsets with associated simple descriptions.

Characterization

Generalization

Link Analysis uncovers relationships among data.

Affinity Analysis

Association Rules

Sequential Analysis determines sequential patterns.

Ex: Time Series Analysis

Example: Stock Market

o Predict future values

o Determine similar patterns over time

DATA MINING


o Classify behavior

DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES

Knowledge Discovery in Databases (KDD): process of finding useful information

and patterns in data.

Data Mining: Use of algorithms to extract the information and patterns derived by

the KDD process.

KDD Process

Selection: Obtain data from various sources.

Preprocessing: Cleanse data.

Transformation: Convert to common format. Transform to new format.

Data Mining: Obtain desired results.

Interpretation/Evaluation: Present results to user in meaningful manner.

KDD Process Ex: Web Log

Selection: Select log data (dates and locations) to use

Preprocessing:

Remove identifying URLs

Remove error logs

Transformation: Sessionize (sort and group)

Data Mining: Identify and count patterns

Construct data structure

Interpretation/Evaluation: Identify and display frequently accessed sequences.

DATA MINING


Potential User Applications: Cache prediction

Personalization

Data Mining Development

Data bases

Relational Data Model

SQL

Association Rule Algorithms

Data Warehousing

Scalability Techniques

Information retrieval

Similarity Measures

Hierarchical Clustering

IR Systems

Imprecise Queries

Textual Data

Web Search Engines

Statistics

Algorithm Design Techniques

Algorithm Analysis

Data Structures

Machine learning

Neural Networks

Decision Tree Algorithms

Algorithm

Algorithm Design Techniques

DATA MINING


Algorithm Analysis

Data Structures

DATA MINING ISSUES Human Interaction

Overfitting

Outliers

Interpretation

Visualization

Large Datasets

High Dimensionality

Multimedia Data

Missing Data

Irrelevant Data

Noisy Data

Changing Data

Integration

Application

DATA MINING METRICS Usefulness

Return on Investment (ROI)

Accuracy

Space/Time

SOCIAL IMPLICATIONS OF DATA MINING Privacy

Profiling

Unauthorized use

Data Mining Metrics

Usefulness

Return on Investment (ROI)

Accuracy

Space/Time

Database Perspective on Data Mining

Scalability

Real World Data

DATA MINING


DATABASE PERSPECTIVE ON DATA MINING Scalability

Real World Data

Updates

Ease of Use

Visualization Techniques

Graphical

Geometric

Icon-based

Pixel-based

Hierarchical

Hybrid

DATA MINING


UNIT-II

DATA MINING TECHNIQUES

Introduction

There are two types of models:

1. Parametric Models: It describes the relationship between input and output through

the use of algebraic equations where some paramets are not specified.

2. Non Parametric Models: It is data-driven. No explicit equations are used to

determine the model.

A STATISTICAL PERPECTIVE ON DATA MINING

1. Statistical

Point Estimation

Models Based on Summarization

Bayes Theorem

Hypothesis Testing

Regression and Correlation

2. Similarity Measures

3. Decision Trees

4. Neural Networks

Activation Functions

5. Genetic Algorithms

Point Estimation Point Estimate: estimate a population parameter.May be made by calculating the parameter

for a sample. May be used to predict value for missing data.

Ex:

· R contains 100 employees

· 99 have salary information

· Mean salary of these is $50,000

· Use $50,000 as value of remaining employee’s salary.

Estimation Error

Bias: Difference between expected value and actual value.

Mean Squared Error (MSE): expected value of the squared difference between the estimate and

the actual value:

Jackknife Estimate

Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of

observed values.

DATA MINING


Ex: estimate of mean for X={x1, … , xn}

Maximum Likelihood Estimate (MLE)

Obtain parameter estimates that maximize the probability that the sample data occurs for

the specific model. Joint probability for observing the sample data by multiplying the

individual probabilities. Likelihood function:

Maximize L

MLE Example

Coin toss five times: {H,H,H,H,T}

Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:

However if the probability of a H is 0.8 then:

General likelihood formula:

Estimate for p is then 4/5 = 0.8

Expectation-Maximization (EM)

DATA MINING


Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively

use estimates for missing data and continue until convergence.

EM Example

DATA MINING


EM Algorithm

Models Based on Summarization

Visualization: Frequency distribution, mean, variance, median, mode, etc.

Box Plot:

DATA MINING


Scatter Diagram

Bayes Theorem

Posterior Probability: P(h1|xi)

Prior Probability: P(h1)

Bayes Theorem:

Assign probabilities of hypotheses given a data value.

Bayes Theorem Example

Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further

identification, h3=do not authorize, h4= do not authorize but contact police

Assign twelve data values for all combinations of credit and income:

From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.

Training Data:

ID Income Credit Class xi

1 4 Excellent h1 x4

2 3 Good h1 x7

3 2 Excellent h1 x2

4 3 Good h1 x7

5 4 Good h1 x8

6 2 Excellent h1 x2

7 3 Bad h2 x11

8 2 Bad h2 x10 9 3 Bad h3 x11

10 1 Bad h4 x9

DATA MINING


Calculate P(xi|hj) and P(xi)

Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.

Predict the class for x4:

Calculate P(hj|x4) for all hj.

Place x4 in class with largest value.

Ex:

P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)

=(1/6)(0.6)/0.1=1.

x4 in class h1.

Hypothesis Testing

Find model to explain behavior by creating and then testing a hypothesis about the data.

Exact opposite of usual DM approach.

H0 – Null hypothesis; Hypothesis to be tested.

H1 – Alternative hypothesis

Chi Squared Statistic

O – observed value

E – Expected value based on hypothesis.

Ex:

O={50,93,67,78,87}

E=75

c2=15.55 and therefore significant

Regression

Predict future values based on past values.Linear Regression assumes linear relationship

exists.

y = c0 + c1 x1 + … + cn xn

Find values to best fit the data

Correlation

DATA MINING


Examine the degree to which the values for two variables behave similarly.

Correlation coefficient r:

1 = perfect correlation

-1 = perfect but opposite correlation

0 = no correlation

Similarity Measures

Determine similarity between two objects.

Similarity characteristics:

Alternatively, distance measure measure how unlike or dissimilar objects are. Similarity Measures

Distance Measures

Measure dissimilarity between objects

DATA MINING


Ex: Twenty Questions Game

Decision Trees

Decision Tree (DT): Tree where the root and each internal node is labeled with a question.

The arcs represent each possible answer to the associated question.

Each leaf node represents a prediction of a solution to the problem.

Popular technique for classification; Leaf node indicates class to which the

corresponding tuple belongs.

A Decision Tree Model is a computational model consisting of three parts:

Decision Tree

Algorithm to create the tree

Algorithm that applies the tree to data

Creation of the tree is the most difficult part.

Processing is basically a search similar to that in a binary search tree (although DT may

not be binary).

DATA MINING


Advantages/Disadvantages

Advantages:

Easy to understand.

Easy to generate rules

Disadvantages:

May suffer from overfitting.

Classifies by rectangular partitioning.

Does not easily handle nonnumeric data.

Can be quite large – pruning is necessary.

NEURAL NETWORKS

Based on observed functioning of human brain. (Artificial Neural Networks (ANN). Our

view of neural networks is very simplistic. We view a neural network (NN) from a graphical

viewpoint. Alternatively, a NN may be viewed from the perspective of matrices. Used in pattern

recognition, speech recognition, computer vision, and classification.

Neural Network (NN) is a directed graph F=<V,A> with vertices V={1,2,…,n} and arcs

A={<i,j>|1<=i,j<=n}, with the following restrictions:

1. V is partitioned into a set of input nodes, VI, hidden nodes, VH, and output

nodes, VO.

2. The vertices are also partitioned into layers

3. Any arc <i,j> must have node i in layer h-1 and node j in layer h.

4. Arc <i,j> is labeled with a numeric value wij.

5. Node i is labeled with a function fi.

Example

NN Node

DATA MINING


NN Activation Functions

Functions associated with nodes in graph. Output may be in range [-1,1] or [0,1]

DATA MINING


NN Learning

Propagate input values through graph. Compare output to desired output.Adjust weights

in graph accordingly. A Neural Network Model is a computational model consisting of three

parts:

· Neural Network graph

· Learning algorithm that indicates how learning takes place.

· Recall techniques that determine hew information is obtained from the network.

Advantages

Learning

Can continue learning even after training set has been applied.

Easy parallelization

Solves many problems

Disadvantages

Difficult to understand

May suffer from overfitting

Structure of graph must be determined a priori.

Input values must be numeric.

Verification difficult.

GENETIC ALGORITHMS Optimization search type algorithms. Creates an initial feasible solution and iteratively

creates new “better” solutions. Based on human evolution and survival of the fittest. Must

represent a solution as an individual.

Individual: string I=I1,I2,…,In where Ij is in given alphabet A. Each character Ij is called a gene.

Population: set of individuals.

A Genetic Algorithm (GA) is a computational model consisting of five parts:

1. A starting set of individuals, P.

2. Crossover: technique to combine two parents to create offspring.

3. Mutation: randomly change an individual.

DATA MINING


4. Fitness: determine the best individuals.

5. Algorithm which applies the crossover and mutation techniques to P iteratively

using the fitness function to determine the best individuals in P to keep. Crossover Examples

Advantages

Easily parallelized

Disadvantages

Difficult to understand and explain to end users.

Abstraction of the problem and method to represent individuals is quite difficult.

Determining fitness function is difficult.

Determining how to perform crossover and mutation is difficult.

DATA MINING


UNIT-V

ASSOCIATION RULES

INTRODUCTION

Associations and Item-sets: An association is a rule of the form: if X then Y. It is denoted as X Y

Example:

If India wins in cricket, sales of sweets go up.

For any rule if X Y Y X, then X and Y are called an “interesting item-set”.

Example:

People buying school uniforms in June also buy school bags

(People buying school bags in June also buy school uniforms)

Support and Confidence: The support for a rule R is the ratio of the number of occurrences of R, given all

occurrences of all rules. The confidence of a rule X Y, is the ratio of the number of

occurrences of Y given X, among all other occurrences given X.

Support for {Bag, Uniform} = 5/10 = 0.5

Confidence for Bag Uniform = 5/8 = 0.625

LARGE ITEM SETS

Set of items: I={I1,I2,…,Im}

Transactions: D={t1,t2, …, tn}, tjÍ I

Itemset: {Ii1,Ii2, …, Iik} Í I

Support of an itemset: Percentage of transactions which contain that itemset.

Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.

I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} is 60%

Association Rule (AR): implication X Þ Y where X,Y Í I and X Ç Y = ;

Support of AR (s) X Þ Y: Percentage of transactions that contain X ÈY

Confidence of AR (a) X Þ Y: Ratio of number of transactions that contain X È Y to the

number that contain X

DATA MINING


Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where

ti={Ii1,Ii2, …, Iik} and Iij Î I, the Association Rule Problem is to identify all association rules

X Þ Y with a minimum support and confidence.

BASIC ALGORITHMS

Apriori ALGORITHM

Large Itemset Property:

Any subset of a large itemset is large.

Contrapositive:

If an itemset is not large, none of its supersets are large.

Large Itemset Property

DATA MINING


Apriori Algorithm

C1 = Itemsets of size one in I;

Determine all large itemsets of size 1, L1;

i = 1;

Repeat

i = i + 1;

Ci = Apriori-Gen(Li-1);

Count Ci to determine Li;

until no more large itemsets found;

Apriori-Gen

Generate candidates of size i+1 from large itemsets of size i.

Approach used: join large itemsets of size i if they agree on i-1

May also prune candidates who have subsets that are not large.

DATA MINING


Advantages:

Uses large itemset property.

Easily parallelized

Easy to implement.

Disadvantages:

Assumes transaction database is memory resident.

Requires up to m database scans.

Sampling

Large databases

Sample the database and apply Apriori to the sample.

DATA MINING


Potentially Large Itemsets (PL): Large itemsets from sample

Negative Border (BD - ):

Generalization of Apriori-Gen applied to itemsets of varying sizes.

Minimal set of itemsets which are not in PL, but whose subsets are all in PL.

Sampling Algorithm

Ds = sample of Database D;

PL = Large itemsets in Ds using smalls;

C = PL È BD-(PL);

Count C in Database using s;

ML = large itemsets in BD-(PL);

If ML = Æ then done

else C = repeated application of BD-;

Count C in Database;

Example:

Find AR assuming s = 20%

Ds = { t1,t2}

Smalls = 10%

PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly,

PeanutButter}, {Bread,Jelly,PeanutButter}}

BD-(PL)={{Beer},{Milk}}

ML = {{Beer}, {Milk}}

Repeated application of BD- generates all remaining itemsets

Advantages:

Reduces number of database scans to one in the best case and two in worst.

Scales better.

Disadvantages:

Potentially large number of candidates in second pass

Partitioning

Divide database into partitions D1,D2,…,Dp . Apply Apriori to each partition. Any large

itemset must be large in at least one partition.

Algorithm

Divide D into partitions D1,D2,…,Dp;

For I = 1 to p do

Li = Apriori(Di);

C = L1 È … È Lp;

Count C on D to generate L;

Partitioning Example

DATA MINING


Advantages:

Adapts to available main memory

Easily parallelized

Maximum number of database scans is two.

Disadvantages:

May have many candidates during second scan.

PARALLEL AND DISTRIBUTED ALGORITHMS

Parallelizing AR Algorithms

Based on Apriori

Techniques differ:

What is counted at each site

How data (transactions) are distributed

Data Parallelism

Data partitioned

Count Distribution Algorithm

Task Parallelism

Data and candidates partitioned

Data Distribution Algorithm

Count Distribution Algorithm(CDA)

Place data partition at each site.

In Parallel at each site do

C1 = Itemsets of size one in I;

Count C1;

Broadcast counts to all sites;

Determine global large itemsets of size 1, L1;

i = 1;

Repeat

i = i + 1;

DATA MINING



Count Ci;

Broadcast counts to all sites;

Determine global large itemsets of size i, Li;


CDA Example

Data Distribution Algorithm(DDA)

Place data partition at each site.

In Parallel at each site do

Determine local candidates of size 1 to count;

Broadcast local transactions to other sites;

Count local candidates of size 1 on all data;

Determine large itemsets of size 1 for local candidates;

Broadcast large itemsets to all sites;

Determine L1;

i = 1;

Repeat

i = i + 1;


Determine local candidates of size i to count;

Count, broadcast, and find Li;


DATA MINING


COMPARING APPROACHES Comparing AR Techniques

Target

Type

Data Type

Data Source

Technique

Itemset Strategy and Data Structure

Transaction Strategy and Data Structure

Optimization

Architecture

Parallelism Strategy

DATA MINING


INCREMENTAL ASSOCIATION RULES

Generate Association Rules in a dynamic database.

Problem: algorithms assume static database

Objective:

Know large itemsets for D

Find large itemsets for D U{D D}

Must be large in either D or D

Save Li and counts

Many applications outside market basket data analysis

Prediction (telecom switch failure)

Web usage mining

Many different types of association rules

Temporal

Spatial

Causal

ADVANCED ASSOCIATION RULE TECHNIQUES

Generalized Association Rules

Multiple-Level Association Rules

Quantitative Association Rules

Using multiple minimum supports

Correlation Rules

MEASURING QUALITY OF RULES

Support

Confidence

Interest

Conviction

Chi Squared Test