DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
ELECTIVE : DATA MINING
UNIT I: Basic Data Mining Tasks – Data Mining Versus Knowledge Discovery in Data Bases –
Data Mining Issues – Data Mining Matrices – Social Implications of Data Mining – Data
Mining from Data Base Perspective.
UNIT II : Data Mining Techniques – a Statistical Perspective on data mining – Similarity
Measures – Decision Trees – Neural Networks – Genetic Algorithms.
UNIT III : Classification : Introduction – Statistical – Based Algorithms – Distance Based
Algorithms – Decision Tree – Based Algorithms – Neural Network Based Algorithms – Rule
Based Algorithms – Combining Techniques.
UNIT IV: Clustering : Introduction – Similarity and Distance Measures – Outliers –
Hierarchical Algorithms . Partitional Algorithms.
UNIT V: Association Rules : Introduction - Large Item Sets – Basic Algorithms – Parallel &
Distributed Algorithms – Comparing Approaches – Incremental Rules – Advanced Association
Rules Techniques – Measuring the Quality of Rules.
TEXT BOOK : Margaret H.Dunbam – ― Data Mining Introductory and Advanced Topics ―
Pearson Education – 2003.
REFERENCE BOOK : Jiawei Han & Micheline Kamber – ― Data Mining Concepts &
Techniques ― 2001 Academic Press.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
UNIT-I
What is Data Mining?
Data Mining is defined as extracting information from huge sets of data. In other words, we can
say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
BASIC DATA MINING TASKS
Basic Data Mining Tasks
Classification maps data into predefined groups or classes
Supervised learning
Pattern recognition
Prediction
Regression is used to map a data item to a real valued prediction variable.
Clustering groups similar data together into clusters.
Unsupervised learning
Segmentation
Partitioning
Summarization maps data into subsets with associated simple descriptions.
Characterization
Generalization
Link Analysis uncovers relationships among data.
Affinity Analysis
Association Rules
Sequential Analysis determines sequential patterns.
Ex: Time Series Analysis
Example: Stock Market
o Predict future values
o Determine similar patterns over time
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
o Classify behavior
DATA MINING VERSUS KNOWLEDGE DISCOVERY IN DATABASES
Knowledge Discovery in Databases (KDD): process of finding useful information
and patterns in data.
Data Mining: Use of algorithms to extract the information and patterns derived by
the KDD process.
KDD Process
Selection: Obtain data from various sources.
Preprocessing: Cleanse data.
Transformation: Convert to common format. Transform to new format.
Data Mining: Obtain desired results.
Interpretation/Evaluation: Present results to user in meaningful manner.
KDD Process Ex: Web Log
Selection: Select log data (dates and locations) to use
Preprocessing:
Remove identifying URLs
Remove error logs
Transformation: Sessionize (sort and group)
Data Mining: Identify and count patterns
Construct data structure
Interpretation/Evaluation: Identify and display frequently accessed sequences.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Potential User Applications: Cache prediction
Personalization
Data Mining Development
Data bases
Relational Data Model
SQL
Association Rule Algorithms
Data Warehousing
Scalability Techniques
Information retrieval
Similarity Measures
Hierarchical Clustering
IR Systems
Imprecise Queries
Textual Data
Web Search Engines
Statistics
Algorithm Design Techniques
Algorithm Analysis
Data Structures
Machine learning
Neural Networks
Decision Tree Algorithms
Algorithm
Algorithm Design Techniques
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Algorithm Analysis
Data Structures
DATA MINING ISSUES Human Interaction
Overfitting
Outliers
Interpretation
Visualization
Large Datasets
High Dimensionality
Multimedia Data
Missing Data
Irrelevant Data
Noisy Data
Changing Data
Integration
Application
DATA MINING METRICS Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
SOCIAL IMPLICATIONS OF DATA MINING Privacy
Profiling
Unauthorized use
Data Mining Metrics
Usefulness
Return on Investment (ROI)
Accuracy
Space/Time
Database Perspective on Data Mining
Scalability
Real World Data
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
DATABASE PERSPECTIVE ON DATA MINING Scalability
Real World Data
Updates
Ease of Use
Visualization Techniques
Graphical
Geometric
Icon-based
Pixel-based
Hierarchical
Hybrid
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
UNIT-II
DATA MINING TECHNIQUES
Introduction
There are two types of models:
1. Parametric Models: It describes the relationship between input and output through
the use of algebraic equations where some paramets are not specified.
2. Non Parametric Models: It is data-driven. No explicit equations are used to
determine the model.
A STATISTICAL PERPECTIVE ON DATA MINING
1. Statistical
Point Estimation
Models Based on Summarization
Bayes Theorem
Hypothesis Testing
Regression and Correlation
2. Similarity Measures
3. Decision Trees
4. Neural Networks
Activation Functions
5. Genetic Algorithms
Point Estimation Point Estimate: estimate a population parameter.May be made by calculating the parameter
for a sample. May be used to predict value for missing data.
Ex:
· R contains 100 employees
· 99 have salary information
· Mean salary of these is $50,000
· Use $50,000 as value of remaining employee’s salary.
Estimation Error
Bias: Difference between expected value and actual value.
Mean Squared Error (MSE): expected value of the squared difference between the estimate and
the actual value:
Jackknife Estimate
Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of
observed values.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Ex: estimate of mean for X={x1, … , xn}
Maximum Likelihood Estimate (MLE)
Obtain parameter estimates that maximize the probability that the sample data occurs for
the specific model. Joint probability for observing the sample data by multiplying the
individual probabilities. Likelihood function:
Maximize L
MLE Example
Coin toss five times: {H,H,H,H,T}
Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:
However if the probability of a H is 0.8 then:
General likelihood formula:
Estimate for p is then 4/5 = 0.8
Expectation-Maximization (EM)
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively
use estimates for missing data and continue until convergence.
EM Example
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
EM Algorithm
Models Based on Summarization
Visualization: Frequency distribution, mean, variance, median, mode, etc.
Box Plot:
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Scatter Diagram
Bayes Theorem
Posterior Probability: P(h1|xi)
Prior Probability: P(h1)
Bayes Theorem:
Assign probabilities of hypotheses given a data value.
Bayes Theorem Example
Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further
identification, h3=do not authorize, h4= do not authorize but contact police
Assign twelve data values for all combinations of credit and income:
From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%.
Training Data:
ID Income Credit Class xi
1 4 Excellent h1 x4
2 3 Good h1 x7
3 2 Excellent h1 x2
4 3 Good h1 x7
5 4 Good h1 x8
6 2 Excellent h1 x2
7 3 Bad h2 x11
8 2 Bad h2 x10 9 3 Bad h3 x11
10 1 Bad h4 x9
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Calculate P(xi|hj) and P(xi)
Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi.
Predict the class for x4:
Calculate P(hj|x4) for all hj.
Place x4 in class with largest value.
Ex:
P(h1|x4)=(P(x4|h1)(P(h1))/P(x4)
=(1/6)(0.6)/0.1=1.
x4 in class h1.
Hypothesis Testing
Find model to explain behavior by creating and then testing a hypothesis about the data.
Exact opposite of usual DM approach.
H0 – Null hypothesis; Hypothesis to be tested.
H1 – Alternative hypothesis
Chi Squared Statistic
O – observed value
E – Expected value based on hypothesis.
Ex:
O={50,93,67,78,87}
E=75
c2=15.55 and therefore significant
Regression
Predict future values based on past values.Linear Regression assumes linear relationship
exists.
y = c0 + c1 x1 + … + cn xn
Find values to best fit the data
Correlation
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Examine the degree to which the values for two variables behave similarly.
Correlation coefficient r:
1 = perfect correlation
-1 = perfect but opposite correlation
0 = no correlation
Similarity Measures
Determine similarity between two objects.
Similarity characteristics:
Alternatively, distance measure measure how unlike or dissimilar objects are. Similarity Measures
Distance Measures
Measure dissimilarity between objects
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Ex: Twenty Questions Game
Decision Trees
Decision Tree (DT): Tree where the root and each internal node is labeled with a question.
The arcs represent each possible answer to the associated question.
Each leaf node represents a prediction of a solution to the problem.
Popular technique for classification; Leaf node indicates class to which the
corresponding tuple belongs.
A Decision Tree Model is a computational model consisting of three parts:
Decision Tree
Algorithm to create the tree
Algorithm that applies the tree to data
Creation of the tree is the most difficult part.
Processing is basically a search similar to that in a binary search tree (although DT may
not be binary).
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Advantages/Disadvantages
Advantages:
Easy to understand.
Easy to generate rules
Disadvantages:
May suffer from overfitting.
Classifies by rectangular partitioning.
Does not easily handle nonnumeric data.
Can be quite large – pruning is necessary.
NEURAL NETWORKS
Based on observed functioning of human brain. (Artificial Neural Networks (ANN). Our
view of neural networks is very simplistic. We view a neural network (NN) from a graphical
viewpoint. Alternatively, a NN may be viewed from the perspective of matrices. Used in pattern
recognition, speech recognition, computer vision, and classification.
Neural Network (NN) is a directed graph F=<V,A> with vertices V={1,2,…,n} and arcs
A={<i,j>|1<=i,j<=n}, with the following restrictions:
1. V is partitioned into a set of input nodes, VI, hidden nodes, VH, and output
nodes, VO.
2. The vertices are also partitioned into layers
3. Any arc <i,j> must have node i in layer h-1 and node j in layer h.
4. Arc <i,j> is labeled with a numeric value wij.
5. Node i is labeled with a function fi.
Example
NN Node
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
NN Activation Functions
Functions associated with nodes in graph. Output may be in range [-1,1] or [0,1]
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
NN Learning
Propagate input values through graph. Compare output to desired output.Adjust weights
in graph accordingly. A Neural Network Model is a computational model consisting of three
parts:
· Neural Network graph
· Learning algorithm that indicates how learning takes place.
· Recall techniques that determine hew information is obtained from the network.
Advantages
Learning
Can continue learning even after training set has been applied.
Easy parallelization
Solves many problems
Disadvantages
Difficult to understand
May suffer from overfitting
Structure of graph must be determined a priori.
Input values must be numeric.
Verification difficult.
GENETIC ALGORITHMS Optimization search type algorithms. Creates an initial feasible solution and iteratively
creates new “better” solutions. Based on human evolution and survival of the fittest. Must
represent a solution as an individual.
Individual: string I=I1,I2,…,In where Ij is in given alphabet A. Each character Ij is called a gene.
Population: set of individuals.
A Genetic Algorithm (GA) is a computational model consisting of five parts:
1. A starting set of individuals, P.
2. Crossover: technique to combine two parents to create offspring.
3. Mutation: randomly change an individual.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
4. Fitness: determine the best individuals.
5. Algorithm which applies the crossover and mutation techniques to P iteratively
using the fitness function to determine the best individuals in P to keep. Crossover Examples
Advantages
Easily parallelized
Disadvantages
Difficult to understand and explain to end users.
Abstraction of the problem and method to represent individuals is quite difficult.
Determining fitness function is difficult.
Determining how to perform crossover and mutation is difficult.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
UNIT-V
ASSOCIATION RULES
INTRODUCTION
Associations and Item-sets: An association is a rule of the form: if X then Y. It is denoted as X Y
Example:
If India wins in cricket, sales of sweets go up.
For any rule if X Y Y X, then X and Y are called an “interesting item-set”.
Example:
People buying school uniforms in June also buy school bags
(People buying school bags in June also buy school uniforms)
Support and Confidence: The support for a rule R is the ratio of the number of occurrences of R, given all
occurrences of all rules. The confidence of a rule X Y, is the ratio of the number of
occurrences of Y given X, among all other occurrences given X.
Support for {Bag, Uniform} = 5/10 = 0.5
Confidence for Bag Uniform = 5/8 = 0.625
LARGE ITEM SETS
Set of items: I={I1,I2,…,Im}
Transactions: D={t1,t2, …, tn}, tjÍ I
Itemset: {Ii1,Ii2, …, Iik} Í I
Support of an itemset: Percentage of transactions which contain that itemset.
Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.
I = { Beer, Bread, Jelly, Milk, PeanutButter}
Support of {Bread,PeanutButter} is 60%
Association Rule (AR): implication X Þ Y where X,Y Í I and X Ç Y = ;
Support of AR (s) X Þ Y: Percentage of transactions that contain X ÈY
Confidence of AR (a) X Þ Y: Ratio of number of transactions that contain X È Y to the
number that contain X
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where
ti={Ii1,Ii2, …, Iik} and Iij Î I, the Association Rule Problem is to identify all association rules
X Þ Y with a minimum support and confidence.
BASIC ALGORITHMS
Apriori ALGORITHM
Large Itemset Property:
Any subset of a large itemset is large.
Contrapositive:
If an itemset is not large, none of its supersets are large.
Large Itemset Property
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Apriori Algorithm
C1 = Itemsets of size one in I;
Determine all large itemsets of size 1, L1;
i = 1;
Repeat
i = i + 1;
Ci = Apriori-Gen(Li-1);
Count Ci to determine Li;
until no more large itemsets found;
Apriori-Gen
Generate candidates of size i+1 from large itemsets of size i.
Approach used: join large itemsets of size i if they agree on i-1
May also prune candidates who have subsets that are not large.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Advantages:
Uses large itemset property.
Easily parallelized
Easy to implement.
Disadvantages:
Assumes transaction database is memory resident.
Requires up to m database scans.
Sampling
Large databases
Sample the database and apply Apriori to the sample.
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Potentially Large Itemsets (PL): Large itemsets from sample
Negative Border (BD - ):
Generalization of Apriori-Gen applied to itemsets of varying sizes.
Minimal set of itemsets which are not in PL, but whose subsets are all in PL.
Sampling Algorithm
Ds = sample of Database D;
PL = Large itemsets in Ds using smalls;
C = PL È BD-(PL);
Count C in Database using s;
ML = large itemsets in BD-(PL);
If ML = Æ then done
else C = repeated application of BD-;
Count C in Database;
Example:
Find AR assuming s = 20%
Ds = { t1,t2}
Smalls = 10%
PL = {{Bread}, {Jelly}, {PeanutButter}, {Bread,Jelly}, {Bread,PeanutButter}, {Jelly,
PeanutButter}, {Bread,Jelly,PeanutButter}}
BD-(PL)={{Beer},{Milk}}
ML = {{Beer}, {Milk}}
Repeated application of BD- generates all remaining itemsets
Advantages:
Reduces number of database scans to one in the best case and two in worst.
Scales better.
Disadvantages:
Potentially large number of candidates in second pass
Partitioning
Divide database into partitions D1,D2,…,Dp . Apply Apriori to each partition. Any large
itemset must be large in at least one partition.
Algorithm
Divide D into partitions D1,D2,…,Dp;
For I = 1 to p do
Li = Apriori(Di);
C = L1 È … È Lp;
Count C on D to generate L;
Partitioning Example
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Advantages:
Adapts to available main memory
Easily parallelized
Maximum number of database scans is two.
Disadvantages:
May have many candidates during second scan.
PARALLEL AND DISTRIBUTED ALGORITHMS
Parallelizing AR Algorithms
Based on Apriori
Techniques differ:
What is counted at each site
How data (transactions) are distributed
Data Parallelism
Data partitioned
Count Distribution Algorithm
Task Parallelism
Data and candidates partitioned
Data Distribution Algorithm
Count Distribution Algorithm(CDA)
Place data partition at each site.
In Parallel at each site do
C1 = Itemsets of size one in I;
Count C1;
Broadcast counts to all sites;
Determine global large itemsets of size 1, L1;
i = 1;
Repeat
i = i + 1;
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
Ci = Apriori-Gen(Li-1);
Count Ci;
Broadcast counts to all sites;
Determine global large itemsets of size i, Li;
until no more large itemsets found;
CDA Example
Data Distribution Algorithm(DDA)
Place data partition at each site.
In Parallel at each site do
Determine local candidates of size 1 to count;
Broadcast local transactions to other sites;
Count local candidates of size 1 on all data;
Determine large itemsets of size 1 for local candidates;
Broadcast large itemsets to all sites;
Determine L1;
i = 1;
Repeat
i = i + 1;
Ci = Apriori-Gen(Li-1);
Determine local candidates of size i to count;
Count, broadcast, and find Li;
until no more large itemsets found;
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
COMPARING APPROACHES Comparing AR Techniques
Target
Type
Data Type
Data Source
Technique
Itemset Strategy and Data Structure
Transaction Strategy and Data Structure
Optimization
Architecture
Parallelism Strategy
DATA MINING
UMADEVI.A (ASST.PROF),DEPT IT/CT, HICAS,CBE-28
INCREMENTAL ASSOCIATION RULES
Generate Association Rules in a dynamic database.
Problem: algorithms assume static database
Objective:
Know large itemsets for D
Find large itemsets for D U{D D}
Must be large in either D or D
Save Li and counts
Many applications outside market basket data analysis
Prediction (telecom switch failure)
Web usage mining
Many different types of association rules
Temporal
Spatial
Causal
ADVANCED ASSOCIATION RULE TECHNIQUES
Generalized Association Rules
Multiple-Level Association Rules
Quantitative Association Rules
Using multiple minimum supports
Correlation Rules
MEASURING QUALITY OF RULES
Support
Confidence
Interest
Conviction
Chi Squared Test