+ All Categories
Home > Education > Data mining

Data mining

Date post: 19-Jan-2015
Category:
Upload: rajendra-akerkar
View: 1,578 times
Download: 1 times
Share this document with a friend
Description:
 
Popular Tags:
43
Data Mining Data Mining Rajendra Akerkar July 7, 2009 Data Mining: R. Akerkar 1
Transcript
Page 1: Data mining

Data MiningData Mining

Rajendra Akerkar

July 7, 2009 Data Mining: R. Akerkar 1

Page 2: Data mining

What Is Data Mining?

• Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge from huge amount of data

• Is everything “data mining”?– (Deductive) query processing(Deductive) query processing. – Expert systems or small ML/statistical programs

July 7, 2009 Data Mining: R. Akerkar 2

Page 3: Data mining

Definition

• Several Definitions– Non-trivial extraction of implicit, previously

unknown and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, ofsemi automatic means, of large quantities of data in order to discover meaningful patternsmeaningful patterns

July 7, 2009 Data Mining: R. Akerkar 3

Page 4: Data mining

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

July 7, 2009 Data Mining: R. Akerkar 4

[ yy , ] g y g,

Page 5: Data mining

Classification

July 7, 2009 Data Mining: R. Akerkar 5

Page 6: Data mining

Classification: D fi itiClassification: Definition

• Given a collection of records (training set )Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class.• Find a model for class attribute as a function of the

values of other attributes.• Goal: pre io sl nseen records sho ld be assigned• Goal: previously unseen records should be assigned

a class as accurately as possible.– A test set is used to determine the accuracy of the model. y

Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

July 7, 2009 Data Mining: R. Akerkar 6

Page 7: Data mining

Classification: IntroductionClassification: Introduction• A classification scheme which generates a tree

and a set of rules from given data set.

• The attributes of the records are categorise into two types:– Attributes whose domain is numerical are called

numerical attributes.A ib h d i i i l– Attributes whose domain is not numerical are called the categorical attributes.

July 7, 2009 Data Mining: R. Akerkar 7

Page 8: Data mining

Decision TreeDecision Tree

• A decision tree is a tree with the following properties:– An inner node represents an attribute.

A d t t t th tt ib t f th f th– An edge represents a test on the attribute of the father node.

– A leaf represents one of the classes.

• Construction of a decision treeBased on the training data– Based on the training data

– Top-Down strategy

July 7, 2009 Data Mining: R. Akerkar 8

Page 9: Data mining

July 7, 2009 Data Mining: R. Akerkar 9

Page 10: Data mining

Decision TreeExample

• The data set has five attributes. • There is a special attribute: the attribute class is the class

labellabel. • The attributes, temp (temperature) and humidity are

numerical attributes• Other attributes are categorical that is they cannot be• Other attributes are categorical, that is, they cannot be

ordered.

• Based on the training data set we want to find a set of• Based on the training data set, we want to find a set of rules to know what values of outlook, temperature, humidity and wind, determine whether or not to play golf.

July 7, 2009 Data Mining: R. Akerkar 10

Page 11: Data mining

Decision TreeDecision TreeExample

• We have five leaf nodes. • In a decision tree, each leaf node represents a rule. , p

• We have the following rules corresponding to the tree given in Figure.given in Figure.

• RULE 1 If it is sunny and the humidity is not above 75%, then play.• RULE 2 If it is sunny and the humidity is above 75%, then do not play.f y y , p y• RULE 3 If it is overcast, then play.• RULE 4 If it is rainy and not windy, then play.• RULE 5 If it is rainy and windy, then don't play.

July 7, 2009 Data Mining: R. Akerkar 11

Page 12: Data mining

July 7, 2009 Data Mining: R. Akerkar 12

Page 13: Data mining

Iterative Dichotomizer (ID3)Iterative Dichotomizer (ID3)• Quinlan (1986)

E h d d t litti tt ib t• Each node corresponds to a splitting attribute• Each arc is a possible value of that attribute.• At each node the splitting attribute is selected to be the most

informative among the attributes not yet considered in the path from the root.

• Entropy is used to measure how informative is a node.• The algorithm uses the criterion of information gain to determine

the goodness of a split.– The attribute with the greatest information gain is taken as g g

the splitting attribute, and the data set is split for all distinct values of the attribute.

July 7, 2009 Data Mining: R. Akerkar 13

Page 14: Data mining

Training DatasetTraining DatasetThis follows an example from Quinlan’s ID3

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent nog31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

July 7, 2009 Data Mining: R. Akerkar 14

g y y>40 medium no excellent no

Page 15: Data mining

Extracting Classification Rules from Trees

R h k l d i h• Represent the knowledge in the form of IF-THEN rules

• One rule is created for each path from the root to a leaf

• Each attribute-value pair along a path forms a conjunctionpath forms a conjunction

• The leaf node holds the class predictionR l i f h• Rules are easier for humans to understand What are the rules?

July 7, 2009 Data Mining: R. Akerkar 15

Page 16: Data mining

Attribute Selection Measure: Information Gain (ID3/C4 5)Gain (ID3/C4.5)

Select the attribute with the highest information gain Select the attribute with the highest information gain S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any q y y

arbitrary tuple ….information is encoded in bits.

f b h l { }sslog

ss),...,s,ssI( i

m

i

im21 2

1

entropy of attribute A with values {a1,a2,…,av}

)s,...,s(Is

s...sE(A) mjj

vmjj

11

information gained by branching on attribute A

sj 1

July 7, 2009 Data Mining: R. Akerkar 16

E(A))s,...,s,I(sGain(A) m 21

Page 17: Data mining

age pi ni I(pi, ni) Class P: buys_computer = “yes” Cl N b t “ ”<=30 2 3 0.971

30…40 4 0 0>40 3 2 0 971

Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age:

>40 3 2 0.971age income student credit_rating buys_computer

<=30 high no fair nog<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yesow yes yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes< 30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31 40 high yes fair yes

July 7, 2009 Data Mining: R. Akerkar 17

31…40 high yes fair yes>40 medium no excellent no

Page 18: Data mining

Attribute Selection by Information Gain ComputationComputation

)0,4(4)3,2(5)( IIageE

694.0)2,3(145

),(14

),(14

)(

I

g

means “age <=30” has 5 out of 14 samples, with 2 yes's )3,2(145 I

and 3 no’s. Hence

246.0)(),()( ageEnpIageGain

14

Similarly,

0480)(151.0)(029.0)(

ratingcreditGainstudentGainincomeGain

July 7, 2009 Data Mining: R. Akerkar 18

048.0)_( ratingcreditGain

Page 19: Data mining

Exercise 1Exercise 1• The following table consists of training data from an employee

databasedatabase.

• Let status be the class attribute. Use the ID3 algorithm to construct a decision tree from the given data.

July 7, 2009 Data Mining: R. Akerkar 19

Page 20: Data mining

ClusteringClustering

July 7, 2009 Data Mining: R. Akerkar 20

Page 21: Data mining

Clustering: DefinitionClustering: Definition

• Given a set of data points, each having a set ofGiven a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to one another.– Data points in separate clusters are less similar to one

hanother.

• Similarity Measures:E lid Di t if tt ib t ti– Euclidean Distance if attributes are continuous.

– Other Problem-specific Measures.

July 7, 2009 Data Mining: R. Akerkar 21

Page 22: Data mining

The K-Means Clustering MethodThe K Means Clustering Method

• Given k, the k-means algorithm is implemented inGiven k, the k means algorithm is implemented in four steps:

Partition objects into k nonempty subsets– Partition objects into k nonempty subsets

– Compute seed points as the centroids of the clusters of the current partition (the centroid is the center i e meanthe current partition (the centroid is the center, i.e., mean point, of the cluster)

– Assign each object to the cluster with the nearest seedAssign each object to the cluster with the nearest seed point

– Go back to Step 2, stop when no more new assignment

July 7, 2009 Data Mining: R. Akerkar 22

Go back to Step 2, stop when no more new assignment

Page 23: Data mining

Visualization of k-meansVisualization of k means algorithm

July 7, 2009 Data Mining: R. Akerkar 23

Page 24: Data mining

Exercise 2

• Apply the K-means algorithm for the following 1-dimensional points (for k=2): 1; g p ( ) ;2; 3; 4; 6; 7; 8; 9.

• Use 1 and 2 as the starting centroidsUse 1 and 2 as the starting centroids.

July 7, 2009 Data Mining: R. Akerkar 24

Page 25: Data mining

K – Mean for 2-dimensionalK Mean for 2 dimensional database

• Let us consider {x1, x2, x3, x4, x5} with following coordinates as two-dimensional sample for clustering:

• x1 = (0, 2), x2 = (0, 0), x3 = (1.5,0), x4 = (5,0), x5 = (5, 2)

• Suppose that required number of clusters is 2.• Initially, clusters are formed from random distribution of

samples:samples:• C1 = {x1, x2, x4} and C2 = {x3, x5}.

July 7, 2009 Data Mining: R. Akerkar 25

Page 26: Data mining

Centroid Calculation• Suppose that the given set of N samples in an n-dimensional space

has somehow be partitioned into K clusters {C1 C2 Ck}has somehow be partitioned into K clusters {C1, C2, …, Ck} • Each Ck has nk samples and each sample is exactly in one cluster.• Therefore, nk = N, where k = 1, …, K.• The mean vector Mk of cluster Ck is defined as centroid of the

cluster,

nk

Where xik is the ith sample belongingMk = (1/ nk) i = 1 xik

• In our example The centroids for these two clusters are

Where xik is the i sample belonging to cluster Ck.

• In our example, The centroids for these two clusters are• M1 = {(0 + 0 + 5)/3, (2 + 0 + 0)/3} = {1.66, 0.66}• M2 = {( 1.5 + 5)/2, (0 +2)/2} = {3.25, 1.00}

July 7, 2009 Data Mining: R. Akerkar 26

2 {( ) ( ) } { }

Page 27: Data mining

Th S f th l tThe Square-error of the cluster

• The square-error for cluster Ck is the sum of squared Euclidean distances between each sample in Ck and its centroid.Thi i ll d th ithi l t i ti• This error is called the within-cluster variation.

ek2 = i = 1 (xik – Mk)2nk

• Within cluster variations, after initial random distribution of samples, are

2 2 2 2 2• e12 = [(0 – 1.66)2 + (2 – 0.66)2] + [(0 – 1.66)2 + (0 – 0.66)2]

+ [(5 – 1.66)2 + (0 – 0.66)2] = 19.36• e2

2 = [(1.5 – 3.25)2 + (0 – 1)2] + [(5 – 3.25)2 + (2 – 1)2] = 8.12

July 7, 2009 Data Mining: R. Akerkar 27

2 [( ) ( ) ] [( ) ( ) ]

Page 28: Data mining

Total Square-error

• The square error for the entire clustering space containing K clusters is the sum of the within-cluster

i ivariations.

Ek2 =k = 1 ek

2

K

• The total square error isThe total square error isE2 = e1

2 + e22 = 19.36 + 8.12 = 27.48

July 7, 2009 Data Mining: R. Akerkar 28

Page 29: Data mining

• When we reassign all samples, depending on a minimum distance from centroids M1 and M2, the new redistribution of samples inside clusters will beclusters will be,

• d(M1, x1) = (1.662 + 1.342)1/2 = 2.14 and d(M2, x1) = 3.40 x1 C1

• d(M1, x2) = 1.79 and d(M2, x2) = 3.40 x2 C1 d(M1, x3) = 0.83 and d(M2, x3) = 2.01 x3 C1 d(M1, x4) = 3.41 and d(M2, x4) = 2.01 x4 C2 d(M1, x5) = 3.60 and d(M2, x5) = 2.01 x5 C2

Above calculation is based on Euclidean distance formula,

d(xi, xj) = k = 1 (xik – xjk)1/2m

July 7, 2009 Data Mining: R. Akerkar 29

Page 30: Data mining

• New Clusters C1 = {x1, x2, x3} and C2 = {x4, x5} have new centroids

• M1 = {0.5, 0.67}• M2 = {5.0, 1.0}

• The corresponding within-cluster variations and the total square error are,

• e12 = 4.17

• e22 = 2.00

E2• E2 = 6.17

July 7, 2009 Data Mining: R. Akerkar 30

Page 31: Data mining

Exercise 3

Let the set X consist of the following sample points in 2 dimensional space:

X = {(1, 2), (1.5, 2.2), (3, 2.3), (2.5,-1), (0, 1.6), (-1,1.5)}

Let c1 = (1.5, 2.5) and c2 = (3, 1) be initial estimates of centroids for X.

What are the revised values of c1 and c2 after 1 iteration of k-means clustering (k = 2)?

July 7, 2009 Data Mining: R. Akerkar 31

Page 32: Data mining

Association Rule Discovery

July 7, 2009 Data Mining: R. Akerkar 32

Page 33: Data mining

Associations discoveryAssociations discovery

• Associations discovery uncovers affinities amongst collection of items

• Affinities are represented by association rules• Associations discovery is an unsupervised

approach to data mining.

July 7, 2009 Data Mining: R. Akerkar 33

Page 34: Data mining

Association discovery is one of the most common forms of data mining that people closely associate with data mining, namely mining for

ld th h t d t b Th ld i thigold through a vast database. The gold in this case is a rule that tells you something about your database that you did not already know anddatabase that you did not already know, and were probably unable to explicitly articulate

July 7, 2009 Data Mining: R. Akerkar 34

Page 35: Data mining

Association discovery is done using rule induction which y gbasically tells a user how strong a pattern is and how likely it is to happen again. For instance a database of items scanned in a consumer market basket helps findingitems scanned in a consumer market basket helps finding interesting patterns such as: If bagels are purchased then cream cheese is purchased 90% of the time and this

i f ll h i b kpattern occurs in 3% of all shopping baskets

You go tell the data base to go find the rules the rules that areYou go tell the data base to go find the rules, the rules that are pulled from the database are extracted and ordered to be presented to the user to according to the percentage of times they are correct and how often they apply. Often gets lot of rules and the user almost needs a second pass to find his/her gold nugget.

July 7, 2009 Data Mining: R. Akerkar 35

g gg

Page 36: Data mining

Associations

• The problem of deriving associations from data – market-basket analysis– The popular algorithms are thus concerned withThe popular algorithms are thus concerned with

determining the set of frequent itemsets in a given set of operation databases.

– The problem is to compute the frequency of occurrences of each itemset in the database.

July 7, 2009 Data Mining: R. Akerkar 36

Page 37: Data mining

Definition

July 7, 2009 Data Mining: R. Akerkar 37

Page 38: Data mining

Association Rules

• Algorithms that obtain association rules from data usually divide the task into two yparts: – find the frequent itemsets andfind the frequent itemsets and – form the rules from them.

July 7, 2009 Data Mining: R. Akerkar 38

Page 39: Data mining

Association Rules

• The problem of mining association rules can be divided into two sub-problems:

July 7, 2009 Data Mining: R. Akerkar 39

Page 40: Data mining

a priori Algorithm

July 7, 2009 Data Mining: R. Akerkar 40

Page 41: Data mining

Exercise 3

Suppose that L3 is the list{{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w},{{a,b,c}, {a,b,d}. {a,c,d}, {b,c,d}. {b,c,w},

{b,c,x}, {p,q,r}, {p,q,s}, {p,q,t}, {p,r,s}, {q,r,s}}

Which itemsets are placed in C4 by the join step of the Apriori algorithm? Which are p p gthen removed by the prune step?

July 7, 2009 Data Mining: R. Akerkar 41

Page 42: Data mining

Exercise 4

• Given a dataset with four attributes w, x, y and z, each with three values, how many , , yrules can be generated with one term on the right-hand side?g

July 7, 2009 Data Mining: R. Akerkar 42

Page 43: Data mining

References• R. Akerkar and P. Lingras. Building an Intelligent Web: Theory &

Practice, Jones & Bartlett, 2008 (In India: Narosa Publishing House,

2009)

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.

Advances in Knowledge Discovery and Data Mining AAAI/MIT PressAdvances in Knowledge Discovery and Data Mining. AAAI/MIT Press,

1996

• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in

Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001

• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan

K f 2001Kaufmann, 2001

• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT

Press, 2001

July 7, 2009 Data Mining: R. Akerkar 43

,


Recommended