+ All Categories
Home > Documents > 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases...

1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases...

Date post: 19-Dec-2015
Category:
View: 215 times
Download: 0 times
Share this document with a friend
Popular Tags:
65
erendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teachi 1 Advanced databases – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://people.cs.kuleuven.be/~bettina.berendt/teaching ast update: 15 November 2011
Transcript
Page 1: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

1Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

1

Advanced databases –

Inferring new knowledge from data(bases):

Knowledge Discovery in Databases

Bettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://people.cs.kuleuven.be/~bettina.berendt/teaching

Last update: 15 November 2011

Page 2: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

2Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

2

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 3: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

3Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

3

Which cells are cancerous?

Proof positive — The difference between a normal and cancerous liver cell is shown clearly by the location of mitochondria […]. The healthy cell shows very few mitochondria near the outer cell wall; they cluster densely (red coloration) as they approach the cell's nucleous (depicted here as the black central hole). In the cancerous cell, the mitochondria are spread throughout the cell, do not cluster, and under the same lighting produce a more subdued effect.

Page 4: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

4Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

4

What is the impact of genetically modified organisms?

Page 5: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

5Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

5

What‘s spam and what isn‘t?

Page 6: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

6Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

6

What makes people happy?

Page 7: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

7Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

7

What „circles“ of friends do you have?

Page 8: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

8Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

8

What should we recommend to a customer/user?

Page 9: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

9Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

9

What topics exist in a collection of texts, and how do they evolve?

News texts, scientific publications, …

Page 10: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

10Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

10

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 11: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

11Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

11

Type of inference used before in this course - example

foaf:mbox

domain Agent

range Thing (well, in fact a Mailbox)

an inverse functional property

If

mbox(MaryPoppins,[email protected]) and mbox(PeterParker, [email protected]) ,

then

MaryPoppins = PeterParker

Page 12: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

12Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

12

Styles of reasoning: „All swans are white“

Deductive: towards the consequences All swans are white.

Tessa is a swan.

Tessa is white.

Inductive: towards a generalisation of observations Joe and Lisa and Tex and Wili and ... (all observed swans) are

swans.

Joe and Lisa and Tex and Wili and ... (all observed swans) are white.

All swans are white.

Abductive: towards the (most likely) explanation of an observation.

Tessa is white.

All swans are white.

Tessa is a swan.

Page 13: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

13Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

13

What about truth?

Deductive:

Given the truth of the assumptions, a valid deduction guarantees the truth of the conclusion

Inductive:

the premises of an argument (are believed to) support the conclusion but do not ensure it

has been attacked several times by logicians and philosophers

Abductive:

formally equivalent to the logical fallacy affirming the consequent

Page 14: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

14Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

14

What about new knowledge?

C.S. Peirce:

Introduced „abduction“ to modern logic

(after 1900): used „abduction“ to mean: creating new rules to explain new observations (this meaning is actually closest to induction)

<<Abduction is the only logical process that actually creates anything new.>>

essential for scientific discovery

Page 15: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

15Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

15

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 16: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

16Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

16

„Data mining“ and „knowledge discovery“

(informal definition):

data mining is about discovering knowledge in (huge amounts of) data

Therefore, it is clearer to speak about “knowledge discovery in data(bases)” (KDD)

Second reason for preferring the term “KDD”:

“data mining” is not uniquely defined:

Some people use it to denote certain types of knowledge discovery (e.g., finding association rules, but not classifier learning)

Page 17: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

17Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

17

„Data mining“ is generally inductive

(informal definition):

data mining is about discovering knowledge in (huge amounts of) data :

... Looking at all the empirically observed swans ...

... Finding they are white

... Concluding that swans are white

Page 18: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

18Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

18

The KDD process

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine

Page 19: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

19Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

19

The process part of knowledge discovery

CRISP-DM • CRoss Industry Standard Process for Data Mining• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

Page 20: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

20Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

20

Knowledge discovery, machine learning, data mining

Knowledge discovery

= the whole process

Machine learning

the application of induction algorithms and other algorithms that can be said to „learn.“

= „modeling“ phase

Data mining

sometimes = KD, sometimes = ML

Page 21: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

21Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

21

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization

Page 22: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

22Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

22

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 23: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

23Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

23

Data General patterns

Examples

Cancerous Cell Data

Classification“What factors determine cancerous cells?”

Classification Algorithm

MiningAlgorithm

- Rule Induction- Decision tree- Neural Network

Page 24: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

24Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

24

If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)

If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)

Classification: Rule Induction“What factors determine whether a cell is cancerous?”

Page 25: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

25Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

25

Color = dark

Color = light

healthy

Classification: Decision Trees

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous

Page 26: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

26Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

26

Healthy

Cancerous

“What factors determine whether a cell is cancerous?”

Classification: Neural Networks

Color = dark

# nuclei = 1

# tails = 2

Page 27: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

27Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

27

“Are there clusters of similar cells?”

Light color with 1 nucleus

Dark color with 2 tails 2 nuclei

1 nucleus and 1 tail

Dark color with 1 tail and 2 nuclei

Clustering

Page 28: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

28Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

28

Task: Discovering association rules among items in a transaction database.

An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.

In general: A1, A2, … => B

Association Rule DiscoveryAssociation Rule Discovery

Association Rule DiscoveryAssociation Rule Discovery

Page 29: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

29Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

29

“Are there any associations between the characteristics of the cells?”

If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;

confidence = 50%)

If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;

confidence = 100%)

If # tails = 1then Color = light (support =

37.5%;confidence = 75%)

Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery

Page 30: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

30Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

30

Genetic Algorithms

StatisticsBayesian Networks

Rough Sets Time Series

Many Other Data Mining Techniques

Text Mining

Page 31: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

31Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

31

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 32: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

32Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

32

The basic idea of clustering: group similar things

Group 1Group 2

Attribute 1

Att

rib

ute

2

Page 33: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

33Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

33Concepts in Clustering

Defining distance between points

Euclidean distance

any other distance (cityblock metric, Levenshtein, Jaccard sim. ...)

A good clustering is one where

(Intra-cluster distance) the sum of distances between objects in the same cluster are minimized,

(Inter-cluster distance) while the distances between different clusters are maximized

Objective to minimize: F(Intra,Inter)

Clusters can be evaluated with “internal” as well as “external” measures

Internal measures are related to the inter/intra cluster distance

External measures are related to how representative are the current clusters to “true” classes

||

||

RQ

RQ

Page 34: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

34Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

34

K Means Example (K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

Page 35: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

35Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

35

K-means algorithm

Page 36: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

36Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

36

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 37: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

37Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

37

Input data ... Q: when does this person play tennis?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook

Page 38: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

38Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

38

Terminology (using a popular data example)

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook Rows:• Instances • (think of them as objects)• Days, described by:

Columns:• Features• Outlook, Temp, …

In this case, there is a feature with a special role:

• The class• Play (does X play tennis on this day?)

This is “relational DB mining“. We will later see other types of data and the mining applied to them.

Page 39: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

39Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

39

The goal: a decision tree for classification / prediction

In which weather

will someone play (tennis etc.)?

Page 40: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

40Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

40

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class

Page 41: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

41Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

41

Which attribute to select?

Page 42: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

42Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

42

Which attribute to select?

Page 43: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

43Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

43

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes Popular impurity criterion: information

gain Information gain increases with the

average purity of the subsets Strategy: choose attribute that gives

greatest information gain

Page 44: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

44Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

44

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:

Page 45: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

45Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

45

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits

Page 46: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

46Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

46

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits

Page 47: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

47Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

47

Continuing to split

gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits

Page 48: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

48Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

48

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further

Page 49: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

49Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

49

Wishlist for a purity measure

Properties we require from a purity measure:

When node is pure, measure should be zero When impurity is maximal (i.e. all classes

equally likely), measure should be maximal Measure should obey multistage property

(i.e. decisions can be made in several stages):

Entropy is the only function that satisfies all three properties!

Page 50: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

50Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

50

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information

Page 51: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

51Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

51

Discussion / outlook decision trees

Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan

Various improvements, e.g. C4.5: deals with numeric attributes, missing values, noisy data Gain ratio instead of information gain [see Witten & Frank slides, ch. 4, pp. 40-45]

Similar approach: CART …

Page 52: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

52Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

52

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Page 53: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

53Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

53Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart, ...)

Where to put: spaghetti,

butter?

Page 54: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

54Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

54

Data

"Market basket data": attributes with boolean domains

In a table each row is a basket (aka transaction)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 55: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

55Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

55Solution approach: The apriori principle and the pruning of the search tree (1)

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Page 56: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

56Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

56

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (2)

Page 57: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

57Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

57

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (3)

Page 58: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

58Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

58

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter

Solution approach: The apriori principle and the pruning of the search tree (4)

Page 59: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

59Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

59

More formally: Generating large k-itemsets with Apriori

Min. support = 40%

step 1: candidate 1-itemsets

Spaghetti: support = 3 (60%)

tomato sauce: support = 3 (60%)

bread: support = 4 (80%)

butter: support = 1 (20%)

Transaction ID

Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 60: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

60Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

60

Contd.

step 2: large 1-itemsets

Spaghetti

tomato sauce

bread

candidate 2-itemsets

{Spaghetti, tomato sauce}: support = 2 (40%)

{Spaghetti, bread}: support = 2 (40%)

{tomato sauce, bread}: support = 2 (40%)

Transaction ID

Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Page 61: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

61Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

61

step 3: large 2-itemsets

{Spaghetti, tomato sauce}

{Spaghetti, bread}

{tomato sauce, bread}

candidate 3-itemsets

{Spaghetti, tomato sauce, bread}: support = 1 (20%)

step 4: large 3-itemsets

{ }

Transaction ID

Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce

Contd.

Page 62: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

62Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

62

From itemsets to association rules

Schema: If subset then large k-itemset with support s and confidence c

s = (support of large k-itemset) / # tuples

c = (support of large k-itemset) / (support of subset)

Example:

If {spaghetti} then {spaghetti, tomato sauce}

Support: s = 2 / 5 (40%)

Confidence: c = 2 / 3 (66%)

Page 63: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

63Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

63

Outlook

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples

Text mining

Page 64: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

64Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

64

References / background reading

Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives:

a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,CA: Morgan Kaufmann. http://www.cs.sfu.ca/%7Ehan/dmbook

a machine learning perspective: Witten, I.H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

a statistics perspective: Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype=2

The CRISP-DM manual can be found at http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf

Page 65: 1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge.

65Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

65

Acknowledgements

The overview of data mining was taken from (with minor modifications):

Tzacheva, A.A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt

Tzacheva, A.A. (2006). Knowledge Discovery and Data Mining. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt

p. 21 was taken from Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques —

Chapter 1 — Introduction. http://www.cs.sfu.ca/%7Ehan/bk/1intro.ppt

The ID3 part is based on Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning

Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

In particular, the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapters 1-4):

http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and ...chapter2.pdf, chapter3.pdf, chapter4.pdf) or

http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and ...chapter2.odp, chapter3.odp, chapter4.odp)


Recommended