1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases...

1Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching

1

Advanced databases –

Inferring new knowledge from data(bases):

Knowledge Discovery in Databases

Bettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://people.cs.kuleuven.be/~bettina.berendt/teaching

Last update: 15 November 2011


2

Agenda

Motivation II: Types of reasoning

The process of knowledge discovery (KDD)

A short overview of key KDD techniques

Clustering: k-means

Classification (classifier learning): ID3

Association-rule learning: apriori

Motivation I: Application examples


3

Which cells are cancerous?

Proof positive — The difference between a normal and cancerous liver cell is shown clearly by the location of mitochondria […]. The healthy cell shows very few mitochondria near the outer cell wall; they cluster densely (red coloration) as they approach the cell's nucleous (depicted here as the black central hole). In the cancerous cell, the mitochondria are spread throughout the cell, do not cluster, and under the same lighting produce a more subdued effect.


4

What is the impact of genetically modified organisms?


5

What‘s spam and what isn‘t?


6

What makes people happy?


7

What „circles“ of friends do you have?


8

What should we recommend to a customer/user?


9

What topics exist in a collection of texts, and how do they evolve?

News texts, scientific publications, …


10

Agenda




Clustering: k-means





11

Type of inference used before in this course - example

foaf:mbox

domain Agent

range Thing (well, in fact a Mailbox)

an inverse functional property

If

mbox(MaryPoppins,[email protected]) and mbox(PeterParker, [email protected]) ,

then

MaryPoppins = PeterParker


12

Styles of reasoning: „All swans are white“

Deductive: towards the consequences All swans are white.

Tessa is a swan.

Tessa is white.

Inductive: towards a generalisation of observations Joe and Lisa and Tex and Wili and ... (all observed swans) are

swans.

Joe and Lisa and Tex and Wili and ... (all observed swans) are white.

All swans are white.

Abductive: towards the (most likely) explanation of an observation.

Tessa is white.

All swans are white.

Tessa is a swan.


13

What about truth?

Deductive:

Given the truth of the assumptions, a valid deduction guarantees the truth of the conclusion

Inductive:

the premises of an argument (are believed to) support the conclusion but do not ensure it

has been attacked several times by logicians and philosophers

Abductive:

formally equivalent to the logical fallacy affirming the consequent


14

What about new knowledge?

C.S. Peirce:

Introduced „abduction“ to modern logic

(after 1900): used „abduction“ to mean: creating new rules to explain new observations (this meaning is actually closest to induction)

<<Abduction is the only logical process that actually creates anything new.>>

essential for scientific discovery


15

Agenda




Clustering: k-means





16

„Data mining“ and „knowledge discovery“

(informal definition):

data mining is about discovering knowledge in (huge amounts of) data

Therefore, it is clearer to speak about “knowledge discovery in data(bases)” (KDD)

Second reason for preferring the term “KDD”:

“data mining” is not uniquely defined:

Some people use it to denote certain types of knowledge discovery (e.g., finding association rules, but not classifier learning)


17

„Data mining“ is generally inductive

(informal definition):

data mining is about discovering knowledge in (huge amounts of) data :

... Looking at all the empirically observed swans ...

... Finding they are white

... Concluding that swans are white


18

The KDD process

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)

non-trivial process

Multiple process

valid Justified patterns/models

novel Previously unknown

useful Can be used

understandableby human and machine


19

The process part of knowledge discovery

CRISP-DM • CRoss Industry Standard Process for Data Mining• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.


20

Knowledge discovery, machine learning, data mining

Knowledge discovery

= the whole process

Machine learning

the application of induction algorithms and other algorithms that can be said to „learn.“

= „modeling“ phase

Data mining

sometimes = KD, sometimes = ML


21

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology Statistics

MachineLearning

PatternRecognition

AlgorithmOther

Disciplines

Visualization


22

Agenda




Clustering: k-means





23

Data General patterns

Examples

Cancerous Cell Data

Classification“What factors determine cancerous cells?”

Classification Algorithm

MiningAlgorithm

- Rule Induction- Decision tree- Neural Network


24

If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)

If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)

Classification: Rule Induction“What factors determine whether a cell is cancerous?”


25

Color = dark

Color = light

healthy

Classification: Decision Trees

#nuclei=1

#nuclei=2

#nuclei=1

#nuclei=2

#tails=1 #tails=2

cancerous

cancerous healthy

healthy

#tails=1 #tails=2

cancerous


26

Healthy

Cancerous

“What factors determine whether a cell is cancerous?”

Classification: Neural Networks

Color = dark

# nuclei = 1

…

# tails = 2


27

“Are there clusters of similar cells?”

Light color with 1 nucleus

Dark color with 2 tails 2 nuclei

1 nucleus and 1 tail

Dark color with 1 tail and 2 nuclei

Clustering


28

Task: Discovering association rules among items in a transaction database.

An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.

In general: A1, A2, … => B

Association Rule DiscoveryAssociation Rule Discovery

Association Rule DiscoveryAssociation Rule Discovery


29

“Are there any associations between the characteristics of the cells?”

If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;

confidence = 50%)

If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;

confidence = 100%)

If # tails = 1then Color = light (support =

37.5%;confidence = 75%)

Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery


30

Genetic Algorithms

StatisticsBayesian Networks

Rough Sets Time Series

Many Other Data Mining Techniques

Text Mining


31

Agenda




Clustering: k-means





32

The basic idea of clustering: group similar things

Group 1Group 2

Attribute 1

Att

rib

ute

2


33Concepts in Clustering

Defining distance between points

Euclidean distance

any other distance (cityblock metric, Levenshtein, Jaccard sim. ...)

A good clustering is one where

(Intra-cluster distance) the sum of distances between objects in the same cluster are minimized,

(Inter-cluster distance) while the distances between different clusters are maximized

Objective to minimize: F(Intra,Inter)

Clusters can be evaluated with “internal” as well as “external” measures

Internal measures are related to the inter/intra cluster distance

External measures are related to how representative are the current clusters to “true” classes

||

||

RQ

RQ


34

K Means Example (K=2)

Pick seeds

Reassign clusters

Compute centroids

xx

Reasssign clusters

xx xx Compute centroids

Reassign clusters

Converged!

Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt


35

K-means algorithm


36

Agenda




Clustering: k-means





37

Input data ... Q: when does this person play tennis?

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook


38

Terminology (using a popular data example)

NoTrueHighMildRainy

YesFalseNormalHotOvercast

YesTrueHighMildOvercast

YesTrueNormalMildSunny

YesFalseNormalMildRainy

YesFalseNormalCoolSunny

NoFalseHighMildSunny

YesTrueNormalCoolOvercast

NoTrueNormalCoolRainy

YesFalseNormalCoolRainy

YesFalseHighMildRainy

YesFalseHighHot Overcast

NoTrueHigh Hot Sunny

NoFalseHighHotSunny

PlayWindyHumidityTempOutlook Rows:• Instances • (think of them as objects)• Days, described by:

Columns:• Features• Outlook, Temp, …

In this case, there is a feature with a special role:

• The class• Play (does X play tennis on this day?)

This is “relational DB mining“. We will later see other types of data and the mining applied to them.


39

The goal: a decision tree for classification / prediction

In which weather

will someone play (tennis etc.)?


40

Constructing decision trees

Strategy: top downRecursive divide-and-conquer fashion

First: select attribute for root nodeCreate branch for each possible attribute value

Then: split instances into subsetsOne for each branch extending from the node

Finally: repeat recursively for each branch, using only instances that reach the branch

Stop if all instances have the same class


41

Which attribute to select?


42

Which attribute to select?


43

Criterion for attribute selection

Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that

produces the “purest” nodes Popular impurity criterion: information

gain Information gain increases with the

average purity of the subsets Strategy: choose attribute that gives

greatest information gain


44

Computing information

Measure information in bits Given a probability distribution, the info

required to predict an event is the distribution’s entropy

Entropy gives the information required in bits(can involve fractions of bits!)

Formula for computing the entropy:


45

Example: attribute Outlook

info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits

info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits

info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits


46

Computing information gain

Information gain: information before splitting – information after splitting

Information gain for attributes from weather data:

gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits

gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits


47

Continuing to split

gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits


48

Final decision tree

Note: not all leaves need to be pure; sometimes identical instances have different classes

Splitting stops when data can’t be split any further


49

Wishlist for a purity measure

Properties we require from a purity measure:

When node is pure, measure should be zero When impurity is maximal (i.e. all classes

equally likely), measure should be maximal Measure should obey multistage property

(i.e. decisions can be made in several stages):

Entropy is the only function that satisfies all three properties!


50

Properties of the entropy

The multistage property:

Simplification of computation:

Note: instead of maximizing info gain we could just minimize information


51

Discussion / outlook decision trees

Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan

Various improvements, e.g. C4.5: deals with numeric attributes, missing values, noisy data Gain ratio instead of information gain [see Witten & Frank slides, ch. 4, pp. 40-45]

Similar approach: CART …


52

Agenda




Clustering: k-means





53Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart, ...)

Where to put: spaghetti,

butter?


54

Data

"Market basket data": attributes with boolean domains

In a table each row is a basket (aka transaction)

Transaction ID Attributes (basket items)

1 Spaghetti, tomato sauce

2 Spaghetti, bread

3 Spaghetti, tomato sauce, bread

4 bread, butter

5 bread, tomato sauce


55Solution approach: The apriori principle and the pruning of the search tree (1)

spaghetti Tomato sauce bread butter

Spaghetti, tomato sauce

Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter

Spagetthi, Tomato sauce,Bread, butter

Spagetthi,Tomato sauce,Bread

Spagetthi,Tomato sauce,butter

Spagetthi,Bread,butter

Tomato sauce,Bread,butter


56



Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter






Solution approach: The apriori principle and the pruning of the search tree (2)


57



Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter








58



Spaghetti, bread

Spaghetti, butter

Tomato s.,bread

Tomato s.,butter

Bread,butter








59

More formally: Generating large k-itemsets with Apriori

Min. support = 40%

step 1: candidate 1-itemsets

Spaghetti: support = 3 (60%)

tomato sauce: support = 3 (60%)

bread: support = 4 (80%)

butter: support = 1 (20%)

Transaction ID

Attributes (basket items)


2 Spaghetti, bread


4 bread, butter



60

Contd.

step 2: large 1-itemsets

Spaghetti

tomato sauce

bread

candidate 2-itemsets

{Spaghetti, tomato sauce}: support = 2 (40%)

{Spaghetti, bread}: support = 2 (40%)

{tomato sauce, bread}: support = 2 (40%)

Transaction ID



2 Spaghetti, bread


4 bread, butter



61


{Spaghetti, tomato sauce}

{Spaghetti, bread}

{tomato sauce, bread}

candidate 3-itemsets

{Spaghetti, tomato sauce, bread}: support = 1 (20%)


{ }

Transaction ID



2 Spaghetti, bread


4 bread, butter


Contd.


62

From itemsets to association rules

Schema: If subset then large k-itemset with support s and confidence c

s = (support of large k-itemset) / # tuples

c = (support of large k-itemset) / (support of subset)

Example:

If {spaghetti} then {spaghetti, tomato sauce}

Support: s = 2 / 5 (40%)

Confidence: c = 2 / 3 (66%)


63

Outlook




Clustering: k-means




Text mining


64

References / background reading

Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives:

a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,CA: Morgan Kaufmann. http://www.cs.sfu.ca/%7Ehan/dmbook

a machine learning perspective: Witten, I.H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

a statistics perspective: Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype=2

The CRISP-DM manual can be found at http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf


65

Acknowledgements

The overview of data mining was taken from (with minor modifications):

Tzacheva, A.A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt

Tzacheva, A.A. (2006). Knowledge Discovery and Data Mining. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt

p. 21 was taken from Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques —

Chapter 1 — Introduction. http://www.cs.sfu.ca/%7Ehan/bk/1intro.ppt

The ID3 part is based on Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning

Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html

In particular, the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapters 1-4):

http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and ...chapter2.pdf, chapter3.pdf, chapter4.pdf) or

http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and ...chapter2.odp, chapter3.odp, chapter4.odp)

Date post:	19-Dec-2015
Category:	Documents
View:	215 times
Download:	0 times

1 Berendt: Advanced databases, first semester 2011, bettina.berendt/teaching 1 Advanced databases...

Documents