Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 215 times |
Download: | 0 times |
1Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
1
Advanced databases –
Inferring new knowledge from data(bases):
Knowledge Discovery in Databases
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://people.cs.kuleuven.be/~bettina.berendt/teaching
Last update: 15 November 2011
2Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
2
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
3Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
3
Which cells are cancerous?
Proof positive — The difference between a normal and cancerous liver cell is shown clearly by the location of mitochondria […]. The healthy cell shows very few mitochondria near the outer cell wall; they cluster densely (red coloration) as they approach the cell's nucleous (depicted here as the black central hole). In the cancerous cell, the mitochondria are spread throughout the cell, do not cluster, and under the same lighting produce a more subdued effect.
4Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
4
What is the impact of genetically modified organisms?
5Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
5
What‘s spam and what isn‘t?
6Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
6
What makes people happy?
7Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
7
What „circles“ of friends do you have?
8Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
8
What should we recommend to a customer/user?
9Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
9
What topics exist in a collection of texts, and how do they evolve?
News texts, scientific publications, …
10Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
10
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
11Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
11
Type of inference used before in this course - example
foaf:mbox
domain Agent
range Thing (well, in fact a Mailbox)
an inverse functional property
If
mbox(MaryPoppins,[email protected]) and mbox(PeterParker, [email protected]) ,
then
MaryPoppins = PeterParker
12Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
12
Styles of reasoning: „All swans are white“
Deductive: towards the consequences All swans are white.
Tessa is a swan.
Tessa is white.
Inductive: towards a generalisation of observations Joe and Lisa and Tex and Wili and ... (all observed swans) are
swans.
Joe and Lisa and Tex and Wili and ... (all observed swans) are white.
All swans are white.
Abductive: towards the (most likely) explanation of an observation.
Tessa is white.
All swans are white.
Tessa is a swan.
13Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
13
What about truth?
Deductive:
Given the truth of the assumptions, a valid deduction guarantees the truth of the conclusion
Inductive:
the premises of an argument (are believed to) support the conclusion but do not ensure it
has been attacked several times by logicians and philosophers
Abductive:
formally equivalent to the logical fallacy affirming the consequent
14Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
14
What about new knowledge?
C.S. Peirce:
Introduced „abduction“ to modern logic
(after 1900): used „abduction“ to mean: creating new rules to explain new observations (this meaning is actually closest to induction)
<<Abduction is the only logical process that actually creates anything new.>>
essential for scientific discovery
15Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
15
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
16Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
16
„Data mining“ and „knowledge discovery“
(informal definition):
data mining is about discovering knowledge in (huge amounts of) data
Therefore, it is clearer to speak about “knowledge discovery in data(bases)” (KDD)
Second reason for preferring the term “KDD”:
“data mining” is not uniquely defined:
Some people use it to denote certain types of knowledge discovery (e.g., finding association rules, but not classifier learning)
17Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
17
„Data mining“ is generally inductive
(informal definition):
data mining is about discovering knowledge in (huge amounts of) data :
... Looking at all the empirically observed swans ...
... Finding they are white
... Concluding that swans are white
18Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
18
The KDD process
The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996)
non-trivial process
Multiple process
valid Justified patterns/models
novel Previously unknown
useful Can be used
understandableby human and machine
19Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
19
The process part of knowledge discovery
CRISP-DM • CRoss Industry Standard Process for Data Mining• a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.
20Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
20
Knowledge discovery, machine learning, data mining
Knowledge discovery
= the whole process
Machine learning
the application of induction algorithms and other algorithms that can be said to „learn.“
= „modeling“ phase
Data mining
sometimes = KD, sometimes = ML
21Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
21
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology Statistics
MachineLearning
PatternRecognition
AlgorithmOther
Disciplines
Visualization
22Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
22
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
23Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
23
Data General patterns
Examples
Cancerous Cell Data
Classification“What factors determine cancerous cells?”
Classification Algorithm
MiningAlgorithm
- Rule Induction- Decision tree- Neural Network
24Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
24
If Color = light and Tails = 1 and Nuclei = 2Then Healthy Cell (certainty = 92%)
If Color = dark and Tails = 2 and Nuclei = 2Then Cancerous Cell (certainty = 87%)
Classification: Rule Induction“What factors determine whether a cell is cancerous?”
25Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
25
Color = dark
Color = light
healthy
Classification: Decision Trees
#nuclei=1
#nuclei=2
#nuclei=1
#nuclei=2
#tails=1 #tails=2
cancerous
cancerous healthy
healthy
#tails=1 #tails=2
cancerous
26Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
26
Healthy
Cancerous
“What factors determine whether a cell is cancerous?”
Classification: Neural Networks
Color = dark
# nuclei = 1
…
# tails = 2
27Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
27
“Are there clusters of similar cells?”
Light color with 1 nucleus
Dark color with 2 tails 2 nuclei
1 nucleus and 1 tail
Dark color with 1 tail and 2 nuclei
Clustering
28Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
28
Task: Discovering association rules among items in a transaction database.
An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B.
In general: A1, A2, … => B
Association Rule DiscoveryAssociation Rule Discovery
Association Rule DiscoveryAssociation Rule Discovery
29Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
29
“Are there any associations between the characteristics of the cells?”
If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%;
confidence = 50%)
If # nuclei = 2 and Cell = Cancerousthen # tails = 2 (support = 25%;
confidence = 100%)
If # tails = 1then Color = light (support =
37.5%;confidence = 75%)
Association Rule DiscoveryAssociation Rule Discovery Association Rule DiscoveryAssociation Rule Discovery
30Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
30
Genetic Algorithms
StatisticsBayesian Networks
Rough Sets Time Series
Many Other Data Mining Techniques
Text Mining
31Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
31
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
32Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
32
The basic idea of clustering: group similar things
Group 1Group 2
Attribute 1
Att
rib
ute
2
33Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
33Concepts in Clustering
Defining distance between points
Euclidean distance
any other distance (cityblock metric, Levenshtein, Jaccard sim. ...)
A good clustering is one where
(Intra-cluster distance) the sum of distances between objects in the same cluster are minimized,
(Inter-cluster distance) while the distances between different clusters are maximized
Objective to minimize: F(Intra,Inter)
Clusters can be evaluated with “internal” as well as “external” measures
Internal measures are related to the inter/intra cluster distance
External measures are related to how representative are the current clusters to “true” classes
||
||
RQ
RQ
34Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
34
K Means Example (K=2)
Pick seeds
Reassign clusters
Compute centroids
xx
Reasssign clusters
xx xx Compute centroids
Reassign clusters
Converged!
Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt
35Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
35
K-means algorithm
36Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
36
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
37Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
37
Input data ... Q: when does this person play tennis?
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHigh Hot Sunny
NoFalseHighHotSunny
PlayWindyHumidityTempOutlook
38Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
38
Terminology (using a popular data example)
NoTrueHighMildRainy
YesFalseNormalHotOvercast
YesTrueHighMildOvercast
YesTrueNormalMildSunny
YesFalseNormalMildRainy
YesFalseNormalCoolSunny
NoFalseHighMildSunny
YesTrueNormalCoolOvercast
NoTrueNormalCoolRainy
YesFalseNormalCoolRainy
YesFalseHighMildRainy
YesFalseHighHot Overcast
NoTrueHigh Hot Sunny
NoFalseHighHotSunny
PlayWindyHumidityTempOutlook Rows:• Instances • (think of them as objects)• Days, described by:
Columns:• Features• Outlook, Temp, …
In this case, there is a feature with a special role:
• The class• Play (does X play tennis on this day?)
This is “relational DB mining“. We will later see other types of data and the mining applied to them.
39Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
39
The goal: a decision tree for classification / prediction
In which weather
will someone play (tennis etc.)?
40Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
40
Constructing decision trees
Strategy: top downRecursive divide-and-conquer fashion
First: select attribute for root nodeCreate branch for each possible attribute value
Then: split instances into subsetsOne for each branch extending from the node
Finally: repeat recursively for each branch, using only instances that reach the branch
Stop if all instances have the same class
41Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
41
Which attribute to select?
42Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
42
Which attribute to select?
43Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
43
Criterion for attribute selection
Which is the best attribute? Want to get the smallest tree Heuristic: choose the attribute that
produces the “purest” nodes Popular impurity criterion: information
gain Information gain increases with the
average purity of the subsets Strategy: choose attribute that gives
greatest information gain
44Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
44
Computing information
Measure information in bits Given a probability distribution, the info
required to predict an event is the distribution’s entropy
Entropy gives the information required in bits(can involve fractions of bits!)
Formula for computing the entropy:
45Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
45
Example: attribute Outlook
info[4,0]=entropy 1,0=−1 log 1−0 log0=0bits
info[2,3]=entropy3 /5,2 /5=−3 /5 log 3/5−2 /5 log 2 /5=0.971bits
info[3,2] , [4,0] , [3,2]=5 /14×0.9714 /14×05 /14×0.971=0.693bits
46Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
46
Computing information gain
Information gain: information before splitting – information after splitting
Information gain for attributes from weather data:
gain(Outlook ) = 0.247 bitsgain(Temperature ) = 0.029 bitsgain(Humidity ) = 0.152 bitsgain(Windy ) = 0.048 bits
gain(Outlook ) = info([9,5]) – info([2,3],[4,0],[3,2])= 0.940 – 0.693= 0.247 bits
47Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
47
Continuing to split
gain(Temperature ) = 0.571 bitsgain(Humidity ) = 0.971 bitsgain(Windy ) = 0.020 bits
48Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
48
Final decision tree
Note: not all leaves need to be pure; sometimes identical instances have different classes
Splitting stops when data can’t be split any further
49Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
49
Wishlist for a purity measure
Properties we require from a purity measure:
When node is pure, measure should be zero When impurity is maximal (i.e. all classes
equally likely), measure should be maximal Measure should obey multistage property
(i.e. decisions can be made in several stages):
Entropy is the only function that satisfies all three properties!
50Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
50
Properties of the entropy
The multistage property:
Simplification of computation:
Note: instead of maximizing info gain we could just minimize information
51Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
51
Discussion / outlook decision trees
Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan
Various improvements, e.g. C4.5: deals with numeric attributes, missing values, noisy data Gain ratio instead of information gain [see Witten & Frank slides, ch. 4, pp. 40-45]
Similar approach: CART …
52Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
52
Agenda
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
53Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
53Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart, ...)
Where to put: spaghetti,
butter?
54Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
54
Data
"Market basket data": attributes with boolean domains
In a table each row is a basket (aka transaction)
Transaction ID Attributes (basket items)
1 Spaghetti, tomato sauce
2 Spaghetti, bread
3 Spaghetti, tomato sauce, bread
4 bread, butter
5 bread, tomato sauce
55Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
55Solution approach: The apriori principle and the pruning of the search tree (1)
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
56Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
56
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
Solution approach: The apriori principle and the pruning of the search tree (2)
57Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
57
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
Solution approach: The apriori principle and the pruning of the search tree (3)
58Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
58
spaghetti Tomato sauce bread butter
Spaghetti, tomato sauce
Spaghetti, bread
Spaghetti, butter
Tomato s.,bread
Tomato s.,butter
Bread,butter
Spagetthi, Tomato sauce,Bread, butter
Spagetthi,Tomato sauce,Bread
Spagetthi,Tomato sauce,butter
Spagetthi,Bread,butter
Tomato sauce,Bread,butter
Solution approach: The apriori principle and the pruning of the search tree (4)
59Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
59
More formally: Generating large k-itemsets with Apriori
Min. support = 40%
step 1: candidate 1-itemsets
Spaghetti: support = 3 (60%)
tomato sauce: support = 3 (60%)
bread: support = 4 (80%)
butter: support = 1 (20%)
Transaction ID
Attributes (basket items)
1 Spaghetti, tomato sauce
2 Spaghetti, bread
3 Spaghetti, tomato sauce, bread
4 bread, butter
5 bread, tomato sauce
60Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
60
Contd.
step 2: large 1-itemsets
Spaghetti
tomato sauce
bread
candidate 2-itemsets
{Spaghetti, tomato sauce}: support = 2 (40%)
{Spaghetti, bread}: support = 2 (40%)
{tomato sauce, bread}: support = 2 (40%)
Transaction ID
Attributes (basket items)
1 Spaghetti, tomato sauce
2 Spaghetti, bread
3 Spaghetti, tomato sauce, bread
4 bread, butter
5 bread, tomato sauce
61Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
61
step 3: large 2-itemsets
{Spaghetti, tomato sauce}
{Spaghetti, bread}
{tomato sauce, bread}
candidate 3-itemsets
{Spaghetti, tomato sauce, bread}: support = 1 (20%)
step 4: large 3-itemsets
{ }
Transaction ID
Attributes (basket items)
1 Spaghetti, tomato sauce
2 Spaghetti, bread
3 Spaghetti, tomato sauce, bread
4 bread, butter
5 bread, tomato sauce
Contd.
62Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
62
From itemsets to association rules
Schema: If subset then large k-itemset with support s and confidence c
s = (support of large k-itemset) / # tuples
c = (support of large k-itemset) / (support of subset)
Example:
If {spaghetti} then {spaghetti, tomato sauce}
Support: s = 2 / 5 (40%)
Confidence: c = 2 / 3 (66%)
63Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
63
Outlook
Motivation II: Types of reasoning
The process of knowledge discovery (KDD)
A short overview of key KDD techniques
Clustering: k-means
Classification (classifier learning): ID3
Association-rule learning: apriori
Motivation I: Application examples
Text mining
64Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
64
References / background reading
Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives:
a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,CA: Morgan Kaufmann. http://www.cs.sfu.ca/%7Ehan/dmbook
a machine learning perspective: Witten, I.H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html
a statistics perspective: Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype=2
The CRISP-DM manual can be found at http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf
65Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching
65
Acknowledgements
The overview of data mining was taken from (with minor modifications):
Tzacheva, A.A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt
Tzacheva, A.A. (2006). Knowledge Discovery and Data Mining. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt
p. 21 was taken from Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques —
Chapter 1 — Introduction. http://www.cs.sfu.ca/%7Ehan/bk/1intro.ppt
The ID3 part is based on Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning
Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html
In particular, the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapters 1-4):
http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and ...chapter2.pdf, chapter3.pdf, chapter4.pdf) or
http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and ...chapter2.odp, chapter3.odp, chapter4.odp)