1
Data MiningChapter 1
Kirk Scott
2
Iris virginica
3
Iris versicolor
4
Iris setosa
5
1.1 Data Mining and Machine Learning
6
Definition of Data Mining
• The process of discovering patterns in data.
• (The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.)
• …Useful patterns allow us to make predictions on new data.
7
Automatic or Semi-Automatic Pattern Discovery
• Pattern discovery by, or with the help of computers is of interest
• Hence machine “learning”• The “learning” part comes from the
algorithms used• Stay tuned for a brief discussion of
whether this is learning
8
Expression of Patterns
• 1. Black box• 2. Transparent box• A transparent box expression reveals the
structure of the pattern• The structure can be examined, reasoned
about, and used to inform future decisions
9
Black Box vs. Transparent Box
• Black vs. transparent box is not a trivial distinction
• Some modern computational approaches are black box in nature
• They seek “answers” without necessarily revealing the structure of the problem at hand
10
Genetic Algorithms
• The book in particular says that genetic algorithms are beyond the realm of consideration
• They are explicitly designed for optimization, not the revelation of structure
• They exemplify black box thinking• They give a single (sub) optimal answer
without providing additional information about the problem being optimized
11
What is the book about?
• Techniques for finding and describing structural patterns in data
• Description of the patterns is an inherent part of data mining as it will be considered
• Techniques that lead to black box predictors will not be considered
12
Describing Structural Patterns
• First example:• Contact lens table, page 6 in textbook• The first 5 columns are effectively 5
factors under consideration• The 6th column is the result• Inspection reveals that this is essentially a
comprehensive listing of all possible combinations
13
14
• The information in the table could be stored syntactically in the form of a set of rules
• For example:• If tear production rate = reduced then
recommendation = none• Otherwise, if age = young and astigmatic =
no then recommendation = soft
15
• The same information could be encapsulated graphically in a decision tree
• The representation is up to the person working with the scenario
• The point is that the ability to represent signifies that the structure has been revealed
16
Input and Output
• The 5 factors represent input• The 6th column represents output, or
“prediction”• Overall, this distinction is typically present
in a data mining problem
17
Completeness
• This example is simplistic because it is complete
• In interesting, practical, applied problems, not all combinations may be present
• Individual data values will be missing• The goal of data mining is to be able to
generalize from the given data so that correct predictions can be made for other cases where the results aren’t specified
18
Machine Learning
• High level learning in humans seems to presuppose self-conscious adaptability
• Whether or not machines are capable of learning in the human sense is open to question
• No current machine/software combination demonstrates behavior exactly analogous to the behavior seen in biological systems
19
• My personal take on this is that the phrase “machine learning” is an unfortunate example of inflated naming
• The phrase “data mining” is neutral• To ask whether a machine can mine is not
as fraught with difficulty as asking whether it can learn
20
• IEEE publishes a research journal with the title “Transactions on Knowledge and Data Discovery”
• Similarly, to ask whether a machine can discover is not as fraught with difficulty as asking whether it can learn
21
• “Machine learning” isn’t as inflated as “Artificial intelligence”
• It is a step back from that level of hype• In the field of artificial intelligence
researchers have scaled back their expectations of what they might accomplish
• They have not been able to mimic general human intelligence
22
• If you look at data mining from an AI point of view, you might see it as a step along the road to mimicking learning and term it machine learning
• If you come from a database management systems background you might say the point of view is results oriented rather than process oriented
23
• The data are not animate• The machines are not animate• The algorithms are not animate
24
• The human mind cannot readily discern some patterns, whether due to the quantity of the data or the subtlety and complexity of the patterns
• Human programmers devise algorithms which are able to discern such patterns
• This is not so different from devising an algorithm to solve any problem which a computer might solve more easily than a human being
25
Naming Problems in Math
• Consider these terms from math:• Imaginary numbers• Complex numbers• Chaos theory• I’ve always marveled at how harmful
tendentious naming can be to achieving understanding of what’s going on
• You might as well talk about magic spells
26
A Biological Perspective
• Although I don’t worship at the church of St. Charles of Darwin, the biologists make an interesting point:
• When reasoning about animals, it’s a mistake to anthropomorphize
• In others, don’t ascribe human characteristics to non-human organisms
27
• The shortcoming of this point of view is that biologists tend towards the rigidity of theologists:
• Animals are not like us; they are no more than machines
• A dog cannot feel anything approaching human emotion, etc., etc.
28
• Whatever the truth of the emotional state of dogs, isn’t this a valid question:
• Why do certain computer scientists persistent in naming technologies in such a way that they seem to be anthropomorphizing machines?
• The day may come when there is truth to this, but isn’t it a wee bit premature?
29
• Ideas on this question?
30
Practically Speaking, What Does Data Mining Consist of?
• Algorithms that examine data sets and extract knowledge
• This time the authors have chosen the relatively neutral words examine and extract
• The word knowledge is admittedly a bit tendentious itself
31
• In more detail, the idea is simply this:• The computer is programmed to
implement an algorithm• The algorithm is designed to run through
or process all of a data set• The goal of the algorithm is to summarize
the data set
32
• The summary is a generalization• For our purposes, the summary consists of
inferences that can be drawn about the data set
• The inferences may be between values for the attributes of a given data point
• They may also be about the relationships among various data points in the set
33
• The contact lens data set illustrated the idea that an inference is a rule
• For our purposes, such a rule takes the form of “if x, then y”
• It is a comprehensive collection of such rules that summarizes the structure of the data points in the set
34
The Value of the Summary
• A set of rules is definitely useful for predictions
• Once again the distinction is made between black and transparent box techniques
• The structural description is also of great value because it helps the human understand the data and what it means
35
• In this sense, at least, a degree of learning is evident in the whole complex
• If the human consumer of the end product ultimately learns something previously unknown about the data set—
• Then the machine and algorithm must have genuinely learned something about the data that was previously unknown
36
• It may be of value to contrast this with transparent box systems again
• A transparent box system achieves results without a structural description
• This is still a valid question:• Just because the end result is not human
learning, does that in fact lessen in any way the degree of learning that the computer system achieves?
37
• There are no fixed answers to these side questions
• This is a 400 level course• There’s no time like now to at least think
about such things, before you step out the door with your sparkling new sheepskin
• Now, back to the grindstone
38
1.2 Simple Examples: The Weather and Other Problems
39
• The book introduces some comparatively simple (and unrealistic) data sets by way of further illustration
• Note that complex, real data sets tend to be proprietary anyway
• Databases and data sets are among the most valuable assets of organizations that have them
• This kind of stuff isn’t given away
40
The Weather Problem
• This is a fictitious data set• Four symbolic categories (non-numeric
attributes) determine whether a game should be played
• There are 36 possible combinations• A table exists with only 14 of the
combinations• See the following overhead
41
An (Incomplete) Table of Data Point Values
42
Decision Lists
• More ideas about representing structure with rules
• Sets of rules can be arranged in order• You work your way through the list• You accept the outcome of a rule if it
applies• Otherwise you continue down the list• This is called a decision list
43
• The following rules represent the idea• No claim is made that this set of rules is
complete or necessarily very good
44
• In an ordered list of rules:• Each rule is potentially (most likely)
fragmentary• It doesn’t include every factor• Because they are ordered, the rules
depend on each other• Individual rules, applied in isolation, do not
necessarily (most likely don’t) give the correct result
45
Handling Numeric Data
• If all of the determining attributes are numeric, you can refer to a numeric attribute problem
• If some are numeric and some are categorical, you can refer to a mixed attribute problem
• Consider the table of the weather problem with some numerical attributes on the following overhead
46
47
• If numeric attributes are present, the decision rules typically become inequalities rather than equalities
• For example
48
Classification vs. Association
• The premise of the foregoing discussion:• A set of independent variables determines
the value of a dependent variable• This is reminiscent of multivariate
statistical regression• It is classification• If relationships can be found among the
supposedly independent variables, this is association
49
• These are examples of association rules taken from the original weather data set
• It is not hard to imagine that outlook, temperature, humidity, and wind depend on each other in whole or in part
50
• The data set is fictitious• The data set is not exhaustive• Unlike the contact lens example, it is not a
list of all possible combinations• It is presumably based on some set of real
observations
51
• The book observes that many possible association rules can be derived from the data set
• Many would be true for 100% of the data points in the set
• Many others would be true for a high percent of the data points in the set
52
• In summary, the ability to derive association rules depends on several things
• The variables are dependent in reality• The data set is representative of that
reality• The scheme for eliciting the structure
produces a good model
53
Contact Lenses: An Idealized Problem
• This subsection of the book presents the problem again, which I will not repeat
• Recall again that the data set was a complete list of all possible combinations
• The new part considers a complete set of rules derived from the data set
• See the following overhead
54
55
Representing Structure
• Converting an exhaustive listing of all possible cases into a complete set of rules represents the structure of the problem
• Questions to consider:• Would a smaller set of rules be possible?• In practice, would we be willing to accept a
smaller rule set, even though this might be an imperfect representation of the structure?
56
A Decision Tree
• The figure on the following overhead shows a decision tree for the data
• This is an alternative, graphical or visual representation of the structure
• In summary, it illustrates 3 binary decisions in a row
• As a matter of fact, this is a simplification• It doesn’t correctly classify 2 of the cases
57
58
Irises: A Classic Numeric Dataset
• In this problem, the attributes are numeric• However, the result is still categorical
classification• See the table illustrating the data set and a
sample set of rules on the following overheads
59
60
61
CPU Performance: Introducing Numeric Prediction
• Regression was mentioned earlier• In general, it takes this form:• y = b1x1 + b2x2 + … + bnxn + c
• An example of a data set with a numeric result is shown on the following overhead
• Being able to derive an equation is nice• Numeric data sets may yield data mining
results which have structure which can’t be represented by a single equation
62
63
Labor Negotiations: A More Realistic Example
• This a mixed attribute problem• There are missing attribute values in some
data points• It is based on actual results of Canadian
labor negotiations
64
• The classification is of an “acceptable” labor contract
• The meaning of “acceptable”, i.e., the classification, is open to interpretation
• For all of these reasons, this is a more realistic example than the others presented so far
65
• A table for this example is shown on the following overhead
• For presentation purposes the attributes are shown down the left hand side
• In other words, the attributes are the rows, not the columns
• The 40 cases are abbreviated with ellipses in the columns
66
67
Decision Trees, Again
• The book gives two possible decision trees for the contract data, tree (a) and tree (b)
• (a) is simpler than (b)• (a) is also somewhat less accurate than
(b)• (a) was obtained by trimming (b)• See the figures on the following overhead
68
69
• The trees represent structure• This makes it possible to reason about
what the data/structure mean• Because the trees differ, it is possible to
draw inferences about the differences in how they define or represent the structure of the problem
70
• Consider the bottom left 3 options in tree (b)
• Why is half of healthcare good, but both none and full are bad?
• This comes down to the question of the meaning of a good, or acceptable contract
71
• In negotiations, both labor and management have to agree
• Half of healthcare may be “good” because it represents an option that both parties can accept
• In other words, compromise is a good outcome
72
• On the other hand, it’s also possible that the derived model is too dependent on the data set
• The book introduces the terminology “overtrained”
• If the data set were expanded to include more cases, we might find that the structure we think is there is not reflected in a more comprehensive view of the problem space
73
Soybean Classification: A Classic Machine Learning Success
• The soybean classification problem was based on an initial data set of
• 680 cases• 35 attributes• 19 disease categories (the classification)• A human expert’s diagnosis of the cases• See the table on the following overhead
74
75
• A subset of ~300 cases was selected from the overall data set
• The subset was selected so that the data points were spread out through the space
• Based on the subset, a set of rules could be derived that correctly diagnosed plant diseases based on the 35 attributes 97.5% of the time
76
• The expert who collaborated and provided diagnoses, also participated in a rule-building exercise
• Instead of doing automated pattern discovery, the computer people worked with the domain expert to build a set of classification rules based on human understanding of the heuristics used
77
• Back in the day, this was the human side of artificial intelligence
• I don’t know if it’s still commonly used• In any case, the expert derived rule set
was only successful in classifying 72% of the time
78
• In short, an automated system for discovering patterns was more successful than a human based system
• This is nice, but not overwhelming• One of the strengths of computing is the
ability manage large amounts of data• This is true in many areas and it’s not
surprising that it’s also true in pattern discovery
79
• Continuing the philosophical discussion from earlier, the algorithms for discovering the patterns were still devised by humans
• Meta-pattern discovery would be the next step up the food chain
• Can algorithms be devised that allow programs to devise new algorithms for diverse purposes?
80
1.3 Fielded Applications
81
Web Mining
• The relevance of pages to queries can be ranked by applying data mining techniques
• User clicks/choices can be mined to select suitable ads to display
• User clicks/choices can be mined to select suitable products to suggest
• Social network pages and other online information can be mined to form profiles and target users
82
Decisions Involving Judgment
• Lending money is both profitable and risky• Data mining can be used to assess risk
vs. reward• First cut analysis can be done using
statistics• People above threshold x are accepted,
people below threshold y are rejected
83
• What to do with the 10% in the middle?• These are potentially “good” customers• In other words, people who are borderline
financially are likely to be a good pool of customers to market loans to
• How to determine who among them is a good risk?
84
• A project was done on 1000 cases with ~20 attributes
• The data mined system had a success rate of 2/3 in predicting repayment of loans
• Professional judgment had a success rater of ½
• If you can do better than flipping a coin, go with it
85
Screening Images
• Basic problem: Identify oil slicks from satellite images
• What was fielded:• A system designed to process satellite
images• The system could then be trained with
sample data to identify slicks
86
• Fielding a trainable, as opposed to pre-trained, system meant that:
• It was customizable by the end user• In particular, the user could tune the
undetected spill vs. false alarm rate• (This is what statisticians refer to as type I
and type II error)
87
Load Forecasting
• The basic problem here is predicting demand for electricity
• In general it depends on time of day, day of week, and season
• A model was built and enhanced with input based on current weather conditions
• The goal was hourly prediction of demand two days in advance
88
• The book’s description makes this sound like more of an analytical and statistical model and less of a data mining problem
• It certainly includes traditional elements• On the other hand, traditional elements
are valid as larger or smaller elements of overall data mining
89
Diagnosis
• This refers to the diagnosis of problems in mechanical equipment
• A traditional approach might rely on eliciting rules from an expert
• This can be costly and time-consuming• It can also be less accurate than what
results from data mining
90
• Specific example:• ~300 examples of mechanical faults• The goal was not to classify fault/no fault• The goal was to identify fault type when
there was a fault• The book seems to think this is an
important example as much for the human factor as the technical factor
91
• The system mined data• The expert wasn’t satisfied• The data was manipulated until the rule
set that resulted satisfied the expert• The expert was not satisfied with rules that
did not conform to his understanding of the problem domain
92
• He was satisfied with a rule set that was consistent with his understanding of the domain
• The derived rule set got better performance
• It expanded the expert’s understanding of the domain
93
• In short, the expert was not satisfied with a black box solution
• The expert didn’t trust results based on improved performance alone
• The expert would only accept the results when the structural representation derived by the system was understandable
94
Marketing and Sales
• Overall, this is a big area of data mining application
• Banks and cell phone companies are good examples
• They involve large sums of money and they keep extensive transaction records
95
• They can classify customers• They can target customers to keep them
from changing provider• They can target customers who have
behavior patterns that make them likely candidates for profitable services
96
• Market basket analysis is another area of data mining application
• Grocery stores are the classic example• Your card allows them to profile you as a
customer• Whether you individually are a highly
profitable customer, your profile in aggregate with others is valuable
97
• Classification depends on a collection of profiles
• Once classification is done, potentially profitable customers can be targeted with tailored offers
98
• Direct marketing/direct mail/telemarketing/online marketing is a profitable area for data mining
• A brick and mortar store may have virtually no information on its customers
• Once a sale is made, direct marketers have made a sale, they potentially have extensive information on buyers
99
• Direct marketers certainly solicit repeat customers
• They also have a virtually limitless potential market
• How do they avoid wasting money contacting unlikely prospects?
• How do they target marketing only towards likely customers?
100
• Profile your customers• Once again, one customer’s profile in
isolation may not be particularly valuable• However, compare these profiles with the
general population• Use this information to identify a high
probability market segment
101
Other Applications
• This is a grab bag, without detail• There’s no reason to even list all of these
examples• The foregoing examples were reasonably
representative
102
1.4 Machine Learning and Statistics
103
• This can be summarized with a simple observation:
• Data mining is an area where CS and statistics converge
• Statistical thinking and techniques are an integral part of data mining
104
• There is a continuum between classical statistical techniques, like multivariate regression, and the most recent, computer-aided techniques
• It’s not either/or• The middle ground is a mix
105
1.5 Generalization as Search
• This is an optional section• I’m not going to cover it
106
1.6 Data Mining and Ethics
• In summary:• If you’re using data to make decisions, you
should consider whether the decisions are based on ethical or legal factors, like race, creed, sex, national origin, sexual preference, etc.
107
Reidentification
• You can read the details in the book• The idea is this:• Even if data is “anonymized”, quite
frequently the remaining attributes can be used to infer identity
108
• This is a conundrum• Completely anonymizing would remove
attributes• At some point you would remove so much
that a data set would be of limited or no value
109
• Note that this is really a database and security question
• A structural representation of a problem by definition is more abstract than individual cases
• The problem arises when you post data sets that have supposedly been anonymized by the removal of key attributes
110
Using Personal Information
• This is also fundamentally a database and security problem
• Here is the data mining twist:• When a user submits data and clicks “I
accept” on the use policy, have they meaningfully been informed that they may be data mined?
111
• Mining data and making decisions based on it may affect that user or others
• It is not the same as static data collection and dissemination
112
Wider Issues
• The book doesn’t express it this way, but these are questions that statisticians have long faced:
• A result may be statistically significant, but is it of practical significance?
• If so, what?
113
• As noted earlier, what are the chances and consequences of false positives and false negatives?
• Also, if you can find correlations (associations) do they imply cause and effect, either one way or the other
• Be leery of falsely inferring cause and effect, which statistics alone can’t establish
114
• The eternal question is, if you gain new knowledge, how do you use it?
• Do you use your powers for good or evil?• Who defines good and evil?
115
1.7 Further Reading
• In general, I’m not going to cover the further reading sections
• You may want to look at them to get ideas for your paper or project
116
The End