1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

Post on 28-Dec-2015

217 views 0 download

transcript

1

Data MiningChapter 1

Kirk Scott

2

Iris virginica

3

Iris versicolor

4

Iris setosa

5

1.1 Data Mining and Machine Learning

6

Definition of Data Mining

• The process of discovering patterns in data.

• (The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.)

• …Useful patterns allow us to make predictions on new data.

7

Automatic or Semi-Automatic Pattern Discovery

• Pattern discovery by, or with the help of computers is of interest

• Hence machine “learning”• The “learning” part comes from the

algorithms used• Stay tuned for a brief discussion of

whether this is learning

8

Expression of Patterns

• 1. Black box• 2. Transparent box• A transparent box expression reveals the

structure of the pattern• The structure can be examined, reasoned

about, and used to inform future decisions

9

Black Box vs. Transparent Box

• Black vs. transparent box is not a trivial distinction

• Some modern computational approaches are black box in nature

• They seek “answers” without necessarily revealing the structure of the problem at hand

10

Genetic Algorithms

• The book in particular says that genetic algorithms are beyond the realm of consideration

• They are explicitly designed for optimization, not the revelation of structure

• They exemplify black box thinking• They give a single (sub) optimal answer

without providing additional information about the problem being optimized

11

What is the book about?

• Techniques for finding and describing structural patterns in data

• Description of the patterns is an inherent part of data mining as it will be considered

• Techniques that lead to black box predictors will not be considered

12

Describing Structural Patterns

• First example:• Contact lens table, page 6 in textbook• The first 5 columns are effectively 5

factors under consideration• The 6th column is the result• Inspection reveals that this is essentially a

comprehensive listing of all possible combinations

13

14

• The information in the table could be stored syntactically in the form of a set of rules

• For example:• If tear production rate = reduced then

recommendation = none• Otherwise, if age = young and astigmatic =

no then recommendation = soft

15

• The same information could be encapsulated graphically in a decision tree

• The representation is up to the person working with the scenario

• The point is that the ability to represent signifies that the structure has been revealed

16

Input and Output

• The 5 factors represent input• The 6th column represents output, or

“prediction”• Overall, this distinction is typically present

in a data mining problem

17

Completeness

• This example is simplistic because it is complete

• In interesting, practical, applied problems, not all combinations may be present

• Individual data values will be missing• The goal of data mining is to be able to

generalize from the given data so that correct predictions can be made for other cases where the results aren’t specified

18

Machine Learning

• High level learning in humans seems to presuppose self-conscious adaptability

• Whether or not machines are capable of learning in the human sense is open to question

• No current machine/software combination demonstrates behavior exactly analogous to the behavior seen in biological systems

19

• My personal take on this is that the phrase “machine learning” is an unfortunate example of inflated naming

• The phrase “data mining” is neutral• To ask whether a machine can mine is not

as fraught with difficulty as asking whether it can learn

20

• IEEE publishes a research journal with the title “Transactions on Knowledge and Data Discovery”

• Similarly, to ask whether a machine can discover is not as fraught with difficulty as asking whether it can learn

21

• “Machine learning” isn’t as inflated as “Artificial intelligence”

• It is a step back from that level of hype• In the field of artificial intelligence

researchers have scaled back their expectations of what they might accomplish

• They have not been able to mimic general human intelligence

22

• If you look at data mining from an AI point of view, you might see it as a step along the road to mimicking learning and term it machine learning

• If you come from a database management systems background you might say the point of view is results oriented rather than process oriented

23

• The data are not animate• The machines are not animate• The algorithms are not animate

24

• The human mind cannot readily discern some patterns, whether due to the quantity of the data or the subtlety and complexity of the patterns

• Human programmers devise algorithms which are able to discern such patterns

• This is not so different from devising an algorithm to solve any problem which a computer might solve more easily than a human being

25

Naming Problems in Math

• Consider these terms from math:• Imaginary numbers• Complex numbers• Chaos theory• I’ve always marveled at how harmful

tendentious naming can be to achieving understanding of what’s going on

• You might as well talk about magic spells

26

A Biological Perspective

• Although I don’t worship at the church of St. Charles of Darwin, the biologists make an interesting point:

• When reasoning about animals, it’s a mistake to anthropomorphize

• In others, don’t ascribe human characteristics to non-human organisms

27

• The shortcoming of this point of view is that biologists tend towards the rigidity of theologists:

• Animals are not like us; they are no more than machines

• A dog cannot feel anything approaching human emotion, etc., etc.

28

• Whatever the truth of the emotional state of dogs, isn’t this a valid question:

• Why do certain computer scientists persistent in naming technologies in such a way that they seem to be anthropomorphizing machines?

• The day may come when there is truth to this, but isn’t it a wee bit premature?

29

• Ideas on this question?

30

Practically Speaking, What Does Data Mining Consist of?

• Algorithms that examine data sets and extract knowledge

• This time the authors have chosen the relatively neutral words examine and extract

• The word knowledge is admittedly a bit tendentious itself

31

• In more detail, the idea is simply this:• The computer is programmed to

implement an algorithm• The algorithm is designed to run through

or process all of a data set• The goal of the algorithm is to summarize

the data set

32

• The summary is a generalization• For our purposes, the summary consists of

inferences that can be drawn about the data set

• The inferences may be between values for the attributes of a given data point

• They may also be about the relationships among various data points in the set

33

• The contact lens data set illustrated the idea that an inference is a rule

• For our purposes, such a rule takes the form of “if x, then y”

• It is a comprehensive collection of such rules that summarizes the structure of the data points in the set

34

The Value of the Summary

• A set of rules is definitely useful for predictions

• Once again the distinction is made between black and transparent box techniques

• The structural description is also of great value because it helps the human understand the data and what it means

35

• In this sense, at least, a degree of learning is evident in the whole complex

• If the human consumer of the end product ultimately learns something previously unknown about the data set—

• Then the machine and algorithm must have genuinely learned something about the data that was previously unknown

36

• It may be of value to contrast this with transparent box systems again

• A transparent box system achieves results without a structural description

• This is still a valid question:• Just because the end result is not human

learning, does that in fact lessen in any way the degree of learning that the computer system achieves?

37

• There are no fixed answers to these side questions

• This is a 400 level course• There’s no time like now to at least think

about such things, before you step out the door with your sparkling new sheepskin

• Now, back to the grindstone

38

1.2 Simple Examples: The Weather and Other Problems

39

• The book introduces some comparatively simple (and unrealistic) data sets by way of further illustration

• Note that complex, real data sets tend to be proprietary anyway

• Databases and data sets are among the most valuable assets of organizations that have them

• This kind of stuff isn’t given away

40

The Weather Problem

• This is a fictitious data set• Four symbolic categories (non-numeric

attributes) determine whether a game should be played

• There are 36 possible combinations• A table exists with only 14 of the

combinations• See the following overhead

41

An (Incomplete) Table of Data Point Values

42

Decision Lists

• More ideas about representing structure with rules

• Sets of rules can be arranged in order• You work your way through the list• You accept the outcome of a rule if it

applies• Otherwise you continue down the list• This is called a decision list

43

• The following rules represent the idea• No claim is made that this set of rules is

complete or necessarily very good

44

• In an ordered list of rules:• Each rule is potentially (most likely)

fragmentary• It doesn’t include every factor• Because they are ordered, the rules

depend on each other• Individual rules, applied in isolation, do not

necessarily (most likely don’t) give the correct result

45

Handling Numeric Data

• If all of the determining attributes are numeric, you can refer to a numeric attribute problem

• If some are numeric and some are categorical, you can refer to a mixed attribute problem

• Consider the table of the weather problem with some numerical attributes on the following overhead

46

47

• If numeric attributes are present, the decision rules typically become inequalities rather than equalities

• For example

48

Classification vs. Association

• The premise of the foregoing discussion:• A set of independent variables determines

the value of a dependent variable• This is reminiscent of multivariate

statistical regression• It is classification• If relationships can be found among the

supposedly independent variables, this is association

49

• These are examples of association rules taken from the original weather data set

• It is not hard to imagine that outlook, temperature, humidity, and wind depend on each other in whole or in part

50

• The data set is fictitious• The data set is not exhaustive• Unlike the contact lens example, it is not a

list of all possible combinations• It is presumably based on some set of real

observations

51

• The book observes that many possible association rules can be derived from the data set

• Many would be true for 100% of the data points in the set

• Many others would be true for a high percent of the data points in the set

52

• In summary, the ability to derive association rules depends on several things

• The variables are dependent in reality• The data set is representative of that

reality• The scheme for eliciting the structure

produces a good model

53

Contact Lenses: An Idealized Problem

• This subsection of the book presents the problem again, which I will not repeat

• Recall again that the data set was a complete list of all possible combinations

• The new part considers a complete set of rules derived from the data set

• See the following overhead

54

55

Representing Structure

• Converting an exhaustive listing of all possible cases into a complete set of rules represents the structure of the problem

• Questions to consider:• Would a smaller set of rules be possible?• In practice, would we be willing to accept a

smaller rule set, even though this might be an imperfect representation of the structure?

56

A Decision Tree

• The figure on the following overhead shows a decision tree for the data

• This is an alternative, graphical or visual representation of the structure

• In summary, it illustrates 3 binary decisions in a row

• As a matter of fact, this is a simplification• It doesn’t correctly classify 2 of the cases

57

58

Irises: A Classic Numeric Dataset

• In this problem, the attributes are numeric• However, the result is still categorical

classification• See the table illustrating the data set and a

sample set of rules on the following overheads

59

60

61

CPU Performance: Introducing Numeric Prediction

• Regression was mentioned earlier• In general, it takes this form:• y = b1x1 + b2x2 + … + bnxn + c

• An example of a data set with a numeric result is shown on the following overhead

• Being able to derive an equation is nice• Numeric data sets may yield data mining

results which have structure which can’t be represented by a single equation

62

63

Labor Negotiations: A More Realistic Example

• This a mixed attribute problem• There are missing attribute values in some

data points• It is based on actual results of Canadian

labor negotiations

64

• The classification is of an “acceptable” labor contract

• The meaning of “acceptable”, i.e., the classification, is open to interpretation

• For all of these reasons, this is a more realistic example than the others presented so far

65

• A table for this example is shown on the following overhead

• For presentation purposes the attributes are shown down the left hand side

• In other words, the attributes are the rows, not the columns

• The 40 cases are abbreviated with ellipses in the columns

66

67

Decision Trees, Again

• The book gives two possible decision trees for the contract data, tree (a) and tree (b)

• (a) is simpler than (b)• (a) is also somewhat less accurate than

(b)• (a) was obtained by trimming (b)• See the figures on the following overhead

68

69

• The trees represent structure• This makes it possible to reason about

what the data/structure mean• Because the trees differ, it is possible to

draw inferences about the differences in how they define or represent the structure of the problem

70

• Consider the bottom left 3 options in tree (b)

• Why is half of healthcare good, but both none and full are bad?

• This comes down to the question of the meaning of a good, or acceptable contract

71

• In negotiations, both labor and management have to agree

• Half of healthcare may be “good” because it represents an option that both parties can accept

• In other words, compromise is a good outcome

72

• On the other hand, it’s also possible that the derived model is too dependent on the data set

• The book introduces the terminology “overtrained”

• If the data set were expanded to include more cases, we might find that the structure we think is there is not reflected in a more comprehensive view of the problem space

73

Soybean Classification: A Classic Machine Learning Success

• The soybean classification problem was based on an initial data set of

• 680 cases• 35 attributes• 19 disease categories (the classification)• A human expert’s diagnosis of the cases• See the table on the following overhead

74

75

• A subset of ~300 cases was selected from the overall data set

• The subset was selected so that the data points were spread out through the space

• Based on the subset, a set of rules could be derived that correctly diagnosed plant diseases based on the 35 attributes 97.5% of the time

76

• The expert who collaborated and provided diagnoses, also participated in a rule-building exercise

• Instead of doing automated pattern discovery, the computer people worked with the domain expert to build a set of classification rules based on human understanding of the heuristics used

77

• Back in the day, this was the human side of artificial intelligence

• I don’t know if it’s still commonly used• In any case, the expert derived rule set

was only successful in classifying 72% of the time

78

• In short, an automated system for discovering patterns was more successful than a human based system

• This is nice, but not overwhelming• One of the strengths of computing is the

ability manage large amounts of data• This is true in many areas and it’s not

surprising that it’s also true in pattern discovery

79

• Continuing the philosophical discussion from earlier, the algorithms for discovering the patterns were still devised by humans

• Meta-pattern discovery would be the next step up the food chain

• Can algorithms be devised that allow programs to devise new algorithms for diverse purposes?

80

1.3 Fielded Applications

81

Web Mining

• The relevance of pages to queries can be ranked by applying data mining techniques

• User clicks/choices can be mined to select suitable ads to display

• User clicks/choices can be mined to select suitable products to suggest

• Social network pages and other online information can be mined to form profiles and target users

82

Decisions Involving Judgment

• Lending money is both profitable and risky• Data mining can be used to assess risk

vs. reward• First cut analysis can be done using

statistics• People above threshold x are accepted,

people below threshold y are rejected

83

• What to do with the 10% in the middle?• These are potentially “good” customers• In other words, people who are borderline

financially are likely to be a good pool of customers to market loans to

• How to determine who among them is a good risk?

84

• A project was done on 1000 cases with ~20 attributes

• The data mined system had a success rate of 2/3 in predicting repayment of loans

• Professional judgment had a success rater of ½

• If you can do better than flipping a coin, go with it

85

Screening Images

• Basic problem: Identify oil slicks from satellite images

• What was fielded:• A system designed to process satellite

images• The system could then be trained with

sample data to identify slicks

86

• Fielding a trainable, as opposed to pre-trained, system meant that:

• It was customizable by the end user• In particular, the user could tune the

undetected spill vs. false alarm rate• (This is what statisticians refer to as type I

and type II error)

87

Load Forecasting

• The basic problem here is predicting demand for electricity

• In general it depends on time of day, day of week, and season

• A model was built and enhanced with input based on current weather conditions

• The goal was hourly prediction of demand two days in advance

88

• The book’s description makes this sound like more of an analytical and statistical model and less of a data mining problem

• It certainly includes traditional elements• On the other hand, traditional elements

are valid as larger or smaller elements of overall data mining

89

Diagnosis

• This refers to the diagnosis of problems in mechanical equipment

• A traditional approach might rely on eliciting rules from an expert

• This can be costly and time-consuming• It can also be less accurate than what

results from data mining

90

• Specific example:• ~300 examples of mechanical faults• The goal was not to classify fault/no fault• The goal was to identify fault type when

there was a fault• The book seems to think this is an

important example as much for the human factor as the technical factor

91

• The system mined data• The expert wasn’t satisfied• The data was manipulated until the rule

set that resulted satisfied the expert• The expert was not satisfied with rules that

did not conform to his understanding of the problem domain

92

• He was satisfied with a rule set that was consistent with his understanding of the domain

• The derived rule set got better performance

• It expanded the expert’s understanding of the domain

93

• In short, the expert was not satisfied with a black box solution

• The expert didn’t trust results based on improved performance alone

• The expert would only accept the results when the structural representation derived by the system was understandable

94

Marketing and Sales

• Overall, this is a big area of data mining application

• Banks and cell phone companies are good examples

• They involve large sums of money and they keep extensive transaction records

95

• They can classify customers• They can target customers to keep them

from changing provider• They can target customers who have

behavior patterns that make them likely candidates for profitable services

96

• Market basket analysis is another area of data mining application

• Grocery stores are the classic example• Your card allows them to profile you as a

customer• Whether you individually are a highly

profitable customer, your profile in aggregate with others is valuable

97

• Classification depends on a collection of profiles

• Once classification is done, potentially profitable customers can be targeted with tailored offers

98

• Direct marketing/direct mail/telemarketing/online marketing is a profitable area for data mining

• A brick and mortar store may have virtually no information on its customers

• Once a sale is made, direct marketers have made a sale, they potentially have extensive information on buyers

99

• Direct marketers certainly solicit repeat customers

• They also have a virtually limitless potential market

• How do they avoid wasting money contacting unlikely prospects?

• How do they target marketing only towards likely customers?

100

• Profile your customers• Once again, one customer’s profile in

isolation may not be particularly valuable• However, compare these profiles with the

general population• Use this information to identify a high

probability market segment

101

Other Applications

• This is a grab bag, without detail• There’s no reason to even list all of these

examples• The foregoing examples were reasonably

representative

102

1.4 Machine Learning and Statistics

103

• This can be summarized with a simple observation:

• Data mining is an area where CS and statistics converge

• Statistical thinking and techniques are an integral part of data mining

104

• There is a continuum between classical statistical techniques, like multivariate regression, and the most recent, computer-aided techniques

• It’s not either/or• The middle ground is a mix

105

1.5 Generalization as Search

• This is an optional section• I’m not going to cover it

106

1.6 Data Mining and Ethics

• In summary:• If you’re using data to make decisions, you

should consider whether the decisions are based on ethical or legal factors, like race, creed, sex, national origin, sexual preference, etc.

107

Reidentification

• You can read the details in the book• The idea is this:• Even if data is “anonymized”, quite

frequently the remaining attributes can be used to infer identity

108

• This is a conundrum• Completely anonymizing would remove

attributes• At some point you would remove so much

that a data set would be of limited or no value

109

• Note that this is really a database and security question

• A structural representation of a problem by definition is more abstract than individual cases

• The problem arises when you post data sets that have supposedly been anonymized by the removal of key attributes

110

Using Personal Information

• This is also fundamentally a database and security problem

• Here is the data mining twist:• When a user submits data and clicks “I

accept” on the use policy, have they meaningfully been informed that they may be data mined?

111

• Mining data and making decisions based on it may affect that user or others

• It is not the same as static data collection and dissemination

112

Wider Issues

• The book doesn’t express it this way, but these are questions that statisticians have long faced:

• A result may be statistically significant, but is it of practical significance?

• If so, what?

113

• As noted earlier, what are the chances and consequences of false positives and false negatives?

• Also, if you can find correlations (associations) do they imply cause and effect, either one way or the other

• Be leery of falsely inferring cause and effect, which statistics alone can’t establish

114

• The eternal question is, if you gain new knowledge, how do you use it?

• Do you use your powers for good or evil?• Who defines good and evil?

115

1.7 Further Reading

• In general, I’m not going to cover the further reading sections

• You may want to look at them to get ideas for your paper or project

116

The End