+ All Categories
Home > Documents > 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

Date post: 28-Dec-2015
Category:
Upload: posy-gabriella-hines
View: 217 times
Download: 0 times
Share this document with a friend
116
1 Data Mining Chapter 1 Kirk Scott
Transcript
Page 1: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

1

Data MiningChapter 1

Kirk Scott

Page 2: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

2

Iris virginica

Page 3: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

3

Iris versicolor

Page 4: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

4

Iris setosa

Page 5: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

5

1.1 Data Mining and Machine Learning

Page 6: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

6

Definition of Data Mining

• The process of discovering patterns in data.

• (The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.)

• …Useful patterns allow us to make predictions on new data.

Page 7: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

7

Automatic or Semi-Automatic Pattern Discovery

• Pattern discovery by, or with the help of computers is of interest

• Hence machine “learning”• The “learning” part comes from the

algorithms used• Stay tuned for a brief discussion of

whether this is learning

Page 8: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

8

Expression of Patterns

• 1. Black box• 2. Transparent box• A transparent box expression reveals the

structure of the pattern• The structure can be examined, reasoned

about, and used to inform future decisions

Page 9: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

9

Black Box vs. Transparent Box

• Black vs. transparent box is not a trivial distinction

• Some modern computational approaches are black box in nature

• They seek “answers” without necessarily revealing the structure of the problem at hand

Page 10: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

10

Genetic Algorithms

• The book in particular says that genetic algorithms are beyond the realm of consideration

• They are explicitly designed for optimization, not the revelation of structure

• They exemplify black box thinking• They give a single (sub) optimal answer

without providing additional information about the problem being optimized

Page 11: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

11

What is the book about?

• Techniques for finding and describing structural patterns in data

• Description of the patterns is an inherent part of data mining as it will be considered

• Techniques that lead to black box predictors will not be considered

Page 12: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

12

Describing Structural Patterns

• First example:• Contact lens table, page 6 in textbook• The first 5 columns are effectively 5

factors under consideration• The 6th column is the result• Inspection reveals that this is essentially a

comprehensive listing of all possible combinations

Page 13: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

13

Page 14: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

14

• The information in the table could be stored syntactically in the form of a set of rules

• For example:• If tear production rate = reduced then

recommendation = none• Otherwise, if age = young and astigmatic =

no then recommendation = soft

Page 15: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

15

• The same information could be encapsulated graphically in a decision tree

• The representation is up to the person working with the scenario

• The point is that the ability to represent signifies that the structure has been revealed

Page 16: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

16

Input and Output

• The 5 factors represent input• The 6th column represents output, or

“prediction”• Overall, this distinction is typically present

in a data mining problem

Page 17: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

17

Completeness

• This example is simplistic because it is complete

• In interesting, practical, applied problems, not all combinations may be present

• Individual data values will be missing• The goal of data mining is to be able to

generalize from the given data so that correct predictions can be made for other cases where the results aren’t specified

Page 18: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

18

Machine Learning

• High level learning in humans seems to presuppose self-conscious adaptability

• Whether or not machines are capable of learning in the human sense is open to question

• No current machine/software combination demonstrates behavior exactly analogous to the behavior seen in biological systems

Page 19: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

19

• My personal take on this is that the phrase “machine learning” is an unfortunate example of inflated naming

• The phrase “data mining” is neutral• To ask whether a machine can mine is not

as fraught with difficulty as asking whether it can learn

Page 20: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

20

• IEEE publishes a research journal with the title “Transactions on Knowledge and Data Discovery”

• Similarly, to ask whether a machine can discover is not as fraught with difficulty as asking whether it can learn

Page 21: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

21

• “Machine learning” isn’t as inflated as “Artificial intelligence”

• It is a step back from that level of hype• In the field of artificial intelligence

researchers have scaled back their expectations of what they might accomplish

• They have not been able to mimic general human intelligence

Page 22: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

22

• If you look at data mining from an AI point of view, you might see it as a step along the road to mimicking learning and term it machine learning

• If you come from a database management systems background you might say the point of view is results oriented rather than process oriented

Page 23: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

23

• The data are not animate• The machines are not animate• The algorithms are not animate

Page 24: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

24

• The human mind cannot readily discern some patterns, whether due to the quantity of the data or the subtlety and complexity of the patterns

• Human programmers devise algorithms which are able to discern such patterns

• This is not so different from devising an algorithm to solve any problem which a computer might solve more easily than a human being

Page 25: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

25

Naming Problems in Math

• Consider these terms from math:• Imaginary numbers• Complex numbers• Chaos theory• I’ve always marveled at how harmful

tendentious naming can be to achieving understanding of what’s going on

• You might as well talk about magic spells

Page 26: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

26

A Biological Perspective

• Although I don’t worship at the church of St. Charles of Darwin, the biologists make an interesting point:

• When reasoning about animals, it’s a mistake to anthropomorphize

• In others, don’t ascribe human characteristics to non-human organisms

Page 27: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

27

• The shortcoming of this point of view is that biologists tend towards the rigidity of theologists:

• Animals are not like us; they are no more than machines

• A dog cannot feel anything approaching human emotion, etc., etc.

Page 28: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

28

• Whatever the truth of the emotional state of dogs, isn’t this a valid question:

• Why do certain computer scientists persistent in naming technologies in such a way that they seem to be anthropomorphizing machines?

• The day may come when there is truth to this, but isn’t it a wee bit premature?

Page 29: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

29

• Ideas on this question?

Page 30: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

30

Practically Speaking, What Does Data Mining Consist of?

• Algorithms that examine data sets and extract knowledge

• This time the authors have chosen the relatively neutral words examine and extract

• The word knowledge is admittedly a bit tendentious itself

Page 31: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

31

• In more detail, the idea is simply this:• The computer is programmed to

implement an algorithm• The algorithm is designed to run through

or process all of a data set• The goal of the algorithm is to summarize

the data set

Page 32: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

32

• The summary is a generalization• For our purposes, the summary consists of

inferences that can be drawn about the data set

• The inferences may be between values for the attributes of a given data point

• They may also be about the relationships among various data points in the set

Page 33: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

33

• The contact lens data set illustrated the idea that an inference is a rule

• For our purposes, such a rule takes the form of “if x, then y”

• It is a comprehensive collection of such rules that summarizes the structure of the data points in the set

Page 34: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

34

The Value of the Summary

• A set of rules is definitely useful for predictions

• Once again the distinction is made between black and transparent box techniques

• The structural description is also of great value because it helps the human understand the data and what it means

Page 35: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

35

• In this sense, at least, a degree of learning is evident in the whole complex

• If the human consumer of the end product ultimately learns something previously unknown about the data set—

• Then the machine and algorithm must have genuinely learned something about the data that was previously unknown

Page 36: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

36

• It may be of value to contrast this with transparent box systems again

• A transparent box system achieves results without a structural description

• This is still a valid question:• Just because the end result is not human

learning, does that in fact lessen in any way the degree of learning that the computer system achieves?

Page 37: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

37

• There are no fixed answers to these side questions

• This is a 400 level course• There’s no time like now to at least think

about such things, before you step out the door with your sparkling new sheepskin

• Now, back to the grindstone

Page 38: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

38

1.2 Simple Examples: The Weather and Other Problems

Page 39: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

39

• The book introduces some comparatively simple (and unrealistic) data sets by way of further illustration

• Note that complex, real data sets tend to be proprietary anyway

• Databases and data sets are among the most valuable assets of organizations that have them

• This kind of stuff isn’t given away

Page 40: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

40

The Weather Problem

• This is a fictitious data set• Four symbolic categories (non-numeric

attributes) determine whether a game should be played

• There are 36 possible combinations• A table exists with only 14 of the

combinations• See the following overhead

Page 41: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

41

An (Incomplete) Table of Data Point Values

Page 42: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

42

Decision Lists

• More ideas about representing structure with rules

• Sets of rules can be arranged in order• You work your way through the list• You accept the outcome of a rule if it

applies• Otherwise you continue down the list• This is called a decision list

Page 43: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

43

• The following rules represent the idea• No claim is made that this set of rules is

complete or necessarily very good

Page 44: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

44

• In an ordered list of rules:• Each rule is potentially (most likely)

fragmentary• It doesn’t include every factor• Because they are ordered, the rules

depend on each other• Individual rules, applied in isolation, do not

necessarily (most likely don’t) give the correct result

Page 45: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

45

Handling Numeric Data

• If all of the determining attributes are numeric, you can refer to a numeric attribute problem

• If some are numeric and some are categorical, you can refer to a mixed attribute problem

• Consider the table of the weather problem with some numerical attributes on the following overhead

Page 46: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

46

Page 47: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

47

• If numeric attributes are present, the decision rules typically become inequalities rather than equalities

• For example

Page 48: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

48

Classification vs. Association

• The premise of the foregoing discussion:• A set of independent variables determines

the value of a dependent variable• This is reminiscent of multivariate

statistical regression• It is classification• If relationships can be found among the

supposedly independent variables, this is association

Page 49: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

49

• These are examples of association rules taken from the original weather data set

• It is not hard to imagine that outlook, temperature, humidity, and wind depend on each other in whole or in part

Page 50: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

50

• The data set is fictitious• The data set is not exhaustive• Unlike the contact lens example, it is not a

list of all possible combinations• It is presumably based on some set of real

observations

Page 51: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

51

• The book observes that many possible association rules can be derived from the data set

• Many would be true for 100% of the data points in the set

• Many others would be true for a high percent of the data points in the set

Page 52: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

52

• In summary, the ability to derive association rules depends on several things

• The variables are dependent in reality• The data set is representative of that

reality• The scheme for eliciting the structure

produces a good model

Page 53: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

53

Contact Lenses: An Idealized Problem

• This subsection of the book presents the problem again, which I will not repeat

• Recall again that the data set was a complete list of all possible combinations

• The new part considers a complete set of rules derived from the data set

• See the following overhead

Page 54: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

54

Page 55: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

55

Representing Structure

• Converting an exhaustive listing of all possible cases into a complete set of rules represents the structure of the problem

• Questions to consider:• Would a smaller set of rules be possible?• In practice, would we be willing to accept a

smaller rule set, even though this might be an imperfect representation of the structure?

Page 56: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

56

A Decision Tree

• The figure on the following overhead shows a decision tree for the data

• This is an alternative, graphical or visual representation of the structure

• In summary, it illustrates 3 binary decisions in a row

• As a matter of fact, this is a simplification• It doesn’t correctly classify 2 of the cases

Page 57: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

57

Page 58: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

58

Irises: A Classic Numeric Dataset

• In this problem, the attributes are numeric• However, the result is still categorical

classification• See the table illustrating the data set and a

sample set of rules on the following overheads

Page 59: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

59

Page 60: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

60

Page 61: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

61

CPU Performance: Introducing Numeric Prediction

• Regression was mentioned earlier• In general, it takes this form:• y = b1x1 + b2x2 + … + bnxn + c

• An example of a data set with a numeric result is shown on the following overhead

• Being able to derive an equation is nice• Numeric data sets may yield data mining

results which have structure which can’t be represented by a single equation

Page 62: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

62

Page 63: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

63

Labor Negotiations: A More Realistic Example

• This a mixed attribute problem• There are missing attribute values in some

data points• It is based on actual results of Canadian

labor negotiations

Page 64: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

64

• The classification is of an “acceptable” labor contract

• The meaning of “acceptable”, i.e., the classification, is open to interpretation

• For all of these reasons, this is a more realistic example than the others presented so far

Page 65: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

65

• A table for this example is shown on the following overhead

• For presentation purposes the attributes are shown down the left hand side

• In other words, the attributes are the rows, not the columns

• The 40 cases are abbreviated with ellipses in the columns

Page 66: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

66

Page 67: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

67

Decision Trees, Again

• The book gives two possible decision trees for the contract data, tree (a) and tree (b)

• (a) is simpler than (b)• (a) is also somewhat less accurate than

(b)• (a) was obtained by trimming (b)• See the figures on the following overhead

Page 68: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

68

Page 69: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

69

• The trees represent structure• This makes it possible to reason about

what the data/structure mean• Because the trees differ, it is possible to

draw inferences about the differences in how they define or represent the structure of the problem

Page 70: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

70

• Consider the bottom left 3 options in tree (b)

• Why is half of healthcare good, but both none and full are bad?

• This comes down to the question of the meaning of a good, or acceptable contract

Page 71: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

71

• In negotiations, both labor and management have to agree

• Half of healthcare may be “good” because it represents an option that both parties can accept

• In other words, compromise is a good outcome

Page 72: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

72

• On the other hand, it’s also possible that the derived model is too dependent on the data set

• The book introduces the terminology “overtrained”

• If the data set were expanded to include more cases, we might find that the structure we think is there is not reflected in a more comprehensive view of the problem space

Page 73: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

73

Soybean Classification: A Classic Machine Learning Success

• The soybean classification problem was based on an initial data set of

• 680 cases• 35 attributes• 19 disease categories (the classification)• A human expert’s diagnosis of the cases• See the table on the following overhead

Page 74: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

74

Page 75: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

75

• A subset of ~300 cases was selected from the overall data set

• The subset was selected so that the data points were spread out through the space

• Based on the subset, a set of rules could be derived that correctly diagnosed plant diseases based on the 35 attributes 97.5% of the time

Page 76: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

76

• The expert who collaborated and provided diagnoses, also participated in a rule-building exercise

• Instead of doing automated pattern discovery, the computer people worked with the domain expert to build a set of classification rules based on human understanding of the heuristics used

Page 77: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

77

• Back in the day, this was the human side of artificial intelligence

• I don’t know if it’s still commonly used• In any case, the expert derived rule set

was only successful in classifying 72% of the time

Page 78: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

78

• In short, an automated system for discovering patterns was more successful than a human based system

• This is nice, but not overwhelming• One of the strengths of computing is the

ability manage large amounts of data• This is true in many areas and it’s not

surprising that it’s also true in pattern discovery

Page 79: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

79

• Continuing the philosophical discussion from earlier, the algorithms for discovering the patterns were still devised by humans

• Meta-pattern discovery would be the next step up the food chain

• Can algorithms be devised that allow programs to devise new algorithms for diverse purposes?

Page 80: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

80

1.3 Fielded Applications

Page 81: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

81

Web Mining

• The relevance of pages to queries can be ranked by applying data mining techniques

• User clicks/choices can be mined to select suitable ads to display

• User clicks/choices can be mined to select suitable products to suggest

• Social network pages and other online information can be mined to form profiles and target users

Page 82: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

82

Decisions Involving Judgment

• Lending money is both profitable and risky• Data mining can be used to assess risk

vs. reward• First cut analysis can be done using

statistics• People above threshold x are accepted,

people below threshold y are rejected

Page 83: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

83

• What to do with the 10% in the middle?• These are potentially “good” customers• In other words, people who are borderline

financially are likely to be a good pool of customers to market loans to

• How to determine who among them is a good risk?

Page 84: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

84

• A project was done on 1000 cases with ~20 attributes

• The data mined system had a success rate of 2/3 in predicting repayment of loans

• Professional judgment had a success rater of ½

• If you can do better than flipping a coin, go with it

Page 85: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

85

Screening Images

• Basic problem: Identify oil slicks from satellite images

• What was fielded:• A system designed to process satellite

images• The system could then be trained with

sample data to identify slicks

Page 86: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

86

• Fielding a trainable, as opposed to pre-trained, system meant that:

• It was customizable by the end user• In particular, the user could tune the

undetected spill vs. false alarm rate• (This is what statisticians refer to as type I

and type II error)

Page 87: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

87

Load Forecasting

• The basic problem here is predicting demand for electricity

• In general it depends on time of day, day of week, and season

• A model was built and enhanced with input based on current weather conditions

• The goal was hourly prediction of demand two days in advance

Page 88: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

88

• The book’s description makes this sound like more of an analytical and statistical model and less of a data mining problem

• It certainly includes traditional elements• On the other hand, traditional elements

are valid as larger or smaller elements of overall data mining

Page 89: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

89

Diagnosis

• This refers to the diagnosis of problems in mechanical equipment

• A traditional approach might rely on eliciting rules from an expert

• This can be costly and time-consuming• It can also be less accurate than what

results from data mining

Page 90: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

90

• Specific example:• ~300 examples of mechanical faults• The goal was not to classify fault/no fault• The goal was to identify fault type when

there was a fault• The book seems to think this is an

important example as much for the human factor as the technical factor

Page 91: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

91

• The system mined data• The expert wasn’t satisfied• The data was manipulated until the rule

set that resulted satisfied the expert• The expert was not satisfied with rules that

did not conform to his understanding of the problem domain

Page 92: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

92

• He was satisfied with a rule set that was consistent with his understanding of the domain

• The derived rule set got better performance

• It expanded the expert’s understanding of the domain

Page 93: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

93

• In short, the expert was not satisfied with a black box solution

• The expert didn’t trust results based on improved performance alone

• The expert would only accept the results when the structural representation derived by the system was understandable

Page 94: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

94

Marketing and Sales

• Overall, this is a big area of data mining application

• Banks and cell phone companies are good examples

• They involve large sums of money and they keep extensive transaction records

Page 95: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

95

• They can classify customers• They can target customers to keep them

from changing provider• They can target customers who have

behavior patterns that make them likely candidates for profitable services

Page 96: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

96

• Market basket analysis is another area of data mining application

• Grocery stores are the classic example• Your card allows them to profile you as a

customer• Whether you individually are a highly

profitable customer, your profile in aggregate with others is valuable

Page 97: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

97

• Classification depends on a collection of profiles

• Once classification is done, potentially profitable customers can be targeted with tailored offers

Page 98: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

98

• Direct marketing/direct mail/telemarketing/online marketing is a profitable area for data mining

• A brick and mortar store may have virtually no information on its customers

• Once a sale is made, direct marketers have made a sale, they potentially have extensive information on buyers

Page 99: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

99

• Direct marketers certainly solicit repeat customers

• They also have a virtually limitless potential market

• How do they avoid wasting money contacting unlikely prospects?

• How do they target marketing only towards likely customers?

Page 100: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

100

• Profile your customers• Once again, one customer’s profile in

isolation may not be particularly valuable• However, compare these profiles with the

general population• Use this information to identify a high

probability market segment

Page 101: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

101

Other Applications

• This is a grab bag, without detail• There’s no reason to even list all of these

examples• The foregoing examples were reasonably

representative

Page 102: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

102

1.4 Machine Learning and Statistics

Page 103: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

103

• This can be summarized with a simple observation:

• Data mining is an area where CS and statistics converge

• Statistical thinking and techniques are an integral part of data mining

Page 104: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

104

• There is a continuum between classical statistical techniques, like multivariate regression, and the most recent, computer-aided techniques

• It’s not either/or• The middle ground is a mix

Page 105: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

105

1.5 Generalization as Search

• This is an optional section• I’m not going to cover it

Page 106: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

106

1.6 Data Mining and Ethics

• In summary:• If you’re using data to make decisions, you

should consider whether the decisions are based on ethical or legal factors, like race, creed, sex, national origin, sexual preference, etc.

Page 107: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

107

Reidentification

• You can read the details in the book• The idea is this:• Even if data is “anonymized”, quite

frequently the remaining attributes can be used to infer identity

Page 108: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

108

• This is a conundrum• Completely anonymizing would remove

attributes• At some point you would remove so much

that a data set would be of limited or no value

Page 109: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

109

• Note that this is really a database and security question

• A structural representation of a problem by definition is more abstract than individual cases

• The problem arises when you post data sets that have supposedly been anonymized by the removal of key attributes

Page 110: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

110

Using Personal Information

• This is also fundamentally a database and security problem

• Here is the data mining twist:• When a user submits data and clicks “I

accept” on the use policy, have they meaningfully been informed that they may be data mined?

Page 111: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

111

• Mining data and making decisions based on it may affect that user or others

• It is not the same as static data collection and dissemination

Page 112: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

112

Wider Issues

• The book doesn’t express it this way, but these are questions that statisticians have long faced:

• A result may be statistically significant, but is it of practical significance?

• If so, what?

Page 113: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

113

• As noted earlier, what are the chances and consequences of false positives and false negatives?

• Also, if you can find correlations (associations) do they imply cause and effect, either one way or the other

• Be leery of falsely inferring cause and effect, which statistics alone can’t establish

Page 114: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

114

• The eternal question is, if you gain new knowledge, how do you use it?

• Do you use your powers for good or evil?• Who defines good and evil?

Page 115: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

115

1.7 Further Reading

• In general, I’m not going to cover the further reading sections

• You may want to look at them to get ideas for your paper or project

Page 116: 1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

116

The End


Recommended