1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

transcript

Data MiningChapter 1

Kirk Scott

Iris virginica

Iris versicolor

Iris setosa

1.1 Data Mining and Machine Learning

Definition of Data Mining

• The process of discovering patterns in data.

• (The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.)

• …Useful patterns allow us to make predictions on new data.

Automatic or Semi-Automatic Pattern Discovery

• Pattern discovery by, or with the help of computers is of interest

• Hence machine “learning”• The “learning” part comes from the

algorithms used• Stay tuned for a brief discussion of

whether this is learning

Expression of Patterns

• 1. Black box• 2. Transparent box• A transparent box expression reveals the

structure of the pattern• The structure can be examined, reasoned

about, and used to inform future decisions

Black Box vs. Transparent Box

• Black vs. transparent box is not a trivial distinction

• Some modern computational approaches are black box in nature

• They seek “answers” without necessarily revealing the structure of the problem at hand

Genetic Algorithms

• The book in particular says that genetic algorithms are beyond the realm of consideration

• They are explicitly designed for optimization, not the revelation of structure

• They exemplify black box thinking• They give a single (sub) optimal answer

without providing additional information about the problem being optimized

What is the book about?

• Techniques for finding and describing structural patterns in data

• Description of the patterns is an inherent part of data mining as it will be considered

• Techniques that lead to black box predictors will not be considered

Describing Structural Patterns

• First example:• Contact lens table, page 6 in textbook• The first 5 columns are effectively 5

factors under consideration• The 6th column is the result• Inspection reveals that this is essentially a

comprehensive listing of all possible combinations

• The information in the table could be stored syntactically in the form of a set of rules

• For example:• If tear production rate = reduced then

recommendation = none• Otherwise, if age = young and astigmatic =

no then recommendation = soft

• The same information could be encapsulated graphically in a decision tree

• The representation is up to the person working with the scenario

• The point is that the ability to represent signifies that the structure has been revealed

Input and Output

• The 5 factors represent input• The 6th column represents output, or

“prediction”• Overall, this distinction is typically present

in a data mining problem

Completeness

• This example is simplistic because it is complete

• In interesting, practical, applied problems, not all combinations may be present

• Individual data values will be missing• The goal of data mining is to be able to

generalize from the given data so that correct predictions can be made for other cases where the results aren’t specified

Machine Learning

• High level learning in humans seems to presuppose self-conscious adaptability

• Whether or not machines are capable of learning in the human sense is open to question

• No current machine/software combination demonstrates behavior exactly analogous to the behavior seen in biological systems

• My personal take on this is that the phrase “machine learning” is an unfortunate example of inflated naming

• The phrase “data mining” is neutral• To ask whether a machine can mine is not

as fraught with difficulty as asking whether it can learn

• IEEE publishes a research journal with the title “Transactions on Knowledge and Data Discovery”

• Similarly, to ask whether a machine can discover is not as fraught with difficulty as asking whether it can learn

• “Machine learning” isn’t as inflated as “Artificial intelligence”

• It is a step back from that level of hype• In the field of artificial intelligence

researchers have scaled back their expectations of what they might accomplish

• They have not been able to mimic general human intelligence

• If you look at data mining from an AI point of view, you might see it as a step along the road to mimicking learning and term it machine learning

• If you come from a database management systems background you might say the point of view is results oriented rather than process oriented

• The data are not animate• The machines are not animate• The algorithms are not animate

• The human mind cannot readily discern some patterns, whether due to the quantity of the data or the subtlety and complexity of the patterns

• Human programmers devise algorithms which are able to discern such patterns

• This is not so different from devising an algorithm to solve any problem which a computer might solve more easily than a human being

Naming Problems in Math

• Consider these terms from math:• Imaginary numbers• Complex numbers• Chaos theory• I’ve always marveled at how harmful

tendentious naming can be to achieving understanding of what’s going on

• You might as well talk about magic spells

A Biological Perspective

• Although I don’t worship at the church of St. Charles of Darwin, the biologists make an interesting point:

• When reasoning about animals, it’s a mistake to anthropomorphize

• In others, don’t ascribe human characteristics to non-human organisms

• The shortcoming of this point of view is that biologists tend towards the rigidity of theologists:

• Animals are not like us; they are no more than machines

• A dog cannot feel anything approaching human emotion, etc., etc.

• Whatever the truth of the emotional state of dogs, isn’t this a valid question:

• Why do certain computer scientists persistent in naming technologies in such a way that they seem to be anthropomorphizing machines?

• The day may come when there is truth to this, but isn’t it a wee bit premature?

• Ideas on this question?

Practically Speaking, What Does Data Mining Consist of?

• Algorithms that examine data sets and extract knowledge

• This time the authors have chosen the relatively neutral words examine and extract

• The word knowledge is admittedly a bit tendentious itself

• In more detail, the idea is simply this:• The computer is programmed to

implement an algorithm• The algorithm is designed to run through

or process all of a data set• The goal of the algorithm is to summarize

the data set

• The summary is a generalization• For our purposes, the summary consists of

inferences that can be drawn about the data set

• The inferences may be between values for the attributes of a given data point

• They may also be about the relationships among various data points in the set

• The contact lens data set illustrated the idea that an inference is a rule

• For our purposes, such a rule takes the form of “if x, then y”

• It is a comprehensive collection of such rules that summarizes the structure of the data points in the set

The Value of the Summary

• A set of rules is definitely useful for predictions

• Once again the distinction is made between black and transparent box techniques

• The structural description is also of great value because it helps the human understand the data and what it means

• In this sense, at least, a degree of learning is evident in the whole complex

• If the human consumer of the end product ultimately learns something previously unknown about the data set—

• Then the machine and algorithm must have genuinely learned something about the data that was previously unknown

• It may be of value to contrast this with transparent box systems again

• A transparent box system achieves results without a structural description

• This is still a valid question:• Just because the end result is not human

learning, does that in fact lessen in any way the degree of learning that the computer system achieves?

• There are no fixed answers to these side questions

• This is a 400 level course• There’s no time like now to at least think

about such things, before you step out the door with your sparkling new sheepskin

• Now, back to the grindstone

1.2 Simple Examples: The Weather and Other Problems

• The book introduces some comparatively simple (and unrealistic) data sets by way of further illustration

• Note that complex, real data sets tend to be proprietary anyway

• Databases and data sets are among the most valuable assets of organizations that have them

• This kind of stuff isn’t given away

The Weather Problem

• This is a fictitious data set• Four symbolic categories (non-numeric

attributes) determine whether a game should be played

• There are 36 possible combinations• A table exists with only 14 of the

combinations• See the following overhead

An (Incomplete) Table of Data Point Values

Decision Lists

• More ideas about representing structure with rules

• Sets of rules can be arranged in order• You work your way through the list• You accept the outcome of a rule if it

applies• Otherwise you continue down the list• This is called a decision list

• The following rules represent the idea• No claim is made that this set of rules is

complete or necessarily very good

• In an ordered list of rules:• Each rule is potentially (most likely)

fragmentary• It doesn’t include every factor• Because they are ordered, the rules

depend on each other• Individual rules, applied in isolation, do not

necessarily (most likely don’t) give the correct result

Handling Numeric Data

• If all of the determining attributes are numeric, you can refer to a numeric attribute problem

• If some are numeric and some are categorical, you can refer to a mixed attribute problem

• Consider the table of the weather problem with some numerical attributes on the following overhead

• If numeric attributes are present, the decision rules typically become inequalities rather than equalities

• For example

Classification vs. Association

• The premise of the foregoing discussion:• A set of independent variables determines

the value of a dependent variable• This is reminiscent of multivariate

statistical regression• It is classification• If relationships can be found among the

supposedly independent variables, this is association

• These are examples of association rules taken from the original weather data set

• It is not hard to imagine that outlook, temperature, humidity, and wind depend on each other in whole or in part

• The data set is fictitious• The data set is not exhaustive• Unlike the contact lens example, it is not a

list of all possible combinations• It is presumably based on some set of real

observations

• The book observes that many possible association rules can be derived from the data set

• Many would be true for 100% of the data points in the set

• Many others would be true for a high percent of the data points in the set

• In summary, the ability to derive association rules depends on several things

• The variables are dependent in reality• The data set is representative of that

reality• The scheme for eliciting the structure

produces a good model

Contact Lenses: An Idealized Problem

• This subsection of the book presents the problem again, which I will not repeat

• Recall again that the data set was a complete list of all possible combinations

• The new part considers a complete set of rules derived from the data set

• See the following overhead

Representing Structure

• Converting an exhaustive listing of all possible cases into a complete set of rules represents the structure of the problem

• Questions to consider:• Would a smaller set of rules be possible?• In practice, would we be willing to accept a

smaller rule set, even though this might be an imperfect representation of the structure?

A Decision Tree

• The figure on the following overhead shows a decision tree for the data

• This is an alternative, graphical or visual representation of the structure

• In summary, it illustrates 3 binary decisions in a row

• As a matter of fact, this is a simplification• It doesn’t correctly classify 2 of the cases

Irises: A Classic Numeric Dataset

• In this problem, the attributes are numeric• However, the result is still categorical

classification• See the table illustrating the data set and a

sample set of rules on the following overheads

CPU Performance: Introducing Numeric Prediction

• Regression was mentioned earlier• In general, it takes this form:• y = b1x1 + b2x2 + … + bnxn + c

• An example of a data set with a numeric result is shown on the following overhead

• Being able to derive an equation is nice• Numeric data sets may yield data mining

results which have structure which can’t be represented by a single equation

Labor Negotiations: A More Realistic Example

• This a mixed attribute problem• There are missing attribute values in some

data points• It is based on actual results of Canadian

labor negotiations

• The classification is of an “acceptable” labor contract

• The meaning of “acceptable”, i.e., the classification, is open to interpretation

• For all of these reasons, this is a more realistic example than the others presented so far

• A table for this example is shown on the following overhead

• For presentation purposes the attributes are shown down the left hand side

• In other words, the attributes are the rows, not the columns

• The 40 cases are abbreviated with ellipses in the columns

Decision Trees, Again

• The book gives two possible decision trees for the contract data, tree (a) and tree (b)

• (a) is simpler than (b)• (a) is also somewhat less accurate than

(b)• (a) was obtained by trimming (b)• See the figures on the following overhead

• The trees represent structure• This makes it possible to reason about

what the data/structure mean• Because the trees differ, it is possible to

draw inferences about the differences in how they define or represent the structure of the problem

• Consider the bottom left 3 options in tree (b)

• Why is half of healthcare good, but both none and full are bad?

• This comes down to the question of the meaning of a good, or acceptable contract

• In negotiations, both labor and management have to agree

• Half of healthcare may be “good” because it represents an option that both parties can accept

• In other words, compromise is a good outcome

• On the other hand, it’s also possible that the derived model is too dependent on the data set

• The book introduces the terminology “overtrained”

• If the data set were expanded to include more cases, we might find that the structure we think is there is not reflected in a more comprehensive view of the problem space

Soybean Classification: A Classic Machine Learning Success

• The soybean classification problem was based on an initial data set of

• 680 cases• 35 attributes• 19 disease categories (the classification)• A human expert’s diagnosis of the cases• See the table on the following overhead

• A subset of ~300 cases was selected from the overall data set

• The subset was selected so that the data points were spread out through the space

• Based on the subset, a set of rules could be derived that correctly diagnosed plant diseases based on the 35 attributes 97.5% of the time

• The expert who collaborated and provided diagnoses, also participated in a rule-building exercise

• Instead of doing automated pattern discovery, the computer people worked with the domain expert to build a set of classification rules based on human understanding of the heuristics used

• Back in the day, this was the human side of artificial intelligence

• I don’t know if it’s still commonly used• In any case, the expert derived rule set

was only successful in classifying 72% of the time

• In short, an automated system for discovering patterns was more successful than a human based system

• This is nice, but not overwhelming• One of the strengths of computing is the

ability manage large amounts of data• This is true in many areas and it’s not

surprising that it’s also true in pattern discovery

• Continuing the philosophical discussion from earlier, the algorithms for discovering the patterns were still devised by humans

• Meta-pattern discovery would be the next step up the food chain

• Can algorithms be devised that allow programs to devise new algorithms for diverse purposes?

1.3 Fielded Applications

Web Mining

• The relevance of pages to queries can be ranked by applying data mining techniques

• User clicks/choices can be mined to select suitable ads to display

• User clicks/choices can be mined to select suitable products to suggest

• Social network pages and other online information can be mined to form profiles and target users

Decisions Involving Judgment

• Lending money is both profitable and risky• Data mining can be used to assess risk

vs. reward• First cut analysis can be done using

statistics• People above threshold x are accepted,

people below threshold y are rejected

• What to do with the 10% in the middle?• These are potentially “good” customers• In other words, people who are borderline

financially are likely to be a good pool of customers to market loans to

• How to determine who among them is a good risk?

• A project was done on 1000 cases with ~20 attributes

• The data mined system had a success rate of 2/3 in predicting repayment of loans

• Professional judgment had a success rater of ½

• If you can do better than flipping a coin, go with it

Screening Images

• Basic problem: Identify oil slicks from satellite images

• What was fielded:• A system designed to process satellite

images• The system could then be trained with

sample data to identify slicks

• Fielding a trainable, as opposed to pre-trained, system meant that:

• It was customizable by the end user• In particular, the user could tune the

undetected spill vs. false alarm rate• (This is what statisticians refer to as type I

and type II error)

Load Forecasting

• The basic problem here is predicting demand for electricity

• In general it depends on time of day, day of week, and season

• A model was built and enhanced with input based on current weather conditions

• The goal was hourly prediction of demand two days in advance

• The book’s description makes this sound like more of an analytical and statistical model and less of a data mining problem

• It certainly includes traditional elements• On the other hand, traditional elements

are valid as larger or smaller elements of overall data mining

Diagnosis

• This refers to the diagnosis of problems in mechanical equipment

• A traditional approach might rely on eliciting rules from an expert

• This can be costly and time-consuming• It can also be less accurate than what

results from data mining

• Specific example:• ~300 examples of mechanical faults• The goal was not to classify fault/no fault• The goal was to identify fault type when

there was a fault• The book seems to think this is an

important example as much for the human factor as the technical factor

• The system mined data• The expert wasn’t satisfied• The data was manipulated until the rule

set that resulted satisfied the expert• The expert was not satisfied with rules that

did not conform to his understanding of the problem domain

• He was satisfied with a rule set that was consistent with his understanding of the domain

• The derived rule set got better performance

• It expanded the expert’s understanding of the domain

• In short, the expert was not satisfied with a black box solution

• The expert didn’t trust results based on improved performance alone

• The expert would only accept the results when the structural representation derived by the system was understandable

Marketing and Sales

• Overall, this is a big area of data mining application

• Banks and cell phone companies are good examples

• They involve large sums of money and they keep extensive transaction records

• They can classify customers• They can target customers to keep them

from changing provider• They can target customers who have

behavior patterns that make them likely candidates for profitable services

• Market basket analysis is another area of data mining application

• Grocery stores are the classic example• Your card allows them to profile you as a

customer• Whether you individually are a highly

profitable customer, your profile in aggregate with others is valuable

• Classification depends on a collection of profiles

• Once classification is done, potentially profitable customers can be targeted with tailored offers

• Direct marketing/direct mail/telemarketing/online marketing is a profitable area for data mining

• A brick and mortar store may have virtually no information on its customers

• Once a sale is made, direct marketers have made a sale, they potentially have extensive information on buyers

• Direct marketers certainly solicit repeat customers

• They also have a virtually limitless potential market

• How do they avoid wasting money contacting unlikely prospects?

• How do they target marketing only towards likely customers?

• Profile your customers• Once again, one customer’s profile in

isolation may not be particularly valuable• However, compare these profiles with the

general population• Use this information to identify a high

probability market segment

Other Applications

• This is a grab bag, without detail• There’s no reason to even list all of these

examples• The foregoing examples were reasonably

representative

1.4 Machine Learning and Statistics

• This can be summarized with a simple observation:

• Data mining is an area where CS and statistics converge

• Statistical thinking and techniques are an integral part of data mining

• There is a continuum between classical statistical techniques, like multivariate regression, and the most recent, computer-aided techniques

• It’s not either/or• The middle ground is a mix

1.5 Generalization as Search

• This is an optional section• I’m not going to cover it

1.6 Data Mining and Ethics

• In summary:• If you’re using data to make decisions, you

should consider whether the decisions are based on ethical or legal factors, like race, creed, sex, national origin, sexual preference, etc.

Reidentification

• You can read the details in the book• The idea is this:• Even if data is “anonymized”, quite

frequently the remaining attributes can be used to infer identity

• This is a conundrum• Completely anonymizing would remove

attributes• At some point you would remove so much

that a data set would be of limited or no value

• Note that this is really a database and security question

• A structural representation of a problem by definition is more abstract than individual cases

• The problem arises when you post data sets that have supposedly been anonymized by the removal of key attributes

Using Personal Information

• This is also fundamentally a database and security problem

• Here is the data mining twist:• When a user submits data and clicks “I

accept” on the use policy, have they meaningfully been informed that they may be data mined?

• Mining data and making decisions based on it may affect that user or others

• It is not the same as static data collection and dissemination

Wider Issues

• The book doesn’t express it this way, but these are questions that statisticians have long faced:

• A result may be statistically significant, but is it of practical significance?

• If so, what?

• As noted earlier, what are the chances and consequences of false positives and false negatives?

• Also, if you can find correlations (associations) do they imply cause and effect, either one way or the other

• Be leery of falsely inferring cause and effect, which statistics alone can’t establish

• The eternal question is, if you gain new knowledge, how do you use it?

• Do you use your powers for good or evil?• Who defines good and evil?

1.7 Further Reading

• In general, I’m not going to cover the further reading sections

• You may want to look at them to get ideas for your paper or project

The End

1 Data Mining Chapter 1 Kirk Scott. Iris virginica 2.

Documents