+ All Categories
Home > Documents > 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

Date post: 31-Dec-2015
Category:
Upload: penelope-scott
View: 220 times
Download: 1 times
Share this document with a friend
Popular Tags:
134
1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott
Transcript
Page 1: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

1

Data MiningChapter 4

Algorithms: The Basic Methods

Kirk Scott

Page 2: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

2

Page 3: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

3

Page 4: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

4

Basic Methods

• A good rule of thumb is try the simple things first

• Quite frequently the simple things will be good enough or will provide useful insights for further explorations

• One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set

Page 5: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

5

• Certain data sets have a certain structure• Certain algorithms are designed to elicit

particular kinds of structures• The right algorithm applied to the right set

will give straightforward results• A mismatch between algorithm and data

set will give complicated, cloudy results

Page 6: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

6

• Here are the remaining sections in chapter 4, 4 more basic algorithm descriptions plus a 9th topic

• 4.5 Mining Association Rules• 4.6 Linear Models• 4.7 Instance-Based Learning• 4.8 Clustering• 4.9 Multi-Instance Learning

Page 7: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

7

4.5 Mining Association Rules

Page 8: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

8

• In theory you could mine association rules in a way similar to the way you mine classification rules

• The problem is that there are too many potential rules of this form:

• (some subset of attributes) (some other subset of attributes)

• The concepts of support and confidence help limit the number to consider

Page 9: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

9

Defining Support, Again

• Depending on how your read the book, there is some confusion about how to define support

• If you’re looking at X Y, the original explanation of support seemed to say this:

• Support = (number of instances containing X) / (total number of instances)

• Confidence = (number of instances containing X and Y) / (number of instances containing X)

Page 10: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

10

• This is the updated definition of support:• Support = (number of instances containing

X and Y) / (total number of instances)• In other words, it’s not how frequent the

protasis is, but how frequent the rule is• Confidence is unchanged• Confidence = (number of instances

containing X and Y) / (number of instances containing X)

Page 11: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

11

• The definition of support differs in the two cases, but the confidence measure is the same

• To summarize, the initial presentation of support says a rule is interesting if the protasis alone occurs frequently

• The second presentation of support says a rule is interesting if the protasis + apodosis together occur frequently

Page 12: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

12

• In other words, this is an absolute measure of the frequency of the truth of a rule

• We are interested in rules if they are frequently true

• We are not necessarily interested in rules just because the protasis occurs frequently

Page 13: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

13

• In both cases the measure of confidence is the same

• It is a relative measure of the frequency of the truth of a rule

• Out of all the times that the protasis is true, how many times is the rule true?

• Looking back, the second presentation of support is probably the more useful one

Page 14: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

14

• An X that implies results all over the map would not be useful

• An X that frequently implies Y is worth considering further

• It will be of genuine interest if those cases where it implies Y are in the majority

• The initial presentation of the concept of support was incomplete

Page 15: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

15

Item Sets

• The first step in mining association rules is to set a minimum support level

• Support is based on X and Y occurring together

• We are interested in mining rules where combinations of values of both occur sufficiently frequently

Page 16: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

16

• The set of combinations of values which meet the minimum support threshold is known as the item set

• Note that we aren’t talking about a single pair of X and Y

• The item set is all X’s and Y’s that meet the support threshold

Page 17: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

17

Finding Item Sets

• There is a straightforward way to determine support for potential rules

• Logically, it falls into 2 parts:• Part 1:• Count how many times particular pairs of

values occur in instances in the data set for every possible pairing of attributes

• Do the same for triplets, quadruplets, etc.

Page 18: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

18

• These tuples are candidates for rules• If the count of the number of occurrences

meets the support threshold, then they will be considered

• This is the basic item set

Page 19: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

19

• Part 2 of the process:• Pick any one of the elements of the basic

item set• Find all of the possible binary partitionings

of the item• The partitionings aren’t just breaking a

sequence of attributes in 2, left and right• The partitionings are all possible among

the set of attributes in the item set

Page 20: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

20

• Once you’ve done the partitioning, one half of the attributes is the protasis and the other half is the apodosis

• In other words, we’ve picked an X and a Y for a rule X Y

• (Now that the different attributes are grouped, you can speak in terms of the left hand side and the right hand side)

Page 21: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

21

• For each potential rule so constructed, you calculate the confidence

• As part of including the item in the item set, you counted the occurrence of X and Y together

• Now count how many times only X occurs in the data set

• Confidence = Count(X and Y) / Count(X)

Page 22: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

22

In Summary

• Find the item set of tuples which meet the support threshold

• Form every possible partitioning of the item set tuples into two disjoint subsets, forming a rule X Y

• Test each such rule to see if it meets the confidence threshold

• The end result is the set of association rules mined from the data set

Page 23: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

23

Generating Rules Efficiently

• The process for finding rules described above is not efficient

• It’s not as bad as starting at ground 0 and blindly checking all possible rules

• However, it’s still not efficient enough to be practical

• Both support and confidence can be used to limit the number of things that have to be checked

Page 24: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

24

• Reducing work based on support:• The critical observation is this:• At every stage, if an n element item set

doesn’t have support, a superset of it containing (n + 1) elements won’t have support

Page 25: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

25

• Why is this true?• Item sets are sets of values for a given set

of attributes• Suppose x1, x2, …, xn is a set of values

that doesn’t meet the support threshold• No matter what xn+1 is, x1 … xn+1 will also

not meet the support threshold• At most, it will have the same level of

support

Page 26: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

26

• This is the work-reducing enhancement to the association rule mining algorithm based on this observation about support:

• When determining support for item sets, work from 2 element item sets to 3 element item sets, …, to n element item sets

Page 27: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

27

• Whenever you find an item set of size n which doesn’t have support, there is no need to consider any supersets of that item set

• For a given item set that didn’t achieve support, adding an attribute-value pair to the item set cannot result in a higher level of support

Page 28: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

28

• Reducing work based on confidence:• A similar kind of logic holds when

evaluating rules• We have already seen this in action• A strong rule is one that implies a weaker

rule• Analytically, a strong rule has more

conditions in the apodosis than a weaker one

Page 29: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

29

• Work from weaker, simpler rules to stronger ones

• If the weaker rule doesn’t meet the confidence threshold, then stronger rules that imply that weaker rule won’t meet the confidence threshold

• In other words, if X y1 doesn’t hold, then X y1 and y2 won’t hold

Page 30: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

30

• From an algorithmic point of view, given candidate rules of this form: X y1 and X y1 and y2

• Check rules with fewer conditions in the apodosis first

• For any of those that don’t meet the confidence threshold, you don’t have to check the rules with the same protasis and more conditions in the apodosis

Page 31: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

31

• At every stage, certain rules can be eliminated from consideration because other rules were eliminated at an earlier stage

• Any rules left standing will be those that met the confidence threshold

Page 32: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

32

Discussion

• Association rule mining is computationally intensive

• Scanning many possibilities is costly• Good algorithms are at a premium• If you can’t make progress towards a

better algorithm, a better scheme for representing/compressing the data may be an approach to achieving better performance

Page 33: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

33

4.6 Linear Models

Page 34: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

34

• The techniques covered so far have been related to trees and rules

• Trees and rules work most naturally with nominal attributes

• They can also be adapted to work with numeric attributes

Page 35: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

35

• Linear models are distinct from tree and rule based approaches in this way:

• They are a classic numeric method designed to work with numeric attributes

• On the whole, it is less straightforward to apply numeric methods to nominal attributes

• However, adaptations can be made to accomplish

Page 36: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

36

Numeric Prediction: Linear Regression

• Ideas based on linear regression were discussed in an earlier set of overheads

• You can do a pure statistical linear regression

• You can also build a tree and then do prediction in the leaves based on a numeric model

• The new material in the following sections consists largely of adaptations and extensions of these ideas

Page 37: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

37

Linear Classification: Logistic Regression

• Recall: A simple linear model does numeric prediction

• The idea now is to adapt numeric methods on numeric attributes to arrive at a binary 1/0 decision whether an instance falls into a given class

• This subsection actually contains two topics: Multi-response linear regression and logistic regression

Page 38: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

38

Multi-Response Linear Regression

• This is the basic approach:• Identify the classes• We are explicitly thinking in terms of data

sets where there are potentially n > 2 classes

• Let the value 1 signify membership in a class

• The value 0 then signifies non-membership

Page 39: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

39

• Do linear regression on the training set for each of the n classes separately

• Set the equation = 1 for those instances in a given class

• Set the equation = 0 for those instances not in a given class

Page 40: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

40

• Solving the regression will give you n equations, one for each class, where for members of the class the equation gives 1 or a value close to 1

• For instances that are not elements of the class, the equation gives 0 or a value close to 0

Page 41: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

41

Using the Regression Equations

• When you get an instance to predict, plug its attribute values into each of the n regression equations

• Each equation is liable to give different values

• Classify the instance according to which equation gives the highest value (the one which gives the value closest to 1, as opposed to 0)

Page 42: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

42

Shortcomings of Multi-Response Linear Regression

• Multi-response linear regression can give good results

• On the other hand, it’s not theoretically perfect

• Problem 1:• You want function output values of 1 (or

0), but a linear equation isn’t constrained to this range

Page 43: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

43

• Problem 2:• The classic technique of linear regression

depends on the assumption that you’re working with a normal distribution

• You’re trying to produce discrete results, 0 or 1, so by definition, the normality assumption is actually violated

Page 44: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

44

Logistic Regression

• This technique is designed to avoid the shortcomings of multi-response regression

• The math is a little complicated• Note that in the text the mathematical

explanation is in a box, which means it’s beyond the scope of this course

• Still, the basic idea is simple

Page 45: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

45

• You replace what was a plain linear expression with an expression where you’ve taken the logarithm of the linear expression

• Assuming the transformation itself was justified, you now have a continuous function that classical statistical methods for continuous distributions can be applied to

Page 46: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

46

• There may come a time later when you want to apply the technique and a better understanding of the math may be required

• If that time comes, you can delve into the contents of the box

• In the meantime, you don’t have to worry about why this works or how it was derived

Page 47: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

47

Linear Classification Using the Perceptron

• Don’t be confused by the funny name• It will be explained at the end

Page 48: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

48

• Suppose you have 2 classifications which are represented by 1, 0

• If your data set has n other attributes, you’re in n space

• Now suppose your data points are linearly separable into the two classifications

• Informally, this means the clusters aren’t mixed together

Page 49: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

49

• Formally, linear separability means there is a hyperplane (a linear construct in n space) that falls midway between the clusters

• If the data points are linearly separable, there is a simple iterative algorithm that will converge on the equation of the hyperplane

Page 50: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

50

• The hyperplane is a linear equation in the attributes, ai, with coefficients, wi, for each of them

• The equation takes this form:• 1 + w0a0 + w1a1 + … + wnan = 0

• Adding 1 and setting the sum = 0 is just a mathematical convenience

• That way you aren’t dealing with solving for a separate constant in the equation

Page 51: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

51

• The constant 1 is technically known as bias

• Doing the equation this way affects the values for wi that you derive

• But it doesn’t affect the fact that the complete set of wi’s will separate the two clusters

Page 52: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

52

• Given the right set of wi’s, for some particular set of ai values

• 1 + w0a0 + w1a1 + … + wnan > 0

• That is, the instance falls “above” the line

Page 53: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

53

• Similarly, for some particular set of ai values

• 1 + w0a0 + w1a1 + … + wnan < 0

• That is, the instance falls “below” the line• For the purposes of binary classification,

the separation into two clusters, arbitrarily let one class be “above” and one class be “below” the line

Page 54: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

54

Algorithm for Finding the wi

• For the foregoing scenario, this is the algorithm for finding the wi:

• Initialize all wi = 0

• For each instance in the training set:• If the instance is classified correctly by the

equation with the wi as they stand, do nothing

Page 55: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

55

• If the instance isn’t classified correctly:• If the instance is a member of the “above”

class (sum of eqn should be > 0) add its ai to the wi

• If the instance is a member of the “below” class (sum of eqn should be < 0) subtract its ai from the wi

Page 56: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

56

• After every iteration through the training set where an adjustment was made, you’ll have to make another iteration through to see if everything (else still) works

• Stop when every instance is classified correctly by the equation with the derived wi’s

Page 57: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

57

• If the data set is linearly separable, this algorithm will converge

• If it’s not linearly separable, it won’t converge

• The algorithm gets its name because it is analogous to the computation done in a one-layer neural network, called a perceptron

Page 58: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

58

Linear Classification Using Winnow

• Winnow can be summarized as follows:• It finds a separating hyperplane, like the

perceptron• It is designed for use with data sets where

all of the attributes are binary, 1, 0• Instead of setting the equation = 0, the

user can specify a value, θ

Page 59: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

59

• The user chooses another factor, α• If an instance should classify “above” the

line but doesn’t, you multiply the wi by α

• If an instance should classify “below” the line but doesn’t, you multiply the wi by 1/α

• Winnow is especially good at focusing classification on those attributes that make a difference and ignoring those that don’t

Page 60: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

60

• Both the perceptron and winnow approaches are good for situations where new instances keep getting added to the data set

• Just begin iteration again in order to find the new separating hyperplane

Page 61: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

61

4.7 Instance-Based Learning

Page 62: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

62

• This is the basic idea behind instance-based learning:

• You don’t derive rules• You don’t derive a tree• You keep the complete training set at hand• You don’t use it to train anything in

advance

Page 63: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

63

• When a new instance arrives, this is how you classify it:

• Based on its attribute values, it has a location in n-space

• You give the new instance the classification of whatever instance in the training set is closest to it in n-space

Page 64: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

64

• This simple idea leads to all sorts of considerations:

• How do you define distance?• Manhattan distance, Euclidean distance,

distance with powers higher than 2• Note that the higher the power, the more

emphasis on those attributes with greater differences between instance values

Page 65: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

65

• This leads to other questions:• Are all the measurements on the same

scale?• Are they equally important?• Does it make sense to somehow equalize

attributes by normalizing them to a scale of 0-1?

Page 66: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

66

• A random philosophical observation:• Consider measurements of humans, like

height and eye color• Aren’t they intrinsically incommensurable?

Page 67: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

67

• Can a data mining scheme “objectively” distinguish the validity of two classification schemes, one heavily influenced by one attribute, another heavily influenced by the other attribute?

• Or is human judgment needed?

Page 68: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

68

• Another aspect of this:• Calculating distance assumes that

attributes are numeric• With nominal attributes, like eye color, it’s

not just eye color and height that are incommensurable

• Ordering and distance between the eye colors is not well defined

Page 69: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

69

• The solution is to make assumptions like the following:

• If two nominal values are equal, the distance between them is 0

• In a normalized scale, if two nominal values are different, the distance between them is 1

• The different nominal values are equally, maximally distant from each other

Page 70: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

70

• Similar questions can be asked about the treatment of missing attribute values

• Similar solutions are possible• The distance between a missing value and

an existing value can be deemed to be the maximum distance possible under whatever scale is being used

Page 71: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

71

Finding Nearest Neighbors Effectively

• Finding nearest neighbors by brute force is computationally daunting

• For m points in the training set and n attributes per instance:

• m times you would be calculating an n-dimensional distance and comparing

• The order of complexity of search/classification is linear in the size of the training set

Page 72: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

72

kD-tree Representations of Data Sets

• Representing the data in a convenient form can ease computational problems

• Suppose the training set space can be represented in a roughly balanced tree

• Looking for the nearest neighbor would be based on tree traversal

• For binary attributes, the complexity of the algorithm would be on the order of log2

Page 73: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

73

Explaining the kD-tree Representation

• In the naming convention, kD stands for k-dimensional

• Instead of referring to the number of attributes as n, you refer to it as k

• The highest number of dimensions that can be represented unambiguously on the printed page is two

• Before this is over with, illustrations will be given with k = 2

Page 74: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

74

• The basic idea of a kD-tree can be summarized in this way:

• An internal node in the tree represents a point on a boundary line in k-space

• How to understand this?• At each internal node, you make a

decision based on one attribute, yes/no, in/out, </>

Page 75: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

75

• The node is binary and represents a boundary

• The left branch represents the region on one side, including all of the data points located there

• The right branch represents the region on the other side, including all of the data points located there

Page 76: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

76

• In 2-space the boundary line is either horizontal or vertical

• In this case you would have 2 attributes, x and y

• A decision based on a value of x would be vertical

• A decision based on a value of y would be horizontal

Page 77: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

77

• In k-space, the boundary is a hyperplane orthogonal to the axis the decision was made on

• This is not too hard to understand, because it’s still possible to visualize what happens in 3 dimensions

• If you set x = constant and let y and z vary, what you get is a 2-dimensional plane perpendicular to the x axis and parallel to the y and z axes

Page 78: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

78

• In the book’s example, the leaf nodes contain single points in the partitions defined by all of their parents

• We have already observed that trees that partition the space into single instances are not necessarily very useful

• It is possible to devise an algorithm (rule of thumb) that stops the partitioning at some point

Page 79: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

79

• Stopping the partitioning limits the depth of the tree

• The result is that leaf nodes contain some minimum number of data points in the lowest level partitions

• (The rule of thumb could be as simple as stopping partitioning when the minimum is reached)

Page 80: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

80

An Illustration

• Figure 4.12 is shown on the following overhead

• It illustrates the partitioning of space using a kD-tree

• The figure will be followed by comments

Page 81: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

81

Page 82: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

82

• In presenting this, the book explains what you see without fully explaining why

• I will just repeat what you see—but all the answers to why are simply not known

• The partitioning is based on arriving instances of ordered pairs

• The tree is two levels deep and at each level a decision can be made on each coordinate

Page 83: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

83

• The first arrival point is (7, 4) and the decision is made on the y value

• At the first node I would say you’re testing y < 4 or y > 4

• (7, 4) is an internal node, so it’s a boundary node

• It defines the boundary line y = 4

Page 84: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

84

• Without further explanation, the point (2, 2) is not expanded further

• The second internal node is based on the point (6, 7)

• This gives the boundary line x = 6• Notice that this boundary does not extend

down into the first partition• No further partitioning is done on (3, 8)

Page 85: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

85

Using a kD-tree for Classification

• A full explanation wasn’t given about why the tree was formed as it was

• The point was more to provide a simple tree that can be used to illustrate classification

• Regardless of how the tree was derived, it is a representation of the data space based on the points that have arrived so far

Page 86: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

86

• Given the kD-tree representation of the data space, how do you use it for classification?

• Let a new instance arrive• Traverse the tree, determining which path

to follow by comparing instance values to node values

• Eventually, the new point will be placed in a leaf

Page 87: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

87

• Find the new point’s nearest neighbor in the leaf

• Naturally, this means that you found the distance, d, between the point and its nearest neighbor in the leaf

• Now work your way back up the tree (backtrack)

Page 88: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

88

• Check to see whether any part of any of the other partitions is within the distance d of the point

• In other words, compare d with the distance between the point and the boundary lines of any of the higher partitions

• Checking this is easier than using the distance formula

Page 89: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

89

• Suppose the distance from the point to a higher partition’s boundary is less than d

• Then the point’s actual nearest neighbor in the set may be in that other partition

• Check the distance between the point and the points in that partition

• Pick the absolute nearest neighbor out of all of the partitions checked

Page 90: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

90

• The moral of the story is that the kD-tree is a way of representing the space

• However, you don’t blindly put instances into the partitions defined by the kD-tree

• The tree serves as a tool for finding a nearest neighbor without having to calculate all distances between the new point and the existing points

Page 91: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

91

• In effect, it’s another kind of computation rule of thumb

• See Figure 4.13 on the following overhead• It illustrates finding the nearest neighbor of

a new point using the kD-tree illustrated earlier

Page 92: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

92

Page 93: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

93

Efficiency

• Even with trees that aren’t perfectly balanced

• Even with backtracking• The order of complexity of classification

using a kD-tree can approach log2

Page 94: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

94

• This raises these follow-on issues:• What are the best heuristics for forming

trees so that they’re well-balanced?• Speaking generically, this means splitting

so that the rectangular partitions tend to be square-like

Page 95: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

95

• This goes back to the unanswered questions from when the example tree was formed

• Why were certain points chosen as the internal nodes to partition on?

• Why were some nodes left as leaves and not expanded?

Page 96: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

96

• The questions can be phrased as whether to split, and if so, whether to split horizontally or vertically

• Guidelines include the following:• Split on the distance that has the greatest

variance (the greatest spread)• Choose a split point that is near the mean

of the distribution along that axis

Page 97: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

97

The Next Stage

• There is a mismatch inherent in the scheme as described thus far:

• The partitions are rectangular• But the distance calculation for the nearest

neighbor leads to a circle (hypersphere)• In essence, this is what makes

backtracking necessary

Page 98: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

98

Hyperspheres (balls) Instead of Rectangular Partitions

• Rectangles have the advantage that they can partition a space without overlapping

• They have the disadvantage that a given circle may overlap >1 rectangle

• In 2-space, for example, where corners of rectangles meet, part of a neighborhood circle might lie in >1 rectangle

Page 99: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

99

• A natural alternative approach would be to partition space using spheres

• The complication here is that if spheres are to cover the whole space, they will have to overlap

• Covering space with spheres leads to a data structure known as a ball tree (instead of a kD-tree)

Page 100: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

100

• Ball trees are formed from the top down• In the space of m points, find the centroid

(I call this the center of gravity)• This will be the center of the original

hypersphere• Then pick a radius large enough to

enclose all of the points

Page 101: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

101

• How to subdivide a sphere:• Pick the point farthest from the center• Pick the point farthest from that point• Partition the points in the sphere according

to which of these two points they’re closest to

Page 102: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

102

• Find the center of gravity of those two clusters as the centers of two “sub-spheres”

• Pick radii large enough to enclose the points of each

Page 103: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

103

• Note that using this process, in effect you would quit partitioning when you reached spheres containing only two points

• You could also quit partitioning when you reached some arbitrary minimum

Page 104: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

104

• Balls, including leaf nodes, may overlap• By definition, superset-subset balls will

contain the same points• Leaf nodes may also share points• This can happen at those points on the

edges of the circles where they intersect, not in their interiors where they overlap

Page 105: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

105

• The hypersphere partitioning of space can be represented as a tree data structure, the ball tree

• See Figure 4.14 on the following overhead

Page 106: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

106

Page 107: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

107

Traversing the Ball Tree

• You use the ball tree like a kD-tree• Given a new instance, traverse the tree

until you reach the leaf node that contains it

• Find the point’s nearest neighbor in the leaf node

Page 108: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

108

• Now consider this case:• The distance from the point to its nearest

neighbor in the ball is d• The distance from the point to some other

ball is < d• This means that the point’s nearest

neighbor may be in the other ball

Page 109: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

109

• Like with kD-trees, you have to backtrack through the tree finding any such balls if they exist

• If they do exist, you search for the point’s nearest neighbor(s) in those balls

• The algorithm concludes by picking the actual nearest neighbor out of all of the candidates found in various leaf balls

Page 110: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

110

• Like with the trees, the balls are a representation of the data space

• The support a scheme for finding the nearest neighbor

• But backtracking means that new instances are not simply stored in the balls of the tree

Page 111: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

111

• In other words, the balls partition the space, but the balls themselves may not represent the classification of the instances

Page 112: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

112

Discussion

• In the schemes covered so far, kD-trees and ball trees, classification was ultimately done by finding the nearest neighbor

• This falls short if the data is noisy• What this means is that classifications

tend to be interspersed rather than clustered

Page 113: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

113

• The term noisy suggests that the interspersion occurs because some of the classification values are wrong

• Interspersion can also result because your data is incomplete

• Add another attribute, another dimension, and it could be that there is a clean boundary between classes

Page 114: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

114

• At any rate, a practical solution to the problem in the absence of any more knowledge is a k nearest neighbor approach

• In this phrase, k doesn’t mean the number of attributes (dimensions in space)

• It means the number of instances that the algorithm is based on

Page 115: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

115

• You don’t classify a new instance by its single nearest neighbor

• You classify it by the majority vote (count of the classifications) of its k nearest neighbors, for some selected value of k

• For data sets conforming to certain desirable mathematical/statistical properties, the k nearest neighbor approach has a high level of success

Page 116: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

116

• A brute force implementation of a nearest neighbor algorithm is not computationally attractive

• kD-trees are much better, but above about dimension 10 they become unattractive

• More general structures like ball trees are amenable to algorithms that will effectively handle much greater numbers of attributes

Page 117: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

117

• The nearest neighbor approach can be simplified

• The simplification isn’t very accurate, but it provides a tool for initial analysis of data

• The idea is to more or less do a statistical summary of the classifications in a data set

• I.e., find a location for the classification in space

Page 118: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

118

• Thinking visually, the idea is that you more or less say that a given ball, with center X and radius R is the region associated with a classification

• Then when new instances arrive, you just check to see what ball they fall into

• In effect, you’re saying now that the balls of a tree do correspond to the classification

Page 119: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

119

• We have come full circle• This method is no longer an instance

based method• It is a method where “rules” (spatial

locations) are derived from the data• Then you test instances against the rules

(see whether they fall into those locations)

Page 120: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

120

4.8 Clustering

Page 121: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

121

• Clustering algorithms can lead to different results:

• Clusters may be mutually exclusive• They may overlap• Instances may be assigned probabilities

that they are in certain clusters• They may be hierarchical (recall

dendrograms)

Page 122: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

122

Iterative Distance-Based Clustering

• The classic clustering technique is called k-means

• You specify how many clusters you want to find, k

• Choose k of the data points at random• Partition the space by assigning points to

clusters based on which of the k points they’re nearest to

Page 123: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

123

• Now find the centers of gravity of these initial clusters

• Repeat the process of assigning points based on their nearness to the centers of gravity

• Iterate until no point changes cluster

Page 124: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

124

• This is simple, effective, and highly unsatisfying

• The algorithm converges, achieving stability

• But the clusters you get are highly dependent on the choice of the initial k points

• And who said there were k clusters in the data set anyway?

Page 125: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

125

• In effect, when clustering, you’re trying to minimize the sums of the distances squared of points in a clusters and the center points of the clusters

• These centers are dependent on the initial choice of k points

• The algorithm only finds a local optimum

Page 126: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

126

• How would you find a global optimum?• Consider all possible numbers of clusters

and all possible assignments of points to those clusters

• Pick that clustering which minimizes the distances over all points and all cases

Page 127: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

127

• The cost of actually trying all possibilities to find the global optimum is prohibitive

• To see why, just try coming up with an expression for the number of possible subsets of a data set

• Then contemplate calculating the distance for each point in them

Page 128: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

128

• There is something else to consider here• The algorithm is designed to find

“compact” clusters• Consider once again the yin and yang

symbol, with some space separating the two halves

Page 129: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

129

• Would k-means successfully group the points in the tails with their respective heads, or would it group them with the heads of the other half, for example?

• Without adjustments, most simple minded approaches are prejudiced in favor of finding disjoint “spheres” as the cluster

• But the reality is that clusters may come in various shapes

Page 130: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

130

• A simple approach to improving the results the k-means algorithm is this:

• Perform the algorithm several times with different initial seed points

• Then simply pick the one that seems to give the best results (lowest distance calculation)

Page 131: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

131

• A further refinement of k-means is known as k-means++

• Choose the first of the seed points at random

• Then choose successive seed points with a probability proportional to the square of their distance from the last one chosen

• This will tend to spread the seeds out through space in a desirable, random way

Page 132: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

132

Faster Distance Calculations

Page 133: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

133

4.9 Multi-Instance Learning

Page 134: 1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

134

The End


Recommended