1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

1

Data MiningChapter 4

Algorithms: The Basic Methods

Kirk Scott

2

3

4

Basic Methods

• A good rule of thumb is try the simple things first

• Quite frequently the simple things will be good enough or will provide useful insights for further explorations

• One of the meta-tasks of data mining is figuring out which algorithm is the right one for a given data set

5

• Certain data sets have a certain structure• Certain algorithms are designed to elicit

particular kinds of structures• The right algorithm applied to the right set

will give straightforward results• A mismatch between algorithm and data

set will give complicated, cloudy results

6

• Here are the remaining sections in chapter 4, 4 more basic algorithm descriptions plus a 9th topic

• 4.5 Mining Association Rules• 4.6 Linear Models• 4.7 Instance-Based Learning• 4.8 Clustering• 4.9 Multi-Instance Learning

7

4.5 Mining Association Rules

8

• In theory you could mine association rules in a way similar to the way you mine classification rules

• The problem is that there are too many potential rules of this form:

• (some subset of attributes) (some other subset of attributes)

• The concepts of support and confidence help limit the number to consider

9

Defining Support, Again

• Depending on how your read the book, there is some confusion about how to define support

• If you’re looking at X Y, the original explanation of support seemed to say this:

• Support = (number of instances containing X) / (total number of instances)

• Confidence = (number of instances containing X and Y) / (number of instances containing X)

10

• This is the updated definition of support:• Support = (number of instances containing

X and Y) / (total number of instances)• In other words, it’s not how frequent the

protasis is, but how frequent the rule is• Confidence is unchanged• Confidence = (number of instances

containing X and Y) / (number of instances containing X)

11

• The definition of support differs in the two cases, but the confidence measure is the same

• To summarize, the initial presentation of support says a rule is interesting if the protasis alone occurs frequently

• The second presentation of support says a rule is interesting if the protasis + apodosis together occur frequently

12

• In other words, this is an absolute measure of the frequency of the truth of a rule

• We are interested in rules if they are frequently true

• We are not necessarily interested in rules just because the protasis occurs frequently

13

• In both cases the measure of confidence is the same

• It is a relative measure of the frequency of the truth of a rule

• Out of all the times that the protasis is true, how many times is the rule true?

• Looking back, the second presentation of support is probably the more useful one

14

• An X that implies results all over the map would not be useful

• An X that frequently implies Y is worth considering further

• It will be of genuine interest if those cases where it implies Y are in the majority

• The initial presentation of the concept of support was incomplete

15

Item Sets

• The first step in mining association rules is to set a minimum support level

• Support is based on X and Y occurring together

• We are interested in mining rules where combinations of values of both occur sufficiently frequently

16

• The set of combinations of values which meet the minimum support threshold is known as the item set

• Note that we aren’t talking about a single pair of X and Y

• The item set is all X’s and Y’s that meet the support threshold

17

Finding Item Sets

• There is a straightforward way to determine support for potential rules

• Logically, it falls into 2 parts:• Part 1:• Count how many times particular pairs of

values occur in instances in the data set for every possible pairing of attributes

• Do the same for triplets, quadruplets, etc.

18

• These tuples are candidates for rules• If the count of the number of occurrences

meets the support threshold, then they will be considered

• This is the basic item set

19

• Part 2 of the process:• Pick any one of the elements of the basic

item set• Find all of the possible binary partitionings

of the item• The partitionings aren’t just breaking a

sequence of attributes in 2, left and right• The partitionings are all possible among

the set of attributes in the item set

20

• Once you’ve done the partitioning, one half of the attributes is the protasis and the other half is the apodosis

• In other words, we’ve picked an X and a Y for a rule X Y

• (Now that the different attributes are grouped, you can speak in terms of the left hand side and the right hand side)

21

• For each potential rule so constructed, you calculate the confidence

• As part of including the item in the item set, you counted the occurrence of X and Y together

• Now count how many times only X occurs in the data set

• Confidence = Count(X and Y) / Count(X)

22

In Summary

• Find the item set of tuples which meet the support threshold

• Form every possible partitioning of the item set tuples into two disjoint subsets, forming a rule X Y

• Test each such rule to see if it meets the confidence threshold

• The end result is the set of association rules mined from the data set

23

Generating Rules Efficiently

• The process for finding rules described above is not efficient

• It’s not as bad as starting at ground 0 and blindly checking all possible rules

• However, it’s still not efficient enough to be practical

• Both support and confidence can be used to limit the number of things that have to be checked

24

• Reducing work based on support:• The critical observation is this:• At every stage, if an n element item set

doesn’t have support, a superset of it containing (n + 1) elements won’t have support

25

• Why is this true?• Item sets are sets of values for a given set

of attributes• Suppose x1, x2, …, xn is a set of values

that doesn’t meet the support threshold• No matter what xn+1 is, x1 … xn+1 will also

not meet the support threshold• At most, it will have the same level of

support

26

• This is the work-reducing enhancement to the association rule mining algorithm based on this observation about support:

• When determining support for item sets, work from 2 element item sets to 3 element item sets, …, to n element item sets

27

• Whenever you find an item set of size n which doesn’t have support, there is no need to consider any supersets of that item set

• For a given item set that didn’t achieve support, adding an attribute-value pair to the item set cannot result in a higher level of support

28

• Reducing work based on confidence:• A similar kind of logic holds when

evaluating rules• We have already seen this in action• A strong rule is one that implies a weaker

rule• Analytically, a strong rule has more

conditions in the apodosis than a weaker one

29

• Work from weaker, simpler rules to stronger ones

• If the weaker rule doesn’t meet the confidence threshold, then stronger rules that imply that weaker rule won’t meet the confidence threshold

• In other words, if X y1 doesn’t hold, then X y1 and y2 won’t hold

30

• From an algorithmic point of view, given candidate rules of this form: X y1 and X y1 and y2

• Check rules with fewer conditions in the apodosis first

• For any of those that don’t meet the confidence threshold, you don’t have to check the rules with the same protasis and more conditions in the apodosis

31

• At every stage, certain rules can be eliminated from consideration because other rules were eliminated at an earlier stage

• Any rules left standing will be those that met the confidence threshold

32

Discussion

• Association rule mining is computationally intensive

• Scanning many possibilities is costly• Good algorithms are at a premium• If you can’t make progress towards a

better algorithm, a better scheme for representing/compressing the data may be an approach to achieving better performance

33

4.6 Linear Models

34

• The techniques covered so far have been related to trees and rules

• Trees and rules work most naturally with nominal attributes

• They can also be adapted to work with numeric attributes

35

• Linear models are distinct from tree and rule based approaches in this way:

• They are a classic numeric method designed to work with numeric attributes

• On the whole, it is less straightforward to apply numeric methods to nominal attributes

• However, adaptations can be made to accomplish

36

Numeric Prediction: Linear Regression

• Ideas based on linear regression were discussed in an earlier set of overheads

• You can do a pure statistical linear regression

• You can also build a tree and then do prediction in the leaves based on a numeric model

• The new material in the following sections consists largely of adaptations and extensions of these ideas

37

Linear Classification: Logistic Regression

• Recall: A simple linear model does numeric prediction

• The idea now is to adapt numeric methods on numeric attributes to arrive at a binary 1/0 decision whether an instance falls into a given class

• This subsection actually contains two topics: Multi-response linear regression and logistic regression

38

Multi-Response Linear Regression

• This is the basic approach:• Identify the classes• We are explicitly thinking in terms of data

sets where there are potentially n > 2 classes

• Let the value 1 signify membership in a class

• The value 0 then signifies non-membership

39

• Do linear regression on the training set for each of the n classes separately

• Set the equation = 1 for those instances in a given class

• Set the equation = 0 for those instances not in a given class

40

• Solving the regression will give you n equations, one for each class, where for members of the class the equation gives 1 or a value close to 1

• For instances that are not elements of the class, the equation gives 0 or a value close to 0

41

Using the Regression Equations

• When you get an instance to predict, plug its attribute values into each of the n regression equations

• Each equation is liable to give different values

• Classify the instance according to which equation gives the highest value (the one which gives the value closest to 1, as opposed to 0)

42

Shortcomings of Multi-Response Linear Regression

• Multi-response linear regression can give good results

• On the other hand, it’s not theoretically perfect

• Problem 1:• You want function output values of 1 (or

0), but a linear equation isn’t constrained to this range

43

• Problem 2:• The classic technique of linear regression

depends on the assumption that you’re working with a normal distribution

• You’re trying to produce discrete results, 0 or 1, so by definition, the normality assumption is actually violated

44

Logistic Regression

• This technique is designed to avoid the shortcomings of multi-response regression

• The math is a little complicated• Note that in the text the mathematical

explanation is in a box, which means it’s beyond the scope of this course

• Still, the basic idea is simple

45

• You replace what was a plain linear expression with an expression where you’ve taken the logarithm of the linear expression

• Assuming the transformation itself was justified, you now have a continuous function that classical statistical methods for continuous distributions can be applied to

46

• There may come a time later when you want to apply the technique and a better understanding of the math may be required

• If that time comes, you can delve into the contents of the box

• In the meantime, you don’t have to worry about why this works or how it was derived

47

Linear Classification Using the Perceptron

• Don’t be confused by the funny name• It will be explained at the end

48

• Suppose you have 2 classifications which are represented by 1, 0

• If your data set has n other attributes, you’re in n space

• Now suppose your data points are linearly separable into the two classifications

• Informally, this means the clusters aren’t mixed together

49

• Formally, linear separability means there is a hyperplane (a linear construct in n space) that falls midway between the clusters

• If the data points are linearly separable, there is a simple iterative algorithm that will converge on the equation of the hyperplane

50

• The hyperplane is a linear equation in the attributes, ai, with coefficients, wi, for each of them

• The equation takes this form:• 1 + w0a0 + w1a1 + … + wnan = 0

• Adding 1 and setting the sum = 0 is just a mathematical convenience

• That way you aren’t dealing with solving for a separate constant in the equation

51

• The constant 1 is technically known as bias

• Doing the equation this way affects the values for wi that you derive

• But it doesn’t affect the fact that the complete set of wi’s will separate the two clusters

52

• Given the right set of wi’s, for some particular set of ai values

• 1 + w0a0 + w1a1 + … + wnan > 0

• That is, the instance falls “above” the line

53

• Similarly, for some particular set of ai values

• 1 + w0a0 + w1a1 + … + wnan < 0

• That is, the instance falls “below” the line• For the purposes of binary classification,

the separation into two clusters, arbitrarily let one class be “above” and one class be “below” the line

54

Algorithm for Finding the wi

• For the foregoing scenario, this is the algorithm for finding the wi:

• Initialize all wi = 0

• For each instance in the training set:• If the instance is classified correctly by the

equation with the wi as they stand, do nothing

55

• If the instance isn’t classified correctly:• If the instance is a member of the “above”

class (sum of eqn should be > 0) add its ai to the wi

• If the instance is a member of the “below” class (sum of eqn should be < 0) subtract its ai from the wi

56

• After every iteration through the training set where an adjustment was made, you’ll have to make another iteration through to see if everything (else still) works

• Stop when every instance is classified correctly by the equation with the derived wi’s

57

• If the data set is linearly separable, this algorithm will converge

• If it’s not linearly separable, it won’t converge

• The algorithm gets its name because it is analogous to the computation done in a one-layer neural network, called a perceptron

58

Linear Classification Using Winnow

• Winnow can be summarized as follows:• It finds a separating hyperplane, like the

perceptron• It is designed for use with data sets where

all of the attributes are binary, 1, 0• Instead of setting the equation = 0, the

user can specify a value, θ

59

• The user chooses another factor, α• If an instance should classify “above” the

line but doesn’t, you multiply the wi by α

• If an instance should classify “below” the line but doesn’t, you multiply the wi by 1/α

• Winnow is especially good at focusing classification on those attributes that make a difference and ignoring those that don’t

60

• Both the perceptron and winnow approaches are good for situations where new instances keep getting added to the data set

• Just begin iteration again in order to find the new separating hyperplane

61

4.7 Instance-Based Learning

62

• This is the basic idea behind instance-based learning:

• You don’t derive rules• You don’t derive a tree• You keep the complete training set at hand• You don’t use it to train anything in

advance

63

• When a new instance arrives, this is how you classify it:

• Based on its attribute values, it has a location in n-space

• You give the new instance the classification of whatever instance in the training set is closest to it in n-space

64

• This simple idea leads to all sorts of considerations:

• How do you define distance?• Manhattan distance, Euclidean distance,

distance with powers higher than 2• Note that the higher the power, the more

emphasis on those attributes with greater differences between instance values

65

• This leads to other questions:• Are all the measurements on the same

scale?• Are they equally important?• Does it make sense to somehow equalize

attributes by normalizing them to a scale of 0-1?

66

• A random philosophical observation:• Consider measurements of humans, like

height and eye color• Aren’t they intrinsically incommensurable?

67

• Can a data mining scheme “objectively” distinguish the validity of two classification schemes, one heavily influenced by one attribute, another heavily influenced by the other attribute?

• Or is human judgment needed?

68

• Another aspect of this:• Calculating distance assumes that

attributes are numeric• With nominal attributes, like eye color, it’s

not just eye color and height that are incommensurable

• Ordering and distance between the eye colors is not well defined

69

• The solution is to make assumptions like the following:

• If two nominal values are equal, the distance between them is 0

• In a normalized scale, if two nominal values are different, the distance between them is 1

• The different nominal values are equally, maximally distant from each other

70

• Similar questions can be asked about the treatment of missing attribute values

• Similar solutions are possible• The distance between a missing value and

an existing value can be deemed to be the maximum distance possible under whatever scale is being used

71

Finding Nearest Neighbors Effectively

• Finding nearest neighbors by brute force is computationally daunting

• For m points in the training set and n attributes per instance:

• m times you would be calculating an n-dimensional distance and comparing

• The order of complexity of search/classification is linear in the size of the training set

72

kD-tree Representations of Data Sets

• Representing the data in a convenient form can ease computational problems

• Suppose the training set space can be represented in a roughly balanced tree

• Looking for the nearest neighbor would be based on tree traversal

• For binary attributes, the complexity of the algorithm would be on the order of log2

73

Explaining the kD-tree Representation

• In the naming convention, kD stands for k-dimensional

• Instead of referring to the number of attributes as n, you refer to it as k

• The highest number of dimensions that can be represented unambiguously on the printed page is two

• Before this is over with, illustrations will be given with k = 2

74

• The basic idea of a kD-tree can be summarized in this way:

• An internal node in the tree represents a point on a boundary line in k-space

• How to understand this?• At each internal node, you make a

decision based on one attribute, yes/no, in/out, </>

75

• The node is binary and represents a boundary

• The left branch represents the region on one side, including all of the data points located there

• The right branch represents the region on the other side, including all of the data points located there

76

• In 2-space the boundary line is either horizontal or vertical

• In this case you would have 2 attributes, x and y

• A decision based on a value of x would be vertical

• A decision based on a value of y would be horizontal

77

• In k-space, the boundary is a hyperplane orthogonal to the axis the decision was made on

• This is not too hard to understand, because it’s still possible to visualize what happens in 3 dimensions

• If you set x = constant and let y and z vary, what you get is a 2-dimensional plane perpendicular to the x axis and parallel to the y and z axes

78

• In the book’s example, the leaf nodes contain single points in the partitions defined by all of their parents

• We have already observed that trees that partition the space into single instances are not necessarily very useful

• It is possible to devise an algorithm (rule of thumb) that stops the partitioning at some point

79

• Stopping the partitioning limits the depth of the tree

• The result is that leaf nodes contain some minimum number of data points in the lowest level partitions

• (The rule of thumb could be as simple as stopping partitioning when the minimum is reached)

80

An Illustration

• Figure 4.12 is shown on the following overhead

• It illustrates the partitioning of space using a kD-tree

• The figure will be followed by comments

81

82

• In presenting this, the book explains what you see without fully explaining why

• I will just repeat what you see—but all the answers to why are simply not known

• The partitioning is based on arriving instances of ordered pairs

• The tree is two levels deep and at each level a decision can be made on each coordinate

83

• The first arrival point is (7, 4) and the decision is made on the y value

• At the first node I would say you’re testing y < 4 or y > 4

• (7, 4) is an internal node, so it’s a boundary node

• It defines the boundary line y = 4

84

• Without further explanation, the point (2, 2) is not expanded further

• The second internal node is based on the point (6, 7)

• This gives the boundary line x = 6• Notice that this boundary does not extend

down into the first partition• No further partitioning is done on (3, 8)

85

Using a kD-tree for Classification

• A full explanation wasn’t given about why the tree was formed as it was

• The point was more to provide a simple tree that can be used to illustrate classification

• Regardless of how the tree was derived, it is a representation of the data space based on the points that have arrived so far

86

• Given the kD-tree representation of the data space, how do you use it for classification?

• Let a new instance arrive• Traverse the tree, determining which path

to follow by comparing instance values to node values

• Eventually, the new point will be placed in a leaf

87

• Find the new point’s nearest neighbor in the leaf

• Naturally, this means that you found the distance, d, between the point and its nearest neighbor in the leaf

• Now work your way back up the tree (backtrack)

88

• Check to see whether any part of any of the other partitions is within the distance d of the point

• In other words, compare d with the distance between the point and the boundary lines of any of the higher partitions

• Checking this is easier than using the distance formula

89

• Suppose the distance from the point to a higher partition’s boundary is less than d

• Then the point’s actual nearest neighbor in the set may be in that other partition

• Check the distance between the point and the points in that partition

• Pick the absolute nearest neighbor out of all of the partitions checked

90

• The moral of the story is that the kD-tree is a way of representing the space

• However, you don’t blindly put instances into the partitions defined by the kD-tree

• The tree serves as a tool for finding a nearest neighbor without having to calculate all distances between the new point and the existing points

91

• In effect, it’s another kind of computation rule of thumb

• See Figure 4.13 on the following overhead• It illustrates finding the nearest neighbor of

a new point using the kD-tree illustrated earlier

92

93

Efficiency

• Even with trees that aren’t perfectly balanced

• Even with backtracking• The order of complexity of classification

using a kD-tree can approach log2

94

• This raises these follow-on issues:• What are the best heuristics for forming

trees so that they’re well-balanced?• Speaking generically, this means splitting

so that the rectangular partitions tend to be square-like

95

• This goes back to the unanswered questions from when the example tree was formed

• Why were certain points chosen as the internal nodes to partition on?

• Why were some nodes left as leaves and not expanded?

96

• The questions can be phrased as whether to split, and if so, whether to split horizontally or vertically

• Guidelines include the following:• Split on the distance that has the greatest

variance (the greatest spread)• Choose a split point that is near the mean

of the distribution along that axis

97

The Next Stage

• There is a mismatch inherent in the scheme as described thus far:

• The partitions are rectangular• But the distance calculation for the nearest

neighbor leads to a circle (hypersphere)• In essence, this is what makes

backtracking necessary

98

Hyperspheres (balls) Instead of Rectangular Partitions

• Rectangles have the advantage that they can partition a space without overlapping

• They have the disadvantage that a given circle may overlap >1 rectangle

• In 2-space, for example, where corners of rectangles meet, part of a neighborhood circle might lie in >1 rectangle

99

• A natural alternative approach would be to partition space using spheres

• The complication here is that if spheres are to cover the whole space, they will have to overlap

• Covering space with spheres leads to a data structure known as a ball tree (instead of a kD-tree)

100

• Ball trees are formed from the top down• In the space of m points, find the centroid

(I call this the center of gravity)• This will be the center of the original

hypersphere• Then pick a radius large enough to

enclose all of the points

101

• How to subdivide a sphere:• Pick the point farthest from the center• Pick the point farthest from that point• Partition the points in the sphere according

to which of these two points they’re closest to

102

• Find the center of gravity of those two clusters as the centers of two “sub-spheres”

• Pick radii large enough to enclose the points of each

103

• Note that using this process, in effect you would quit partitioning when you reached spheres containing only two points

• You could also quit partitioning when you reached some arbitrary minimum

104

• Balls, including leaf nodes, may overlap• By definition, superset-subset balls will

contain the same points• Leaf nodes may also share points• This can happen at those points on the

edges of the circles where they intersect, not in their interiors where they overlap

105

• The hypersphere partitioning of space can be represented as a tree data structure, the ball tree

• See Figure 4.14 on the following overhead

106

107

Traversing the Ball Tree

• You use the ball tree like a kD-tree• Given a new instance, traverse the tree

until you reach the leaf node that contains it

• Find the point’s nearest neighbor in the leaf node

108

• Now consider this case:• The distance from the point to its nearest

neighbor in the ball is d• The distance from the point to some other

ball is < d• This means that the point’s nearest

neighbor may be in the other ball

109

• Like with kD-trees, you have to backtrack through the tree finding any such balls if they exist

• If they do exist, you search for the point’s nearest neighbor(s) in those balls

• The algorithm concludes by picking the actual nearest neighbor out of all of the candidates found in various leaf balls

110

• Like with the trees, the balls are a representation of the data space

• The support a scheme for finding the nearest neighbor

• But backtracking means that new instances are not simply stored in the balls of the tree

111

• In other words, the balls partition the space, but the balls themselves may not represent the classification of the instances

112

Discussion

• In the schemes covered so far, kD-trees and ball trees, classification was ultimately done by finding the nearest neighbor

• This falls short if the data is noisy• What this means is that classifications

tend to be interspersed rather than clustered

113

• The term noisy suggests that the interspersion occurs because some of the classification values are wrong

• Interspersion can also result because your data is incomplete

• Add another attribute, another dimension, and it could be that there is a clean boundary between classes

114

• At any rate, a practical solution to the problem in the absence of any more knowledge is a k nearest neighbor approach

• In this phrase, k doesn’t mean the number of attributes (dimensions in space)

• It means the number of instances that the algorithm is based on

115

• You don’t classify a new instance by its single nearest neighbor

• You classify it by the majority vote (count of the classifications) of its k nearest neighbors, for some selected value of k

• For data sets conforming to certain desirable mathematical/statistical properties, the k nearest neighbor approach has a high level of success

116

• A brute force implementation of a nearest neighbor algorithm is not computationally attractive

• kD-trees are much better, but above about dimension 10 they become unattractive

• More general structures like ball trees are amenable to algorithms that will effectively handle much greater numbers of attributes

117

• The nearest neighbor approach can be simplified

• The simplification isn’t very accurate, but it provides a tool for initial analysis of data

• The idea is to more or less do a statistical summary of the classifications in a data set

• I.e., find a location for the classification in space

118

• Thinking visually, the idea is that you more or less say that a given ball, with center X and radius R is the region associated with a classification

• Then when new instances arrive, you just check to see what ball they fall into

• In effect, you’re saying now that the balls of a tree do correspond to the classification

119

• We have come full circle• This method is no longer an instance

based method• It is a method where “rules” (spatial

locations) are derived from the data• Then you test instances against the rules

(see whether they fall into those locations)

120

4.8 Clustering

121

• Clustering algorithms can lead to different results:

• Clusters may be mutually exclusive• They may overlap• Instances may be assigned probabilities

that they are in certain clusters• They may be hierarchical (recall

dendrograms)

122

Iterative Distance-Based Clustering

• The classic clustering technique is called k-means

• You specify how many clusters you want to find, k

• Choose k of the data points at random• Partition the space by assigning points to

clusters based on which of the k points they’re nearest to

123

• Now find the centers of gravity of these initial clusters

• Repeat the process of assigning points based on their nearness to the centers of gravity

• Iterate until no point changes cluster

124

• This is simple, effective, and highly unsatisfying

• The algorithm converges, achieving stability

• But the clusters you get are highly dependent on the choice of the initial k points

• And who said there were k clusters in the data set anyway?

125

• In effect, when clustering, you’re trying to minimize the sums of the distances squared of points in a clusters and the center points of the clusters

• These centers are dependent on the initial choice of k points

• The algorithm only finds a local optimum

126

• How would you find a global optimum?• Consider all possible numbers of clusters

and all possible assignments of points to those clusters

• Pick that clustering which minimizes the distances over all points and all cases

127

• The cost of actually trying all possibilities to find the global optimum is prohibitive

• To see why, just try coming up with an expression for the number of possible subsets of a data set

• Then contemplate calculating the distance for each point in them

128

• There is something else to consider here• The algorithm is designed to find

“compact” clusters• Consider once again the yin and yang

symbol, with some space separating the two halves

129

• Would k-means successfully group the points in the tails with their respective heads, or would it group them with the heads of the other half, for example?

• Without adjustments, most simple minded approaches are prejudiced in favor of finding disjoint “spheres” as the cluster

• But the reality is that clusters may come in various shapes

130

• A simple approach to improving the results the k-means algorithm is this:

• Perform the algorithm several times with different initial seed points

• Then simply pick the one that seems to give the best results (lowest distance calculation)

131

• A further refinement of k-means is known as k-means++

• Choose the first of the seed points at random

• Then choose successive seed points with a probability proportional to the square of their distance from the last one chosen

• This will tend to spread the seeds out through space in a desirable, random way

132

Faster Distance Calculations

133

4.9 Multi-Instance Learning

134

The End

Date post:	31-Dec-2015
Category:	Documents
Upload:	penelope-scott
View:	220 times
Download:	1 times

1 Data Mining Chapter 4 Algorithms: The Basic Methods Kirk Scott.

Documents