+ All Categories
Home > Documents > Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to...

Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to...

Date post: 18-Mar-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
227
Copyright © SAS Institute Inc. All rights reserved. Monte Carlo K-Means Clustering SAS Enterprise Miner Donald K. Wedding, PhD Director of Data Science Sprint Corporation [email protected]
Transcript
Page 1: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Monte Carlo K-Means Clustering SAS Enterprise Miner

Donald K. Wedding, PhD

Director of Data Science

Sprint Corporation

[email protected]

Page 2: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

What Is Clustering?

Page 3: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

3

K-Means Clustering

• Technique can be used on other data such as CUSTOMER data

• K-Means clustering allows for grouping multiple variables simultaneously

• More sophisticated treatment of customers than is possible from simple segmentation

3

Page 4: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

4

K-Means Clustering Clusters based on AGE and INCOME

4

How many clusters do you see?

Page 5: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

5

K-Means Clustering Visual Inspection “proc eyeball”

5

There are FOUR clusters.

Page 6: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

6

K-Means Clustering

A bank might use these clusters for “cross sell”

• Recent Graduates : Overdraft Protection

• Peak Income : Mortgage, Heloc , Investment Account

• Retired : Trust Fund, Retirement Account, Estate Planning

• Unemployed : Unprofitable – “Choose to Lose”

6

Page 7: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

What Affects Cluster Quality?

Page 8: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

8

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 9: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

9

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 10: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

How Many Clusters?

Page 11: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

11

How Many Clusters: Example

Given the Following Data Points

• Find the cluster centers for N=2 Clusters

• Find the cluster centers for N=3 Clusters

• Find the cluster centers for N=4 Clusters

Page 12: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

12

How Many Clusters: Example

Page 13: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

13

How Many Clusters: Example 2 Clusters

13

Page 14: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

14

How Many Clusters: Example 2 Clusters

14

Page 15: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

15

How Many Clusters: Example 2 Clusters

15

Page 16: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

16

How Many Clusters: Example 2 Clusters

16

Page 17: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

17

How Many Clusters: Example 2 Clusters

17

Page 18: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

18

How Many Clusters: Example 2 Clusters

18

Page 19: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

19

How Many Clusters: Example 2 Clusters

19

Page 20: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

20

How Many Clusters: Example 3 Clusters

20

Page 21: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

21

How Many Clusters: Example 3 Clusters

21

Page 22: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

22

How Many Clusters: Example 3 Clusters

22

Page 23: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

23

How Many Clusters: Example 3 Clusters

23

Page 24: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

24

How Many Clusters: Example 3 Clusters

24

Page 25: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

25

How Many Clusters: Example 3 Clusters

25

Page 26: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

26

How Many Clusters: Example 3 Clusters

26

Page 27: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

27

How Many Clusters: Example 4 Clusters

27

Page 28: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

28

How Many Clusters: Example 4 Clusters

28

Page 29: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

29

How Many Clusters: Example 4 Clusters

29

Page 30: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

30

How Many Clusters: Example 4 Clusters

30

Page 31: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

31

How Many Clusters: Example 4 Clusters

31

Page 32: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

32

How Many Clusters: Example 4 Clusters

32

Page 33: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

33

How Many Clusters: Example 4 Clusters

33

Page 34: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

34

How Many Clusters: Example 4 Clusters

34

Page 35: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

35

Summary

Given the Following Data Points

• Find the cluster centers for N=2 Clusters

• Find the cluster centers for N=3 Clusters

• Find the cluster centers for N=4 Clusters

Page 36: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

36

Summary

36

Page 37: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

37

K-Means Clustering Clusters based on AGE and INCOME

37

How many clusters do you see?

Page 38: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

38

38

Page 39: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

39

39

Page 40: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

40

40

Page 41: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

41

41

Page 42: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

42

Page 43: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

43

Page 44: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

44

44

Page 45: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

45

Page 46: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

46

Page 47: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

47

Page 48: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

48

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 49: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

49

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 50: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

What Are The Center Points?

Page 51: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

51

Cluster Seeds: Example

Given the Following Data Points

• Find the cluster centers for N=3 Clusters

• Find the cluster centers using Starting Point “A”

• Find the cluster centers using Starting Point “B”

Page 52: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

52

Cluster Seeds: Example

Page 53: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

53

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

53

Page 54: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

54

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

54

Page 55: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

55

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

55

Page 56: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

56

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

56

Page 57: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

57

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

57

Page 58: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

58

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

58

Page 59: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

59

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

59

Page 60: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

60

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “A”

60

Page 61: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

61

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

61

Page 62: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

62

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

62

Page 63: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

63

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

63

Page 64: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

64

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

64

Page 65: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

65

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

65

Page 66: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

66

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

66

Page 67: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

67

Cluster Starting Points “Seeds” 3 Clusters: Starting Point “B”

67

Page 68: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

68

Summary

Page 69: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

69

Summary

Page 70: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

70

K-Means Clustering Clusters based on AGE and INCOME

70

How many clusters do you see?

Page 71: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

71

71

Page 72: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

72

72

Page 73: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

73

73

Page 74: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

74

74

Page 75: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

75

75

Page 76: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

76

76

Page 77: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

77

Summary

Page 78: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

78

Summary

Page 79: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

79

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 80: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

80

What Affects Cluster Results?

• How many clusters are there?

• Cluster Starting Points (“Seeds”)?

Page 81: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Approximate The Number of Clusters

Page 82: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Diagram 4300

Page 83: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

83

Page 84: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

84

How Many Clusters?

Set the number of clusters to”automatic”

Set the Following Parameters: • Preliminary Max = 50

Assume that initially there might be as many as 50 clusters

• Minimum = 2 When complete, there will be at least 2 clusters.

• Final Maximum = 20 When complete, there will be no more than 20 clusters.

Page 85: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

85

How Many Clusters?

• SAS Enterprise Miner allows user to “guess” at the number of clusters within a RANGE (example: at least 2 and at most 20 is default)

• SAS Enterprise Miner will estimate the optimal number of clusters

• Optimal number of clusters will vary depending upon clustering parameters.

• STEP1: Narrow the “Search Range” by clustering using multiple parameters

Page 86: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

86

How Many Clusters?

Measurement of cluster distances • Average • Centroid • Ward (Default)

Page 87: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

87

Cluster Selection Methods SAS Enterprise Miner

• Average

Calculate the average distance from every point in one cluster to every point in another cluster

• Centroid

Find the distance from one cluster center point to another cluster center point

• Ward (Default Method)

Cluster measurement is based on the ANOVA sum of squares of the two clusters

Page 88: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

88

How Many Clusters?

How are Initial Clusters Centers Chosen? • First “n” Records • MacQueen Drifting • Full Replacement • Princomp • Partial Replacement (Default)

Page 89: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

89

Cluster Seed Selection SAS Enterprise Miner

• First “N” Records Method

• Use the first “N” records in the list as seeds

• Partial Replacement Method (Default)

• Select “N” records that are far away from each other

• Full Replacement Method

• Select “N” records that are very far away from each other by looking for outliers.

• Principal Component Method

• Select “N” evenly spaced records along the first Principal Component Vector

• MacQueen “Drifting” Method

• Use the first “N” records in the list as seeds. Assign records to clusters one by one and recomputes center after each record is assigned aka “drifting”.

Page 90: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

90

Approximate Number Of Clusters Diagram 4300

Page 91: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

91

Page 92: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

92

Page 93: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

93

Page 94: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

94

Page 95: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

95

Example 1: Random Seeds – Synthetic Data SAS Program to generate synthetic data

95

• Program creates 1000 data points with two values: X,Y

• 200 points centered at (3,3)

• 200 points centered at (5,5)

• 200 points centered at (4,6)

• 200 points centered at (6,4)

• 200 points centered at (4,4)

• Each X and Y value has noise added to

• Normally distributed random number

• Random number is multiplied by a weight of 0.5

Page 96: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

96

Example 1: Random Seeds – Synthetic Data %let COUNT = 200 %let WEIGHT = 0.5; %let SEED = 1; %let INFILE = INFILE; %let OUTFILE = RANDOM_DATA; data &INFILE.; do I = 1 to &COUNT.; X = 3.0; Y = 3.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 5.0; Y = 5.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 6.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 6.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; end; drop I; run; data &OUTFILE.; set &INFILE.; X = X + &WEIGHT.*NOISE_X; Y = Y + &WEIGHT.*NOISE_Y; drop NOISE_X; drop NOISE_Y; run;

Page 97: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

97

Noise Level 0.5

Page 98: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

98

Page 99: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

99

Random Seeds – Shuffle Cards %let SEED = 1; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Page 100: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

100

Random Seeds – Shuffle Cards %let SEED = 1; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Random Number Seed: Changing this value will cause the list of data points to be put in a different order (“shuffled”)

Page 101: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

101

Page 102: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

102

How Many Clusters?

Page 103: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

103

Ward / First = 7 clusters

Page 104: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

104

Ward / MacQueen = 3 clusters

Page 105: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

105

Ward / Full = 5 clusters

Page 106: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

106

Ward / Princomp = 5 clusters

Page 107: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

107

Ward / Partial = 3 clusters

Page 108: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

108

Page 109: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

109

Average / First = 5 clusters

Page 110: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

110

Average / MacQueen = 5 clusters

Page 111: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

111

Average / Full = 7 clusters

Page 112: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

112

Average / Princomp = 5 clusters

Page 113: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

113

Average / Partial = 5 clusters

Page 114: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

114

Page 115: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

115

Centroid / First = 4 clusters

Page 116: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

116

Centroid / MacQueen = 5 clusters

Page 117: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

117

Centroid / Full = 6 clusters

Page 118: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

118

Centroid / Princomp = 5 clusters

Page 119: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

119

Centroid / Partial = 6 clusters

Page 120: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

120

How Many Clusters?

Cluster Ward Average Centroid

First 7 5 4

MacQueen 3 5 5

Full 5 7 6

Princomp 5 5 5

Partial 3 5 6

Page 121: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

121

How Many Clusters?

Number of Cluster Count

3 clusters 2

4 clusters 1

5 clusters 8

6 clusters 2

7 clusters 2

Page 122: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

122

How Many Clusters?

Number of Cluster Count

3 clusters 2

4 clusters 1

5 clusters 8

6 clusters 2

7 clusters 2

Page 123: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

123

123

Page 124: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

124

124

Page 125: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

125

Page 126: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

126

Page 127: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

127

How Many Clusters?

Number of Cluster Count

3 clusters 2

4 clusters 1

5 clusters 8

6 clusters 2

7 clusters 2

Page 128: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

128

How Many Clusters?

The Number of Clusters Found Depends Upon

• Cluster Starting Points

• Clustering Method

Certain Numbers occur more frequently than others

• Trial and Error suggests 3 to 7 Clusters

• Probably 5 Clusters is optimal

Page 129: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Starting Points Affect Clusters “Your Mileage May Vary”

Page 130: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Diagram 4400

Page 131: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

131

Different Seed Selection Methods: Diagram 4400

Page 132: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

132

Random Seeds – Synthetic Data %let COUNT = 200 %let WEIGHT = 0.2; %let SEED = 1; %let INFILE = INFILE; %let OUTFILE = RANDOM_DATA; data &INFILE.; do I = 1 to &COUNT.; X = 3.0; Y = 3.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 5.0; Y = 5.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 6.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 6.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; end; drop I; run; data &OUTFILE.; set &INFILE.; X = X + &WEIGHT.*NOISE_X; Y = Y + &WEIGHT.*NOISE_Y; drop NOISE_X; drop NOISE_Y; run;

Page 133: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

133

Random Seeds – Synthetic Data %let COUNT = 200 %let WEIGHT = 0.2; %let SEED = 1; %let INFILE = INFILE; %let OUTFILE = RANDOM_DATA; data &INFILE.; do I = 1 to &COUNT.; X = 3.0; Y = 3.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 5.0; Y = 5.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 6.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 6.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; X = 4.0; Y = 4.0; NOISE_X = rannor(&SEED.); NOISE_Y = rannor(&SEED.); output; end; drop I; run; data &OUTFILE.; set &INFILE.; X = X + &WEIGHT.*NOISE_X; Y = Y + &WEIGHT.*NOISE_Y; drop NOISE_X; drop NOISE_Y; run;

Changed “Noise Weight” • From 0.5 • To 0.2

Less “Noise” was introduced. Clusters will be more “defined”

Page 134: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

134

Different Seed Selection Methods

Page 135: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

135

Different Seed Selection Methods

Page 136: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

136

Random Seeds – Shuffle Cards %let SEED = 2; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Page 137: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

137

Random Seeds – Shuffle Cards %let SEED = 2; %let INFILE = RANDOM_DATA; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Random Number Seed: Changing this value will cause the list of data points to be put in a different order (“shuffled”)

Page 138: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

138

Different Seed Selection Methods

Page 139: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

139

Different Seed Selection Methods

Set the numbers to exactly 5 clusters

Use first 5 data points as cluster seeds. • Repeat for “Partial” • Repeat for “Full” • Repeat for “MacQueen” • Repeat for “Princomp”

Page 140: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

140

Different Seed Selection Methods

Page 141: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

141

First “N” Selection Method

Page 142: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

142

MacQueen Selection Method

Page 143: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

143

Full Selection Method

Page 144: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

144

Partial Selection Method

Page 145: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

145

Princomp Selection Method

Page 146: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

146

146

Page 147: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

147

147

Page 148: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

148

148

Page 149: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

149

What Are The Cluster Centers?

Different Starting Points and Settings Can Yield Different Results

• Occasionally sub-optimal clusters are found

• Usually the same optimal clusters are found regardless of starting points and settings

Five different settings

• 2 of 5 have sub optimal Clusters

• 3 of 5 have optimal cluster

- Even sub-optimal Clusters have some similarity to optimal clusters

Page 150: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Monte Carlo Clustering

Page 151: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Monte Carlo Macros

Page 152: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

152

Monte Carlo Clustering

Cluster data repeatedly:

• Use different methods for determining starting points

• Use different clustering methods

After each clustering algorithm finishes:

• After each iteration, record the number of clusters

• After each iteration, record the cluster centers

After numerous iterations:

• Determine the correct number of clusters

• Cluster the “Cluster Centers”

Page 153: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

153

SAS Macro: Sleep

• Macro will cause the SAS Program to “sleep” for a specified number of seconds.

• This gives the operating system time to write files to disk and prevents deadlocks.

Parameters

%SLEEP( HOWLONG );

- HowLong : How many seconds should the program “sleep”

Page 154: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

154

SAS Macro: Sleep

%macro SLEEP( HOWLONG ); data; time_slept=sleep(&HOWLONG.,1); run; %mend;

Page 155: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

155

SAS Macro: Save_Cluster_Info

• Stores the number of clusters and the cluster centers found by

• SAS Enterprise Miner Cluster Node

• SAS Enterprise Miner SOM/Kohonen Node

• Results are collected from Enterprise Miner nodes and appended to SAS data files

• Clusters with rare membership are deleted

Page 156: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

156

SAS Macro: Save_Cluster_Info

CENTERFILE : Cluster Centers from SAS Enterprise Miner

OUTFILE_CENTERS : File to store the Cluster Centers

OUTFILE_HOWMANY : File to store the number of Cluster Centers

TEMPFILE : Temporary File to hold data

HOWMANY : Name of the variable that will store the number of clusters

CUTOFFPCT : If a clusters has less than this percent of the records, delete it

HOWLONG : How many seconds to sleep between functions

%SAVE_CLUSTER_INFO( CENTERFILE, OUTFILE_CENTERS, OUTFILE_HOWMANY, TEMPFILE = TEMPFILE, HOWMANY = _HOWMANY_, CUTOFFPCT = 0.1, HOWLONG = 1 );

Page 157: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

157

SAS Macro: Save_Cluster_Info Sample Run: First run found “7” Clusters

157

How Many Clusters File:

Cluster Center File:

Obs _HOWMANY_ 1 7

Obs _HOWMANY_ X Y 1 7 3.03363 2.74416 2 7 4.16234 3.71581 3 7 6.03689 3.95565 4 7 4.00574 4.82218 5 7 2.71785 3.49664 6 7 3.96880 6.14890 7 7 5.12917 5.06041

Page 158: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

158

SAS Macro: Save_Cluster_Info Sample Run: Second Run found “5” Clusters

158

How Many Clusters File:

Cluster Center File:

Obs _HOWMANY_ X Y 1 7 3.03363 2.74416 2 7 4.16234 3.71581 3 7 6.03689 3.95565 4 7 4.00574 4.82218 5 7 2.71785 3.49664 6 7 3.96880 6.14890 7 7 5.12917 5.06041 8 5 2.90993 3.03597 9 5 4.06764 3.98039 10 5 6.01687 3.95821 11 5 5.02613 5.04958 12 5 3.94580 6.00564

Obs _HOWMANY_ 1 7 2 5

Page 159: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

159

SAS Macro: Save_Cluster_Info Sample Run: Third Run found “5” Clusters

159

How Many Clusters File:

Cluster Center File:

Obs _HOWMANY_ 1 7 2 5 3 5

Obs _HOWMANY_ X Y 1 7 3.03363 2.74416 2 7 4.16234 3.71581 3 7 6.03689 3.95565 4 7 4.00574 4.82218 5 7 2.71785 3.49664 6 7 3.96880 6.14890 7 7 5.12917 5.06041 8 5 2.90993 3.03597 9 5 4.06764 3.98039 10 5 6.01687 3.95821 11 5 5.02613 5.04958 12 5 3.94580 6.00564 13 5 6.02094 3.95189 14 5 3.95058 6.00808 15 5 5.05562 5.05453 16 5 4.07909 4.01951 17 5 2.92715 3.03838

Page 160: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

160

SAS Macro: Save_Cluster_Info (page 1 of 3)

%macro CLUSTER_SLEEP( HOWLONG ); data; time_slept=sleep(&HOWLONG.,1); run; %mend; %macro SAVE_CLUSTER_INFO( CENTERFILE, OUTFILE_CENTERS, OUTFILE_HOWMANY, TEMPFILE = TEMPFILE, HOWMANY = _HOWMANY_, CUTOFFPCT = 0.1, HOWLONG = 1 ); data &TEMPFILE.; set &CENTERFILE.; drop _RADIUS_; drop _CRIT_ _XCONV_ _FCONV_ _RMSSTD_ _NEAR_ _GAP_ _SEGMENT_; drop _CRIT_ _XCONV_ _FCONV_ SOM_SEGMENT _RMSSTD_ _NEAR_ _GAP_ SOM_DIMENSION1 SOM_DIMENSION2 SOM_ID; run;

Page 161: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

161

SAS Macro: Save_Cluster_Info (page 2 of 3) data; set &TEMPFILE.; retain &HOWMANY.; if _N_ = 1 then &HOWMANY. = 0; &HOWMANY. = &HOWMANY. + _FREQ_; call symput("HOWMANYCOUNT", &HOWMANY. ); run; data &TEMPFILE.; set &TEMPFILE.; if _FREQ_ / &HOWMANYCOUNT. * 100 < &CUTOFFPCT. then delete; run; data; set &TEMPFILE.; retain &HOWMANY.; if _N_ = 1 then &HOWMANY. = 0; &HOWMANY. = &HOWMANY. + 1; call symput("HOWMANYCOUNT", &HOWMANY. ); run; data &TEMPFILE.; length &HOWMANY. 8.; set &TEMPFILE.; &HOWMANY. = &HOWMANYCOUNT.; drop _FREQ_; run;

Page 162: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

162

SAS Macro: Save_Cluster_Info (page 3 of 3)

%cluster_sleep(&HOWLONG.); proc append data=&TEMPFILE. out=&OUTFILE_CENTERS. force; run; %cluster_sleep(&HOWLONG.); data &TEMPFILE.; set &TEMPFILE.(obs=1); keep &HOWMANY.; run; %cluster_sleep(&HOWLONG.); proc append data=&TEMPFILE. out=&OUTFILE_HOWMANY. force; run; %cluster_sleep(&HOWLONG.); %mend;

Page 163: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

%include the MACRO

Page 164: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

164

SAS Enterprise Miner Project Start Code

Page 165: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

165

SAS Enterprise Miner Project Start Code

Page 166: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

166

SAS Enterprise Miner Project Start Code

Page 167: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

167

SAS Enterprise Miner Project Start Code

Page 168: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

EXAMPLE-Using the Macro: Diagram 5100

Page 169: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

169

Cluster Node Data Collection

1. Use same Synthetic Data Program as Example 3 of Lecture 4. The Noise factor is set to 0.5

• 200 points centered at (3,3)

• 200 points centered at (5,5)

• 200 points centered at (4,6)

• 200 points centered at (6,4)

• 200 points centered at (4,4)

2. Use same “shuffle program” use SEED = -1

• A value of -1 causes the computer clock to be used as a “seed”.

• This results in a different random seed being used every time the program is executed.

Page 170: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

170

Cluster Node Data Collection

3. Cluster Node Settings

• Ward Clustering (but any method will do)

• Partial Replacement Cluster Seed (but any method will do)

• Automatic Cluster Selection

- Max 7: (Maximum Value from Example 3 of Lecture 4)

- Min 3: (Minimum Value from Example 3 of Lecture 4)

4. Save the Cluster Centers using the Save_Cluster_Info Macro.

Page 171: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

171

Cluster Node Data Collection

5. Rerun Numerous Times

• Shuffle

• Cluster

• Save_Cluster_Info

6. Cluster the Clusters Centers

Page 172: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

172

Cluster Node Data Collection Enterprise Miner Diagram

Page 173: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

173

Cluster Node Data Collection Enterprise Miner Diagram

Create synthetic data with the noise factor of 0.5

Page 174: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

174

Cluster Node Data Collection Enterprise Miner Diagram

Page 175: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

175

Cluster Node Data Collection Enterprise Miner Diagram

Shuffle the data with a random seed of -1

(tied to clock)

Page 176: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

176

Cluster Node Data Collection Enterprise Miner Diagram

Set the “Rerun” to “Yes” so that this code node will rerun every time and will reshuffle the data.

Page 177: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

177

Cluster Node Data Collection Enterprise Miner Diagram

%let SEED = -1; %let INFILE = &EM_IMPORT_DATA.; %let TEMPFILE = TEMPFILE; %let OUTFILE = SORTED_DATA; data &TEMPFILE.; set &INFILE.; SORT = ranuni( &SEED. ); run; proc sort data=&TEMPFILE.; by SORT; run; data &OUTFILE.; set &TEMPFILE.; drop SORT; run; proc print data=&OUTFILE.(obs=5); run;

Random Number Seed is set to -1. This ties the random number seed to the clock.

Every time this program is executed, the data will be in a different order.

Page 178: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

178

Cluster Node Data Collection Enterprise Miner Diagram

Create clusters using the data points that were shuffled in the

previous node.

Page 179: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

179

Cluster Node Data Collection Enterprise Miner Diagram

Number of clusters is set to “Automatic”. The MAX is set to “7” and the MIN is set to “3” because that was the range found in Lecture 4 Example 3. The Clustering Method is set to “Ward”, but “Average” or “Centroid could also be used.

Seed Intitialization is set to “Partial Replacement” but other methods could be used.

Page 180: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

180

Cluster Node Data Collection Enterprise Miner Diagram

Call the SAS Macro “Save_Cluster_Info” in order to save the results from the cluster node.

Page 181: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

181

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_CLUSMEAN.; %let CENTERFILE = SGFLIB.y5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

The Macro “Save_Cluster_Info” is called. The number of clusters is stored in a file called “yHOWMANYFILE” and the actual clusters are saved in a file called “yCENTERFILE”.

Page 182: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

182

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_CLUSMEAN.; %let CENTERFILE = SGFLIB.y5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

The Macro “Save_Cluster_Info” is called. The number of clusters is stored in a file called “yHOWMANYFILE” and the actual clusters are saved in a file called “yCENTERFILE”.

Page 183: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

183

Cluster Node Data Collection Enterprise Miner Diagram

This graphing box is not necessary. It is being used for illustration purposes to display the cluster center points.

Page 184: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

184

Cluster Node Data Collection Results: Run 1

Page 185: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

185

Cluster Node Data Collection Center Points After 1 Run

Page 186: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

186

Cluster Node Data Collection Results: Run 2

Page 187: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

187

Cluster Node Data Collection Center Points After 2 Runs

Page 188: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

188

Cluster Node Data Collection Results: Run 4

Page 189: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

189

Cluster Node Data Collection Center Points After 4 Runs

Page 190: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

190

Cluster Node Data Collection Enterprise Miner Diagram

After running the code TEN times, the graph suggests that the 5 cluster solution (red boxes) will place the centers in roughly the same places. The 7 cluster solution (blue boxes) will place the centers is roughly the same places (but these will be different from the red boxes).

Page 191: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

191

Cluster Node Data Collection Center Points After 10 Runs

Page 192: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Looping in SAS Enterprise Miner

Page 193: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

193

Automated Data Collection

Manually executing the Data Collection Program is time consuming

SAS Enterprise Miner has a “Looping” Structure to automate Cluster Data Collection

IMPORTANT: Occasionally when SAS Enterprise Miner is “Looping”, then a error might occur. This is usually a result of a file deadlock state. It does not matter. Just exit Enterprise Miner and start running it again if you wish. You might have already collected enough samples by that point in time, so rerunning may not be necessary.

Page 194: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

194

Automated Data Collection Enterprise Miner Diagram

Page 195: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

195

Automated Data Collection Enterprise Miner Diagram

START GROUP: Start of an Enterprise Miner “Loop”

The Nodes Inside the“Start”/”End” Group will execute multiple times

END GROUP: End of an Enterprise Miner “Loop”

Page 196: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

196

Automated Data Collection Enterprise Miner Diagram

START GROUP: Start of an Enterprise Miner “Loop”

END GROUP: End of an Enterprise Miner “Loop”

Page 197: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

197

Automated Data Collection Enterprise Miner Diagram

• Rerun = Yes Mode: • Index Informs SAS that it will loop “N”

number of time. Index Count: • The Number of times the loop will

execute. In this case the number will be “3” but the number can be set to a much higher value if a person plans to be away from their computer for a while.

Page 198: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

198

Automated Data Collection Enterprise Miner Diagram

Same “shuffle” data node as in previous example. The seed is -1 which means it is tied to the clock.

Rerun is set to “YES” just as before.

Page 199: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

199

Automated Data Collection Enterprise Miner Diagram

Same “Cluster Node” as in previous example. All settings should be the same as before.

Page 200: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

200

Automated Data Collection Enterprise Miner Diagram

Same “Save Results” Code Node as in the previous example. Note: “PROC PRINT” and other output PROCS won’t display inside of a loop.

Page 201: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

201

Automated Data Collection Enterprise Miner Diagram

Print the Results

Page 202: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

202

Automated Data Collection Enterprise Miner Diagram

%let CENTERFILE = SGFLIB.y5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; proc print data=&CENTERFILE.(obs=100); run; proc print data=&HOWMANYFILE.(obs=100); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run;

Page 203: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

203

Automated Data Collection Enterprise Miner Diagram

After running 233 time, it is observed that

• 70% of the time, 5 clusters are found

• 26% of the time, 7 clusters are found

Note: Because of the nature of the random number generator, rerunning this model might yield slightly different results.

Page 204: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

204

Automated Data Collection Clusters = 5 Center Points

Page 205: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

205

Automated Data Collection Clusters = 7 Center Points

Page 206: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Cluster the Centers

Page 207: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

207

Cluster the Cluster Centers Enterprise Miner Diagram

Page 208: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

208

Cluster the Cluster Centers Enterprise Miner Diagram

Page 209: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

209

Automated Data Collection Enterprise Miner Diagram

*%let CENTERFILE = SGFLIB.y5100_CENTERFILE; *%let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; %let CENTERFILE = SGFLIB.z5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.z5100_HOWMANYFILE; %let OUTFILE = &EM_EXPORT_TRAIN.; proc print data=&CENTERFILE.(obs=100); run; proc print data=&HOWMANYFILE.(obs=100); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &OUTFILE.; set &CENTERFILE.; run;

Page 210: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

210

Automated Data Collection Enterprise Miner Diagram

*%let CENTERFILE = SGFLIB.y5100_CENTERFILE; *%let HOWMANYFILE = SGFLIB.y5100_HOWMANYFILE; %let CENTERFILE = SGFLIB.z5100_CENTERFILE; %let HOWMANYFILE = SGFLIB.z5100_HOWMANYFILE; %let OUTFILE = &EM_EXPORT_TRAIN.; proc print data=&CENTERFILE.(obs=100); run; proc print data=&HOWMANYFILE.(obs=100); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &OUTFILE.; set &CENTERFILE.; run;

For convenience, the program was already run 200+ times and the results were stored in the files: SGFLIB.z5100_CENTERFILE;

SGFLIB.z5100_HOWMANYFILE;

Page 211: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

211

Cluster the Cluster Centers Enterprise Miner Diagram

Page 212: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

212

Automated Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_DATA.; %let OUTFILE = &EM_EXPORT_TRAIN.; %let HOWMANY = 5; proc print data=&INFILE.(obs=100); run; data &OUTFILE.; set &INFILE.; if _HOWMANY_ = &HOWMANY.; drop _HOWMANY_; run;

Page 213: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

213

Automated Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_DATA.; %let OUTFILE = &EM_EXPORT_TRAIN.; %let HOWMANY = 5; proc print data=&INFILE.(obs=100); run; data &OUTFILE.; set &INFILE.; if _HOWMANY_ = &HOWMANY.; drop _HOWMANY_; run;

Only keep the CENTER POINTS for the times when 5 clusters were found.

Page 214: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

214

Example 3: Cluster the Cluster Centers Cluster of Center Points = 5

Page 215: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

215

Automated Data Collection Enterprise Miner Diagram

%let INFILE = &EM_IMPORT_DATA.; %let OUTFILE = &EM_EXPORT_TRAIN.; %let HOWMANY = 7; proc print data=&INFILE.(obs=100); run; data &OUTFILE.; set &INFILE.; if _HOWMANY_ = &HOWMANY.; drop _HOWMANY_; run;

Only keep the CENTER POINTS for the times when 7 clusters were found.

Page 216: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

216

Example 3: Cluster the Cluster Centers Cluster of Center Points = 7

Page 217: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

217

Cluster the Cluster Centers Applied to Original Data

Page 218: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

218

Example 3: Clusters = 5 Applied to Original Data

Page 219: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

219

Example 3: Clusters = 7 Applied to Original Data

Page 220: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Kohonen/SOM Clusters

Page 221: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

221

Automated Data Collection Enterprise Miner Diagram

Page 222: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

222

Automated Data Collection Enterprise Miner Diagram

Page 223: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

223

Automated Data Collection Enterprise Miner Diagram

Page 224: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

224

Automated Data Collection Enterprise Miner Diagram

Page 225: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

225

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_LIB..&EM_METASOURCE_NODEID._OUTMEAN; %let CENTERFILE = SGFLIB.y6100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y6100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

Page 226: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

226

Cluster Node Data Collection Enterprise Miner Diagram

%let INFILE = &EM_LIB..&EM_METASOURCE_NODEID._OUTMEAN; %let CENTERFILE = SGFLIB.y6100_CENTERFILE; %let HOWMANYFILE = SGFLIB.y6100_HOWMANYFILE; proc print data=&INFILE.; run; %save_cluster_info( &INFILE., &CENTERFILE., &HOWMANYFILE. ); proc print data=&CENTERFILE.(obs=30); run; proc print data=&HOWMANYFILE.(obs=10); run; proc freq data=&HOWMANYFILE.; table _HOWMANY_ /missing; run; data &EM_EXPORT_TRAIN.; set &CENTERFILE.; run;

SAS Enterprise Miner creates a file to hold the Kohonen/SOM center points. But they are not exported. Therefore, you need to go out and get them!

Page 227: Monte Carlo K-Means Clustering · Centroid . Find the distance from one cluster center point to another cluster center point ...

Copyright © SAS Inst itute Inc. A l l r ights reserved.

Questions?

227


Recommended