Date post: | 22-Jan-2018 |
Category: |
Data & Analytics |
Upload: | asoka-korale |
View: | 114 times |
Download: | 3 times |
MIXED NUMERIC AND CATEGORICAL
ATTRIBUTE CLUSTERING ALGORITHM MODELING
DR. ASOKA KORALE, C.ENG. MIET & MIESL
ADVANTAGES TO NUMERIC AND CATEGORICAL ATTRIBUTE CLUSTERING
Slide | 2
Improved Targeting in Campaigns & Insight in
to Segments
Currently clustering on numeric variables Age,
Net Stay, ARPU PRIMARY ATTRIBUTES THAT CAN BE INCLUDED
WITH MIXED ATTRIBUTE TYPE CLUSTERING –
ACCOUNT TYPE, GENDER, GEO LOCATION, ……
Currently Fuzzy C – Means Algorithm used in
Clustering
Digital Advertizing SEGMENTATIONS
INCREASINGLY BASED ON CLUSTERING
Include other Categorical attributes depending
on Interest segment to create –”Micro
Segments”
WIDENING POTENTIAL INSIGHTS THROUGH CATEGORICAL CLUSTERING
Slide | 3
Improved
Targeting in
Campaigns &
Insight in to
All Attributes Can be
Clustered – leading to
very specific and
wider array of
segments
Geographic attribute clustering
to incorporate Income/ARPU
hotspots at micro level
CONCEPT UNDERLYING THE MIXED K PROTOTYPES ALGORITHM [1]
Slide | 4
point “d” and point “c” may switch sides depending on how similar the numeric part and categorical part of the point is similar to the numeric and categorical part of the centroid (prototype)
Influence or contribution of Numeric and Categorical Attributes of a data point can be controlled via a parameter “gamma”
Point “a” may switch if the categorical part is closer to the categorical centroid (prototype) more than its numeric part is close to the numeric part of the centroid.
Numeric and Categorical Attributes parts of a data point can be considered separately and two sets of centroids act as attractors for each Attribute type in each cluster
Numeric Attribute1
Shapes represent two values of a single categorical variable
Numeric Attribute2
[1]. Huang, CSIRO, Australia
MIXED K PROTOTYPES ALGORITHM [1]
Slide | 5
Distance measure to a prototype (center) of two parts – numeric and categorical
Numeric Attributes - Euclidian Distance Categorical Attributes – Dissimilarity Measure
Centroid of Numeric Attributes – a simple average of the points in that cluster
Includes “Yij” a fuzzy membership function if we wish to go in that direction
MIXED K PROTOTYPES ALGORITHM [1]
Slide | 6
Minimize the total cost “E” which is the sum of the distances to the numeric and categorical parts of the centroid (prototype)
Centroid of Categorical attributes determined on highest frequency of attribute value in each cluster
Slide | 7
CONVERGENCE PERFORMANCE
0 5 10 15 20 25 30 35 400
200
400
600
800
1000
1200
1400
1600Total no of switches at each iteration
Iteration Number
0 5 10 15 20 25 30 35 401.2
1.3
1.4
1.5
1.6
1.7
1.8x 10
4
Iteration Number
Total Distance at each iteration
1
2
3
4
5
6
7
8
0 5 10 15 20 25 30 35 400.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Iteration Number
Total Categorical Distance at each iteration
1
2
3
4
5
6
7
8
Slide | 8
CLUSTER & SEGMENT PROFILE
1 2 3 4 5 6 7 80
200
400
600
800
1000
1200Number of Cx in each Cluster
Cluster ID
20
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8Cluster/Segment ID
Age
0
50
100
150
200
250
1 2 3 4 5 6 7 8Cluster/Segment ID
Net Stay
0
0.5
1
1.5
2
2.5
3
3.5
4
x 104
1 2 3 4 5 6 7 8Cluster/Segment ID
ARPU
Slide | 9
VALIDATION WITH DISTRIBUTION ANALYSIS
Cluster IDCx in
Cluster Avg. AgeSpread
AgeAvg. Net-
StaySpred
Net-Stay Avg. ARPUSpread ARPU Post Paid Pre Paid Female Male
1 913 27 5 28 26 1231 1427 90 823 913 0
2 930 28 5 19 16 1407 1699 159 771 0 930
3 407 53 8 46 35 1095 1303 34 373 407 0
4 409 54 8 34 24 967 919 66 343 0 409
5 556 36 11 82 43 2601 2399 546 10 556 0
6 542 32 5 95 27 1031 927 0 542 67 475
7 1116 36 9 96 44 2917 2669 1116 0 0 1116
8 348 57 7 131 33 1205 853 147 201 33 315
15 20 25 30 35 40 45 50 55 60 65 70 75 80 850
50
100
150
200
Histogram Cx Age, Male
Age (years)
Fre
qu
ency
15 20 25 30 35 40 45 50 55 60 65 70 750
50
100
150
Histogram Cx Age, Female
Age (years)
Fre
qu
ency
Due to a certain bi-modal nature, clustering able to identify the modes in the Age histograms
Slide | 10
Cluster ID
Datapoints in Cluster Avg. Age
Spread Age
Avg. Net-Stay
Spred Net-Stay Avg. ARPU
Spread ARPU
Number Post Paid
Number Pre Paid
Number Female
Number Male
1 913 27 5 28 26 1231 1427 90 823 913 0
2 930 28 5 19 16 1407 1699 159 771 0 930
3 407 53 8 46 35 1095 1303 34 373 407 0
4 409 54 8 34 24 967 919 66 343 0 409
5 556 36 11 82 43 2601 2399 546 10 556 0
6 542 32 5 95 27 1031 927 0 542 67 475
7 1116 36 9 96 44 2917 2669 1116 0 0 1116
8 348 57 7 131 33 1205 853 147 201 33 315
0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 2400
50
100
150
200
250
Histogram Cx Network Stay
Net Stay (months)
Fre
qu
ency
No identifiable structure in Net Stay distribution
VALIDATION WITH DISTRIBUTION ANALYSIS
Cluster Segment Profile
Slide | 11
CLUSTERING NUMERIC PART OF SEGMENTS IN 3D
-20
24
-50
5-5
0
5
10
15
20
Age (normalized)
Segmental Analysis: Age, Net Stay and ARPU
Net-Stay (normalized)
AR
PU
(norm
aliz
ed)
1
2
3
4
5
6
7
8
Slide | 12
NOTABLE POINTS
• Allows us to cluster most attributes (within reason)
• Particularly if the categorical attributes do not have many different component
values
• Reasonable convergence performance both in terms of run time and number
of iterations
• Different dissimilarity measures and distance criteria will give differing results
• The influence of the categorical part via gamma may also need to change with
the method used
• Algorithm somewhat sensitive to initial conditions –
initialization of centroids
• Explore likelihood of falling in to a local minima and getting trapped there leading to a
sub optimal final solution
• To do…..
• Each drop can result in a non unique final result but will not impact the underlying
trends and insights in to each segment