Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | miranda-lindsey |
View: | 218 times |
Download: | 0 times |
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Using Clustering to Learn Distance Functions for Supervised Similarity Assessment
Christoph F. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta Department of Computer Science
University of Houston
Organization of the Talk1. Similarity Assessment
2. A Framework for Distance Function Learning
3. Inside Outside Weight Updating
4. Distance Function Learning Research at UH-DMML
5. Experimental Evaluation
6. Other Distance Function Learning Research
7. Summary
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
1. Similarity Assessment
Definition: Similarity assessment is the task of determining which objects are similarto each other and which are dissimilar to each other.
Goal of Similarity Assessment: Construct a distance function!
Applications of Similarity Assessment:• Case-based reasoning• Classification techniques that rely on distance functions• Clustering• …
Complications: • Usually, there is no universal “good” distance function for a set of objects; the usefulness of a distance depends on the task it used for (“no free lunch in similarity assessment either”).• Defining the distance between objects is more an art than a science.
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
The following relation is given (with 10000 tuples):Patient(ssn, weight, height, cancer-sev, eye-color, age,…)• Attribute Domains
– ssn: 9 digits
– weight between 30 and 650; mweight=158 sweight=24.20
– height between 0.30 and 2.20 in meters; mheight=1.52 sheight=19.2
– cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor
– eye-color: {brown, blue, green, grey }
– age: between 3 and 100; mage=45 sage=13.2
Task: Define Patient Similarity
Motivating Example: How To Find Similar Patients?
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Data Extraction Tool
DBMS
Clustering Tool
User Interface
A set of clusters
Similarity measure
Similarity Measure Tool
Default choices and domain information
Library of similarity measures
Type and weight
information
ObjectView
Library of clustering algorithms
CAL-FULL/UH Database Clustering & Similarity Assessment Environments
CAL-FULL/UH Database Clustering & Similarity Assessment Environments
LearningTool
TrainingData
Today’stopic
For more details: see [RE05]
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
2. A Framework for Distance Function Learning
• Assumption: The distance between two objects is computed as the weighted sum of the distances with respect to their attributes.
• Objective: Learn a “good” distance function for classification tasks.• Our approach: Apply a clustering algorithm with the object distance
function to be evaluated that returns k clusters. • Our goal is to learn the weights of an object distance function such that
pure clusters are obtained (or as pure is possible) --- a pure cluster contains example belonging to a single class.
f
fjif
ji
wpf
woopfoo
1
*)(1
,),(
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Idea: Coevolving Clusters and Distance Functions
Clustering X DistanceFunction Cluster
Goodness of the Distance Function
q(X) Clustering Evaluation
Weight Updating Scheme /Search Strategy
x
x x
x
o
oo
o
xx
oo
xx
oo
oo
“Bad” distance function “Good” distance function
xx
oo
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
3. Inside/Outside Weight Updating
Cluster1: distances with respect to Att1
Action: Increase weight of Att1
Action: Decrease weight for Att2
Cluster1: distances with respect to Att2Idea: Move examples of the majority class closer to each other
xo oo ox
o o xx o o
o:=examples belonging to majority classx:= non-majority-class examples
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Inside/Outside Weight Updating Algorithm
1. Cluster the dataset using a given weight vector w=(w1,…,wp) using k-means
2. FOR EACH cluster-attribute pair DO
1. Modify w using inside/outside weight updating
3. IF NOT DONE, CONTINUE with Step1; OTHERWISE, RETURN w.
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Inside/Outside Weight Updating Heuristic
0.3) (e.g. rate learning:
i attribute respect toth cluster wi in the objects classmajority of distance average :μ
i attribute respect toth cluster wi in the objects all of distance average :σ
*)('
i
i
iii
wiw
o o xx o o xo oo oxExample 1: Example 2:
(W)
The weight of the i-th attribute wi is updated as follows for a given cluster:
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Idea: Weight Inside/Outside Weight Updating
1
2
3
45
6
Clusterk
Attribute1 Attribute2 Attribute3
Initial Weights: w1=w2=w3=1; Updated Weights: w1=1.14,w2=1.32, w3=0.84
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Illustration: Net Effect of Weight Adjustments
New Object Distances Old Object Distances
1
2
3
45
6
Clusterk
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
A Slight Enhanced Weight Update Formula
sizecluster average over the sizecluster theis :
0.3) (e.g. rate learning:
i attribute respect toth cluster wi in the objects classmajority of distance average :μ
i attribute respect toth cluster wi in the objects all of distance average :σ
)(W' **)('
i
i
iii
wiw
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Sample Run of IOWU for the Diabetes Dataset
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
4. Distance Function Learning Research at UH-DMML
RandomizedHill Climbing
AdaptiveClustering
Inside/OutsideWeight Updating
K-Means
SupervisedClustering
NN-Classifier
Weight-Updating Scheme /Search Strategy
Distance FunctionEvaluation
… …
WorkBy Karypis
[BECV05]
Other Research
[ERBV04]
CurrentResearch[EZZ04]
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
5. Experimental Evaluation
• Used a benchmark consisting of 7/15 UCI datasets
• Inside/outside weight updating was run for 200 iterations• was set to 0.3
• Evaluation (10-fold cross validation repeated 10 times was used to determine accuracy)
– Used 1-NN classifier as the base line classifer
– Usee the learned distance function for a 1-NN
– Used the learned distance function for a NCC classifier (new!)
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
NCC-Classifier
A
C
E
a. Dataset clustered by K-means b. Dataset edited using cluster centroids that carry the class label of the cluster majority class
Attribute1
D
B
Attribute2
F
Attribute2
Attribute1
Idea: the training set is replaced by k (centroid, majority class) pairs that are computed using k-means; the so generated dataset is then used to classify the examples in thetest set.
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Experimental Evaluation
Dataset n k 1-NN LW1NN NCC C4.5
DIABETES 768 35 70.62 68.89 73.07 74.49
VEHICLE 846 64 69.59 69.86 65.94 72.28
HEART-STATLOG 270 10 76.15 77.52 81.07 78.15
GLASS 214 30 69.95 73.5 66.41 67.71
HEART-C 303 25 76.06 76.39 78.77 76.94
HEART-H 294 25 78.33 77.55 81.54 80.22
IONOSPHERE 351 10 87.1 91.73 86.73 89.74
Remark: Statistically significant improvements are in red.
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
DF-Learning With Randomized Hill Climbing
Random: random number
: rate of change
for example:[-0.3,0.3]
0.3
-0.3
• Generate R solutions in the neighborhood of w and pick the best one to be the new weight vector w
)*)0,1(1(*' Randomwwii
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Accuracy IOWA and Randomized Hill Climbing
Dataset RHC(1c) RHC(2c) RHC(5c) IOWA(1c) IOWA(2c) IOWA(5c)
autos 48.21 46.66 38.32 40.94 45.70 41.39
breast-cancer 70.09 73.05 71.04 71.85 73.21 71.49
wisconsin-breast-cancer 94.47 96.24 95.06 94.41 96.67 94.03
credit-rating 53.17 47.17 44.59 53.28 49.14 45.88
pima_diabetes 71.56 73.91 73.24 72.11 73.80 74.22
german_credit 69.50 71.31 72.48 67.41 68.89 70.47
Glass 61.24 64.56 62.32 61.16 63.38 61.41
cleveland-14-heart-diseas 77.89 74.87 71.20 77.33 73.39 67.30
hungarian-14-heart-diseas 80.94 80.09 78.45 79.77 79.62 76.78
heart-statlog 82.33 81.67 76.37 82.15 81.78 77.52
ionosphere 82.74 85.75 86.17 85.25 89.72 89.57
sonar 70.70 71.97 73.68 71.70 72.67 73.43
vehicle 56.25 56.25 58.31 53.51 56.36 55.48
vote 94.67 90.54 88.84 93.68 94.21 89.05
zoo 78.97 67.19 56.11 79.20 68.75 53.80
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
• Uses reinforcement learning to adapt distance functions for k-means clustering.
• Employs search strategies that explores multiple paths in parallel. The algorithm maintains an open-list with maximum size |L| --- bad performers a dropped from the open list. Currently, beam search is used which creates 2p successors (increasing and decreasing the weight of each attribute exactly once) and evaluates those 2p*|L| successors and keeps the best |L| of those.
• Discretizes the search space in which states are (<weights>,<centroids>) tuples into a grid, and memorizes and updates the fitness values of the grid; value iteration is limited to “interesting states” by employing prioritized sweeping.
• Weights are updated by increasing / decreasing the weight of an attribute by a randomly chosen percentage that fall within an interval [min-change, max-change]; our current implementation uses: [25%,50%].
• Employs entropy H(X) as the fitness function (low entropy pure cluster)
Distance Function Learning With Adaptive Clustering
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
6. Related Distance Function Learning Research
• Interactive approaches that use user feedback and reinforcement learning to derive a good distance function.
• Other work uses randomized hill climbing and neural networks to learn distance functions for classification tasks; mostly, NN-queries are used to evaluate the quality of a clustering.
• Other work, mostly in the area of semi-supervised clustering, adapts object distances to cope with constraints.
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
7. Summary
• Described an approach that employs clustering for distance function evaluation.
• Introduced an attribute weight updating heuristic called inside/outside weight-updating and evaluated its performance.
• The inside/weight updating approach enhanced a 1-NN classifier significantly for some UCI datasets, but not for all data sets that were tested.
• The quality of the employed approach is dependent on the number of cluster k which is an input parameter; our current research centers on determining k automatically with a supervised clustering algorithm [EZZ04]
• The general idea to replace a dataset by cluster representatives to enhance NN-classifiers shows a lot of promise in this (as exemplified in the NCC classifier) and other research we are currently conducting.
• Distance function learning is quite time consuming; one run of 200 iterations of inside/outside weight updating takes between 5 seconds and 5 minutes depending on dataset size and k-value; other techniques we currently investigate are significantly slower; therefore, we are currently moving to high performance computing facilities for the empirical evaluation of the distance function learning approaches.
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Links to 4 Papers 1. [EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version
appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November 2004. http://www.cs.uh.edu/~ceick/kdd/EZZ04.pdf
2. [RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): 29-59 (2005). http://www.cs.uh.edu/~ceick/kdd/RE05.doc
3. [ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July 2005. http://www.cs.uh.edu/~ceick/kdd/ERBV05.pdf
4. [BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication. http://www.cs.uh.edu/~ceick/kdd/BECV05.pdf
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Question?
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Randomized Hill Climbing
• Fast start: algorithm starts from small neighborhood size until it can not find any better solutions. Then it increases its neighborhood size by 3 times hopping that a better solution can be found by trying more points• Shoulder condition: When the algorithm has moved to a shoulder or flat hill, it will keep getting solutions with the same fitness value. Our algorithm terminates when it has tried for 3 times and still getting the same results. This prevents it from been trapped in a shoulder forever
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Randomized Hill Climbing
Shoulder
Flat hill
State space
Objective function
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Purity in clusters obtained (internal)
Test 2.2 (Beta=0.4)
Inside outside weight updating (Repeat 200 times)
SCEC paramet
ers PS=200, n=30
Learning Rate(%) Diabetes Vehicle HeartStatlog Glass Heart-C Heart-H IONOSPHERE
10 0.231770.3514
2 0.133330.2423
00.3300
30.1428
6 0.11252
35 0.221350.3387
0 0.140740.2383
20.3300
30.1462
5 0.08717
50 0.213540.3621
3 0.140740.2609
90.3333
30.1326
5 0.08717
70 0.217450.3554
5 0.140740.2387
10.3333
30.1360
5 0.08717
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Purity in clusters obtained (internal)
Test 2.2 (Beta=0.4)
Randomize Hill Climbing (p=30)
SCEC parameters PS=200, n=30
Learning Rater(%)
Diabetes Vehicle
HeartStatlog Glass Heart-C Heart-H IONOSPHERE
5 0.2174 0.3532 0.1407 0.2804 0.3399 0.1361 0.1196
15 0.2227 0.3550 0.1296 0.2407 0.3366 0.1020 0.1150
30 0.2174 0.3515 0.1148 0.2323 0.3333 0.1259 0.1207
50 0.2174 0.3320 0.1111 0.2330 0.3333 0.1259 0.1054
65 0.2214 0.3108 0.1148 0.2323 0.3135 0.1190 0.0957
80 0.2083 0.3092 0.1148 0.2196 0.3300 0.1361 0.1082
90 0.2057 0.3108 0.1296 0.2349 0.3201 0.1088 0.0872
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
Ch. Eick
Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)).
Different Forms of Clustering
Ch. Eick et al.: Using Clustering to Learn Distance FunctionsMLDM 2005
A Fitness Function for Supervised Clustering
q(X) := Impurity(X) + β*Penalty(k)
ck
ck
0
n
ck
Penalty(k) and
,n
ExamplesMinority of # )Impurity(X where k: number of clusters used
n: number of examples the dataset
c: number of classes in a dataset.
β: Weight for Penalty(k), 0< β ≤2.0
Penalty(k) vs k
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0 5 10 26 53k
Pe
na
lty(k
)
k
Penalty(k) increase sub-linearly.
because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above