http://lamda.nju.edu.cn
Learning Instance Specific Distance Using Metric
PropagationDe-Chuan Zhan, Ming Li, Yu-Feng Li, Zhi-Hua Zhou
LAMDA GroupNational Key Lab for Novel Software Technology
Nanjing University, China
{zhandc, lim, liyf, zhouzh}@lamda.nju.edu.cn
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cnDistance based classification
K-nearest neighbor classification
SVM with Gaussian kernels
Is the distance reliable?
Distance metric learners are introduced…
Are there any more natural measurements?
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Any more natural measurements?
When sky is compared to other pictures…
… our work
Can we assign a specific distance measurement for each
instance, both labeledlabeled and unlabeledunlabeled?
When Phelps II is compared to other athletes…
Color, probably texture features
Speed of swimming, shape of feet…
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cnOutline
Introduction
Our Methods
Experiments
Conclusion
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Introduction
Distance Metric Learning
Many machine learning algorithms rely on the distance metric for input data patterns.
• Classification
• Clustering
• RetrievalThere are many metric learning algorithms developed
[Yang, 2006]Problem:
Focus on learning a uniform Mahalanobis distance for ALL instances
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
• Instead of applying a uniform distance metric for every example, it is more natural to measure distances according to specific properties of data
• Some researchers define distance from sample’s own perspective
• QSim [Zhou and Dai, ICDM’06] [Athitsos et al., TDS’07]• Local distance functions [Frome et al., NIPS’06, ICCV’07]
Introduction
Other distance functions
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Introduction
Query sensitive similarity
Actually, instance specific similarities or query specific similarities are studied in other fields before:
The problem:
Query similarity is based on pure heuristics.
In content-based image retrieval, there has been a study which tries to compute query sensitive similarities.
The similarities among different images are decided after receiving a query image. [Zhou and Dai, ICDM’06]
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
• [Frome et al. NIPS’06]
Introduction
Local distance functions
Dji>Djk
Dij>D
kj
The distance from the j-th instance to the i-th instance is larger than that from the j-th to the k-th
1. Cannot generalize directly2.The local distance defined is not directly
comparable. • [Frome et al. ICCV’07]
All constraints can be tired together.Requiring more heuristics for testing.
The problem:
Local distance functions for unlabeled data are N/A.
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Introduction
Our Work
Can we assign a specific distance measurement for each instance,
both labeled and unlabeledboth labeled and unlabeled?
Yes, we learn Instance Specific Distance via Metric Propagation
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cnOutline
Introduction
Our Methods
Experiments
Conclusion
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
• Focus on learning instance specific distance for both labeled and unlabeled data.
Our Methods
Intuition
• For labeled data: the pair of examples come
from the same class should be closer to each other
• For unlabeled data: Metric propagation on a
relationship graph
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
The ISD Framework
• Instead of directly conducting metric propagation while learning the distances for labeled examples, we formulate the metric propagation with a regularized framework.
The Loss function for labeled data
Induced by the labels of instances, provides the side information
A regularization term responsible forthe implicit metric propagation
is a convex loss function, such as hinge lossin classification or least square loss in regression
The j-th instance belongs to a class other than the i-th, or the j-th instance is a neighbor of i-th instance, i.e., allCannot-links and some of the must-links are considered
Inspired by [Zhu 2003], the regularization term can bedefined as:
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
The ISD Framework – relationship to FSM
Replaced with high-order side information,such as triplets information
L is set to identity matrix
FSM [Frome et al. NIPS’06] is a
special case of ISD
Although only pair-wised side information is investigated in our work,the ISD Framework is a common frame…
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
The ISD Framework – update graph
Given structurePredefined graph
Graph Weights
Initialize In new ISD space
Updated Graph Weights
Final ISD
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
ISD with L1-loss
Introducing slack variables
Solving it respect to all w simultaneously is of great challenge.The computational cost is too expensive.
Convex problem we employ the alternating descent method to solve it, i.e.to sequentially solve one w for one instance at each time by fixing other ws till converges or maxiters reached.
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
ISD with L1-loss (con’t)
Primal:
Dual:
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
Acceleration: ISD with L2-loss
For acceleration: The alternating descent method is used to solve the problem Reduce the number of constraints by considering some must- linksHowever, the number of inequality constraints may be large
Inspired by nu-SVM, we probably can obtain a more efficient method:
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Our Methods
Acceleration: ISD with L2-loss
drop
Dual:
A linear equality constraint
We will project the solution back to the feasible regionafter we get the optimization results:Thus, this dual variable can be efficiently solved using Sequential Minimal Optimization.
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cnOutline
Introduction
Our Methods
Experiments
Conclusion
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
• Data sets:– 15 UCI data sets– COREL image dataset (20 classes, 100
images/class)
• 2/3 labeled training set; 1/3 unlabeled for testing, 30 runs
• Compared methods– ISD-L1/L2– FSM/FSSM (Frome et al. 2006 & 2007)– LMNN (Weinberger et al. 2005)– DNE (Zhang et al, 2007)
• Parameters are selected via cross validation
Experiments
Configurations
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Experiments
Classification Performance
Comparison of test error rates (mean±std.)
12 11
The win/tie/loss counts ISD vs. other methodst-test, 95% significance level
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Experiments
Influence of the number of iteration rounds
Updating rounds
Starting from Euclidean
The error rates of ISD-L1 are reduced on most datasets as the number of update increasing
The error rates of ISD-L2 reduce on some datasets.The error rates of ISD-L2 reduce on some datasets.However, on others, the performance are degenerated.
Overfitting – L2-loss is more sensitive to noise
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cn
Experiments
Influence of the amount of labeled data
ISD is less sensitive to the influence ofthe amount of labeled data
When the amount of labeled samples is limited, the superiority of ISD is more apparent
http://lamda.nju.edu.cn/zhandc
http://lamda.nju.edu.cnConclusion
Main contribution: A method for learning instance-specific distance
for labeled as well as unlabeled instances.
Future work: The construction of the initial graph
Label propagation, metric propagation, … any more properties to propagate?
Thanks!