Download - Learning Instance Specific Distance Using Metric Propagation

http://lamda.nju.edu.cn

Learning Instance Specific Distance Using Metric

PropagationDe-Chuan Zhan, Ming Li, Yu-Feng Li, Zhi-Hua Zhou

LAMDA GroupNational Key Lab for Novel Software Technology

Nanjing University, China

{zhandc, lim, liyf, zhouzh}@lamda.nju.edu.cn

http://lamda.nju.edu.cn/zhandc

http://lamda.nju.edu.cnDistance based classification

K-nearest neighbor classification

SVM with Gaussian kernels

Is the distance reliable?

Distance metric learners are introduced…

Are there any more natural measurements?



Any more natural measurements?

When sky is compared to other pictures…

… our work

Can we assign a specific distance measurement for each

instance, both labeledlabeled and unlabeledunlabeled?

When Phelps II is compared to other athletes…

Color, probably texture features

Speed of swimming, shape of feet…


http://lamda.nju.edu.cnOutline

Introduction

Our Methods

Experiments

Conclusion



Introduction

Distance Metric Learning

Many machine learning algorithms rely on the distance metric for input data patterns.

• Classification

• Clustering

• RetrievalThere are many metric learning algorithms developed

[Yang, 2006]Problem:

Focus on learning a uniform Mahalanobis distance for ALL instances



• Instead of applying a uniform distance metric for every example, it is more natural to measure distances according to specific properties of data

• Some researchers define distance from sample’s own perspective

• QSim [Zhou and Dai, ICDM’06] [Athitsos et al., TDS’07]• Local distance functions [Frome et al., NIPS’06, ICCV’07]

Introduction

Other distance functions



Introduction

Query sensitive similarity

Actually, instance specific similarities or query specific similarities are studied in other fields before:

The problem:

Query similarity is based on pure heuristics.

In content-based image retrieval, there has been a study which tries to compute query sensitive similarities.

The similarities among different images are decided after receiving a query image. [Zhou and Dai, ICDM’06]



• [Frome et al. NIPS’06]

Introduction

Local distance functions

Dji>Djk

Dij>D

kj

The distance from the j-th instance to the i-th instance is larger than that from the j-th to the k-th

1. Cannot generalize directly2.The local distance defined is not directly

comparable. • [Frome et al. ICCV’07]

All constraints can be tired together.Requiring more heuristics for testing.

The problem:

Local distance functions for unlabeled data are N/A.



Introduction

Our Work

Can we assign a specific distance measurement for each instance,

both labeled and unlabeledboth labeled and unlabeled?

Yes, we learn Instance Specific Distance via Metric Propagation



Introduction

Our Methods

Experiments

Conclusion



• Focus on learning instance specific distance for both labeled and unlabeled data.

Our Methods

Intuition

• For labeled data: the pair of examples come

from the same class should be closer to each other

• For unlabeled data: Metric propagation on a

relationship graph



Our Methods

The ISD Framework

• Instead of directly conducting metric propagation while learning the distances for labeled examples, we formulate the metric propagation with a regularized framework.

The Loss function for labeled data

Induced by the labels of instances, provides the side information

A regularization term responsible forthe implicit metric propagation

is a convex loss function, such as hinge lossin classification or least square loss in regression

The j-th instance belongs to a class other than the i-th, or the j-th instance is a neighbor of i-th instance, i.e., allCannot-links and some of the must-links are considered

Inspired by [Zhu 2003], the regularization term can bedefined as:



Our Methods

The ISD Framework – relationship to FSM

Replaced with high-order side information,such as triplets information

L is set to identity matrix

FSM [Frome et al. NIPS’06] is a

special case of ISD

Although only pair-wised side information is investigated in our work,the ISD Framework is a common frame…



Our Methods

The ISD Framework – update graph

Given structurePredefined graph

Graph Weights

Initialize In new ISD space

Updated Graph Weights

Final ISD



Our Methods

ISD with L1-loss

Introducing slack variables

Solving it respect to all w simultaneously is of great challenge.The computational cost is too expensive.

Convex problem we employ the alternating descent method to solve it, i.e.to sequentially solve one w for one instance at each time by fixing other ws till converges or maxiters reached.



Our Methods

ISD with L1-loss (con’t)

Primal:

Dual:



Our Methods

Acceleration: ISD with L2-loss

For acceleration: The alternating descent method is used to solve the problem Reduce the number of constraints by considering some must- linksHowever, the number of inequality constraints may be large

Inspired by nu-SVM, we probably can obtain a more efficient method:



Our Methods

Acceleration: ISD with L2-loss

drop

Dual:

A linear equality constraint

We will project the solution back to the feasible regionafter we get the optimization results:Thus, this dual variable can be efficiently solved using Sequential Minimal Optimization.



Introduction

Our Methods

Experiments

Conclusion



• Data sets:– 15 UCI data sets– COREL image dataset (20 classes, 100

images/class)

• 2/3 labeled training set; 1/3 unlabeled for testing, 30 runs

• Compared methods– ISD-L1/L2– FSM/FSSM (Frome et al. 2006 & 2007)– LMNN (Weinberger et al. 2005)– DNE (Zhang et al, 2007)

• Parameters are selected via cross validation

Experiments

Configurations



Experiments

Classification Performance

Comparison of test error rates (mean±std.)

12 11

The win/tie/loss counts ISD vs. other methodst-test, 95% significance level



Experiments

Influence of the number of iteration rounds

Updating rounds

Starting from Euclidean

The error rates of ISD-L1 are reduced on most datasets as the number of update increasing

The error rates of ISD-L2 reduce on some datasets.The error rates of ISD-L2 reduce on some datasets.However, on others, the performance are degenerated.

Overfitting – L2-loss is more sensitive to noise



Experiments

Influence of the amount of labeled data

ISD is less sensitive to the influence ofthe amount of labeled data

When the amount of labeled samples is limited, the superiority of ISD is more apparent


http://lamda.nju.edu.cnConclusion

Main contribution: A method for learning instance-specific distance

for labeled as well as unlabeled instances.

Future work: The construction of the initial graph

Label propagation, metric propagation, … any more properties to propagate?

Thanks!