Nearest Neighbor Retrieval Using Distance-Based Hashing

Nearest Neighbor Retrieval Using Distance-Based Hashing

Vassilis Athitsos Michalis Potamias+

University of Texas, Arlington Boston University

Panagiotis Papapetrou George Kollios Boston University Boston University

04/19/23 Distance Based Hashing 2

nearest neighbor problem

Setting: database of objects S distance function D

Given: query Q (previously unseen)

Find and Return: object P* from S, that is closest to Q

NNs appear in various applications under many different distance functions classification of handwritten digits hand-pose estimation

Can perform linear scan… Cost

large S expensive D


cost model

Dominating cost: Distance function may be very “expensive” Time series (DTW) String Alignment (Edit) Computer vision

Cost Model: minimize number of distance computations Dynamic Programming

for Edit Distance


some existing solutions

If objects are low dimensional, exact nearest neighbors are fast

If objects are high dimensional, for some distance functions (Hamming) approximate nearest neighbors are fast, using LSH

However in many interesting settings “linear scan” may be the only approach for exact NNs high dimensional, non-metric


dbh setting

No assumptions for the distance function probably non-metric

Distance function computations dominate the cost

Trade perfect accuracy for faster results


dbh method overview

Preprocess: Hash database using appropriate functions

Query Q arrives: Hash it! Filter: Retrieve colliding objects as “candidate

NNs” Refine: Compute the actual distance between

query and candidates Return: Candidate that is closest to Q

Background


Background: hash – based indexing

D D D

min

query

Building the index

Query Time

Use L tables in parallel } … L

database

h1

h1

h2

hL

h1


Background: locality sensitive hashing

Choice of Hash Functions is important! LSH family of functions [IM98]

An LSHF in a Hash-based Indexing scheme guarantees sublinear behavior for approximate NNs!

Such families have been constructed for Hamming, L2…

What if there is no LSH family for the Distance function used? Edit, DTW etc.

xr

cr

z

y

Distance Based Hashing

Hash based Indexing schemeCan be applied to any space & any DIts hash functions treat D as a black boxOptimization


DBH: family of hash functions Pseudo-Line projection [FL95]

maps an object into the real line y,z are pivot-points from the

database Project x on the y-z pseudoline Use a threshold to make it

discrete valued

- - This family is not an LSHF ++ Definition does not depend

on the specific distance function, only on the 3 pairwise distances.

x

y zD(y,z)

F (x)y,z

zyD

zxDzyDyxDxF zy

,2

,,, 222,


DBH: method

Preprocessing:1. Use a random choice of K of these pseudoline

projections to define a hash function2. Build L such (K-bit) functions3. Hash all objects of S to the L h-tables

At query time:1. Apply the same L functions to Q2. Filter : Retrieve colliding objects (candidate set)3. Refine: Invoke D for candidates4. Return: Nearest*


DBH: accuracy vs cost

Accuracy : Percentage of queries for which DBH returns true NN Cost: Amount of distance computations Problem: Given desired accuracy minimize the cost Choice of K,L affects the cost and the accuracy Sampling: approximate distributions

Probability of NNs colliding Probability of non-NNs colliding

Perform binary search for best (K,L)

Distance Matrix

0 5 4 … 3 0 …...

... 0

Desired Accuracy

TRAINING PHASEDBH Index Structure

K, L


DBH: accuracy

Probability of collision between any Query Q and its Nearest Neighbor N(Q) for a single projection function

Employ sampling to estimate C(Q,N(Q))

Use K and L to shift distribution to desired accuracy Probability of collision in at

least one of the L K-bit tables

…and compute

LKLK QNQCQNQC ,11,,

QNhQhQNQCDBHHh Pr,

dQQQNQCAccuracyXQ LKLK Pr,,,

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10

100

200

300

400

500

600

700

800


DBH: cost

Hash and LookUp

HashCost: Number of distance computations to evaluate hash functions

LookupCost: number of objects that collide in at least one of the L hash tables

Query Cost:

Total Cost (for all Queries):

Ux

LKLK xQCQLookupCost ,,,

KLQHashCost LK 2,

QHashCostQLookupCostQCost LKLKLK ,,,

XQ LKLK dQQQCostCost Pr,,

D(x,z)

D(x

,y)

x

y zD(y,z)

F (x)y,z

D D D

min


DBH: further optimization

1. Hierarchical DBH Build M parallel DBH indices for different

subsets of queries Partition according to distribution D(Q,N(Q)) Queries that are close to their NN are “easier”

2. Reduce HashCost by restricting HDBH to a small subset of database pivot-points for the projections

Experiments


experiments: datasets

We test DBH on 3 datasets:

Unipen (timeseries ~30 – digits) Dynamic Time Warping 10K (test: 5K)

MNIST (images 28x28 – digits) Shape Context Matching 60K (test: 10K)

Hands (images 256x256 – hand-pose) Chamfer Distance 80K (test: 1K)


experiments: results

Training-set to opt K, L

Test-set experiment

Compare to modified VP-tree handles non-metric data

Accuracy vs Cost plot X-axis : Accuracy Y-axis : Distance Computations


experiments: results


conclusion

Distance Based Hashing is a hash-based indexing framework for NN retrieval Not sublinear, just speedup General purpose: No properties assumed for distance

function - black box May be further optimized for bigger speedups

Future: Can we build a scheme for “black box” distance function and provide a statistical argument for sublinear behavior to the size of the database?


thank you!

Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης)

Date post:	31-Dec-2015
Category:	Documents
Upload:	aaron-robinson
View:	33 times
Download:	8 times

Nearest Neighbor Retrieval Using Distance-Based Hashing

Documents