Date post: | 31-Dec-2015 |
Category: |
Documents |
Upload: | aaron-robinson |
View: | 33 times |
Download: | 8 times |
Nearest Neighbor Retrieval Using Distance-Based Hashing
Vassilis Athitsos Michalis Potamias+
University of Texas, Arlington Boston University
Panagiotis Papapetrou George Kollios Boston University Boston University
04/19/23 Distance Based Hashing 2
nearest neighbor problem
Setting: database of objects S distance function D
Given: query Q (previously unseen)
Find and Return: object P* from S, that is closest to Q
NNs appear in various applications under many different distance functions classification of handwritten digits hand-pose estimation
Can perform linear scan… Cost
large S expensive D
04/19/23 Distance Based Hashing 3
cost model
Dominating cost: Distance function may be very “expensive” Time series (DTW) String Alignment (Edit) Computer vision
Cost Model: minimize number of distance computations Dynamic Programming
for Edit Distance
04/19/23 Distance Based Hashing 4
some existing solutions
If objects are low dimensional, exact nearest neighbors are fast
If objects are high dimensional, for some distance functions (Hamming) approximate nearest neighbors are fast, using LSH
However in many interesting settings “linear scan” may be the only approach for exact NNs high dimensional, non-metric
04/19/23 Distance Based Hashing 5
dbh setting
No assumptions for the distance function probably non-metric
Distance function computations dominate the cost
Trade perfect accuracy for faster results
04/19/23 Distance Based Hashing 6
dbh method overview
Preprocess: Hash database using appropriate functions
Query Q arrives: Hash it! Filter: Retrieve colliding objects as “candidate
NNs” Refine: Compute the actual distance between
query and candidates Return: Candidate that is closest to Q
Background
04/19/23 Distance Based Hashing 8
Background: hash – based indexing
D D D
min
query
Building the index
Query Time
Use L tables in parallel } … L
database
h1
h1
h2
hL
h1
04/19/23 Distance Based Hashing 9
Background: locality sensitive hashing
Choice of Hash Functions is important! LSH family of functions [IM98]
An LSHF in a Hash-based Indexing scheme guarantees sublinear behavior for approximate NNs!
Such families have been constructed for Hamming, L2…
What if there is no LSH family for the Distance function used? Edit, DTW etc.
xr
cr
z
y
Distance Based Hashing
Hash based Indexing schemeCan be applied to any space & any DIts hash functions treat D as a black boxOptimization
04/19/23 Distance Based Hashing 11
DBH: family of hash functions Pseudo-Line projection [FL95]
maps an object into the real line y,z are pivot-points from the
database Project x on the y-z pseudoline Use a threshold to make it
discrete valued
- - This family is not an LSHF ++ Definition does not depend
on the specific distance function, only on the 3 pairwise distances.
x
y zD(y,z)
F (x)y,z
zyD
zxDzyDyxDxF zy
,2
,,, 222,
04/19/23 Distance Based Hashing 12
DBH: method
Preprocessing:1. Use a random choice of K of these pseudoline
projections to define a hash function2. Build L such (K-bit) functions3. Hash all objects of S to the L h-tables
At query time:1. Apply the same L functions to Q2. Filter : Retrieve colliding objects (candidate set)3. Refine: Invoke D for candidates4. Return: Nearest*
04/19/23 Distance Based Hashing 13
DBH: accuracy vs cost
Accuracy : Percentage of queries for which DBH returns true NN Cost: Amount of distance computations Problem: Given desired accuracy minimize the cost Choice of K,L affects the cost and the accuracy Sampling: approximate distributions
Probability of NNs colliding Probability of non-NNs colliding
Perform binary search for best (K,L)
Distance Matrix
0 5 4 … 3 0 …...
... 0
Desired Accuracy
TRAINING PHASEDBH Index Structure
K, L
04/19/23 Distance Based Hashing 14
DBH: accuracy
Probability of collision between any Query Q and its Nearest Neighbor N(Q) for a single projection function
Employ sampling to estimate C(Q,N(Q))
Use K and L to shift distribution to desired accuracy Probability of collision in at
least one of the L K-bit tables
…and compute
LKLK QNQCQNQC ,11,,
QNhQhQNQCDBHHh Pr,
dQQQNQCAccuracyXQ LKLK Pr,,,
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10
100
200
300
400
500
600
700
800
04/19/23 Distance Based Hashing 15
DBH: cost
Hash and LookUp
HashCost: Number of distance computations to evaluate hash functions
LookupCost: number of objects that collide in at least one of the L hash tables
Query Cost:
Total Cost (for all Queries):
Ux
LKLK xQCQLookupCost ,,,
KLQHashCost LK 2,
QHashCostQLookupCostQCost LKLKLK ,,,
XQ LKLK dQQQCostCost Pr,,
D(x,z)
D(x
,y)
x
y zD(y,z)
F (x)y,z
D D D
min
04/19/23 Distance Based Hashing 17
DBH: further optimization
1. Hierarchical DBH Build M parallel DBH indices for different
subsets of queries Partition according to distribution D(Q,N(Q)) Queries that are close to their NN are “easier”
2. Reduce HashCost by restricting HDBH to a small subset of database pivot-points for the projections
Experiments
04/19/23 Distance Based Hashing 19
experiments: datasets
We test DBH on 3 datasets:
Unipen (timeseries ~30 – digits) Dynamic Time Warping 10K (test: 5K)
MNIST (images 28x28 – digits) Shape Context Matching 60K (test: 10K)
Hands (images 256x256 – hand-pose) Chamfer Distance 80K (test: 1K)
04/19/23 Distance Based Hashing 20
experiments: results
Training-set to opt K, L
Test-set experiment
Compare to modified VP-tree handles non-metric data
Accuracy vs Cost plot X-axis : Accuracy Y-axis : Distance Computations
04/19/23 Distance Based Hashing 21
experiments: results
04/19/23 Distance Based Hashing 22
conclusion
Distance Based Hashing is a hash-based indexing framework for NN retrieval Not sublinear, just speedup General purpose: No properties assumed for distance
function - black box May be further optimized for bigger speedups
Future: Can we build a scheme for “black box” distance function and provide a statistical argument for sublinear behavior to the size of the database?
04/19/23 Distance Based Hashing 23
thank you!
Famous NNs : Castor (Κάστωρ) and Polydeuces (Πολυδεύκης)