Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 1
Probabilistic Hashing for Efficient Search and Learning
Ping Li
Department of Statistical Science
Faculty of Computing and Information Science
Cornell University
Ithaca, NY 14853
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 2
Practice of Statistics Analysis and Data Mining
Key Components :
Models (Methods) + Variables (Features) + Observations (Samples)
Often a Good Practice :
Simple Models (Methods) + Lots of Features + Lots of Data
or simply
Simple Methods + BigData
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 3
BigData Everywhere
Conceptually, consider a dataset as a matrix of size n × D.
In modern applications, # examples n = 106 is common and n = 109 is not
rare, for example, images, documents, spams, search click-through data.
High-dimensional (image, text, biological) data are common: D = 106 (million),
D = 109 (billion), D = 1012 (trillion), D = 264 or even higher. In a sense, D
can be arbitrarily high by considering pairwise, 3-way or higher interactions.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 4
Examples of BigData Challenges: Linear Learning
Binary classification: Dataset (xi, yi)ni=1, xi ∈ R
D , yi ∈ −1, 1.
One can fit an L2-regularized linear logistic regression:
minw
1
2w
Tw + C
n∑
i=1
log(
1 + e−yiwTxi
)
,
or the L2-regularized linear SVM:
minw
1
2w
Tw + C
n∑
i=1
max
1 − yiwTxi, 0
,
where C > 0 is the penalty (regularization) parameter.
The weight vector w has length D, the same as the data dimensionality.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 5
Challenges of Learning with Massive High-dimensional Data
• The data often can not fit in memory (even when the data are sparse).
• Data loading (or transmission over network) takes too long.
• Training can be expensive, even for simple linear models.
• Testing may be too slow to meet the demand, especially crucial for
applications in search, high-speed trading, or interactive data visual analytics.
• The model itself can be too large to store, for example, we can not really store
a vector of weights for logistic regression on data of 264 dimensions.
• Near neighbor search, for example, finding the most similar document in
billions of Web pages without scanning them all.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 6
Dimensionality Reduction and Data Reduction
Dimensionality reduction: Reducing D, for example, from 264 to 105.
Data reduction: Reducing # nonzeros is often more important. With modern
linear learning algorithms, the cost (for storage, transmission, computation) is
mainly determined by # nonzeros, not much by the dimensionality.
Is PCA enough? PCA is infeasible for bigdata at least not in real-time. Updating
PCA is non-trivial. PCA does not provide a good indexing scheme. In extremely
high-dimensional data, PCA may not be very useful even if it can be computed.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 7
High-Dimensional Data Generated by Feature Expansions
• Certain datasets (e.g., genomics) are naturally high-dime nsional.
• For some applications, one can choose either
– Low-dimensional representations + complex & slow methods
or
– High-dimensional representations + simple & fast methods
• Two popular feature expansion methods
– Global expansion: e.g., pairwise (or higher-order) interactions
– Local expansion: e.g., word/charater n-grams using contiguous
words/charaters.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 8
Webspam: Text Data and Local Expansion
Dim. Time Accuracy
1-gram 254 20 sec 93.30%
Binary 1-gram 254 20 sec 87.78%
3-gram 16,609,143 200 sec 99.6%
Binary 3-gram 16,609,143 200 sec 99.6%
Kernel 16,609,143 Days 99.6%
Experiments were based on linear SVM (LIBLINEAR) unless marked as “kernel”
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 9
MNIST: Digit Image Data and Global Expansion
Dim. Time Accuracy
Original 768 80 sec 92.11%
Binary original 768 30 sec 91.25%
Pairwise 295,296 400 sec 98.33%
Binary pairwise 295,296 400 sec 97.88%
Kernel 768 Hours 98.6%
(It is actually not easy to achieve the well-known 98.6% accuracy. Must use very
precise kernel parameters)
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 10
Observations from Feature Expansion Experiments
• 3-grams + linear SVM on webspam produces excellent classification result.
• All pairwise features + linear SVM on MNIST also produces excellent
classification result, close to the well-known result from Gaussian kernel.
• For low-dimensional data, binary quantization of the features may lead to
serious degradation of classification performance.
• Binary quantization on high-dimensional data either does not affect the results
at all (e.g., webspam ) or dos not hurt the performance much (e.g., MNIST).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 11
Major issues with feature expansions:
• high-dimensionality
• high storage cost
• relatively high training/testing cost
The search industry has adopted the practice of
high-dimensional data + linear algorithms + probabilistic hashing
——————
Random projection can be viewed as a hashing method.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 12
A Popular Solution Based on Normal Random Projections
Random Projections : Replace original data matrix A by B = A × R
A R = B
R ∈ RD×k: a random matrix, with i.i.d. entries sampled from N(0, 1).
B ∈ Rn×k : projected matrix, also random.
B approximately preserves the Euclidean distance and inner products between
any two rows of A. In particular, E (BBT) = E (ARR
TA
T) = AAT.
Therefore, we can simply feed B into (e.g.,) SVM or logistic regression solvers.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 13
Very Sparse Random Projections
The projection matrix: R = rij ∈ RD×k. Instead of sampling from normals,
we sample from a sparse distribution parameterized by s ≥ 1:
rij =
−1 with prob. 12s
0 with prob. 1 − 1s
1 with prob. 12s
If s = 100, then on average, 99% of the entries are zero.
If s = 10000, then on average, 99.99% of the entries are zero.
——————-
Ref: Li, Hastie, Church, Very Sparse Random Projections, KDD’06.
Ref: Li, Very Sparse Stable Random Projections, KDD’07.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 14
A Running Example Using (Small) Webspam Data
Datset: 350K text samples, 16 million dimensions, about 4000 nonzeros on
average, 24GB disk space.
0 2000 4000 6000 8000 100000
1
2
3
4x 10
4
# nonzeros
Fre
quen
cy
Webspam
Task: Binary classification for spam vs. non-spam.
——————–
The dataset was generated using 3-grams.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 15
Very Sparse Projections + Linear SVM on Webspam Data
Red dashed curves: results based on the original data
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1000)
CC
lass
ifica
tion
Acc
(%
)Observations:
• We need a large number of projections (e.g., k ≥ 4096) for high accuracy.
• The sparsity parameter s matters little unless k is small.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 16
Very Sparse Projections + Linear SVM on Webspam Data
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 1)
C
Cla
ssifi
catio
n A
cc (
%)
10−2
10−1
100
101
102
75
80
85
90
95
100
k = 32
k = 64
k = 128
k = 256k = 512k = 1024k = 4096
Webspam: SVM (s = 10000)
CC
lass
ifica
tion
Acc
(%
)
As long as k is large (necessary for high accuracy), the projection matrix can be
extremely sparse, even with s = 10000.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 17
Many more experimental results (classification, clustering, and regression) on
very sparse random projections as well as b-bit minwise hashing can be found at
the 2012 Stanford Modern Massive Data Sets (MMDS) Workshop:
http:
//www.stanford.edu/group/mmds/slides2012/s-pli.pdf
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 18
Disadvantages of Random Projections (and Variants)
Inaccurate, especially on binary data.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 19
Variance Analysis for Inner Product Estimates
A R = B
First two rows in A: u1, u2 ∈ RD (D is very large):
u1 = u1,1, u1,2, ..., u1,i, ..., u1,Du2 = u2,1, u2,2, ..., u2,i, ..., u2,D
First two rows in B: v1, v2 ∈ Rk (k is small):
v1 = v1,1, v1,2, ..., v1,j, ..., v1,kv2 = v2,1, v2,2, ..., v2,j, ..., v2,k
E (vT1v2) = uT
1u2 = a
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 20
a =1
k
k∑
j=1
v1,jv2,j , (which is also an inner product)
E(a) = a
V ar(a) =1
k
(
m1m2 + a2)
m1 =
D∑
i=1
|u1,i|2, m2 =
D∑
i=1
|u2,i|2
———
Random projections may not be good for inner products because the variance is
dominated by marginal l2 norms m1m2, especially when a ≈ 0.
For real-world datasets, most pairs are often close to be orthogonal (a ≈ 0).
——————-
Ref: Li, Hastie, and Church, Very Sparse Random Projections, KDD’06.
Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 21
b-Bit Minwise Hashing
• Simple algorithm designed for massive, high-dimensional, binary data.
• Much more accurate than random projections for estimating inner products.
• Much smaller space requirement than the original minwise hashing algorithm.
• Estimating 3-way similarity, while random projections can not.
• Large-scale linear learning (and kernel learning of course).
• Sub-linear time near neighbor search. (Slides available in the appendix)
——————–
Major drawback : very expensive preprocessing. (Now solved!)
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 22
Massive, High-dimensional, Sparse, and Binary Data
Binary sparse data are very common in the real-world :
• For many applications (such as text), binary sparse data are very natural.
• Many datasets can be quantized/thresholded to be binary without hurting the
prediction accuracy.
• In some cases, even when the “original” data are not too sparse, they often
become sparse when considering pariwise and higher-order interactions.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 23
Binary Data Vectors and Sets
A binary (0/1) vector in D-dim can be viewed a set S ⊆ Ω = 0, 1, ..., D − 1.
Example: S1, S2, S3 ⊆ Ω = 0, 1, ..., 15 (i.e., D = 16).
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
S1:
S2:
S3:
0
S1 = 1, 4, 5, 8, S2 = 8, 10, 12, 14, S3 = 3, 6, 7, 14
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 24
Binary (0/1) Massive Text Data by Shingling
Shingling : Each document (Web page) can be viewed as a set of w-shingles.
For example, after parsing, a sentence “today is a nice day” becomes
• w = 1: “today”, “is”, “a”, “nice”, “day”
• w = 2: “today is”, “is a”, “a nice”, “nice day”
• w = 3: “today is a”, “is a nice”, “a nice day”
Previous studies used w ≥ 5, as single-word (unit-gram) model is not sufficient.
Shingling generates extremely high dimensional vectors, e.g., D = (105)w .
(105)5 = 1025 = 283, although in current practice, it seems D = 264 suffices.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 25
Notation
A binary (0/1) vector can be equivalently viewed as a set (locations of nonzeros).
Consider two sets S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1 (e.g., D = 264)
f2
f1 a
f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|.
The resemblance R is a popular measure of set similarity
R =|S1 ∩ S2||S1 ∪ S2|
=a
f1 + f2 − a. (Is it more rational than
a√f1f2
?)
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 26
Minwise Hashing: Standard Algorithm in the Context of Searc h
The standard practice in the search industry:
Suppose a random permutation π is performed on Ω, i.e.,
π : Ω −→ Ω, where Ω = 0, 1, ..., D − 1.
An elementary probability argument shows that
Pr (min(π(S1)) = min(π(S2))) =|S1 ∩ S2|
|S1 ∪ S2|= R.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 27
An Example
D = 5. S1 = 0, 3, 4, S2 = 1, 2, 3, R = |S1∩S2||S1∪S2|
= 15 .
One realization of the permutation π can be
0 =⇒ 3
1 =⇒ 2
2 =⇒ 0
3 =⇒ 4
4 =⇒ 1
π(S1) = 3, 4, 1 = 1, 3, 4, π(S2) = 2, 0, 4 = 0, 2, 4
In this example, min(π(S1)) 6= min(π(S2)).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 28
Minwise Hashing In 0/1 Data Matrix
Original Data Matrix
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
S1:
S2:
S3:
0
Permuted Data Matrix
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
1
0
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
π(S1):
π(S2):
π(S3):
min(π(S1)) = 2, min(π(S2)) = 0, min(π(S3)) = 0
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 29
Minwise Hashing Estimator
After k permutations, π1, π2, ..., πk , one can estimate R without bias:
RM =1
k
k∑
j=1
1min(πj(S1)) = min(πj(S2)),
Var(
RM
)
=1
kR(1 − R).
—————————-
A major problem : Need to use 64 bits to store each hashed value.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 30
b-Bit Minwise Hashing: the Intuition
Basic idea : Only store the lowest b-bits of each hashed value, for small b.
Intuition:
• When two sets are identical, then their lowest b-bits of the hashed values are
of course also equal. b = 1 only stores whether a number is even or odd.
• When two sets are similar, then their lowest b-bits of the hashed values
“should be” also similar (True?? Need a proof).
• Thus, hopefully we do not need many bits to obtain useful information, as real
applications often care about pairs with high similarities (e.g., R > 0.5).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 31
The Basic Collision Probability Result
Consider two sets, S1 and S2,
S1, S2 ⊆ Ω = 0, 1, 2, ..., D − 1,f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|
Define the minimum values under π : Ω → Ω to be z1 and z2:
z1 = min (π (S1)) , z2 = min (π (S2)) .
and their b-bit versions
z(b)1 = The lowest b bits of z1, z
(b)2 = The lowest b bits of z2
Example: if z1 = 7(= 111 in binary), then z(1)1 = 1, z
(2)1 = 3.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 32
Decimal Binary (8 bits) Lowest b = 2 bits Corresp. Decimal
0 00000000 00 0
1 00000001 01 1
2 00000010 10 2
3 00000011 11 3
4 00000100 00 0
5 00000101 10 1
6 00000110 10 2
7 00000111 11 3
8 00001000 00 0
9 00001001 01 1
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 33
Collision probability : Assume D is large (which is virtually always true)
Pb = Pr
(
z(b)1 = z
(b)2
)
= C1,b + (1 − C2,b) R
Note that the exact probability can be written as a complicated double summation.
——————–
Recall, (assuming infinite precision, or as many digits as needed), we have
Pr (z1 = z2) = R
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 34
Collision probability : Assume D is large (which is virtually always true)
Pb = Pr
(
z(b)1 = z
(b)2
)
= C1,b + (1 − C2,b) R
r1 =f1
D, r2 =
f2
D, f1 = |S1|, f2 = |S2|,
C1,b = A1,br2
r1 + r2+ A2,b
r1
r1 + r2,
C2,b = A1,br1
r1 + r2+ A2,b
r2
r1 + r2,
A1,b =r1 [1 − r1]
2b−1
1 − [1 − r1]2b
, A2,b =r2 [1 − r2]
2b−1
1 − [1 − r2]2b
.
——————–
Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 35
The Exact Collision Probability for b = 1
Pb = Pr
(
z(1)1 = z
(1)2
)
= Pr(both even) + Pr(both odd)
=R +∑
i=0,2,4,...
∑
j 6=i,j=0,2,4,...
Pr (z1 = i, z2 = j)
+∑
i=1,3,5,...
∑
j 6=i,j=1,3,5,...
Pr (z1 = i, z2 = j)
.
where
Pr (z1 = i, z2 = j, i < j)
=f2
D
f1 − a
D − 1
j−i−2∏
t=0
D − f2 − i − 1 − t
D − 2 − t
i−1∏
t=0
D − f1 − f2 + a − t
D + i − j − 1 − t
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 36
The closed-form collision probability is remarkably accurate even for small D.
The absolute errors (approximate - exact) are very small even for D = 20.
0 0.2 0.4 0.6 0.8 10
0.002
0.004
0.006
0.008
0.01
f2 = 2
f2 = 4
a / f2
App
roxi
mat
ion
Err
or
D = 20, f1 = 4, b = 1
0 0.2 0.4 0.6 0.8 10
1
2
3
4x 10
−4
D = 500, f1 = 50, b = 1
f2 = 2
f2 = 25
f2 = 50
a / f2
App
roxi
mat
ion
Err
or
—————–
Ref: Li and Konig, Theory and Applications of b-Bit Minwise Hashing, CACM
Research Highlights, 2011 .
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 37
The Variance-Space Trade-off
Smaller b =⇒ smaller space for each sample. However,
smaller b =⇒ larger variance.
To characterize the var-space trade-off, we use
64 (i.e., space) × variance
b (i.e., space) × variance
In the least-favorable situation, the relative improvement is
64 × variance
b × variance= 64
R
R + 1
If R = 0.5, then the improvement will be 643 = 21.3-fold.
64 bits + k permutations ⇔ 1 bit + 3k permutations
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 38
Experiments: Duplicate detection on Microsoft news articl es
The dataset was crawled as part of the BLEWS project at Microsoft. We
computed pairwise resemblances for all documents and retrieved documents
pairs with resemblance R larger than a threshold R0.
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.3
Sample size (k)
Pre
cisi
on
b=1b=2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.4
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 39
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.5
Sample size (k)
Pre
cisi
on
b=1
2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.6
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.7
Sample size (k)
Pre
cisi
on
b=1
2
b=1b=2b=4M
0 100 200 300 400 5000
0.10.20.30.40.50.60.70.80.9
1
R0 = 0.8
Sample size (k)
Pre
cisi
on
2
b=1
b=1b=2b=4M
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 40
1/2 Bit Suffices for High Similarity
Certain applications concern very high similarities. One can XOR two bits from
two permutations, into just one bit.
1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1, 0 XOR 1 = 1
The new estimator is denoted by R1/2. Compared to the 1-bit estimator R1:
• A good idea for highly similar data :
limR→1
Var(
R1
)
Var(
R1/2
) = 2.
• Not so good idea when data are not very similar :
Var(
R1
)
< Var(
R1/2
)
, if R < 0.5774, Assuming sparse data
—————
Ref: Li and Konig, b-Bit Minwise Hashing, WWW’10.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 41
b-Bit Minwise Hashing for Estimating 3-Way Similarities
Notation for 3-way set intersections : R123 = |S1∩S2∩S3||S1∪S2∪S3|
a12
f1 a
a23
f3
a13
f2
r1
r3
s12
ss23
r2
s13
3-Way collision probability : Assume D is large.
Pr (lowest b bits of 3 hashed values are equal) = R123 +Z
u
where u = r1 + r2 + r3 − s12 − s13 − s23 + s123, ri = |Si|D ,
sij = |S1∩S2|D , and ...
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 42
Z =(s12 − s)A3,b +(r3 − s13 − s23 + s)
r1 + r2 − s12
s12G12,b
+(s13 − s)A2,b +(r2 − s12 − s23 + s)
r1 + r3 − s13
s13G13,b
+(s23 − s)A1,b +(r1 − s12 − s13 + s)
r2 + r3 − s23
s23G23,b
+[
(r2 − s23)A3,b + (r3 − s23)A2,b
] (r1 − s12 − s13 + s)
r2 + r3 − s23
G23,b
+[
(r1 − s13)A3,b + (r3 − s13)A1,b
] (r2 − s12 − s23 + s)
r1 + r3 − s13
G13,b
+[
(r1 − s12)A2,b + (r2 − s12)A1,b
] (r3 − s13 − s23 + s)
r1 + r2 − s12
G12,b,
Aj,b =rj(1 − rj)2
b−1
1 − (1 − rj)2b
,
Gij,b =(ri + rj − sij)(1 − ri − rj + sij)2
b−1
1 − (1 − ri − rj + sij)2b
, i, j ∈ 1, 2, 3, i 6= j.
—————
Ref: Li, Konig, Gui, b-Bit Minwise Hashing for Estimating 3-Way Similarities,
NIPS’10.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 43
Useful messages :
1. Substantial improvement over using 64 bits, just like in the 2-way case.
2. Must use b ≥ 2 bits for estimating 3-way similarities.
——————
Next question : How to use b-bit minwise hashing for linear learning?
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 44
Using b-Bit Minwise Hashing for Learning
Random projection for linear learning is straightforward :
A R = B
because the projected data are naturally in an inner product space (linear kernel).
Natural questions about minwise hashing
• Is the n × n matrix of resemblances positive definite (PD)?
• Is the matrix of estimated resemblances PD?
We look for linear kernels to take advantage of modern linear learning algorithms.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 45
Kernels from Minwise Hashing and b-Bit Minwise Hashing
Definition : A symmetric n × n matrix K satisfying∑
ij cicjKij ≥ 0, for all
real vectors c is called positive definite (PD).
Result Apply permutation π on n sets S1, ..., Sn ∈ Ω = 0, 1, ..., D − 1.
Define zi = minπ(Si). The following three matrices are all PD.
1. The resemblance matrix R ∈ Rn×n: Rij =
|Si∩Sj ||Si∪Sj |
=|Si∩Sj |
|Si|+|Sj |−|Si∩Sj |
2. The minwise hashing matrix M ∈ Rn×n: Mij = 1zi = zj
3. The b-bit minwise hashing matrix M(b) ∈ R
n×n:
M(b)ij =
∏bt=1 1 ei,t = ej,t, where ei,t is the t-th lowest bit of zi.
Consequently, for k independent permutations, the sum∑k
s=1 M(b)(s) is also PD.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 46
b-Bit Minwise Hashing for Large-Scale linear Learning
Linear learning algorithms require the estimators to be inner products.
The estimator of minwise hashing
RM =1
k
k∑
j=1
1min(πj(S1)) = min(πj(S2))
is indeed an inner product of two (extremely sparse) vectors in D×k dimensions.
As z1 = min(π(S1)), z2 = min(π(S2)) ∈ Ω = 0, 1, ..., D − 1, we write
1min(πj(S1)) = min(πj(S2)) =D−1∑
i=0
1z1 = i × z2 = i
———–
Ref: Li, Shrivastava, Moore, Konig, Hashing Algorithms for Large-Scale Learning, NIPS’11.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 47
Consider D = 5. We can expand numbers into vectors of length 5.
0 =⇒ [0, 0, 0, 0, 1], 1 =⇒ [0, 0, 0, 1, 0], 2 =⇒ [0, 0, 1, 0, 0]
3 =⇒ [0, 1, 0, 0, 0], 4 =⇒ [1, 0, 0, 0, 0].
———————
If z1 = 2, z2 = 3, then
0 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 1, 0, 0, 0].
If z1 = 2, z2 = 2, then
1 = 1z1 = z2 = inner product between [0, 0, 1, 0, 0] and [0, 0, 1, 0, 0].
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 48
Linear Learning Algorithms
Binary classification: Dataset (xi, yi)ni=1, xi ∈ R
D , yi ∈ −1, 1.
One can fit an L2-regularized linear SVM:
minw
1
2w
Tw + C
n∑
i=1
max
1 − yiwTxi, 0
,
or the L2-regularized logistic regression:
minw
1
2w
Tw + C
n∑
i=1
log(
1 + e−yiwTxi
)
,
where C > 0 is the penalty (regularization) parameter.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 49
Integrating b-Bit Minwise Hashing for (Linear) Learning
Very simple :
1. Apply k independent random permutations on each (binary) feature vector xi
and store the lowest b bits of each hashed value. The storage costs nbk bits.
2. At run-time, expand a hashed data point into a 2b × k-length vector, i.e.
concatenate k 2b-length vectors. The new feature vector has exactly k 1’s.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 50
An example with k = 3 and b = 2
For set (vector) S1:
Original hashed values (k = 3) : 12013 25964 20191
Original binary representations :
010111011101101 110010101101100 100111011011111
Lowest b = 2 binary digits : 01 00 11
Corresponding decimal values : 1 0 3
Expanded 2b = 4 binary digits : 0010 0001 1000
New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] × 1√3
Same procedures on sets S2, S3, ...
——————
Note that (i): solvers require normalization; (ii)expansion 0000 is never used.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 51
Datasets and Solver for Linear Learning
This talk presents the experiments on two text datasets.
Dataset # Examples (n) # Dims (D) Avg # Nonzeros Train / Test
Webspam (24 GB) 350,000 16,609,143 3728 80% / 20%
Rcv1 (200 GB) 781,265 1,010,017,424 12062 50% / 50%
To generate the Rcv1 dataset, we used the original features + all pairwise
features + some 3-way features.
We chose LIBLINEAR as the basic solver for linear learning. Note that our method
is purely statistical/probabilistic, independent of the underlying procedures.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 52
Experimental Results on Linear SVM and Webspam
• We conducted our extensive experiments for a wide range of regularization C
values (from 10−3 to 102) with fine spacings in [0.1, 10].
• We experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, and 16.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 53
Testing Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 6,8,10,16
b = 1
svm: k = 200
b = 24
b = 4
Spam: Accuracy
• Solid: b-bit hashing. Dashed (red) the original data
• Using b ≥ 8 and k ≥ 200 achieves about the same test accuracies as using
the original data.
• The results are averaged over 50 runs.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 54
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6b = 8,10,16
svm: k = 50Spam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
CA
ccur
acy
(%)
Spam: Accuracy
svm: k = 100
b = 1
b = 2
b = 4b = 8,10,16
4
6
6
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
svm: k = 150
Spam: Accuracy
b = 1
b = 2
b = 4b = 6,8,10,16
4
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 6,8,10,16
b = 1
svm: k = 200
b = 24
b = 4
Spam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
Spam: Accuracy
svm: k = 300
b = 1
b = 24
b = 4b = 6,8,10,16
10−3
10−2
10−1
100
101
102
80828486889092949698
100
b = 1
b = 24
b = 6,8,10,16
C
Acc
urac
y (%
)
svm: k = 500
Spam: Accuracy
b = 4
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 55
Stability of Testing Accuracy (Standard Deviation)
Our method produces very stable results, especially b ≥ 4
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6b = 8
b = 1610
svm: k = 50Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6b = 8
b = 10,16svm: k = 100Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
svm: k = 150
Spam:Accuracy (std)
b = 1
b = 2
b = 4
b = 8b = 16
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
b = 1
b = 2
b = 4
b = 6
b = 8,10,16svm: k = 200Spam accuracy (std)
10−3
10−2
10−1
100
101
102
10−2
10−1
100
C
Acc
urac
y (s
td %
)
Spam:Accuracy (std)
b = 1
b = 2
b = 4b = 8b = 16svm: k = 300
10−3
10−2
10−1
100
101
102
10−2
10−1
100
b = 1
b = 2
b = 4
b = 6,8,10,16Spam accuracy (std)
C
Acc
urac
y (s
td %
)
svm: k = 500
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 56
Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k = 50Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k =100Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
svm: k = 150
Spam:Training Time
b = 16
b = 1
b=8
b = 4b = 1
2416
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
svm: k = 200Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
b = 1
2
416
b=8
b = 1
svm: k = 300
Spam:Training Time10
−310
−210
−110
010
110
210
0
101
102
103
Spam: Training time
b = 10
b = 16
C
Tra
inin
g tim
e (s
ec)
svm: k = 500
• They did not include data loading time (which is small for b-bit hashing)
• The original training time is about 100 seconds.
• b-bit minwise hashing needs about 3 ∼ 7 seconds (3 seconds when b = 8).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 57
Testing Time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 50Spam: Testing time
10−3
10−2
10−1
100
101
102
12
10
100
1000
CT
estin
g tim
e (s
ec)
svm: k = 100Spam: Testing time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tes
ting
time
(sec
)
svm: k = 150Spam:Testing Time
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 200Spam: Testing time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tes
ting
time
(sec
) Spam:Testing Timesvm: k = 300
10−3
10−2
10−1
100
101
102
12
10
100
1000
C
Tes
ting
time
(sec
)
svm: k = 500Spam: Testing time
However, here we assume the test data have already been processed.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 58
Experimental Results on L2-Regularized Logistic Regression
Testing Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
logit: k = 50Spam: Accuracy
b = 1
b = 2
b = 4
b = 6b = 8,10,16
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)logit: k = 100
Spam: Accuracy
b = 1
b = 2
b = 4b = 6
b = 8,10,16
10−3
10−2
10−1
100
101
102
80
85
90
95
100
C
Acc
urac
y (%
)
Spam:Accuracylogit: k = 150
b = 1
b = 2
b = 4b = 8
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
Spam: Accuracy
logit: k = 200
b = 6,8,10,16
b = 1
b = 2
b = 4
10−3
10−2
10−1
100
101
102
80
90
95
100
C
Acc
urac
y (%
)
logit: k = 300Spam:Accuracy
b = 1
b = 2
b = 4b = 8
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
logit: k = 500
Spam: Accuracy
b = 1
b = 2b = 4
4
b = 6,8,10,16
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 59
Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 50
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 100
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 150Spam:Training Time
b = 16b = 1
2 b = 84
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
logit: k = 200
Spam: Training time
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
b = 16
b = 1
b = 482
logit: k = 300Spam:Training Time
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
) b = 16
logit: k = 500
Spam: Training time
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 60
Comparisons with Other Algorithms
We conducted extensive experiments with the VW algorithm (Weinberger et. al.,
ICML’09, not the VW online learning platform), which has the same variance as
random projections.
Consider two sets S1, S2, f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|,R = |S1∩S2|
|S1∪S2|= a
f1+f2−a . (a = 0 ⇔ R = 0) Then, from their variances
V ar (aV W ) ≈ 1
k
(
f1f2 + a2)
V ar(
RMINWISE
)
=1
kR (1 − R)
we can immediately see the significant advantages of minwise hashing, especially
when a ≈ 0 (which is common in practice).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 61
Comparing b-Bit Minwise Hashing with VW
8-bit minwise hashing (dashed, red) with k ≥ 200 (and C ≥ 1) achieves about
the same test accuracy as VW with k = 104 ∼ 106.
101
102
103
104
105
106
80828486889092949698
100
C = 0.01
1,10,100
0.1
k
Acc
urac
y (%
)
svm: VW vs b = 8 hashing
C = 0.01
C = 0.1C = 1
10,100
Spam: Accuracy
101
102
103
104
105
106
80828486889092949698
100
k
Acc
urac
y (%
)
1
C = 0.01
C = 0.1
C = 110
C = 0.01
C = 0.1
10,100
logit: VW vs b = 8 hashingSpam: Accuracy
100
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 62
8-bit hashing is substantially faster than VW (to achieve the same accuracy).
102
103
104
105
106
100
101
102
103
k
Tra
inin
g tim
e (s
ec) C = 100
C = 10
C = 1,0.1,0.01C = 100
C = 10
C = 1,0.1,0.01
Spam: Training time
svm: VW vs b = 8 hashing
102
103
104
105
106
100
101
102
103
kT
rain
ing
time
(sec
)
C = 0.01
10,1.0,0.1
100
logit: VW vs b = 8 hashingSpam: Training time
C = 0.1,0.01
C = 100,10,1
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 63
Experimental Results on Rcv1 (200GB)
Test accuracy using linear SVM (Can not train the original data)
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
svm: k =50 b = 16
b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)rcv1: Accuracy
svm: k =100 b = 16
b = 12b = 8
b = 4b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
svm: k =150
rcv1: Accuracy
b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
svm: k =200
rcv1: Accuracy
b = 16
b = 12b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
svm: k =300 b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
b = 1
b = 2
b = 4
b = 8b = 12
b = 16
CA
ccur
acy
(%)
svm: k =500
rcv1: Accuracy
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 64
Training time using linear SVM
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=50
12
16
b = 8
12
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
)
rcv1: Train Time
svm: k=100
b = 16
12
b = 8
12
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=150
b = 1612
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=200
b = 1612
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
svm: k=300
b = 16
12
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
svm: k=500
b = 8
12
b = 16
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 65
Test accuracy using logistic regression
10−2
100
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =50 b = 16
b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
CA
ccur
acy
(%)
rcv1: Accuracyb = 1
b = 2
b = 4
b = 8
b = 12
b = 16logit: k =100
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =150 b = 16
b = 12b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
rcv1: Accuracy
logit: k =200 b = 16b = 12
b = 8
b = 4
b = 2
b = 1
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
b = 16b = 12
b = 8
b = 4
b = 2
b = 1
logit: k =300
rcv1: Accuracy
10−3
10−2
10−1
100
101
102
50
60
70
80
90
100
C
Acc
urac
y (%
)
b = 16
b = 12b = 8
b = 4
b = 2
b = 1
rcv1: Accuracy
logit: k =500
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 66
Training time using logistic regression
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=50
b = 168
b = 1
b = 4
b = 2
b = 12
10−3
10−2
10−1
100
101
102
100
101
102
103
CT
rain
ing
time
(sec
)
rcv1: Train Time
logit: k=100
b = 8
b = 16
b = 12
b = 1
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=150
b = 16
128
b = 4
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=200
b = 16
128
b = 1
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
logit: k=300b = 16
12
b = 8
10−3
10−2
10−1
100
101
102
100
101
102
103
C
Tra
inin
g tim
e (s
ec)
rcv1: Train Time
b = 1612
b = 8
logit: k=500
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 67
Comparisons with VW on Rcv1
Test accuracy using linear SVM
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
svm: VW vs hashing
rcv1: Accuracy
C = 0.01
C = 0.1,1,10
C = 0.01,0.1,1,10
1−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)rcv1: Accuracy
svm: VW vs hashing2−bit
C = 0.01,0.1,1,10 C = 0.01
C = 0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracysvm: VW vs hashing4−bit
C = 0.01
C = 0.1,1,10
C = 0.01
C = 10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracy
svm: VW vs hashing8−bit
C = 0.1,1,10
C = 0.01C = 0.01
C = 10
101
102
103
104
50
60
70
80
90
100
C = 0.01
K
Acc
urac
y (%
)
C = 0.01C = 0.1,1,10
rcv1: Accuracysvm: VW vs hashing12−bit
C = 10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 0.1,1,10
C = 0.01
rcv1: Accuracy
svm: VW vs hashing16−bit
C = 10
C = 0.01
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 68
Test accuracy using logistic regression
101
102
103
104
50
60
70
80
90
100
C = 0.01
C = 0.01,0.1,1,10
C = 0.1,1,10
K
Acc
urac
y (%
)
logit: VW vs hashingrcv1: Accuracy
1−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 1logit: VW vs hashing2−bit
rcv1: Accuracy
C = 0.1,1,10
C = 0.01C = 0.01,0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
C = 0.1,1,10
C = 0.01
C = 0.1,1,10
C = 0.01
rcv1: Accuracylogit: VW vs hashing4−bit
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracylogit: VW vs hashing8−bit
C = 0.01
C = 0.1,1,10
C = 0.01
C = 0.1,1,10
101
102
103
104
50
60
70
80
90
100
K
Acc
urac
y (%
)
rcv1: Accuracylogit: VW vs hashing12−bit
C = 0.01
C = 0.1,1,10
C = 0.1,1,10
C = 0.01
101
102
103
104
50
60
70
80
90
100
KA
ccur
acy
(%)
C = 0.01
C = 0.1,1,10
C = 0.01
C = 10
logit: VW vs hashing16−bit
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 69
Training time
101
102
103
104
100
101
102
103
C = 0.01
C = 0.1
C = 10
8−bit
K
Tra
inin
g tim
e (s
ec)
C = 0.01
C = 0.1
C = 1
rcv1: Train Time
svm: VW vs hashing
C = 1
C = 10
101
102
103
104
100
101
102
103
KT
rain
ing
time
(sec
)
rcv1: Train Time
logit: VW vs hashing8−bit
C = 10
C = 1
C = 0.1
C = 0.01
C = 0.01
C = 0.1
C = 1
C = 10
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 70
k-Permutation Hashing: Conclusions
• BigData everywhere. For many operations including clustering, classification,
near neighbor search, etc. exact answers are often not necessary.
• Very sparse random projections (KDD 2006) work as well as dense random
projections. They however require many projections (e.g., k = 10000)
• Minwise hashing is a standard procedure in the context of search. b-bit
minwise hashing is a substantial improvement by using only b bits per hashed
value instead of 64 bits (e.g., a 20-fold reduction in space).
• b-bit minwise hashing provides a simple strategy for efficient linear learning,
by expanding the bits into vectors of length 2b × k (originally 264 × k)
• We can build hash tables for achieving sub-linear time near neighbor search.
• It can be substantially more accurate than random projections (and variants).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 71
• Major drawbacks of b-bit minwise hashing:
– It was designed (mostly) for binary data.
– It needs a fairly high-dimensional representation (2b × k) after expansion.
– Expensive preprocessing and expensive testing for unprocessed data.
Parallelization solutions based on (e.g.,) GPUs offer good speed but they
are not energy-efficient.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 72
Why Minwise Hashing Is Wasteful?
Original Data Matrix
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
0
0
1
0
0
0
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
1
1
0
0
0
S1:
S2:
S3:
0
Permuted Data Matrix
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
1
0
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
π(S1):
π(S2):
π(S3):
Only the the minimums and repeat the process k (e.g., 500) times.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 73
One Permutation Hashing
S1, S2, S3 ⊆ Ω = 0, 1, ..., 15 (i.e., D = 16). The figure presents the
permuted sets as three binary (0/1) vectors:
π(S1) = 2, 4, 7, 13, π(S2) = 0, 6, 13, π(S3) = 0, 1, 10, 12
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
1
0
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
1 2 3 4
π(S1):
π(S2):
π(S3):
One permutation hashing: divide the space Ω evenly into k = 4 bins and
select the smallest nonzero in each bin.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 74
Two major tasks:
• Theoretical analysis and estimators
• Practical strategies to deal with empty bins, should they occur
Ref: P. Li, A. Owen, C-H Zhang, One Permutation Hashing, NIPS’12
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 75
Two Definitions
Recall: the space is divided evenly into k bins
# Jointly empty bins: Nemp =k
∑
j=1
Iemp,j
# Matched bins: Nmat =k
∑
j=1
Imat,j
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 76
where Iemp,j and Imat,j are defined for the j-th bin, as
Iemp,j =
1 if both π(S1) and π(S2) are empty in the j-th bin
0 otherwise
Imat,j =
1 if both π(S1) and π(S1) are not empty and the smallest element
of π(S1) matches the smallest element of π(S2), in the j-th bin
0 otherwise
———————————
Nemp =k
∑
j=1
Iemp,j , Nmat =k
∑
j=1
Imat,j
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 77
Results for the Number of Jointly Empty Bins Nemp
Notation :
f1 = |S1|, f2 = |S2|, a = |S1 ∩ S2|, f = |S1 ∪ S2| = f1 + f2 − a.
Distribution:
Pr (Nemp = j) =
k−j∑
s=0
(−1)s k!
j!s!(k − j − s)!
f−1∏
t=0
D(
1 − j+sk
)
− t
D − t
Expectation:E (Nemp)
k=
f−1∏
j=0
D(
1 − 1k
)
− j
D − j≤
(
1 − 1
k
)f
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 78
Approximation of Nemp in Sparse Data
f = |S1 ∪ S2| = f1 + f2 − a.
E(Nemp)k ≈
(
1 − 1k
)f ≈ e−f/k.
f/k = 5 =⇒ E(Nemp)k ≈ 0.0067 (negligible).
f/k = 1 =⇒ E(Nemp)k ≈ 0.3679 (noticeable, but then why hashing?).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 79
An Unbiased Estimator
Estimator : Rmat =Nmat
k − Nemp
Expectation : E(
Rmat
)
= R
Variance: V ar(
Rmat
)
<R(1 − R)
k
One-permutation hashing has smaller variance than k-permutation hashing.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 80
Empirically Validation
Based on many pairs of vectors, empirical MSEs (MSE = bias2 + variance) verify:
101
102
103
104
105
10−6
10−5
10−4
10−3
10−2
10−1
k
MS
E(R
mat
)
RIGHTS−−RESERVED
EmpiricalTheoreticalMinwise
101
102
103
104
105
10−6
10−5
10−4
10−3
10−2
10−1
k
MS
E(R
mat
)
OF−−AND
EmpiricalTheoreticalMinwise
101
102
103
104
105
10−6
10−5
10−4
10−3
10−2
10−1
k
MS
E(R
mat
)
THIS−−HAVE
EmpiricalTheoreticalMinwise
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 81
Summary of Advantages of One Permutation Hashing
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0
1
1
0
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
1
1
0
0
0
0
0
0
0
1 2 3 4
π(S1):
π(S2):
π(S3):
• Computationally much more efficient and energy-efficient.
• Parallelization-based solution requires additional hardware & implementation.
• Testing unprocessed data is much faster, crucial for user-facing applications.
• Implementation is easier, from the perspective of random number generation,
e.g., storing a permutation vector of length D = 109 (4GB) is not a big deal.
• One permutation hashing is a much better matrix sparsification scheme.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 82
One Permutation Hashing for Linear Learning
Almost the same as k-permutation hashing, but we must deal with empty bins *.
Two ideas:
1. Zero coding : Encode empty bins as (e.g.,) 00000000 in the expanded space.
2. Random coding : Encode empty bins as a random number. Did not work as
well.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 83
Statistics of # nonzeros in Webspam Data
0 2000 4000 6000 8000 100000
1
2
3
4x 10
4
# nonzeros
Fre
quen
cy
Webspam
The dataset has 350,000 samples. The average number of nonzeros is about
4000 which should be much larger than the k (e.g., 200 to 500).
About 10% (or 2.8%) of the samples have < 500 (or < 200) nonzeros. We
must deal with empty bins if we do not want to exclude those data points.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 84
Review the original example with k = 3 permutations and b = 2
For set (vector) S1:
Original hashed values (k = 3) : 12013 25964 20191
Original binary representations :
010111011101101 110010101101100 100111011011111
Lowest b = 2 binary digits : 01 00 11
Corresponding decimal values : 1 0 3
Expanded 2b = 4 binary digits : 0010 0001 1000
New feature vector fed to a solver : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0] × 1√3
Same procedures on sets S2, S3, ...
0000 was never used!
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 85
Zero Coding Example
One permutation hashing with k = 4 and b = 2
For set (vector) S1:
Original hashed values (k = 4) : 12013 25964 20191 ∗Original binary representations :
010111011101101 110010101101100 100111011011111 ∗Lowest b = 2 binary digits : 01 00 11 ∗Corresponding decimal values : 1 0 3 ∗Expanded 2b = 4 binary digits : 0010 0001 1000 0000
New feature vector : [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0] × 1√4 − 1
Same procedures on sets S2, S3, ...
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 86
Experimental Results on Webspam Data
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4b = 6,8
SVM: k = 256Webspam: Accuracy
Original1 Permk Perm
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2b = 4,6,8
SVM: k = 512Webspam: Accuracy
Original1 Permk Perm
When k = 512 (or even 256) and b = 8, b-bit one permutation hashing achieves
similar test accuracies as using the original data.
One permutation hashing (zero coding) is even slightly more accurate than
k-permutation hashing (at merely 1/k of the original cost).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 87
One Permutation v.s. k-Permutation
Linear SVM
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6b = 8
SVM: k = 32Webspam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4b = 6b = 8
SVM: k = 64Webspam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4b = 6,8
SVM: k = 256Webspam: Accuracy
Original1 Permk Perm
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2b = 4,6,8
SVM: k = 512Webspam: Accuracy
Original1 Permk Perm
Logistic Regression
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
logit: k = 32Webspam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4b = 6b = 8
logit: k = 64Webspam: Accuracy
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1
b = 2b = 4b = 6,8
logit: k = 256Webspam: Accuracy
Original1 Permk Perm
10−3
10−2
10−1
100
101
102
80828486889092949698
100
C
Acc
urac
y (%
)
b = 1b = 2b = 4,6,8
logit: k = 512Webspam: Accuracy
Original1 Permk Perm
————————
The influence of empty bins is really not that obvious, as expected.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 88
One Permutation Hashing: Conclusion
One permutation hashing is at least as accurate as the standard k-permutation
minwise hashing at merely 1/k of the original processing cost.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 89
Testing the Robustness of Zero Coding on 20Newsgroup Data
For webspam data, the # empty bins may be too small to make the algorithm fail.
The 20Newsgroup (News20) dataset has merely 20, 000 samples in about one
million dimensions, with on average about 500 nonzeros per sample.
Therefore, News20 is merely a toy example to test the robustness of our
algorithm. In fact, we let k as large as k = 4096, i.e., most of bins are empty.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 90
20Newsgroups: One Permutation v.s. k-Permutation (SVM)
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
SVM: k = 8News20: Accuracy
Original1 Permk Perm
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
SVM: k = 16News20: Accuracy
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
SVM: k = 32News20: Accuracy
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4
b = 6
b = 8
SVM: k = 64News20: Accuracy
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6b = 8
SVM: k = 128News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
CA
ccur
acy
(%)
b = 1
b = 2
b = 4
b = 6b = 8
SVM: k = 256News20: Accuracy
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 91
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4b = 6
b = 8
SVM: k = 512News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4b = 6,8
SVM: k = 1024News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4b = 6,8
SVM: k = 2048News20: Accuracy
Original1 Permk Perm
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1 b = 2b = 4,6,8
SVM: k = 4096News20: Accuracy
Original1 Permk Perm
One permutation with zero coding can reach the original accuracy 98%.
The original k-permutation can only reach 97.5% even with k = 4096.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 92
20Newsgroups: One Permutation v.s. k-Permutation (Logit)
Logistic Regression
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
logit: k = 32News20: Accuracy
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
logit: k = 64News20: Accuracy
10−1
100
101
102
103
50556065707580859095
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6
b = 8
logit: k = 128News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4
b = 6b = 8
logit: k = 256News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1
b = 2
b = 4b = 6
b = 8
logit: k = 512News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4b = 6,8
logit: k = 1024News20: Accuracy
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1b = 2
b = 4b = 6,8
logit: k = 2048News20: Accuracy
Original1 Permk Perm
10−1
100
101
102
103
65
70
75
80
85
90
95
100
C
Acc
urac
y (%
)
b = 1b = 2b = 4,6,8
logit: k = 4096News20: Accuracy
Original1 Permk Perm
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 93
b-Bit Minwise Hashing for Efficient Near Neighbor Search
Near neighbor search is a much more frequent operation than training an SVM.
The bits from b-bit minwise hashing can be directly used to build hash tables to
enable sub-linear time near neighbor search.
This is an instance of the general family of locality-sensitive hashing (LSH).
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 94
An Example of b-Bit Hashing for Near Neighbor Search
We use b = 2 bits and k = 2 permutations to build a hash table indexed from
0000 to 1111, i.e., the table size is 22×2 = 16.
00 10
11 1011 11
00 0000 01
Index Data Points
11 01
8, 13, 251 5, 14, 19, 29(empty)
33, 174, 3153 7, 24, 156
61, 342
00 10
11 1011 11
00 0000 01
Index Data Points
11 01
8
17, 36, 1292, 19, 83
7, 198
56, 989,9, 156, 879
4, 34, 52, 796
Then, the data points are placed in the buckets according to their hashed values.
Look for near neighbors in the bucket which matches the hash value of the query.
Replicate the hash table (twice in this case) for missed and good near neighbors.
Final retrieved data points are the union of the buckets.
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 95
Experiments on the Webspam Dataset
B = b × k: table size L: number of tables.
Fractions of retrieved data points
8 12 16 20 2410
−2
10−1
100
Webspam: L = 25
B (bits)
Fra
ctio
n E
valu
ated
b = 1
b = 2
b = 4
SRPb−bit
8 12 16 20 2410
−2
10−1
100
B (bits)
Fra
ctio
n E
valu
ated
b = 1
b = 2
b = 4
Webspam: L = 50
SRPb−bit
8 12 16 20 2410
−2
10−1
100
Webspam: L = 100
B (bits)
Fra
ctio
n E
valu
ated
b = 1
b = 2
b = 4
SRPb−bit
SRP (sign random projections) and b-bit hashing (with b = 1, 2) retrieved similar
numbers of data points.
————————
Ref: Shrivastava and Li, Fast Near Neighbor Search in High-Dimensional Binary
Data, ECML’12
Ping Li Probabilistic Hashing for Efficient Search and Learning Fall, 2012 Cornell STSCI6520 96
Precision-Recall curves (the higher the better) for retrieving top-T data points.
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 24 L= 100Webspam: T = 5
b = 1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 24 L= 100Webspam: T = 10
b = 1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 24 L= 100Webspam: T = 50
b = 1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 16 L= 100Webspam: T = 5
1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 16 L= 100Webspam: T = 10
1
b = 4
SRPb−bit
0 10 20 30 40 50 60 70 80 90 1000
102030405060708090
100
Recall (%)
Pre
cisi
on (
%)
B= 16 L= 100Webspam: T = 50
1
b = 4
SRPb−bit
Using L = 100 tables, b-bit hashing considerably outperformed SRP (sign
random projections), for table sizes B = 24 and B = 16.