1 CROiﬁcation: Accurate Kernel Classiﬁcation with the...

1

CROification: Accurate Kernel Classification withthe Efficiency of Sparse Linear SVM

Mehran Kafai, Member, IEEE and Kave Eshghi

Abstract—Kernel methods have been shown to be effective for many machine learning tasks such as classification and regression. Inparticular, support vector machines with the Gaussian kernel have proved to be powerful classification tools. The standard way to applykernel methods is to use the kernel trick, where the inner product of the vectors in the feature space is computed via the kernelfunction. Using the kernel trick for SVMs, however, leads to training that is quadratic in the number of input vectors and classificationthat is linear with the number of support vectors.We introduce a new kernel, the CRO (Concomitant Rank Order) kernel that approximates the Gaussian kernel on the unit sphere. Wealso introduce a randomized feature map, called the CRO feature map that produces sparse, high-dimensional feature vectors whoseinner product asymptotically equals the CRO kernel. Using the Discrete Cosine Transform for computing the CRO feature map ensuresthat the cost of computing feature vectors is low, allowing us to compute the feature map explicitly. Combining the CRO feature mapwith linear SVM we introduce the CROification algorithm which gives us the efficiency of a sparse high-dimensional linear SVM with theaccuracy of the Gaussian kernel SVM.

Index Terms—Kernel method, SVM, concomitant rank order, classification, feature map.

F

1 INTRODUCTION

K ERNEL methods have been shown to be effective for manymachine learning tasks such as clustering, regression, and

classification [1], [2]. The theory behind kernel methods relies ona mapping between the input space and the feature space suchthat the inner product of the vectors in the feature space can becomputed via the kernel function, aka the ’kernel trick’. The kerneltrick is used because a direct mapping to the feature space isexpensive or, in the case of the Gaussian kernel, impossible, sincethe feature space is infinite dimensional.

The canonical example is Support Vector Machine (SVM)classification with the Gaussian kernel. It has been shown that formany types of data, the classification accuracy of Gaussian kernelSVMs far surpasses linear SVMs. For example, for the MNIST [3]handwritten number recognition dataset, SVM with the Gaussiankernel achieves accuracy of 98.6%, whereas linear SVM can onlyachieve 92.7%.

For SVMs, the main drawback of the kernel trick is that bothtraining and classification can be expensive. Training is expensivebecause the kernel function must be applied for each pair ofthe training samples, making the training task at least quadraticwith the number of training samples. Classification is expensivebecause for each classification task the kernel function must beapplied for each of the support vectors, whose number may belarge. As a result, kernel SVMs are rarely used when the numberof training instances is large or for online applications whereprediction must happen very fast [4]. There have been manyapproaches in the literature to overcome these efficiency problemswith non-linear kernel SVMs. Section 2 mentions some of therelated work.

The situation is different with sparse, high-dimensional input

• M. Kafai is with Amazon A9, Visual Search, Palo Alto, CA 94301.E-mail: [email protected]

• K. Eshghi is with Box Inc., Redwood City, CA. E-mail: [email protected]• The two authors contributed equally to this work.

vectors and linear kernels. For this class of problems, thereare efficient algorithms for both training and prediction. Oneimplementation is LIBLINEAR [5], from the same group that hasimplemented LIBSVM [6]. When the data fits this model, e.g. fortext classification, these algorithms are very effective. But whenthe data does not fit this model, e.g. for image classification [7],these algorithms are not particularly efficient and the classificationaccuracy is low.We introduce a new kernel Kγ (A,B) defined as

Kγ (A,B) = Φ2 (γ, γ, cos(A,B)) (1)

where γ is a constant, and Φ2 (x, y, ρ) is the CDF of the standardbivariate normal distribution with correlation ρ.

We also introduce the notion of a CRO feature map, which isa randomized mapping from the input space to a sparse, high-dimensional, binary feature space. The feature map is definedin Definition 4. Briefly, for a constant integer κ that determinesthe sparsity of the map, a CRO feature map f is a function

f : RD → {0, 1}U (2)

where U = τκ, for a positive integer τ , and the following is true:

• For any A ∈ RD , the expected number of non-zeroelements of f(A) is τ .

• In the limit, as U →∞, for any A,B ∈ RD ,

E

[f(A) · f(B)

U

]= Kγ (A,B) (3)

where γ = Φ−1( 1κ ) and Φ is the CDF of the standard

normal distribution.• The standard deviation of the lhs of eq. (3) decreases with

the inverse of the square root of U , allowing us to makethe feature map as accurate as desired by increasing U .

To perform the mapping from the input space to the featurespace, we use a variant of the concomitant rank order hash

2

function [8], [9]. Relying on a result first presented in [9], we usea random permutation followed by a Discrete Cosine Transform(DCT) to compute the random projection that is at the heart ofthis operation. The resulting algorithm for computing the featuremap is highly efficient. The proposed kernel and feature map haveinteresting properties:

• The kernel approximates the Gaussian kernel on the unitsphere (Proposition 2).

• Computing the feature map is efficient.• The feature space is

– Binary (only zeros and ones in the feature vector)– High-dimensional (U is large)– Sparse (κ is relatively large)

We introduce the CROification algorithm by combining theCRO feature map with linear SVM which results in a classifica-tion algorithm with the efficiency of a sparse high-dimensionallinear SVM and the accuracy of the Gaussian kernel SVM. TheCROification algorithm works as follows:

• For training, we apply the feature map to each input vector,to compute the feature vector. Then we apply the linearSVM trainer (in this case LIBLINEAR) to the set of allthe training feature vectors thus derived to train a model.

• To classify a test vector, we apply the feature map to thetest vector, derive the corresponding feature vector, anduse the linear model from the training phase to classify it.

Thus, for the class of problems where the CRO kernel iseffective, we have the best of both worlds: the accuracy of theGaussian kernel, and the efficiency of the sparse, high-dimensionallinear models.

We show the efficacy of our approach, in terms of classifi-cation accuracy, training time, and prediction time on a numberof standard datasets. We also make a detailed comparison withalternative approaches for randomized mappings to the featurespace presented in [10], [11], [12].

Along the way, we prove a new result in the theory ofconcomitant rank order statistics for bivariate normal distributions,given in Theorem 3.

2 RELATED WORK

Reducing the training and prediction cost of non-linear SVMshas attracted a great deal of attention in the literature. Joachimset. al. [13] use basis vectors other than support vectors to findsparse solutions that speed up training and prediction. Segata et.al. [14] use local SVMs on redundant neighborhoods and choosethe appropriate model at query time. In this way, they divide thelarge SVM problem into many small local SVM problems.

Tsang et. al. [15] re-formulate the kernel methods as minimumenclosing ball (MEB) problems in computational geometry, andsolve them via an efficient approximate MEB algorithm, leadingto the idea of core sets. Nandan et. al. [16] choose a subset of thetraining data, called the representative set, to reduce the trainingtime. This subset is chosen using an algorithm based on convexhulls and extreme points.

A number of approaches compute approximations to the fea-ture vectors and use linear SVM on them. Chang et. al. [17] do anexplicit mapping of the input vectors into low degree polynomialfeature space, and then apply fast linear SVMs for classification.Vedaldi et. al. [18] introduce explicit feature maps for the additive

kernels, such as the intersection, Hellinger’s, and χ2. Weinbergeret. al. [19] use hashing to reduce the dimensionality of the inputvectors. Litayem et. al. [20] use hashing to reduce the size of theinput vectors and speed up the prediction phase of linear SVM. Suet. al. [21] use a sparse projection to reduce the dimensionality ofthe input vectors while preserving the kernel function.

Rahimi et. al. [10] propose two randomized feature maps tomap the input data to a randomized feature space. The first featuremap uses sinusoids randomly drawn from the Fourier transform ofthe kernel function. It works as follows:

• A projection matrix P is created where each row of P isdrawn randomly from the Fourier transform of the kernelfunction.

• To calculate the feature vector for the input vector A,Ap = PA is calculated.

• The feature vector Φ(A) is a vector whose length istwice the length of Ap, and for each coordinate i in ApΦ(A)[2i] = cos(Ap[i]) and Φ(A)[2i+ 1] = sin(Ap[i]).

The authors show that the expected inner product of the featurevectors is the kernel function applied to the input vectors. Ourapproach has the following two advantages over the work byRahimi et. al. [10]:

1) We can use the DCT to compute the feature vectors,whereas they need to do the projection through multi-plying the input vector with the projection matrix. Thiscan be more expensive if the input vector is relativelyhigh-dimensional.

2) We generate a sparse feature vector (only τ entries out ofU in the feature vector are non-zero, where τ << U ),whereas their feature vector is dense. Having sparse fea-ture vectors can be advantageous in many circumstances,for example many training algorithms converge muchmore quickly with sparse feature vectors.

The second randomized feature map proposed in [10] parti-tions the input space at randomly chosen resolutions by utilizingrandomly shifted grids. Compared to the first approach, randombinning features have received less attention due to their com-putational complexity. Wu et. al. [22] explore the optimizationof random binning and demonstrate that random binning featurescould be more efficient than other random features under the samerequirement of accuracy.

Pham et. al. [12] introduce Tensor Sketching, a method forapproximating polynomial kernels which relies on fast convolutionof count sketches. Quoc et. al. [11] introduce Fastfood which re-places the random matrix proposed in [10] with an approximationthat allows for fast multiplication. By doing so, Fastfood reducesthe cost of evaluating and storing basis functions. Yen et. al. [23]propose the sparse random features algorithm which reduces thenumber of basis functions for better space and time complexity.

Li [24] improves upon the original random Fourier featuresprocedure by adding a normalization step. The normalized randomFourier features have a smaller variance compared to the randomFourier features proposed in [10]. Li also proposes a nonlin-ear positive definite kernel for data similarity and ”generalizedconsistent weighted sampling”, a hashing method that maps thenonlinear kernel to a linear kernel.

Huang et. al. [25] apply kernel SVMs to the problem ofphoneme classification for the TIMIT [26] dataset. The problemthey address is similar to ours: a full Gaussian kernel SVM is

3

impractical to train on this type of dataset. To ameliorate this,they try the random Fourier feature based method of [10], but theyfind that to achieve acceptable accuracy, the dimensionality of thefeature space needs to be very high, in the order of 200,000, whichmakes it impractical to store and process the vectors in memory.They propose two ways of overcoming this problem: the first isto have an ensemble of weak learners, each working on a smallerdimensional feature space, and combine the results. The secondapproach is a scalable solver that does not require the storage ofthe entire matrix of the feature space vectors in memory; instead,it computes the vectors on demand as the training process unfolds.

Raginsky et. al. [27] compute locality sensitive hashes wherethe expected Hamming distance between the binary codes of twovectors is related to the value of a shift-invariant kernel. They usethe results in [10] for this purpose.

3 THE CONCOMITANT RANK ORDER KERNEL

To better understand the rest of the paper we start with describingthe notation.

3.1 Notation3.1.1 Φ(x)

We use Φ(x) to denote the CDF of the standard normal distribu-tion N (0, 1), and φ(x) to denote its PDF.

φ(x) =1√2πe−

x2

2 (4)

Φ(x) =

∫ x

−∞φ(u) du (5)

3.1.2 Φ2(x, y, ρ)

We use Φ2(x, y, ρ) to denote the CDF of the standard bivariatenormal distribution

N((

00

),

(1 ρρ 1

))and φ2(x, y, ρ) to denote its PDF.

φ2(x, y, ρ) =1

2π√

1− ρ2e

−x2−y2+2ρxy

2(1−ρ2) (6)

Φ2(x, y, ρ) =

x∫−∞

y∫−∞

φ2(u, v, ρ) du dv (7)

3.1.3 Vector IndexingWe use square brackets to index into vectors. For example A[5]refers to the fifth element of the vector A. Indices start from one.

3.1.4 D−→We use D−→ to denote ”distributed as”. For example

XD−→ N (µ, σ2)

states that random variable X is distributed as a normal distribu-tion with mean µ and variance σ2.

3.1.5 Rank function R(k,A)

For A ∈ RD and 1 ≤ k ≤ D we use R(k,A) to denote the rankof k in A, defined as:

Definition 1. R(k,A) is the number of elements of A which areless than or equal to A[k].

3.1.6 Indicator function I(P )

Definition 2. For the logical proposition P ,

I(P ) =

{1, if P is true0, if P is false

(8)

3.2 Definition and Properties of the CRO KernelDefinition 3. Let A,B ∈ RD. Let γ ∈ R. Then the ConcomitantRank Order (CRO) kernel Kγ (A,B) is defined as:

Kγ (A,B) = Φ2 (γ, γ, cos(A,B)) (9)

In order for Kγ (A,B) to be admissible as a kernel forsupport vector machines, it needs to satisfy the Mercer conditions.Theorem 1 proves this.

Theorem 1. Let A,B ∈ RD. Then Kγ (A,B) satisfies theMercer’s conditions, and is an admissible support vector kernel.

Proof. According to criteria in Theorem 8 in [28], to prove thatKγ (A,B) satisfies Mercer’s conditions, it is sufficient to provethat there is a convergent series

Kγ (A,B) =∞∑n=0

αn(A ·B)n (10)

where ∀n, αn ≥ 0, and A ·B is the inner product of A and B.By definition,

Kγ (A,B) = Φ2(γ, γ, cos(A,B)) (11)

Thus without loss of generality we can assume that ‖A‖ = 1 and‖B‖ = 1, since scaling A and B does not change the value of thekernel. Let

ρ = cos(A,B) = A ·B (12)

Using the tetrachoric series expansion for Φ2 [29], which isconvergent, we can show that

Φ2(γ, γ, ρ) = Φ2(γ) + φ2(γ)∞∑k=0

(Hek(γ))2

(k + 1)!ρk+1 (13)

= Φ2(γ) +∞∑k=1

[φ(γ)Hek−1(γ)]2

k!ρk (14)

where Hek(x) is the kth Hermite polynomial.From eqs. (11), (12) and (14) we get

Kγ (A,B) = Φ2(γ) +∞∑k=1


k!(A ·B)

k (15)

which is sufficient to prove the theorem since the series isconvergent, Φ2(γ) ≥ 0, and for all k


k!≥ 0 (16)

In the following proposition we derive the formula forKγ (A,B).

Proposition 1. Formula for the CRO kernel

Kγ (A,B) = Φ2(γ) +

cos(A,B)∫0

exp(−γ2

1+ρ

)2π√

1− ρ2dρ (17)

4

Proof. LetC(x, ρ) = Φ(x, x, ρ) (18)

By definition,

Kγ (A,B) = C(γ, cos(A,B)) (19)

Now, as proved in [30]

d

dρΦ2(x, y, ρ) = φ2(x, y, ρ) (20)

i.e.d

dρΦ2(x, y, ρ) =

exp(−x2−y2+2ρxy

2(1−ρ2)

)2π√

1− ρ2(21)

Thus

d

dρC(x, ρ) =

exp(−x2−x2+2ρx2

2(1−ρ2)

)2π√

1− ρ2(22)

=exp

(−x2(1−ρ)

(1−ρ2)

)2π√

1− ρ2(23)

Letρ 6= 1 (24)

Then eq. (23) can be simplified as

d

dρC(x, ρ) =

exp(−x2

1+ρ

)2π√

1− ρ2(25)

We know that C(x, 0) = Φ2(x). Thus

C(x, r) = Φ2(x) +

r∫0

exp(−x2

1+ρ

)2π√

1− ρ2dρ (26)

From eq. (19) and eq. (26) we get

Kγ (A,B) = Φ2(γ) +

cos(A,B)∫0

exp(−γ2

1+ρ

)2π√

1− ρ2dρ (27)

Note that eq. (24) is satisfied inside the integral in eq. (27).

In Proposition 2 we derive the relationship between the CROkernel and the Gaussian kernel.

Proposition 2. The CRO kernel approximates the Gaussian kernelon the unit sphere.

Proof. Let

α = Φ(γ) (28)

From the definition of bivariate normal distribution we can showthat

Φ2(γ, γ, 0) = α2 (29)

Φ2(γ, γ, 1) = α (30)

Thus

log(Φ2(γ, γ, 0)) = 2 log(α) (31)

log(Φ2(γ, γ, 1)) = log(α) (32)

If we linearly approximate log(Φ2(γ, γ, ρ)) between ρ = 0and ρ = 1, we get the following:

log(Φ2(γ, γ, ρ)) ≈ log(α)(2− ρ) (33)

Figure 1 shows the two sides of eq. (33) as ρ goes from 0 to 1. Thechoice of γ = −2.42617 is for a typical use case of the kernel. αis related to γ through eq. (28).

Fig. 1. Comparison of the two sides of eq. (33) for γ = −2.42617

From eq. (33) we get

Φ2(γ, γ, ρ) ≈ elog(α)(2−ρ) (34)

≈ αelog(α)(1−ρ) (35)

Let ρ = cos(A,B). Then from Definition 3 and eq. (35) we get

Kγ (A,B) ≈ αelog(α)(1−cos(A,B)) (36)

Let A,B ∈ RN be on the unit sphere, i.e.

‖A‖ = ‖B‖ = 1 (37)

Then

‖A−B‖2

2=‖A‖2 + ‖B‖2 − 2‖A‖‖B‖ cos(A,B)

2(38)

= 1− cos(A,B) (39)

Replacing the rhs of eq. (39) in eq. (36) we get

Kγ (A,B) ≈ αelog(α)

2 ‖A−B‖2 (40)

But the rhs of eq. (40) is the definition of a Gaussian kernel withparameter log(α).

Notice that the purpose of the comparison with the Gaussiankernel is to give us an intuition about Kγ (A,B), to enable us tomake meaningful comparisons with implementations that use theGaussian kernel.

To this end, how close is the approximation to the Gaussiankernel? The following diagram shows the two sides of eq. (40)as cos(A,B) goes from 0 to 1. Notice that when cos(A,B) isnegative the values of both kernels are very close to zero, so weignore such cases.

We will now describe the feature map that allows us to use thekernel Kγ (A,B) for tasks such as SVM training.

4 CRO FEATURE MAP FAMILIES

In this section we define the CRO feature map family G(D,U, γ).

Definition 4. Let D,U be positive integers, and γ a real number.A function family G(D,U, γ) is a CRO feature map family if itis a set of functions f where each f has the domain RD and therange {0, 1}U , i.e.,

f : RD → {0, 1}U

5

Fig. 2. Comparison of the two sides of eq. (40) for γ = −2.42617

and for all A,B ∈ RD, the conditions specified in eqs. (41)to (43) hold.

4.1 Conditions that a CRO Feature Map should Satisfy

CRO feature map families should satisfy 3 conditions: sparsity,convergence to kernel, and bounded randomization error.

4.1.1 Sparsity

This condition requires that for all 1 ≤ k ≤ U

Ef∈G(D,U,γ)

[f(A)[k]] = Φ(γ) (41)

Since the feature vectors are binary, the expected value off(A)[k] is the probability that the coordinate k in f(A) has thevalue 1. Thus eq. (41) in effect defines the sparsity of the featurevectors, by specifying the probability that any coordinate wouldhave the value 1.

4.1.2 Convergence to CRO Kernel

This condition states that asymptotically, the expected normalizedinner product of the feature vectors should be equal to the CROkernel.

limU→∞

Ef∈G(D,U,γ)

[f(A) · f(B)

U

]= Φ2(γ, γ, cos(A,B)) (42)

4.1.3 Bounded Randomization Error

This condition requires that the standard deviation of the normal-ized product of f(A) and f(B) should decrease with the squareroot of U . Specifically, for some positive constant c ≤ 0.5,

SDf∈G(D,U,γ)

[f(A) · f(B)

U

]≤ c√

U(43)

This requirement is necessary since the feature map is a random-ized function, and we need to ensure that we can minimize theerror in eq. (42) by choosing a large enough U .

In the next section we describe the Smallest Tau (ST) featuremap family, and prove that it satisfies Definition 4, i.e., it is a CROfeature map family.

4.2 The Smallest Tau (ST) Feature Map FamilyThe ST feature map family is based on the Concomitant RankOrder (CRO) hash function introduced in [8]. It was furtherdeveloped and its collision probabilities analyzed in [9] and [31].Before we can define the ST family, we need to define the functionHQ,τ (A).

Definition 5. Let τ, U be positive integers, with τ < U . LetA ∈ RD and Q ∈ RU×D . Let 1 ≤ j ≤ U . Let

S = QA (44)

Then the function

HQ,τ

(A) : RD → {0, 1}U (45)

is defined as follows:

HQ,τ

(A)[j] = I(R(j, S) ≤ τ) (46)

where R() is the rank function defined in Definition 1 and I() isthe indicator function defined in Definition 2.

One way of reading eq. (46) is that HQ,τ (A)[j] = 1 iff S[j]is among the smallest τ elements in S.

Definition 6. Let κ,D be fixed positive integers. Let γ = Φ−1( 1κ )

Let τ be a positive integer, and let U = τκ. LetMU,D be a U×Dmatrix of iid N (0, 1) random variables. Then ST (D,U, γ) isdefined as follows:

ST (D,U, γ) = {HQ,τ

: Q is an instance of MU,D} (47)

4.3 Smallest Tau is a CRO Feature Map FamilyIn this section, we prove that ST is a CRO feature map family. Todo this, we need to prove that the conditions specified in eqs. (41)to (43) hold. This is proved in Theorem 2. The proof of thistheorem is long, so the reader can skip to Section 4.4 on a firstreading.

Theorem 2. Let ST (D,U, γ) be defined as in Definition 6. ThenST is a CRO feature map family.

Proof. Let A,B ∈ RD, and let 1 ≤ j ≤ U . Let Q be an instanceof MU,D chosen uniformly and at random. We need to prove thefollowing:

• Sparsity (eq. (41))

E

[HQ,τ

(A)[j]

]= Φ(γ) (48)

• Convergence to CRO kernel (eq. (42))

limU→∞

E

[HQ,τ (A) ·HQ,τ (B)

U

]= Φ2(γ, γ, cos(A,B))

(49)• Bounded Randomization Error (eq. (43))

SD


U

]≤ c√

U(50)

for some positive constant c ≤ 0.5.

Eq. (48) is proved in Proposition 3, eq. (49) is proved in Propo-sition 4, and eq. (50) is proved by applying Proposition 7, andobserving that, since p is a probability and κ is a positive integer,√

p(1−p)κ ≤ 0.5.

6

Proposition 3. Let κ,D, γ, τ, U,MU,D be defined as in Defini-tion 6 and HQ,τ be defined as in Definition 5. Let A ∈ RD and1 ≤ j ≤ U . Then

E

[HQ,τ

(A)[j]

]= Φ(γ) (51)

Proof. Let S = QA. It is straightforward to prove that theelements of S are instances of iid N (0, ‖A‖) random variables.So, the probability of any of them being in the smallest τ elementsin S is τ

U . Now, by definition, HQ,τ (A)[j] = 1 iff R(j, S) ≤ τ ,i.e. S[j] is among the smallest τ elements in S. It follows that

P [HQ,τ

(A)[j] = 1] =τ

U=

1

κ= Φ(γ) (52)

E

[HQ,τ

(A)[j]

]= Φ(γ) (53)

Proposition 4. Let κ,D, γ, τ, U,MU,D be defined as in Defini-tion 6 and HQ,τ be defined as in Definition 5. Let A,B ∈ RD .Then

limU→∞

E


U

]= Φ2(γ, γ, cos(A,B)) (54)

Proof. First, we note that for any scalar s > 0 and input vectorA, HQ,τ (sA) = HQ,τ (A). This is because multiplying the inputvector by a positive scalar simply scales the elements in QA, anddoes not change the order of the elements.

We also note that for any scalars s1 6= 0, s2 6= 0,cos(A,B) = cos(s1A, s2B).

Let A = A‖A‖√U

and B = B‖B‖√U

. We will prove thefollowing:

limU→∞

E


U

]= Φ2(γ, γ, cos(A,B)) (55)

Which, by the argument above, is sufficient to prove eq. (54). Now,

E

[HQ,τ

(A) · HQ,τ

(B)

]=

U∑i=1

E

[HQ,τ

(A)[i] HQ,τ

(B)[i]

](56)

=

U∑i=1

P

[HQ,τ

(A)[i] = 1 ∧ HQ,τ

(B)[i] = 1

](57)

We will show that for all 1 ≤ i ≤ U ,

limU→∞

P

[HQ,τ

(A)[i] = 1 ∧ HQ,τ

(B)[i] = 1

]=

Φ2(γ, γ, cos(A,B)) (58)

which together with eq. (58) is sufficient to prove the proposition.Consider the vectors

x = QA (59)

y = QB (60)

Let

ρ = cos(A,B) (61)

It is possible to prove that(x1

y1

),

(x2

y2

), . . . ,

(xUyU

)are iid sam-

ples from a bivariate normal distribution, where for all 1 ≤ i ≤ U ,(xiyi

)D−→ N

((00

),

(1 ρρ 1

))(62)

Now consider the following question: for any 1 ≤ i ≤ U , what isthe probability that

HQ,τ

(A)[i] = 1 ∧ HQ,τ

(B)[i] = 1 (63)

From Definition 5 and the definition of x and y it follows that

HQ,τ (A)[i] = 1 ⇐⇒ R(i,x) ≤ τ (64)

HQ,τ (B)[i] = 1 ⇐⇒ R(i,y) ≤ τ (65)

Where R is the rank function introduced in Definition 1. Thus

P

[HQ,τ

(A)[i] = 1 ∧ HQ,τ

(B)[i] = 1

]=

P [(R(i,x) ≤ τ) ∧ (R(i,y) ≤ τ)] (66)

By Theorem 3

limU→∞

P [R(i,x) ≤ τ ∧R(i,y) ≤ τ ] = Φ2(γ, γ, ρ) (67)

Substituting in eq. (66)

limU→∞

P

[HQ,τ

(A)[i] = 1 ∧ HQ,τ

(B)[i] = 1

]= Φ2(γ, γ, ρ) (68)

which proves eq. (58) and the proposition.

Theorem 3. Let κ be a fixed positive integer. Let γ = Φ−1( 1κ ).

Let τ be a positive integer, and let U = τκ. Let

S =

(x1

y1

),

(x2

y2

), . . . ,

(xUyU

)be a sample from Φ2(x, y, ρ). Let x = (x1, x2, . . . , xU )T , andy = (y1, y2, . . . , yU )T . Let 1 ≤ i ≤ U . Then

limU→∞

P [R(i,x) ≤ τ ∧R(i,y) ≤ τ ] = Φ2(γ, γ, ρ) (69)

Proof. First, we note that since x and y are samples from a normaldistribution, the probability that they have duplicate elementsis zero. Thus, we will assume that they do not have duplicateelements.Consider the sample

S =

(x1

y1

), . . . ,

(xi−1

yi−1

),

(xi+1

yi+1

), . . . ,

(xUyU

)which is S with

(xiyi

)taken out. Let

x = (x1, . . . , xi−1, xi+1, . . . , xU )T

y = (y1, . . . , yi−1, yi+1, . . . , yU )T

Let x(1:U−1), x(2:U−1), . . . , x(U−1:U−1) be the order statis-tics on x, and y(1:U−1), y(2:U−1), . . . , y(U−1:U−1) the orderstatistics on y. Given that there are no duplicates in x and y,it is not too difficult to show that

R(i,x) ≤ τ ⇐⇒ xi < x(τ :U−1) (70)

R(i,y) ≤ τ ⇐⇒ yi < y(τ :U−1) (71)

7

That is,

P [R(xi,x) ≤ τ ∧R(yi,y) ≤ τ ] =

P[xi < x(τ :U−1) ∧ yi < y(τ :U−1)

](72)

The next step is to find the distribution of(x(τ :U−1)

y(τ :U−1)

), from

which we can use eq. (72) to prove the theorem.From the definitions of γ, U, κ and τ we can derive

Φ(γ) =1

κ(73)

Φ(γ)(U − 1) =U

κ− 1

κ(74)

Φ(γ)(U − 1) = τ − 1

κ(75)

dΦ(γ)(U − 1)e = τ (76)

This means that when we apply Proposition 6 with n = U −1

to get the distribution of(x(τ :U−1)

y(τ :U−1)

), we get

(x(τ :U−1)

y(τ :U−1)

)D−−−−→

U→∞N((

γγ

),

(s√U−1

rU−1

rU−1

s√U−1

))(77)

with s and r constants solely dependent on γ.Now, by assumption,(

xiyi

)D−→ N

((00

),

(1 ρρ 1

))(78)

Note that(xiyi

)and

(x(τ :U−1)

y(τ :U−1)

)are instances of independent

random variables with distributions eq. (77) and eq. (78). Usingstandard procedure for the subtraction of two independent bivari-ate normal random variables, we can derive

(xi − x(τ :U−1)

yi − y(τ :U−1)

)D−−−−→

U→∞

N

(−γ−γ

),

√1 + s2

U−1ρ+ r

U−1

ρ+ rU−1

√1 + s2

U−1

(79)

Clearly,

limU→∞

√1 +

s2

U − 1= 1 (80)

limU→∞

ρ+r

U − 1= ρ (81)

Thus, from eq. (79), eq. (80) and eq. (81) we can derive,(xi − x(τ :U−1)

yi − y(τ :U−1)

)D−−−−→

U→∞N((−γ−γ

),

(1 ρρ 1

))(82)

Thus by Proposition 5

limU→∞

P[xi − x(τ :U−1) < 0 ∧ yi − y(τ :U−1) < 0

]= Φ2(γ, γ, ρ)

(83)limU→∞

P[xi < x(τ :U−1) ∧ yi < y(τ :U−1)

]= Φ2(γ, γ, ρ)

(84)

Substituting the lhs of eq. (83) in the rhs of eq. (72) we get

limU→∞

P [R(xi,x) ≤ τ ∧R(yi,y) ≤ τ ] = Φ2(γ, γ, ρ) (85)

Proposition 5. Let(xy

)D−→ N

((qq

),

(1 ρρ 1

))(86)

ThenP (x < 0 ∧ y < 0) = Φ2(−q,−q, ρ)

Proof. Let(uv

)=

(x− qy − q

). Then from eq. (86) it easily follows

that (uv

)D−→ N

((00

),

(1 ρρ 1

))(87)

Thus, by eq. (87) and definition of Φ2

P [u < −q ∧ v < −q] = Φ2(−q,−q, ρ) (88)

By definition, u = x− q and v = y − q. Substituting in eq. (88)we get

P [x− q < −q ∧ y − q < −q] = Φ2(−q,−q, ρ) (89)

P [x < 0 ∧ y < 0] = Φ2(−q,−q, ρ) (90)

Proposition 6. Let(x1

y1

),

(x2

y2

), . . . ,

(xnyn

)be a sample from

Φ2(x, y, ρ). Let γ ∈ R be a constant.Let

τ = dΦ(γ)ne (91)

s =

√Φ(γ)(1− Φ(γ))

φ(γ)(92)

r =Φ2(γ, γ, ρ)− Φ2(γ)

φ2(γ)(93)

Let X(1:n), X(2:n), . . . , X(n:n) be the order statistics on

x1, x2, . . . , xn

and Y(1:n), Y(2:n), . . . , Y(n:n) the order statistics on

y1, y2, . . . , yn

Then, as n→∞,(X(τ :n)

Y(τ :n)

)D−→ N

((γγ

),

(s√n

rn

rn

s√n

))(94)

Proof. This follows from the theorem on page 148 in [32] withappropriate substitutions.

4.4 The Accuracy of the Feature Map

In Proposition 4, we established that

limU→∞

E


U

]= Φ2(γ, γ, cos(A,B)) (95)

= Kγ (A,B) (96)

which links the feature map HQ,τ with the CRO kernel.But this proposition only talks about the expected value ofHQ,τ (A)·HQ,τ (B)

U . To get a sense of the error introduced by therandomization, we need to get a bound on the standard deviationof HQ,τ (A)·HQ,τ (B)

U .

8

Let p be defined as in eq. (98). Then by Proposition 7,

SD[HQ,τ (A) ·HQ,τ (B)

U

]<

√p(1−p)κ√U

(97)

which shows that the standard deviation of HQ,τ (A)·HQ,τ (B)U

decreases with the inverse of the square root of U . This assuresus that by choosing a large enough U , we can decrease the errorintroduced by randomization to as small a quantity as desired.

Proposition 7. Let κ,D, γ, τ, U, ST be defined as in Definition 6.Let HQ,τ be chosen from ST (D,U, γ) uniformly at random. LetF (A) = HQ,τ (A) and F (B) = HQ,τ (B). Let

p = P[F (A)[1] = 1 | F (B)[1] = 1

](98)

Then

SD

[F (A) · F (B)

U

]<

√p(1−p)κ√U

(99)

Proof. It is easy to show that for any 1 ≥ i ≥ U , the conditionalprobability that F (A)[i] = 1 given that F (B)[i] = 1 is equal to p,i.e.

P[F (A)[i] = 1 | F (B)[i] = 1

]= p (100)

Let V (A,B) be the random variable that represents F (A) · F (B),i.e.

V (A,B) = F (A) · F (B) (101)

We can re-write V (A,B) as a conditional random variable, asfollows:

V (A,B) =

(∑i∈z

F (A)[i]

)| z = {l : F (B)[l] = 1} (102)

In eq. (102), z is the set of indices l such that F (B)[l] = 1.By definition of the feature map, the size of z is τ , i.e. |z| = τ .

The point of eq. (102) is that it formulates V (A,B) as a sumof conditional random variables each one of which is a Bernoullivariable with success probability p.

Now consider the Binomial random variable W = B(τ, p). Itis defined as follows:

W =τ∑u=1

tu (103)

where t1, t2, . . . , tτ are iid Bernoulli random variables with suc-cess probability p.

How does V (A,B), as formulated in eq. (102), differ from Win eq. (103)? The difference is that while the tu in eq. (103) areindependent, the F (A)[i] in eq. (102) are not independent.

Let’s look at the moments of W and V (A,B). We can easilyprove that the first moments are the same, i.e.

E[W ] = E[V (A,B)] (104)

But what about the second moments? From eq. (103)

E[W 2] = E

( τ∑u=1

tu

)2 (105)

=τ∑u=1

τ∑v=1

E[tutv] (106)

Conditioned on z = {l : F (B)[l] = 1}, by definition,

E[(V (A,B))2

]= E

(∑i∈z

F (A)[i]

)2 (107)

= E

∑i∈z

∑j∈z

F (A)[i]F (A)[j]

(108)

=∑i∈z

∑j∈z

E[F (A)[i]F (A)[j]

](109)

Since |z| = τ , there is a one-to-one correspondence betweenthe terms of eq. (109) and the terms of eq. (106). There are twokinds of terms in eq. (106): those where u = v, and those whereu 6= v. They correspond, respectively, to terms of eq. (109) wherei = j and i 6= j.

Since t1, t2, · · · , tτ are iid Bernoulli random variables withprobability p,

u = v =⇒ E[tutv] = E[t2u] = p (110)

u 6= v =⇒ E[tutv] = p2 (111)

To show that E[(V (A,B))2

]< E

[W 2

], we show that the terms

in eq. (109) are either equal to or less than the corresponding termsin eq. (106). To do this, we show that for all i, j ∈ z

i = j =⇒ E[F (A)[i]F (A)[j]

]= E

[(F (A)[i]

)2]

= p

(112)

i 6= j =⇒ E[F (A)[i]F (A)[j]

]< p2 (113)

In the rest of this proof, all expectations and probabilities areconditional on i, j ∈ z.

Since F (A)[i] and F (A)[j] are Bernoulli random variableswith probability p,

E

[(F (A)[i]

)2]

= P[F (A)[i] = 1

]= p (114)

Which proves eq. (112). On the other hand

E[F (A)[i]F (A)[j]

]= P

[F (A)[i] = 1 ∧ F (A)[j] = 1

](115)

Therefore, to prove eq. (113), it is sufficient to prove that

P[F (A)[i] = 1 ∧ F (A)[j] = 1

]< p2 (116)

Since

P[F (A)[i] = 1 ∧ F (A)[j] = 1

](117)

= P[F (A)[i] = 1

∣∣∣ F (A)[j] = 1]P[F (A)[j] = 1

](118)

= P[F (A)[i] = 1

∣∣∣ F (A)[j] = 1]p (119)

To prove eq. (113), it is sufficient to prove that

P[F (A)[i] = 1

∣∣∣ F (A)[j] = 1]< p (120)

i.e.

P[F (A)[i] = 1

∣∣∣ F (A)[j] = 1]< P

[F (A)[i] = 1

](121)

In other words, we need to show that the conditional probabilitythat F (A)[i] = 1 given that F (A)[j] = 1 is less than the uncon-ditional probability that F (A)[i] = 1. Recall the definition of Sin eq. (44). According to eq. (46), F (A)[j] = 1 iff S[j] is among

9

the smallest τ elements of S. Thus, given that F (A)[j] = 1,S[i] has to ’compete’ with S[j] to be among the smallest τelements of S, making P [F (A)[i] = 1 | F (A)[j] = 1] less thanP [F (A)[i] = 1].Thus we proved that

E[(V (A,B))2] < E[W 2] (122)

which, given that the first moments are equal, implies that

Var[V (A,B)

]< Var [W ] (123)

W is a B(τ, p) Binomial random variable, thus

Var [W ] = τp(1− p) (124)

Which, together with eq. (123), leads to

Var[V (A,B)

]< τp(1− p) (125)

Thus

SD[V (A,B)

]<√τp(1− p) (126)

SD

[V (A,B)

U

]<

√τp(1− p)U

(127)

By definition, τ = Uk and V (A,B) = F (A) · F (B). Substituting

in eq. (127) and simplifying, we get

SD

[F (A) · F (B)

U

]<

√p(1−p)κ√U

(128)

5 COMPUTING THE FEATURE MAP USING DCTComputing S in eq. (44) using matrix multiplication takesO(UD) operations, which can be expensive when D is large.When Q is an instance of a matrix of iid normal random variablesthis cannot be avoided. In this section, we introduce the notion ofCRO-admissible projection matrices that allow the computation ofS in eq. (44) to be performed via the Discrete Cosine Transform(DCT), bringing down the cost of computing S to O(U log(U)).To simplify what follows, we introduce the following notation:

• U+ : For the integer U , U+ = U + 1• crop(H) : For the U+×D matrixH , crop(H) is a U×D

matrix defined as:

crop(H)[i, j] = H[i+ 1, j] (129)

i.e., crop(H) is the matrix H with the top row ’cropped’.

Definition 7. Let κ,D be fixed positive integers. Let γ =Φ−1( 1

κ ). Let τ be a positive integer, and let U = τκ. LetA,B ∈ RD , and let 1 ≤ k ≤ U . Let ΘU,D be a set of U × Dmatrices. Note that U � D. Then ΘU,D is CRO-admissible iff,as U →∞,

EQ∈ΘU,D

[HQ,τ

(A)[k]

]= Φ(γ) (130)

EQ∈ΘU,D


U

]= Φ2(γ, γ, cos(A,B)) (131)

SDQ∈ΘU,D


U

]<

0.5√U

(132)

Notice that the three conditions above correspond to the conditionsspecified for CRO feature map families in eqs. (41) to (43).

Using the techniques explored in [9], we define a set ofprojection matrices ΘU,D that are CRO-admissible, and for whichcomputing the projection can be achieved through Fast DCT. Thus,we reduce the cost of computing the projection to O(U log(U)).

Definition 8. Let ID denote the D ×D identity matrix and 0r,Dthe r ×D zero matrix.

Let r = U+ mod 2D and d = U+÷ 2D (÷ denotes integerdivision). Then the U+×D symmetric repetition matrix JU+,D isdefined as follows:

JU+,D

= [ID,−ID, ID,−ID, . . . , ID,−ID, 0r,D

] (133)

where ID,−ID blocks are repeated d times.

Definition 9. Let T be a U+ × U+ DCT matrix, J a U+ ×D symmetric repetition matrix, and Π the set of all U+ × U+

permutation matrices. Then

ΘU,D

= {crop(Tπ J) : π ∈ Π} (134)

The reason we use U+ and then crop is that the first elementof the DCT is the ‘dc element’, which is always 0 in our case.crop gets rid of this element.

Proposition 8. ΘU,D is CRO-admissible.

Proof. Let A,B ∈ RD . Let ρ = cos(A,B). Without lossof generality, assume ‖A‖ = ‖B‖ =

√D. Let Q be chosen

uniformly at random from ΘU,D . Let

X = QA (135)

Y = QB (136)

Let

Z =

(x1

y1

),

(x2

y2

), . . . ,

(xUyU

)(137)

Then, using Theorem 1 in [31], we can prove that for a fixedk, as U → ∞, Z tends to a k-wise independent sequence

of N((

00

),

(1 ρρ 1

))random variables. It is also possible to

prove that X and Y are stationary and ergodic.To prove Proposition 8, we need to prove the conditions

specified in eqs. (130) to (132). Eq. eq. (130) can be proved usingthe arguments in Proposition 3. Since Z is k-wise independentwith k > 2, condition 132 can be proved using the argumentsin proposition 7.

We can perform the proof of eq. (131) in a manner analogousto the proof of eq. (3). That proof boils down to proving eq. (83),i.e. proving that

limU→∞

P[xi < X(τ :U−1) ∧ yi < Y(τ :U−1)

]= Φ2(γ, γ, ρ)

(138)where X(τ :U−1), Y(τ :U−1) are order statistics on X and Y .

In order to prove eq. (138), we proceed as follows. We firstderive the asymptotic value of X(τ :U−1) and Y(τ :U−1), and thenwe use the definition of Φ2 to prove the condition.

10

Since asymptotically X and Y are stationary and ergodicsequences of standard normal random variables, by Theorem 2.1in [33], as U →∞,

X(τ :U−1)a.s−−→ Φ−1

(τ

U − 1

)(139)

Y(τ :U−1)a.s−−→ Φ−1

(τ

U − 1

)(140)

where a.s−−→ denotes almost sure convergence. Now, by definition,U = κτ . Thus as U →∞,

τ

U − 1→ 1

k(141)

Φ−1

(τ

U − 1

)→ γ (142)

X(τ :U−1)a.s−−→ γ (143)

Y(τ :U−1)a.s−−→ γ (144)

Substituting γ for X(τ :U−1) and Y(τ :U−1) in eq. (138) we get

limU→∞

P [xi < γ ∧ yi < γ] = Φ2(γ, γ, ρ) (145)

Which can be proved by observing that as U → ∞,(xiyi

)tends

to an N((

00

),

(1 ρρ 1

))random variable.

The important point is that when Q ∈ ΘU,D , rather thanmaterializing Q and using matrix multiplication to compute QA,we use the procedure in Table 1, involving fast DCT, to computethe projection. The result is that computing the projection takesO(U log(U)) operations.

Below, we describe an algorithm for computing the CROfeature map. The algorithm computes the coordinates of the non-zero elements of the feature map, which for historical reasons wecall the hash set of the input vector. The hash set is computedusing the CRO hash function, which maps input vectors to a set ofhashes chosen from a universe 1 . . . U , where U is a large integer.τ denotes the number of hashes that we require per input vector.

Let A ∈ RD be the input vector. The transform-based hashfunction takes as a second input a random permutation π of1 . . . U . It should be emphasized that the random permutation πis chosen once and used for ”hashing” all input vectors.

Table 1 shows the procedure for computing the CRO hashset using fast DCT. Here we use −A to represent the vector Amultiplied by −1. We use A,B,C, . . . to represent concatenationof vectors A,B,C etc.

Table 2 presents an implementation of CROify, the transform-based CRO feature map function in Python. In this implementationwe don’t bother with U+ and crop, since in practice the fact thatthe first element of the transform is 0 makes no difference to theresults.

6 EXPERIMENTS

We present the experiment results in two sections. First in Sec-tion 6.1 we evaluate the CRO feature map when combined with alinear SVM classifier and compare it in terms of accuracy, trainingtime, and prediction time to LIBLINEAR [5], LIBSVM [6], Fast-food [11], Tensor Sketching [12], and deep learning classificationvia Torch7 [34] on multiple publicly available datasets.

TABLE 1Procedure for computing the CRO hash set for given input vector A

1) Let A = A,−A

2) Create a repeated input vector A′ as follows:A′ = A, A, . . . , A︸︷︷︸

d

000︸︷︷︸r

where d = U÷2D and r = U mod 2D.

Thus |A′| = 2dD + r = U.

3) Apply the random permutation π to A′ to get permuted inputvector V .

4) Compute the DCT of V to get S.

5) Find the indices of the smallest τ members of S. These indicesare the hash set of the input vector A.

TABLE 2Python code for the CROify function using the discrete cosine transform

import numpy as npfrom math import floorfrom scipy.fftpack import dct

def CROify(A,U,P,tau):# A is the input vector# U is the size of the hash universe# P is a random permutation of 1:U## chosen once and for all and used## in all CROify calculations# tau is the desired number of## non-zero elements

# compute the input space dimensionalityd = A.size

# generate and permute the## extended vector Ep = floor(U/d/2)E = np.tile(np.concatenate([A,-A]),p)E = np.append(E,np.zeros(U-E.size))E = E[P]

# return the indices of the smallest## tau memberscf = dct(E,norm=’ortho’)return np.argpartition(cf,-tau)[:tau]+1

In Section 6.2 we compare the CRO feature map with Fast-food [11] and Tensor Sketching [12] with respect to the transfor-mation time. We chose Fastfood and Tensor Sketching becausethey improve upon random binning and random Fourier featuresby Rahimi et. al. [10] in terms of time and storage complexity [35],and they are closely related to the work presented here.

All experiments were conducted on a HPE ProLiant DL580with 4 sockets, 60 cores, and 1.5 TB of RAM. Each socket has 15cores and 384 GB of RAM.

6.1 Evaluation of CROificationCROification is the combination of the CRO feature map andlinear SVM. We evaluate CROification on four publicly available

11

TABLE 5Classification prediction accuracy and processing time comparison

Training time is reported in seconds and prediction time in µs per test instance

LIBLINEAR LIBSVM (Gaussian) CROificationdataset training prediction accuracy training prediction accuracy training prediction accuracy

w8a 0.3 0.6 98.34% 33 193 99.39% 7.1 .1 99.45%covtype 6.2 .03 67.14% 10228 821 84.42% 1 .1 85.67%

datasets. Table 3 shows, for each of the datasets, the training setsize, test set size, feature vector dimensionality, and number ofclasses for each of the datasets.

TABLE 3Dataset information

dataset training set size test set size dimensionality classesmnist 60,000 10,000 780 10w8a 49,749 14,951 300 2

covtype 500,000 81,012 54 2TIMIT 1,385,426 506,113 826 39

We chose these datasets because the Gaussian kernel showssignificant improvement in accuracy over the linear kernel. Thus,for example, we did not include the adult dataset, since for thisdataset linear SVMs are as good as non-linear ones.

6.1.1 Evaluation ResultsTable 4 presents the classification results on the MNIST dataset [3]for LIBLINEAR, LIBSVM, CROification, Fastfood, and TensorSketching. LIBLINEAR is used as the linear classifier for CROifi-cation, Fastfood, and Tensor Sketching.

TABLE 4Classification results on MNIST dataset

trans. training prediction errormethod time time (sec) time (ms) rate

LIBLINEAR 0 9.1 0.001 8.4%CROification 11 12.6 0.01 1.5%

LIBSVM Gaussian 0 465 10.4 1.4%Fastfood 28 265 1.6 1.9%

Tensor Sketching 14 254 0.9 2.6%

In comparison with Fastfood and Tensor Sketching, CROifica-tion delivers the fastest transformation time; however, the mainadvantage of CROification is related to the training time andmore importantly the prediction time. The output of the CROfeature map is a sparse binary vector whereas Fastfood andTensor Sketching generate dense vectors. As a result, the trainingand prediction time for CROification is significantly less thanFastfood and Tensor Sketching. CROification performs similarto LIBSVM in terms of prediction accuracy but approximately1000 times faster with respect to the prediction time. CROificationachieves 98.54% prediction accuracy whereas LIBSVM achieves98.57%; however, CROification takes 12.6 seconds for SVMtraining compared to 465 seconds for LIBSVM, and 0.01 ms forprediction per test instance vs 10.4 ms for LIBSVM. LIBLINEARhas the fastest training and prediction time; however, the error rateis higher compared to other approaches.

Table 5 presents the classification accuracy, training, andprediction time for LIBSVM, LIBLINEAR and CROification onw8a and covtype datasets. The time reported in Table 5 forCROification includes only the time required for training andprediction.

Table 6 presents the processing time breakdown for the entireCROification process which includes mapping the input vectorinto the sparse, high-dimensional feature vectors, and then per-forming linear SVM. The total time required for CROification ismuch less compared to LIBSVM with Gaussian kernel.

TABLE 6Processing time breakdown for CROification

training time (sec) prediction time per instance (ms)dataset trans. train total trans. predict total

w8a 3.7 7.1 10.9 .07 0.001 .071covtype 55 610 665 0.1 0.001 0.1

On the covtype dataset CROification achieves higher accuracythan LIBSVM but it requires less time for training and prediction.

The accuracy of CROification is not sensitive to the valuesof U and τ . Table 7 shows the cross-validation accuracy on theMNIST dataset for different values of log2(U) and τ .

TABLE 7MNIST cross-validation accuracy with CROification

τlog(U) 15 16 17 18 19

400 98.03 98.16 98.12 98.14 98.11500 98.10 98.20 98.25 98.25 98.24600 98.17 98.25 98.31 98.32 98.30700 98.12 98.32 98.44 98.44 98.43800 98.21 98.32 98.50 98.49 98.46900 98.21 98.32 98.50 98.48 98.461000 98.20 98.37 98.54 98.53 98.52

We used the grid function from the LIBSVM toolbox to findthe best cost (c) and gamma (g) parameter values for experimentswith LIBSVM. For LIBLINEAR, we chose the value for parame-ter s (solver type) which achieves the highest accuracy; s=1 is L2-regularized L2-loss dual support vector classification and s = 2 isL2-regularized L2-loss primal support vector classification.

For CROification, we performed a cross-validation for eachdataset to find the optimal values for U and τ . Table 8 presentsthe parameter values used in the experiments for each dataset.

TABLE 8Parameter values for each dataset

LIBLINEAR LIBSVM CROificationdataset s c g log2(U) τmnist 2 64 0.03125 17 1000w8a 1 8 8 13 500

covtype 2 32768 8 16 500TIMIT 2 1 .001 22 8192

6.1.2 CROification vs Deep Learning on TIMITThe TIMIT acoustic-phonetic continuous speech corpus is widelyused for the purpose of development and performance evaluation

12

of speech recognition systems. The data from TIMIT includesrecordings of 6300 utterances (5.4 hours) of eight dialects ofAmerican English (pronounced by 630 speakers, 10 sentences perspeaker). The standard train/test split is 4620 train and 1680 testutterances. The phonetic and word transcriptions are time-alignedand include a 16kHz, 16-bit waveform file for each utterance. AllTIMIT data has been manually verified so that the correspondinglabels match the phonemes. Due to the aforementioned attributes,TIMIT is especially suitable for phoneme classification.

Each utterance is converted to a sequence of 826-dimensionalfeature vectors of 25 mel-frequency cepstral coefficients (MFCC)with cepstral mean subtraction (including cepstral C0 coefficient)plus delta and acceleration coefficients. Standard frames of length25 ms shifted by 10 ms are used. The original set of 61 phonemesis collapsed into a smaller set of 39 phonemes. The mapping from61 to 39 phonemes is defined as:

• aa← ao• ah← ax ax-h• er← axr• hh← hv• ih← ix• l← el• m← em

• n← en nx• ng← eng• sh← zh• uw← ux• sil← pcl tcl kcl bcl dcl

gcl h# pau epi

Thus, the final set of phonemes becomes: aa, ao, ah, ax, ax-h, er,axr, hh, hv, ih, ix, l, el, m, em, n, en, nx, ng, eng, sh, zh, uw, ux,sil, pcl, tcl, kcl, bcl, dcl, gcl, h#, pau, epi, iy, eh, ey, ae, aw, ay, oy,ow, uh, jh, ch, b, d, g, p, t, k, dx, s, z, f, th, v, dh, r, w, y. Table 9shows the training and test sets for the TIMIT dataset.

TABLE 9TIMIT corpus summary

dataset #utterances #frames #phonemestrain 4620 1385426 39test 1680 506113 39

In this paper, we are concerned with phoneme classification,where each frame is classified independently of the frames preced-ing or following it. As a matter of fact, we permute the traininginputs so all temporal relationship between phonemes is lost.This is in contrast with the phoneme recognition task, where therelationship between adjacent frames is taken into account, forexample through a Hidden Markov Model.

For the deep learning classification results we conducted sixexperiments with different network typologies. It should be notedthat our purpose in doing these experiments was to compare ourresults against a baseline standard deep neural network as imple-mented in Torch7. Torch is a scientific computing framework withwide support for machine learning algorithms that puts GPUs first.It is easy to use and efficient, thanks to an easy and fast scriptinglanguage, LuaJIT, and an underlying C/CUDA implementation.An NVIDIA Tesla K20m graphics card was used for the runningthe experiments.

Table 12 presents the six experiments along with the reportedaccuracy and timing. Stochastic gradient descent without weightinitialization was used as the optimization method for minimizingthe objective function for all experiments. Experiment #1 has beentaken as base case and the remaining experiments differ in at mostone spec with the base experiment.

Table 10 presents the results of all six experiments. Forcomparison with CROification we’ll use the first experiment asit gives the best results compared to the other 5 experiments.

TABLE 10Classification results on TIMIT dataset for different network topologies

experiment accuracy trainingtime

predictiontime (ms)

1 72.48% 4.5 hrs 0.862 71.46% 11 hrs 20 min 0.863 71.15% 11 hrs 20 min 0.86

4 Did notConverge – –

5 Did notConverge – –

6 71.88% 11 hrs 23 min 0.86

Table 11 presents the classification results on the TIMITdataset for LIBLINEAR, LIBSVM, deep learning, and CROifi-cation. The highest reported accuracy of 74.0% belongs to LIB-SVM with the Gaussian kernel; however, the training time is52 hours and the prediction time is 142 ms per test instance.CROification reports an accuracy of 73.3% with 4.4 hours oftotal training time and 0.14 ms prediction time per test instance(0.12 ms transformation time plus 0.02 ms for prediction). In otherwords, CROification prediction time is more than 1000 timesfaster than LIBSVM with an accuracy loss of 0.7%. The deeplearning experiment reported a 72.4% accuracy with 4.5 hours oftraining time and 0.86 ms prediction time per test instance. Theaccuracy of deep learning is comparable with that of CROification;however, CROification is more than 6 times faster for prediction.LIBLINEAR reported the least accuracy of 50.7% but with thefastest prediction time.It is important to note that for the timing results in Table 11:

• The predict function in LIBLINEAR has been modified tosupport multi-threaded processing.

• A multi-thread approach has been used for LIBLINEARand LIBSVM training and for the CRO transformation.

• The GPU-based deep learning method is parallelized overall available GPU cores.

We have not included Fastfood and Tensor Sketching in the resultson the TIMIT dataset because they have scalability issues for tworeasons. Firstly, the prediction time for these approaches has alinear relationship with the number of training instances [36]. Sec-ondly, they require a dense feature space with large dimensionalitywhen the input space dimensionality is large [25].

TABLE 11Processing time and classification results on TIMIT dataset

method trainingtime (h)

pred.time (ms) accuracy

LIBLINEAR 1.6 .003 50.7%LIBSVM (Gaussian) 52 142 74.0%

CROification 4.4 0.14 73.3%Deep Learning 4.5 0.86 72.4%

6.2 Transformation time comparisonClassification using random feature mapping approaches requiresthree main steps: transformation, training, and prediction. In thissection we compare the CRO feature map with Fastfood [11]

13

TABLE 12Neural network topologies used for TIMIT dataset experiments

Topology Input 1st Hidden 2nd Hidden 3rd Hidden 4th Hidden Output Learning Activation# Layer Layer Layer Layer Layer Layer Rate Function1 826 1000 1000 1000 – 39 .001 ReLU2 826 1000 1000 1000 – 39 .0001 ReLU3 826 1000 1000 1000 – 39 .001 Tanh4 826 1000 1000 1000 – 39 .01 ReLU5 826 1000 1000 1000 – 39 .1 ReLU6 826 1000 750 750 1000 39 .001 ReLU

and Tensor Sketching [12] with respect to the time required forperforming the transformation from the input space to the featurespace.

Figure 3 compares the CRO feature map, Fastfood, and TensorSketching in terms of transformation time with increasing inputspace dimensionality d. The input size n is equal to 10, 000 andthe feature dimensionality D is set to 4096. The results showthat Tensor Sketching transformation time increases linearly withthe increase of d whereas Fastfood and CRO feature map showlogarithmic dependency to d. Transformation via the CRO featuremap is faster than Fastfood for the given parameter values for allexperimented values of d.

Fig. 3. Average trans. time with increasing input space dimensionality d

Figure 4 illustrates the linear dependency between transfor-mation time and input size n for all three approaches. For thisexperiment we used face image data with dimensionality d=1024.Also, the feature space dimensionality D was set to 4096. Theresults show that the transformation time for Fastfood and TensorSketching increases at a higher linear rate compared to the CROfeature map.

Figure 5 presents the transformation time comparison betweenthe three approaches with increasing values of feature spacedimensionality D. All three approaches show linear increase intransformation time when increasing D; however, transformationusing the CRO feature map has a smaller average rate of change.

The space complexity of our approach is O(nτ), whereasFastfood and Tensor Sketching both have a space complexity ofO(nD). In most cases where the results are comparable we have

Fig. 4. Total transformation time with increasing input size n

Fig. 5. Average trans. time as a func. of feature space dimensionality D

D � τ . For example the parameter settings are τ = 1000 andD = 4096 for the results reported in Table 4.

As mentioned in Sections 2 and 6.1.2 and as the transfor-mation time results depict, Fastfood and Tensor Sketching havescalability limitations when the number of data instances and/orthe dimensionality is large.

14

7 CONCLUSION

We introduced a highly efficient feature map that transformsvectors in the input space to sparse, high-dimensional vectors inthe feature space where the inner product in the feature space isthe CRO kernel. We showed that the CRO kernel approximatesthe Gaussian kernel for unit length input vectors. This allows us touse very efficient linear SVM algorithms that have been optimizedfor high-dimensional sparse vectors. The results show that we canachieve the same accuracy as non-linear Gaussian SVMs withlinear training time and constant prediction time. This approachcan enable many new time-sensitive applications where accuracydoes not need to be sacrificed for training and prediction speed.

REFERENCES

[1] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi, “Kernelmethods on Riemannian manifolds with Gaussian RBF kernels,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, pp.2464–2477, Dec 2015.

[2] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M. M. Cheng, S. L. Hicks,and P. H. S. Torr, “Struck: Structured output tracking with kernels,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp.2096–2109, Oct 2016.

[3] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278–2324, Nov. 1998.

[4] Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEETrans. on Pattern Analysis and Machine Intelligence, vol. 38, no. 9, pp.1844–1858, Sept 2016.

[5] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIB-LINEAR: A library for large linear classification,” Journal of MachineLearning Research, vol. 9, pp. 1871–1874, June 2008.

[6] C.-C. Chang and C.-J. Lin, “LIBSVM : a library for support vectormachines,” ACM Trans. on Intelligent Systems and Technology, 2011.

[7] M. Ristin, M. Guillaumin, J. Gall, and L. V. Gool, “Incremental learningof random forests for large-scale image classification,” IEEE Trans. onPattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 490–503,March 2016.

[8] K. Eshghi and S. Rajaram, “Locality sensitive hash functions based onconcomitant rank order statistics,” in KDD, 2008, pp. 221–229.

[9] M. Kafai, K. Eshghi, and B. Bhanu, “Discrete cosine transform locality-sensitive hashes for face retrieval,” IEEE Trans. on Multimedia, no. 4,pp. 1090–1103, June 2014.

[10] A. Rahimi and B. Recht, “Random features for large-scale kernel ma-chines,” in Advances in neural info. proc. systems, 2007, pp. 1177–1184.

[11] Q. Le, T. Sarlos, and A. Smola, “Fastfood: Approximate kernel expan-sions in loglinear time,” in Int. Conf. on Machine Learning, 2013.

[12] N. Pham and R. Pagh, “Fast and scalable polynomial kernels via explicitfeature maps,” in 19th ACM SIGKDD Int. Conf. on Knowledge Discoveryand Data Mining (KDD), 2013, pp. 239–247.

[13] T. Joachims and C.-N. J. Yu, “Sparse kernel SVMs via cutting-planetraining,” Machine Learning, vol. 76, no. 2-3, pp. 179–193, Sep 2009.

[14] N. Segata and E. Blanzieri, “Fast and scalable local kernel machines,”Journal of Machine Learning Research, vol. 11, pp. 1883–1926, 2010.

[15] I. W. Tsang, J. T. Kwok, and P.-M. Cheung, “Core vector machines: FastSVM training on very large data sets,” in Journal of Machine LearningResearch, 2005, pp. 363–392.

[16] M. Nandan, P. P. Khargonekar, and S. S. Talathi, “Fast SVM training us-ing approximate extreme points,” Journal of Machine Learning Research,vol. 15, no. 1, pp. 59–98, 2014.

[17] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin,“Training and testing low-degree polynomial data mappings via linearSVM,” Journal of Machine Learning Research, vol. 11, Aug 2010.

[18] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit fea-ture maps,” IEEE Trans. on Pattern Analysis and Machine Intelligence,vol. 34, no. 3, pp. 480–492, 2012.

[19] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg,“Feature hashing for large scale multitask learning,” in Int. Conf. onMachine Learning (ICML), 2009, pp. 1113–1120.

[20] S. Litayem, A. Joly, and N. Boujemaa, “Hash-based support vectormachines approximation for large scale prediction.” in British MachineVision Conference, 2012, pp. 1–11.

[21] Y.-C. Su, T.-H. Chiu, Y.-H. Kuo, C.-Y. Yeh, and W. Hsu, “Scalablemobile visual classification by kernel preserving projection over high-dimensional features,” IEEE Trans. on Multimedia, vol. 16, no. 6, pp.1645–1653, Oct 2014.

[22] L. Wu, I. E. Yen, J. Chen, and R. Yan, “Revisiting random binningfeatures: Fast convergence and strong parallelizability,” in KDD, 2016.

[23] I. E. H. Yen, T.-W. Lin, S.-D. Lin, P. Ravikumar, and I. S. Dhillon,“Sparse random features algorithm as coordinate descent in Hilbertspace,” in NIPS, 2014, pp. 2456–2464.

[24] P. Li, “Linearized GMM kernels and normalized random Fourier fea-tures,” in KDD, 2017, pp. 315–324.

[25] P.-S. Huang, H. Avron, T. N. Sainath, V. Sindhwani, and B. Ramabhad-ran, “Kernel methods match deep neural networks on TIMIT,” in Int.Conf. on Acoustics, Speech and Signal Processing, 2014, pp. 205–209.

[26] J. Garofolo, “TIMIT acoustic-phonetic continuous speech corpus,” Lin-guistic Data Consortium, 1993.

[27] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes fromshift-invariant kernels,” in Advances in neural information processingsystems, 2009, pp. 1509–1517.

[28] A. J. Smola and B. Scholkopf, “A tutorial on support vector regression,”Statistics and Computing, vol. 14, no. 3, pp. 199–222, Aug 2004.

[29] O. A. Vasicek, “A series expansion for the bivariate normal integral,”Journal of Computational Finance, Jul. 2000.

[30] M. Sibuya, “Bivariate extreme statistics, I,” Annals of the Institute ofStatistical Mathematics, vol. 11, no. 2, pp. 195–210, 1959.

[31] K. Eshghi and M. Kafai, “The CRO kernel: Using concomitant rankorder hashes for sparse high dimensional randomised feature maps,” inInt. Conf. on Data Engineering, 2016.

[32] M. M. Siddiqui, “Distribution of quantiles in samples from a bivariatepopulation,” Journal of Research of the National Institute of Standardsand Technology, 1960.

[33] A. Dembinska, “Asymptotic behavior of central order statistics fromstationary processes,” Stochastic Processes and their Applications, vol.124, no. 1, pp. 348–372, 2014.

[34] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A Matlab-likeenvironment for machine learning,” in BigLearn, NIPS Workshop. No.EPFL-CONF-192376, 2011.

[35] J. von Tangen Sivertsen, “Scalable learning through linearithmic timekernel approximation techniques,” Master’s thesis, 2014.

[36] Z. Lu, A. May, K. Liu, A. B. Garakani, D. Guo, A. Bellet, L. Fan,M. Collins, B. Kingsbury, M. Picheny, and F. Sha, “How to scaleup kernel methods to be as good as deep neural nets,” CoRR, vol.abs/1411.4000, 2014. [Online]. Available: http://arxiv.org/abs/1411.4000

Date post:	10-Sep-2020
Category:	Documents
Upload:	others
View:	2 times
Download:	0 times

1 CROiﬁcation: Accurate Kernel Classiﬁcation with the...

Documents