+ All Categories
Home > Documents > In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational...

In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational...

Date post: 30-May-2020
Category:
Upload: others
View: 10 times
Download: 0 times
Share this document with a friend
37
In-Database Learning with Sparse Tensors Hung Q. Ngo LogicBlox, Inc. [email protected] XuanLong Nguyen University of Michigan [email protected] Dan Olteanu University of Oxford [email protected] Maximilian Schleich University of Oxford [email protected] March 15, 2017 Abstract We introduce a unified framework for a class of optimization based statistical learning problems used by LogicBlox retail-planning and forecasting applications, where the input data is given by queries over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. The main challenge posed by computing these problems is the large number of records and of cate- gorical features in the input data, which leads to very large compute times or failure to process the entire data. We address this challenge with two orthogonal contributions. First, we introduce a sparse tensor representation and computation framework that allows for space and time complexity reduction when dealing with feature extraction queries that have categorical variables. Second, we exploit functional dependencies present in the database to reduce the dimensionality of the optimization problems. For degree-2 regression models, the interplay of the two techniques is crucial for scalability, as for typical ap- plications such models can have thousands of parameters and require the computation of tens of millions of aggregates for gradient-based training methods. We implemented our solution as an in-memory prototype and as an extension of the LogicBlox runtime engine. We benchmarked it against R, MadLib, and libFM for training degree-1 and degree-2 regression models on a real dataset in the retail domain with 84M tuples and 3700 categorical features. Our solution is up to three orders of magnitude faster than its competitors when they do not exceed memory limitation, 22-hour timeout, or internal design limitations. 1 Introduction There is an increasing interest in industry for in-database analytics [16, 17, 21, 27]. This is motivated by the realization that data usually resides inside databases and bringing the analytics closer to the data saves non- trivial time usually spent on data import/export at the interface between database systems and statistical packages. A second, increasingly more prominent realization is that large chunks of ML code can be rewritten into multiple aggregates and computed purely relationally inside the database [7, 8, 19, 30]. At LogicBlox, we are developing an in-database analytics solution for a host of optimization problems encountered in retail-planning and forecasting applications. Unifying analytics and databases fits naturally within the ethos of LogicBlox, which provides a unified programming language for OLAP, OLTP, graph data processing, mathematical optimization, and analytics, as well as a computing platform for hybrid workloads [1]. The aim of our solution is to support descriptive (backward-looking analytics), predictive (forward-looking analytics such as classification and regression), and prescriptive analytics (also forward- looking and usually take the output of a predictive model as input). 1
Transcript
Page 1: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

In-Database Learning with Sparse Tensors

Hung Q. NgoLogicBlox, Inc.

[email protected]

XuanLong NguyenUniversity of [email protected]

Dan OlteanuUniversity of Oxford

[email protected]

Maximilian SchleichUniversity of Oxford

[email protected]

March 15, 2017

Abstract

We introduce a unified framework for a class of optimization based statistical learning problems usedby LogicBlox retail-planning and forecasting applications, where the input data is given by queries overrelational databases. This class includes ridge linear regression, polynomial regression, factorizationmachines, and principal component analysis.

The main challenge posed by computing these problems is the large number of records and of cate-gorical features in the input data, which leads to very large compute times or failure to process the entiredata. We address this challenge with two orthogonal contributions. First, we introduce a sparse tensorrepresentation and computation framework that allows for space and time complexity reduction whendealing with feature extraction queries that have categorical variables. Second, we exploit functionaldependencies present in the database to reduce the dimensionality of the optimization problems. Fordegree-2 regression models, the interplay of the two techniques is crucial for scalability, as for typical ap-plications such models can have thousands of parameters and require the computation of tens of millionsof aggregates for gradient-based training methods.

We implemented our solution as an in-memory prototype and as an extension of the LogicBloxruntime engine. We benchmarked it against R, MadLib, and libFM for training degree-1 and degree-2regression models on a real dataset in the retail domain with 84M tuples and 3700 categorical features.Our solution is up to three orders of magnitude faster than its competitors when they do not exceedmemory limitation, 22-hour timeout, or internal design limitations.

1 Introduction

There is an increasing interest in industry for in-database analytics [16,17,21,27]. This is motivated by therealization that data usually resides inside databases and bringing the analytics closer to the data saves non-trivial time usually spent on data import/export at the interface between database systems and statisticalpackages. A second, increasingly more prominent realization is that large chunks of ML code can be rewritteninto multiple aggregates and computed purely relationally inside the database [7, 8, 19,30].

At LogicBlox, we are developing an in-database analytics solution for a host of optimization problemsencountered in retail-planning and forecasting applications. Unifying analytics and databases fits naturallywithin the ethos of LogicBlox, which provides a unified programming language for OLAP, OLTP, graphdata processing, mathematical optimization, and analytics, as well as a computing platform for hybridworkloads [1]. The aim of our solution is to support descriptive (backward-looking analytics), predictive(forward-looking analytics such as classification and regression), and prescriptive analytics (also forward-looking and usually take the output of a predictive model as input).

1

Page 2: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

In this paper, we introduce a unified framework for a class of such optimization based statistical learningmethods used in practice, such as ridge linear regression, polynomial regression, factorization machines,classification, and principal component analysis. We put a particular emphasis on degree-2 regression modelsthat are used by LogicBlox data scientists to capture more accurately the correlations in the data and therebyimprove their prediction models. The typical data sources of interest in LogicBlox analytics applications areweekly sales data, promotions, and product descriptions. The input to analytics is the natural join of thosedata sources stored in a database. A prediction a retailer would like to compute is the additional demandgenerated for a given product due to promotion.

The main challenge encountered in computing these problems is the large number of records and ofcategorical features1 in the input data. This can lead to failure to use the entire data and to a huge numberof model parameters that come with prohibitively expensive compute times. We address this challenge withtwo orthogonal techniques.

First, we factorize the computation of the natural join of the data sources and of the aggregates required bythe optimization problems. Our factorization technique draws on earlier work on factorized databases [23]and in particular on the FAQ framework for efficient computation of aggregates over joins [18] and onfactorized learning with continuous features [30]. We go beyond prior work as we need to compute large setsof group-by aggregates that capture the interactions of categorical features in the learned model.

Second, we exploit existing functional dependencies (FDs) to reduce the dimensionality of the underlyingoptimization problem. Prior work has shown that FDs can be exploited for Naıve Bayes classification andfeature selection [20], though our reparameterizations of regression models under FDs are new.

For degree-2 regression models, the interplay of the two techniques is crucial for scalability, as for typicalapplications such models can have thousands of parameters and require the computation of tens of millionsof aggregates for gradient-based training methods.

The contributions of our paper are as follows:

• We introduce a unified framework for a host of in-database optimization problems for statistical learningtasks (Section 3).

• We introduce a sparse tensor representation and computation framework that allows for space andtime complexity reduction when dealing with feature extraction queries that have categorical variables(Section 4.2).

• We show how to exploit simple FDs to perform dimensionality reduction in a couple of commonregression models (Section 4.3). An FD is simple if its left side consists of a single query variable.

• We implemented our solution as an in-memory prototype and as an extension of the LogicBlox engine.

• We benchmarked our solution against the open-source state-of-the-art systems R, MadLib, and libFMfor training degree-1 and degree-2 regression models on a real dataset in the retail domain with 84Mtuples and 3700 categorical features. Our solution is up to three orders of magnitude faster than itscompetitors when they do not exceed memory limitation, 22-hour timeout, or internal design limitations(Section 5).

Related work. Most related efforts in the database community are on designing systems to supportmachine learning libraries on top of large-scale architectures, with a goal on providing a unified architecturefor machine learning and databases [8], e.g., MLLib [21] and DeepDist [22] on Spark [31], GLADE [25],MADlib [16] on PostgreSQL, SystemML [4, 17], system benchmarking [5] and sample generator for cross-validate learning [29]. Our contribution is orthogonal to these efforts. It is on a specific class of optimizationproblems and shows how database concepts and techniques, such as functional dependencies and factorizedcomputation of aggregates and joins, can drastically improve the performance of in-database solutions tothese problems. We next briefly discuss prior works that are closest to ours.

1Most of the features in our clients’ datasets are categorical.

2

Page 3: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Prior work on factorized joins and aggregates in the FDB [2,23] and the FAQ [18] frameworks only considerone (group-by) aggregate at a time, whereas we need to consider a large number of them as they computefeature interactions. Our system predecessor is F, a linear regression learner over factorized joins [30]. Fonly considers continuous features, linear models, and no functional dependencies. While one-hot encodedcategorical features can be treated as continuous, this comes with the significant overhead of processingnon-existing feature interactions (Section 5 reports on performance of the DC variant of our system thatsubsumes F). Our approach avoids the static one-hot encoding of categorical features in the input data andinstead performs the encoding on the fly and thereby avoids the creation of irrelevant (zero-valued) featureinteractions altogether.

A limited form of factorized learning of generalized linear models, which partially pushes gradient ag-gregates past key-foreign key joins, has been first proposed by Kumar et al [19]. More recent work usesfunctional dependencies to avoid key-foreign key joins and reduce the number of features in Naıve Bayesclassification and feature selection [20], where it is pointed out how generalization performance may be im-proved (due the reduced variance associated with a model class). The question of how feature reduction canbe done effectively in a given model class is far from being resolved, especially when the size of the modelclass is controlled via a regularization function as in our work. Moreover, we also focus on the resolution offunctional dependencies so as to significantly speed up both the precomputation phase and the convergenceof the learning algorithm.

Rendle introduced factorization machines on relational data [28], a practically useful model that factorizesthe parameter space to better capture data correlations. Our approach combines parameter factorizationwith data factorization and exploits functional dependencies. Section 5 reports on the libFM learner forfactorization machines.

2 Preliminaries

Throughout this paper, bold face letters, e.g., x, θ, xi, θj , denote vectors or matrices, and normal face

letters, e.g., xi, θj , θ(j)i , denote scalars. For any positive integer n, [n] denotes the set 1, . . . , n. For any

set S and positive integer k,(Sk

)denotes the collection of all k-subsets of S. We use the following matrix

operations: ⊗ denotes the Kronecker/tensor product and the Hadamard product, and 〈·, ·〉 denotes theFrobenius inner product of two matrices, which reduces to the familiar vector inner product when the twomatrices have one column each. (Thus, all our norms in this paper are Frobenius norms, as it is convenientto express model parameters as matrices. In particular, the `2-penalty term is the square of the Frobeniusnorm of the parameter.) Given a feature f , If denotes identity matrix of dimension equal to the activedomain size of f .

We utilize the following notational convention for tuples (i.e., a vector of values over some domain with aspecific index set). Let S be any finite set and D be any domain, then aS = (aj)j∈S ∈ DS is a tuple indexedby S and each of whose component takes value in D. If S and T are disjoint index sets, and given tuples aSand aT , then tuple (aS ,aT ) is interpreted naturally as the tuple aS∪T .

The tuple eS is the all-1 tuple indexed by S. The tuple 0S is the all-0 tuple indexed by S. If S ⊆ G,then the tuple 1S|G is the characteristic vector of the subset S, i.e. 1S|G = (1v∈S)v∈G.

This paper makes extensive use of basic concepts and results from matrix calculus to be summarized inthe following. We will also discuss a connection between tensor computation and the FAQ-framework.

2.1 Basics

We list here common identities we often use in the paper; for more details see the Matrix Cookbook [24].We use denominator layout for differentiation, i.e., the gradient is a column vector. Let A be a matrix, and

3

Page 4: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

u,v,x,b be vectors, where A and b are independent of x, and u and v are functions of x then

∂ 〈b,x〉∂x

= b (1)

∂x>Ax

∂x= (A + A>)x (2)

∂ ‖Ax− b‖2

∂x= 2A>(Ax− b) (3)

∂u>v

∂x=

∂u>

∂xv +

∂v>

∂xu (4)

∂(Bx + b)>C(Dx + d)

∂x= B>C(Dx + d) + D>C>(Bx + b). (5)

The following matrix inversion formulas will be useful. See [14] and references thereof for details.

Proposition 2.1. We have

(B + UCV)−1 = B−1 −B−1U(C−1 + VB−1U)−1VB−1. (6)

whenever all dimensions match up and inverses on the right hand side exist. In particular, the followingholds when C = (1), U = 1, V = 1>, and J is the all-1 matrix:

(B + J)−1 = B−1 −B−11(1 + 1>B−11)−11>B−1. (7)

Another special case is

(A + U>U)−1 = A−1 −A−1U>(I + UA−1U>)−1UA−1. (8)

An even more special case is the Sherman-Morrison formula, where U> is just a vector u. The matrixA + uu> is typically called a rank-1 update of A:

(A + uu>)−1 = A−1 − A−1uu>A−1

1 + u>A−1u. (9)

2.2 Tensors, Kronecker product, Khatri-Rao product

Next, we discuss some identities regarding tensors. We use ⊗ to denote the tensor product, which is not thesame as the outer product, even though the two are isomorphic maps. If A is an m× n matrix and B is ap× q matrix, then the tensor product A⊗B is an mp× nq matrix. In particular, if x is an m-dimensionalvector and y is an p-dimensional vector, then x⊗ y is an mp-dimensional vector, not an m× p matrix as inthe case of the outer product. This layout is the correct layout from the definition of the tensor (Kronecker)product. Generally,

Definition 1. Let A be a tensor of order r (i.e., a multilinear function ψA(X1, . . . , Xr)) and B be a tensorof order s (i.e., a multilinear function ψB(Y1, . . . , Ys)), then the tensor product A ⊗ B is the multilinearfunction

ψ(X1, . . . , Xr, Y1, . . . , Ys) = ψA(X1, . . . , Xr)ψB(Y1, . . . , Ys).

Definition 2. Let A and B be two matrices each with n columns. We use A ? B to denote the matrixwith n columns, where the jth column of A ?B is the tensor product of the jth column of A with the jthcolumns of B. The operator ? is a (special case of) the Khatri-Rao product [?], where we partition the inputmatrices into blocks of one column each.

4

Page 5: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Proposition 2.2. We have (if the dimensionalities match up correctly):

(AB⊗CD) = (A⊗C)(B⊗D) (10)

(A⊗B)> = (A> ⊗B>) (11)

〈x,By〉 =⟨B>x,y

⟩(12)

(A⊗B)−1 = (A−1 ⊗B−1) if both are square matrices (13)

〈A⊗B,RX⊗ SY〉 =⟨R>A⊗ S>B,X⊗Y

⟩. (14)

If x is a standard n-dimensional unit vector, A and B are two matrices with n columns each, and a and bare two n-dimensional vectors, then

(A⊗B)(x⊗ x) = (A ?B)x (15)

〈a⊗ b,x⊗ x〉 = 〈a b,x〉 . (16)

Let x be a standard n-dimensional unit vector, A1, . . . ,Ak be k matrices with n columns each. Then,

(k⊗i=1

Ai)(x⊗k) = (

k

Fi=1

Ai)x. (17)

We note in passing that the first five identities are very useful in our dimension reduction techniques byexploiting functional dependencies, while (15)(16)(17) are instrumental in achieving computational reductionin our handling of categorical features.

Proof. The identities (10), (11), (12), and (13) can be found in the Matrix Cookbook [24]. Identity (14)follows from (10) and (11). To see (15), note that

(A⊗B)(x⊗ x) = Ax⊗Bx = (A ?B)x,

where the last equality follows due to the following reasoning. Suppose xj = 1 for some j, then Ax = ajand Bx = bj , where aj and bj are the jth columns of A and B, respectively. Thus,

Ax⊗Bx = aj ⊗ bj = (A ?B)j = (A ?B)x.

Identities (16) and (17) are proved similarly, where (17) is a trivial generalization of (15).

2.3 Tensor computation, FAQ-expression, and the InsideOut algorithm

Quite often we need to compute a product of the form (A⊗B)C, where A,B, and C are tensors, providedthat their dimensionalities match up. For example, suppose A is an m× n matrix, B a p× q matrix, and Ca nq × 1 matrix (i.e. a vector). The result is a mp× 1 tensor. The brute-force way of computing (A⊗B)Cis to compute A⊗B first, taking Θ(mnpq)-time, and then multiply the result with C, for an overall runtimeof Θ(mnpq). The brute-force algorithm is a horribly inefficient algorithm.

The better way to compute (A ⊗B)C is to view this as an FAQ-expression [18] (a sum-product form):we think of A as a function ψA(x, y), B as a function ψB(z, t), and C as a function ψC(y, t). What we wantto compute is the function

ϕ(x, z) =∑y

∑t

ψA(x, y)ψB(z, t)ψC(y, t).

This is nothing but a 4-cycle FAQ-query, and we should certainly pick between the following two strategies:

• eliminate t first (i.e. compute ϕ1(y, z) :=∑t ψB(z, t)ψC(y, t) with a runtime of O(npq)), and then

eliminated y (i.e. compute ϕ(x, y) =∑y ϕ1(y, z)ψA(x, y) with O(mnp)-time. The overall runtime is

thus O(np(m+ q)).

5

Page 6: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

• or the symmetric strategy of eliminating y first, and then t for an overall runtime of O(mq(n+ p)).

This is not surprising, since the problem is just matrix chain multiplication; or, in the FAQ-language of theInsideOut algorithm we want to pick the best tree decomposition and then compute a variable eliminationorder out of it [18]. We shall see later that a special case of the above that occurs often is when B = I,

x(m)

y(n) t(q)

z(p)

ψA(x, y)

ψC(y, t)

ψB(z, t)

ϕ(x, z)

the identity matrix. In that case, ψB(z, t) is the same as the atom z = t, and thus it serves as a change ofvariables:

ϕ(x, z) =∑y

∑t

ψA(x, y)ψB(z, t)ψC(y, t) =∑y

ψA(x, y)ψC(y, z).

In other words, we only have to marginalize out one variable instead of two. This situation arises, forexample, in Eq. (53) and Eq. (54) below.

3 Problem formulation

3.1 The basic formulation

We consider solving an optimization problem of a particular form inside a database. Suppose we have pparameters θ = (θ1, . . . , θp) ∈ Rp. Let n denote the number of numeric features. In our formulation, thereis a positive integer m, and two vector-valued functions g : Rp → Rm and h : Rn → Rm. Each componentfunction gj of g = (gj)j∈[m] is a multivariate polynomial. Each component function hj of h = (hj)j∈[m] isa multivariate monomial. A typical machine learning task is to solve θ∗ := arg minθ J(θ), where J(θ) :=∑

(x,y)∈D L (〈g(θ), h(x)〉 , y)+Ω(θ). Here L is some loss function (e.g., square loss), Ω is the regularizer (such

as `1- or `2-norm of θ), and D is the training dataset with features (regressors) x and response (regressand)y. For concreteness, with square loss and `2-regularization, we want to find a minimizer for the followingfunction

J(θ) :=1

2|D|∑

(x,y)∈D

(〈g(θ), h(x)〉 − y)2 +λ

2‖θ‖22 . (18)

The above problem is pervasive in machine learning:

Example 1. The ridge linear regression (LR) model with response y and regressors x1, . . . , xn has p = n+1,parameters θ = (θ0, . . . , θn). For convenience, we set x0 = 1 corresponding to the bias parameter θ0. Thedata points x = (x0, x1, . . . , xn, y) are taken from a dataset D. We would like to minimize the following lossfunction J(θ):

1

2|D|∑

(x,y)∈D

(n∑i=0

θixi − y

)2

2‖θ‖22 .

This is exactly of the form (18) with m = n+ 1, where both g and h are identity functions: g(θ) = θ, andh(x) = x.

6

Page 7: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Example 2. The degree-d polynomial regression (PRd) model with response y and regressors x0 = 1, x1, . . . , xnhas p = m = 1 + n + n2 + · · · + nd parameters θ = (θa), where a = (a1, . . . , an) is a tuple of non-negativeintegers such that

∑ni=1 ai ≤ d. In this case, g(θ) = θ and h is defined by the component functions

ha(x) =∏ni=1 x

aii .

Example 3. The degree-2 rank-r factorization machines (FaMa2r) model with regressors x0 = 1, x1, . . . , xn

and regressand y has parameters the θ consisting of θi for i ∈ 0, . . . , n and θ(j)i for i ∈ [n] and j ∈ [r].

Training FaMa2r corresponds to minimizing the following function J(θ):

1

2|D|∑

(x,y)∈D

n∑

i=0

θixi +∑

i,j∈([n]2 )

`∈[r]

θ(`)i θ

(`)j xixj − y

2

2‖θ‖22 .

The number of parameters is p = 1 + n + rn. Let m = 1 + n +(n2

). Then, training FaMa2

r correspondsprecisely to optimizing (18) with the following g and h functions:

hS(x) =∏i∈S

xi, for S ⊆ [n], |S| ≤ 2

gS(θ) =

θ0 when |S| = 0

θi when S = i∑r`=1 θ

(`)i θ

(`)j when S = i, j.

Example 4. Classification methods such as support vector machines (SVM), logistic regression and Ad-aboost also fall under the same optimization framework for J , but different choices of loss L and regularizerΩ. Typically, Ω(θ) = λ

2 ‖θ‖22. Restricting to binary class labels y ∈ ±1, the loss function L(f, y), where

f := 〈g(θ), h(x)〉, takes the form L(f, y) = max1 − yf, 0 for SVM, L(f, y) = log(1 + e−yf ) for logisticregression and L(f, y) = e−yf for Adaboost.

Example 5. Various unsupervised learning techniques can be expressed as iterative optimization proceduresaccording to which each iteration is reduced to an optimization problem of the generic form given above.For example, the Principal Component Analysis (PCA) requires solving the following optimization problemto obtain a principal component direction

max‖θ‖=1

θ>Σθ = maxθ∈Rp

minλ∈R

θ>Σθ + λ(‖θ‖2 − 1),

where Σ := 1|D|∑

x∈D xx> is the (empirical) correlation matrix of the given data. Although there is no

response/ class label y, within each iteration of the above iteration, for a fixed λ, there is a loss function Lacting on feature vector h(x) and parameter vector g(θ), along with a regularizer Ω. Specifically, we haveh(x) = Σ ∈ Rp×p, g(θ) = θ ⊗ θ ∈ Rp×p, L = 〈g(θ), h(x)〉F , where the Frobenius inner product is nowemployed. In addition, Ω(θ) = λ(‖θ‖2 − 1).

3.2 Dealing with categorical features

Categorical features. In the basic formulation, all features are continuous. In many practical applications,users store data in a relational database D and extract features via a relational query Q. The training datasetis then the query result Q(D). Most of the features are obtained from categorical attributes of the database.For concreteness, consider the following query that is a highly simplified version of a feature extraction querywe typically use at LogicBlox:

Q(sku, store, day, color, quarter, city, country, unitsSold)←R1(sku, store, date, unitsSold), R2(sku, color), (19)

R3(date, quarter), R4(store, city), R5(city, country).

7

Page 8: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Relation R1 records the number of units of a given sku (stock keeping unit) sold at a store on a particulardate. The retailer is a global business, so it has stores in different cities and countries. One objective, forexample, is to predict the number of blue units to be sold next year in the Fall quarter in Berlin. In thiscase, the response is y = unitsSold , but all of the regressors are categorical variables. Categorical featuresconstitute the vast majority (99%) of features we see in LogicBlox’s machine learning applications.2

One-hot encoding. As is common practice in machine learning, all categorical features are assumed tobe one-hot encoded [15]. Note that under one-hot encoding, each feature such as country is in fact a 01-vectorxcountry indicating which country occurs in this data point (tuple in the query result). For example, supposevietnam, england, usa are the only three countries occurring in the query result. Then, the component xcountry

is in fact a 3-dimensional vector xcountry = [xvietnam, xengland, xusa] with exactly one of the three values 1 andthe others 0; if a particular data point has country = “england”, then xcountry = [0, 1, 0].

To model the one-hot encoding of categorical features, we need a few more notations. Let H = (V, E)denote the query Q’s hypergraph, i.e., V is the set of attributes occurring in Q, and E has a hyperedge A foreach relation whose attributes are in the set A. For example, in the query (19), V consists of sku, store, etc.and the hypergraph has 5 edges, where the edge corresponding to relation R4 is store, city. There is also asubset V ⊆ V which specify the actual (categorical or not) features one wants to extract from the query. Forexample, in order to predict sales for next year, it does not make sense to use this year’s date as a feature.In that case V does not contain date while V does.

The query Q extracts from the database data points which are vectors (x, y) ∈ Q(D), where x = (xc)c∈Vis a vector each of whose component is also a vector (xc is xcity, for instance). If a feature, say xsalary, is notcategorical, then it is treated as a 1-dimensional vector denoted by xsalary. Similarly, each component of theparameter vector θ becomes a matrix (or a vector if the matrix has one column).

Robustness of formulation. The problem formulation (18) is robust; it works with categorical featuresas well, as long as we replace arithmetic product in the component functions of g and h by tensor product.To explain this point clearly, we write the monomials hk more concretely:

hk(x) =⊗f∈V

xek(f)f (20)

where ek = (ek(c))c∈V ∈ Nn. For each k ∈ [n], the set f | ek(f) > 0 is the set of features whoseinteraction is modeled with the (hyper-) monomial hk. Let C denote the set of categorical features, definethe sets Ck = f | ek(f) > 0, f ∈ C. For each categorical feature f , then hk represents

∏f∈Ck

|πf (Q)|many monomials, one for each combination of the categories.3 Due to one-hot encoding, each element in thevector xf for a categorical feature f is either 1 or 0 and then xef = xf for e > 0. We thus can simplify hk as

hk(x) =∏

f∈V−Ck

xek(f)f ·

⊗f∈Ck

xf . (21)

(Note that we intentionally write xf instead of xf when f ∈ V − Ck because those are non-categorical, i.e.,they are continuous variables.)

Example 6. We elaborate on the subtlety of the tensor product. Consider a query that extracts tuples(country, a, b, c, color) from the database, where country and color are categorical features. Also, in the outputthere are two countries vietnam, england, and three colors red, green, blue. Consider three “interaction”functions

h1(x) = xcountry ⊗ x2axc (22)

h2(x) = xcountry ⊗ xcolor ⊗ xb (23)

h3(x) = xbxc. (24)

2It does not make sense to join over float-type attributes.3πf (Q) denote the projection of Q onto attribute f .

8

Page 9: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Since the feature space under one-hot encoding is actually

(vietnam, england, a, b, c, red, green, blue),

Equation (21) says that the functions h1 and h2 are actually encoding 8 functions:

h1,vietnam(x) = xvietnamx2axc

h1,england(x) = xenglandx2axc

h2,vietnam,red(x) = xvietnamxredxb

h2,vietnam,green(x) = xvietnamxgreenxb

h2,vietnam,blue(x) = xvietnamxbluexb

h2,england,red(x) = xenglandxredxb

h2,england,green(x) = xenglandxgreenxb

h2,england,blue(x) = xenglandxbluexb.

We next exemplify the tensor product for three models.

Example 7. In linear regression, the parameter θ is a vector of vectors: θ = [θ0, . . . ,θn]. Since ourinner product is Frobenius, when computing 〈θ,x〉 we would be multiplying, for example, θusa with xusacorrespondingly.

Example 8. In polynomial regression, the parameter θ is a vector of tensors (i.e., high-dimensional ma-trices). For example, consider the second order term θijxixj . When both i and j are non-categorical,θij is just a scalar. Now, suppose i is country and j is quarter. Then, the model has terms of the formθvietnam,fallxvietnamxfall, θusa,winterxusaxwinter, and so on. All these terms are captured by the Frobenius innerproduct 〈θij ,xi ⊗ xj〉. The component θij is a matrix whose number of entries is the number of pairs (coun-try, quarter) that appear together in some tuple of the query result (which can be much less than the productof the numbers of countries and number of quarters in the input database).

Example 9. Consider the FaMa model from Example (3), but now with categorical variables. From theprevious examples, we already know how to interpret the linear part

∑ni=0 θixi of the model when features are

categorical. Consider a term in the quadratic part such as∑r`=1 θ

(`)i θ

(`)j xixj . When i and j are categorical,

the term becomes ⟨r∑`=1

θ(`)i ⊗ θ

(`)j ,xi ⊗ xj

⟩.

4 Algorithms

We apply batch gradient-descent (BGD) to optimize the loss function J(θ) over the dataset defined by thequery result of a query Q and database D. In a BGD algorithm, two computations are repeated a givennumber of runs or until convergence of the parameters θ: (1) Point evaluation: Given θ, compute thescalar J(θ); and (2) Gradient computation: Given θ, compute the vector ∇J(θ). The basic structure of thealgorithm is shown in Algorithm 1.

We refer the reader to the excellent review article [12] for more details on fast implementations of thegradient-descent method. In particular, our implementation used the “adaptive” Barzilai-Borwein step sizeadjustment [3] recommended in [9].

4.1 Precomputations

To illustrate the main idea, we first consider the simplest case when there are no categorical features: werewrite (18) to factor out the data-dependent part of these two computations.

9

Page 10: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Algorithm 1: BGD with Armijo line search.

θ ← a random point;while not converged yet do

α← next starting step size // Barzilai-Borwein;d←∇J(θ);

while(J(θ − αd) ≥ J(θ)− α

2 ‖d‖22

)do

α← α/2 // line search;endθ ← θ − αd;

end

Theorem 4.1. Let J(θ) be the function defined in (18). Define the matrix Σ = (σij)i,j∈[m], the vectorc = (ci)i∈[m], and the scalar sY by

Σ =1

|Q(D)|∑

(x,y)∈Q(D)

h(x)h(x)> (25)

c =1

|Q(D)|∑

(x,y)∈Q(D)

y · h(x) (26)

sY =1

2|Q(D)|∑

(x,y)∈Q(D)

y2. (27)

Then,

J(θ) =1

2g(θ)>Σg(θ)− 〈g(θ), c〉+ sY +

λ

2‖θ‖2 (28)

∇J(θ) =∂g(θ)>

∂θΣg(θ)− ∂g(θ)>

∂θc + λθ. (29)

Proof. We start with point evaluation:

1

2|Q|∑

(x,y)∈Q

(〈g(θ), h(x)〉 − y)2

=1

2|Q|∑

(x,y)∈Q

(〈g(θ), h(x)〉2 − 2y 〈g(θ), h(x)〉+ y2)

=1

2|Q|∑

(x,y)∈Q

g(θ)>(h(x)h(x)>)g(θ)−

⟨g(θ),

1

|Q|∑

(x,y)∈Q

yh(x)

⟩+

1

2|Q|∑

(x,y)∈Q

y2

=1

2g(θ)>

1

|Q|∑

(x,y)∈Q

h(x)h(x)>

g(θ)− 〈g(θ), c〉+ C

=1

2g(θ)>Σg(θ)− 〈g(θ), c〉+ C.

The gradient formula follows straightforwardly from (28) and the chain rule.

Note that ∂g(θ)>

∂θ is a p×m matrix, and Σ is an m×m matrix. Statistically, Σ is related to the covariancematrix, c to the correlation between the response and the regressors, and sY to the empirical second moment

10

Page 11: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

of the response variable. Theorem 4.1 allows us to compute the two key steps of BGD without scanningthrough the data again, because the quantities (Σ, c, sY ) can be computed efficiently in a preprocessing stepinside the database as aggregates over the query Q. We explain this point further below and in the nextsection.

When g is an identity function, i.e. the model is linear, as is the case in PR (and thus LR) model,equations (28) and (29) become particularly simple:

Corollary 4.2. In a linear model (i.e. g(θ) = θ),

J(θ) =1

2θ>Σθ − 〈θ, c〉+ sY +

λ

2‖θ‖22 (30)

∇J(θ) = Σθ + λθ − c. (31)

Let d = ∇J(θ). The Armijo condition J(θ − αd) ≥ J(θ)− α2 ‖d‖

22 is equivalent to the following

αθ>Σd− α2

2d>Σd− α 〈c,d〉+ λα 〈θ,d〉 ≤ α

2(λα+ 1) ‖d‖22 . (32)

Lastly,∇J(θ − αd) = (1− α)d− αΣd. (33)

Proof. From (30) we have

J(θ)− J(θ − αd) =1

2θ>Σθ − 1

2(θ − αd)>Σ(θ − αd)− 〈θ, c〉+ 〈θ − αd, c〉+

λ

2‖θ‖2F −

λ

2‖θ − αd‖2F

=1

2θ>Σθ − 1

2

(θ>Σθ − 2αθ>Σd + α2d>Σd

)− α 〈d, c〉+ λα 〈θ,d〉 − λα2

2‖d‖2

= αθ>Σd− α2

2d>Σd− α 〈d, c〉+ λα 〈θ,d〉 − λα2

2‖d‖2 .

The significance of (32) is as follows. In a typical iteration of BGD, we have to backtrack a few times (sayt times) for each value of α. If we were to recompute J(θ − αd) using (30) each time, then the runtime ofArmijo backtracking search is O(tm2), even after we have already computed d and J(θ). Now, using (32), we

can compute in advance the following quantities (in this order): d, ‖θ‖22, Σd, 〈c,d〉, 〈θ,d〉, d>Σd, θ>Σd,After that, each check for inequality (32) can be done in O(1)-time, for a total of O(m2 + t)-times. Oncewe have determined the step size α, equation (33) allows us to compute the next gradient (i.e. the next d)in linear time, because we have already computed Σd for line search. As we shall see later, with categoricalvariables time saving from these simple observations is very significant.

To implement BGD, we need to precompute the covariance matrix Σ and the correlation vector c inorder to evaluate (29) and (28). We do not need to compute the second moment sY because optimizing J(θ)is the same as optimizing J(θ)−sY . If all attributes from Q(D) are continuous features, then Σ = (σij)

mi,j=1

is a matrix and c = (ci)mi=1 is a vector defined more precisely by

σij =1

|Q(D)|∑

(x,y)∈Q(D)

hi(x)hj(x) (34)

cj =1

|Q(D)|∑

(x,y)∈Q(D)

y · hj(x). (35)

Here, hj denote the jth component function of the vector-valued function h, and hj is a multivariate mono-mial in x. Hence, it is easy to see that both expressions (34) and (35) are Sum-Product expressions asexplained in the Functional Aggregate Queries framework [18]. These aggregates can be computed usingknown algorithms [18,30] in time O(N fhtw), where fhtw denote the fractional hypertree width [13] of the inputquery Q’s hypergraph H, and N is the input database size, for an overall runtime of O(m2N fhtw).

11

Page 12: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

4.2 Tensor representation and computation

The more interesting, more common, and also considerably challenging situation is when some attributesof Q(D) are categorical features. We next explain how we model and deal with categorical features in theprecomputation of Σ and c. The main idea can be explained with our running example 6.

Example 10. In example 6, the matrix Σ is of size 8× 8 instead of 3× 3 after one-hot encoding. However,many of those entries are 0, for instance ∀x ∈ Q(D):

h1,vietnam(x)h1,england(x) = 0

h1,england(x)h2,vietnam,blue(x) = 0

h2,vietnam,blue(x)h2,england,blue(x) = 0

h2,vietnam,blue(x)h2,vietnam,red(x) = 0.

The reason is that the indicator variables xblue or xengland are like selection clauses xcolor = blue or xcountry =england. Thus, we can rewrite an entry σij as an aggregate over a more selective query:∑

x∈Q(D)

h1,vietnam(x)h2,vietnam,red(x) =∑φ

x2axcxb,

where φ := (x ∈ Q(D) ∧ xcolor = red ∧ xcountry = vietnam).

Extrapolating straightforwardly, if we were to write Σ down in the one-hot encoded feature space, thenthe entries σij under one-hot encoding got exploded into many entries. More concretely, σij is in fact atensor σij of dimension

∏f∈Ci

|πf (Q)| ×∏f∈Cj

|πf (Q)|, because

σij =1

|Q(D)|∑

(x,y)∈Q(D)

hi(x)⊗ hj(x). (36)

Similarly, cj in (35) is a tensor cj of dimension∏f∈Cj

|πf (Q)| since the monomial hj(x) is a matrix in thecategorical case.

4.2.1 Sparse tensor representation

The previous example demonstrates how the dimensionalities of σij and cj can be extremely large. For-tunately, the tensor is very sparse, and a sparse representation of it can be computed with one aggregatequery as the following proposition shows. For any event E, let 1E denote the Kronecker delta, i.e. 1E = 1if E holds, and 1E = 0 otherwise. Recall that the input query Q has hypergraph H = (V, E), and there isan input relation RF for every hyperedge F ∈ E .

Proposition 4.3. The tensor σij can be sparsely represented by an aggregate query with group-by attributesCi ∪ Cj. In the FAQ-framework, the query is expressed as a Sum-Product query with free (i.e., group-by)variables:

ϕ(xCi∪Cj) =

∑xf :f /∈Ci∪Cj

∏f /∈Ci∪Cj

xei(f)+ej(f)f

∏F∈E

1πF (x)∈RF. (37)

Similarly, the tensor cj can be sparsely represented by an aggregate query with group-by attributes Cj, whichis expressed as the Sum-Product query

ϕ(xCj) =

∑xf :f /∈∪Cj∪y

y∏f∪Cj

xej(f)f ·

∏F∈E

1πF (x)∈RF. (38)

Example 11. Consider the queryQ in (19), where the set of features is sku, store, day, color, quarter, city, countryand unitSold is the response variable. In this query n = 7, and thus for a PR2 model we have m = 1+7+72 =

12

Page 13: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

57 parameters. Consider two indices i and j to the component functions of g and h, where i = (store, city)and j = (city). Suppose the query result states that the retailer has Ns stores in Nc countries. Then, thefull dimensionality of the tensor σij is Ns ×N2

c , because by definition it was defined to be

σij :=1

|Q(D)|∑

(x,y)∈Q(D)

xstore ⊗ xcity︸ ︷︷ ︸hi(x)

⊗ xcity︸︷︷︸hj(x)

. (39)

Recall that xstore and xcity are both indicator vectors. Consequently, the above tensor has the followingstraightforward interpretation: for every triple (s, c1, c2), where s is a store and c1 and c2 are cities, thistriple entry of the tensor counts the number of data points (x, y) ∈ Q(D) for this particular combination ofstore and cities (divided by 1/|Q(D)|). It is clear that most of these (s, c1, c2)-entries are 0. For example, ifc1 6= c2 then the count is zero. Thus, we can concentrate on computing entries of the form (s, c, c):

SELECT s, c, count(*) FROM Q GROUP BY s, c;

Better yet, since store functionally determines city, the number of entries in the output of the above queryis bounded by Ns. Using database relations to represent sparse tensor results in a massive amount of spacesaving.

4.2.2 Computing the aggregates in Σ and c

We employ three orthogonal ideas to compute the queries (37) and (38) efficiently.First, our FAQ [18] and FDB [30] frameworks are designed to compute such aggregates over feature

extraction queries, which are wider than traditional OLAP queries.

Proposition 4.4. Let faqw(i, j) denote the FAQ-width of the query corresponding to σij, and S(i, j) denotethe size of the sparse representation, i.e., the number of tuples, used to represent the tensor σij. Then, the

worst-case runtime for computing Σ and c is O(∑

i,j∈[m](Nfaqw(i,j) + S(i, j)) logN

).

The notion of FAQ-width is beyond the scope of this paper. The tensor (39) has faqw(i, j) = 1 andS(i, j) = O(N).

Second, we exploit the observation that in the computation of Σ many distinct tensors σij have iden-tical sparse representations. Consider, for example, the tensor σij from Example 11 corresponding toi = (store, city) and j = (city). Following equation (37), all of the following tensors have the exact samesparse representation: (i, j) ∈ ((store, city), city), ((store, store), city), ((store, city), store), . . .. There are 12tensors sharing this particular sparse representation. Formula (37) tells us which tensors share the samerepresentation: The output function ϕ is identified by the tuple (Ci∪Cj , (ei(f)+ej(f))f∈V−Ci∪Cj

). Tensorssharing the same tuple will share the same sparse representation. In the tuple, Ci ∪ Cj identify the set ofgroup-by variables, and the exponents (ei(f) + ej(f))f∈V−Ci∪Cj specify which function we want to computefor those group-by variables.

Third, we employ a sparse representation of tensors in the parameter space. We need to evaluate thecomponent functions of g, which are polynomial. In the FaMa2

r example, for instance, we evaluate expressionsof the form

gstore, city(θ) =

r∑`=1

θ(`)store ⊗ θ

(`)city. (40)

The result is a 2-way tensor whose CP-decomposition (a sum of rank-1 tensors) is already given! There isno point in materializing the result of gstore, city(θ) and we instead keep it as is. If we were to materialize it,we would end up with an Ω(N2)-sized result for absolutely no gain in computational and space complexity,while the space complexity of the CP-decomposition is only O(N). This is a prime example of factorizationof the parameter space.

13

Page 14: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

4.2.3 Point evaluation and gradient computation

Finally, we explain how to evaluate the expressions (28) and (29) with our sparse tensor representations.There are two aspects of our solution worth spelling out: (1) how to multiply two tensors, e.g., σij and θj ,and (2) how to exploit that some tensors have the same representation to speed up the point evaluation andgradient computation.

To answer question (1), we need to know the intrinsic dimension of the tensor σij . In order to computeΣg(θ) in Example 11, we need to multiply σij with gj(θ) for i = (store, city) and j = (city). In a linearmodel, gj(θ) = θj = θcity. In this case, when computing σijθcity we marginalize away one city dimension ofthe tensor, while keeping the other two dimensions store, city. This is captured by the following query:

SELECT store, city, sum(σi,j .val ∗ θj .val)

FROM σi,j ,θj WHERE σi,j .city = θj .city

GROUP BY store, city;

where the tensors σi,j and θj map (store, city) and respectively (city) to aggregate values. In words, σijgj(θ)is computed by a group-by aggregate query where the group-by variables are precisely the variables in Ci.

For the second question, we use the CP-decomposition of the parameter space as discussed earlier.Suppose now we are looking at the σij tensor where i = (city) and j = (store, city). Note that this tensorhas the identical representation as the above tensor, but it is a different tensor. In a FaMa2

r model, wewould want to multiply this tensor with the component function gj(θ) defined in (40) above. We do so by

multiplying it with each of the terms θ(`)store ⊗ θ

(`)city, one by one for ` = 1, . . . , r, and then add up the result.

Multiplying the tensor σij with the first term θ(1)store ⊗ θ

(1)city corresponds precisely to the following query:

SELECT city, sum(σi,j .val ∗ θ(1)store.val ∗ θ

(1)city.val)

FROM σi,j ,θ(1)store,θ

(1)city

WHERE σi,j .city = θ(1)city.city AND

σi,j .store = θ(1)store.store

GROUP BY city;

where the tensors σi,j , θ(1)city, and θ

(1)store map (store, city), (city), and respectively (store) to aggregate values.

Finally, to answer question (2), note that for the same column j (i.e., the same component functiongj(θ)), there can be multiple tensors σij which have identical sparse representation. (This is especially truein degree-2 models.) In such cases, we have queries with identical from-where blocks but different select-group-by clauses, because the tensors have different group-by variables. Nevertheless, all such queries canshare computation as we can compute the from-where clause once for all of them and then scan this resultto compute each specific tensor.

4.3 FD-based dimensionality reduction

Consider a query Q in which country and city are two attributes representing features. For simplicity, assumethat there are only two countries “vietnam” and “england”, and 5 cities “saigon”, “hanoi”, “london”, “leeds”,and “bristol”. Under one-hot encoding, the corresponding features are of the form xvietnam, xengland, xsaigon,xhanoi, xlondon, xleeds, xbristol. Since city → country is an FD, we know that, for a given tuple x in the queryresult, the following identities hold:

xvietnam = xsaigon + xhanoi (41)

xengland = xlondon + xleeds + xbristol. (42)

14

Page 15: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

The first identity states that if a tuple in the query result has vietnam as the value for country (xvietnam = 1),then it can only have as value for city either saigon (xsaigon = 1 and xhanoi = 0) or hanoi (xhanoi = 1 andxsaigon = 0). The second identity has a similar explanation4.

The linear relationship only holds if both variables are categorical. An FD on categorical and continuousfeatures may not be linear in general, e.g., the FD SSN → salary, where SSN is categorical but salary iscontinuous.

Where do these identities come from? We can extract in a preprocessing step from the database a relationof the form R(city, country) with city as a primary key. How do we express the identities such as (41) and (42)in a formal manner in terms of the input vectors xcity and xcountry? Suppose there are Ncity cities and Ncountry

countries. The predicate R(city, country) can be modeled mathematically with an Ncountry × Ncity matrixR: if xcity is an indicator vector for saigon, then Rxcity is an indicator for vietnam.5 In this language, theabove identities are written as xcountry = Rxcity. For example, in the above particular example Ncity = 5,Ncountry = 2, and

R =saigon hanoi london leeds bristol

1 1 0 0 0 vietnam0 0 1 1 1 england

This relationship suggests a natural but powerful technique: we may replace any occurrence of statisticsxcountry by its functionally determining quantity xcity. Since these quantities are present only in loss L viainner products 〈g(x), h(θ)〉, such replacements result in a (typically) linear reparameterization of the loss.What happens next is less obvious, due to the presence of nonlinear penalty function Ω. Depending on thespecific structure of FDs and the choice of Ω, it turns out that many parameters associated with redundantstatistics, which do not affect the loss L, can be optimized out directly with respect to the transformed Ωpenalty. Our reparameterization technique thus entails a number of benefits: (1) the number of requiredprecomputations is reduced drastically, (2) the transformed parameter space is also reduced in dimensions,which potentially helps to speed up the convergence in the optimization phase.

The remainder of this subsection is a gentle introduction of our idea in the presence of one simple FD. Ageneral development of this technique will be presented in its full glory in the sequel, by accounting for moregeneral FD structures in the context of PR and FaMa learning problems. Consider a query Q in which cityand country are two of the categorical features, functionally determine one another via a matrix R such thatRxcity = xcountry for all x = (· · · ,xcity,xcountry, · · · ) ∈ Q. We use this linear correlation to eliminate xcountry

from the model as follows.

〈g(θ), h(x)〉 = 〈θ,x〉

=∑

j /∈city,country

〈θj ,xj〉+ 〈θcity,xcity〉+ 〈θcountry,xcountry〉

=∑

j /∈city,country

〈θj ,xj〉+ 〈θcity,xcity〉+ 〈θcountry,Rxcity〉

=∑

j /∈city,country

〈θj ,xj〉+

⟨θcity + R>θcountry︸ ︷︷ ︸

γcity

,xcity

⟩.

Reparameterize the model by defining γ = (γj)j∈V−country, and two new functions g : Rn−1 → Rn−1,

4Composite FDs lead to more complex identities. For instance, the FD (guest, hotel, date) → room leads to the identityxroom =

∑xguestxhotelxdate. We do not explore such polynomial correlations in this paper.

5The predicate is a sparse representation of the matrix!

15

Page 16: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

h : Rn → Rn−1 by

γj =

θj j /∈ city, countryθcity + R>θcountry j = city.

(43)

g(γ) = γ (44)

hj(x) = xj , j 6= city. (45)

(Note again that there is no γcountry.) Then, we can rewrite J(θ) in terms of γ:

J(θ) =1

2|Q|∑

(x,y)∈Q

〈g(θ), h(x)〉+λ

2‖θ‖22

=1

2|Q|∑

(x,y)∈Q

⟨g(γ), h(x)

⟩+λ

2

∑j 6=city

∥∥γj∥∥2+∥∥γcity −R>θcountry

∥∥2+ ‖θcountry‖2

.

Note how θcountry has disappeared from the loss term, but it still remains in the penalty term. Since thepenalty is data-independent, θcountry can be optimized out easily. The partial derivative of J with respect toθcountry:

2

λ

∂J

∂θcountry= R(R>θcountry − γcity) + θcountry (46)

By setting (46) to 0 we obtain θcountry in terms of γcity: θcountry = R(I + R>R)−1γcity. (I is the identitymatrix.) So, J can be expressed completely in terms of γ, its gradient with respect to γ is also available:

J(θ) =1

2|Q|∑

(x,y)∈Q

⟨g(γ), h(x)

⟩+λ

2

∑j 6=city

∥∥γj∥∥2+⟨(I + R>R)−1γcity,γcity

⟩ ,

1

2

∂ ‖θ‖22∂γj

=

γj j 6= city(I + R>R)

)−1γcity j = city.

The gradient of the loss term is computed using the matrix Σ and the vector c with respect to thedimensionality-reduced pair (g, h). The matrix (I + R>R) is a rank-Ncountry update to the identity matrix,strictly positive definite and thus invertible. (The inverse can be obtained using the Sherman-Morrison-Woodbury identity [14].) It is important to keep in mind the database interpretation of the product R>R:this is an Ncity × Ncity binary matrix whose rows and columns are indexed by cities, where there is a 1 atentry (c1, c2) iff both cities belong to the same country. It should be obvious that – using a ternary predicatewith two keys and one value – a sparse representation of (I+R>R) can be computed using database queries.

4.4 Polynomial regression under sets of simple FDs

The variable elimination and reparameterization techniques described above will be developed in full gener-ality, which help to simplify and substantially speedup the training of PR and FaMa under an arbitrary setof simple FDs. This section deals with PRd – the polynomial regression of model of degree d. Towards thisend, we need to establish a few notations and terminologies.

Groups of simple FDs To formalize general FD structures, consider a query Q in which there are k groupsof disjoint categorical features G1, . . . , Gk. In group i, there is a feature fi which functionally determinesthe remaining features in the group. We write Gi = fi ∪ Si, and fi → Si is an FD. For example, in atypical data modeling query we see at LogicBlox’s retailer customer, we have k = 3 groups (among otherattributes): in the first group we have day → week → month → quarter → year, and thus f1 = day and S1

16

Page 17: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

= week, month, quarter, year . In the second group, f2 = sku and S2 = type, color, size, ... (this is arather large group). In the third group f3 = store and S3 = city, country, region, continent . We shallrefer to these as groups of simple FDs in the statements of theorems below. Similar to the setting above, foreach feature c ∈ Si, let Rc denote the matrix for which xc = Rcxfi . For the sake of brevity, we also definea matrix Rfi = Ifi (the identity matrix of dimension equal to the active domain size of attribute fi); thisway the equality Rcxfi = xc holds for every c ∈ Gi.

f1

S1

G1

f2

S2

G2

fk

Sk

Gk

F

V

Figure 1: Groups of simple FDs. G = G1 ∪ · · · ∪Gk.

The PRd model Recall how we reduce the PRd-model to our problem formulation in Example 2. We nowmake that reduction more formal. Consider the set of all tuples aV = (aw)w∈V ∈ NV of non-negative integerssuch that ‖aV ‖1 ≤ d. One can think of the set as a (database) |V |-ary relation over the non-negative integerdomain. For any x ∈ Q and a ∈ AV , define

x⊗a :=⊗v∈V

x⊗avv .

In the PRd model we have θ = (θa)‖a‖1≤d, g(θ) = θ, and ha(x) = x⊗a. If a feature, say v ∈ V , is non-

categorical, then x⊗avv = xavv — we simply go back to the traditional multiplication in place of the tensorproduct.

If we knew xv ∈ 0, 1, then xavv = xv and thus there is no need to have terms for which av > 1. Asimilar situation occurs when v is a categorical variable. To see this, let us consider a simple query whereV = b, c, w, t, and all four variables are categorical. Suppose the model PRd has a term corresponding toa = (ab, ac, aw, at) = (0, 2, 0, 1). The term of 〈θ, h(x)〉 indexed by tuple a is of the form⟨

θa,x⊗2c ⊗ xt

⟩= 〈θa,xc ⊗ xc ⊗ xt〉 .

For the dimensionality to match up, θa is a 3rd-order tensor, say indexed by (i, j, k). The above expressioncan be simplified as∑

i

∑j

∑k

θa(i, j, k) · xc(i) · xc(j) · xt(k) =∑j

∑k

θa(j, j, k)xc(j)xt(k),

where the equality holds due to the fact that xc(j) is idempotent. In particular, we only need the entriesindexed by (j, j, k) of θa. Equivalently, we write:

〈θa,xc ⊗ xc ⊗ xt〉 =⟨((Ic ? Ic)

> ⊗ It)θa,xc ⊗ xt⟩.

Multiplying on the left by the matrix (Ic ? Ic)> ⊗ It has precisely the same effect as selecting out only

entries θa(j, j, k) from the tensor θa. More generally, in the PRd model we can assume that all the indicesaV = (av)v∈V satisfies the condition that av ∈ 0, 1 whenever v is categorical. (This is in addition to thedegree requirement that ‖aV ‖1 ≤ d.)

17

Page 18: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

FD-reduced pairs of functions

Definition 3. Given a pair of functions g and h in our problem setting. Recall that Cjs were defined insubsection 3.2, while Sks given in the previous section. Define

T := j ∈ [m] | Cj ∩ (S1 ∪ · · · ∪ Sk) 6= ∅

(In words, T is the set of component functions of h containing at least one functionally determined variable.)The group of simple FDs induces an FD-reduced pair of functions g : Rm−|T | → Rm−|T | and h : Rn →

Rm−|T |, which are specified as follows: The component functions of h are obtained from from the componentfunctions of h by removing all component functions hj for j ∈ T . Similarly, g is obtained from g by removingall component functions gj for which j ∈ T . Naturally, define the covariance matrix Σ and the correlationvector c as in (25) and (26), but with respect to h.

Given k groups of FDs represented by G1, . . . , Gk, let G =⋃ki=1Gi, S =

⋃ki=1 Si, G = V −G, S = V −S,

and F = f1, . . . , fk. For every non-empty subset T ⊆ [k], define FT := fi | i ∈ T. Given a naturalnumber h < d, and a non-empty set T ⊆ [k] with size |T | ≤ d− h, define the collection

U(T, h) := U | U ⊆ G ∧ U ∩Gi 6= ∅,∀i ∈ T ∧ U ∩Gi = ∅,∀i /∈ T ∧ |U | ≤ d− h. (47)

For every tuple aG ∈ NG with ‖aG‖1 = h < d, i ∈ T , and every U ∈ U(T, h), define the matrices

BT,h,i =∑

U∈U(T,h)

([F

c∈U∩Gi

Rc

]> [F

c∈U∩Gi

Rc

]), (48)

RaG,U=⊗w∈Gaw>0

Iw ⊗⊗i∈T

Fc∈U∩Gi

Rc. (49)

With all relevant quantities in place, we are ready to establish the following general theorem for in-databaselearning for ridge polynomial regression of arbitrary degree (PRd, d ≥ 1).

Theorem 4.5. Consider the PRd model with k groups of simple FDs Gi = fi ∪ Si, i ∈ [k]. Let θ =(θaV

)‖aV ‖1≤d be the original parameters. Define the following reparameterization:

γbS=

γ(bG,0G) bF = 0F∑U∈U(T,h) R>bG,U

θ(bG,1U|G) T = j | j ∈ F, bfj = 1, h = ‖bG‖1 .(50)

Then, minimizing J(θ) is equivalent to minimizing the function

J(γ) =1

2γ>Σγ − 〈γ, c〉+

λ

2Ω(γ), (51)

where

Ω(γ) =∑‖bS‖≤d‖bF ‖=0

∥∥∥γbS

∥∥∥2

+∑

‖bG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

⟨⊗w∈Gbw>0

Iw ⊗⊗i∈T

B−1T,h,i

γ(bG,1FT |F ),γ(bG,1FT |F )

⟩. (52)

Remark 1. The proof of this theorem is deferred to Section 4.6. In the above theorem, note that J is definedwith respect to the FD-reduced pair of functions g, h and a reduced parameter space of γ. Its gradient isvery simple to compute, due the fact that

1

2

∂Ω(γ)

∂γbS

=

γbS

bF = 0F(⊗w∈Gbw>0

Iw ⊗⊗

i∈T B−1T,h,i

)γ(bG,1FT |F ) T = j | j ∈ F, bj = 1, h = ‖bG‖1 .

(53)

18

Page 19: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Moreover, once a minimizer γ of J is obtained, we can compute a minimizer θ of J by setting

θaV=

γaS‖aG‖1 = 0⊗

w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1T,h,i

γ(aG,1FT |F ) ‖aG‖1 > 0, ‖aG‖1 = h, T = i | ∃c ∈ Gi, ac > 0

U = c | ac > 0, c ∈ G.(54)

4.4.1 Special case 1: LR model

Theorem 4.5 might be a bit difficult to grasp at first glance due to its generality. This section explores thetheorem in the special case of (ridge) linear regression (PR1), while the next subsection explores the specialcase of degree-2 polynomial regression (PR2).

Let us first specialize expressions (47), (48), and (49). We start with (47). Since d = 1, the only validchoice of h is 0, and |T | = 1. If T = j, then U ∈ U(T, h) iff U = c for some c ∈ Gj . In otherwords, wecan replace U(T, h) by Gj itself. Next, consider (49): there is only one valid choice of aG – the all 0 vector –and U = c for some c ∈ Gj , the matrix RaG,U

is exactly Rc. Lastly, when T = j the sum (48) becomes∑c∈Gj

R>c Rc. We have the following corollary:

Corollary 4.6. Consider a LR model with k groups of simple FDs Gi = fi∪Si, i ∈ [k]. Let θ = (θw)w∈Vbe the original parameters. Define the following reparameterization:

γw =

θw w ∈ V −G,∑c∈Gi

R>c θc w ∈ F.

Then, minimizing J(θ) is equivalent to minimizing the function J(γ) := 12γ>Σγ − 〈γ, c〉 + λ

2 Ω(γ), where

Ω(γ) :=∑w∈V \G ‖γw‖

2+∑ki=1

⟨B−1i γfi , γfi

⟩, and matrix Bi for each i ∈ [k] is given by

Bi :=∑c∈Gi

R>c Rc. (55)

J is defined with respect to the FD-reduced pair of functions g, h and a reduced parameter space of γ.Its gradient is very simple to compute, where we specialize (53):

1

2

∂Ω(γ)

∂γw=

γw w ∈ V −G,B−1i γfi w ∈ F.

(56)

Moreover, once a minimizer γ of J is obtained, following (54), we can compute a minimizer θ of J by setting

θw =

γw w ∈ V \G,RwB−1

i γfi w ∈ Gi, i ∈ [k].

4.4.2 Special case 2: PR2 model

In this section we explores Theorem 4.5 for the special case of degree-2 polynomial regression. This caseis significant for three reasons. First, due to the explosion in the number of parameters, in practice onerarely runs polymomial regression of degree higher than 2. In fact, PR2 may be a sufficiently rich nonlinearregression model for many real-world applications. Second, this is technically already a highly non-trivial

19

Page 20: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

application of our general theorem. Third, this case shares some commonality with FaMa2r model to be

described in the next section.As before, we first specialize expressions (47), (48), and (49). To do so, we change the indexing scheme

of the model a bit. In the general model, we use a with ‖a‖1 ≤ d to index parameters. When the model is ofdegree 2, we explicitly write down the two types of indices: we use θw, w ∈ V instead of θa with ‖a‖1 = 1,and we use θcw with c, w ∈ V instead of θa when ‖a‖1 = 2.

We start with (47). Since d = 2, two valid choices of h are 0 and 1.

• when h = 1, |T | = i for some i ∈ [k]. The set U(i, 1) is the collection of singleton subsets of Gi.Hence, this is simlar to the linear regression situation.

• when h = 0, |T | is either i or i, j. The set U(i, j, 0) consists of all 2-subsets U of G for which Ucontains one element from Gi and one from Gj . The set U(i, 0) contains all singletons and 2-subsetsof Gi.

From this analysis, we can write down (48) explicitly (also recall the definition of Bi in (55)):

Bi,1,i =∑c∈Gi

R>c Rc = Bi

Bi,j,0,i = Bi

Bi,j,0,j = Bj

Bi,0 =∑c∈Gi

R>c Rc +∑

c,t∈(Gi2 )

(Rc ?Rt)>(Rc ?Rt).

Next, consider (49): there are two valid choices for the pair (aG, U):

• when ‖aG‖1 = 0, U ∈ U(i, j, 0) or U ∈ U(i, 0). In that case, we have

R∅,c,t = Rc ⊗Rt (c, t) ∈ Gi ×GjR∅,c = Rc c ∈ Gi

R∅,c,t = Rc ?Rt c, t ∈(Gi2

).

• when ‖aG‖1 = 1, U ∈ U(i, 0) for some i ∈ [k]; and in this case we use w ∈ G to represent aG(aw > 0):

Rw,c = Iw ⊗Rc.

Corollary 4.7. Consider the PR2 model with k groups of simple FDs Gi = fi ∪ Si, i ∈ [k]. Let θ =((θw)w∈V , (θcw)c,w∈v) be the original parameters, and G = ∪i∈[k]Gi. Define the following reparameterization:

γw =

θw w ∈ V \G∑c∈Gi

R>c θc +∑

c,t∈(Gi2 )

(Rc ?Rt)>θct w = fi, i ∈ [k]. (57)

γtw =

θtw t, w ⊆ V \G∑c∈Gi

(Iw ⊗R>c )θwc t = fi, w /∈ G∑(c,c′)∈Gi×Gj

(Rc ?Rc′)>θcc′ t, w = fi, fj, i, j ∈

([k]2

).

(58)

20

Page 21: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Then, minimizing J(θ) is equivalent to minimizing the function J(γ) := 12γ>Σγ − 〈γ, c〉+ λ

2 Ω(γ), where

Ω(γ) :=∑w/∈G

‖γw‖2

+∑c/∈Gt/∈G

‖γct‖2

+

k∑i=1

⟨B−1i,0γfi ,γfi

⟩+

∑i∈[k]w/∈G

⟨(Iw ⊗B−1

i γwfi ,γwfi⟩

+∑

ij∈([k]2 )

⟨B−1i ⊗B−1

j γfifj , γfifj

⟩.

The gradient of J is very simple to compute, by noticing that J is defined with respect to the FD-reducedpair of functions g, h and a reduced parameter space of γ. Its gradient can be computed by specializing (53):

1

2

∂Ω(γ)

∂γw=

γw w /∈ GB−1i,0γfi w = fi

(59)

1

2

∂Ω(γ)

∂γtw=

γtw t, w ∩ fiki=1 = ∅(Iw ⊗B−1

i )γwfi t = fi, w /∈ G(B−1

i ⊗B−1j )γfifj t, w = fi, fj.

(60)

Moreover, once a minimizer γ of J is obtained, following (54), we can compute a minimizer θ of J by setting

θw =

γw w ∈ V \GRwB−1

i,0γfi w ∈ Gi, i ∈ [k]

θct = (Rc ?Rt)B−1i,0γfi , ∀c, t ∈

(Gi2

)θcw =

γcw w ∈ V \G(Iw ⊗RcB

−1i )γwfi , c ∈ Gi, w /∈ G, i ∈ [k]

θct = (RcB−1i ⊗RtB

−1j )γfifj (c, t) ∈ Gi ×Gj .

4.4.3 Computational issues

To apply the above results, we need to solve a couple of computational primitives. The first primitive is tocompute the matrix inverse B−1

T,h (or rather, the product of the inverse with another vector). This task can bedone in one of two ways: we either explicitly compute the inverse, or compute the Cholesky decomposition ofthe matrix BT,h. We now proceed to explain how both of these tasks can be done with, essentially, databasequeries below.

Maintaining the matrix Inverse with rank-1 updates Using formula (9), we can incrementally com-pute the inverse of the matrix I +

∑c∈Gi

R>c Rc as follows. Let S ⊂ Gi be some subset and suppose we have

already compute the inverse for AS = I +∑s∈S R>s Rs. We now explain how we can compute the inverse

for AS∪c = I +∑s∈S∪cR>s Rs. For concreteness, let’s say Rc maps city to country. For each country

country, let ecountry denote the 01-vector where there is a 1 for each city the country has. For example,

egermany = [1 1 0 0 0]>.

Then,

R>c Rc =∑

country

ecountrye>country.

21

Page 22: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

And thus, starting with AS , we can keep applying the Sherman-Morrison-Woodbury formula (9) a numberof times, one for each country, such as:

(A + egermanye>germany)

−1 = A−1 −A−1egermanye

>germanyA

−1

1 + e>germanyA−1egermany

. (61)

This update can certainly be done with typical database aggregate queries.

• The number α = e>germanyA−1egermany is a sum of entries (i, j) in A−1 where both i and j are cities in

germany.

• The vector v = A−1egermany is the sum of columns of A−1 corresponding to germany.

• The numerator A−1egermanye>germanyA

−1 is exactly vv>.

Overall, each update (61) can be done in O(n2)-time, where n is the number of cities, for an overall runtimeof O(n2m), where m is the number of countries. When the FDs form a chain, the blocks are nested insideone another, and thus each update is even cheaper as we do not have to access all n2 entries.

Maintaining a Cholesky decomposition with rank-k update Maintaining/computing a matrix in-verse can be numerically unstable. It would be best to compute a Cholesky decomposition of the matrixwe want: This strategy is more stable numerically. There are known rank-1 update algorithms [6,11], usingstrategies similar to the inverse rank-1 update above.

Another computational primitive we often have to perform is to multiply a tensor product with a vector,such as in (B−1

i ⊗ B−1j )γfifj . We have already commented on this task in Section 2.3. Again note the

improtant fact that this task can be done with typical database aggregate queries.

Alternative to Corollary 4.2 One big advantage of a linear model in terms of BGD is Corollary 4.2,where we do not have to redo point-evaluation for every backtracking step. After the reparameterizationexploiting FD-based dimensionality reduction, Corollary 4.2 does not work as is, because we have changedthe penalty terms. However, it is easy to work out a similar result in terms of the new parameter space. Letd = ∇J(γ). Then,

J(γ)− J(γ − αd) =1

2γ>Σγ − 1

2(γ − αd)Σ(γ − αd) + 〈γ − αd, c〉+

λ

2(Ω(γ)− Ω(γ − αd))

= αγ>Σd− α2

2dΣd− α 〈d, c〉+

λ

2(Ω(γ)− Ω(γ − αd)).

Hence, we have the following analog of Corollary 4.2:

Proposition 4.8. With respect to the new parameters (and new objective J defined in (51)), the Armijo

condition J(γ)− J(γ − αd) ≤ α2 ‖d‖

2is equivalent to

α(

2γ>Σd− αdΣd− 2 〈d, c〉 − ‖d‖2)

+ λΩ(γ)) ≤ λΩ(γ − αd),

where d = ∇J(γ). Furthermore, the next gradient of J is also readily available:

∂J(γ − αd)

∂γ= d− αΣd +

λ

2

(∂Ω(γ − αd)

∂γ− ∂Ω(γ)

∂γ

).

The point here is that we only need to compute intermediate results involving the covariance matrixΣ once while backtracking. For each new value of α, we will need to recompute the penalty’s objectiveΩ(γ − αd), which is an inexpensive operation. If λ = 0, we can even solve for α directly.

22

Page 23: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

4.5 Factorization machines under sets of simple FDs

Next, we turn our attention to FaMa of degree 2 and rank r.

Theorem 4.9. Consider the FaMa model of degree 2, rank r, with k groups of simple FDs Gi = fi ∪ Si,i ∈ [k]. Let θ = (θi)i∈V be the original parameters, and G = ∪i∈[k]Gi. For any i ∈ [k], define

βfi :=

r∑`=1

∑c,t∈(Gi

2 )

R>c θ(`)c R>t θ

(`)t , (62)

and the following reparameterization:6

γw =

θw w /∈⋃ki=1 Si

θfi +∑c∈Si

R>c θc + βfi w = fi, i ∈ [k].

γ(`)w =

θ(`)w w /∈ F

θ(`)fi

+∑c∈Si

R>c θ(`)c w = fi, i ∈ [k].

Then, minimizing J(θ) is equivalent to minimizing the function J(γ) := 12π(γ)>Σπ(γ)−〈π(γ), c〉+ λ

2 Ω(γ),where

Ω(γ) :=∑w/∈G

‖γw‖2

+

k∑i=1

⟨B−1i (γfi − βfi), (γfi − βfi)

⟩+∑`∈[r]w/∈F

∥∥∥γ(`)w

∥∥∥2

+∑i∈[k]`∈[r]

∥∥∥∥∥γ(`)fi−∑c∈Si

R>c γ(`)c

∥∥∥∥∥2

(63)

and π(γ) denotes the vector obtained from γ by removing all components associated with Si, i ∈ [k].

Proof. We begin with a similar derivation, whereby “relevant terms” of 〈g(θ), h(x)〉 are the terms where hcontains a feature c ∈ Gi for some i ∈ [k]:

relevant terms of 〈g(θ), h(x)〉

=∑c∈Gi

i∈[k]

〈θc,xc〉+∑

c,t∈(Gi2 )

i∈[k]`∈[r]

⟨θ(`)c ⊗ θ

(`)t ,xc ⊗ xt

⟩+

∑ij∈([k]

2 )

∑c∈Git∈Gj

`∈[r]

⟨θ(`)c ⊗ θ

(`)t ,xc ⊗ xt

⟩+∑c∈Gw/∈G`∈[r]

⟨θ(`)c ⊗ θ(`)

w ,xc ⊗ xw

=∑c∈Gi

i∈[k]

〈θc,Rcxfi〉+∑

c,t∈(Gi2 )

i∈[k]`∈[r]

⟨θ(`)c ⊗ θ

(`)t ,Rcxfi ⊗Rtxfi

+∑

ij∈([k]2 )

∑c∈Git∈Gj

`∈[r]

⟨θ(`)c ⊗ θ

(`)t ,Rcxfi ⊗Rtxfj

⟩+∑i∈[k]c∈Gi

w/∈G`∈[r]

⟨θ(`)c ⊗ θ(`)

w ,Rcxfi ⊗ xw

=∑c∈Gi

i∈[k]

⟨R>c θc,xfi

⟩+

∑c,t∈(Gi

2 )i∈[k]`∈[r]

⟨R>c θ

(`)c ⊗R>t θ

(`)t ,xfi ⊗ xfi

+∑

ij∈([k]2 )

∑c∈Git∈Gj

`∈[r]

⟨R>c θ

(`)c ⊗R>t θ

(`)t ,xfi ⊗ xfj

⟩+∑i∈[k]c∈Gi

w/∈G`∈[r]

⟨R>c θ

(`)c ⊗ θ(`)

w ,xfi ⊗ xw

6 Note that there is no γc for c ∈ Si, but there is γ(`)c for c ∈ Si.

23

Page 24: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=∑c∈Gi

i∈[k]

⟨R>c θc,xfi

⟩+

∑c,t∈(Gi

2 )i∈[k]`∈[r]

⟨R>c θ

(`)c R>t θ

(`)t ,xfi

+∑

ij∈([k]2 )

`∈[r]

⟨∑c∈Gi

R>c θ(`)c ⊗

∑t∈Gj

R>t θ(`)t ,xfi ⊗ xfj

⟩+∑i∈[k]w/∈G`∈[r]

⟨∑c∈Gi

R>c θ(`)c ⊗ θ(`)

w ,xfi ⊗ xw

=

k∑i=1

⟨∑c∈Gi

R>c θc +

r∑`=1

∑c,t∈(Gi

2 )

R>c θ(`)c R>t θ

(`)t

︸ ︷︷ ︸γfi

,xfi

+∑

ij∈([k]2 )

`∈[r]

⟨∑c∈Gi

R>c θ(`)c︸ ︷︷ ︸

γ(`)fi

⊗∑t∈Gj

R>t θ(`)t︸ ︷︷ ︸

γ(`)fj

,xfi ⊗ xfj

⟩+∑i∈[k]w/∈G`∈[r]

⟨∑c∈Gi

R>c θ(`)c︸ ︷︷ ︸

γ(`)fi

⊗θ(`)w ,xfi ⊗ xw

=

k∑i=1

⟨γfi ,xfi

⟩+

∑ij∈([k]

2 )`∈[r]

⟨γ

(`)fi⊗ γ

(`)fj,xfi ⊗ xfj

⟩+∑i∈[k]w/∈G`∈[r]

⟨γ

(`)fi⊗ θ(`)

w ,xfi ⊗ xw

⟩.

The above derivation immediately yields the reparameterization given in the statement of the theorem, whichwe reproduce here for the sake of clarity:

γw =

θw w /∈ Gθfi +

∑c∈Si

R>c θc + βfi w = fi, i ∈ [k].

γ(`)w =

θ(`)w w /∈ F

θ(`)fi

+∑c∈Si

R>c θ(`)c w = fi, i ∈ [k].

Note that we did not define γw for w ∈ Si, i ∈ [k]. The reason we can do so, is because we can optimize outθc due to the following trick we have been using (as in the proof of Theorem 4.5). First, we rewrite all the

terms in ‖θ‖2 in terms of γ and θc, c ∈ Si, i ∈ [k]:

‖θ‖2 =∑w/∈G

‖θw‖2 +

k∑i=1

∑t∈Gi

‖θt‖2 +

r∑`=1

∑w/∈F

∥∥∥θ(`)w

∥∥∥2

+

r∑`=1

k∑i=1

∥∥∥θ(`)fi

∥∥∥2

=∑w/∈G

‖γw‖2

+

k∑i=1

∑t∈Gi

‖θt‖2 +

r∑`=1

∑w/∈F

∥∥∥γ(`)w

∥∥∥2

+

r∑`=1

k∑i=1

∥∥∥θ(`)fi

∥∥∥2

=∑w/∈G

‖γw‖2

+

k∑i=1

∥∥∥∥∥γfi −∑c∈Si

R>c θc − βfi

∥∥∥∥∥2

+

k∑i=1

∑t∈Si

‖θt‖2 +

r∑`=1

∑w/∈F

∥∥∥γ(`)w

∥∥∥2

+

r∑`=1

k∑i=1

∥∥∥∥∥γ(`)fi−∑c∈Si

R>c γ(`)c

∥∥∥∥∥2

.

24

Page 25: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Since θt,t ∈ Si, does not depend on the loss term, we have

1

2

∂J

∂θt= θt −Rt

γfi −∑c∈Si

R>c θc − βfi︸ ︷︷ ︸θfi

w ∈ Si, i ∈ [k]. (64)

By setting (64) to 0, we have θt = Rtθfi for all t ∈ Gi, and thus

θfi = γfi −∑c∈Si

R>c θc − βfi = γfi −∑c∈Si

R>c Rcθfi − βfi ,

which implies θfi = B−1i (γfi − βfi). Hence, the following always holds:

θt = RtB−1i (γfi − βfi), ∀t ∈ Gi, i ∈ [k].

Note also that, ∑t∈Gi

‖θt‖2 =∑t∈Gi

∥∥RtB−1i (γfi − βfi)

∥∥2

=∑t∈Gi

⟨R>t RtB

−1i (γfi − βfi),B

−1i (γfi − βfi)

⟩=

⟨(∑t∈Gi

R>t Rt

)B−1i (γfi − βfi),B

−1i (γfi − βfi)

⟩=⟨BiB

−1i (γfi − βfi),B

−1i (γfi − βfi)

⟩=⟨(γfi − βfi),B

−1i (γfi − βfi)

⟩.

Due to the fact that θ(`)fi

= γ(`)fi−∑c∈Si

R>c γ(`)c , we can now write the penalty term in terms of the new

parameter γ:

‖θ‖2 =∑w/∈G

‖γw‖2

+

k∑i=1

∑t∈Gi

‖θt‖2 +

r∑`=1

∑w/∈F

∥∥∥γ(`)w

∥∥∥2

+

r∑`=1

k∑i=1

∥∥∥θ(`)fi

∥∥∥2

(65)

=∑w/∈G

‖γw‖2

+

k∑i=1

⟨(γfi − βfi),B

−1i (γfi − βfi)

⟩+

r∑`=1

∑w/∈F

∥∥∥γ(`)w

∥∥∥2

(66)

+

r∑`=1

k∑i=1

∥∥∥∥∥γ(`)fi−∑c∈Si

R>c γ(`)c

∥∥∥∥∥2

. (67)

In order to optimize J with respect to γ, the following proposition provides a closed form formulae forthe relevant gradient.

Proposition 4.10. The gradient of Ω(γ) defined in (63) can be computed by computing δ(`)i =

∑c∈Si

R>c γ(`)c ,

and

βfi =

r∑`=1

[(γ

(`)fi− 1

(`)i

) δ(`)

i −1

2

∑t∈Si

R>t (γ(`)t γ

(`)t )

]

25

Page 26: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Then,

1

2

∂Ω(γ)

∂γw=

γw, w /∈ GB−1i (γfi − βfi) w = fi, i ∈ [k].

(68)

1

2

∂Ω(γ)

∂γ(`)w

=

γ

(`)w w /∈ G, ` ∈ [r]

γ(`)fi− δ

(`)i

(1 +

1

2

∂Ω(γ)

∂γfi

)w = fi, ` ∈ [r]

γ(`)w −Rw

(`)fi 1

2∂Ω(γ)∂γfi

+ 12∂Ω(γ)

∂γ(`)fi

]w ∈ Si, ` ∈ [r].

(69)

Proof. The goal is to derive the gradient of Ω(γ) w.r.t the parameters γ. Since βfi is a function of γ(`)c ,

` ∈ [r], c ∈ Gi, the following is immediate:

1

2

∂ ‖θ‖2

∂γw=

γw, w /∈ GB−1i (γfi − βfi) w = fi, i ∈ [k].

1

2

∂ ‖θ‖2

∂γ(`)w

= γw, w /∈ G, ` ∈ [r].

Next, we have to simplify βfi to facilitate fast computation:

βfi :=

r∑`=1

∑c,t∈(Gi

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

R>fiθ(`)fi∑c∈Si

R>c θ(`)c +

∑c,t∈(Si

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

θ(`)fi∑c∈Si

R>c θ(`)c +

∑c,t∈(Si

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

(γ(`)fi−∑t∈Si

R>t θ(`)t

)∑c∈Si

R>c θ(`)c +

∑c,t∈(Si

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

γ(`)fi∑c∈Si

R>c θ(`)c −

∑t∈Si

∑c∈Si

R>t θ(`)t R>c θ

(`)c +

∑c,t∈(Si

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

γ(`)fi∑c∈Si

R>c θ(`)c −

∑t∈Si

R>t θ(`)t R>t θ

(`)t −

∑c,t∈(Si

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

γ(`)fi∑c∈Si

R>c θ(`)c −

∑t∈Si

R>t (θ(`)t θ

(`)t )−

∑c,t∈(Si

2 )

R>c θ(`)c R>t θ

(`)t

=

r∑`=1

γ(`)fi∑c∈Si

R>c γ(`)c −

∑t∈Si

R>t (γ(`)t γ

(`)t )−

∑c,t∈(Si

2 )

R>c γ(`)c R>t γ

(`)t

.

26

Page 27: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

Next, we derive the partial derivative w.r.t. γ(`)fi

for a fixed i ∈ [k], ` ∈ [r]; in this computation we make useof (5) above:

1

2

∂ ‖θ‖2

∂γ(`)fi

=1

2

∂⟨(γfi − βfi),B

−1i (γfi − βfi)

⟩∂γ

(`)fi

+1

2

∂∥∥∥γ(`)

fi−∑c∈Si

R>c γ(`)c

∥∥∥2

∂γ(`)fi

=

(∑c∈Si

DIAG(R>c γ(`)c )

)B−1i (βfi − γfi) + γ

(`)fi−∑c∈Si

R>c γ(`)c

= γ(`)fi−∑c∈Si

R>c γ(`)c︸ ︷︷ ︸

δ(`)i

(∑c∈Si

R>c γ(`)c

)︸ ︷︷ ︸

δ(`)i

B−1i (γfi − βfi)

= γ(`)fi− δ

(`)i − δ

(`)i

(1

2

∂ ‖θ‖2

∂γfi

).

Lastly, we move on to the partial derivative w.r.t. γ(`)w for a fixed i ∈ [k], w ∈ Si, ` ∈ [r]:

1

2

∂ ‖θ‖2

∂γ(`)w

=1

2

∂∥∥∥γ(`)

w

∥∥∥2

∂γ(`)w

+1

2

∂⟨(γfi − βfi),B

−1i (γfi − βfi)

⟩∂γ

(`)w

+1

2

∂∥∥∥γ(`)

fi−∑c∈Si

R>c γ(`)c

∥∥∥2

∂γ(`)w

= γ(`)w + Rw

(∑c∈Gi

DIAG(R>c γ(`)c )

)B−1i (βfi − γfi) + Rw

(∑c∈Si

R>c γ(`)c − γ

(`)fi

)

= γ(`)w + Rw

(`)fi

+ δ(`)i

)

(1

2

∂ ‖θ‖2

∂γfi

)+ Rw

(`)i − γ

(`)fi

)= γ(`)

w + Rw

(`)fi

(1

2

∂ ‖θ‖2

∂γfi

)+

(`)i

(1

2

∂ ‖θ‖2

∂γfi

)+ δ

(`)i − γ

(`)fi

)]

= γ(`)w + Rw

(`)fi

(1

2

∂ ‖θ‖2

∂γfi

)−

(1

2

∂ ‖θ‖2

∂γ(`)fi

)].

In particular, we were able to reuse the computation of 12∂‖θ‖2

∂γ(`)fi

and 12∂‖θ‖2∂γfi

to compute 12∂‖θ‖2

∂γ(`)w

. There is,

however, still one complicated term βfi left to compute. We simplify βfi to make its evaluation faster asfollows.

βfi =

r∑`=1

γ(`)fi∑c∈Si

R>c γ(`)c −

∑t∈Si

R>t (γ(`)t γ

(`)t )−

∑c,t∈(Si

2 )

R>c γ(`)c R>t γ

(`)t

=

r∑`=1

(`)fi∑c∈Si

R>c γ(`)c −

1

2

∑t∈Si

R>t (γ(`)t γ

(`)t )− 1

2

∑c∈Si

∑t∈Si

R>c γ(`)c R>t γ

(`)t

]

=

r∑`=1

γ(`)fi∑c∈Si

R>c γ(`)c︸ ︷︷ ︸

δ(`)i

−1

2

∑t∈Si

R>t (γ(`)t γ

(`)t )− 1

2

∑c∈Si

R>c γ(`)c

∑t∈Si

R>t γ(`)t

27

Page 28: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=

r∑`=1

(`)fi δ(`)

i −1

2

∑t∈Si

R>t (γ(`)t γ

(`)t )− 1

(`)i δ

(`)i

]

=

r∑`=1

[(γ

(`)fi− 1

(`)i

) δ(`)

i −1

2

∑t∈Si

R>t (γ(`)t γ

(`)t )

].

This completes the proof.

Remarks Suppose that the minimizer γ of J has been obtained, then a minimizer θ of J is available inclosed form:

θw =

γw w ∈ V \GRtB

−1i (γfi − βfi), ∀t ∈ Gi, i ∈ [k].

θ(`)w =

γ

(`)w , ∀w /∈ F, ` ∈ [r].

γ(`)w − δ(`)

i , w = fi, ` ∈ [r].

4.6 Proof of Theorem 4.5

Proof. We start by breaking the loss term into two parts

〈θ, h(x)〉 =∑

‖aV ‖1≤d

〈θa, ha(x)〉 =∑

‖aV ‖1≤d‖aG‖1=0

⟨θa,x

⊗a⟩+∑

‖aV ‖1≤d‖aG‖1>0

⟨θa,x

⊗a⟩

and rewrite the second part:∑‖aV ‖1≤d‖aG‖1>0

⟨θa,x

⊗a⟩ (70)

=∑

‖aV ‖1≤d‖aG‖1>0

⟨θa,x

⊗aG

G⊗ x⊗aG

G

⟩(71)

=∑

‖aV ‖1≤d‖aG‖1>0

⟨θa,x

⊗aG

G⊗⊗i∈[k]c∈Giac>0

xc

⟩(72)

=∑

‖aV ‖1≤d‖aG‖1>0

⟨θa,x

⊗aG

G⊗

⊗i∈[k]

‖aGi‖1>0

⊗c∈Giac>0

Rcxfi

⟩(73)

=∑

‖aV ‖1≤d‖aG‖1>0

⟨θa,x

⊗aG

G⊗

⊗i∈[k]

‖aGi‖1>0

Fc∈Giac>0

Rc

xfi

⟩(74)

=∑

‖aV ‖1≤d‖aG‖1>0

⟨θa,

⊗w∈Gaw>0

Iw ⊗⊗i∈[k]

‖aGi‖1>0

Fc∈Giac>0

Rc

x⊗aG

G⊗

⊗i∈[k]

‖aGi‖1>0

xfi

(75)

28

Page 29: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=∑

‖aV ‖1≤d‖aG‖1>0

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈[k]

‖aGi‖1>0

Fc∈Giac>0

Rc

>

θa,x⊗aG

G⊗

⊗i∈[k]

‖aGi‖1>0

xfi

⟩(76)

=∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

∑U∈U(T,h)

⟨(⊗w∈Gaw>0

Iw ⊗⊗i∈T

Fc∈U∩Gi

Rc

︸ ︷︷ ︸Ra

G,U defined in (49)

)>θ(aG,1U|G),x

⊗aG

G⊗⊗i∈T

xfi

⟩(77)

=∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

⟨ ∑U∈U(T,h)

R>aG,Uθ(aG,1U|G)︸ ︷︷ ︸

γ(aG

,1FT |F )

,x⊗aG

G⊗⊗i∈T

xfi

⟩(78)

=∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

⟨γ(aG,1FT |F ),x

⊗aG

G⊗⊗i∈T

xfi

⟩(79)

=∑‖bS‖1≤d

⟨γbS

,x⊗bS

S

⟩. (80)

The equality at (77) is a bit loaded. What goes on there is that we broke the sum over aV for which ‖aV ‖1 ≤ dand ‖aG‖1 > 0 into a nested triple sums. First of all, in order for ‖aG‖1 > 0, obviously ‖aG‖ < d musthold, so we group by those tuples first. The remaining mass ‖aG‖1 can only be at most d− ‖aG‖ = d− h.Since all features in G are categorical, from the above analysis we have aG = (ag)g∈G ∈ 0, 1G, i.e., aG isa characteristic vector of a subset U ⊆ G. Let T = i | Ui 6= ∅. Then, in the second summation we groupU by T . The third summation ranges over all choices of U ∩Gi, i ∈ T , for which the total mass is at mostd− h. (Recall the definition of U(T, h) in (47).)

Next, in (78) we perform the reparameterization. Recall that 1FT |F is the characteristic vector of theset FT := fii∈T in the collection F = f1, . . . , fk. The new parameter γ(aG,1FT |F ) is indexed by the

tuple (aG,1FT |F ) whose support is G ∪ F = S, i.e., the set of all features except for the ones functionallydetermined by features in F . After the reparameterization, the loss term is identical to the loss term of aPRd model whose features are S. This explains the collapsed pair (g, h) used in the theorem.

Next, we explore the new parameter and how it affects the penalty term. Consider a fixed pair aG andT ⊆ [k] such that T 6= ∅ and ‖aG‖1 + |T | ≤ d. The last condition is implicit for the set U to exist for whichU ∩Gi 6= ∅ and ‖aG‖1 + |U | ≤ d. Among all choices of U , we single out U = FT and write

γ(aG,1FT |F ) =∑U⊆G

U∩Gi 6=∅,∀i∈T‖aG‖1+|U |≤d

R>aG,Uθ(aG,1U|G) = θ(aG,1FT |G) +

∑FT 6=U⊆G

U∩Gi 6=∅,∀i∈T‖aG‖1+|U |≤d

R>aG,Uθ(aG,1U|G).

Now we are ready to write the penalty term ‖θ‖2 in terms of the new parameter γ and some “left-over”components of θ.

‖θ‖2 =∑

‖aV ‖1≤d

‖θa‖2

=∑

‖aV ‖1≤d‖aG‖1=0

‖θaV‖2 +

∑‖aV ‖1≤d‖aG‖1>0

‖θaV‖2

29

Page 30: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=∑

‖aV ‖1≤d‖aG‖1=0

‖θaV‖2 +

∑‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

∑U∈U(T,h)

∥∥∥θ(aG,1U|G)

∥∥∥2

=∑‖bS‖1≤d‖bF ‖1=0

∥∥∥γbS

∥∥∥2

+∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

∥∥∥θ(aG,1FT |G)

∥∥∥2

+∑

W∈U(T,h)W 6=FT

∥∥∥θ(aG,1U|G)

∥∥∥2

=∑‖bS‖1≤d‖bF ‖1=0

∥∥∥γbS

∥∥∥2

+∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

∥∥∥∥∥∥∥∥γ(aG,1FT |F ) −∑

U∈U(T,h)U 6=FT

R>aG,Uθ(aG,1U|G)

∥∥∥∥∥∥∥∥2

+∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

∑W∈U(T,h)W 6=FT

∥∥∥θ(aG,1W |G)

∥∥∥2

.

Next, for every W ∈ U(T, h)− FT , we optimize out the parameter θ(aG,1W |G) by noting that the new lossterm does not depend on these parameters. Thus:

1

2

∂J

∂θ(aG,1W |G)= θ(aG,1W |G) −RaG,W

γ(aG,1FT |F ) −∑

U∈U(T,‖aG‖1)

U 6=FT

R>aG,Uθ(aG,1U|G)

= θ(aG,1W |G) −RaG,W

θ(aG,1FT |G).

Setting this partial derivative to 0, we obtain θ(aG,1W |G) = RaG,Wθ(aG,1FT |G), which leads to

θ(aG,1FT |G) = γ(aG,1FT |F ) −∑

U∈U(T,‖aG‖1)

U 6=FT

R>aG,Uθ(aG,1U|G)

= γ(aG,1FT |F ) −∑

U∈U(T,‖aG‖1)

U 6=FT

R>aG,URaG,U

θ(aG,1FT |G).

Moving and grouping, we obtain⊗g∈Gag>0

Ig ⊗⊗i∈T

Ifi +∑

U∈U(T,‖aG‖1)

U 6=FT

R>aG,URaG,U

θ(aG,1FT |G) = γ(aG,1FT |F ).

The matrix on the left can be completely factorized, as follows:⊗g∈Gag>0

Ig ⊗⊗i∈T

Ifi +∑

U∈U(T,‖aG‖1)

U 6=FT

R>aG,URaG,U

=∑

U∈U(T,‖aG‖1)

R>aG,URaG,U

30

Page 31: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=∑

U∈U(T,‖aG‖1)

⊗w∈Gaw>0

Iw ⊗⊗i∈T

Fc∈U∩Gi

Rc

>⊗

w∈Gaw>0

Iw ⊗⊗i∈T

Fc∈U∩Gi

Rc

=∑

U∈U(T,‖aG‖1)

⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]>⊗w∈Gaw>0

Iw ⊗⊗i∈T

Fc∈U∩Gi

Rc

=⊗w∈Gaw>0

Iw ⊗∑

U∈U(T,‖aG‖1)

(⊗i∈T

[F

c∈U∩Gi

Rc

]>)(⊗i∈T

Fc∈U∩Gi

Rc

)

=⊗w∈Gaw>0

Iw ⊗∑

U∈U(T,‖aG‖1)

(⊗i∈T

[F

c∈U∩Gi

Rc

]> [F

c∈U∩Gi

Rc

])

=⊗w∈Gaw>0

Iw ⊗⊗i∈T

∑U∈U(T,‖aG‖1)

([F

c∈U∩Gi

Rc

]> [F

c∈U∩Gi

Rc

])︸ ︷︷ ︸

BT,‖aG‖1,i

defined in (48)

=⊗w∈Gaw>0

Iw ⊗⊗i∈T

BT,‖aG‖1,i.

Consequently, we can completely optimize out the remaining θ-components, solving for them in terms of thecomponents of γ:

θ(aG,1FT |G) =

⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1

T,‖aG‖1,i

γ(aG,1FT |F )

θ(aG,1U|G) = RaG,Uθ(aG,1FT |G)

= RaG,U

⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1

T,‖aG‖1,i

γ(aG,1FT |F )

=

⊗w∈Gaw>0

Iw ⊗⊗i∈T

Fc∈U∩Gi

Rc

⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1

T,‖aG‖1,i

γ(aG,1FT |F )

=

⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F ).

Since BT,‖aG‖1,iis a symmetric matrix, so is its inverse. For every U ∈ U(T, ‖aG‖1), we simplify the norm:

∥∥∥θ(aG,1U|G)

∥∥∥2

31

Page 32: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=

∥∥∥∥∥∥∥∥⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F )

∥∥∥∥∥∥∥∥2

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F ),

⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F )

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

>⊗

w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F ),γ(aG,1FT |F )

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1

T,‖aG‖1,i

[F

c∈U∩Gi

Rc

]>⊗w∈Gaw>0

Iw ⊗⊗i∈T

[F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F ),γ(aG,1FT |F )

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1

T,‖aG‖1,i

[F

c∈U∩Gi

Rc

]> [F

c∈U∩Gi

Rc

]B−1

T,‖aG‖1,i

γ(aG,1FT |F ),γ(aG,1FT |F )

⟩.

Thus, for a fixed T and aG with ‖aG‖1 = h, we have∑U∈U(T,‖aG‖1)

∥∥∥θ(aG,1U|G)

∥∥∥2

=∑

U∈U(T,h)

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1T,h,i

[F

c∈U∩Gi

Rc

]> [F

c∈U∩Gi

Rc

]B−1T,h,i

γ(aG,1FT |F ),γ(aG,1FT |F )

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1T,h,i

∑U∈U(T,h)

[F

c∈U∩Gi

Rc

]> [F

c∈U∩Gi

Rc

]B−1T,h,i

γ(aG,1FT |F ),γ(aG,1FT |F )

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1T,h,iBT,h,iB

−1T,h,i

γ(aG,1FT |F ),γ(aG,1FT |F )

=

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1T,h,i

γ(aG,1FT |F ),γ(aG,1FT |F )

⟩.

We next write ‖θ‖2 in terms of the new parameter γ:

‖θ‖2 =∑

‖aV ‖1≤d

‖θa‖2

32

Page 33: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

=∑

‖aV ‖1≤d‖aG‖1=0

‖θaV‖2 +

∑‖aV ‖1≤d‖aG‖1>0

‖θaV‖2

=∑

‖aV ‖1≤d‖aG‖1=0

‖θaV‖2 +

∑‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

∑U∈U(T,h)

∥∥∥θ(aG,1U|G)

∥∥∥2

=∑‖bS‖≤d‖bF ‖=0

∥∥∥γbS

∥∥∥2

+∑

‖aG‖1=h

h<d

∑T⊆[k]

0<|T |≤d−h

⟨⊗w∈Gaw>0

Iw ⊗⊗i∈T

B−1T,h,i

γ(aG,1FT |F ),γ(aG,1FT |F )

⟩.

5 Experiments

We report on the performance of learning regression models over the natural joins of a real dataset, whichare used by LogicBlox in client retail applications.

We benchmark three variants of our system: DC assumes the categorical (discrete) features alreadyone-hot encoded; AC takes the plain input dataset and one-hot encodes the categorical features on the fly;AC+FD is AC that exploits the existing functional dependencies. We do not report in detail the exactperformance for the implementation inside the LogicBlox system; it is up to 5x factor slower than AC+FDfor the experiments considered here.Competitors. We report the performance of three open-source systems: MADlib [16] 1.8 (M) uses olsto compute the closed-form solution of polynomial regression models (M also supports gradient descent forgeneralized linear models, glm, but this is consistently slower than ols for our experiments and we do notreport it here); R [26] 3.0.2 uses lm (linear model) based on QR-decomposition [10]; and libFM [28] 1.4.2supports factorization machines.

The competitors come with strong limitations. M inherits the limitation of max 1600 columns per relationfrom its PostgreSQL host. The MADlib one-hot encoder transforms a categorical variable with n distinctvalues into n columns. Therefore, the number of distinct values across all categorical variables plus thenumber of continuous variables in the input data cannot exceed 1600. R limits the number of values intheir data frames to 231 − 1. There exist R packages, e.g., ff, which work around this limitation by storingdata structures on disk and mapping only chunks of data in main memory. The biglm package can computethe regression model by processing one ff-chunk at a time. Chunking the data, however, can lead to rankdeficiencies within chunks (feature interactions missing from chunks), which causes biglm to fail. Biglm failsin all our experiments due to this limitation, and, thus, we are unable to benchmark against it. For libFM,we used its more stable MCMC variant with a fixed number of runs (300); its SGD implementation requiresa fixed learning rate α and does not converge. For all variants of our systems, we use the adaptive learningrate outlined in Algorithm 1 and run until the parameters have converged with high accuracy (for FaMamodels the max number of runs is 300).

Summary of findings. AC+FD is the fastest variant in our experiments. DC subsumes our earlierprototype F for learning linear regression over datasets with continuous features only [30]. The variants of oursystem are orders of magnitude faster than the competitors or can finish successfully where the others exceedmemory/time/internal design limits. The performance gap is attributed to several key optimizations: (1)our system is an end-to-end in-database solution as it performs the join together the regression aggregatesand therefore avoids the costly data export-import step at the interface between database systems andstatistical packages (almost 50% time cost for R); (2) it avoids the join materialization (6% time cost for R);(3) it factorizes the computation of the aggregates and the underlying join (20× compression factor); (4) itmassively shares the computation of large (up to 65M) sets of aggregates; (5) it decouples the computation

33

Page 34: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

of the aggregates on the input data from the parameter convergence step and thus avoids scanning the joinresult per iteration (we need on average 400 iterations); (6) it avoids the upfront one-hot encoding thatcomes with higher asymptotic complexity and prohibitively large covariance matrices and only computesnon-identical, non-zero matrix entries (for PR2 and our dataset v4, this leads to a 259× reduction factor inthe number of aggregates to compute!); (7) it exploits functional dependencies in the input data to reducethe number of features of the model (3.5x improvement factor). None of our competitors employ all of theseoptimizations. Our earlier prototype F, which is subsumed by DC, supports (1) to (5). M supports (1)and (2) and does not need (5) as it computes the closed-form solution. R does not support any of theseoptimizations. libFM requires as input a zero-suppressed materialization of the join of the one-hot encodedinput dataset.Experimental Setup. All experiments were performed on an Intel(R) Core(TM) i7-4770 3.40GHz/64bit/32GB with Linux 3.13.0/g++4.8.4; we also used an EC2 r4.4xlarge/122 GB RAM/500 GB SSD machine toprepare the data. We report wall-clock times by running each system once and then reporting the averageof four subsequent runs with warm cache. We do not report the times to load the database into memory forthe join as they can differ substantially between the systems and are orthogonal to this work. All relationsare given sorted by their join attributes.Dataset. We experimented with a real-world dataset in the retail domain for forecasting user demands andsales. It has five relations, which provide information on: products and stores of the retailer; inventory foreach product and store at different dates; weather conditions for each store at different dates; and informationabout competitors and demographics for each store. The natural join of these five relations is acyclic andhas 43 attributes. The following 8 attributes correspond to categorical features in our regression models:sku, zip, category, subcategory, categoryCluster, snow, rain, thunder. We use the functional dependency treesku → category, subcategory, categoryCluster. We design 4 fragments of our dataset with an increasingnumber of input categorical features. v1 has the last 6 categorical attributes. It was specifically tailored towork within the limitations of R. v2 computes the same model as v1 but over all rows in the data (5× largerthan v1). v3 adds the zip categorical attribute to v2; v2 and v3 were designed to work within the limitationsof M. Finally, v4 has all the attributes but zip; it is the only dataset on which the FD holds.

Table 1 gives the performance of the systems for computing LR, PR2, and FaMa28 models that predict the

amount of inventory units based on all other features.Categorical features. As we move from v1/v2 to v4, we increase the number of categorical features by

approx. 65 times for LR (from 55 to 3.7K) and for PR2 and FaMa28 (from 2.4K to 154K). For LR, this increase

only led to a 7× decrease in performance of AC and at least 9× for M (we stopped M after 22 hours). ForPR2, this yields a 13.7× performance decrease for AC. This behavior remains the same for AC’s aggregatecomputation step with or without the convergence step, since the latter is dominated by the former by upto three orders of magnitude. This sub-linear behavior is partly explained by the ability of our system toprocess large sets of aggregates much faster in bulk than individually and by the same-order increase in thenumber of aggregates: 65 (51) times more distinct non-zero aggregates in v4 vs v2 for LR (and respectivelyPR2 and FaMa2

8).For degree-2 models on v3, we observe that we have more aggregates than for v4, yet AC’s performance

is a factor of 5 to 10 better than on v4. The explanation is that v3 has the categorical variable zip, whichdoes not occur in v4 and the interactions of all zip-derived categorical features with the other features onlyhappens once towards the end of the aggregate computation step; we exploit the good factorization of thecomputation in that zip is independent of many other categorical variables coming from different relationsand we process in bulk all interactions of zip’s features with features of those independent variables.

Database size. A fivefold increase in database size (and join result) from v1 to v2 leads to a similardecrease factor in performance for DC and AC on all models, since the number of features and aggregatesstay roughly the same and the join is acyclic and is processed in linear time. M’s performance follows thesame trend for LR, but runs out of time (22 hours) for both datasets for degree-2 models. R cannot copewith the size increase due to internal design limitations; for v1, its performance is 34× worse than AC.

One-hot encoding vs sparse tensor representation. Our competitors require the data be one-hotencoded before ingestion. This however leads to a large number of zero and/or redundant entries in the

34

Page 35: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

covariance matrix. For instance, for v4 and PR2, the number of features is m = 154, 595, and then theupper half of the covariance matrix would have m(m+ 1)/2 ≈ 1.19× 1010 entries! Most of these are eitherzero or repeating. In contrast, AC’s sparse tensor representation only considers 46M non-zero and distinctaggregates. The reduction in the number of aggregates is in the order of 259 times!

The static one-hot encoding took (in seconds): 28.42 for R on v1; 9.41 for DC on v1 and v2; 2 for M onv1 to v3; and slightly more than an hour for libFM, due to the expensive zero-suppression step.

Functional dependencies. The FD in our dataset v4 has a twofold effect on AC (all other systems donot exploit FDs): it effectively reduces the number of features and aggregates needed to compute the model,which leads to better performance of the in-database precomputation step; it requires a more elaborateconvergence step due to the more complex penalty term even though this is over fewer parameters. For LR,the aggregate step becomes 2.3 times faster, while the convergence step increases 13 times. Nevertheless, theconvergence step takes at most 2% of the overall compute time in this case. For degree-2 models, the FDbrings an improvement by a factor of 3.5× for PR2, and 3.87× for FaMa2

8. This is due to a 10% decrease inthe number of categorical features, which leads to a 20% decrease in the number of categorical aggregates.

6 Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovation pro-gramme under grant agreement 682588. Olteanu also acknowledges Amazon Cloud Credits for Research andGoogle Research Award. Ngo’s work was supported by DARPA under agreement FA8750-15-2-0009. TheU.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwith-standing any copyright thereon. Nguyen is partially supported by grants NSF CAREER DMS-1351362 andNSF CNS-1409303.

References

[1] M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, D. Olteanu, E. Pasalic, T. L. Veldhuizen, and G. Wash-burn. Design and implementation of the LogicBlox system. In SIGMOD, pages 1371–1382, 2015.

[2] N. Bakibayev, T. Kocisky, D. Olteanu, and J. Zavodny. Aggregation and ordering in factoriseddatabases. PVLDB, 6(14):1990–2001, 2013.

[3] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8(1):141–148, 1988.

[4] M. Boehm, S. Tatikonda, B. Reinwald, P. Sen, Y. Tian, D. Burdick, and S. Vaithyanathan. Hybridparallelization strategies for large-scale machine learning in SystemML. PVLDB, 7(7):553–564, 2014.

[5] Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. M. Jermaine. A comparison of platformsfor imple- lementing and running very large scale machine lear- ning algorithms. In SIGMOD, pages1371–1382, 2014.

[6] T. A. Davis and W. W. Hager. Multiple-rank modifications of a sparse Cholesky factorization. SIAMJ. Matrix Anal. Appl., 22(4):997–1013, 2001.

[7] T. Elgamal, S. Luo, M. Boehm, A. V. Evfimievski, S. Tatikonda, B. Reinwald, and P. Sen. SPOOF:sum-product optimization and operator fusion for large-scale machine learning. In CIDR, 2017.

[8] X. Feng, A. Kumar, B. Recht, and C. Re. Towards a unified architecture for in-rdbms analytics. InSIGMOD, pages 325–336, 2012.

[9] R. Fletcher. On the Barzilai-Borwein method. In Optimization and control with applications, volume 96of Appl. Optim., pages 235–256. 2005.

35

Page 36: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

[10] J. G. F. Francis. The QR transformation: A unitary analogue to the LR transformation–Part 1. TheComputer Journal, 4(3):265–271, 1961.

[11] P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. Methods for modifying matrix factorizations.Math. Comp., 28:505–535, 1974.

[12] T. Goldstein, C. Studer, and R. G. Baraniuk. A field guide to forward-backward splitting with a FASTAimplementation. CoRR, abs/1411.3406, 2014.

[13] M. Grohe and D. Marx. Constraint solving via fractional edge covers. ACM Trans. Alg., 11(1):4, 2014.

[14] W. W. Hager. Updating the inverse of a matrix. SIAM Rev., 31(2):221–239, 1989.

[15] D. Harris and S. Harris. Digital Design and Computer Architecture. 2nd edition, 2012.

[16] J. M. Hellerstein and et al. The MADlib analytics library or MAD skills, the SQL. PVLDB, 5(12):1700–1711, 2012.

[17] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. Reiss. Resource elasticity forlarge-scale machine learning. In SIGMOD, pages 137–152, 2015.

[18] M. A. Khamis, H. Q. Ngo, and A. Rudra. FAQ: Questions asked frequently. In PODS, pages 13–28,2016.

[19] A. Kumar, J. F. Naughton, and J. M. Patel. Learning generalized linear models over normalized data.In SIGMOD, pages 1969–1984, 2015.

[20] A. Kumar, J. F. Naughton, J. M. Patel, and X. Zhu. To join or not to join?: Thinking twice aboutjoins before feature selection. In SIGMOD, pages 19–34, 2016.

[21] X. Meng and et al. Mllib: Machine learning in apache spark. J. Mach. Learn. Res., 17(1):1235–1241,2016.

[22] D. Neumann. Lightning-fast deep learning on Spark via parallel stochastic gradient updates, www.

deepdist.com, 2015.

[23] D. Olteanu and J. Zavodny. Size bounds for factorised representations of query results. TODS, 40(1):2,2015.

[24] K. B. Petersen and M. S. Pedersen. The matrix cookbook, nov 2012. Version 20121115.

[25] C. Qin and F. Rusu. Speculative approximations for terascale distributed gradient descent optimization.In DanaC, pages 1:1–1:10, 2015.

[26] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for StatisticalComputing, www.r-project.org, 2013.

[27] C. Re and et al. Machine learning and databases: The sound of things to come or a cacophony of hype?In SIGMOD, pages 283–284, 2015.

[28] S. Rendle. Scaling factorization machines to relational data. PVLDB, 6(5):337–348, 2013.

[29] S. Schelter, J. Soto, V. Markl, D. Burdick, B. Reinwald, and A. V. Evfimievski. Efficient samplegeneration for scalable meta learning. In ICDE, pages 1191–1202, 2015.

[30] M. Schleich, D. Olteanu, and R. Ciucanu. Learning linear regression models over factorized joins. InSIGMOD, pages 3–18, 2016.

[31] M. Zaharia and et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory clustercomputing. In NSDI, pages 15–28, 2012.

36

Page 37: In-Database Learning with Sparse Tensors · Rendle introduced factorization machines on relational data [28], a practically useful model that factorizes the parameter space to better

v1 v2 v3 v4

Linear regression LR

Features without FDs 33 + 55 33+55 33+1340 33+3702(cont.+categ.) with FDs 33 + 55 33+55 33+1340 33+3653

Aggregates without FDs 595+2,418 595+2,421 595+111,549 595+157,735(cont.+categ.) with FDs 595+2,418 595+2,421 595+111,549 595+144,589

M (ols) Learn 1,898.35 8,855.11 > 79, 200.00 –

R (qr) Join (PSQL) 50.63 – – –Export/Import 308.83 – – –Learn 490.13 – – –

DC Aggregate 93.31 424.81 OOM OOMConverge (runs) 0.01 (359) 0.01 (359)

AC Aggregate 25.51 116.64 117.94 895.22Converge (runs) 0.02 (343) 0.02 (367) 0.42 (337) 0.66 (365)

AC+FD Aggregate same as AC 380.31Converge (runs) there are no FDs 8.82 (366)

Speedup AC+FD/M 74.36× 75.91× > 669.14× ∞AC+FD/R 33.28× ∞ ∞ ∞AC+FD/DC 3.65× 3.64× ∞ ∞AC+FD/AC same as AC, there are no FDs 2.30×

Polynomial regression of degree 2 PR2

Features without FDs 562+2,363 562+2,366 562+110,209 562+154,033(cont.+categ.) with FDs same as above, there are no FDs 562+140,936

Aggregates without FDs 158k+742k 158k+746k 158k+65,875k 158k+46,113k(cont.+categ.) with FDs same as above, there are no FDs 158k+36,712k

M (ols) Learn > 79, 200.00 > 79, 200.00 > 79, 200.00 –

AC Aggregate 132.43 517.40 820.57 7,012.84Converge (runs) 3.27 (321) 3.62 (365) 349.15 (400) 115.65 (200)

AC+FD Aggregate same as AC 1819.80Converge (runs) there are no FDs 219.51 (180)

Speedup AC+FD/M > 583.64× > 152.01× > 67.71× ∞AC+FD/AC same as AC, there are no FDs 3.50×

Factorization machine of degree 2 and rank 8 FaMa28Features without FDs 530+2,363 530+2,366 530+110,209 530+154,033(cont.+categ.) with FDs same as above, there are no FDs 562+140,936

Aggregates without FDs 140k+740k 140k+744k 140k+65,832k 140k+45,995k(cont.+categ.) with FDs same as above, there are no FDs 140k+36,595k

libFM Join (PSQL) 50.63 216.56 216.56 216.56(MCMC) Export/Import 412.84 1462.54 3,096.90 3,368.06

Learn (300 runs) 19,692.90 103,225.50 79,839.13 87,873.75

AC Aggregate 128.97 498.79 772.42 6869.47Converge (runs) 3.03 (300) 3.05 (300) 262.54 (300) 166.60 (300)

AC+FD Aggregate same as AC 1672.83Converge (runs) there are no FDs 144.07 (300)

Speedup AC+FD/libFM 152.70× 209.03× 80.34× 50.33×AC+FD/AC same as AC, there are no FDs 3.87×

Table 1: Time performance comparison (seconds) for learning regression models over increasingly largerfragments (v1 to v4) of Retailer. The join size stays the same for versions v2 to v4 at 3,614,400,131 values asflat vs 169,231,200 values as factorized (21.4× compression); v1 is a fifth of v2, with 774M values as flat vs36,929,272 values as factorized (20.96× compression). (–) means that the system failed to compute due todesign limitations. R can only compute the LR model for v1, the other versions and models exceed the sizelimit of R’s data frames. M cannot compute any model on v4 since the one-hot encoding have more than1600 columns; M takes over 22 hours for PR. R and M do not support factorization machines. Since libFMrequires a fixed number of runs, all experiments for FaMa2

8 are computed with max 300 runs.

37


Recommended