In-Database Learning with Sparse TensorsRendle (libFM): Discover repeating blocks in the...

transcript

In-Database Learning with Sparse Tensors

Mahmoud Abo Khamis, Hung Ngo, XuanLong Nguyen,

Dan Olteanu, and Maximilian Schleich

Toronto, October 2017

RelationalAI

Talk Outline

Current Landscape for DB+ML

What We Did So Far

Factorized Learning over Normalized Data

Learning under Functional Dependencies

Our Current Focus

Brief Outlook at Current Landscape for DB+ML (1/2)

No integration

• ML & DB distinct tools on the technology stack

• DB exports data as one table, ML imports it in own format

• Spark/PostgreSQL + R supports virtually any ML task

• Most DB+ML solutions seem to operate in this space

Loose integration

• Each ML task implemented by a distinct UDF inside DB

• Same running process for DB and ML

• DB computes one table, ML works directly on it

• MadLib supports comprehensive library of ML UDFs

No integration

• ML & DB distinct tools on the technology stack

• DB exports data as one table, ML imports it in own format

• Spark/PostgreSQL + R supports virtually any ML task

• Most DB+ML solutions seem to operate in this space

Loose integration

• Each ML task implemented by a distinct UDF inside DB

• Same running process for DB and ML

• MadLib supports comprehensive library of ML UDFs

Unified programming architecture

• One framework for many ML tasks instead of one UDF per

task, with possible code reuse across UDFs

• Bismark supports incremental gradient descent for convex

programming; up to 100% overhead over specialized UDFs

Tight integration ⇒ In-Database Analytics

• One evaluation plan for both DB and ML workload;

opportunity to push parts of ML tasks past joins

• Morpheus + Hamlet supports GLM and naıve Bayes

• Our approach supports PR/FM with continuous & categorical

features, decision trees, . . .

Unified programming architecture

• One framework for many ML tasks instead of one UDF per

task, with possible code reuse across UDFs

• Bismark supports incremental gradient descent for convex

programming; up to 100% overhead over specialized UDFs

Tight integration ⇒ In-Database Analytics

• One evaluation plan for both DB and ML workload;

opportunity to push parts of ML tasks past joins

• Morpheus + Hamlet supports GLM and naıve Bayes

• Our approach supports PR/FM with continuous & categorical

features, decision trees, . . .3/29

In-Database Analytics

• Move the analytics, not the data

• Avoid expensive data export/import

• Exploit database technologies

• Build better models using larger datasets

• Cast analytics code as join-aggregate queries

• Many similar queries that massively share computation

• Fixpoint computation needed for model convergence

In-database vs. Out-of-database Analytics

feature

extraction

queryDB

materialized

output

ML tool θ∗

modelmodel

reformulationFAQ/FDB

Optimized

join-aggregate

queries

Gradient-descent

Trainer

Does It Pay Off?

Retailer dataset (records) excerpt (17M) full (86M)

Linear regression

Features (cont+categ) 33 + 55 33+3,653

Aggregates (cont+categ) 595+2,418 595+145k

MadLib Learn 1,898.35 > 24h

R Join (PSQL) 50.63 –

Export/Import 308.83 –

Learn 490.13 –

Our approach Aggregate+Join 25.51 380.31

Converge (runs) 0.02 (343) 8.82 (366)

Polynomial regression degree 2

Features (cont+categ) 562+2,363 562+141k

Aggregates (cont+categ) 158k+742k 158k+37M

MadLib Learn > 24h –

Our approach Aggregate+Join 132.43 1,819.80

Converge (runs) 3.27 (321) 219.51 (180)6/29

Talk Outline

What We Did So Far

Our Current Focus

Unified In-Database Analytics for Optimization Problems

Our target: retail-planning and forecasting applications

• Typical databases: weekly sales, promotions, and products

• Training dataset: Result of a feature extraction query

• Task: Train model to predict additional demand generated for

a product due to promotion

• Training algorithm for regression: batch gradient descent• Convergence rates are dimension-free

• ML tasks: ridge linear regression, degree-d polynomial

regression, degree-d factorization machines; logistic

regression, SVM; PCA.

Typical Retail Example

• Database I = (R1,R2,R3,R4,R5)

• Feature selection query Q:

Q(sku, store, color, city, country, unitsSold)←R1(sku, store, day, unitsSold),R2(sku, color),

R3(day, quarter),R4(store, city),R5(city, country).

• Free variables

• Categorical (qualitative):

F = sku, store, color, city, country.• Continuous (quantitative): unitsSold.

• Bound variables

• Categorical (qualitative): B = day, quarter

Typical Retail Example

• We learn the ridge linear regression model

〈θ, x〉 =∑f ∈F〈θf , xf 〉

• Input data: D = Q(I )

• Feature vector x and response y = unitsSold .

• The parameters θ are obtained by minimizing the objectivefunction:

J(θ) =

least square loss︷︸︸︷1

2|D|∑

(x,y)∈D

(〈θ, x〉 − y)2 +

`2−regularizer︷︸︸︷‖θ‖22

Side Note: One-hot Encoding of Categorical Variables

• Continuous variables are mapped to scalars

• yunitsSold ∈ R.

• Categorical variables are mapped to indicator vectors

• country has categories vietnam and england

• country is then mapped to an indicator vector

xcountry = [xvietnam, xengland]> ∈ (0, 12)>.

• xcountry = [0, 1]> for a tuple with country = ‘‘england’’

This encoding leads to wide training datasets and many 0s

Side Note: Least Square Loss Function

Goal: Describe a linear relationship fun(x) = θ1x + θ0 so we can

estimate new y values given new x values.

• We are given n (black) data points (xi , yi )i∈[n]

• We would like to find a (red) regression line fun(x) such that

the (green) error∑

i∈[n](fun(xi )− yi )2 is minimized

From Optimization to SumProduct FAQ Queries

We can solve θ∗ := arg minθ J(θ) by repeatedly updating θ in the

direction of the gradient until convergence:

θ := θ − α ·∇J(θ).

Define the matrix Σ = (σij)i ,j∈[|F |], the vector c = (ci )i∈[|F |], andthe scalar sY :

σij =1

|D|∑

(x,y)∈D

xix>j ci =

|D|∑

(x,y)∈D

y · xi sY =1

|D|∑

(x,y)∈D

Then,J(θ) =

2|D|∑

(x,y)∈D

(〈θ, x〉 − y)2 +λ

2‖θ‖22

2θ>Σθ − 〈θ, c〉+

2‖θ‖22

∇J(θ) = Σθ − c + λθ

θ := θ − α ·∇J(θ).

σij =1

|D|∑

(x,y)∈D

xix>j ci =

|D|∑

(x,y)∈D

y · xi sY =1

|D|∑

(x,y)∈D

Then,J(θ) =

2|D|∑

(x,y)∈D

(〈θ, x〉 − y)2 +λ

2‖θ‖22

2θ>Σθ − 〈θ, c〉+

2‖θ‖22

∇J(θ) = Σθ − c + λθ

θ := θ − α ·∇J(θ).

σij =1

|D|∑

(x,y)∈D

xix>j ci =

|D|∑

(x,y)∈D

y · xi sY =1

|D|∑

(x,y)∈D

Then,J(θ) =

2|D|∑

(x,y)∈D

(〈θ, x〉 − y)2 +λ

2‖θ‖22

2θ>Σθ − 〈θ, c〉+

2‖θ‖22

∇J(θ) = Σθ − c + λθ

Expressing Σ, c, sY as SumProduct FAQ Queries

FAQ queries for σij = 1|D|∑

(x,y)∈D xix>j (w/o factor 1

• xi , xj continuous ⇒ no free variable

ψij =∑

f ∈F :af ∈Dom(xf )

∑b∈B:ab∈Dom(xb)

ai · aj ·∏k∈[5]

1Rk (aS(Rk ))

• xi categorical, xj continuous ⇒ one free variable

ψij [ai ] =∑

f ∈F−i:af ∈Dom(xf )

aj ·∏k∈[5]

1Rk (aS(Rk ))

• xi , xj categorical ⇒ two free variables

ψij [ai , aj ] =∑

f ∈F−i ,j:af ∈Dom(xf )

∏k∈[5]

1Rk (aS(Rk ))

Expressing Σ, c, sY as SQL Queries

SQL queries for σij = 1|D|∑

(x,y)∈D xix>j (w/o factor 1

• xi , xj continuous ⇒ no group-by attribute

SELECT SUM(xi*xj) FROM D;

• xi categorical, xj continuous ⇒ one group-by attribute

SELECT xi , SUM(xj) FROM D GROUP BY xi ;

• xi , xj categorical ⇒ two group-by variables

SELECT xi , xj , SUM(1) FROM D GROUP BY xi , xj ;

This query encoding avoids drawbacks of one-hot encoding 15/29

Side Note: Factorized Learning over Normalized Data

Idea: Avoid Redundant Computation for DB Join and ML

Realized to varying degrees in the literature

• Rendle (libFM): Discover repeating blocks in the materializedjoin and then compute ML once for all• Same complexity as join materialization!

• NP-hard to (re)discover join dependencies!

• Kumar (Morpheus): Push down ML aggregates to each inputtuple, then join tables and combine aggregates• Same complexity as listing materialization of join results!

• Our approach: Morpheus + Factorize the join to avoidexpensive Cartesian products in join computation• Arbitrarily lower complexity than join materialization

Model Reparameterization using Functional Dependencies

Consider the functional dependency city → country and

• country categories: vietnam, england

• city categories: saigon, hanoi, oxford, leeds,bristol

The one-hot encoding enforces the following identities:

• xvietnam = xsaigon + xhanoi

country is vietnam ⇒ city is either saigon or hanoi

xvietnam = 1⇒ either xsaigon = 1 or xhanoi = 1

• xengland = xoxford + xleeds + xbristol

country is england ⇒ city is either oxford, leeds, or bristol

xengland = 1⇒ either xoxford = 1 or xleeds = 1 or xbristol = 1

• Identities due to one-hot encodingxvietnam = xsaigon + xhanoi

xengland = xoxford + xleeds + xbristol

• Encode xcountry as xcountry = Rxcity, where

saigon hanoi oxford leeds bristol

1 1 0 0 0 vietnam

0 0 1 1 1 england

For instance, if city is saigon, i.e., xcity = [1, 0, 0, 0, 0]>,

then country is vietnam, i.e., xcountry = Rxcity = [1, 0]>.

[1 1 0 0 0

0 0 1 1 1

]18/29

• Functional dependency: city → country

• xcountry = Rxcity• Replace all occurrences of xcountry by Rxcity:

∑f∈F−city,country

〈θf , xf 〉+ 〈θcountry, xcountry〉+ 〈θcity, xcity〉

f∈F−city,country

〈θf , xf 〉+ 〈θcountry,Rxcity〉+ 〈θcity, xcity〉

〈θf , xf 〉+

⟨R>θcountry + θcity︸︷︷︸

γcity

, xcity

• We avoid computing aggregates over xcountry.

• We reparameterize and ignore parameters θcountry.

• What about the penalty term in the loss function?

• xcountry = Rxcity• Replace all occurrences of xcountry by Rxcity:

∑f∈F−city,country

〈θf , xf 〉+ 〈θcountry, xcountry〉+ 〈θcity, xcity〉

〈θf , xf 〉+ 〈θcountry,Rxcity〉+ 〈θcity, xcity〉

〈θf , xf 〉+

⟨R>θcountry + θcity︸︷︷︸

γcity

, xcity

• We avoid computing aggregates over xcountry.

• We reparameterize and ignore parameters θcountry.

• What about the penalty term in the loss function? 19/29

• xcountry = Rxcity γcity = R>θcountry + θcity

• Rewrite the penalty term

‖θ‖22 =∑

j 6=city

‖θj‖22 +∥∥∥γcity − R>θcountry

∥∥∥22+ ‖θcountry‖22

• Optimize out θcountry by expressing it in terms of γcity:

θcountry = (Icountry + RR>)−1Rγcity = R(Icity + R>R)−1γcity

• The penalty term becomes

‖θ‖22 =∑

j 6=city

‖θj‖22 +⟨(Icity + R>R)−1γcity,γcity

⟩20/29

Side Note: Learning over Normalized Data with FDs

Hamlet & Hamlet++

• Linear classifiers (Naıve Bayes): model accuracy unlikely to be

affected if we drop a few functionally determined features

• Use simple decision rule: fkeys/key > 20?

• Hamlet++ shows experimentally that this idea does not work

for more interesting classifiers, e.g., decision trees

Our approach

• Given the model A to learn, we map it to a much smaller

model B without the functionally determined features in A

• Learning B can be OOM faster than learning A

• Once B is learned, we map it back to A

Side Note: Learning over Normalized Data with FDs

Hamlet & Hamlet++

• Linear classifiers (Naıve Bayes): model accuracy unlikely to be

affected if we drop a few functionally determined features

• Use simple decision rule: fkeys/key > 20?

• Hamlet++ shows experimentally that this idea does not work

for more interesting classifiers, e.g., decision trees

Our approach

• Given the model A to learn, we map it to a much smaller

model B without the functionally determined features in A

• Learning B can be OOM faster than learning A

• Once B is learned, we map it back to A21/29

General Problem Formulation

We want to solve θ∗ := arg minθ J(θ), where

J(θ) :=∑

(x,y)∈D

L (〈g(θ), h(x)〉 , y) + Ω(θ).

• θ = (θ1, . . . , θp) ∈ Rp are parameters• functions g : Rp → Rm and h : Rn → Rm

• g = (gj)j∈[m] is a vector of multivariate polynomials

• h = (hj)j∈[m] is a vector of multivariate monomials

• L is a loss function, Ω is the regularizer

• D is the training dataset with features x and response y .

Problems: ridge linear regression, polynomial regression,

factorization machines; logistic regression, SVM; PCA.

Special Case: Ridge Linear Regression

• square loss L , `2-regularization,

• data points x = (x0, x1, . . . , xn, y),

• p = n + 1 parameters θ = (θ0, . . . , θn),

• x0 = 1 corresponds to the bias parameter θ0,

• identity functions g and h,

we obtain the following formulation for ridge linear regression:

J(θ) :=1

2|D|∑

(x,y)∈D

(〈θ, x〉 − y)2 +λ

2‖θ‖22 .

Special Case: Degree-d Polynomial Regression

• p = m = 1 + n + n2 + · · ·+ nd parameters θ = (θa), where

a = (a1, . . . , an) with non-negative integers s.t. ‖a‖1 ≤ d .

• the components of h are given by ha(x) =∏n

i=1 xaii ,

• g(θ) = θ,

we obtain the following formulation for polynomial regression:

J(θ) :=1

2|D|∑

(x,y)∈D

(〈g(θ), h(x)〉 − y)2 +λ

2‖θ‖22 .

Special Case: Factorization Machines

• p = 1 + n + r · n parameters,

• m = 1 + n +(n2

)features,

we obtain the following formulation for degree-2 rank-rfactorization machines:

J(θ) :=1

2|D|∑

(x,y)∈D

θixi +∑

i,j∈([n]2 )

`∈[r ]

θ(`)i θ

(`)j xixj − y

2‖θ‖22 .

Special Case: Classifiers

• Typically, the regularizer is λ2‖θ‖

• The response is binary: y ∈ ±1

• The loss function L(γ, y), where γ := 〈g(θ), h(x)〉 is

• L(γ, y) = max1− yγ, 0 for support vector machines,

• L(γ, y) = log(1 + e−yγ) for logistic regression,

• L(γ, y) = e−yγ for Adaboost.

Zoom-in: In-database vs. Out-of-database Learning

feature extraction

R1 on . . . on Rk

|D| ML tool θ∗

modelmodel

reformulationQueries:

...σij

...c1...

optimizer

Factorized query evaluation Cost ≤ N faqw |D|

∇J(θ)

converged?

Gradient-descent

Talk Outline

What We Did So Far

Our Current Focus

• MultiFAQ: Principled approach to computing many FAQs overthe same hypertree decomposition

• Asymptotically lower complexity than computing each FAQ

independently

• Applications: regression, decision trees, frequent itemset

• SGD using sampling from factorized joins

• Applications: regression, decision trees, frequent itemset

• in-DB linear algebra

• Generalization of current effort, add support for efficient

matrix operations, e.g., inversion

Thank you!

In-Database Learning with Sparse TensorsRendle (libFM): Discover repeating blocks in the...

Documents