[ACM Press the 19th ACM SIGKDD international conference - Chicago, Illinois, USA...

Accurate Intelligible Models with Pairwise Interactions

Yin LouDept. of Computer Science

Cornell [email protected]

Rich CaruanaMicrosoft Research

Microsoft [email protected]

Johannes GehrkeDept. of Computer Science


Giles HookerDept. of Statistical Science


ABSTRACTStandard generalized additive models (GAMs) usually modelthe dependent variable as a sum of univariate models. Al-though previous studies have shown that standard GAMscan be interpreted by users, their accuracy is significantlyless than more complex models that permit interactions.

In this paper, we suggest adding selected terms of inter-acting pairs of features to standard GAMs. The resultingmodels, which we call GA2M-models, for Generalized Addi-tive Models plus Interactions, consist of univariate terms anda small number of pairwise interaction terms. Since thesemodels only include one- and two-dimensional components,the components of GA2M-models can be visualized and in-terpreted by users. To explore the huge (quadratic) numberof pairs of features, we develop a novel, computationally ef-ficient method called FAST for ranking all possible pairs offeatures as candidates for inclusion into the model.

In a large-scale empirical study, we show the effectivenessof FAST in ranking candidate pairs of features. In addition,we show the surprising result that GA2M-models have al-most the same performance as the best full-complexity mod-els on a number of real datasets. Thus this paper postulatesthat for many problems, GA2M-models can yield modelsthat are both intelligible and accurate.

Categories and Subject DescriptorsI.2.6 [Computing Methodologies]: Learning—Induction

Keywordsclassification, regression, interaction detection

1. INTRODUCTIONMany machine learning techniques such as boosted or

bagged trees, SVMs with RBF kernels, or deep neural nets

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13, August 11–14, 2013, Chicago, Illinois, USA.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2174-7/13/08 ...$15.00.

are powerful classification and regression models for high-dimensional prediction problems. However, due to theircomplexity, the resulting models are hard to interpret forthe user. But in many applications, intelligibility is as im-portant as accuracy [19], and thus building models that userscan understand is a crucial requirement.

Generalized additive models (GAMs) are the gold stan-dard for intelligibility when only univariate terms are con-sidered [13, 19]. Standard GAMs have the form

g(E[y]) =∑

fi(xi), (1)

where g is the link function. Standard GAMs are easy tointerpret since users can visualize the relationship betweenthe univariate terms of the GAM and the dependent vari-able through a plot fi(xi) vs. xi. However there is unfor-tunately a significant gap between the performance of thebest standard GAMs and full complexity models [19]. Inparticular, Equation 1 does not model any interactions be-tween features, and it is this limitation that lies at the coreof the lack of accuracy of standard GAMs as compared tofull complexity models.

Example 1. Consider the function F (x) = log(x21x3) +x2x3. F has a pairwise interaction (x2, x3), but no in-teractions between (x1, x2) or (x1, x3), since log(x21x3) =2 log(x1) + log(x3), which is additive.

Our first contribution in this paper is to build modelsthat are more powerful than GAMs, but are still intelligible.We observe that two-dimensional interactions can still berendered as heatmaps of fij(xi, xj) on the two-dimensionalxi, xj-plane, and thus a model that includes only one- andtwo-dimensional components is still intelligible. Thereforein this paper, we propose building models of the form

g(E[y]) =∑

fi(xi) +∑

fij(xi, xj); (2)

we call the resulting model class Generalized Additive Modelsplus Interactions, or short GA2Ms.

The main challenge in building GA2Ms is the large num-ber of pairs of features to consider. We thus only want toinclude “true” interactions that pass some statistical test.To this end, we focus on problems with up to thousands offeatures since for truly high dimensional problems (e.g., mil-lions of features), it is almost intractable to test all possiblepairwise interactions (e.g., trillions of feature pairs).

Existing approaches for detecting statistical interactionscan be divided into two classes. One class of methods di-

623

rectly models and compares the interaction effects and ad-ditive effects [10, 11, 18, 25]. One drawback of these meth-ods is that spurious interactions may be reported over low-density regions [15]. The second class of methods measuresthe performance drop in the model if certain interaction isnot included; they compare the performance between re-stricted and unrestricted models, where restricted modelsare not allowed to model an interaction in question [22].Although this class of methods does not suffer from theproblem of low-density regions, they are computationallyextremely expensive even for pairwise interaction detection.

Our second contribution in this paper is to scale the con-struction of GA2Ms by proposing a novel, extremely efficientmethod called FAST to measure and rank the strength of theinteraction of all pairs of variables. Our experiments showthat FAST can efficiently rank all pairwise interactions closeto a ground truth ranking.

Our third contribution is an extensive empirical evalua-tion of GA2M-models. Surprisingly, on many of the datasetsincluded in our study, the performance of GA2M-models isclose and sometimes better than the performance of full-complexity models. These results indicate that GA2M-modelsnot only make a significant step in improving accuracy overstandard GAMs, but in some cases they actually come all theway to the performance of full-complexity models. The per-formance may be due to the difficulty of estimating intrin-sically high dimensional functions from limited data, sug-gesting that the bias associated with the GA2M structure isoutweighed by a drop in variance. We also demonstrate thatthe resulting models are intelligible through a case study.

In this paper we make the following contributions:

• We introduce the model class GA2M.

• We introduce our new method FAST for efficient in-teraction detection. (Section 4)

• We show through an extensive experimental evalua-tion that (1) GA2Ms have accuracy comparable to full-complexity models; (2) FAST accurately ranks inter-actions as compared to a gold standard; and (3) FASTis computationally efficient. (Section 5)

We start with a problem definition and a survey of relatedwork in Sections 2 and 3.

2. PROBLEM DEFINITIONLet D = {(xi, yi)}N1 denote a dataset of size N , where

xi = (xi1, ..., xin) is a feature vector with n features and yiis the response. Let x = (x1, ..., xn) denote the variables orfeatures in the dataset. For u ⊆ {1, ..., n}, we denote by xuthe subset of variables whose indices are in u. Similarly x−u

will indicate the variables with indices not in u. To simplifynotation, we denote U1 = {{i}|1 ≤ i ≤ n}, U2 = {{i, j}|1 ≤i < j ≤ n}, and U = U1 ∪ U2, i.e., U contains all indices forall features and pairs of features.

For any u ∈ U , letHu denote the Hilbert space of Lebesguemeasurable functions fu(xu), such that E[fu] = 0 and E[f2

u]< ∞, equipped with the inner product 〈fu, f ′u〉 = E[fuf

′u].

Let H1 =∑

u∈U1 Hu denote the Hilbert space of func-tions that have additive form F (x) =

∑u∈U1 fu(xu) on

univariate compnents; we call those components shape func-tions [19]. Similarly let H =

∑u∈U Hu denote the Hilbert

space of functions of x = (x1, ..., xn) that have additive form

F (x) =∑

u∈U fu(xu) on both one- and two-dimensionalshape functions. Models described by sums of low-ordercomponents are called generalized additive models (GAMs),and in the remainder of the paper, we use GAMs to denotemodels that only consist of univariate terms.

We want to find the best model F ∈ H that minimizesthe following objective function:

minF∈H

E[L(y, F (x))], (3)

where L(·, ·) is a non-negative convex loss function. When Lis the squared loss, our problem becomes a regression prob-lem, and if L is logistic loss function, we are dealing with aclassification problem.

3. EXISTING APPROACHES

3.1 Fitting Generalized Additive ModelsTerms in GAMs can be represented by a variety of func-

tions, including splines [24], regression trees, or tree ensem-bles [9]. There are two popular methods of fitting GAMs:Backfitting [13] and gradient boosting [10]. When the shapefunction is spline, fitting GAMs reduces to fitting general-ized linear models with different bases, which can be solvedby least squares or iteratively reweighted least squares [25].

Spline-based methods become inefficient when modelinghigher order interactions because the number of parame-ters to estimate grows exponentially; tree-based methodsare more suitable in this case. Standard additive model-ing only involves modeling individual features (also calledfeature shaping). Previous research showed that gradientboosting with ensembles of shallow regression trees is themost accurate method among a number of alternatives [19].

3.2 Interaction DetectionIn this section, we briefly review existing approaches to

interaction detection.ANOVA. An additive model is fit with all pairwise inter-

action terms [13] and the significance of interaction terms ismeasured through an analysis of variance (ANOVA) test [25].The corresponding p-value for each pair can then be com-puted; however, this requires the computation of the fullmodel, which is prohibitively expensive.

Partial Dependence Function. Friedman and Popescuproposed the following statistic to measure the strength ofpairwise interactions,

H2ij =

∑Nk=1[Fij(xki, xkj)− Fi(xki)− Fj(xkj))]

2∑Nk=1 F

2ij(xki, xkj)

(4)

where Fu(xu) = Ex−u [F (xu, x−u)] is the partial dependencefunction (PDF) [10, 11] and F is a complex multi-dimensional

function learned on the dataset. Computing Fu(xu) on thewhole dataset is expensive, thus one often specifies a subsetof size m on which to compute Fu(xu). The complexity isthen O(m2). However, since partial dependence functionsare computed based on uniform sampling, they may detectspurious interactions over low-density regions [15].

GUIDE. GUIDE tests pairwise interactions based on theχ2 test [18]. An additive model F is fit in H1 and residualsare obtained. To detect interactions for (xi, xj), GUIDEdivides the (xi, xj)-space into four quadrants by splitting therange of each variable into two halves at the sample median.

624

Then GUIDE constructs a 2×4 contingency table using theresidual signs as rows and the quadrants as columns. Thecell values in the table are the number of “+”s and “-”s ineach quadrant. These counts permit the computation of ap-value to measure the interaction strength of a pair. Whilethis might be more robust to outliers, in practice it is lesspowerful than the method we propose.

Grove. Sorokina et al. proposed a grove-based method todetect statistical interactions [22]. To measure the strengthof a pair (xi, xj), they build both the restricted model Rij(x)and unrestricted model F (x), where Rij(x) is preventedfrom modeling an interaction (xi, xj):

Rij(x) =f\i(x1, ..., xi−1, xi+1, ..., xn)

+ f\j(x1, ..., xj−1, xj+1, ..., xn). (5)

To correctly estimate interaction strength, such method re-quires a model to be highly predictive when certain interac-tion is not allowed to appear, and therefore many learningalgorithms are not applicable (e.g., bagged decision trees).To this end, they choose to use Additive Groves [21].

They measure the performance as standardized root meansquared error (RMSE) and quantify the interaction strengthIij by the difference between Rij(x) and F (x),

stRMSE(F (x)) =RMSE(F (x))

StD(F ∗(x))(6)

Iij = stRMSE(Rij(x))− stRMSE(F (x)) (7)

where Std(F ∗(x)) is calculated as standard deviation of theresponse values in the training set. The ranking of all pairscan be generated based on the strength Iij .

To handle correlations among features, they use a vari-ant of backward elimination [12] to do feature selection.Although Grove is accurate in practice, building restrictedand unrestricted models are computationally expensive andtherefore this method is almost infeasible for large high di-mensional datasets.

4. OUR APPROACHFor simplicity and without loss of generality, we focus

in this exposition on regression problems. Since there areO(n2) pairwise interactions, it is very hard to detect pair-wise interactions when n is large. Therefore we propose aframework using greedy forward stagewise selection strategyto build the most accurate model in H.

Algorithm 1 summarizes our approach called GA2M. Wemaintain two sets S and Z, where S contains the selectedpairs so far and Z is the set of the remaining pairs (Line 1-2). We start with the best additive model F so far in Hilbertspace H1+

∑u∈S Hu (Line 4) and detect interactions on the

residual R (Line 5). Then for each pair in Z, we build aninteraction model on the residual R (Line 6-7). We selectthe best interaction pair and include it in S (Line 9-10). Wethen repeat this process until there is no gain in accuracy.

Note that Algorithm 1 will find an overcomplete set S bythe greedy nature of the forward selection strategy. Whenfeatures are correlated, it is also possible that the algorithmincludes false pairs. For example, consider the function inExample 1. If x1 is highly correlated with x3, then (x1, x2)may look like an interaction pair, and it may be included inS before we select (x2, x3). But since we will refit the modelevery time we include a new pair, it is expected that F will

Algorithm 1 GA2M Framework

1: S ← ∅2: Z ← U2

3: while not converge do4: F ← arg minF∈H1+

∑u∈S Hu

12E[(y − F (x))2]

5: R← y − F (x)6: for all u ∈ Z do7: Fu ← E[R|xu]8: u∗ ← arg minu∈Z

12E[(R− Fu(xu))2]

9: S ← S ∪ {u∗}10: Z ← Z − {u∗}

xi

xj

cj

ci ci

cj cj

Figure 1: Illustration for searching cuts on inputspace of xi and xj. On the left we show a heat mapon the target for different values of xi and xj. ci andcj are cuts for xi and xj, respectively. On the rightwe show an extremely simple predictor of modelingpairwise interaction.

perfectly model (x2, x3) and therefore (x1, x2) will becomea less important term in F .

For large high-dimensional datasets, however, Algorithm1 is very expensive for two reasons. First, fitting interactionmodels for O(n2) pairs in Z can be very expensive if themodel is non-trivial. Second, every time we add a pair, weneed to refit the whole model, which is also very expensivefor large datasets. As we will see in Section 4.1 and Sec-tion 4.2, we will relax some of the constraints in Algorithm 1to achieve better scalability while still staying accurate.

4.1 Fast Interaction DetectionConsider the conceptual additive model in Equation 2,

given a pair of variables (xi, xj) we wish to measure howmuch benefit we can get if we model fij(xi, xj) instead offi(xi) + fj(xj). Since we start with shaping individual fea-tures and always detect interactions on the residual, fi(xi)+fj(xj) are presumably modeled and therefore we only needto look at the residual sum of squares (RSS) for the inter-action model fij . The intuition is that when (xi, xj) is astrong interaction, modeling fij can significantly reduce theRSS. However, we do not wish to fully build fij since thisis a very expensive operation; instead we are looking for acheap substitute.

4.1.1 OverviewOur idea is to build an extremely simple model for fij

using cuts on the input space of xi and xj , as illustratedin Figure 1. The simplest model we can build is to placeone cut on each variable, i.e., we place one ci and one cut

625

xi

xj

CHtj(cj) CHt

j(cj)

CH

ti (ci )

CH

ti (ci )

a b

c d

a = pre-computed

b = CHti (ci)− a

c = CHtj(cj)− a

d = CHti (ci)− c

Figure 2: Illustration for computing sum of targetsfor each quadrant. Given that the value of red quad-rant is known, we can easily recover values in otherquadrant using marginal cumulative histograms.

cj on xi and xj , respectively. Those cuts are parallel tothe axes. The interaction predictor Tij is constructed bytaking the mean of all points in each quadrant. We searchfor all possible (ci, cj) and pick the best Tij with the lowestRSS, which is assigned as weight for (xi, xj) to measure thestrength of interaction.

4.1.2 Constructing PredictorsNaıve implementation of FAST is straightforward, but

careless implementation has very high complexity since weneed to repeatedly build a lot of Tij for different cuts. Thekey insight for faster version of FAST is that we do notneed to scan through the dataset each time to compute Tij

and compute its RSS. We show that by using very sim-ple bookkeeping data structures, we can greatly reduce thecomplexity.

Let dom(xi) = {v1i , ..., vdii } be a sorted set of possiblevalues for variable xi, where di = |dom(xi)|. Define Ht

i (v)as the sum of targets when xi = v, and define Hw

i (v) asthe sum of weights (or counts) when xi = v. Intuitively,these are the standard histograms when constructing re-gression trees. Similarly, we define CHt

i (v) and CHwi (v)

as the cumulative histogram for sum of targets and sumof weights, respectively, i.e., CHt

i (v) =∑

u≤vHti (u) and

CHwi (v) =

∑u≤vH

wi (u). Accordingly, define CHt

i (v) =∑u>vH

ti (u) = CHt

i (vdii ) − CHti (v) and define CHw

i (v) =∑u>vH

wi (u) = CHw

i (vdii ) − CHwi (v). Furthermore, define

Htij(u, v) and Hw

ij(u, v) as the sum of targets and the sumof weights, respectively, when (xi, xj) = (u, v).

Consider again the input space for (xi, xj), we need aquick way to compute the sum of targets and sum of weightsfor each quadrant. Figure 2 shows an example for computingsum of targets on each quadrant. Given the above notations,we already know the marginal cumulative histograms for xiand xj , but unfortunately using these marginal values onlycan not recover values on four quadrants. Thus, we have tocompute value for one quadrant.

We show that it is very easy and efficient to compute allpossible values for the red quadrant given any cuts (ci, cj)using dynamic programming. Once that quadrant is known,we can easily recover values in other quadrant using marginalcumulative histograms. We store those values into lookuptables. Let Lt(ci, cj) = [a, b, c, d] be the lookup table for sum

Algorithm 2 ConstructLookupTable

1: sum← 02: for q = 1 to dj do3: sum← sum+Ht

ij(v1i , v

qj )

4: a[1][q]← sum5: L(v1i , v

qj )← ComputeV alues(CHt

i , CHtj , a[1][q])

6: for p = 2 to di do7: sum← 08: for q = 1 to dj do9: sum← sum+Ht

ij(vpi , v

qj )

10: a[p][q]← sum+ a[p− 1][q]11: L(vpi , v

qj )← ComputeV alues(CHt

i , CHtj , a[p][q])

of targets on cuts (ci, cj), and denote Lw(ci, cj) = [a, b, c, d]as the lookup table for sum of weights on cuts (ci, cj).

Algorithm 2 describes how to compute the lookup tableLt. We focus on computing quadrant a and other quad-rants can be easily computed, which is handled by subrou-tine ComputeV alues. Given Ht

ij , we first compute as forthe first row of Lt (Line 3-5). Let a[p][q] denote the valuefor cuts (p, q). Note a[p][q] = a[p− 1][q] +

∑k≤qH

tij(v

pi , v

kj ).

Thus we can efficiently compute the rest of the lookup tablerow by row (Line 6-11).

Once we have Lt and Lw, given any cuts (ci, cj), we caneasily construct Tij . For example, we can set the leftmostleaf value in Tij as Lt(ci, cj).a/L

w(ci, cj).a. It is easy to seethat with those bookkeeping data structures, we can reducethe complexity of building predictors to O(1).

4.1.3 Calculating RSS

In this section, we show that calculating RSS for Tij

can be very efficient. Consider the definition of RSS. LetTij .r denote the prediction value on region r, where r ∈{a, b, c, d}.

RSS =

N∑k=1

(yk − Tij(xk))2 (8)

=

(N∑

k=1

y2k − 2∑r

Tij .rLt.r +

∑r

(Tij .r)2Lw.r

)(9)

In practical implementation, we only need to care about∑r(Tij .r)

2Lw.r−2∑

r Tij .rLt.r since we are only interested

in relative ordering of RSS, and it is easy to see the com-plexity of computing RSS for Tij is O(1).

4.1.4 Complexity AnalysisFor each pair (xi, xj), computing the histograms and cu-

mulative histograms needs to scan through the data andtherefore its complexity is O(N). Constructing the lookuptables takes O(didj + N) time. Thus, the time complexityof FAST is O(didj +N) for one pair (xi, xj). Besides, Sincewe need to store di-by-dj matrices for each pair, the spacecomplexity is O(didj).

For continuous features, didj can be quite large. However,we can discretize the features into b equi-frequency bins.Such feature discretizing usually does not hurt the perfor-mance of regression tree [17]. As we will see in Section 5,FAST is not sensitive to a wide range of bs. Therefore, thecomplexity can be reduced to O(b2 + N) per pair when we

626

discretize features into b bins. For small bs (b ≤ 256), wecan quickly process each pair.

4.2 Two-stage ConstructionWith FAST, we can quickly rank of all pairs in Z, the re-

maining pair set, and add the best interaction to the model.However, refitting the whole model after each pair is addedcan be very expensive for large high-dimensional datasets.Therefore, we propose a two-stage construction approach.

1. In Stage 1, build the best additive model F inH1 usingonly one-dimensional components.

2. In Stage 2, fix the one-dimensional functions, and buildmodels for pairwise interactions on residuals.

4.2.1 Implementation DetailsTo scale up to large datasets and many features, we dis-

cretize the features into 256 equi-frequency bins for contin-uous features.1 We find such feature discretization rarelyhurts the performance but substantially reduces the run-ning time and memory footprint since we can use one byteto store a feature value. Besides, discretizing the features re-moves the sorting requirement for continuous features whensearching for the best cuts in the space.

Previous research showed that feature shaping using gra-dient boosting [10] with shallow regression tree ensemblescan achieve the best accuracy [19]. We follow similar ap-proach (i.e., gradient boosting with shallow tree-like ensem-bles) in this work. However, a regression tree is not the ideallearning method for each component for two reasons. First,while regression trees are good as a generic shape functionsfor any xu, shaping a single feature is equivalent to cuttingon a line, but line cutting can be made more efficient thanregression tree. Second, using regression tree to shape pair-wise functions can be problematic. Recall that in Stage 1,we obtain the best additive model after gradient boostingconverges. This means adding more cuts to any one featuredoes not reduce the error, and equivalently, any cut on a sin-gle feature is random. Therefore, when we begin to shapepairwise interactions, the root test in a regression tree thatis constructed greedily top-down is random.

Similar to [19], to effectively shape pairwise interactions,we build shallow tree-like models on the residuals as illus-trated in Figure 3. We enumerate all possible cuts ci onxi. Given this cut, we greedily search the best cut c1j in the

region above ci and similarly greedily search the best cut c2jin the region below ci. Note we can reuse the lookup tableLt and Lw we developed for FAST for fast search of thosethree cuts. Figure 3 shows an example of computing the leafvalues given ci, c

1j and c2j . Similarly, we can quickly compute

the RSS given any combination of 3 cuts once the leaf valuesare available, just as we did in Section 4.1.4, and thereforeit is very fast to search for the best combination of cuts inthis space. Similarly, we search for the best combination of3 cuts with 1 cut on xj and 2 cuts on xi and pick the bettermodel with lower RSS. It is easy to see the complexity isO(N + b2), where b is the number of bins for each featureand b = 256 in our case.

1Note that this is not the number of bins used in FAST,the interaction detection process. Here we use 256 bins forfeature/pair shaping.

xi

xj

c1j

c2j

cia b

c d

a = Lt(ci, c1j ).a/Lw(ci, c

1j ).a

b = Lt(ci, c1j ).b/Lw(ci, c

1j ).b

c = Lt(ci, c2j ).c/Lw(ci, c

2j ).c

d = Lt(ci, c2j ).d/Lw(ci, c

2j ).d

Figure 3: Illustration for computing shape functionfor pairwise interaction.

Dataset Size Attributes %Pos

Delta 7192 6 -CompAct 8192 22 -

Pole 15000 49 -CalHousing 20640 9 -MSLR10k 1200192 137 -Spambase 4601 58 39.40

Gisette 6000 5001 50.00Magic 19020 11 64.84Letter 20000 17 49.70

Physics 50000 79 49.72

Table 1: Datasets.

4.2.2 Further RelaxationFor large datasets, even refitting the model on selected

pairs can be very expensive. Therefore, we propose to usethe ranking of FAST right after Stage 1, to select the top-Kpairs to S, and fit a model using the pairs in S on the residualR, where K is chosen according to computing power.

4.2.3 DiagnosticsModels that combine both accuracy and intelligibility are

important. Usually S will still be an overcomplete set. Forintelligibility, once we have learned the best model in H,we would like to rank all terms (one- and two-dimensionalcomponents) so that we can focus on the most importantfeatures, or pairwise interactions. Therefore, we need toassign weights for each term. We use

√E[f2

u], the standarddeviation of fu (since E[fu] = 0), as the weight for termu. Note this is a natural generalization of the weights inthe linear models; this is easy to see since fi(xi) = wixi,√E[f2

i ] is equivalent to |wi| if features are normalized sothat E[x2i ] = 1.

5. EXPERIMENTSIn this section we report experimental results on both syn-

thetic and real datasets. The results in Section 5.1 showGA2M learns models that are nearly as accurate as full-complexity random forest models while using terms that de-pend only on single features and pairwise interactions andthus are intelligible. The results in Section 5.2 demonstratethat FAST finds the most important interactions of O(n2)feature pairs to include in the model. Section 5.3 comparesthe computational cost of FAST and GA2M to competingmethods. Section 5.4 briefly discusses several important de-

627

Model Delta CompAct Pole CalHousing MSLR10k Mean

Linear Regression 0.58±0.01 7.92±0.47 30.41±0.24 7.28±0.80 0.76±0.00 1.52±0.79

GAM 0.57±0.02 2.74±0.04 21.62±0.38 5.76±0.55 0.75±0.00 1.00±0.00GA2M Rand - - 11.37±0.38 - 0.73±0.00 -GA2M Coef - - 11.61±0.43 - 0.73±0.00 -

GA2M Order - - 10.81±0.29 - 0.74±0.00 -GA2M FAST 0.55±0.02 2.53±0.02 10.59±0.35 5.00±0.91 0.73±0.00 0.84±0.20

Random Forests 0.53±0.19 2.45±0.08 11.38±1.03 4.90±0.81 0.71±0.00 0.83±0.17

Table 2: RMSE for regression datasets. Each cell contains the mean RMSE ± one standard deviation.Average normalized score is shown in the last column, calculated as relative improvement over GAM.

Model Spambase Gisette Magic Letter Physics Mean

Logistic Regression 6.22±0.93 15.78±3.28 17.11±0.08 27.54±0.27 30.02±0.37 1.79±1.25

GAM 5.09±0.64 3.95±0.65 14.85±0.28 17.84±0.20 28.83±0.24 1.00±0.00GA2M Rand 5.04±0.52 3.53±0.61 - - 28.82±0.25 -GA2M Coef 4.89±0.54 3.43±0.55 - - 28.74±0.37 -

GA2M Order 4.93±0.65 3.08±0.55 - - 28.76±0.34 -GA2M FAST 4.78±0.70 2.91±0.38 13.88±0.32 8.62±0.31 28.20±0.18 0.81±0.21

Random Forests 4.76±0.70 3.25±0.47 12.45±0.64 6.16±0.22 28.48±0.40 0.79±0.26

Table 3: Error rate for classification datasets. Each cell contains the error rate ± one standard deviation.Average normalized score is shown in the last column, calculated as relative improvement over GAM.

sign choices made for FAST and GA2M. Finally, Section 5.5concludes with a case study.

5.1 Model Accuracy on Real DatasetsWe run experiments on ten real datasets to show the accu-

racy that GA2M can achieve with models that depend onlyon 1-d features and pairwise feature interactions.

5.1.1 DatasetsTable 1 summarizes the 10 datasets. Five are regression

problems: “Delta” is the task of controlling the ailerons ofan F16 aircraft [1]. “CompAct” is from the Delve repositoryand describes the state of multiuser computers [2]. “Pole”describes a telecommunication problem [23]. “CalHousing”describes how housing prices depend on census variables [16].“MSLR10k” is a learning-to-rank dataset but we treat rele-vance as regression targets [3]. The other five datasets are bi-nary classification problems: The “Spambase”, “Magic” and“Letter” datasets are from the UCI repository [4]. “Gisette”is from the NIPS feature selection challenge [5]. “Physics” isfrom the KDD Cup 2004 [6].

The features in all datasets are discretized into 256 equi-frequency bins. For each model we include at most 1000feature pairs; we include all feature pairs in the six problemswith least dimension, and the top 1000 feature pairs foundby FAST on the “Pole”, “MSLR10k”, “Spambase”, “Gisette”,and “Physics” datasets. Although it is possible that higheraccuracy might be obtained by including more or fewer fea-ture pairs, search for the optimal number of pairs is expen-sive and GA2M is reasonably robust to excess feature pairs.However, it is too expensive to include all feature pairs onproblems with many features. We use 8 bins for FAST inall experiments.

5.1.2 ResultsWe compare GA2M to linear/logistic regression, feature

shaping (GAMs) without interactions, and full-complexity

random forests. For regression problems we report rootmean squared error (RMSE) and for classification problemswe report 0/1 loss. To compare results across different datasets,we normalize results by the error of GAMs on each dataset.For all experiments, we train on 80% of the data and holdaside 20% of the data as test sets.

In addition to FAST, we also consider three baseline meth-ods on five high dimensional datasets, i.e., GA2M Rand,GA2M Coef and GA2M Order. GA2M Rand means we addsame number of random pairs to GAM. GA2M Order andGA2M Coef use the weights of 1-d features in GAM to pro-pose pairs; GA2M Order generates pairs by the order of 1-dfeatures and GA2M Coef generates pairs by the product ofweights of 1-d features.

The regression and classification results are presented inTable 2 and Table 3. As expected, the improvement overlinear models from shaping individual features (GAMs) issubstantial: on average feature shaping reduces RMSE 34%on the regression problems, and reduces 0/1 loss 44% onthe classification problems. What is surprising, however, isthat by adding shaped pairwise interactions to the models,GA2M FAST substantially closes the accuracy gap betweenunintelligible full-complexity models such as random forestsand GAMs. On some datasets, GA2M FAST even outper-forms the best random forest model. Also, none of the base-line methods perform comparably GA2M FAST.

5.2 Detecting Feature Interactions with FASTIn this section we evaluate how accurately FAST detects

feature interactions on synthetic problems.

5.2.1 Sensitivity to the Number of BinsTo evaluate sensitivity of FAST we use the synthetic func-

tion generator in [10] to generate random functions. Becausethese are synthetic function, we know the ground truth in-teracting pairs and use average precision (area under theprecision-recall curve evaluated at true points) as the eval-

628

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

2 4 8 16 32 64 128 256

Ave

rage

Pre

cisi

on

Number of Bins

10^2 10^3 10^4 10^5 10^6

(a) 10 features.

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

2 4 8 16 32 64 128 256

Ave

rage

Pre

cisi

on

Number of Bins

10^2 10^3 10^4 10^5 10^6

(b) 100 features.

Figure 4: Sensitivity of FAST to the number of bins.

0.75

0.8

0.85

0.9

0.95

1

GroveANOVA

FASTGUIDE

PDF 100

PDF 200

PDF 400

PDF 800

Average Precision

(a)

10

100

1000

10000

100000

1e+06

GroveANOVA

PDF 800

PDF 400

PDF 200

PDF 100

GUIDE

FAST

Time (s)

(b)

Figure 5: Precision/Cost on synthetic function.

uation metric. We vary b = 2, 4, ..., 256 and the datasetsize N = 102, 103, ..., 106. For each fixed N , we generatedatasets with n features and k higher order interactions xu,where |u| = b1.5 + rc and r is drawn from an exponentialdistribution with mean λ = 1. We experiment with twocases: 10 features with 25 higher order interactions and 100features with 1000 higher order interactions.

Figure 4 shows the mean average precision and variancefor 100 trials at each setting. As expected, average pre-cision increases as dataset size increases, and decreases asthe number of features increases from 10 (left graph) to 100(right graph). When there are only 10 features and as manyas 106 samples, FAST ranks all true interactions above allnon-interacting pairs (average precision = 1) in most cases,but as the sample size decreases or the problem difficultyincreases average precision drops below 1. In the graph onthe right with 100 features there are 4950 feature pairs, andFAST needs large sample sizes (106 or greater) to achieve av-erage precision above 0.7, and as expected performs poorlywhen there are fewer samples than pairs of features.

On these test problems the optimal number of bins ap-pears to be about b = 8, with average precision fallingslightly for number of bins larger and smaller than 8. This isa classic bias-variance tradeoff: smaller b reduces the chancesof overfitting but at the risk of failing to model some kindsof interactions, while large b allows more complex interac-tions to be modeled but at the risk of allowing some falseinteractions to be confused with weak true interactions.

5.2.2 AccuracyThe previous section showed that FAST accurately de-

tects feature interactions when the number of samples ismuch larger than the number of feature pairs, but that ac-curacy drops as the number of feature pairs grows compa-rable to and then larger than the number of samples. Inthis section we compare the accuracy of FAST to the in-teraction detection methods discussed in Section 3.2. ForANOVA, we use R package mgcv to compute p-values un-der a Wald test [25]. For PDF, we use RuleFit package andwe choose m = 100, 200, 400, 800, where m is the sample sizethat trades off efficiency and accuracy [7]. Grove is availablein TreeExtra package [8].

Here we conduct experiments on synthetic data generatedby the following function [14, 22].

F (x) = πx1x2√

2x3 − sin−1(x4) + log(x3 + x5)−x9x10

√x7x8− x2x7 (10)

Variables x4, x5, x8, x10 are uniformly distributed in [0.6, 1]and the other variables are uniformly distributed in [0, 1].

We generate 10, 000 points for these experiments. Figure 5(a)shows the average precision of the methods. On this prob-lem, the Grove and ANOVA methods are accurate and rankall 11 true pairs in the top of the list. FAST is almost asgood and correctly ranks the top ten pairs. The other meth-ods are significantly less accurate than Grove, ANOVA, andFAST.

To understand why FAST does not pick up the 11th pair,we plot heat maps of the residuals of selected pairs in Fig-ure 6. (x1, x2) and (x2, x7) are two of the correctly rankedtrue pairs, (x1, x7) is a false pair ranked below the true pairsFAST detects correctly but above the true pair it misses, and(x8, x10) is the true pair FAST misses and ranks below thisfalse pair. The heat maps show strong interactions are easyto distinguish, but some false interactions such as (x1, x7)can have signal as strong as that of weak true interactionssuch as (x8, x10). In fact, Sorokina et al. found that x8is a weak feature, and do not consider pairs that use x8 asinteractions on 5, 000 samples [22], so we are near the thresh-old of detectability of (x8, x10) going from 5, 000 to 10, 000samples.

5.2.3 Feature Correlation and Spurious PairsIf features are correlated, spurious interactions may be

detected because it is difficult to tell the difference betweena true interaction between x1 and x2 and the spurious in-teraction between x1 and x3 when x3 is strongly correlatedwith x2; any interaction detection method such as FASTthat examines pairs in isolation will have this problem. WithGA2M, however, it is fine to include some false positive pairsbecause GA2M is able to post-filter false positive pairs bylooking at the term weights of shaped interactions in thefinal model.

To demonstrate this, we use the synthetic function inEquation 10, but make x6 correlated to x1. We generate2 datasets, one with ρ(x1, x6) = 0.5 and the other withρ(x1, x6) = 0.95, where ρ is the correlation coefficient. Werun FAST on residuals after feature shaping. We give thetop 20 pairs found by FAST to GA2M, which then uses gra-dient boosting to shape those pairwise interactions. Figure 7illustrates how the weights of selected pairwise interactionsevolve after each step of gradient boosting. Although thepair (x2, x6) can be incorrectly introduced by FAST becauseof the high correlation between x1 and x6, the weight on thisfalse pair decreases quickly as boosting proceeds, indicatingthat this pair is spurious. This not only allows the modeltrained on the pairs to remain accurate in the face of spu-rious pairs, but also reduces the weight (and ranking) givento this shaped term so that intelligibility is not be hurt bythe spurious term.

629

0

5

10

15

20

25

30

0 5 10 15 20 25 30

x2

x1

"hm/0.1.txt" u 1:2:3

-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8

0

5

10

15

20

25

30

0 5 10 15 20 25 30

x7

x2

"hm/1.6.txt" u 1:2:3

-0.6-0.4-0.2 0 0.2 0.4 0.6 0.8

(x1, x2) (x2, x7)

0

5

10

15

20

25

30

0 5 10 15 20 25 30

x7

x1

"hm/0.6.txt" u 1:2:3

-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4

0

5

10

15

20

25

30

0 5 10 15 20 25 30

x10

x8

"hm/7.9.txt" u 1:2:3

-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4

(x1, x7) (x8, x10)

Figure 6: True/Spurious heat maps. Features arediscretized into 32 bins for visualization.

0 0.002 0.004 0.006 0.008

0.01 0.012 0.014 0.016 0.018

0.02

0 1000 2000 3500

Wei

ght

Iteration

(x1, x2)(x2, x6)(x1, x3)

(x7, x9)(x3, x6)(x2, x3)

(x9, x10)(x3, x5)(x8, x9)

(a) ρ(x1, x6) = 0.5

0 0.002 0.004 0.006 0.008

0.01 0.012 0.014 0.016

0 1000 2000 3500

Wei

ght

Iteration

(x1, x2)(x2, x6)(x1, x3)

(x7, x9)(x3, x6)(x2, x3)

(x9, x10)(x3, x5)(x8, x9)

(b) ρ(x1, x6) = 0.95

Figure 7: Weights for pairwise interaction terms inthe model.

5.3 ScalabilityFigure 5(b) illustrates the running time of different meth-

ods on 10, 000 samples from Equation 10. Model buildingtime is included. FAST takes about 10 seconds to rankall possible pairs while the two other accurate methods,ANOVA and Grove, are 3-4 orders of magnitude slower.Grove, which is probably the most accurate interaction de-tection method currently available, takes almost a week torun once on this data. This shows the advantage of FAST;it is very fast with high accuracy. On this problem FASTtakes less than 1 second to rank all pairs and the majorityof time is devoted to building the additive model.

Figure 8 shows the running time of FAST per pair on realdatasets. It is clear that on real datasets, FAST is bothaccurate and efficient.

5.4 Design ChoicesAn alternate to interaction detection that we considered

was to build ensembles of trees on residuals after shaping theindividual features and then look at tree statistics to findcombinations of features that co-occur in paths more oftenthan their independent rate warrants. By using 1-step look-ahead at the root we also hoped to partially mitigate themyopia of greedy feature installation to make interactionsmore likely to be detected. Unfortunately, features withhigh “co-occurence counts” did not correlate well with trueinteractions on synthetic test problems, and the best tree-based methods we could devise did not detect interactionsas well as FAST, and were considerably more expensive.

●

●

●

●

●●●●

●

●

5e+03 2e+04 1e+05 5e+05

2e−0

45e

−03

1e−0

1

Size of dataset

Tim

e (s

) per

pai

r

SpambaseGissette

Delta

CompActPole

Magic LetterCalHousing

Physics

MSLR10k

Figure 8: Computational cost on real datasets.

5.5 Case Study: Learning to RankLearning-to-rank is an important research topic in the

data mining, machine learning and information retrieval com-munities. In this section, we train intelligible models withshaped one-dimensional features and pairwise interactionson the “MSLR10k” dataset. A complete description of fea-tures can be found in [3]. We show the top 10 most im-portant individual features and their shape functions in firsttwo rows of Figure 9. The number above each plot is theweight for the corresponding term in the model. Interest-ingly, we found BM25 [20], usually considered as a powerfulfeature for ranking, ranked 70th (BM25 url) in the list af-ter shaping. Other features such as IDF (inverse documentfrequency) enjoy much higher weight in the learned model.

The last two rows of Figure 9 show the 10 most importantpairwise interactions and their term strengths. Each of themshows a clear interaction that could not be modeled by addi-tive terms. The non-linear shaping of the individual featuresin the top plots and the pairwise interactions in the bottomplots are intelligible to experts and feature engineers, butwould be well hidden in full-complexity models.

6. CONCLUSIONSWe present a framework called GA2M for building intel-

ligible models with pairwise interactions. Adding pairwiseinteractions to traditional GAMs retains intelligibility, whilesubstantially increasing model accuracy. To scale up pair-wise interaction detection, we propose a novel method calledFAST that efficiently measures the strength of all potentialpairwise interactions.

Acknowledgements. We thank the anonymous review-ers for their valuable comments, and we thank Nick Craswellof Microsoft Bing for insightful discussions. This researchhas been supported by the NSF under Grants IIS-0911036and IIS-1012593. Any opinions, findings and conclusions orrecommendations expressed in this material are those of theauthors and do not necessarily reflect the views of the NSF.

7. REFERENCES[1] http://www.liaad.up.pt/~ltorgo/Regression/

DataSets.html.

[2] http:

//www.cs.toronto.edu/~delve/data/datasets.html.

[3] http:

//research.microsoft.com/en-us/projects/mslr/.

630

http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html

http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html

http://www.cs.toronto.edu/~delve/data/datasets.html

http://www.cs.toronto.edu/~delve/data/datasets.html

http://research.microsoft.com/en-us/projects/mslr/

http://research.microsoft.com/en-us/projects/mslr/

0.0093 0.0084 0.0061 0.0058 0.0057

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0 50 100 150 200 250-0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4

0 50 100 150 200 250-0.2

0 0.2 0.4 0.6 0.8

1 1.2 1.4 1.6 1.8

0 2 4 6 8 10 12 14 16 18-0.15-0.1

-0.05 0

0.05 0.1

0.15 0.2

0.25 0.3

0.35 0.4

0 50 100 150 200 250-0.25

-0.2-0.15

-0.1-0.05

0 0.05

0.1 0.15

0.2 0.25

0 50 100 150 200 250

stream length body query-url click count covered query term number sum of term frequency body LAIR.ABS bodytitle

0.0055 0.0052 0.0050 0.0040 0.037

-0.6-0.4-0.2

0 0.2 0.4 0.6 0.8

1 1.2

0 50 100 150 200 250-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0 10 20 30 40 50 60-1

-0.5

0

0.5

1

1.5

2

2.5

3

0 10 20 30 40 50 60 70 80 90-0.2

-0.15-0.1

-0.05 0

0.05 0.1

0.15 0.2

0.25 0.3

0 50 100 150 200 250-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0 20 40 60 80 100 120

min of term frequency covered query term ratio stream length url IDF title outlink numberwhole document title

6.6934e-4 6.6726e-4 5.5579e-4 3.4585e-4 3.0110e-4

0

50

100

150

200

250

0 50 100 150 200-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0

50

100

150

200

250

0 5 10 15 20 25-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0

5

10

15

20

25

0 10 20 30 40 50 60 70 80 90-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0

50

100

150

200

250

0 50 100 150 200 250-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0

50

100

150

200

250

0 50 100 150 200 250-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

qualityscore2 vs. inlink number vs. number of slash in url vs. url click count vs. pagerank vs. min of streamsiterank number of slash in url stream length url query-url click count length normalized term

frequency whole document

2.4755e-4 2.4104e-4 2.3218e-4 2.2952e-4 2.1643e-4

0

50

100

150

200

250

0 50 100 150 200 250-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0

50

100

150

200

250

0 50 100 150 200 250-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0

50

100

150

200

0 20 40 60 80 100 120-0.15-0.1-0.05 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

0

50

100

150

200

250

0 50 100 150 200 250-0.2

0

0.2

0.4

0.6

0.8

1

1.2

0

50

100

150

200

250

0 50 100 150 200 250-0.6-0.5-0.4-0.3-0.2-0.1 0 0.1 0.2 0.3 0.4 0.5

url dwell time vs. qualityscore2 vs. siterank vs. pagerank vs. query-url click count vs.query-url click count min of tf*idf url outlink number BM25 url length of url

Figure 9: Shapes of features and pairwise interactions for the “MSLR10k” dataset with weights. Top tworows show top 10 strongest features. Next two rows show top 10 strongest interactions.

[4] http://archive.ics.uci.edu/ml/.

[5] http://www.nipsfsc.ecs.soton.ac.uk/.

[6] http://osmot.cs.cornell.edu/kddcup/.

[7] http:

//www-stat.stanford.edu/~jhf/R-RuleFit.html.

[8] http://additivegroves.net.

[9] E. Bauer and R. Kohavi. An empirical comparison ofvoting classification algorithms: Bagging, boosting,and variants. Machine learning, 36(1):105–139, 1999.

[10] J. Friedman. Greedy function approximation: agradient boosting machine. Annals of Statistics,29:1189–1232, 2001.

[11] J. Friedman and B. Popescu. Predictive learning viarule ensembles. The Annals of Applied Statistics,pages 916–954, 2008.

[12] I. Guyon and A. Elisseeff. An introduction to variableand feature selection. The Journal of MachineLearning Research, 3:1157–1182, 2003.

[13] T. Hastie and R. Tibshirani. Generalized additivemodels. Chapman & Hall/CRC, 1990.

[14] G. Hooker. Discovering additive structure in black boxfunctions. In KDD, 2004.

[15] G. Hooker. Generalized functional anova diagnosticsfor high-dimensional functions of dependent variables.Journal of Computational and Graphical Statistics,16(3):709–732, 2007.

[16] R. Kelley Pace and R. Barry. Sparse spatial

autoregressions. Statistics & Probability Letters,33(3):291–297, 1997.

[17] P. Li, C. Burges, and Q. Wu. Mcrank: Learning torank using multiple classification and gradientboosting. In NIPS, 2007.

[18] W. Loh. Regression trees with unbiased variableselection and interaction detection. Statistica Sinica,12(2):361–386, 2002.

[19] Y. Lou, R. Caruana, and J. Gehrke. Intelligiblemodels for classification and regression. In KDD, 2012.

[20] C. D. Manning, P. Raghavan, and H. Schutze.Introduction to information retrieval. CambridgeUniversity Press Cambridge, 2008.

[21] D. Sorokina, R. Caruana, and M. Riedewald. Additivegroves of regression trees. In ECML, 2007.

[22] D. Sorokina, R. Caruana, M. Riedewald, and D. Fink.Detecting statistical interactions with additive grovesof trees. In ICML, 2008.

[23] S. M. Weiss and N. Indurkhya. Rule-based machinelearning methods for functional prediction. Journal ofArtificial Intelligence Research, 3:383–403, 1995.

[24] S. Wood. Thin plate regression splines. Journal of theRoyal Statistical Society: Series B (StatisticalMethodology), 65(1):95–114, 2003.

[25] S. Wood. Generalized additive models: an introductionwith R. CRC Press, 2006.

631

http://archive.ics.uci.edu/ml/

http://www.nipsfsc.ecs.soton.ac.uk/

http://osmot.cs.cornell.edu/kddcup/

http://www-stat.stanford.edu/~jhf/R-RuleFit.html

http://www-stat.stanford.edu/~jhf/R-RuleFit.html

http://additivegroves.net

Date post:	09-Dec-2016
Category:	Documents
Upload:	giles
View:	212 times
Download:	0 times

[ACM Press the 19th ACM SIGKDD international conference - Chicago, Illinois, USA...

Documents