+ All Categories
Home > Documents > A new R2-based metric to shed greater insight on variable ......We used the same network...

A new R2-based metric to shed greater insight on variable ......We used the same network...

Date post: 12-Nov-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
7
Ecological Modelling 313 (2015) 307–313 Contents lists available at ScienceDirect Ecological Modelling journa l h om epa ge: www.elsevier.com/locate/ecolmodel Letter to the Editor A new R 2 -based metric to shed greater insight on variable importance in artificial neural networks a r t i c l e i n f o Keywords: Connection weights Garson’s algorithm Machine learning a b s t r a c t Artificial neural networks (ANNs) represent a powerful analytical tool designed for predictive modeling. However the shortage of straightforward and reliable approaches for calculating variable importance and characterizing predictor–response relationships has likely hindered the broader use of ANNs in ecology. Two such metrics product-of-connection-weights (PCW) and product-of-standardized-weights (PSW) have received much attention in the published literature. A recent paper (Fischer, in press, Ecological Mod- elling) found that PSW was comparable to PCW for retrieving variable importance values in linear models seemingly overturning the conclusions of Olden et al. (2004, Ecological Modelling) and that PSW was superior to PCW in nonlinear models. In this paper we call into question the findings of Fischer (in press) and more importantly we explain why neither PCW nor PSW are universally good measures of variable. Next, we advance the field by proposing a new permutational R 2 -based variable importance metric and show that it accurately estimates the proportion of the total variance in the response variable that is uniquely associated with each predictor variable in both linear and non-linear data contexts. By enabling ecologists to measure relative strengths of predictor variables in a transparent and straightforward way, this metric has the potential to help widen the use of ANNs in ecology. Published by Elsevier B.V. 1. Introduction Artificial neural networks (ANNs) have witnessed greater appli- cation in ecology in recent decades because they are perceived to overcome many of the difficulties commonly associated with ecological data (Olden et al., 2008). Included are the advan- tages of modeling non-linear associations, not requiring specific assumptions concerning the distributional characteristics of the independent variables (i.e., nonparametric), and accommodating variable interactions without a priori specification. ANNs perform complex pattern recognition and prediction tasks by adaptively learning and mapping input variables to output variables via a series of interconnected nodes akin to neurons in the brain (Rumelhart et al., 1986; Bishop, 1995; Ripley, 1996). However, adding model complexity to enhance predictive power often comes at the cost of reduced explanatory insight into the ecological rela- tionships being explored (Low-Décarie et al., 2014). With the aim of illuminating the “black box” perception of ANNs a variety of methods have been developed to quantify predictor (input) variable importance in neural networks (reviewed in Olden and Jackson, 2002; Gevrey et al., 2003; Olden et al., 2004). In a study published in Ecological Modelling, Olden et al. (2004) provided a comprehensive evaluation of the accuracy of different approaches in retrieving true variable importance in a simulated dataset where predictor variables were linearly associated with the response variable. This study found that the product-of-connection-weights method (hereafter, PCW) was the best performing metric while Garson’s (1991) product-of-standardized-weights (hereafter, PSW) method performed the worst. In a recent paper, also published in Ecological Modelling, Fischer (in press) reported that PSW was a better measure of variable importance in ANNs compared to PCW when applied to datasets exhibiting nonlinear relationships between predictor and response variables. Interestingly, in the simulation analysis by Fischer (in press), PSW correctly retrieved variable importance ranks in all simulated datasets (4 functional forms: 1 linear; 2 quadratic; 1 interaction-term; for each functional form, 100 datasets with sam- ple size n = 500 were evaluated) whereas PCW was reported to be 100% accurate in the linear datasets but only 16–52% accurate in the other 3 nonlinear datasets. Although we are encouraged that discussion of how explana- tory insight is gleamed from ANNs continues in the literature, regrettably Fischer’s (in press) analysis was problematic in a number of ways. First, it was simply impossible to replicate the simulation experiments because Fischer (in press) failed to pro- vide information regarding the response variable’s error structure. Second, it is unclear how variable importance values of PSW and PCW can be related to beta-weights of the correct linear regression model (see Fig. 1 in Fischer, in press); neither Garson’s (1991) original formulation of PSW nor Olden et al.’s (2004) proposal of PCW asserted this relationship. Therefore, one cannot interpret the relative importance values of both PSW and PCW presented by Fischer (in press); or put another way, there was no theoretical benchmark upon which to make a comparison. Third, even if we http://dx.doi.org/10.1016/j.ecolmodel.2015.06.034 0304-3800/Published by Elsevier B.V.
Transcript
Page 1: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

Ecological Modelling 313 (2015) 307–313

Contents lists available at ScienceDirect

L

A

a

KCGM

1

ctetaivcla(aat

a(apcipvm

h0

Ecological Modelling

journa l h om epa ge: www.elsev ier .com/ locate /eco lmodel

etter to the Editor

new R2-based metric to shed greater insight on variable importance in artificial neural networks

r t i c l e i n f o

eywords:onnection weightsarson’s algorithmachine learning

a b s t r a c t

Artificial neural networks (ANNs) represent a powerful analytical tool designed for predictive modeling.However the shortage of straightforward and reliable approaches for calculating variable importance andcharacterizing predictor–response relationships has likely hindered the broader use of ANNs in ecology.Two such metrics – product-of-connection-weights (PCW) and product-of-standardized-weights (PSW)have received much attention in the published literature. A recent paper (Fischer, in press, Ecological Mod-elling) found that PSW was comparable to PCW for retrieving variable importance values in linear models– seemingly overturning the conclusions of Olden et al. (2004, Ecological Modelling) – and that PSW wassuperior to PCW in nonlinear models. In this paper we call into question the findings of Fischer (in press)and more importantly we explain why neither PCW nor PSW are universally good measures of variable.Next, we advance the field by proposing a new permutational R2-based variable importance metric and

show that it accurately estimates the proportion of the total variance in the response variable that isuniquely associated with each predictor variable in both linear and non-linear data contexts. By enablingecologists to measure relative strengths of predictor variables in a transparent and straightforward way,this metric has the potential to help widen the use of ANNs in ecology.

Published by Elsevier B.V.

. Introduction

Artificial neural networks (ANNs) have witnessed greater appli-ation in ecology in recent decades because they are perceivedo overcome many of the difficulties commonly associated withcological data (Olden et al., 2008). Included are the advan-ages of modeling non-linear associations, not requiring specificssumptions concerning the distributional characteristics of thendependent variables (i.e., nonparametric), and accommodatingariable interactions without a priori specification. ANNs performomplex pattern recognition and prediction tasks by adaptivelyearning and mapping input variables to output variables via

series of interconnected nodes akin to neurons in the brainRumelhart et al., 1986; Bishop, 1995; Ripley, 1996). However,dding model complexity to enhance predictive power often comest the cost of reduced explanatory insight into the ecological rela-ionships being explored (Low-Décarie et al., 2014).

With the aim of illuminating the “black box” perception of ANNs variety of methods have been developed to quantify predictorinput) variable importance in neural networks (reviewed in Oldennd Jackson, 2002; Gevrey et al., 2003; Olden et al., 2004). In a studyublished in Ecological Modelling, Olden et al. (2004) provided aomprehensive evaluation of the accuracy of different approachesn retrieving true variable importance in a simulated dataset where

redictor variables were linearly associated with the responseariable. This study found that the product-of-connection-weightsethod (hereafter, PCW) was the best performing metric while

ttp://dx.doi.org/10.1016/j.ecolmodel.2015.06.034304-3800/Published by Elsevier B.V.

Garson’s (1991) product-of-standardized-weights (hereafter, PSW)method performed the worst.

In a recent paper, also published in Ecological Modelling, Fischer(in press) reported that PSW was a better measure of variableimportance in ANNs compared to PCW when applied to datasetsexhibiting nonlinear relationships between predictor and responsevariables. Interestingly, in the simulation analysis by Fischer (inpress), PSW correctly retrieved variable importance ranks in allsimulated datasets (4 functional forms: 1 linear; 2 quadratic; 1interaction-term; for each functional form, 100 datasets with sam-ple size n = 500 were evaluated) whereas PCW was reported to be100% accurate in the linear datasets but only 16–52% accurate inthe other 3 nonlinear datasets.

Although we are encouraged that discussion of how explana-tory insight is gleamed from ANNs continues in the literature,regrettably Fischer’s (in press) analysis was problematic in anumber of ways. First, it was simply impossible to replicate thesimulation experiments because Fischer (in press) failed to pro-vide information regarding the response variable’s error structure.Second, it is unclear how variable importance values of PSW andPCW can be related to beta-weights of the correct linear regressionmodel (see Fig. 1 in Fischer, in press); neither Garson’s (1991)original formulation of PSW nor Olden et al.’s (2004) proposal ofPCW asserted this relationship. Therefore, one cannot interpret

the relative importance values of both PSW and PCW presentedby Fischer (in press); or put another way, there was no theoreticalbenchmark upon which to make a comparison. Third, even if we
Page 2: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

3 al Mo

aixutxsdaMrari

virwim(oanmipeowep

2

2

(

y

wxsd

sI(d(roRwx1mf

cap1l

08 Letter to the Editor / Ecologic

ssume beta-weights are a legitimate benchmark for PSW and PCWmportance values, the relative importance of two variables x1 and2 does not equal the ratio of beta-weights of x2

1 and x2 which wassed by Fischer (in press) to evaluate the two quadratic datasets;he latter ratio measures the relative importance of x2

1 and x2 (not1 and x2). Fourth, a 100% accuracy rate is exceedingly rare inimulation experiments owing to stochastic noise in the simulatedataset. Because the error structure is not known, we were notble to explain this result in the context of the simulation design.oreover, a 100% accuracy in determining variable importance

anks in dataset 4 (functional form: y = x1 * x2 + 0.1 * x3, where x1nd x2 have the same theoretical ranks) required the implausibleesult of identical variable importance values for the two variablesn all simulations.

The problems listed above and the high uncertainty around theariable importance values associated with PCW prompted us tonvestigate the legitimacy of both connection weight-based met-ics, PCW and PSW. First, we performed a simulation study in whiche assessed the utility of PCW and PSW in retrieving variable

mportance ranks from simulated datasets that follow the linearodel structure used by Fischer (in press). However, unlike Fischer

in press), we performed a more robust investigation using a varietyf sample sizes and overall effect sizes to more completely evalu-te the generality of his findings (i.e., 100% rank accuracy). We didot repeat Fischer’s experiment for nonlinear and interaction-termodels because the beta-weight ratio used in his simulation exper-

ment does not correctly measure variable importance. Second, weerformed a more detailed dissection of a one hidden-layer ANN toxplain why neither PCW nor PSW are universally good measuresf variable importance rank or value. Third, in recognition of theeaknesses of both PCW and PSW, we propose a simple and gen-

ralizable R2-based variable importance measure and evaluate itserformance against an appropriate benchmark.

. Methods

.1. Simulation experiment

We simulated datasets from the linear function used by Fischerin press) (Eq. (1)):

i = 1 ∗ x1i + 0.5 ∗ x2i + 0.1 ∗ x3i + ei (1)

here yi is ith observation of the response (output) variable, x1i, x2i,3i are ith observations of the three predictor variables, and ei is thetochastic error term of the ith observation that follows a Gaussianistribution with mean = 0 and standard deviation �.

We simulated 9 sample size–effect size combinations (3 sampleizes × 3 effect sizes), each of which comprised 1000 datasets.n comparison, Fischer (in press) simulated a single sample sizen = 500) for a single effect size (unknown) examining just 100atasets. In our experiment, we varied the overall effect sizei.e., strength of association between the predictor variables andesponse variable or overall R2) by changing the standard deviationf the error term (� = 0.3, 1, and 2, corresponding to overall model2 ∼ 0.93, 0.56, 0.24, respectively). Sample sizes in each datasetere n = 100, 500, and 1000. For each dataset, we first generated

1, x2, and x3 from a uniform distribution bounded between 0 and. We then centered and scaled each predictor variable so thatean = 0 and variance = 1 before simulating the response variable

ollowing Eq. (1).We analyzed the 1000 datasets in each sample size–effect size

ombination separately using ANNs fitted by the backpropagation

lgorithm. We used the same network architecture as Fischer (inress): 3 input nodes, 3 hidden nodes in a single hidden layer, and

output node. The activation function in the output node wasinear (as appropriate for the data) whereas that in the hidden

delling 313 (2015) 307–313

nodes was sigmoidal (logistic), which is standard for ANNs. As thedecay parameter was unknown in Fischer (in press), we chose toperform 10-fold cross validation to search for the optimal param-eter value (from 5 candidate values 0.0001–1) that minimizedroot-mean-square-error (RMSE) for each simulated dataset. Themodel with the optimal value was used to determine variableimportance. Specifically, we wanted to evaluate whether we couldreplicate the 100% PSW rank accuracy in Fischer (in press).

2.2. Functional decomposition of a simple ANN

Using a simple ANN with 2 input nodes, 2 hidden nodes, and1 output node as an example, we explain how predictor (input)variables and neural connection weights combine to producepredictions of the response (output) variable. By comparing thefunctional relationship between the input variables and the outputvariable with the PSW and PCW methods, we evaluated the legiti-macy of these two weight-based methods in determining relativevariable importance.

2.3. A new R2-based measure of relative variable importance

After revealing methodological weaknesses of PSW and PCW,we propose a new measure of relative variable importance for ANNsbased on the reduction of the variance explained by the model (R2)by permuting each predictor variable in turn. Permuting a givenpredictor variable breaks the association between the predictorvariable and the response (if present) and decreases the overall R2

of the model. The magnitude of the reduction in R2 when a givenpredictor variable is permuted reflects the strength of associationbetween that predictor variable and the response. The rationalebehind this metric follows the permutation accuracy importancemetric used in random forest models (Strobl et al., 2008).

The permutational relative variable importance (pRVIxi) of the

ith predictor variable xi is (Eq. (2)):

pRVIxi= R2

obs − R̄2perm,xi

(2)

where R2obs

is the R2 of the ANN model fitted to the observed pre-dictor and response variables, R2

perm,xiis the R2 of the ANN model

fitted to a modified dataset where xi is permuted, and R̄2perm,xi

is themean (or alternatively, median) value of R2

perm,xiafter m permuted

datasets. R2 was calculated as 1 − residual sum-of-squares/totalsum-of-squares. This framework can be generalized to incorpo-rate different measures of model fit in place of R2 such as RMSEfor regression and classification accuracy for classification tasks.Because pRVI is based on the permutation of predictor variables, itcan be applied to datasets comprising both continuous and cate-gorical predictor variables.

We tested the utility of pRVI by conducting a simulation exper-iment that fitted ANNs to simulated datasets based on the threefunctional forms used by Fischer’s (in press) (Eqs. (3)–(5)):

Linear : yi = 1 ∗ x1i + 0.5 ∗ x2i + 0.1 ∗ x3i + ei (3)

Quadratic : yi = −1 ∗ x21i + 0.5 ∗ x2i + 0.1 ∗ x3i + ei (4)

Interaction-term : yi = 1 ∗ x1i ∗ x2i + 0.1 ∗ x3i + ei (5)

where yi is ith observation of the response (output) variable, x1i, x2i,x3i are ith observations of the three predictor variables, and ei is thestochastic error term of the ith observation that follows a Gaussiandistribution with mean 0 and standard deviation �.

This experiment had the same design as the one presented in

Section 2.1. We simulated 9 sample size–effect size combinations,each comprising 500 datasets for each function. The effect sizes,dataset sample sizes, and variable generation procedure followsSection 2.1.
Page 3: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

Letter to the Editor / Ecological Modelling 313 (2015) 307–313 309

F ed-w( xes, an

tngtaiAsAeirpdrx

2

F

ig. 1. Distribution of product-of-connection-weights (PCW), product-of-standardiza) n = 1000, � = 0.3, (b) n = 1000, � = 2, (c) n = 100, � = 0.3, (d) n = 100, � = 2. Lines, bo

For each simulated dataset, we determined optimal values forhe decay parameter (5 values: 0.0001–1) and the number of hiddenodes (3 values: 2, 3, 4) by performing 10-fold cross validation. Touard against false convergence on local minima that is characteris-ic of machine learning approaches (Olden et al., 2008; preliminarynalyses indicated that false convergence affected quadratic andnteraction-term datasets), we added an algorithm to accept anNN model only when the range of R2 obtained from four con-ecutive models were within a value of 0.01. With the optimalNN model structure, we calculated pRVI and pRVI rankings forach dataset and compared it to a known benchmark: relativemportance as measured by the reduction in R2 when fitting the cor-ect model [via least-squares (LS) fitting] after permuting a givenredictor variable versus fitting the correct model to the originalataset. For analyses involving the interaction-term datasets, a cor-ect variable importance ranking was defined by pRVI of x1 and2 > pRVI of x3.

.4. Software and code

All analyses were conducted in R 3.2.0 (R Core Team, 2015).ollowing Fischer (in press), ANNs were fitted in the nnet package

eights (PSW), and least-square beta-weights (LS) over 1000 simulated datasets withd whiskers are medians, interquartile range (IQR), and 1.5× IQR, respectively.

(Ripley & Venables, 2015). PSW and PCW were calculated withfunctions provided in the NeuralNetTools package (Beck, 2015).We used the doParallel package (Revolution Analytics, 2014)to parallelize simulations. Due diligence necessitates that allcode is made freely available when reporting on methodologicalcomparisons using simulated data. For this reason we provideexample R code for simulations, convergence algorithm, and pRVIin Supporting Code A1.

3. Results

3.1. Simulation experiment

Contrary to Fischer (in press), neither PSW nor PCW produced100% accurate variable importance ranks when applied to datasetswith an underlying linear structure (Table 1). Notably, PCW wasmore accurate than PSW in all 9 sample size–effect size combi-nations we examined. The accuracy of PSW and PCW generally

declined with decreasing sample size and increasing stochasticnoise (i.e., error �) as expected. However, the decline in PSW accu-racy was much more pronounced than that of PCW. When � = 2(R2 ∼ 0.26) and n = 100, PSW gave correct variable rankings for only
Page 4: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

310 Letter to the Editor / Ecological Modelling 313 (2015) 307–313

Table 1Accuracy of the product-of-standardized-weights (PSW; Garson, 1991) and product-of-connection-weights (PCW; Olden et al., 2004) approaches in ranking variable impor-tance for simulated datasets of different sample sizes (n) and error standard deviation values (�). Accuracy of beta-weights fitted by least-squares regression (LS) and R2 (andinterquartile range; IQR) are presented for comparison.

n � Accuracy (%) R2 (IQR) of LS model

PSW PCW LS

1000 0.3 81.4 93.4 100 0.93 (0.93–0.94)1 77.4 88.8 100 0.56 (0.55–0.57)2 50.1 82.7 100 0.24 (0.22–0.26)

500 0.3 83.9 93.2 100 0.93 (0.93–0.94)1 76.4 92 100 0.56 (0.54–0.58)2 39.6 88.4 99.8 0.24 (0.22–0.27)

929675

3c

bbnIst

tc

3

1aTctf

Fi

100 0.3 80.3

1 81.7

2 34.4

4% of the datasets whereas PCW ranked 76% of these datasetsorrectly.

The suboptimality of PCW and PSW in ranking variables cannote attributed to the effect of stochastic noise on linear regressioneta-weights. In all but the lowest sample size–effect size combi-ation, rankings derived from beta-weights were ≥ 99.7% accurate.

n other words, although almost all simulated datasets (even withtochastic noise) reflected the correct variable importance struc-ure, PCW and PSW were unable to retrieve it accurately.

The variation in PCW and PSW values was much higher thanhe benchmark beta-weight values in all sample size–effect sizeombinations (Fig. 1 and Fig. A1).

.2. Functional decomposition of a simple ANN

We used a simple ANN with 2 input nodes, 2 hidden nodes, and output node to illustrate the relationship between input vari-bles to output variables via neural connection weights (Fig. 2).

he output values from the 2 input nodes (IN1 and IN2) are directlyonnected to the input (predictor) variables x1 and x2 via an iden-ity link; therefore IN1 = 1 * x1 and IN2 = 1 * x2. These values are fedorward to the 2 hidden nodes; the net input (or activation) for each

ig. 2. Schematic diagram of the example ANN with a 2:2:1 configuration (i.e., 2nput nodes, 2 hidden nodes, and 1 output node).

.5 100 0.94 (0.93–0.94)

.5 99.7 0.57 (0.53–0.61)

.5 86 0.26 (0.21–0.31)

node is the sum of its input values (i.e., input values are IN1, IN2,and BIASH = 1) multiplied by their associated connection weights(w and b; see Fig. 1). The activation values (A1 and A2) for hiddennodes 1 and 2 are therefore (Eqs. (6) and (7)):

AH1 = wIN1 H1 ∗ IN1 + wIN2 H1 ∗ IN2 + 1 ∗ bH1 (6)

AH2 = wIN1 H2 ∗ IN1 + wIN2 H2 ∗ IN2 + 1 ∗ bH2 (7)

The output values (H1 and H2) from the hidden nodes areobtained by applying a transfer function to AH1 and AH2 . The trans-fer function in the hidden nodes is often sigmoidal; logistic orhyperbolic tangent functions are commonly used. We use a logistictransfer function in our example (Eqs. (8) and (9)):

H1 = 11 + exp(−AH1 )

(8)

H2 = 11 + exp(−AH2 )

(9)

The transfer function in the hidden layer is never linear becausea network with such a transfer function would be reduced to a sin-gle layer network (without hidden layers) that can only deal withlinearly separable problems (i.e., like linear regression) (Bishop,1995).

Next, the output values from the hidden nodes are fed forward toproduce a net activation value (AOUT) into the output node whichis in turn put through a transfer function to produce the overalloutput value (OUT; corresponding to response variable y) (Eqs. (10)and (11)):

AOUT = wH1 OUT ∗ H1 + wH2 OUT ∗ H2 + 1 ∗ bOUT (10)

OUT = 1 ∗ AOUT (11)

We use an identity (linear) output transfer function here but asigmoidal transfer function could also be used depending on thenature of the response variable in the dataset.

Combining Eqs. (6)–(11), the relationship between the inputvariables (xi) and the response variable (y) can be explicitly writtendown as (Eq. (12)):

y = wH1 OUT

1 + exp[−(wIN1 H1 ∗ x1 + wIN2 H1 ∗ x2 + bH1 )]

+ wH2 OUT

1 + exp[−(wIN1 H2 ∗ x1 + wIN2 H2 ∗ x2 + bH2 )]+ bOUT (12)

As seen from Eq. (12), the relative effect of the predictor vari-ables xi on the output y cannot be calculated by comparing products

of the connection weights across the network—the method used inboth PSW (Garson, 1991) and PCW (Olden et al., 2004) approaches.This is because the marginal change in the output of a given hiddennode (H) resulting from one unit change in xi depends on the bias
Page 5: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

Letter to the Editor / Ecological Modelling 313 (2015) 307–313 311

Fig. 3. Permutational relative variable importance (pRVI) values versus actual R2 reduction from fitting the correct linear least-squares regression model after permutinge 00, �a

tsc

3

rtim

ie1mtaiWst

sitms

ach given variable in datasets with (a) n = 1000, � = 0.3, (b) n = 1000, � = 2, (c) n = 1ccurately estimate actual R2 reduction.

erm bH even when the connection weights w are the same. Con-equently, the connection weights do not measure effect sizes in aonsistent manner.

.3. A new R2-based measure of relative variable importance

In general, our new R2-based metric pRVI reflected the actualeduction in R2 when the correct least-squares (LS) model is fittedo permuted datasets (Figs. 3–5). The variable importance rank-ngs retrieved by pRVI were ≥95% accurate in all but 2 functional

odel–sample size–effect size combinations (Table 2).Although generally high, the accuracy of pRVI varied depend-

ng on the functional model underlying datasets, sample size andffect size categories. The position of the points relative to the:1 (pRVI:actual R2 reduction) line indicated that pRVI was slightlyore accurate in linear and interaction-term datasets (Figs. 3 and 5)

han datasets with a quadratic term (Fig. 4). pRVI accuracy increaseds sample size increased (Figs. 3–5, bottom to top), but interest-ngly, decreased with increasing effect size (Figs. 3–5; right to left).

e did not present figures from analyses involving intermediateample (n = 500) and effect (� = 1) sizes because they followed therends described above.

When overall effect size was extremely high (� = 0.3 corre-ponding to mean R2 > ∼0.90), pRVI tended to underestimate the

mportance of the most highly correlated variable(s) and overes-imate the importance of subordinate variables(s). This bias was

uch reduced when the overall effect size was low (� = 2 corre-ponding to mean R2 < ∼0.30).

= 0.3, (d) n = 100, � = 2. Points that fall on the 1:1 (black) line are pRVI values that

4. Discussion

Despite performing extensive simulations comprising a widerange of sample and effect sizes, we could not replicate Fischer’s(in press) findings that PSW (Garson, 1991) was 100% accuratein ranking variable importance in linear datasets. Our simulationsconfirmed Olden et al.’s (2004) findings that PCW outperformedPSW in such datasets; however, neither of these two approacheswere consistently accurate.

Our theoretical analysis demonstrated that both PSW and PCWare not entirely faithful approaches to determine variable impor-tance in ANNs, thus explaining the poor empirical performance ofthese connection weight-based metrics. The reason is that the effectexerted by 1 unit of connection weight is unequal across hiddennodes because the activation function is sigmoidal. When the biasweight is a large positive or negative value, a 1 unit increase in con-nection weight increases the output value by a smaller value thanwhen the bias weight is close to zero.

If there is no theoretical basis for both PSW and PCW, why werethey effective in determining variable importance values and ranksin previous studies (e.g., Olden et al., 2004; Fischer, in press)? InOlden et al. (2004), the simulation datasets on which PCW was eval-uated had a very high overall effect size. This might have causedbias weights to be similar across hidden nodes resulting in compa-rable connection weights. The products of comparable connection

weights were therefore able to approximate importance rankings(but probably not values). We could not speculate on the reasonsfor Fischer’s (in press) findings owing to a lack of information onthe error structure of the data generating model as well as various
Page 6: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

312 Letter to the Editor / Ecological Modelling 313 (2015) 307–313

Fig. 4. Permutational relative variable importance (pRVI) values versus actual R2 reduction from fitting the correct quadratic least-squares regression model after permutingeach given variable in datasets with (a) n = 1000, � = 0.3, (b) n = 1000, � = 2, (c) n = 100, � = 0.3, (d) n = 100, � = 2. Points that fall on the 1:1 (black) line are pRVI values thataccurately estimate actual R2 reduction.

Fig. 5. Permutational relative variable importance (pRVI) values versus actual R2 reduction from fitting the correct interaction-term least-squares regression model afterpermuting each given variable in datasets with (a) n = 1000, � = 0.3, (b) n = 1000, � = 2, (c) n = 100, � = 0.3, (d) n = 100, � = 2. Points that fall on the 1:1 (black) line are pRVIvalues that accurately estimate actual R2 reduction.

Page 7: A new R2-based metric to shed greater insight on variable ......We used the same network architecture as Fischer (in press): 3 input nodes, 3 hidden nodes in a single hidden layer,

Letter to the Editor / Ecological Modelling 313 (2015) 307–313 313

Table 2Accuracy of variable importance rankings based on the reduction in R2 by permuting each given variable and refittingthe correct least-squares model versus the optimal artificial neural network.

n � Accuracy (%)

Linear Quadratic Interaction-term

1000 0.3 100 100 1001 100 100 1002 100 100 100

500 0.3 100 100 1001 100 100 1002 99.8 99.2 100

opl

aietrtniasebh

isstiiwJi

A

mr

A

t0

R

B

E-mail addresses: [email protected] (X. Giam),

100 0.3 100

1 100

2 86.8

ther aspects involving the ANN parameterization. However, it isossible that Fischer’s (in press) simulated datasets exhibited very

arge effect size (R2 ∼ 1).Our R2-based variable importance metric (pRVI) is a possible

lternative to connection weight-based approaches. By measur-ng the proportion of variance in the response that is uniquelyxplained by each variable, it is analogous to the squared semipar-ial correlation in linear regression. Importantly, pRVI accuratelyetrieved variable importance values and rankings as defined byhe unique proportion of variance explained in linear as well asonlinear datasets. One pitfall of this method is that the relative

mportance of the strongest variables is slightly underestimatednd importance of the weakest variables is overestimated whenample sizes are low and effect sizes are extremely high. How-ver, this bias is likely to be negligible for most ecological analysesecause it is extremely rare for ecological datasets to exhibit suchigh overall effect sizes (model R2 > 0.90).

In conclusion, connection weight-based approaches to estimat-ng variable importance are problematic. They lack theoreticalupport and do not perform well when applied to datasets withmall-moderate effect and sample size. We propose a permuta-ional R2-based metric (pRVI) which is general for different types ofnput variables and is accurate in measuring variable importancen linear as well as non-linear datasets. The use of pRVI together

ith association visualization tools (see Lek et al., 1996; Olden andackson, 2002) can help ecologists gain better explanatory insightnto their study systems using ANNs.

cknowledgment

Financial support was provided by a H. Mason Keeler Endow-ent to University of Washington. We thank one anonymous

eviewer for commenting on the manuscript.

ppendix A. Supplementary data

Supplementary data associated with this article can be found, inhe online version, at http://dx.doi.org/10.1016/j.ecolmodel.2015.6.034

eferences

eck, M.W., 2015. NeuralNetTools: Visualization and Analysis Tools for NeuralNetworks. R Package Version 1.3.1, http://cran.r-project.org/web/packages/NeuralNetTools/index.html

100 10095.4 10073.4 97.2

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Clarendon Press,Oxford, UK.

Fischer, A., 2015. How to determine the unique contributions of input-variables tothe nonlinear regression function of a multilayer perceptron. Ecol. Model., http://dx.doi.org/10.1016/j.ecolmodel.2015.04.015, in press.

Garson, G.D., 1991. Interpreting neural-network connection weights. Artif. Intell.Expert 6, 47–51.

Gevrey, M., Dimopoulos, I., Lek, S., 2003. Review and comparison of methods to studythe contribution of variables in artificial neural network models. Ecol. Model.160, 249–264.

Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J., Aulagnier, S., 1996. Appli-cation of neural networks to modelling nonlinear relationships in ecology. Ecol.Model. 90, 39–52.

Low-Décarie, E., Chivers, C., Granados, M., 2014. Rising complexityand falling explanatory power in ecology. Front. Ecol. Environ. 12,412–418.

Olden, J.D., Jackson, D.A., 2002. Illuminating the black box: understand-ing variable contributions in artificial neural networks. Ecol. Model. 154,135–150.

Olden, J.D., Joy, M.K., Death, R.G., 2004. An accurate comparison of methods for quan-tifying variable importance in artificial neural networks using simulated data.Ecol. Model. 178, 389–397.

Olden, J.D., Lawler, J.J., Poff, N.L., 2008. Machine learning methods without tears: aprimer for ecologists. Quart. Rev. Biol. 83, 171–193.

R Core Team, 2015. R: A Language and Environment for Statistical Computing. Ver-sion 3.2.0. R Foundation for Statistical Computing, Vienna, Austria.

Revolution Analytics, Weston, S., 2014. doParallel: Foreach Parallel Adaptor forthe Parallel Package. R Package Version 1.0.8, http://cran.r-project.org/web/packages/doParallel/index.html

Ripley, B., Venables, W., 2015. nnet: Feed-forward Neural Networks and Multino-mial Log-linear Models. R Package Version 7.3-1, http://cran.r-project.org/web/packages/nnet/index.html

Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge UniversityPress, Cambridge, UK.

Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. Nature 323, 533–536.

Strobl, C., Boulesteix, A-L., Kneib, T., Augustin, T., Zeileis, A., 2008. Conditional vari-able importance for random forests. BMC Bioinformatics 9, 307.

Xingli Giam ∗

Julian D. Olden ∗∗

School of Aquatic and Fishery Sciences, University ofWashington, Seattle, WA 98105, USA

∗ Corresponding author.

∗∗ Corresponding author. Tel.: +1 206 616 3112.

[email protected] (J.D. Olden).

25 May 2015


Recommended