Neural Network for Consumer Choice Prediction

A Comparative Analysis of Neural Networks and Statistical Methods for Predicting ConsumerChoiceAuthor(s): Patricia M. West, Patrick L. Brockett, Linda L. GoldenSource: Marketing Science, Vol. 16, No. 4 (1997), pp. 370-391Published by: INFORMSStable URL: http://www.jstor.org/stable/184232 .Accessed: 26/04/2011 22:56

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unlessyou have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and youmay use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .http://www.jstor.org/action/showPublisher?publisherCode=informs. .

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printedpage of such transmission.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range ofcontent in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new formsof scholarship. For more information about JSTOR, please contact [email protected].

INFORMS is collaborating with JSTOR to digitize, preserve and extend access to Marketing Science.

http://www.jstor.org

http://www.jstor.org/action/showPublisher?publisherCode=informs

http://www.jstor.org/stable/184232?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp

http://www.jstor.org/action/showPublisher?publisherCode=informs

A Comparative Analysis of Neural

Networks and Statistical Methods for

Predicting Consumer Choice

Patricia M. West * Patrick L. Brockett * Linda L. Golden Department of Marketing Administration, University of Texas at Austin, Graduate School of Business, CBA 7.202,

Austin, Texas 78712

Abstract This paper presents a definitive description of neural network methodology and provides an evaluation of its advantages and disadvantages relative to statistical procedures. The development of this rich class of models was inspired by the neural architecture of the human brain. These models mathematically emulate the neurophysical structure and decision making of the human brain, and, from a statistical perspective, are closely related to generalized linear models. Ar- tificial neural networks are, however, nonlinear and use a different estimation procedure (feed forward and back propagation) than is used in traditional statistical models (least squares or maximum likelihood). Additionally, neural network models do not require the same restrictive assumptions about the relationship between the independent variables and dependent variable(s). Consequently, these models have already been very successfully applied in many diverse disciplines, including biology, psychology, statistics, mathematics, business, insurance, and computer science.

We propose that neural networks will prove to be a valuable tool for marketers concerned with predicting consumer choice. We will demonstrate that neural networks provide superior predictions regarding consumer decision processes. In the context of modeling consumer judgment and decision making, for example, neural network models can offer significant improvement over traditional statistical methods because of their ability to capture nonlinear relationships associated with the use of noncompensatory decision rules. Our analysis reveals that neural networks have great potential for improving model predictions in nonlinear decision contexts without sacrificing performance in linear decision contexts.

This paper provides a detailed introduction to neural networks that is understandable to both the academic researcher and the practitioner. This exposition is intended to provide both the intuition and the rigorous mathematical models needed for successful applications. In particular, a step-by- step outline of how to use the models is provided along with a discussion of the strengths and weaknesses of the model. We also address the robustness of the neural network models

and discuss how far wrong you might go using neural network models versus traditional statistical methods.

Herein we report the results of two studies. The first is a numerical simulation comparing the ability of neural networks with discriminant analysis and logistic regression at predicting choices made by decision rules that vary in complexity. This includes simulations involving two noncompensatory decision rules and one compensatory decision rule that involves attribute thresholds. In particular, we test a variant of the satisficing rule used by Johnson et al. (1989) that sets a lower bound threshold on all attribute values and a "latitude of acceptance" model that sets both a lower threshold and an upper threshold on attribute values, mimicking an "ideal point" model (Coombs and Avrunin 1977). We also test a compensatory rule that equally weights attributes and judges the acceptability of an alternative based on the sum of its attribute values. Thus, the simulations include both a linear environment, in which traditional statistical models might be deemed appropriate, as well as a nonlinear environment where statistical models might not be appropriate. The complexity of the decision rules was varied to test for any potential degradation in model performance. For these simulated data it is shown that, in general, the neural network model outperforms the commonly used statistical procedures in terms of explained variance and out-of-sample predictive accuracy.

An empirical study bridging the behavioral and statistical lines of research was also conducted. Here we examine the predictive relationship between retail store image variables and consumer patronage behavior. A direct comparison between a neural network model and the more commonly encountered techniques of discriminant analysis and factor analysis followed by logistic regression is presented. Again the results reveal that the neural network model outperformed the statistical procedures in terms of explained variance and out-of-sample predictive accuracy. We conclude that neural network models offer superior predictive capabilities over traditional statistical methods in predicting consumer choice in nonlinear and linear settings. (Consumer Decision Making; Neural Networks; Statistical Techniques)

MARKETING SCIENCE/VO1. 16, No. 4, 1997 0732-2399/97/1604/0370/$05.00 Copyright ? 1997, Institute for Operations Research

pp. 370-391 and the Management Sciences

A COMPARATIVE ANALYSIS OF NEURAL NETWORKS

AND STATISTICAL METHODS FOR PREDICTING CONSUMER CHOICE

1. Introduction The study of artificial neural networks (ANN) has drawn considerable attention in many disciplines, including biology, psychology, statistics, mathematics, business, and computer science. The development of this rich class of nonlinear models was inspired by the neural architecture of the human brain, which consists of multiple levels of neurons and synaptic connections with information transfer between neurons across synaptic arcs (Rosenblatt 1961). Due to the empirical success of artificial neural networks at predicting the outcome of complex nonlinear processes, the methodology has become recognized for its superior forecasting ability and is receiving considerable attention from statisticians.

We propose that artificial neural networks will prove to be a valuable tool for marketers concerned with predicting the outcome of consumer decision processes, especially those that involve the use of noncompensatory choice heuristics. Figure 1 (to be discussed in more detail in ? 3.2) graphically illustrates three commonly used decision rules for consumer choice. The "Satisficing" and the "Latitude of Acceptance" rules are both nonlinear and noncompensatory, and pose severe prediction problems for standard linear statistical models. We show that neural network models have great potential for improving model predictions in these nonlinear decision contexts without sacrificing performance in linear decision contexts.

Research in the area of artificial neural networks has moved along two distinct lines (Hart 1992). Cognitive scientists have built "connectionist" models of brain behavior to further understand human cognition (see Rumelhart, McClelland, and the PDP Research Group 1986 for a review). Some of the specific skills examined have been speech perception (McClelland and Elman 1986) and decoding (Elman and Zipser 1988), reading (McClelland 1986), learning and memory (McClelland and Rumelhart 1986), recognizing handwritten char- acters (Fukushima and Miyake 1984), inference gen- eration (Shastri and Ajjanagadde 1993), and recognition and cued-recall (Chappell and Humphreys 1994). The purpose of these studies has been to better understand and model the associative nature of human cognition.

Figure 1 Threshold-Based Decision Rules

1000

Accept Attribute 2

lReject. o 1000

Attribute 1 (a) Satisficing Rule-A threshold is set for each attribute. Only alternatives that exceed each attribute threshold are deemed "acceptable."

1000

Attribute 2 ... t . ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ... .... .. .

Reject- o 1000

Attribute 1 (b) Latitude of Acceptance Rule-An acceptance range is set for each attribute. Only alternatives that fall within the shared acceptance region are deemed "acceptable."

1000

Accept

Attribute 2

.... .. .... -Reject

o 1000 Attribute I

(c) Weighted-Additive Rule-A threshold is set for the combined value of attributes. Only alternatives whose attribute sum exceeds the threshold are considered "acceptable."

On the other hand, information scientists, statisticians, and outcome-based researchers examine the mathematical properties of the neural network models including their performance in data analysis such tasks as classification (Archer and Wang 1993), multiple criteria decision making (Malakooti and Zhou 1994, Wang 1994, Wang and Malakooti 1992, Hart 1992), regression (Dutta and Shekhar 1988, Surkan and Singleton 1990, Treigueiros and Berry 1991),

MARKETING SCIENCE/VOl. 16, No. 4, 1997 371

WEST, BROCKETT, AND GOLDEN A Comparative Analysis of Neural Networks and Statistical Methods

discriminant analysis (Curram and Mingers 1994, Yoon, Swales, and Margavio 1993), and even the financial problems of portfolio choice and financial insolvency monitoring (Brockett et al. 1994). In each of these applications the role of the neural network is to enhance mathematical approaches to decision making. Neural networks are shown in these studies to outperform frequently used multivariate statistical techniques for these classification and decision-making tasks. Our intent is to demonstrate that neural networks offer significant improvement over traditional statistical methods generally used to model consumer judgment and decision making because of their ability to capture nonlinear relationships associated with the use of noncompensatory decision rules (Ganzach 1995, Payne, Bettman, and Johnson 1993).

This paper reports the results of two studies. The first is a numerical simulation comparing the ability of neural networks with discriminant analysis and logistic regression for predicting choices made by noncompensatory decision rules that vary in complexity. The second is an empirical study bridging the behavioral and statistical lines of research: We examine the predictive relationship between retail store image variables and consumer patronage behavior. A direct comparison between a feed-forward backward propagation neural network model and the more commonly encountered techniques of discriminant analysis and logistic regression will be presented. Both the simulation results and the empirical analysis reveal that the neural network model performed as well or better than commonly used statistical procedures in terms of forecasting ability (out-of-sample predictive accuracy).

In the context of consumer behavior, a strength of the neural network approach lies in its ability to mimic the functioning of the human brain and to estimate nonlinear and noncompensatory processes without first presupposing a parametric relationship between product attributes, perceptions, and behavior. We propose that neural networks hold great promise for en- abling the prediction of behavior based on product attributes and/or image, particularly when the underlying choice process is noncompensatory in nature.

From a cognitive perspective, the architecture of these models is consistent with the widely accepted

spreading activation model of human memory (Anderson 1976, Collins and Loftus 1975, Quillan 1968) and pattern recognition models of categorization and learning (Simon 1996). We propose that these models are also well suited for representing judgment and decision making, which often entails nonlinear and noncompensatory processing of information (Payne, et al. 1993). From a statistical perspective, these models are closely related to generalized linear model techniques, but provide superior predictions regarding consumer attitudes and choice over traditional statistical methods because the neural network does not require the same restrictive assumptions about the relationship between the independent variables and dependent variable(s).'

1.1. Challenging the Robustness of Simple Linear Models

Studies using compensatory linear models have abounded since the seminal writing of Martin Fishbein (1967) on attitudes. These models have been used to examine consumer preference (see Green and Srinivasan 1978, 1990 for reviews), attitudes (Wilkie and Pessimier 1973), judgment and decision making (Dawes and Corrigan 1974), and have been touted as being normatively appropriate for multiattribute decision making (Keeney and Raiffa 1976). From the perspective of mathematical convenience, one of the strengths of compensatory linear models has been the ease with which the models could be estimated using ordinary least squares regression, analysis of variance procedures, logistic regression, and discriminant analysis. These models are frequently used because of the common belief in their ability to mimic consumer judgment and choice processes.

Challenges to the basic linear-additive structure of these models, however, have been levied almost since their inception. Anderson (1970, 1971, 1991) argued that in addition to the simple additive model, both

1Statistical models require the imposition of a parametric formulation for the relationship under study (e.g., a linear relationship, logit linear relationship, or a multivariate normal distribution within groups with equal covariance matrices, etc.). If the assumed relationship is too restrictive, then the predictive ability of the model is

compromised relative to the flexible (indeed, universal approxima- tor) form of the neural network.

372 MARKETING SCIENCE/VOl. 16, No. 4, 1997


multiplicative and multilinear models are also legiti- mate combination rules for describing the information integration process (see Lynch 1985 for a discussion of testing alternative integration rules). In fact, Coombs and Avrunin (1977) demonstrated that preference is often not linear, but rather a single-peaked function. For example, most coffee drinkers who prefer cream and sugar would attest to the fact that there is an optimal level of each that is preferable. Moreover, unlike the compensatory structure of linear-additive models, Payne et al. (1993) demonstrated that much of consumer judgment and choice is the result of simple decision rules that are noncompensatory (see Figure l(a) and 1(b)) in nature rather than compensatory (see Fig- ure 1(c)). These noncompensatory rules often involve the use of attribute thresholds that allow for easy elimination or acceptance of an alternative. Johnson, Meyer, and Ghose (1989) demonstrated that compensatory statistical models, such as the linear-additive structure, fail to capture noncompensatory decision rules particularly when attributes are negatively correlated.

In each of these above cases, the deviations from the basic linear-additive structure have been tested, and methods developed for modeling nonlinear rules by incorporating the appropriate exponential terms and multiplicative interactions into the regression models. Noncompensatory decision processes can be modeled when the nature of the rule is known a priori by estimating separate slopes for the regions above and below the attribute thresholds. Essentially, to model a nonlinear process or noncompensatory decision rule a specific relationship is postulated and then examined for reduction in error variance relative to a more parsimonious linear model.

We shall show in this paper that neural network modeling may offer significant advantages over the commonly used estimation procedures described above for assessing consumers' attitudes, preferences, judgment, and choice behavior. Neural network models provide superior predictive capabilities when the nature of the decision rule being applied is unknown. This conclusion is consistent with that of White (1989) who states that neural network models provide a "rich and interesting class of new methods

applicable to a regression problem requiring some sort of flexible functional form."

In the following section we describe the ANN model in some detail. Subsequently we examine the results of two studies. First, we present a simulation that tests the performance of the ANN against traditional linear models for capturing three commonly observed consumer choice rules. Second, we test how the ANN per- forms in practice using an empirical data set examining consumer patronage behavior. Finally, we provide a set of guidelines for using ANN models and discuss some practical implications for managers.

2. Background on Neural Network Methods

The neural network model can be represented as a massive parallel interconnection of many simple processing units connected structurally in much the same manner as individual neurons in the brain. Just as the individual neurons in the brain provide "intelligent learning" through their constantly evolving network of interconnections and reconnections, artificial neural networks function by constantly adjusting the values of the interconnections between neural units (see Fig- ure 2). In building a mathematical model of this biological process, we create a "node" and "arc" structure analogous with their biological counterparts. The nodes or "neurons" in the model are connected to each other via interconnection weights representing arcs. The interconnection weights are analogous to the lack of resistance of the electrical gap or "synapse" between neurons in the brain. In the brain, neural pathways become strengthened if they are useful, and the resistance at the synapse changes (reduces) to reinforce the ease of transmission of energy along successful neural pathways. In artificial neural networks, pathways also become strengthened if they are useful; however, it is the weight between neurons that changes (increases) to reinforce a successful neural pathway. The process by which the network judges success, learns to improve its performance, recognizes patterns, and develops generalizations is called the "training rule."

The learning law proposed by Hebb (1949) served as the starting point for developing the training algo- rithms of neural networks. The subsequent development of the "back-propagation" training rule resolved



Figure 2 Neural Network Topology

Xi,wi ; 0 0 "Output" of \1 }Neural Net

/Fj (Ex wi) WI _, Activation

\/ ~~~~~~~Function / ( ~~~~Aggregation/

Inputs to Neural Unit

(a) A single neural processing unit (neuron) j.

"Output" of 0 eural Net

F(Zwk2 F.(x ) j=1 i=

Input Hidden Output layer I layer 2 layer

(b) Multiple neural processing units. Each circle in the hidden layer represents a single neural processing unit.

certain computational problems outstanding for two decades and significantly enhanced the performance of neural networks in practice (Smith 1996). ANNs are based on a "feed-forward" system whereby the flow of the network is from input toward output (as, for example, occurs in path models and structural equation or maximum likelihood factor analysis "causal" models). Via "back propagation" the network updates the interconnection weights by starting at the derived output value, determining the error produced with the current configuration, and then propagating the error backward through the network to determine, at an aggregate level, how to best update the interconnection weights between individual neurons to improve the

overall predictive accuracy. The method of steepest gradient decent is used for updating the weights to minimize total aggregate error.

2.1. The General Neural Network Model All neural networks possess certain fundamental fea- tures. For example, the basic building block of an ANN is the "neural processing unit" or "neuron," which takes a multitude of individual inputs x = (xl, . . . , xi), determines (through the learning algorithm) the optimal connection weights w = (wl,..., wl) that are appropriate to apply to each input, then aggregates these weighted values to concatenate the multiple inputs

374 MARKETING SCIENCE/VOL. 16, No. 4, 1997

WEST, BROCKETT, AND GOLDEN

A Comparative Analysis of Neural Networks and Statistical Methods

into a single value 1. 1 wixi for the neuron. An activation function, F, is then applied to the aggregated weighted value to produce an individual output F(W=1 wixi) for the specific neural unit. The logistic activation function F(z) = 1/(1 + exp( - - z)) is commonly used, although other sigmoid functions such as the hyperbolic tangent function are possible.2 The logistic function has its steepest slope near the threshold (intercept) point q, indicating that the relative impact of inputs on the corresponding output values is most pronounced near the threshold, while extreme or outlier values have a decreasingly dramatic effect upon predictions. Figure 2(a) graphically displays the configuration of the single neuron as described above.

Until now, we have discussed the working of a single neuron; however, the typology of most neural networks generally involves a collection of neurons that are configured in two layers (i.e., an input layer and an output layer) or three layers (i.e., an input layer, hidden layer, and output layer).3 For each layer of the neural network the same construction process is used to create an array of neural processing units. The ultimate topology for the network is obtained by connecting the input layer units (via connection weights wMJ)) to the hidden layer units, and connecting the hidden layer units (via connection weights w(2)) to the output unit (see Figure 2(b)).

2The output range for the logistic sigmoid function is (0, 1), whereas the output range for the hyperbolic tangent sigmoid functions is (-1, 1). The logistic sigmoid function has desireable mathematical properties (e.g., F'(z) = F(z) [1 - F(z)]), which can be used to sim- plify the computation of gradients and elasticities. These mathematical advantages together with the insensitivity of the ultimate results to the precise choice of sigmoid functional has contributed to the dominance of the logistic function in practical applications. A discussion of the attributes of various sigmoid formulae is given in Menon et al. (1996). Note also that 2(1/(1 + exp(-x))) = 1 + tanh (x/2) so that the logistic and hyperbolic tangent function are essentially mathematically equivalent.

3There is a lack of consistency in the literature with regard to the counting of the number of layers in a network (cf., Abbott 1996, Hart 1992, Rumelhart et al. 1996). These differences in terminology can be resolved by referring to the structure of the network based on the number of "hidden layers" it contains. Thus, the model presented in Figure 2(b) would be referred to as a "single hidden layer" neural network.

2.2. Network Neural Processing Units and Layers As indicated above, artificial neural networks are formed by modeling the interconnections between the individual neural processing units. It follows from the neural network topology given in Figure 2(b) that, when a logistic activation function is used, the mathematical structure of the neural network without any hidden layer is isomorphic to the standard logistic regression model. Thus, the neural network models considered here generalize the logistic regression models commonly used in marketing and social science research.4

Rosenblatt (1959) developed the simplest neural network without a hidden layer, provided a convergence algorithm for adjusting weights, and demonstrated that if the inputs presented to the network come from two linearly separable classes then the algorithm yields a hyperplane that distinguishes the two classes (see Figure 1(c)). However, networks that possess a hidden or intermediate processing layer between the input layer and output layer are able to represent more complex nonlinear rules, such as noncompensatory decision rules commonly used in consumer choice situations (see Figure 1(a) and 1(b)).

Further substantiation for using a single hidden layer typology for modeling complex behavioral problems is derived from the following mathematical results. Prior research proves that a single hidden layer neural network allows for universal approximations of any continuous functional relationship between inputs and outputs (Funahashi 1989, Hornik, Stinchcombe, and White 1990). Consequently, any discontinuous functional relationship that can be approximated by a continuous function can also be universally approximated by a single hidden layer network. The hidden layer of the network captures nonlinearities and interactions between variables (Lippmann 1987). Essen- tially, these results show that the class of neural network models having a single hidden layer is "dense" in the space of all continuous functions of n variables, so that no matter what is the "true" (but unknown)

'Other standard statistical models also arise as subsidiary models of the simplest neural networks. For example, the Probit model arises in the two-layer network with a summative aggregation function and an activation function that is the standard normal distribution function.



functional relationship between the input variables and the output variable, it can be well approximated by a single hidden layer neural network model. In- deed, such single hidden layer neural networks are able to predict the failure of savings and loan companies (cf., Salchenberger, Cinar, and Lash 1992) and the failure of property and casualty insurance companies (cf., Brockett et al. 1994) better than traditional statistical methods.

2.3. The Back-Propagation Algorithm One of the differentiating characteristics between neural network models and traditional statistical procedures is the back-propagation estimation algorithm used by neural networks. Much like the techniques used for maximum likelihood estimation, the back- propagation algorithm can be viewed as a gradient search technique wherein the objective function is to minimize the squared error between an observed output for a given configuration of input values and the computed output of the network using the same input values. A primary difference is that the back- propagation algorithm can sequentially consider data records, readjusting the parameters after each observation in a gradient search manner, whereas estimation procedures such as maximum likelihood and least squares use an aggregated error across the entire sample in the estimation.5

For example, the neural network is trained by pre- senting an input pattern vector X to the network and computing forward through the network until an output vector 0 is obtained. The output error is computed by comparing the computed output 0 with the actual output 0 for the input X. The network learns by adjusting the weights of each neural processing unit in such a fashion as to reduce this observed prediction error. The prediction errors are swept backward though the network, layer by layer, to associate a "square error derivative" (delta) with each processing unit, to compute a gradient from each delta, and finally to update the weights of each processing unit based upon the corresponding gradient (see Appendix 1).

5The back-propagation method can be specified to update the weights after processing small or large "batches" as well as after each individual observation by using the aggregate error for the batch in

computing the error gradient.

This process is then repeated beginning with another input/output pattern in the training set. After all the patterns in the training set have been used, the algorithm examines the training set again one by one and readjusts the weights throughout the entire network structure until either the objective function (sum of squared prediction errors on the training sample) is sufficiently close to zero or the default number of iterations is reached. Eberhart and Dobbins (1990) present a computer algorithm implementing the back- propagation technique described above.

3. Implementation Procedures Depending on the sophistication of the software being used to estimate the neural network model, a researcher may be called on to make a number of decisions regarding the typology of the network, the activation function used for training, and stopping rule used for terminating training. We examine these issues in detail and provide a set of guidelines for model implementation.

3.1. Variable Selection The choice of which variables to include in the model is an important consideration for any practical statistical model building procedure, neural networks in- cluded. Various statistical devices can be used to select pertinent variables or reduce the number of input variables. For example, logistic regression, which represents a simplified version of a neural network, can be used to create a "super list" of potentially significant variables by examining the t statistics of the parameters. Once the neural network has been best trained on this "super list," examination of the variables' neural connection weights can be used to identify prospective variables for elimination (see Appendix 2 for sensitivity calculations). Variables with small connection weights (i.e., low sensitivities) are good candidates for elimination. A new network is then created without these variables and the performance assessed. Alter- natively, an information theoretic technique can be used to order variables with respect to the information they provide on the outcome variable. Networks can then be built by adding variables one at a time and examining if there is improved performance. Once the

376 MARKETING SCIENCE/VOL 16, No. 4, 1997


selection of variables is made, the model building can proceed.

3.2. Choosing an Activation Function The most common activation function used in neural network modeling is overwhelmingly the logistic activation function. The choice of the activation function depends upon the desired network output range. Within the class of sigmoid functions having the same output range, it matters little which particular sigmoid function is used. However, the logistic sigmoid function has certain desirable mathematical and computational properties (see Footnote 2). These mathematical advantages together with the insensitivity of the ultimate results to the precise choice of sigmoid functional has contributed to the dominance of the logistic function in practical applications.

3.3. Cross-Validation If your objective is to build a model that is generaliz- able to new data (i.e., you are interested in forecasting as opposed to merely fitting an existing data set), then it is important to use a validation sample to prevent overfitting the model. Cross-validation procedures are used to determine how well a network captures the nature of a functional relationship. This entails split- ting the data into subsamples in order to validate the network's performance on additional examples that were not used in training the model. The data set should be partitioned into a training sample (Ti), a validation sample (T2), and a testing sample (T3). While there is not a precise figure for the size of these three samples, the "rule of thumb" we have used is 60 percent for Ti and 20 percent for T2 and T3.

Subsample Ti is used to determine the network configuration and connection weights (i.e., for model building and parameter estimation). However, we cannot determine when to stop training simply by examining the error on the training sample Ti because the error continues to decline as the network learns the idiosyncrasies in the training sample (see Figure 3). Therefore, we spot overfitting by simultaneously observing the error on the validation sample, T2. The set T3 is used to test the network's out-of-sample predictive accuracy (generalizability).

In summary, at each iteration the interconnection weights of the network are estimated using Ti. These

Figure 3 Stopping Rule for Training the Neural Network

This figure was adapted from Smith (1996, p. 127). The training sample (T1) is used to determine the network configuration and connection weights. Be- cause the error continues to decline as the network learns the idiosyncracies in the training sample, the validation sample (T2) is used to spot overfitting by simultaneously observing the error in the validation sample. Training is stopped when the error in the validation sample begins to increase.

Error .08

.06 Validation Sample

.04

.02 Training Sample

0 . 0 25 50 75 100

Iterations

Stop training

parameters are used to validate the developing model, using the data from T2, in order to determine when training should cease. The final weights are applied to T3 to determine the network's out-of-sample predictive accuracy.

3.4. Stopping Rule An important part of training the neural network is knowing when it is most appropriate to stop. Extend- ing the training process too long can result in "overfitting" the idiosyncrasies of training data set at the expense of the network's ability to generalize to other data configurations. This is similar to the statistical issue of parsimonious model selection-better fitting models can be obtained for any particular data set by merely increasing the predictor variable set, but at the expense of out-of-sample predictive accuracy. Con- versely, in neural network training, too few iterations of the data will prevent the network from adequately learning to discern the pertinent relationships between the input patterns and the output variable(s).

The validation sample T2 is used to determine the appropriate number of iterations through the data to be used for adjusting the weights in the network, as is illustrated in Figure 3. When the error on the validation sample begins to increase we know that overfitting has occurred. At this point we stop training and go back to the earlier weights that produced less error



on T2. The inflection point in curve for the validation sample, which is illustrated in Figure 3, represents the "peak of generalizability" to a holdout sample.

Another criterion for deciding when to stop training is to continue iterative changes in the weights until the weight matrix reaches stationarity (Malakooti and Zhou 1994), or to continue training until the percentage misclassified in the training sample no longer continues to improve (Curram and Mingers 1994). If primary interest lies in out-of-sample predictive accuracy, then both of these methods run the risk of training the network to recognize idiosyncrasies of the training data set at the expense of generalizability. While these latter methods are common in statistical estimation (such as maximum likelihood) done in a "batch mode," this method is not recommended for creating predictive neural network models.

3.5. Network Configuration It has been proven that any continuous functional form can be well approximated by a single hidden layer neural network. Existing theory does not, however, provide a rule of thumb for determining the number of neurons (hidden units) to use in the hidden layer. Consequently, the appropriate number of hidden units to use will depend on the nature of the data and the problem. Each additional hidden unit adds capacity to the network because it increases the degrees of free- dom with which the model can fit the data. Neverthe- less, to prevent overfitting we also need to limit the number of hidden units.

In practice, we advise testing simple models first and successively increasing the number of hidden units until a peak of generalizability to the validation sample occurs. Thus, we start with a "zero hidden unit" model and see if generalizability can be improved by adding a single hidden unit. We then add a second hidden unit, retrain, and see if we improve performance. When performance in the validation sample T2 begins to deteriorate by adding a unit, we drop back to the previous number of hidden units and use that model structure. In essence, the error in the model estimated on Ti will continue to decrease as additional hidden units are added; however, the error in the validation sample T2 will decline up to a point and then increase. This inflection point is used to determine the

appropriate number of hidden units. The pictorial rep- resentation of Figure 3 with the horizontal axis being the number of hidden units is the pertinent conceptual framework.

3.6. Sample Reuse Procedure The steepest descent gradient algorithm used ensures that the neural network has reached a local extrema; however, due to the nonconvex nonlinear nature of the neural network model, we may not be at a global extrema.6 A sample reuse procedure is recommended to minimize this risk. This procedure is analogous to using multiple starting points in a global maximum likelihood parameter search in statistical estimation, except that the unit of repetition in neural networks is the data set, so multiple data sets should be tested. In the interest of using the existing data to the maximum extent possible, we recommend a sample reuse procedure (similar to statistical bootstrapping techniques). This procedure entails creating multiple data partitions.7 For each partition, the data set should be randomly divided into sets Ti (training = 60 percent), (validation = 20 percent), and T3 (testing = 20 percent), as described earlier. Multiple partitioning provides information about the distribution of out- comes (i.e., the mean, range, and variance) and the expected accuracy and reliability of the modeling procedure. The range of performance for the neural network also provides useful information about the potential for error in prediction (worst and best case performance). A researcher interested in developing a predictive model would only be interested in finding the "best" training session. This can be accomplished by identifying the data partitioning that produced the

6This problem of local extrema is not unique to neural networks. Maximum likelihood estimation suffers from the same problem un- less the density function is convex in the parameters. In maximum likelihood estimation the use of multiple starting points is used to address the search for maxima in a nonconvex setting. Analogously, for neural network models wherein learning and convergence are effected by data ordering, a sample reuse procedure with multiple data partitionings serves the same purpose for finding optima in nonconvex neural network models.

7The appropriate number of data partition needs to be determined. The more partitions used the better (as in bootstrap estimation), but

practically speaking we have found that performing 10 separate random partitionings of the data into subsets TI, T2, and T3 is adequate.



largest percentage of correctly classified data when applied to its own testing subsample T3 (i.e., the best out- of-sample predictive accuracy). In essence, by taking the best performing model one achieves a more optimal result than that obtained using a single sample partitioning.

4. Study 1-A Numerical Simulation

4.1. Overview Prior research suggests that linear models are robust and capable of capturing noncompensatory processes in judgment (Dawes and Corrigan 1974, Einhorn 1970, Ganzach 1995) and choice (Johnson and Meyer 1984). This belief rests upon the past performance of simple linear models at explaining reliable variance in consumer evaluations and predictive accuracy in consumer choice among alternatives where the actual choice rule is inferred based on protocol analysis or processes tracing methods. As a result, the evidence for the sufficiency of linear models in predicting noncompensatory choice rules has been indirect. How- ever, using numerical simulations, Johnson et al. (1989) demonstrate that noncompensatory choice models involving attribute thresholds such as "satisficing" (i.e., a conjunctive rule) are poorly fit using a linear-additive model. They examine the performance of linear models in contexts where the data are known to be generated by specific choice rules and find that even in an orthogonal attribute environment compensatory models fail to capture noncompensatory choice rules involving attribute thresholds.

We examine the comparative ability of neural works and traditional linear-additive models at capturing three commonly used consumer decision rules. These include two noncompensatory decision rules and one compensatory decision rule that involve attribute thresholds. In particular, we test a variant of the satisficing rule (see Figure 1(a)) used by Johnson et al. (1989) that sets a lower bound threshold on all attribute values, and we test a "latitude of acceptance" rule (see Figure 1(b)) that sets both a lower threshold and an upper threshold on attribute values, mimicking the "ideal point" model described previously (Coombs and Avrunin 1977). Given the flexible nature of the

neural network models we expect to see better within- sample predictive accuracy and out-of-sample predictive accuracy from the neural network model than discriminant analysis or logistic regression for both of the noncompensatory decision rules. We also test a compensatory rule that equally weights product attributes and judges the acceptability of an alternative based on the sum of its attribute values. In this case we expect to see little improvement in performance between neural networks and the traditional methods, with all models predicting well, due to the compensatory nature of the rule.

The complexity of the decision rule was also varied to test for degradation in model performance. Each of the three decision rules was tested using either two or four product attributes. Unlike Johnson et al. (1989), we did not model choice among a set of alternatives. Rather, the rule was used to determine the "acceptability" of individual alternatives.

4.2. Method We test the comparative ability of a neural network model, discriminant analysis, and logistic regression at capturing three riskless decision rules: satisficing (SAT), latitude of acceptance (LOA), and weighted- additive (WADD). The first two rules are selected because of prior evidence indicating the insufficiency of linear-additive models at representing these noncompensatory decision rules (Johnson et al. 1989). The weighted-additive rule implemented here also involves a threshold for classifying an alternative as either acceptable or unacceptable. However, unlike the satisficing and latitude of acceptance rules, it is compensatory in nature. This choice of decision rules allows for a clear demonstration of when the neural network model's performance will be comparable to traditional statistical methods, as well as when it will surpass them in performance (cf., Rumelhart et al. 1996).

The three decision rules are discussed as follows. Each of these rules is examined separately at two levels of decision complexity: two attributes and four attributes.

1. Satisficing (SAT). For an alternative to be classified as an acceptable choice based on the satisficing rule it must surpass the established threshold (250) on all



attributes. In the two-attribute environment, approximately 56 percent of the alternatives were deemed acceptable, and in the four-attribute environment, approximately 32 percent of the alternatives were acceptable, according to the satisficing rule.8

2. Latitude of Acceptance (LOA). For an alternative to be classified as an acceptable choice using a latitude of acceptance rule the value of each of its attributes must lie within the specified range (251 to 750). In the two-attribute environment, approximately 25 percent of the alternatives fit this criterion, and in the four- attribute environment, approximately 6 percent of the alternatives were deemed acceptable.

3. Weighted Additive (WADD). For an alternative to be classified as an acceptable choice using a weighted- additive rule the sum of its attribute values had to exceed a specified threshold (1,000 for two-attribute alternatives; 2,000 for four-attribute alternatives). Approximately 50 percent of the alternatives fit this criterion in both the two- and four-attribute environments.

Two data sets, each consisting of 500 choice alternatives, were generated using the random number generator in SAS. Separate data sets were generated for the two-attribute and four-attribute choice environments. Attribute values were drawn from a uniform distribution ranging from 0 to 1,000. For each of the 500 choice alternatives the three decision rules were applied to determine the "acceptability" according to the specific rule. This generated a binary output variable associated with three decision rules for each of the 500 choice alternatives.

4.3. The Neural Network Model Two separate simulations of the decision rules were performed varying in decision complexity (two or four attributes). Six separate neural networks were trained (one for each of the three decision rules and for each level of rule complexity). Symbolically, these inputs are denoted by x = (xl, . . ., x1)' where I = 2 or 4. An output variable, 0, that the various models attempted to predict was associated with each decision rule. The

8This noncontext-dependent operationalization of a satisficing rule is similar to an elimination-by-aspects rule which sets thresholds on individual attributes and sequentially eliminates alternatives that do not meet one or more of the attribute cut-offs (cf., Tversky 1972).

functional relationship between the input units and neural output of the decision rule outcome was based on a logistic activation function

1 + exp(- - z) (1)

where z = x'w.} and w.) - . . ., wJ))is the vector of weights (to be subsequently adjusted by learning) bridging from the original set of I input variables to the jth hidden unit. Thus, the calculated value of the jth hidden neural unit was:

1 + exp(- - x'w7))' i = 1,...,J (2)

where rj) is the threshold for the jth hidden unit (the intercept term in the logistic regression formulation for Hj), and Hj is the derived numerical value for the jth hidden neural unit that results from the vector of input variables x = (x, . . . , xI)' and the current best estimate of the weight vector w(J). The final output was obtained by again using the logistic activation applied to the weighted summation of the hidden layer neural values H = (H1,. . ., H,)' with the number of hidden neural units being denoted as J. Thus,

11 ~~~~~1 1 +ep-(2)-H'W(2)) (3) 1 + exp( - - H'w Y)

where the weight vector is now given by w(2) =

(w(2) ..,w2)7), which is applied to the hidden factors H = (H1,..., Hj)'. By the iterative feed-forward and back-propagation algorithm (see Appendix 1), the neural network learned to simultaneously select the weights w.}, Iw I the threshold values (71), ... ., CJ and (2) in such a manner as to minimize the prediction error in the training sample. Of course, a change in any single weight w) changes the value of Hj so that the weights between the hidden layer and the output also change.

4.4. Comparative Analysis and Results For each of the three decision rules and the two levels of model complexity the performance of the neural network was compared to the results obtained using discriminant analysis and logistic regression. The procedure described in the previous section was used to test



both the within- and out-of-sample predictive accuracy. For the neural networks, out-of-sample predictive accuracy was measured using the weights estimated from samples Ti (training) and T2 (validation) on T3 (testing). For logistic regression and discriminant analysis, out-of-sample predictive accuracy was measured using the weights estimated from the combined set Ti and T2 and applied to T3. Following the sample-reuse procedure described, 10 random partitionings of the set of choice alternatives were generated. All models were estimated using the first 400 observations and cross-validated on the last 100 observations. The same 10 partitionings were used for all six neural networks, as well as the logistic regressions and discriminant analyses.

The simulation results are summarized in Tables 1 and 2. The mean and range of predictive accuracy, as well as Type I and Type II error rates across all 10 random samples of the data, are reported. The performance of all noncompensatory models, with the ex- ception of the four-attribute LOA model, are seen to be statistically different (p < .0001) and inferior to the

neural network model, both in terms of explained variance (within-sample predictive accuracy) and, more importantly, in terms of forecasting (out-of-sample predictive accuracy).

The results for the four-attribute LOA rule were slightly different from expected. While the neural network exhibited far superior within-sample and out- of-sample accuracy than discriminant analysis (p <

.0001), the performance of logistic regression was comparable to the neural network (p > .45).

A closer examination of the logistic regression's performance in predicting a four-attribute LOA rule, however, indicates that the model's strong predictive accuracy is due to the low base rate of acceptable alternatives (theoretically, within a given sample, only 6 percent of the alternatives would be deemed acceptable) rather than its ability to differentiate between acceptable and unacceptable alternatives. An examination of the out-of-sample Type I error rate (100 percent) indicates that the logistic regressions were unable to predict the few acceptable alternatives in the data set and categorized all of the alternatives as unacceptable.

Table 1 Comparative Results from Numerical Simulation

Within-Sample Accuracy Out-of-Sample Accuracy

Number of Latitude of Weighted Latitude of Weighted Analysis Method Attributes Acceptance Satisficing Additive Acceptance Satisficing Additive

Discriminant analysis 2 55.2%/** 87.8** 99.2%/* 51.8%** 87.7%/** 97.4%* (53-57%) (87-89%) (99-100%) (47-59%) (83-90%) (93-100%)

4 55.0** 77.9** 97.4** 51.6** 78.7** 98.2* (52-57) (77-79) (97-100) (45-56) (76-81) (96-100)

Logistic regression 2 79** 87.7** 100 77.0** 87.8** 100 (78-81) (87-89) (100) (71-82) (84-91) (100)

4 92.4** 82.0** 100* 91.6 79.0** 99.9 (92-93) (81-83) (100) (90-94) (74-85) (99-100)

Neural network 2 97.2 99.7 99.6 94.8 99.5 99.5 (86-100) (98-100) (98-100) (86-99) (98-100) (97-100)

4 94.3 94.8 99.3 90.7 91.4 99.3 (80-100) (92-100) (99-100) (87-96) (86-97) (98-100)

For each of the three decision rules and methods of analysis the predictive accuracy was calculated across 10 random samples of the data. Cell values represent the mean and range of predictive accuracy for the 1 0 random samples. These averages were stable across the 10 samples. The standard deviations for the discriminant analysis models ranged from 0.004 to 0.036, with an average of 0.016. For the logistic regression models the standard deviation ranged from 0 to 0.034, with an average of 0.079. Similarly, the neural network models ranged from 0.01 to 0.05, with an average of 0.02. A series of paired ttests with df = 9 compared the performance of the neural network models across the 10 random samples with the performance of the traditional statistical methods. Significance of the ttests is represented as follows: *p < .01, **p < .0001.



Table 2 Comparison of Error Rates from Numerical Simulation

Out-of-Sample Error Rates

Latitude of Weighted Acceptance Satisficing Additive

Number of Type I Type II Type I Type II Type I Type II Analysis Method Attributes a ,B a ,B a ,B

Discriminant analysis 2 44% 50% 10% 14% 3% 2% (31-50%) (45-56%) (4-18%) (9-199%) (0-9%) (0-8%)

4 55 48 20 22 3 1 (33-67) (42-54) (11-24) (17-25) (2-5) (0-4)

Logistic regression 2 100 0 11 14 0 0 (4-18) (7-18)

4 100 0 50 10 0 0 (36-65) (4-15)

Neural network 2 10 4 1 0 0 1

(0-24) (0-18) (0-4) (0-2) (0-7) 4 28 8 10 8 1 1

(0-44) (2-21) (0-21) (4-14) (0-4) (0-4)

For each of the three decision rules and methods of analysis the out-of-sample error rate was calculated across the 10 random samples. Cell values

represent the mean and range of out-of-sample error. Type I error represents the percentage of acceptable alternatives categorized as unacceptable by the

model, and Type II error represents that percentage of unacceptable alternatives categorized as acceptable by the model.

On the other hand, the neural networks were able to achieve the same aggregate level of out-of-sample predictive accuracy while balancing the Type I (28 percent) and Type II (8 percent) error rates.9 From a managerial perspective, the logistic regression's solution of rejecting all alternatives may not be feasible when at- tempting to find acceptable alternatives. Moreover, when assessing the performance of these models one needs to consider the cost associated with the two types of errors. For example, if the cost of a Type I error is $100 versus $1 for a Type II error, then the expected cost of using the logistic regression would be

9Because the logistic regression attempts to separate the space into two linearly separable regions, and it optimizes its parameters based on an examination of the aggregate sample, in situations such as that

presented by the four-attribute LOA decision rule where one group is much larger than the other, the model will achieve its highest

accuracy by classifying all observations into the larger of the two

groups (i.e., the logistic regression "gives up" trying to predict the

acceptable alternatives and goes with the odds). By contrast, the neu-

ral network is able to learn the nonlinear structural relationships even with disparate group sizes.

($100*.06*100) + ($1*.94*0) = $600, whereas the expected cost of using the neural network would be ($100*0.06*0.28) + ($1*0.94*0.08) = $175.52. As this result indicates, the high predictive accuracy rate of the logistic model in the four-attribute LOA decision rule context may be deceiving when both types of error are important.

Practically speaking, on the average, the "best trained" neural network always out performed both discriminant analysis and logistic regression in terms of both within- and out-of-sample predictive accuracy for the noncompensatory decision rules (see Table 1). As expected, there was little difference between the neural networks' performance and that of discriminant analysis or logistic regression when predicting data generated by a two-attribute or four-attribute weighted-additive decision rule. All three modeling procedures performed exceptionally well in capturing this compensatory decision rule. Thus, you cannot go wrong by using neural networks in linear settings and can gain substantially in nonlinear (or unknown) settings.



5. Study 2-An Empirical Examination of Consumer Patronage Behavior

An empirical study of the neural networks' ability to predict consumer choice (patronage behavior) was conducted to further substantiate the simulation results. Consumer perceptions and patronage behavior toward three nationwide mass-merchandise retailers were examined (Kmart, Montgomery Ward, and Sears Roebuck). A sample of 800 members from a nationwide consumer mail panel was analyzed. Since panel members were predominantly female, half of the cover letters asked the panel member to fill out the questionnaire and the other half instructed the panel member to ask their spouse to respond in order to bal- ance for gender.

5.1. Survey Instrument The "image" dimensions of products and services are of interest to consumer behavior researchers because of their important role in influencing patronage behavior and consumer choice. In this study, image perceptions for retail chain stores were assessed by having respondents rate each of three mass merchandise retailers on 19 store characteristics, including price (low to high), cleanliness (dirty to clean), employee friend- liness (unfriendly to friendly), etc. (see Table 3). These 19 store characteristics were selected based on a review of the literature on retail store image (Zimmer and Golden 1988). Each of the 19 store characteristics was assessed using a seven-point bipolar adjective scale (see Golden et al. 1992 for details on the questionnaire format).

Besides assessing consumers' attitudes and perceptions of the three retailers a behavioral response measure of consumer patronage was elicited. For each of the three stores, respondents were asked how frequently they shopped at Kmart, Montgomery Ward, and Sears Roebuck. Response options ranged from 1 = never to 6 = at least once a week. The neural network technique described in this paper is capable of han- dling observations with missing data; however, the statistical methods against which the neural networks are being compared require complete data. The goal was to compare the neural network performance with

that of other common statistical methods; therefore, only surveys having complete data were used in the analysis. After imposing this restriction, 371 of the 800 data records remained for Sears Roebuck, 294 of the 800 records for Kmart, and 235 of the 800 records for Montgomery Ward. These three data sets were used to build separate neural networks and each was analyzed using the traditional statistical methods.

5.2. Comparative Analysis and Results Each of the three ANN models used the chain-specific responses to the 19 store characteristic variables as inputs and patronage frequency as the output variable. For consistency with the simulation results the scale of the behavioral response variable was converted from a 1-to-6 scale to a binary variable with 0 representing an "infrequent shopper" and 1 representing a "frequent shopper" based on a median split of the data. Respondents who reported shopping frequencies of two or less on the scale were classified as "infrequent shoppers" and those who reported shopping frequencies greater than two were classified as "frequent shoppers."'0

The same training and cross-validation procedures used in the numerical simulation were applied to the questionnaire data; that is 60 percent of the sample was used to train the network, 20 percent was used to determine a stopping point for training, and 20 percent was used for out-of-sample testing of the predictive accuracy. These sample sizes were 223, 74, 74 for Sears Roebuck, 194, 50, 50 for Kmart, and 141, 47, 47 for Montgomery Ward.

In addition to building neural network models to predict consumer patronage behavior both discriminant analysis and logistic regression were tested as alternative statistical methods. Yet another statistical alternative was also tested. Due to the large number of input variables one might conceptualize consumer patronage as being based on a set of factors or latent variables that represent combinations of the 19 store characteristics. In this new alternative statistical method

10Neural networks are not limited to these types of binary classification problems. Continued efforts at testing the comparative performance of neural networks to classification tasks involving mul- tilevel output variables is considered a fruitful area for future research.



Table 3 Retail Store Characteristics

Montgomery Kmart Sears Ward

n= 294 n= 371 n= 235

Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.

Unfriendly/Friendly 4.51 3.36 4.59 3.43 4.33 3.29

Unhelpful/Helpful 4.33 3.66 4.63 3.49 4.29 3.34

Bad/Good Reputation 4.90 3.08 5.54 2.37 4.72 3.23

Low/High Caliber 3.86 2.78 4.73 2.28 4.18 2.56

Dislike/Like 4.76 3.49 4.76 3.37 3.95 3.96

Uncongested/Congested 4.45 3.25 4.04 3.10 3.77 3.37

Not Enjoyable/Enjoyable 4.47 3.28 4.61 2.89 4.05 3.28

Hard/Easy to Exchange 5.31 3.18 5.43 2.94 4.87 3.34

Bad/Good Value 4.86 2.88 5.25 2.25 4.52 2.87

Inconvenient/Convenient Location 5.25 4.18 4.90 4.03 3.81 5.42

Low/High Price 2.90 2.90 4.50 2.38 4.23 2.53

Dirty/Clean 4.98 3.53 5.95 1.77 5.32 2.97

Hard/Easy Credit 4.77 4.32 5.48 3.89 5.31 3.42

Cluttered/Spacious 4.13 3.92 4.91 3.22 4.45 3.60

Unpleasant/Pleasant Atmosphere 4.68 3.50 5.23 2.80 4.70 3.39

Low/High Quality Merchandise 3.89 3.03 5.34 2.02 4.46 3.08

Dull/Exciting 3.88 2.74 4.08 2.46 3.60 2.81

Bad/Good Selection 4.69 2.80 5.03 2.53 4.24 3.29

Unsophisticated/Sophisticated Customers 3.29 2.79 4.47 2.33 4.00 2.59

Each of the 19 store characteristics was assessed using a 7-point bipolar adjective scale. The values represent the mean and standard deviation in responses

from a questionnaire administered to a nationwide consumer panel.

four latent variables were uncovered via factor analysis. These results were then used as inputs for analysis. The results for all three retail chains are presented in Table 4. The same sample reuse procedure described earlier was used for all three models to reduce the ef- fects of outliers and sample particularities.

The results in Table 4 demonstrate that once again the neural network model exhibits a superior ability to learn the patterns corresponding to consumer choice (in this case, patronage frequency) and product attributes (image dimensions). Consistent with the simulation results, the neural networks demonstrate significantly better (p < .05) out-of-sample predictive accuracy than the two traditional statistical methods for all three stores. Across all three store chains, the "best trained" neural network outperformed the best of all alternative statistical models tested (See Table 4).

It is well known that within-sample predictive accuracy represents a measure of explained variance and yields an upwardly biased estimate of future model performance. Hence, out-of-sample predictive accuracy is a more relevant measure of future performance. Nevertheless, in all but two instances (logistic regression for Kmart and Montgomery Ward), the average within-sample predictive accuracy of neural networks exceeded that of the other models. Taken together, our results suggest that the neural network produced less within-sample "overfitting" than the logistic regression or discriminant analysis.

An informative byproduct of discriminant analysis and logistic regression is the ,B weights that represent a quantification of the expected change in output (patronage frequency) that results from changes in the input variables (store characteristics). For nonlinear



Table 4 Comparative Results from Consumer Patronage Survey

Within-Sample Accuracy Out-of-Sample Accuracy

Montgomery Sears Montgomery Analysis Method Kmart Ward Roebuck Kmart Ward Sears Roebuck

Discriminant analysis 78.04%** 80.72%** 71.26%/** 72.98%0/*** 75.21 %** 64.1 7%0/**

(74-18%) (76-83%) (67-75%) (69-78%) (70-79%) (58-68%) Logistic regression 80.78* 80.56* 72.13 77.45* 68.45*** 64.39*

(80-83) (76-85) (71-75) (74-83) (65-75) (60-68) Neural network 82.08 82.73 73.74 83.51 84.58 69.72

(79-88) (80-86) (71-77) (77-95) (77-94) (58-75)

For each retail chain and method of analysis the predictive accuracy was calculated for 10 random samples of the data. Cell values represent the mean and range of predictive accuracy for the 10 random samples. These averages were stable across the 10 samples. The standard deviations for the discriminant analysis models ranged from 1.88 to 3.05, with an average of 2.53. For the logistic regression models the standard deviation ranged from 2.45 to 3.54, with an average of 2.88. Similarly, the neural network models ranged from 1.95 to 5.80, with an average of 3.87. A series of paired ttests with df = 9 compared the performance of the neural network models across the 10 random samples with the performance of the traditional statistical methods. Significance of the ttests is represented as follows: *p < .05, **p < .01, ***p < .0001.

techniques, such as neural networks, the change in output due to changes in input is not a constant, as the relative weight assigned to each input can vary in both magnitude and sign across the range of an input variable. Changes in the output variable that result from changes in inputs can be estimated via the "sensitivity" or "elasticity" of output to input defined by aO/axi, the partial derivative of the network output with respect to input i. For linear models, this sensitivity measure, /B, is constant (but is not so for the neural network models). A closed-form solution for computing the sensitivity of each input i for a neural network is presented in Appendix 2.

In our application, each input potentially has integer values that range from 1 to 7. Accordingly, the sensitivity of each input i was measured by evaluating the partial derivative, aO/axi, at each level of the 19 input variables while holding all other input variables fixed. The sensitivity for xi at response level k is obtained by calculating the sensitivity for each respondent according to the mathematical formula given in Appendix 2, then averaging across respondents.

A sampling of the results of the sensitivity calculations is presented in Table 5 for Sears Roebuck, Kmart, and Montgomery Ward, respectively. As can be seen from this table, there is consistency in algebraic sign

for the sensitivity along each of the bipolar adjective scales for a given retailer, indicating that the networks learned the directionality of input to output that cor- responds to the image variables. Moreover, the fact that some of the average sensitivity measures for the different variables were relatively large, while others were relatively small, indicates that patronage frequency has differing sensitivity to changes in the perceived image variables under investigation. For example, patronage frequency is relatively sensitive to the different levels of some variables such as "Dislike/Like" for all three stores (ranges from 0.03 to 15.16) but patronage frequency is relatively insensitive to changes in other variables, such as "Hard/Easy Credit" (ranges from 0.06 to 1.32).

These findings have obvious managerial implications since the consumers' perceptions of the various store characteristics can be influenced by managerial actions. For some variables (e.g., Hard/Easy Credit), the financial costs needed to change consumers' perceptions might be large, but these results show that it may have little effect on consumer patronage and should therefore not be a priority item for managerial action. Conversely, since the ultimate shopping behavior is more strongly influenced by other variables, such as "Good/Bad Selection," managerial



Table 5 Sensitivity of Output with Respect to a Selective Group of Image Dimensions

Image Dimension Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Level 7

Low/High Price Sears 0.64 - 8.09 - 4.57 - 3.82 - 4.57 - 3.40 -1.77 Kmart - 0.61 - 0.76 -1.31 - 0.78 - 2.46 -1.14 -1.52 Montgomery Ward - 0.15 - 0.17 - 0.11 - 0.17 - 0.14 - 0.05 - 0.02

Hard/Easy Credit Sears 0.16 0.70 0.53 0.80 0.98 0.56 0.54 Kmart 0.44 0.66 0.68 0.55 0.35 0.49 0.25 Montgomery Ward 1.11 0.06 0.32 1.28 0.65 1.32 0.85

Cluttered/Spacious Sears -1.69 - 0.98 - 2.13 - 3.29 - 2.75 - 2.97 - 2.50 Kmart -1.77 -1.23 -1.21 -1.46 - 0.91 -1.52 - 0.30 Montgomery Ward 0.31 0.21 0.39 0.46 0.34 0.26 0.35

Low/High Quality Merchandise Sears 0.00 3.29 2.45 2.72 2.52 3.46 2.13 Kmart 0.08 0.83 0.89 0.71 0.59 0.28 0.37 Montgomery Ward 0.83 0.81 0.68 1.12 1.03 1.06 1.14

Bad/Good Selection Sears 0.03 1.15 1.89 1.97 2.68 3.28 1.32 Kmart 1.60 3.48 2.96 1.39 2.26 1.24 1.05 Montgomery Ward 2.96 1.18 1.62 2.66 2.30 2.04 4.41

Low/High Caliber Sears 0.27 0.22 1.89 1.45 1.74 1.19 1.30 Kmart 1.24 1.26 1.77 1.04 0.71 0.33 0.32 Montgomery Ward 4.89 0.21 2.72 4.60 3.23 3.88 6.34

Hard/Easy to Exchange Sears 0.23 0.32 0.31 0.61 0.48 0.61 0.47 Kmart 11.99 2.88 10.48 11.89 8.19 5.13 2.65 Montgomery Ward -0.17 -0.16 -0.18 -0.23 -0.10 -0.16 -0.15

The values above represent a sensitivity measure for a selection of store image characteristics across the seven level bipolar scales. This measure is analogous to a parameter estimate in a statistical model. However, the measure is not averaged across the range of the input variable, thus the relative weight can vary across the range of the input variable.

actions affecting the consumer's perception of the store's merchandise selection can better influence store patronage.

The nonconstancy of the sensitivity measure across individual attribute levels, together with the knowledge of current consumer perceptions, allow the marketing manager to better estimate the impact of their strategic decisions. For example, as indicated in Table 3, the average perception of the Kmart consumer on the Low/High Price variable is 2.9, and if the manager could influence this perception value down to 1.9, then the patronage frequency could be expected to increase by .76. Other information is also available from Table

5. For example, we observed what appears to be a threshold model at work for Sears with respect to the variables "High/Low Quality Merchandise" and also "Bad/Good Selection" with the threshold value being at level 2. Once above the threshold, the sensitivity is fairly constant and positive; therefore, little is to be gained by wooing consumers whose perceptions are below the threshold. This information is not available with logistic regression or discriminant analysis because the constant sensitivity measure (,B) is averaged across attribute levels.

The sensitivity information from Table 5 can also have differential managerial value. For example, while



from Table 3 we see that the average consumer perceptions of Sears and Kmart shoppers on the "Hard/Easy to Exchange" variable are roughly equal (5.43 versus 5.31, respectively), from Table 5 we see that Sears has little to gain by changing consumer perceptions on exchange while Kmart stands to gain significantly by focusing on changing this perception. In addition, from Table 5, on the variable "Low/High Quality Merchandise" there is less to be gained for Kmart by increasing the perception of quality merchandise than there is to be gained by their increasing consumer perceptions of how easy it is to exchange merchandise. Instead of advertising about their "new" high quality merchandise. Kmart should attempt to convince consumers their, "new" exchange policy is an improvement. This will potentially pay off more in terms of patronage frequency than a quality of merchandise campaign. This same strategy is not optimal for Sears or Montgomery Ward as witnessed by their sensitivity values on these variables.

6. Conclusions and Implications Artificial neural network research has progressed along two distinct lines: Cognitive scientists use such models to further understand human cognition, while information scientists, statisticians, and other quanti- tative researchers examine the mathematical properties of the neural network models to improve classification and prediction. This paper examines the mathematical properties of these models that offer tre- mendous potential for predicting consumer decision processes when noncompensatory decision heuristics are used, or product information is integrated in a nonlinear fashion. As outlined in this paper, the proper use of neural networks requires the researcher to make a series of important decisions in order to ensure accuracy in prediction. We present a set of implementation procedures for a researcher to follow when using neural networks to develop predictive models.

This paper demonstrates that neural network models consistently outperform statistical methods such as discriminant analysis and logistic regression when predicting the outcome of a known noncompensatory choice rule. Even when the underlying choice rule is unknown, the neural network exhibits better out-of- sample predictive accuracy than traditional statistical

methods, due to the flexible nature of the model. This improvement in performance is accomplished through an iterative process, whereby the model "learns" complex relationships between product attributes (image variables) and consumer choice (patronage frequency).

Neural network models differ from other statistical procedures in that the model does not presuppose any linear or causal relationship (e.g., logistic regression, maximum likelihood factor analysis, discriminant analysis, etc.) between the input variables and the output variable(s). Traditional statistical models are able to deal with nonlinear functional relationships by incorporating the appropriate exponential terms or multiplicative interactions but the form of nonlinearity must be known a priori. Similarly, noncompensatory choice rules, such as satisficing and latitude of acceptance, can be modeled using traditional methods by estimating separate slopes for the regions above and below the attribute thresholds, but again the exact form of the nonlinearity must be known precisely. The benefit of the neural network is its robustness to model mispecification.

Tables 1 and 4 give guidance on two additional aspects of model robustness for the neural network models versus the statistical models: model overfitting and worst case model performance. Model "overfitting" can be quantified by contrasting the within-sample accuracy (where overfitting to the training sample may be present) with the out-of-sample accuracy (which is an unbiased assessment of the model's performance). Model overfitting is illustrated graphically in Figure 4. In the absence of overfitting the within- and out-of- sample accuracy would coincide (a point representing model performance in a two-dimensional space would fall on the 450 line). Distance below the 450 line is in- dicative of model overfitting. As illustrated in Figure 4 neural network models are on average more accurate and exhibit less overfitting than discriminant analysis and logistic regression. Overfitting is not a problem for neural networks as performed in this paper because of our use of the validation sample T2 to determine when to stop training (cf., Figure 3).

A second aspect of robustness concerns the question of "how far wrong can you go using a particular modeling procedure?" Once again Tables 1 and 4 provide



Figure 4 Illustration of Model Overfitting

The graphs represent an illustration of model fit as well as overfitting. The closer to the upper right-hand corner of the graph the

better fitting the model. Distance below the 450 line represents model overfitting.

Latitude of Acceptance Rule Satisficing Rule Weighted Additive Rule 100 100 100-

80 80 A 80

60 60 60

9220 9 20 9220

o 0 0

0 0 - 0i-I-t------ i-----I-I-----I-I t-

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Within Sample Accuracy Within Sample Accuracy Within Sampke Accuracy

K-Mart Montgomery Ward Sears Roebuck 100 . 100 / 100 -_ _

80 A

80 80

A 60 260 280

40-140'40

920-9 29 20

0 0 0

---I0 -;0-----

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Within Sample Accuracy Within Sample Accuracy Within Sample Accuracy

Legend: M = Discnminant Analysis; A = Logistic Regression; 0= Neural Network

information that addresses this question for neural networks as compared to the discriminant analysis and logistic regression. In this case it is the range of out-of- sample predictive accuracy that provides information regarding the "worst-case accuracy." In summarizing the results from Table 1 and 4, there are nine comparisons that can be made between discriminant analysis and neural networks. In all nine cases neural networks showed a less extreme "worst case" performance, on the average having 14.8 percent better "worst case" accuracy. Similarly, examining logistic regression versus neural networks on the nine comparisons, neural networks showed less extreme "worst case" performance in two-thirds of the cases, with an average 10.8 percent better "worst case" accuracy. In the three instances where logistic regression showed less extreme "worst case" performance, it only averaged a 2.3 percent

improvement. Thus, neural networks appear more robust, in general, with respect to "how far wrong you can go" using the procedures examined.

In addition, the neural network model has the potential for uncovering managerially relevant information. An examination of the input variable sensitivities, as presented in Appendix 2, may allow the practitioner to determine the relative influence of various input variables, as well as thresholds that may be used for determining consumer choice. However, interpreta- tion of the interconnection weights in a neural network is not as simple as examining the parameters produced by a regression that, under appropriate distributional assumptions, can be subject to commonly used statistical significance tests. Because of the two-stage com- positional character of the neural network, the interconnection weights are linked between the different

388 MARKETING SCIENCE/VOL 16, No. 4,1997


layers, creating interdependencies. While the sensitivity measures developed in this paper were designed to overcome this issue, if the focus of a marketing analysis requires the selection of "statistically significant" prediction variables and if a parametric statistical model can be presupposed with confidence at the on- set of modeling, then traditional statistical measures may be preferred. Otherwise, neural network models offer improved predictive accuracy when the nature of the consumer's decision rule is unknown. For neural network models variable selection is discussed in ? 3.

A potential drawback of the neural network approach is that its intrinsically nonlinear and nonconvex structure allows for the possibility that a given model has achieved a local rather than global minimum in error rate. The sample reuse procedure described earlier was designed to address this shortcoming; however, while it increases the likelihood of finding a good model, it does not guarantee the discovery of the "best" possible model."

Assuming the main goal of the marketing researcher is to predict behavior, the results presented here are very promising. When neural network performance is compared to the predictive results that would be obtained using traditional marketing models, without ex- ception the neural network model exhibits equivalent or better out-of-sample predictive accuracy than any of the comparative statistical methods examined. This portends great usefulness for the use of artificial neural network modeling for prediction of consumer choice based on product attributes, and suggests potential applications of the neural network methodology to nu- merous other marketing applications.12

"1Another possible drawback concerning neural networks that is often heard is that the model is a "black box" methodology. As per the discussion in ? 2, however, it is a well-defined adaptive gradient search procedure for parameter fitting in a complex nonlinear model, and not a "black box" at all.

12The authors express sincere appreciation to the University of Texas at Austin, Graduate School of Business, and to the University of Texas Research Institute, both of which provided financial support. We thank Maureen Carter, John Hansen, Jaeho Jang, Kishore Krshna, Utai Pitaktong, Scott Swan, Yuying Wang, Xiaohua Xia, and Li Zhou for their assistance in data analysis. Suggestions and feed- back should be directed to Patricia M. West, Assistant Professor of Marketing, the University of Texas at Austin, Graduate School of Business, CBA 7.202, Austin, TX 78712.

Appendix 1

The Training Procedure and Algorithm A summary of the feed-forward back-propagation algorithm for a neural network with a single output 6 is presented below.

Let wMJ) be the interconnection weight between unit j of the hidden layer and the input variable xi, and let wu2) denote the interconnection weight between Hi = F(<') + x'w()), the jth neural unit in the hidden layer and the output variable 0. The threshold values i1) where j = 1 J...J J and q(2) are defined for each layer. We designate the logistic function 1/(1 + exp( - - z)) by F(x). The back-propagation algorithm proceeds by taking the response pattern (XU, 0U) for choice alternative, or respondent u, and incrementally updating the interconnection weights to reflect this pattern. This process is formalized as follows:

Step 1. Initialization: Set all weights w72), u(})J ,1), and ,(2) equal to small random values.

Step 2. Feed forward: For each unit j of the hidden layer, compute

Hj F(1) + xu w(7)) where xu is the vector of inputs corresponding to choice alternative, or respondent u from the input layer. Subse- quently, compute the predicted output 6 = F(q(2) + H'w(2)) at the output layer.

Step 3. Propagate backward: Using Ou as the "target" output for the input pattern xu compute (52) = F'(q7(2) + H'w(2))(Ou - U). At

the hidden layer, calculate (51) F( + xu' wJ1))W(2)>572) for all j. (These computations are facilitated by the formula F' = F- (1 - F), which holds for the logistic activation function.)

Step 4. Update weights: First compute AwO) a 6(2)Hj. Then, to

update the connection weights between the hidden layer j and the output, use the formula: w 2)new w(2)old + Aw72). Similarly, to update all the connection weights between the input layer and the hidden layer, compute AwQj) = c4x51Yxi and update the weights via the equation: w{l)new - w +j)o1d A Ow(). The parameter a in the preceding expressions represents a learning rate parameter.

Step 5. Repeat: Return to Step 2 and repeat the previously described process for the next respondent's data pattern. When all respondents have been analyzed in the above manner, start over again with the first respondent, again updating the weights as necessary to reduce the average disparity between Ou and Ou. The iterations cease according to the stopping rules outlined subsequently.

Appendix 2

The Sensitivity of a Single Network Output with Respect to an Input Variable To calculate the sensitivity of the output 0 with respect to a partic-

ular input variable Xk, we must determine the partial derivative -. aXk

This derivative is then evaluated at the observed value of Xk holding fixed at their observed values the other variables xi, where i =# k. The results are averaged over the respondents in each response category for each variable to obtain the entries in Table 5.

According to Equation (2), we may represent the partial derivative of the output 0 by the hidden layer neurons Hj as 0 = Rq(2)



+ H'w 2)), where H = (H1, ..., Hj) with Hi = F(q1() + x'w?P) and the activation function given by F(z) = 1/(1 + exp( - z)). Using the total derivative rule to take the partial derivatives and then applying the chain rule we have:

aO a d MH axi j=1 aHj axi

where

a- = F(q'2' + H'w)w(2)

and

-H= F'(qj0) + x'w())W(l1) ax, 17 I I

Thus,

? = I F'(7(2) + H'W(2))w52)-F'(i4) + x'WPj))Wj). aXi j=1

Now, using F'(z) = F(z) [1 - F(z)], the sensitivity can be succinctly written as

= w52)w') F( q(2) + H'w(2))[1 - F(0'2) + H'w(2))]

F(2) + X'W(p))[j - (1j) + 'wQ))]. F ?q x'.J1 - F(q17 + ll

After the training is terminated and weights Wi2), WQ), 17q1), and q(2)

are obtained, the average sensitivity of input xj with respect to output 0 evaluated at input response category level xi = k is calculated via the formula

N(k) dx

N(k) i=I a xji

where the index of summation is over the N(k) respondents who answered response category xi = k. These are the values appearing in Table 5.

References Abbott, L.F. (1996), "Statistical Analysis of Neural Networks," in

Paul Smolensky, Michael C. Mozer, and David Rumelhart

(Eds.), Mathematical Perspectives on Neural Networks, Mahwah,

NJ: Erlbaum. Anderson, John R. (1976), Language Memory and Thought, Hillsdale,

NJ: Erlbaum. Anderson, Norman H. (1970), "Functional Measurement and Psy-

chophysical Judgment," Psychological Review, 77, 153-170. (1971), "Integration Theory and Attitude Change," Psychological Review, 78, 177-206. (1991), Contributions to Information Integration Theory Volume I:

Cognition, Mahwah, NJ: Erlbaum. Archer, Norman P., and Shouhong Wang (1993), "Application of the

Back Propagation Neural Network Algorithm with Monoton- icity Constraints for Two-Group Classification," Decision Sci- ences, 24, 60-75.

Brockett, Patrick L., William W. Cooper, Linda L. Golden, and Utai Pitaktong (1994), "A Neural Network Method for Obtaining an Early Warning of Insurer Insolvency," Journal of Risk and Insur- ance, 61, September, 402-424.

Chappell, Mark and Michael S. Humphreys (1994), "An Auto- Associative Neural Network for Sparse Representations: Anal- ysis and Application of Models of Recognition and Cued Re- call," Psychological Review, 101, 103-128.

Collins, A.M., and E.F. Loftus (1975), "A Spreading Activation The- ory of Semantic Memory," Psychological Review, 82, 407-428.

Coombs, Clyde H. and George S. Avrunin (1977), "Single Peaked Functions and the Theory of Preference," Psychological Review, 84, 216-230.

Curram, Stephen P. and John Mingers (1994), "Neural Networks, Decision Tree Induction and Discriminant Analysis: An Empir- ical Comparison," Journal of Operational Research Society, 45, 440-450.

Dawes, Robin and Bernard Corrigan (1974), "Linear Models in Decision-Making," Psychological Bulletin, 81, 95-106.

Dutta S. and S. Shekhar (1988), "Bond Rating: A Non-Conservative Application of Neural Networks," Proceedings IEEE Interna- tional Conference on Neural Networks, 11443-11450.

Eberhart, Russell C. and Roy W. Dobbins (1990), Neural Network PC Tools: A Practical Guide, New York: Academic Press.

Einhorn, Hillel J. (1970), "The Use of Nonlinear, Noncompensatory Models of Decision Making," Psychological Bulletin, 73, 3, 221- 230.

Elman, J. and D. Zisper (1988), "Learning the Hidden Structure of Speech," Journal of the Acoustical Society of America, 83, 1615- 1626.

Fishbein, Martin (1967), "A Behavior Theory Approach to the Rela- tions Between Beliefs About an Object and the Attitude Toward That Object," in Readings in Attitude Theory and Measurement, M. Fishbein (Ed.), New York: Wiley, 389-399.

Fukushima, K. and S. Miyake (1984), "Neocognition: A New Algo- rithm for Pattern Recognition Tolerant of Deformations and Shifts in Position," Pattern Recognition, 15, 455-469.

Funahashi, K. (1989), "On the Approximate Realization of Contin- uous Mappings by Neural Networks," Neural Networks, 2,183- 192.

Ganzach, Yoav (1995), "Nonlinear Models of Clinical Judgment: Meehl's Data Revisited," Psychological Bulletin, 118, 3, 422-429.

Golden, Linda L., Patrick L. Brockett, Gerald Albaum, and Juan Zatarain (1992), "The Golden Numerical Comparative Scale Format for Economical Multiobject/Multiattribute Comparison Questionnaires," Journal of Official Statistics, 8, 77-86.

Green, Paul and V. Srinivasan (1978), "Conjoint Measurement in Consumer Research: Issues and Outlook," Journal of Consumer Research, 5, September, 103-123. and (1990), "Conjoint Analysis in Marketing: New De-

velopments with Implications for Research and Practice," Jour- nal of Marketing, 54, 3-19.

Hart, Anna (1992), "Using Neural Networks for Classification Tasks-Some Experiments on Datasets and Practical Advice," Journal of Operational Research Society, 43, 251-226.



Hebb, D.O. (1949), The Organization of Behavior, New York: Wiley. Hornik, K.M. Stinchcombe, and H. White (1990) "Universal Ap-

proximation of an Unknown Mapping and Its Derivatives Us- ing Multilayer Feedforward Networks," Neural Networks, 3, 551-560.

Johnson, Eric J. and Robert J. Meyer (1984), "Compensatory Choice Models of Noncompensatory Processes: The Effect of Varying Context," Journal of Consumer Research, 11, June, 528-541.

,and Sanjoy Ghose (1989), "When Choice Models Fail: Compensatory Models in Negatively Correlated Environ- ments," Journal of Marketing Research, 26, August, 255-270.

Keeney, Ralph L. and Howard Raiffa (1976), Decisions with Multiple Objectives: Preferences and Value Trade-Offs, New York: Wiley.

Lippmann, R.P. (1987), "An Introduction to Computing with Neural Nets," IEEE ASSP Magazine, 1, April, 4-22.

Lynch, John G. (1985), "Uniqueness Issues in the Decompositional Modeling of Multiattribute Overall Evaluations: An Informa- tion Integration Perspective," Journal of Marketing Research, 22, 1-19.

Malakooti, Behnam and Ying Q. Zhou (1994), "Feedforward Artifi- cial Neural Networks for Solving Discrete Multiple Criteria De- cision Making Problems," Management Science, 40, November, 1542-1561.

Menon, A., K. Mehrotra, C.K. Mohen, and S. Ranka (1996), "Char- acterization of a Class of Sigmoid Functions with Applications to Neural Networks," Neural Networks, 9, 5, 819-835.

McClelland, James L. (1986) "A Programmable Blackboard Model of Reading," in Parallel Distributed Processing: Explorations in the Microstructures of Cognition, David Rumelhart and James McClelland (Eds.), Cambridge, MA: MIT Press. and J.L. Elman (1986), "Interactive Processes in Speech Percep- tion: The TRACE Model," in Parallel Distributed Processing: Ex- plorations in the Microstructures of Cognition, David Rumelhart and James McClelland (Eds.), Cambridge, MA: MIT Press. and David E. Rumelhart (1986), "A Distributed Model of Hu- man Learning and Memory," in Parallel Distributed Processing: Explorations in the Microstructures of Cognition, David Rumelhart and James McClelland (Eds.), Cambridge, MA: MIT Press.

Payne, John W., James R. Bettman, and Eric J. Johnson (1993), The Adaptive Decision Maker, New York: Cambridge University Press.

Quillian, M.R. (1968), "Semantic Memory," in Semantic Information Processing, M. Minsky (Ed.), Cambridge, MA, MIT Press.

Rosenblatt, Frank (1959), "Two Theorems of Statistical Separability in the Perception," Proceedings of a Symposium on the Mechani- zation of Thought Processes, London: Her Majesty's Stationary Office, 421-456.

(1961), Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, Washington, DC: Spartan Books.

Rumelhart, David E., Richard Burbin, Richard Golden, and Yves Chauvin (1996), "Backpropogation: The Basic Theory," in Math- ematical Perspectives on Neural Networks, Paul Smolensky, Michael C. Mozer, and David Rumelhart (Eds.), Mahwah, NJ: Elbaum. , James L. McClelland, and the PDP Research Group (1986), Parallel Distributed Processing, Explorations in the Microstructures of Cognition, Cambridge, MA: MIT Press.

Salchenberger, Linda M., E. Mime Cinar, and Nicholas A. Lash (1992), "Neural Network: A New Tool for Predicting Thrift Fail- ures," Decision Sciences, 23, July-August, 899-917.

Shastri, Lokendra and Venkat Ajjanagadde (1993), "From Simple As- sociations to Systematic Reasoning: A Connectionist Represen- tation of Rules, Variables and Dynamic Bindings Using Tem- poral Synchrony," Behavioral and Brain Sciences, 16, 417-494.

Simon, Herbert A. (1996), The Sciences of the Artificial, 3d ed., Cam- bridge, MA: MIT Press.

Smith, Murray (1996), Neural Networks for Statistical Modeling, Boston, MA: International Thomson Computer Press.

Surkan, A.J. and J.C. Singleton (1990), "Neural Networks for Bond- ing Rating Improved by Multiple Hidden Layers," Proceedings IEEE International Conference on Neural Networks, 11157-11162.

Treigueiros, D. and R. Berry (1991), "The Application of Neural Net- works Based Methods to the Extraction of Knowledge from Ac- counting Reports," Proceedings, 24th Annual Hawaii International Conference on System Sciences IV, 137-146.

Tversky, Amos (1972), "Elimination by Aspects: A Theory of Choice," Psychological Review, 79, July, 281-299.

Wang, Jun (1994), "A Neural Network Approach to Modeling Fuzzy Preference Relations for Multiple Criteria Decision Making," Computer Operations Research, 21, September, 991-1000. and Behnam Malakooti (1992), "A Feedforward Neural Net-

work for Multiple Criteria Decision Making," Computers Opera- tions Research, 19, February, 151-167.

White, Halbert (1989), "Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Network Models," Journal of the American Statistical Association, 84, 1003-1013.

Wilkie, William L. and Edgar A. Pessimier (1973), "Issues In Mar- keting's Use of Multi-Attribute Attitude Models," Journal of Marketing Research, 10, 428-441.

Yoon, Youngohc, George Swales, and Thomas Margavio (1993), "A Comparison of Discriminant Analysis Versus Artificial Neural Networks," Journal of Operational Research Society, 44, 51-60.

Zimmer, Mary R. and Linda L. Golden (1988), "Impressions of Retail Stores: A Content Analysis of Consumer Images," Journal of Re- tailing, 64, 3, 235-293.

This paper was received August 16, 1995, and has been with the authors 15 months for 2 revisions; processed by Gary L. Lilien.


Date post:	31-Aug-2014
Category:	Documents
Upload:	bradley
View:	46 times
Download:	4 times

Neural Network for Consumer Choice Prediction

Documents