Either explicitly or implicitly, all the well-formed methods we have been able to

Appendix: Additional Thoughts and

Technical DetailConcerning Data Priors

andModel Selection by Prediction

• There are several methods by which data priors might be incorporated into model selection for BMS and NML. We will describe an approach/description that leads to extensions to NML* and to prediction/cross validation.

• Treat the prior probability of a given data outcome, D0i, as the probability that the outcome of a (virtual) prior study had been D0i.

• Bayesian analysis allows combination of successive results (posterior of first is prior for second).

• To a certain degree, our methods maintain

consistency with this Bayesian property.

Either explicitly or implicitly, all the well-formed methods we have been able to formulate treat the prior probability of a given data outcome, D0i, as the probability that the outcome of a prior study had been D0i.

• To set the stage consider first a case where we have a single prior data set with some assigned probability.

• Suppose we have current data D1, and prior data D2, but we believe that there is only a probability of say 0.5 that this report of prior data is true.

• Bayesian analysis would say with probability 0.5 that p(Mi|D) = p(Mi|D1&D2) and with probability 0.5 p(Mi|D) = p(Mi|D1).

So, • P(Mi|D) = (.5)p(Mi|D1&D2) + (.5)p(Mi|D1)

• In this approach the posteriors are mixed in accord with our beliefs.

• There are two ways to generalize this idea:

• In one, the different possible examples of prior data are all independently possible, and each has a probability in (0,1)—the probabilities need not sum to 1.0.

•In this generalization, if we have a data prior p*(Dj) then every possible combination of these data possibilities will have an associated probability: E.g. p(Dki, Dk2, Dk3) = p*(Dk1)p*(Dk2)p*(Dk3).

•We then calculate a posterior for each of these combinations with present data D1.

• The resultant posterior is the probabilistic sum of all the posteriors. Let each combination of prior data possibilities be denoted Ck with associated probability p(Ck). Then

• P(Mi|D*,D1) = Σp(Ck)p(Mi|Ck&D1)

• This approach seems well formed in terms of traditional Bayesian analysis.

• It has one advantage in that it can handle the case of two actual prior studies, thus each having p = 1.0.

• Of course it is completely useless in practice given the number of possible prior data sets, and the explosion of all possible subsets of those.

• The second way to generalize is the one we pursue:

• Suppose we stipulate that the data prior represents examples of data outcomes, exactly one of which is true. Thus the data prior probabilities add to 1.0

• If exactly one was true, then we need consider only each combination of D1 & D*k:

• P(Mi|D) = Σkp(D*k)p(Mi|D1&D*k)

• For model selection we sum these across the Mi in each model class:

• ΣiΣkp(D*k)p(Mi|D1&D*k)

• It is this approach and justification that we pursue in the following.

• We use the matrix of all model classes.

• For Model Classes M = 1, 2, etc., index all their parameters by θM: (θ1, θ2, …, θi).

• Assume we have a parameter prior, P0(θM), in addition to a data prior, P0(Di).

[To maintain this consistency, and other desirable properties, it is critical that we start with the full matrix of all models, not just one model at a time. If we have Models M = J, K, etc., index all their parameters by θM: (θ1, θ2, …, θi). We will omit the M when referring to all the models together, and place J, K, etc. in the superscript when we want to restrict consideration to the parameters of a given class. Assume we have a parameter prior, P0(θ), in addition to a data prior, P0(Di).]

• The BMS* criterion for Model Class K (with parameters θK) is the joint probability of the observed data and all parameters θi within Model K: p(Dobs, K) = Σip(Dobs, θK

i). • Thus we are basing model selection on the

prediction that Model K will hold and that the data to be observed will occur.

• Now we consider the data prior, P0(Di), as the probability that Di had already been observed, prior to the present study.

• Let D0i represent outcome Di of an identical (hypothetical, virtual) prior study. We note that:

p’(Dobs, K|D0i) = p’(Dobs, D0i, K)/p0(D0i)

Here p’ refers to the probability calculated on the basis of the original model with the specified parameter prior, but without the separately specified data prior.

[All the Bayesian methods we consider for incorporating data priors start by calculating a joint BMS score for each virtual prior data outcome, the present observed data outcome, and Model K: P(D0i, Dobs, K)This is a sum of the joint probabilities over the parameter values of Model K, within the vector of the 3-D matrix defined by D0i and Dobs.]

• This characterization contains the joint probability of the virtual prior study data outcome, the present study data outcome, and the parameters of all the models.

• Thus a graphic depiction of our extended method should use a 3-D joint matrix.

(not yet available as a picture)

• The 3-D matrix has one axis for the parameters of all models, another for the outcomes of the present study, and a third for the outcomes of the ‘virtual’ prior study:

• On one axis are the parameter values of the various models:– θK

1θK2θK

3..θKn1θJ

1θJ2θJ

3…θJn2

• On the second axis are the potential data outcomes of the present study:– D1, D2, D3, …..

• On the third axis are the (virtual) data outcomes of the prior study:– D01, D02, D03, ….

• A given entry in this joint matrix is obtained from the model and the parameter prior:

P(θi,D0j,Dk) = P0(θi)P(D0j|θi)P(Dk|θi)

This characterization assumes independence of the actual present study and virtual prior study –an assumption that seems reasonable.

• All the Bayesian methods we consider for incorporating data priors start by calculating a joint BMS score for each virtual prior data outcome, the present observed data outcome, and Model K:

P(D0i, Dobs, K)• This is a sum of the joint probabilities over the

parameter values of Model K, within the vector of the 3-D matrix defined by D0i and Dobs.

• We now want to weight this BMS joint score by the probabilities of the virtual prior data—i.e. the data prior—and sum across all the virtual prior outcomes. The justification for this probability mixing was given in the introduction to this section on data priors.

• The weights justified in the introduction lead to the following BMS criterion:

BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i)

The Upper Case K here means we are really taking two sums, one over data and the other over the model class.

In this formulation there are normalizing constants that could be altered to allow two other weight descriptions. Thus we have considered all three of the following weights equations: (the weights are in the brackets):

Σip’(Dobs, D0i, K)[p0(D0i)] (1)Σip’(Dobs, D0i, K)[p0(Di)/pA(Di)] (2)

Σip’(Dobs, D0i, K)[p0(D0i)/pA(Dobs,D0i)] (3)

[Recall that A refers to the marginal data probability obtained using the original parameter prior.]

• Note that the weighting scheme is a matter of convenience: One can choose to assign the same weights whatever scheme is assumed. But mathematical equivalence is not conceptual equivalence—when we know something about data (a prior), we have to know how to specify this knowledge. I.e. the interpretation of the weights changes with the underlying assumptions, so when translating prior knowledge into weights, one must do so in a way consistent with the assumptions.

BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i),

• For this weighting scheme, we can interpret the weight as the probability that of all outcomes that could have occurred in our prior (virtual) study, this was the outcome that did occur.

• For this method, equal weights gives a constant times the BMS* criterion and score that would have been obtained in normal BMS, without a data prior, so relative model selection remains unchanged.

• This formulation also has the nice property that adding additional models to the set under consideration will not change the relative BMS preference among the initial models.

• We must keep our justification clearly in mind. Suppose we had an actual prior replication. One could list the outcome as a data prior with p = 1.0. If we did so, then our favored formula would combine the two results in the usual Bayesian manner. But this is a special case.

• In general, given for two or more actual prior studies, the results would have to be combined before using the formula.

• The method we propose is of course not aimed to apply to actual prior replications. In general our prior data knowledge is not from replications, but from relevant sources of all kinds, including vaguely similar studies, conceptual thinking, vetted theories, and so on.

• Our formulation applies to equal or unequal model class probabilities.

• To extend to NML it is convenient to restate in terms of means;

• In equation form, the BMS data-prior criterion stated in terms of means is:

Σi [μK(Dobs,D0i,K)/ΣhΣjμK(Dj,D0h,K)]P0(K)P0(D0i)

Here i (and h) index the prior data outcomes and j indexes the present data outcomes. Hence the term in brackets assumes a particular prior data outcome D0i, and is the mean for the present observed data and Model K and prior data outcome D0i divided by the sum of means for Model K across all present data outcomes and prior data outcomes. This term is multiplied by the prior probability of Model K. Finally we weight such terms by the probability of the prior data outcome and sum.

NML with Data Priors

We previously characterized BMS and NML in terms of differences between means and maxima. Thus to obtain an NML data-prior model selection criterion, it is natural to replace the means in this data-prior BMS model selection equation with maxima.

• The NML result becomes:

Σi [MAXK(Dobs,D0i,K)/ΣjΣhMAXK(Dj,D0h,K)]w0(K)P0(D0i)

In words, one calculates an NML-like score for each Model K by taking the max within the vector defined by Dobs and D0i and then dividing by the sum of such maxs for Model K for all present and past data outcomes. This ratio is multiplied by a model weight w0(K) and the result is weighted by the data prior P0(D0i) and summed over prior data outcomes.

The weight w0(K) could reasonably be set to equal the sum of the prior probabilities, or weights, over the Model K class.

• For the purposes of exposition we have up to this point been too restrictive in the types of data priors allowed. We now wish to generalize, in an entirely natural way that does not alter the approach.

We have characterized data priors (as in the equation below) in terms of alternative outcomes of the present study.

BMS*DP(K) = Σip’(Dobs, D0i, K)p0(D0i),

However, the data prior could be based on a different virtual study

E.g. Prior study: Accuracy; Present study: Response Time

Simpler variants are of course also possible, such as choosing a data prior for a study with a smaller number of observations than the present study.


There is another important proviso, and generalization:

The prior data (study) and the present data study could be based on the same model class, but different parameters.

In general we want the parameters in the two cases to be similar, but not identical.

In effect we have to have a prior for the covariance of the parameters.

• Thus we need a joint prior on the parameters of the two (or more) studies:

P0(θobs,g,θ0,h)

For a given prior data outcome, the joint probability of D0i and Dobs becomes

• p’(Dobs, D0i, K) = ΣgΣhP0(θobs,g,θ0,h)p(Dobs|θobs,g)p(D0i|θ0h)

We then apply the previously specified equation for data priors, using this joint probability. [As usual these joint probabilities are calculated on the basis of all the model classes under consideration].

• How do we specify P0(θobs,g,θ0,h)?

• It is usual to do so using a hierarchical structure:

P0(θobs,g,θ0,h) = P0(θobs,g|η)P0(θ0,h|η)P0(η)

• Here η specifies the joint distribution over θobs, and θ0. For example this might be a gaussian distribution with some parameterized variance/covariance structure.

• Given that we are inventing a data prior from a whole set of partially relevant and often vague considerations, it is not entirely clear whether any of this new machinery is necessary. It may be sufficient to simply provide a best guess for the outcome of the present study.

• This approach for BMS can be carried over to NML, but details are omitted.

• Thus we have a data prior method for model selection.

• But what about computation? Is the method usable?

• Without data priors (or ‘flat’ ones) standard sampling methods suffice for BMS, and BMS scores could be considered an approximation to NML.

[Even without a separately specified data prior, computing an NML score is far more difficult than computing a BMS score, and typically not feasible computationally (because the max of distributions is much harder to obtain by sampling approaches than are means).

We therefore suggested it might be best to compute a BMS score (and treat the result as an approximation to NML, if we prefer).]

• The need to specify a data prior introduces an additional issue of computational complexity, for both BMS and NML: The data outcome space is often and typically extremely complex and high dimensional (much more so than the parameter space—after all the models are aimed to simplify the data space).

• Thus it may be very difficult to specify the data outcome prior, and almost impossible to sum weighted model selection scores across it.

• We believe that we may be able to approximate the computation by using a reasonable size sample of data outcomes, each with appropriately assigned probabilities.

• We will be exploring this hypothesis with simulations in the near future.

• In summary:

• Using the joint probability characterization outlined in Part I, and treating data priors in terms of probabilistic outcomes of a virtual prior study, we can develop a data prior approach to model selection, one that can be stated in both BMS and NML terms.

Research in Progress:Comments, criticisms and suggestions are welcome.

Extensions to Cross Validation Methods

[This section is in a more preliminary phase of construction than the prior sections]

Extensions to Predictive Validation and Cross Validation (CV)

We want to start by saying that CV methods also need to take priors into account: What we know

determines inference in all methods

• The third major class of model selection methods involve prediction and cross validation. Such methods include many cross-validation variants, prequential prediction methods that include accumulated prediction error, and certain bootstrapped simulation versions of these methods.

• For example, one might prefer a model that fit to one half a data set, best predicts the other half of the data (split-half cross-validation). Or one might predict a single observation from the fit to all the others, and do this over and over (leave-one-out cross validation). Or one might sequentially use a model fit to trials 1 to n to predict n+1, and accumulate all the prediction error, favoring the model that has the lowest APE (prequential analysis).

• We now want to consider the relation of these approaches to our proposed BMS and NML data-prior-based methods, and propose ways to incorporate data and parameter priors into these methods.

• One might think the prediction methods eliminate the need to introduce parameter priors and data priors, but this thinking represents a serious conceptual error.

• To take just one simple example among an infinite set of examples, if we know before a study that Model A is one hundred times as probable as Model B, but analysis of the study shows Model B cross-validates slightly better than A, it would not be sensible to prefer Model B.

• To take another simple example, suppose we are testing a fair coin vs a biased coin. We divide the trials in half and see how each model fit to one half predicts the other half. If the observed data shows a preponderance of tails, the bias model will do best at cross validation (the excess tails tend to occur in both halves of the split sample). But suppose I have prior knowledge that tells me that if a coin is biased it has a bias towards heads. Now the bias model should do less well, and the fair coin model should do much better.

• The characterization of our BMS and NML methods that specifies the criteria in terms of prediction, and treats the data prior as the outcome of a prior virtual study, suggests a connection to the prediction/cross validation approaches.

• There are many variants of prediction/cross-validation approaches. To take a concrete example and set the stage, consider just one: split-half cross validation (SHCV): One partitions the observed data into halves: a training set (D1) and a test or validation set (D2). The models are fit to D1, and the resultant fitted models are used to predict D2, the best predicting model being preferred.

• Before turning to issues such as the incorporation of priors into CV/Prediction, we what to make a few general remarks about SHCV and why it ‘works’ and ‘doesn’t work’.

• Typically, we fit D1, and the single best fitting model within each class is used to predict D2.

• Suppose there is one fixed parameter value in each model class (nothing to estimate from D1). Then CV methodology would prefer the model that better fits D2. This makes little sense--it surely would be better to base the model selection decision on the fit to all the data together.

• Suppose there are multiple models (parameter values) in each class, but that D1 happens to equal D2. The best fit to D1 will then be the best fit to D2. Again the decision will be based only on the best fit to D2, and this decision will favor the more complex model.

• In general, of course, D1 and D2 differ due to noise. The more complex model class will tend to predict the noise in D1 better, so its best fit model will not tend to be best for D2, since D2 has independent noise. This instantiates a form of Occam’s Razor.

• Does SHCV penalize complexity sufficiently? Even technically this depends: If the parameter space is fairly flat, then the best fit to D1 may be far in parameter space from the best fit to D2, but because the space is fairly flat, the difference in fit might be modest. Conversely, if the parameter space is sharply tuned, the best fits to D1 and D2 might produce similar parameters and again the difference in D2 fit might be modest.

• Consider the case of nested models. If the best fit to D1 happens to be the same for these two models, then the two models will be judged equivalent. E.g. Model A says p(success) is flat in (0.0, 0.1); Model B says p(s) is flat in (0.0, 1.0). D1 has, say, 1 success, 19 failures. The best fit model is the same for the two classes, so the two models are judged equal (BMS and NML both favor the simpler model in such a case).

• These possibilities do not rule out SHCV as a reasonable model selection system, but point out some potential limitations (there are others as well).

• Experts often argue that CV is designed to choose the model class that best predicts future data. Certainly there are cases where various forms of CV will outperform BMS and NML in terms of prediction (see for a related example, the later discussion of ‘catch up’), but this is not guaranteed and in many instances BMS or NML will do better.

• These various considerations aside, we consider the failure of CV/Prediction to take prior knowledge into account to be a serious deficiency, so now return to this issue.

• We will start by considering Bayesian versions of split-half cross validation, in which we use D1 to obtain a distribution of probabilities for the different parameters, and use that distribution as a prior for D2.

• Regular CV uses D1 to make inferences within a model class, but not between classes. If we do the same with a Bayesian approach, then the final model selection preference will differ depending on whether the data is considered successively, in two parts, or all at once.

• However, we note that the approach we have been proposing for data priors allows us to deal with a division of data into D1 and D2 in a way that maps onto the data-prior BMS and NML methods we have proposed.

• In our data-prior method, let us consider D1 the prior with probability 1.0. The BMS*DP criterion we proposed was:


If p0(D0i) becomes p(D1) = 1.0, then:

BMS*DP(K) = p’(D2, D1,K)(1.0) = p’(D, K)

• This is just the usual BMS* criterion for all the data together, without a data prior.

• Giving all prior weight to D1 similarly allows us to map directly onto our previous data-prior NML method.

• These mappings let us see that our data-prior BMS and NML methods apply to divisions of data into a training set and test set.

• However, it is critical to note that these methods differ conceptually and quantitatively from the cross-validation approach.

• The CV approach is based on a method termed ‘posterior predictive’ and treats the models separately. As we show below, in this method the answer differs depending on whether one treats the data as one group or two successive groups.

• In normal Bayesian analysis D1 causes a redistribution of beliefs about the parameters within class, but also changes the beliefs about the two classes. The posterior predictive method ignores the changes in beliefs (due to analysis of D1) about the classes, and analyzes D2 using only the within-class adjustments for each model class. Essentially, the method does not care about our belief of model classes because it evaluates models solely based on their prediction performance after adjustments. Thus, we can say that it is only partially Bayesian.

• Let us return to an earlier example to illustrate: Suppose M1 says the probability of heads p(h) is either .00, .5, or 1.0, with equal priors. M2 says p(h) is .5, and the models are assumed equally probable to start with. We observe two heads. The usual Bayesian analysis (prior predictive) gives the same answer whether we consider the heads one at a time, or together: The degrees of belief at the end for M1 and p(h) = .00, .5, and 1.0 are respectively 0/8, 1/8, and 4/8, and for M2 and p(h) = .5 is 3/8. Hence Model A is favored 5/8 to 3/8, or 5:3.

• But now consider the posterior predictive method. We analyze D1 (one heads) with M1, obtaining a posterior over p(h) = 0.0, 0.5, and 1.0 of 0, 1/3, 2/3 respectively. The belief in p = 0.5 for M2 remains at 1.0 (it is the only choice).

• We use these posteriors as priors for D2, ignoring the probabilities of the two model classes, obtaining a joint probability of M1 and D2 of (1/2)(1/5 + 4/5) and of M2 of (1/2)(1/2). The result favors M1 over M2 by 2:1. This differs from normal Bayesian analysis which gave an answer favoring M1 over M2 of 5:3.

• Why the different answers? The prior predictive method took into account the changes in beliefs about the model classes caused by observation of D1. The posterior predictive method did not.

• One might ask why one would want to use the posterior predictive approach, given it ignores the seemingly obvious changes in model probabilities due to the training set of data.

• One answer may lie in a difference in the goals of model selection. The posterior predictive approach is intended to have a practical goal. By basing model evaluation solely on each model’s prediction performance after calibration, it tries to capture what matters in practice. This approach is common in engineering. Suppose we had alternative models for visual detection of tanks in satellite photos—we would probably want to validate the models by some form of standard CV, because good prediction is the goal. Of course having a desirable goal is not the same as succeeding in that goal. Whether CV does better in practice than BMS or NML is an empirical question.

• Even within Bayesian analysis (using the entire posterior distribution rather than a maximum parameter) there are many cases where prediction of future data is superior using forms of cross validation.

• Bayesians might be surprised by this result, since regular analysis uses all the observed data, and presumably assigns optimal degrees of belief to the various model classes and models within class.

• ‘Presumably’ is a key condition here: BMS is a reasonably consistent formal system based on well formed probability assumptions and derivations, but nonetheless is a system that requires validation.

• There are a few possibilities for validation. One can generate simulated data and assess the degree to which BMS (or other systems of model selection) choose the data that generated the data set. Such methodology is represented by the PBCM methodology proposed by Wagenmakers, E.-J., Ratcliff, R., Gomez, P., & Iverson, G. J. (2004): Assessing model mimicry using the parametric bootstrap. Journal of Mathematical Psychology, 48, 28-50. This theoretical validation approach is not without problems, however. For example, sometimes the model that did not generate a data set does a better job at predicting future data. E.g. in Cohen, Sanborn, Shiffrin (PB&R, 2008), we gave examples where small data samples caused errors and distortions in parameter estimation that were larger for the model generating a data set than for the alternative model. As a result when predicting the distribution of data produced by the true generating model, using the estimated parameters, the alternative model predicts better in significant numbers of cases.

• It is also possible to validate model selection systems empirically, by assessing the degree to which model selection based on a sample of real data predicts additional real data in the same situation.

• There are many cases, surprisingly, where Bayesian analysis prefers a simpler model but a more complex alternative model does a better job at predicting additional data.

• A newish paper by Peter Grunwald deals with this issue:

Catching Up Faster by Switching Sooner:A Prequential Solution to the AIC-BIC Dilemma

Tim van Erven, Peter Grunwald, Steven de RooijCentrum voor Wiskunde en Informatica (CWI)

Kruislaan 413, P.O. Box 940791090 GB Amsterdam, The Netherlands

{Tim.van.Erven,Peter.Grunwald,Steven.de.Rooij}@cwi.nl

• In ideal Bayesian prediction, one would take the posterior across all the parameters in all model classes being compared or considered, at a given moment in time, make a prediction based on each, weight these by the posterior of each, and (depending on what is being predicted) sum.

• This is usually called Bayesian Model Averaging (or BMA).

• There is a Bayesian preference for simpler models that is of course reflected in the degree of posterior belief in the simpler model. Grunwald and friends point out in many situations the following happens:– For small amounts of data, the simpler model has more

posterior belief and predicts better.– For large amounts of data, the more complex model

has more posterior belief and predicts better.– However, for intermediate amounts of data the simpler

model has more posterior belief but the more complex model predicts better.

• The result is that Bayesian Model Averaging in these intermediate regions of data amounts relies too much on the simpler model to make predictions, and a heavier weighting of the more complex model predicts future data better.

• (For technical exposition, and a nice solution to make BMA work better for prediction, see the article).

• Intuition suggests (and Grunwald confirms) that this problem would be even worse for NML, assuming that one chooses some sensible way to predict using NML (there are several choices).

• Suppose one is in a data region where the more complex model predicts better, but BMS prefers the simpler model. Which model should be preferred? One can try to skirt this question by saying the answer depends on one’s goals, but we believe the issue is complex and remains an open question, one deserving more thought and research.

• There are several lessons we can take from this:• Bayesian prediction, NML prediction ( if one comes

up with a sensible prediction system for NML), and empirical predictive validation are mutually inconsistent—in general they measure different things and give different answers.

• However, when applied sensibly, with appropriate parameter and data priors, we believe the answers would generally be close (an assertion requiring testing and validation).

• Perhaps it would make sense to combine NML, BMS, and predictive validation: Averaging the three answers could possible combine the benefits of each.

• One might now want to ask the following question: How should one introduce data priors into posterior predictive approaches like cross-validation?

• Of several possibilities, we propose a method that seems to capture the spirit of our previous proposals.

ΣiP(D2,D0i | D1,K)P0(D0i)

• In this approach, we use the calibrated model given the training set D1, in order to see how good a joint, predicted probability it gives to both the test/validation set D2 and also a plausible data set, D0i. We then take the expectation of this with respect to our data prior, P0(D0i). We can implement this measure using split-half, or preferably, leave-one-out cross validation.

Date post:	24-Feb-2016
Category:	Documents
Upload:	branxton
View:	22 times
Download:	0 times

Either explicitly or implicitly, all the well-formed methods we have been able to

Documents