+ All Categories
Home > Documents > Part 1, Philosophy of the Quantitative Sciencesphilosophy.wisc.edu/forster/papers/Part1.pdf ·...

Part 1, Philosophy of the Quantitative Sciencesphilosophy.wisc.edu/forster/papers/Part1.pdf ·...

Date post: 01-Apr-2018
Category:
Upload: lekhanh
View: 223 times
Download: 4 times
Share this document with a friend
30
Philosophy of the Quantitative Sciences Part 1: Unification, Curve Fitting, and Cross Validation Malcolm Forster (November 17, 2003) What Is Philosophy of Science About? According to one definition, the philosophy of science seeks to describe and understand how science works in a way that applies to a variety of examples of science. This does not have to include every kind of science. But it had better not be confined to a single branch of a single science, for such an understanding would add nothing to what scientists working in that area already know. Deductive logic is about the property of arguments called validity. An argument has this property when its conclusion follows deductively from its premises. Heres an example: If Alice is guilty then Bob is guilty, and Alice is guilty. Therefore, Bob is guilty. The important point is that the validity of this argument has nothing to do with the content of the argument. Any argument of the following form (called modus ponens) is valid: If P then Q, and P, therefore Q. Any claims substituted for P and Q lead to an argument that is valid. Probability theory is also content-free. This is why deductive logic and probability theory have traditionally been the main tools in philosophy of science. If science worked by logic, and logic alone, then this would be a valuable thing to know and understand. For it would mean that someone familiar with one science would immediately understand many other sciences. It would be like having a universal grammar that applies to a disparate range of languages. The question is: How deep is the understanding of science can logic and probability provide? One of the principal goals of this essay is to argue that there is a tradeoff between universality and depth. More specifically, I shall argue that there is a significant depth of understanding to be gained by narrowing the focus of the philosophy of science to the quantitative sciences. A Dilemma for Philosophy of Science Every discipline begins with the simplest ideas, to see how far they can go. The philosophy of science is no exception. Here I sketch two simple answers to the question: How does science work?. Neither answer is adequate, albeit for quite different reasons. The first answer is that science accumulates knowledge by simple enumerative induction, which refers to the following pattern of reasoning: All billiard balls observed so far have moved when struck, therefore all billiard balls move when struck. The premise of this argument refers to something that we know by observation alone. It is a statement of observational or empirical evidence. The conclusion extends the observed regularity to unobserved instances. For example, it predicts that the next billiard ball will move when struck. Like any scientific theory or hypothesis, the conclusion goes beyond the evidence. Simple enumerative induction is an instance of what is called ampliative inference because the conclusion amplifies the premises.
Transcript

Philosophy of the Quantitative Sciences Part 1: Unification, Curve Fitting, and Cross Validation

Malcolm Forster (November 17, 2003)

What Is Philosophy of Science About?

According to one definition, the philosophy of science seeks to describe and understand how science works in a way that applies to a variety of examples of science. This does not have to include every kind of science. But it had better not be confined to a single branch of a single science, for such an understanding would add nothing to what scientists working in that area already know.

Deductive logic is about the property of arguments called validity. An argument has this property when its conclusion follows deductively from its premises. Here�s an example: If Alice is guilty then Bob is guilty, and Alice is guilty. Therefore, Bob is guilty. The important point is that the validity of this argument has nothing to do with the content of the argument. Any argument of the following form (called modus ponens) is valid: If P then Q, and P, therefore Q. Any claims substituted for P and Q lead to an argument that is valid. Probability theory is also content-free. This is why deductive logic and probability theory have traditionally been the main tools in philosophy of science.

If science worked by logic, and logic alone, then this would be a valuable thing to know and understand. For it would mean that someone familiar with one science would immediately understand many other sciences. It would be like having a universal grammar that applies to a disparate range of languages.

The question is: How deep is the understanding of science can logic and probability provide? One of the principal goals of this essay is to argue that there is a tradeoff between universality and depth. More specifically, I shall argue that there is a significant depth of understanding to be gained by narrowing the focus of the philosophy of science to the quantitative sciences.

A Dilemma for Philosophy of Science

Every discipline begins with the simplest ideas, to see how far they can go. The philosophy of science is no exception. Here I sketch two simple answers to the question: How does science work?. Neither answer is adequate, albeit for quite different reasons.

The first answer is that science accumulates knowledge by simple enumerative induction, which refers to the following pattern of reasoning: All billiard balls observed so far have moved when struck, therefore all billiard balls move when struck. The premise of this argument refers to something that we know by observation alone. It is a statement of observational or empirical evidence. The conclusion extends the observed regularity to unobserved instances. For example, it predicts that the next billiard ball will move when struck. Like any scientific theory or hypothesis, the conclusion goes beyond the evidence. Simple enumerative induction is an instance of what is called ampliative inference because the conclusion �amplifies� the premises.

2

It is a simple fact of logic that any ampliative inference is invalid. Even if the billiard ball has been observed many times in a variety of circumstances, the next billiard ball may not move when struck (there is such a thing as superglue).

Since any ampliative inference is fallible, any methodology of science is also fallible. Our very best scientific theories may be false. But, it�s not the fallibility of simple enumerative induction that limits its usefulness in philosophy of science. Rather, the main objection is that simple enumerative induction fails to capture the important role of new concepts in science. Newton�s theory of gravitation introduced the concept of gravitation to explain the motion of the planets. Mendel introduced the concept of a gene to explain the inheritance of traits in pea plants. Atomic theory introduces the notion of atoms to explain thermodynamic behavior. Psychology introduced the notion of intelligence to explain the correlation of test scores.

In contrast, the pattern of simple enumerative induction is: All observed A�s have been B�s. Therefore, all A�s are B�s. The terms appearing in the conclusion are just the same as those appearing in the premise. In contrast, Newton did not see any instances of gravity pulling on planets, Mendel did not observe any genes, and Boltzmann did not observe molecules in motion.

Another very simple theory of scientific inference is called inference to the best explanation. The best explanation can postulate the existence of new, unobserved, entities. So, this picture of scientific inference cannot be faulted in this regard. Instead, it is limited by its vagueness. For if nothing more is said about what counts as an explanation, and what counts as �best�, then any example of science may be described as an inference to the best explanation. It does little more than replace the original question with new questions. What is explanation, and what counts as �best� and how do we tell which explanation is �best�?

The dilemma for philosophy of science is to say something precise that fits a diverse range of scientific examples and also deepens our understanding of how science works. There are two well developed philosophies of science that a fairly good job at making this tradeoff. The first uses deductive logic as a tool, and is called hypothetico-deductivism. The second is a probabilistic generalization of this, which is commonly referred to as Bayesianism. Both of these approaches achieve a great deal from very basic principles. Yet, I plan to show that these views are limited by their great generality. These views of science are so content-free that they make very little distinction between everyday and scientific reasoning. It raises the question: Is there anything very special about scientific reasoning? If the current philosophies of science are to be believed, then the answer is that there is nothing very special about science. At a broad level of description, this may be true to some extent. But at a slightly deeper level of description, I plan to show that it is not true.

In my view, there is a line between quantitative science and non-quantitative science, where by quantitative science I refer to science that uses mathematical equations to represent reality. In particular, I plan to argue that there is a reasonably universal methodology of unification and curve-fitting that applies to a wide variety of quantitative sciences, which deepens understanding of these sciences in a way that hypothetico-deductivism and Bayesianism do not.

Unification and Curve Fitting

My first thesis is that unification and curve fitting are inseparable parts of the same process�that every instance of curve-fitting involves unification. Think about drawing a curve through a set of data points. There is a clear intuitive sense in which the curve

3

connects the points not merely in the obvious sense of drawing a line between them, but also in a deeper sense of explaining many observations in terms of single hypothesis. Moreover, there are other ways in which unification and curve-fitting go together.

One kind of unification is illustrated by Boyle�s unified treatment of his experiment concerning the law connecting the pressure and the volume of a gas and Torricelli�s experiment. The Torricellian experiment showed that the weight of the air at sea level is equivalent to the weight of a column of mercury approximately 29 inches high. When Boyle observed that the air trapped in a tube is compressed by the weight of 29 inches of mercury to half its volume, he saw an extraordinary coincidence between that number and the height of mercury in Torricelli�s experiment. This coincidence needed to be explained. It occurred to him that Torricelli�s experiment showed that there was, effectively, two atmospheres of pressure working to halve the volume in his experiment�the weight of the air and the weight of the mercury that he added. So, he saw the halving of the gas volume as being caused by a doubling of the pressure. This is one instance of the law PV = constant. It was only after his theoretical inference about the value P that his empirical law takes on this simple form.1

Another kind of unification is when parameters measured in one curve-fitting agree numerically with those measured in another curve-fitting. The paradigmatic example of this kind of unification is Newton�s unification of celestial and terrestrial motion. In this example, the observed accelerations and distances of the earth�s moon and terrestrial projectiles provide two independent measurements of the earth�s gravitational mass. The unification is confirmed to be successful when the two measurements agree.2

Here is how the same kind of unification works in another example. A beam balance measures the ratio of two masses, for example the ratio of the masses of objects a and b. Denote this ratio by ( ) ( )m a m b . We also measure the ratios ( ) ( )m b m c and ( ) ( )m a m c in two other experiments in the same way. Then the theory implies that the three mass ratios are connected in the following way:

( ) ( ) ( ) ( ) ( ) ( )m a m c m a m b m b m c= × . This means that ( ) ( )m a m c can be predicted from the other two mass ratios, which provides an independent measurement of ( ) ( )m a m c . If the measurements agree, then the unified beam balance model is confirmed.

The beam balance example is the toy example that I shall use to provide a generic description of curve-fitting as it commonly occurs in the quantitative sciences. The example is sufficient to show that the view of curve fitting assumed prominent philosophers in the past, such as Reichenbach (1938) and Hempel (1966), is wrong. It is not only descriptively wrong, but also normatively incorrect. The method that scientists use is the only one that works. I shall begin by describing the right method, and then compare it to the wrong view.

Curve-Fitting in Science

Curve-fitting involves three steps: 3

Step 1: The determination of the dependent and independent variable(s). 1 See chapter 7 of Harré, Rom (1981): Great Scientific Experiments. Oxford: Phaidon Press. 2 See Harper (2002), and back references, for other intriguing examples in Newtonian science. 3 These are the three steps in the �colligation of facts� in the quantitative sciences identified by William

Whewell 150 years ago. The relevant excerpts are reprinted in Butts, Robert E. (ed.) (1989). William Whewell: Theory of Scientific Method. Hackett Publishing Company, Indianapolis/Cambridge.

4

Step 2: Model selection: The selection the model, or family of curves. Step 3: Parameter estimation: The determination of the curve that best fits the data.

The first step in curve-fitting is the determining the appropriate choice of x and y variables. In its simplest form the data consists of a set of (x, y) points, where y is the dependent variable (e.g., volume) and x is the independent variable (e.g., pressure).

Given that curve-fitting is the fitting a curve to a set of data points, it is natural to think that one begins with the data. One may begin with some data, but it is not always in the form required to discover the appropriate curve. Boyle began with the task of discovering a law determining the volume of a gas, but first he had to understand how the weight of the air combined with the added mercury to produce the total pressure on the gas. A major step of Boyle�s discovery was therefore in the determination of the independent variable (the total pressure).

To overlook the first step is to overlook how much prior knowledge and experience goes into preparing the ground for a successful curve-fitting. Famously, Kepler fitted an ellipse to a set of known positions of Mars in its orbit. But these positions were three-dimensional positions of Mars relative to the sun. They were not directly observed. They were inferred from the positions of Mars relative to the fixed stars, and the inference relied on centuries of intellectual labor. By overlooking this first step, automated discovery of laws can be made to look easy enough to be performed by a computer (Langley et al. 1987).

The second step is the selection the model, or family of curves. This step is known as model selection in the statistics literature. Given that the scientists� concept of model differs from the sense of the term commonly used by philosophers of science, it is appropriate to define the term.

Definition: In the quantitative sciences, a model is an equation, or a set of equations, with at least one adjustable parameter. A model usual applies to a specific kind of situation. In contrast, a theory, such as Newton�s theory of motion, is a general set of laws and principles of broad scope, from which different models are derived using different sets of auxiliary assumptions.

Overlooking the role of models in curve fitting is another way of making the process look too easy. One must formulating a set of rival models, which might be done by guessing the form of equation, or beginning with a very simple model such as the family of circles or the family straight lines, and complicating these models in various ways. In other cases, models are derived from a background theory using auxiliary assumptions. Again, scientists usually begin with the simplest possible models even though these are drastically unrealistic and idealized. For example, Newton proved that Kepler�s law follow from his theory of gravity under the plainly false idealization that each planet is a point mass, which has negligible mass compared to the mass of the sun, such that the presence of the other planets have no effect. He had to complicated this model in stages, in order to predict effects that are not predicted by Kepler�s laws.

Likewise, in deriving the beam balance model from Newton�s theory of motion, there are stronger and weaker choices of auxiliary assumptions that can be made. In all cases, we are left with many models. The selection of the �best� model from among rival models is the concern of a theory of confirmation. This is the main topic of later sections.

In the meantime, suppose we have a set of data consisting of pairs of (x, y) values. One of the principle goals of curve-fitting is to make predictions of y for new values of x. It seems deceptively easy�determine a single curve the fits the data and make predictions for that. A family of curves, makes many predictions, and therefore no

5

precise predictions, in general. Hence one would not initially guess that models play an essential role in curve-fitting. In fact, prominent philosophers of science have assumed exactly that�the problem for them was how to choose from among the set of all curves, many of which will fit the data perfectly. Not only is this not how curve fitting works in real science, but the next three sections will explain how this false view of curve fitting has led to spurious philosophical puzzles about the role of simplicity in curve fitting.

The correction of this mistake is very important because the introduction of models does something that has philosophical significance. It introduces adjustable parameters into the problem, and these can play a key role in the confirmation of a curve fitting hypotheses.

As a concrete example of model, consider the family of all straight lines in the x-y plane, which is represented by the equation y = a + b x where a and b are adjustable parameters that can take on any value between −∞ and +∞. For any curve, y = a when x = 0, so a determines the point of intersection between the line and the y-axis. The parameter b is the slope of the line. Clearly, any particular set of values assigned to the adjustable parameters will pick out a particular curve in the family. And conversely, any particular curve will determine a set of numerical values for all the adjustable parameters.

It is this second fact that explains how statistical estimation works. If one determines the curve that fits the data best, then one has found estimates of all the adjustable parameters. For that reason, step 3 is referred to as the estimation of the parameters in statistics. Without the mediation of models, the estimation of parameters plays no role.

The third step in curve-fitting uses the data to pick out a particular curve that best fits the data from the family of curves. The definition of �best fit� can differ. The best known method of determining the curve that best fits the data is the method of least squares. Interestingly, the method of least squares was introduced by Gauss (1777 - 1855) and Legendre (1752 - 1833) in the early 1800s as a way of inferring planetary trajectories from noisy data. If we associate with each curve an error distribution for the y values, we may also define the best curve as the one that makes the data most probable. This is called the method of maximum likelihood. Gauss proved that these two methods are equivalent when the error distribution is bell-shaped, which is why the bell-shaped distribution now bears his name.

Prediction in the Beam Balance Example

It is time use a concrete example to illustrate the concepts introduced so far. The example is easy to understand, but also rich enough to illustrate many features of relevance.

Suppose we hang an object, labeled a, on one side of a beam balance, and find how far a second object b has to hang on the other side to balance the first. The distance that a is hung from the point at which the beam is supported (called the fulcrum) is labeled dist(a), while the distance to the right of the fulcrum at which b balances a is labeled dist(b). If b is moved to the left of this point then the beam tips until the object a rests on the ground and if b is moved to

( )dist a ( )dist b

a b

Figure 1: A beam balance

Table 1

dist(a)

dist(b)

1 cm

1 cm

2 cm

2 cm

3 cm

3 cm

6

the right the beam tips the other way. In the first instance, dist(a) is 1 centimeter (cm) and the beam balances when dist(b) is 1 cm. This pair of numbers is a datum, or a data point. We repeat this procedure 2 more times with different values of dist(a), and tabulate the resulting data in Table 1. We are now asked to make the following prediction: Given that a is hung at a distance of 4 cm to the left of the fulcrum, predict the distance at which b will balance; viz. predict dist(b). Once we notice that for all 3 data, dist(b) is equal to dist(a), it appears obvious that we should predict that b will be 4 cm from the fulcrum when a is hung at 4 cm. But what is the general method of prediction that applies to this and other examples?

The Wrong Answer

The naive answer to the question about how to make predictions is roughly this: A regularity is observed in past data, and we should infer that this regularity will apply to future instances. In the beam-balance example, this idea may be spelt out in terms of the following inference: In all the observed instances, object b balanced on the other side of the beam at an equal distance as object a. Therefore, b will balance at the same distance as a in every instance. We recognize the general pattern, or form, of the argument as that of simple enumerative induction:

In all the observed instances, system s conforms to regularity R �������������������������

System s will always conform to regularity R.

In this example, the system s is the beam balance with objects a and b hung at opposite sides of the fulcrum, and the regularity R is expressed by the mathematical equation: R: ( ) ( )dist b dist a= .

Philosophers of science, from Reichenbach (1938), to Hempel (1966), Goodman (1965) and Priest (1976), have been quick to recognize what�s wrong with this form of ampliative inference: It wrongly presupposes that regularity R is unique. A quick glance at Fig. 2 shows that there are infinitely many �regularities� that fit the same set of data. There is no such thing at the observed regularity. So, any pattern of inference that tells us to extend the observed regularity from observed instances to unobserved instances makes a false presupposition. As previously mentioned, it�s not the fallibility of the inference that is in question. It�s the very coherent of the method that has been challenged.

( )dist a

( )dist b2R

R

1R

1 2 3 4

2

4

6

Figure 2: The plot of the data points in Table 1, showing three curves that fit the data perfectly.

7

Does Simplicity Save the Day?

What Reichenbach, Hempel, Goodman and Quine all suppose is that this problem is solved by bringing in simplicity. That is, we should modify simple enumerative induction in the following way:

R is the simplest regularity to which system s is observed to conform. ���������������������������

System s always conforms to regularity R. Note that this form of ampliative inference does not conclude that the system conforms to the simplest possible regularity. The method does not imply that the inferred regularity is simple. It only says that at each stage, R is determined by the simplest curve that fits the data. Thus, in general, the regularity R will become more complex as more data are collected.

However, this solution will only work if the requisite notion of simplicity is coherent. There is a brilliant argument by Priest (1976) that shows, in my opinion, that it is not. Note that the requisite notion of simplicity is one that applies to single curves. There is no argument here against defining the simplicity of families of curves.

Priest�s argument applies to the beam balance example in the following way. Instead of using dist(b) and dist(a), let�s label the variables y and x, respectively. This makes it easier to write down the equation for the curve R1 in Fig. 2, which is:

( ) 3 21 0.5 3 4.5 3y f x x x x= = − + − + .

Intuitively, this equation is more complex than y = x. Now suppose we define a new variable y′as:

( )1y y f x′ ≡ .

The original data set in x-y coordinates is {(1,1), (2,2), (3,3)} . Now the same data set is represented in x-y′ coordinates as {(1,1), (2,1), (3,1)}. The simplest curve through these three points is a horizontal line with the equation 1y′ = . We can now solve the prediction problem. When x = 4, y′ = 1. In fact, for any value of x, we predict that y′ = 1. In other words, given any value of x, we predict that 1 1( ) ( )y y f x f x′= = . Therefore, the new prediction is just given by the curve R1. This prediction disagrees with the previous prediction, but both are based on the same method, so there is a big problem. Therefore, simplicity does not save the day.

In summary, in the original coordinate system, the intuitively simplest curve predicts that y = 4 cm when x = 4 cm. But in the new coordinate system, the intuitively simplest curve predicts that y = 1 cm when x = 4 cm.4 If our judgments of simplicity are based on intuition, then the problem is that these judgments depend on the way we describe the prediction problem. But a method prediction should not dependent on how we describe the problem anymore than the truth should depend on whether we speak French or English.

Priest (1976) has shown that our intuitive judgments of the simplicity are language variant, while truth is language invariant. Therefore, our intuitive judgments of simplicity are not reliable indicators of truth. 4 From the graph, we see that it is purely accidental fact that the prediction y = 1 cm looks the same as y′ = 1. For other values of x, this is not true.

8

Priest�s argument applies only to the simplicity of single curves. There is no argument that the simplicity of families of curves is not language invariant. This will allow for a positive role for simplicity in curve fitting that includes models.

The Mediation of Models5

The conclusion of the previous two sections is that the mediation of models is an essential part of curve fitting. It should not be overlooked even though it raises questions of its own: Where do models come from, and given that there are many models, how do we choose one in favor of the others?

There are two answers to the question Where do models come from? The first answer is that in the less mature or more difficult branches of science, in which there is no background theory, the postulation of models is a matter of trial and error�guesswork, following by some kind of testing procedure.

The second way of producing a model is by deducing it from a theory with the aid of auxiliary assumptions:

Theory Auxiliary assumptions ���������

Model

This procedure will be illustrated by deriving the beam balance model from Newton�s theory of motion. However, we shall see that even in this second case, there are different choices of auxiliary assumptions and therefore many models that are logically compatible with a single theory. Hence, the question How do we choose from among competing models? arises in either case.

Derivation of the Beam Balance Model

The derivation of the beam balance model is important for two reasons. First, it increases the reader�s understanding of the example. But more importantly, it establishes the general point the deduction of a model from a theory does not imply that the theory determines a unique model. It does not obviate the need for model selection criteria.

We begin with Newton�s theory of motion, which tells us about forces and how they produce motion. In the case of a beam, Newton�s theory tells us that the beam will remain motionless (i.e., it will balance) when the forces at every point exactly balance. What are these forces? The idea of leverage is familiar to everyone. For example, everyone knows that it is possible to �magnify� a force using a rigid body such as a crowbar. A long rigid beam may lever a large boulder 5 Numerous philosophers of science have argued that philosophy of science can ignore the role of models. Regrettably, the sense of �model� assumed by many of these authors is akin to the notion of �interpretations� in predicate logic. For this sense of �model� does not fit the scientific use of the term. Other philosophers, most notably Sober (e.g., 1988), have emphasized the importance of models in science in the sense described here. The deeper examination of the logic of curve fitting is meant to provide a deeper explanation for why models play an indispensable role in science.

small force

large force

Figure 3: The principle of leverage is most commonly about the magification of forces.

9

if the point at which we push down is a long distance from the fulcrum compared with the distance of the fulcrum to the end of the beam applying a force to the boulder (Fig. 3). Of course, you have to apply the downward force through a longer distance than the distance that the boulder moves upwards, so the work you do is equal to the work done on the boulder. This is required by the conservation of energy.

The same principle applies to beam balances. The forces applied to the beam arise from the gravitational forces of the two objects. If m(a) is the mass of a, m(b) is the mass of b, and g is the gravitational field strength, then a exerts a gravitational force of m(a).g on the beam at the point at which it is hung, and b exerts a force of m(b).g at the point at which it is hung. Now focus on the object a. If the beam is to balance then the forces acting on a must balance. That is, the upward leverage of b on a must balance the downward gravitational force m(a).g. By the principle of leverage, b is exerting an upward force on a equal its downward force magnified by the ratio of the distance dist(b) to dist(a). The background theory, viz. Newton�s theory of motion, tells us that these two forces must be equal:

dist bdist a

m b g m a g( )( )

( ) ( )= .

If we multiply both sides of this equation by dist(a) and divide both sides by m(b).g, and simplify, we derive the equation:

Model M: dist b m am b

dist a( ) ( )( )

( )= .

This completes the first two steps of curve fitting, for not only have we selected a model (second step) but we have also determined the dependent and independent variables (first step). Note that the mass ratio is playing the role of an adjustable parameter.

The several sections to follow are devoted to the many consequences that the mediation of models has on the normal operation of science. Some are familiar, such as the role of auxiliary assumptions, and some are not-so-familiar, such as the role of simplicity and unification.

The Many-Models Problem

In the beam balance example, there were many auxiliary assumptions made in deriving the model. For instance, we ignored the leverage applied by the mass of the beam itself. This assumption would be true if the beam were massless, or if it were of uniform mass and supported by the fulcrum exactly at the center of the beam. As a second example, we ignored the presence of other forces. We tacitly assumed that there were no electrostatic forces, no puffs of wind, no magnetic forces, and so on. A third example is the tacit assumption that the gravitational field strength, g, was the same on both sides of the beam. We know that g is different at different places on the surface of the earth (e.g., near a large mountain). For a small beam balance the two masses will be at approximately the same place, so g will be approximately the same, but not exactly the same. All such simplifying assumptions are called auxiliary assumptions. In practice, it

leverage force

force of gravity

Figure 4: The beam balances if and only if the leverage acting upward balances gravitational force downward .

10

is impossible to list all auxiliary assumptions, and often scientists do not make them explicit.

Auxiliary assumptions are commonly formulated in the vocabulary of the theory itself, such as the assumption that there were no other forces acting. Yet auxiliary assumptions are not derived from the theory itself. Newton�s theory of motion does not tell us whether a beam has uniform density, or whether it has been properly centered. Nor are auxiliary assumptions determined from the data. For example, if a is subject to a gravitational field strength of g1 while b subjected to g2, then the model derived from the theory is:

1

2

( )( ) ( )( )

gm adist b dist am b g

= .

This model defines exactly the same family of curves (all straight lines passing through the origin). The only difference will be a difference in interpretation: In this model, the slope of the straight line passing through all the data points (equal to 1) is interpreted as the value of the ratio of the weights rather than the ratio of the masses.6 Other data may help to decide between auxiliary assumptions. For example, we could test the assumption that the gravitational field strength is uniform by seeing whether a single mass stretches a spring different amounts when it is moved from place to place. But note that this test will introduce its own auxiliary assumptions. Eventually, we are going to find auxiliary assumptions that are not derivable from theory even with the aid of the total data. If this is correct, if there are auxiliary assumptions that are not wholly derivable from theory plus data, then there will be many models that are underdetermined by the theory plus data.

If this is true, then at any point of time, a theory has many unfalsified models. Each model will, in general, made different predictions. When we observe whether some of these predictions are true, we narrow down the set of unfalsified models. But the problem remains. There are still many unfalsified models that make incompatible predictions.

A New Role for Simplicity?

Perhaps simplicity has a role to play after all, in selecting from among many models that fit the data equally well?

The model that we derived for the beam balance experiment is simple in that it has one parameter (the mass ratio). Call this Model 1. It might be written as: Model 1: ( ) ( )d b d aβ= , where β is any positive real number. Note that dist has been shortened to d and the single parameter β replaces the mass ratio. Now compare this with a more complicated model, Model 2, which does not assume that the beam is exactly centered. It is possible, in other words, that b must be placed a (small) non-zero distance from the fulcrum to balance the beam if a were absent. In other words, we must allow that d(b) is equal to some non-zero value, call it α, when d(a) = 0. This complication may be incorporated into the model by adding α as a new adjustable parameter, to obtain: Model 2: ( ) ( )d b d aα β= + , where α is any real number and β is positive.

6Weight is the force of gravity, equal to the mass times the gravitational field strength - the weight of an object is different on earth than it would be on the moon because the gravitational field strength is different, but its mass would not change.

11

Model 2 has more adjustable parameters than Model 1, and is therefore more complex than Model 1. Also notice that Model 2 contains Model 1 as a special case. Think of Model 1 as the family of straight lines that pass through the origin. Then Model 2 is the family of all straight lines. Model 2 therefore includes Model 1 as a subfamily. Scientists and statisticians frequently say that Model 1 is nested in Model 2.

Here is an independent way of seeing the same thing: If we put α = 0 in the equation for Model 2 we obtain the equation for Model 1.

The relationship can also be described in terms of logical entailment, which is defined as follows. Definition: For any hypotheses P and Q, P entails Q (or, equivalently, P implies Q, or Q follows from P, or Q is deducible from P, all written P ⇒ Q) if and only if it is impossible for P to be true and Q to be false at the same time.7 Even though entailment is a very strong relation, Model 1 logically entails Model 2.

Proof: Suppose that the equation ( ) ( )d b d aβ= is true, for some positive numberβ. Then ( ) 0 ( )d b d aβ= + , and so ( ) ( )d b d aα β= + for some number real number α (namely 0). Therefore, Model 2 is true.

The entailment relation is not required for one model to be simpler than another, but it frequently holds.

Note that we add an auxiliary assumption�that the beam is supported exactly at its center�in order to derive Model 1. So, ironically, simpler models require a greater numbers of auxiliary assumptions in their derivation. The fewer auxiliary assumptions, the more complex the model. This is the opposite of what you might have thought.

Models whose equations are based on more independent variables are also more complex, given that the new terms introduce additional adjustable parameters. So, the relationship between simple and complex models is very frequently like the relationship between Model 1 and Model 2.

Simplicity Is Not Equal to Falsifiability

We have just noted that when one model is a special case of a second model, then the first is simpler than the second. Popper (1959) concluded that simplicity amounts to nothing more than falsifiability, and the reason why simplicity is important in science is that it is good for theories to stick their necks out, so long as they are not �chopped off�.

Popper also wanted to equate simplicity with falsifiability more generally. He attempted to do this by defining falsifiability in terms of the number of data points needed to falsify a hypothesis. The comparison of Model 1 and Model 2 illustrates Popper�s idea. Two data points will fit some curve in Model 2 perfectly. If the line joining two points does not pass through the origin, then the two points will falsify every curve in Model 1. Therefore, Model 1 is more falsifiable than Model 2. Model 1 is also simpler than Model 2. So, it seems that we can equate simplicity and falsifiability.

Notice that Popper�s definition of simplicity agrees with Priest�s conclusion that it is impossible to justify the intuition that some single curves are simpler than others. For if we consider any single curve, then it is falsifiable by a single data point. Hence, all hypothesis represented by single curves are equally simple. None is simpler than any other.

7 Example: Let P = Alice and Bob are guilty, and Q = Bob is guilty. Then P entails Q, but Q does not entail P. Entailment is a strong relation. For example, let P = George is a 100 year-old human, and Q = George cannot run a sub-four-minute mile. P does not entail Q even though knowing P would make Q highly probable.

12

What we have is two competing definitions of simplicity. The first is simplicity in the sense of the paucity of adjustable parameters and the second is Popper�s notion of simplicity defined as falsifiability. It was Hempel (1966, p. 45) who saw that �the desirable kind of simplification� achieved by a theory is not just a matter of increased content; for if two unrelated hypotheses (e.g., Hooke�s law and Snell�s laws) are conjoined, the resulting conjunction tells us more, yet is not simpler, than either component.� This is Hempel�s counterexample to Popper�s definition of simplicity. The paucity of parameters definition of simplicity agrees Hempel�s intuition. For the conjunction of Hooke�s law and Snell�s laws has a greater number of adjustable parameters than either law alone. Therefore, the conjunction is more complex.

Hempel�s counterexample does not show that logical entailment and falsifiability are unimportant in the philosophy of the quantitative sciences. It shows, in my view, that deductive (or probabilistic) relations don�t do all the work.

How to Fit a Model to the Data

Exactly what role might simplicity play in choosing between Model 1 and Model 2? We began with the suggestion that simplicity might come into play when two model fit the data equally well. To explore this idea, first recall what it means for a model to fit a set of data. We need to describe the third step of the curve fitting in greater detail.

Consider some particular model, which might be thought of as a family of curves. For example, Model 1 is the family of straight lines in Fig. 5 that pass through the origin (the point at which d(a) = 0 and d(b) = 0). Each of these lines has a specific slope which is equal to the mass ratio m(a)/m(b), which might be labeled β.

A new feature of curve fitting mediated by models is that there is no longer any guarantee that there is a curve in the family that fits the data perfectly. This immediately raises the question: If no curve fits perfectly, how do we define the best fitting curve? This question has a famous answer in terms of the method of least squares.

Consider the data points in Fig. 5 and an arbitrary curve in Model 1, such as the one labeled 3R . What is the �distance� of this curve from the data. Define this as the sum of squared residues (SSR), where the residues are defined as the y-distances between the curve and the data points. The residues are the lengths of the vertical lines drawn in Fig. 5. If a line are below the curve, then the residue is negative. But a sum of squares is always greater than or equal to zero, and equal to zero if and only if the curve passes through all the data points. Thus, the SSR is an intuitively good measure of the discrepancy between the curve and the data.

More importantly, we can define the curve that best fits the data as the curve that has the least SSR. Now, recall that there is a one-to-one correspondence between numerical values assigned to the parameters of a model and the members of the model. Any assignment of numbers to all the adjustable parameters determines a unique curve. And given any curve, its parameters have unique numerical values. So, in particular, the best

( )d a

( )d b

R

1 2 3

2

43R

Figure 5: R fits the data better than R 3.

13

fitting curve assigns numerical values to all the adjustable parameters. These values are called the estimated values of the parameters, and the method of estimation is called the method of least squares when �best fitting� is defined in terms of the least SSR.

In summary, the process of fitting a model to the data yields a unique curve in the family.8 This is a curve in which the adjustable parameters have been adjusted. It can now make precise predictions, which can be used to test the model, and indirectly test the theory from which it was derived.

The adjusted values, or the estimated values, of the parameters are often denoted by a hat. Thus, the final step in curve fitting yield a single curve denoted by, for example,

�( ) ( )d b d aβ= , where �β in our example is approximately equal to 1. This curve allows us to complete the prediction task.

We note in passing that the third step in curve fitting provides us with a non-trivial account of how theoretical quantities are measured. If one ignores models, then the measurement of theoretically postulated quantities is a great mystery.

Recall that our main criticism of simple enumerative induction was that it did not allow for the introduction of new theoretical concepts. On the other hand, inference to the best explanation did provide for this, but did so in an vague way. We can now see how it is that curve fitting takes a middle road�it makes a reasonable tradeoff between precision and generality.

However, this is getting away from our immediate concern, which is How might simplicity help solve the many-models problem? Our examination of how models fit data will lead to an unexpected conclusion: Complex models fit the data better than simpler models! So, if fit is everything, then simplicity is bad. Simplicity is not bad. Therefore�

Fit Is Not Everything!

Compare two nested models, such as Model 1 and Model 2. No matter how we define what it means to �best fit the data�, it is impossible for Model 1 to fit the data better than Model 2. The proof is remarkably simple. Consider what it would mean for Model 1 to fit the data better than Model 2. Model 2 must have a best fitting curve, C2, but Model 1 would have an even better fitting curve C1. Yet C1 is in Model 2, and C1 fits the data better than C2. But this is a contradiction, because C2 is by definition, the best fitting curve in Model 2. Therefore, it is impossible for Model 1 to fit the data better than Model 2. And since the special case in which they fit the data equally well almost never happens, we can conclude that more complex models invariably fit the data than simpler models nested in them. More intuitively, we can explain this fact in terms of the greater flexibility of complex models. They have more curves, and therefore, they have a better �chance� of fitting the data.

This changes the idea under investigation. The original idea was that simplicity might break a tie between models that fit the data equally well, but we now see that this situation does not arise. Rather, what seems intuitive is this: We don�t want to move from a simple model to a more complex model just because it fits the data better by some minute amount. Look at Fig. 5 again. The three data points are such that there is a curve in Model 2 that fits those data points better that R. The issue is not whether we want to

8 There are exceptions to this, for example when the model contains more adjustable parameters than there are data. Scientists avoid such models.

14

believe that this effect is due to beam being unbalanced, as opposed to some other cause. This issue is whether is we believe that the way in which the data deviates from the curve R is anything more than a chance event. Think of datum 2 as fixed on curve R, and suppose that datum 2 and datum 3 have equal chances of being above, on, or below curve R. If one is below and the other is above R then there is a curve in Model 2 that fits better. If the fluctuations are random, then there the probability of this happening is twice 1/3 times 1/3, which comes out to be 2 chances in 9. This is approximately one chance in 4. It�s about the same as the probability of a coin landing heads twice in a row. It�s not the kind of evidence that shows that Model 2 is going to provide more accurate predictions than Model 1.

To make such a judgment on the basis of three small random fluctuations would be to take the data too seriously. This general phenomenon is known as overfitting. For example, it is well known that a n-degree polynomial (and equation with a final term proportional to xn, and 10 adjustable parameters) can fit n+1 data points exactly. So, imagine that there are 10 data points that are randomly scattered above and below a straight line. If our goal were to predict new data, then it is generally agreed that it is not wise to use the complex polynomial, especially for extrapolation beyond the range of the known data points (see Fig. 6).

It�s plausible that one sticks with Model 1 unless there is a significant gain in fit in moving to Model 2. There is widespread agreement amongst statisticians and scientists that this is qualitatively correct. What is controversial is exactly how this tradeoff should be made in quantitative terms, and what goal is achieved by making the tradeoff. For example, does the right tradeoff maximize the probability that the selected model is true, or does it maximize the expected accuracy of predictions. Or do these things amount to the same thing? These are issues that discussed elsewhere.9

The claim of this section is: If fit with data were everything, then complex models would be better. Therefore fit with data is not everything. The point is amplified in the next section.

Unified Beam Balance Models

With respect to a single application of the beam balance model to a pair of objects {a, b}, the following three equations differ only in how the adjustable parameter is represented:

( )( ) ( )( )

m ad b d am b

= , ( ) ( )d b d aβ= , and ( ) ( , ) ( )d b m a b d a= .

The notation ( , )m a b is the same as the β notation except that it records the fact that the model is applied to the set of objects {a, b}. The reason that the equations are equivalent

9 See Forster and Sober (1994) for more details about how a tradeoff between simplicity and fit can minimize overfitting errors. The distinction between extrapolation and interpolation is discussed in Forster (2000) and Forster (2002).

2 4 6 8 10 12

5

10

15

20

Figure 6: A nine-degree polynomial fitted to 10 data points generated by the function y = x + u, where u is a Gaussian error term of mean 0 and standard deviation ½.

15

in this context is that they parameterize exactly the same family of curves; namely all the straight lines with positive slope that pass through the origin.

If we think of the models more broadly, for example, as applying to three pairs of objects, {a, b}, {b, c}, and {a, c}, then there are three equations in each model.

Model 1: ( )( ) ( )( )

m ad b d am b

= , ( )( ) ( )( )

m bd c d bm c

= , and ( )( ) ( )( )

m ad c d am c

= .

In contrast, Model 2 consists the equations:

Model 2: ( ) ( , ) ( )d b m a b d a= , ( ) ( , ) ( )d c m b c d b= , and ( ) ( , ) ( )d c m a c d a= .

On the one hand, one might say that the models have the same number of adjustable parameters. The adjustable parameters of Model 1 are ( )m a , ( )m b , and ( )m c , while those of Model 2 are ( , )m a b , ( , )m b c , and ( , )m a c . However, there is another way of counting adjustable parameters such that Model 1 has fewer adjustable parameters, and hence is simpler. This is because the third mass ratio, ( ) ( )m a m c is equal to the product of the other two mass ratios according to the mathematical identity:

( ) ( ) ( )( ) ( ) ( )

m a m a m bm c m b m c

= .

Another way of making the same point would be to rewrite Model 1 as:

Model 1: 1( ) ( )d b d aβ= , 2( ) ( )d c d bβ= , and 1 2( ) ( )d c d aβ β= .

Now we see that Model 1 has only two independently adjustable parameters.

There is a third way of seeing the same thing: Suppose we give one mass the status of being the unit mass. Set ( ) 1m c = . Then Model 1 is written as:

Model 1: ( )( ) ( )( )

m ad b d am b

= , ( ) ( ) ( )d c m b d b= , and ( ) ( ) ( )d c m a d a= .

Again, the number of adjustable parameters is 2, not 3 No matter how one looks at it, Model 1 is simpler than Model 2. This is a species of

simplicity that I shall call unification. It is not the only kind of unification that might be relevant to the philosophy of science, but it is a major kind.

Because unification is a species of simplicity, the point of the previous section applies: Complex models fit the data better. The proof extends to this case because Model 1 is nested in Model 2. To see this, first note that these models are not families of curves, but families of curve triples. For example, Model 2 is the family of all curve triples 1 2 3( , , )c c c , where 1c is a curve described by the equation ( ) ( , ) ( )d b m a b d a= for some numerical value of ( , )m a b , 2c is a curve described by the equation

( ) ( , ) ( )d c m b c d b= for some numerical value of ( , )m b c , and 3c is a curve described by the equation ( ) ( , ) ( )d c m a c d a= for some numerical value of ( , )m a c . Since each curve is defined by its slope, we could alternatively represent Model 2 as the family of all triples of numbers 1 2 3( , , )β β β . These include triples of the form 1 2 1 2( , , )β β β β as special cases. Therefore, all the curve triples in Model 1 are contained in Model 2. Hence, Model 2 invariably fits the data better than Model 1.

If we were to collect data pertaining to all three beam balance experiments, Model 2 would fit the data better than Model 1, at least by a small amount. If the amount is only small, we would not want to conclude that Model 2 is better confirmed than Model 1.

16

Again, we want simplicity to count for something, and so we would favor Model 1 over Model 2 in at least some situations.

Or to put the point another way, we wouldn�t want to fault Newton�s theory of motion for entailing Model 1 rather than the better fitting Model 2. On the contrary, it is a virtue of Newton�s theory that its models are unified in this way.

The shocking conclusion of this section is that less unified models fit better. So, fit had better not be everything.

Prediction versus Accommodation

There is an interesting way in which the greater unification of Model 1 manifests itself in this example. Suppose that we apply to model to the pair of objects {a, b}, and then to the pair {b, c}. From the first application we obtain a measurement of the mass ratio

( ) ( )m a m b , and from the second we obtain a measurement of the mass ratio ( ) ( )m b m c . If we multiply these two numbers together, we measure the mass ratio ( ) ( )m a m c without looking at any of the data in the experiment with objects {a, c}. That is, we can predict what the direct measurement will be. Or equivalently, we can compare the predicted value of ( ) ( )m a m c with the measured value and check whether the two values agree. If they do agree then the prediction has proved to be successful.

In contrast, Model 2 makes no such prediction. It yields measurements of ( , )m a b and ( , )m b c in the first two experiments, but entails nothing about the value of

( , )m a c from these. There is no agreement of independent measurements because there is only one measurement. On the other hand, Model 2 is not refuted by the {a, c} experiment either. It fits the data just the same as Model 1. In fact, it even fits the total data better than Model 1 (this follows rigorously from the result of the previous section). We may express this difference by saying that Model 2 merely accommodates the data in the {a, c}experiment, while Model 1 predicts it. Therefore, Model 1 is better confirmed than Model 2 because successful prediction is better than mere accommodation.

The use of the terms �prediction� and �accommodation� is clear enough in this example. But the philosophy of science aims to define such distinctions in the most general way possible. First, recall that prediction is achieved via the three steps of curve fitting. Ones chooses dependent and independent variables, then one selects or derives a model, and finally one fits the model to data in order to pick out a best fitting curve, from which predictions are inferred. If we carry out this curve fitting procedure using all the data known at a particular time, then the predictions produced are untested at that time. Yet, for the purposes of confirmation, untested predictions are irrelevant. Therefore, the contrast between prediction and accommodation is a contrast between tested prediction and accommodation. So we need to explicate the notion of a tested prediction.

Tested predictions can be made by dividing the total data into two sets. The first data set is called the calibration set or the training set. The calibration set is used to determine a specific equation or set of equations using the curve fitting method already described. The remaining data is then used as the test set. The predictions obtained from the calibration data are �validated� (or �invalidated�) by the test data. This procedure is referred to as cross validation in the statistics literature. In science, it is better known as plain old hypothesis testing (which is different from what statisticians call hypothesis testing).

In science, historical circumstances often determine what is used as the calibration data and what is used as the test data. The calibration set is commonly the set of data

17

known data at one particular point in time, while the test data consist of data collected at a later time. After that, the augmented data set is used as the calibration set, and still later data is used to test it. There is a natural historical sequence of cross validations in science.

Note that the notion of cross validation defined here is a purely logical. It is a procedure that can be done with an arbitrary division of the known data into two sets. For example, we can use any two of the three experiments {a, b}, {b, c}, and {a, c} to predict the curve that applies to the third experiment. All three of these cross validation tests should be relevant to the confirmation of the hypothesis, even though at most one of them will arise historically.

Any cross validation test can lead to one of three possible outcomes: (A) The calibration data leads to no predictions that can be checked by the test data. (B) The calibration data leads to predictions, and the predictions are validated. (C) The calibration data leads to predictions, and the predictions are invalidated.

Case (A) can be divided further into two cases: (A1) The model makes no predictions from the calibration set, but is �validated� by the test set itself. That is, it would pass all cross validation tests that can be constructed within the test set. (A2) The model fails at least one cross validation test that can be constructed within the test set itself.

Case (A2) will be rare if the test set is very small. But in our logical conception of cross validation, the test set can be of any size.10 We think of case (A1) as being a case of accommodation. Clearly, we don�t want to say that the test data is accommodated in case (A2). After all, �accommodation� is a success term, even if the success it denotes is weaker than predictive success.

It is also necessary to make a distinction between mere accommodation and accommodation. For even a strongly unified model fails to predict every aspect of the test data. At best it predicts the parameters of the curve that will best fit the test data. It does not, and should not, predict the individual data points.11 So, we are forced to say that it predicts some aspect of the test data and accommodates the rest.

More generally, if some evidence E is equivalent to the conjunction P&X, where P is predicted, but X is accommodated, then we are forced to say that E is accommodated, since it is not predicted. But it is not merely accommodated, because a part of it is predicted.

On the other, if X is evidence that is entirely unrelated to the hypothesis, then we don�t want to say that the hypothesis accommodates X merely just because it is not invalidated by it in any way. For example, we don�t want to say that Galileo�s law of free fall accommodates data about swarming bees simply because it doesn�t entail it. The term �accommodation� applies only to data that falls within the domain of a theory.

It is very important to note that the distinction between prediction and accommodation is a logical distinction. It has nothing to do with the historical order of events or the intentions of the inventors or constructors of the theory.12

10 The calibration set can�t be too small, for then it might not be large enough to estimate the parameters of the model accurately. 11 Logically speaking, this is related to the fact that �All swans are white� does not entail that a is a white swan. Rather, it entails that if a is a swan then a is white. 12 Unfortunately, White (2003) defines the distinction in psychological terms. I am understanding the distinction in a purely logical terms because (a) it fits the examples, and (b) philosophers need to clearly separate logical from non-logical factors in confirmation. For otherwise all discussion of whether non-logical factors are relevant to confirmation becomes irrevocably muddled.

18

The conclusion of this section is that the relevance of the traditional distinction between prediction and accommodation to confirmation is adequately captured in terms of the notion of cross validation.

Why Cross Validation is Fundamental

In comparing models such as y xβ= and y xα β= + , we have seen that the criterion of fit alone will invariably select the more complex model because it has greater flexibility in fitting any noise in the data. In curve fitting, the likelihood is the most basic measure of how well the hypothesis fits the data. Think of the equation for a curve like y = x and suppose that we add a bell-shaped error distribution for y given any particular value of x. The probabilistic hypothesis is written as y = x + u, where u represents the difference between the value of y given by the curve and the observed value of y. Because the error distribution is centered on the curve, observed values of y on or near the curve are more probable than values far from the curve. In fact, if the error distribution is a bell-shaped (Gaussian) distribution then the likelihood measure of fit is equivalent to the measure of fit assumed by the method of least squares. More exactly, it is the logarithm of the likelihood that recovers the least squares criterion. For that reason, the log-likelihood is widely accepted as the most versatile and universal measure of fit.

Forster and Sober (1994) describe one of many ways of trading off simplicity and fit in such a way that the simpler of the two models will be favored when the complex model�s increased fit is too small to be judged sufficient. The method of model selection described there is called Akaike�s Information Criterion (AIC). The AIC criterion works in the following way. Let L(M) be the member of model M that best fits the data. This general overfits the data, so that it tends to be too optimistic as an estimate of how well new data will fit the same curve. Akaike�s theorem states that under a wide variety of circumstances, a good estimate of how much a model overfits the data is given by k, where k is the number of adjustable parameters. This estimate is plausible in that it says that more complex models have a tendency to overfit the data more than simpler models. But more than that, it gives a quantitative estimate of the effect (in the currency of log-likelihoods). Therefore, a better estimate of how well new data will fit the curve is obtained by subtracting k from the observed log-likelihood score. The model selection criterion is to the model with the highest AIC score.

Classical Neyman-Pearson hypothesis testing in statistics exploits a qualitatively similar idea, in that it with accept the simpler null hypothesis (α = 0) over the alternative (α ≠ 0) unless the difference in fit is sufficiently great to favor the alternative. While statistical hypothesis testing is not designed explicitly to take simplicity into account, it does so implicitly, which explains why it works well in many situations.

There is a third method of model selection that Stone (1977) proves is asymptotically equivalent to AIC. It is a special kind of cross validation criterion that works in the following way. For a data set of n data, fit the models to the n subsets obtained from the data by removing a single datum. It will serve as the test set. In each case, record the log-likelihood of the omitted datum to obtain a cross validation score. Then add the n scores together, and select the model with the highest score. This called the �leave-one-out� cross validation criterion.

The interesting thing about �leave-one-out� cross validation, or any cross validation criterion, is that it does not use any explicit measure of simplicity. It trades off simplicity and fit in approximately the same way that AIC trades off simplicity and fit. Yet it does so without making any mention of simplicity.

19

In the same way, I content that the cross validation of unified models takes account of their unification without the need to define or quantify it in any way. I see no particular advantage in constructing a single cross validation score by averaging the scores of separate tests. For I believe that confirmation is fundamentally a multifaceted relationship between a hypothesis and its evidence. For example, the directionality of some cross validation tests, for instance in predicting future data from past data, provides specific information about the hypothesis that would be lost if all scores were averaged together (Forster 2000).

The nexus of cross validation tests provides rich picture of theory and evidence that ought to be exploited by philosophers of science. Subsequent sections show how cross validation relates to many controversial topics in the philosophy of science. I have already discussed the distinction between prediction and accommodation as one example. Topics coming up include counterfactuals and the nature of laws, common cause explanation as an argument for realism, the value of diversified evidence, historical versus logical theories of confirmation, and positive heuristics in scientific research programs.

Predictions from a Curve

The basic notion of prediction from a particular curve appears to be straight-forward, but it is not. Suppose that we have completed the three steps of curve fitting, and have arrived at a best fitting curve, say �y xβ= , where �β is the value of β estimated from the data. For the sake of simplicity, suppose that �β = 1. What does the equation predict? The simplest answer is that it predicts whatever follows logically from the equation. First note that it doesn�t predict data points. It does not predict that, for example, that x = 4 and y = 4. This is analogous to the logical fact that proposition �all swans are white� does not predict that a is a white swan, for some arbitrary object a. Rather, �all swans are white� predicts of an arbitrary object a that if it is a swan, then it is white. The question is: What is the meaning of the if-then statement?

I argue that it can be interpreted as a subjunctive conditional of the form: If it were the case that x = 4 then it would be the case that y = 4. Denote any such conditional as P → Q, where P is a statement like x = 4 and Q is a statement like y = 4. The problem with this view is that is not clear how such predictions are observed to be true. I now argue that this is a surmountable problem.

For the sake of clarity, first assume that propositions like x = 4 and y = 4 can be observed to be true or false. This is merely an explicit statement of what has already been assumed�that data points are observational in character.13 Let us denote such directly observed facts by P and Q.

Certain deductions are now justified because the meaning of the conditional → is partially defined by the partial truth table drawn at the right. In particular, suppose P and Q are observed to be true. Then the conditional P → Q is true. On the other hand, when 13 The view that predictions are observational is controversial in some examples, for it often appears that these statements assume the truth of other models or theories. I argue that if they make essential reference to other models or theories, they do not presuppose their truth. For example, Kepler�s data about the three-dimensional positions of Mars relative to the sun were inferred from the two-dimensional positions of celestial bodies relative to the fixed stars using Copernican theory. But the data did not presuppose that Copernicus�s theory was true.

P Q P → Q Row 1 T T T Row 2 T F F Row 3 F T ? Row 4 F F ?

20

P is true and Q is false, then we can conclude that P → Q is false. Therefore the predictions of an equation like y x= can be observed to be true or false, provided that the antecedent P is true.14

The distinction being made here is between the case in which the subjunctive conditionals have a true antecedent (P is true) and the case in which the conditional is counterfactual (P is false). The problem is that counterfactual conditionals are not observed to be true or false. The remaining question is whether counterfactuals can be confirmed in any sense, given that they cannot be directly confirmed.

The Law-Likeness of Equations

In the traditional philosophical literature, there is a distinction made between scientific laws, such as �All copper conducts electricity�, and accidentally true generalizations, such �All pieces of gold weigh less that 10,000 kg.� The intuitive difference between them is that the law implies of any object a that if a were made of copper then it would conduct electricity. For example, if the moon were made of copper, then it would conduct electricity. Accidentally true generalizations are different. Assuming that it is true that all pieces of gold weigh less that 10,000 kg, it is true because there is never enough gold accumulated in one place for it to form, or be formed into, a single lump. This intuition is coupled with an equally strong intuition that the generalization may be true without the following counterfactual conditional being true: If the moon were made of gold, then it would weigh less than 10,000 kg. To the contrary, our intuition is that if the moon were made of gold, it would weigh more than 10,000 kg. This is because other facts about the moon, such as its current size, would not be change if we were to learn that the moon has a gold core. It follows that accidentally true generalizations of the form All F�s are G�s do not imply counterfactuals of the form: if object a were an F, then a would be a G.

In the beam balance example, our intuition is that the equation ( ) ( )d b d a= is law-like in the sense that it entails for any (positive) number x, the counterfactual conditional: If

( )d a were equal to x, then ( )d b would be equal to x. Does this intuition have any justification? On its face, the law-likeness of the

equation has no justification. For counterfactual conditionals cannot be observed to be true or false. If this is correct, then it is puzzling why such an equation, if it is true at all, should be viewed as anything more than an accidentally true generalization. At least it�s strange if we think only in terms of the direct confirmation of counterfactuals.

The Cross Validation of Equations

Since the concept of confirmation is well established in philosophy of science, I shall refer to the particular theory of confirmation suggested here as a theory of cross validation. By saying that observations validate a proposition, I mean only that they afford some degree of evidential support. Validation comes in degrees, and is weaker than conclusive proof. What follows is not a theory of validation, but an account of some intuitions that a theory of validation might explain, or explain away.

14 We could reverse the role of P and Q, so that the truth of Q → P can also be observed. However, it is not necessarily the case that this proposition is entailed by the equation. This issue arises within the context of causal modeling.

21

The first issue is whether the law-like content of an equation can be supported, or validated, by any observational evidence. If we look at evidence showing that some of direct predictions of an equation are true, then we might argue as follows. Our choice of values for the independent variables is entirely free. The experimenter chooses, in each instance, some particular distance to hang object a from the fulcrum. But she could have chosen differently, and in other instances, she did choose differently. Since the choice was random, within some range of values at least, the confirmation we obtained applies to all values of the independent variable within that range. This includes counterfactual values. Therefore, the counterfactual import of the equation is confirmed by direct evidence.

We saw in Fig. 6 that curve fitting tends to be less reliable for values outside the range of x-values actually observed. So, our conclusions need to be qualified in terms of the actual range of x-values observed.

The case becomes stronger when one considers that the indirect evidence for a particular equation in the broader context of a unified set of models. The key point is that one can predict which particular curve will best fit the experimental{a, b} data independently of which particular set of x-values are chosen. The indirect evidence, obtained from observations in the {b, c} and {a, c} experiments, is evidence for all instances of the equation ( ) ( )d b d a= , rather than any particular instance of the equation.

Therefore, the sum of the direct and indirect evidence for the equation does appear to provide good justification for believing in its counterfactual predictions. This explains our interpretation of such equations as law-like.

In contrast, there is no strong indirect evidence supporting the generalization All pieces of gold weigh less than 10,000 kg. To the contrary, we believe that, without exception, we can join two pieces of gold to make a bigger piece. We know nothing that would exclude ten 1,000 kg pieces of gold from being joined together. We have good reason to believe that it will not happen, but no good reason to believe that it could not happen. Therefore, if we were told suddenly that the moon is made of gold, we would not accommodate the new evidence by revising our beliefs about the size of the moon. We would, more simply, revise our belief that all pieces of gold weigh less than 10,000 kg.

It is somewhat controversial whether this difference is merely a difference in how we would revise our beliefs in the face of new evidence, or whether it is, in addition, good evidence for believing that the world behind the appearances is the certain way. The latter view is referred to as scientific realism. Let me remark that the realist view has an added plausibility in the beam balance example. For in the beam balance example, the indirect evidence for the equation ( ) ( )d b d a= is exactly the same evidence that supplies an independent measurement for a mass ratio. The empirical agreement of two independent measurements of mass is good evidence that mass is a real quantity. Realism about counterfactuals and realism about theoretical entities fit together like hand and glove.

Realism and Common Cause Explanations

The agreement of independent measurements is a correlation observed between two quantities. Like many observed correlations, it has a common cause explanation; roughly, that the two quantities have the same value because that are measurements of the same physical quantity.

22

Common cause explanations are commonplace. The ancients observed a correlation between the phases of the moon and the occurrence of extreme tides. Newton explained it as a gravitational effect: When the moon is full or when it is new, the sun and moon are lined up with the earth, which means that their gravitational pulls reinforce each effect on the tides, producing extra high high tides and extra low low tides. Or we may observe a correlation between barometer readings and storms, and explain it in terms of the low atmospheric pressure.

However not all correlations are explained in terms of a common cause, even those between events that are space-like separated (�space-like separated� means that no signal traveling at subliminal speeds can travel from one event to the other). The beam balance example provides one illustration of this (Forster 1986). The equation ( ) ( )d b d a= represents a correlation between ( )d a and ( )d b . When one quantity is low, so is the other, and when one quantity is high, so it the other. The explanation is not that there is a third quantity that is correlated with each of the two correlated variables. There is not such quantity postulated in the model. Rather, the explanation is that only cases in which the beam balances are entered into the data, and Newton�s laws ensure that the correlation holds.

There are many more examples of correlations that are not explained in terms of a common cause (see Arntzenius 1993 for a survey). Most famously, some spin correlations predicted by quantum mechanics are of this kind. Bell (1964) was the first to prove that a common cause theory that assumes that electrons really have spin state that causes the outcome of a spin measurement (such that this spin state is not influenced by any non-local events) leads to a false prediction. By logic alone, this implies that the common cause explanation is false.

As van Fraassen (1980) was quick to point out, the observation of a correlation does automatically justify the introduction of a common cause. This means that there is no quick realist argument to the effect: Every correlation has a common cause, so if we observe a correlation that has no observed common cause, then we are justified in inferring the existence of an unobserved common cause. This would have been a coup for realism. While there is no common cause principle of this kind, it is would be short-sighted that there are no common cause explanations that provide strong intuitive support for realism.

The bottom line is that cross validation in curve fitting involves a kind common cause explanation in which an observed correlation is explained by the existence of a quantity that is independently measured. The quantity is represented in terms of the adjustable parameters of the theory. It is never expressed as a fixed function of observed quantities. In such cases, the theory is presenting an explanation of the observed correlation that invites us to believe in the existence of a theoretical postulated quantity. Whether this leads to a convincing argument for realism, or not, is controversial among philosophers. My purpose here is not to resolve the controversy. Rather, my only claim is that common cause explanations of this kind are correctly viewed as playing a central role in any realist interpretation of the quantitative sciences.

Variety of Evidence

The intuitive idea that a variety of evidence is usually illustrated by examples such as �all birds are warm-blooded�. It does seem that all birds are warm-blooded is better confirmed by sampling birds in a variety of different geographical locations than repeatedly examining New Zealand birds. However, this intuition might be explained by

23

the elementary logical fact that �all birds are warm-blooded� is equivalent to the conjunction of two hypotheses: All New Zealand birds are warm-blooded and all other birds are warm-blooded. If only New Zealand birds are sampled, then we have confirmed the first part of the hypothesis, but not the second.

In the beam balance example, the variety of evidence has a deeper significance.15 For the sake of simplicity, suppose that the objects a, b, and c have equal masses. The proposition being confirmed is the conjunction [ ( ) ( )d a d b= ] & [ ( ) ( )d b d c= ] & [ ( ) ( )d a d c= ], where is it understood that each equation is restricted to the appropriate experiment context. (The notation is misleading because it appears that the third equation is a logical consequence of the other two, but this is not true in Model 2.)

Now compare three sets of data. Data 1 consists in 90 trials of the {a, b}experiment with exactly the same value of ( )d a chosen in every case. This is the least varied of the three data sets. Data 2 consists in 90 trials of the same experiment with a variety of different values of ( )d a . Data 2 is more varied than Data 1. The most varied data set is Data 3, which consists of 30 trials of all three experiments, with a the 30 values being varied within each experiment.

It seems that Data 2 does a better job of validating the first equation, ( ) ( )d a d b= , than Data 1 because Data 2 is more varied. The intuition is clear enough. But can we provide some insight into why this is so? It�s not that the unvaried data fails to pick out a single equation in the model. Since all lines pass through the origin, only one point is needed to determine a unique line.

One problem with Data 1 is that it fails to gives us any evidence in favor of more complex versions of the model, such as the model with the extra parameter α. More complex models include lines that do not pass through the origin. In these models, a single data point does not determine a unique curve. Data 2 allows us to falsify such models, whereas Data 1 does not.

Nevertheless, there is another way of looking at the difference. Suppose that the 90 data are divided into two subsets, where the model is fitted to one subset, and then tested against the other. Equivalently, we could fit the model independently to both subsets of data, and see whether the same curve results in each case. If there is good agreement, the model is confirmed by the data.

In the case of the unvaried data, Data 1, the agreement appears to be guaranteed a priori. So long as there is not much noise in the data, and no matter how we divide the data, the test seems trivial. Any model will pass the test as well as any other. Equivalently, we might say that the two subsets of data fail to provide independent measurements of the mass ratio ( ) ( )m a m b . The agreement of independent measurements of parameters is the sign of a successful cross validation test. Data 1 does not provide such a test.

In the case of the more varied data set, Data 2, it can be divided into subsets that provide some non-trivial cross validation tests. In particular, we could collect all the data such that ( )d a is less than a certain value into one subset, and all the data such ( )d a is greater than that value in the remaining subset. Then the lines that best fit the two data

15 Hempel (1966, p.34) discusses the diversity of evidence within the context of three applications of Snell�s laws, which has the same unified structure as the beam balance model. The ratios of indices of refraction play the role of mass ratios. Mendel�s genetic model of the inherited traits of pea plants also has a unified structure. An intriguing discussion of this example is found in Arntzenius, Frank (1995): �A Heuristic for Conceptual Change,� Philosophy of Science 62: 357-369. None of these authors highlight the role of unification in these example.

24

sets can have different slopes. In other words, the measurements of the mass ratio are independent, and therefore their agreement is not guaranteed. And the more varied the evidence, the stronger the cross validation.16 The cross validation story still appeals to the idea of potential falsification, but it is more specific about the source of the falsifiability.

The point is strengthened when considering the most varied data set, Data 3. Under Model 2, there is a tradeoff. Data 2 uses 90 data points to directly confirmation of the equation ( ) ( )d a d b= . When 60 of those data points are moved to other experiments, the strength of the evidence for the first equation is reduced. There is some loss in the confirmation of the first equation, which pays off in the gain of confirmation of the second and third equation.17

In contrast, there may be no loss at all in the unified model. For in moving the 60 data points to the other experiments, there is an additional payoff in the cross validation of the first equation, which may outweigh the loss in direct confirmation. Even if the loss is not outweighed by the gain in cross validation, the loss is certainly less. In addition, the same advantage is gained in the validation of the other equations. This adds up to an impressive increase in the confirmational power of diversified evidence.

Historical Versus Logical Theories of Confirmation

Hempel (1966, 37) claims that �it is highly desirable for a scientific hypothesis to be confirmed� by �new� evidence�by facts that were not known or not taken into account when the hypothesis was formulated. Many hypotheses and theories in natural science have indeed received support from such �new� phenomena, with the result that their confirmation was considerably strengthened.�

Hempel illustrates the point in terms of a hypothesis formulated by J. J. Balmer in 1885. From the first four wavelengths in the emission spectrum of hydrogen, Balmer constructed a formula that reproduced the values of λ for n = 3, 4, 5, and 6 as follows:

2

2 22nb

nλ =

−,

The constant b is an adjustable parameter in Balmer�s model, which he found to be approximately 3645.6 Å by fitting the model to the 4 data points.

Balmer�s formula now predicts the value of λ for higher values of n. Balmer was unaware that 35 consecutive lines in the so-called Balmer series had already been measured, and that his predicted values agreed well with the measured values.18

It is uncontroversial to say that the agreement of Balmer�s predictions with the unseen data confirmed his formula. Yet, as Hempel notes, a puzzling question arises in this context.19 What if Balmer�s model had been constructed with full knowledge of all 35 lines of the Balmer spectrum? In this fictitious example, the model and total data are exactly the same. The only difference is the historical order of events. Should this difference make any difference to the confirmation of Balmer�s formula? If confirmation

16 Kruse 1997 argues for varied evidence in terms of the reliability of estimates of predictive accuracy. 17 There seems to be a law of diminishing returns involved here in the sense that �the increase in confirmation effected by one new favorable instance will generally become smaller as the number of previously established favorable instances grows.� (Hempel 1966, 33). 18 For more detail, see Chapter 4, Holton, G. and D. H. D. Roller (1958), Foundations of Modern Physical Science, Reading, MA; Addison-Wesley Publishing Co. 19 For additional discussion of this question, see Musgrave, (1974).

25

is a logical relationship between theory and evidence, then historical circumstances should make no difference.

There is good reason to mistrust a model when the model is �fudged�. We have already seen that a model with as many adjustable parameters as data can yield perfect fit. Good fit can sometimes be bad. The key point to raise here is that confirmation is not based on fit with the data. It is based on the outcomes of all possible cross validation tests, and these are independent of the actual historical circumstances.

Balmer�s model has only one adjustable parameter. If it had had 35 adjustable parameters, then it would have been unacceptable in either historical circumstance. As it was, there was no question that that would passed all cross validation tests in which the formula is fitted to any four wavelengths and tested against the remaining 31 members of the series. The success of the historically induced cross validation test is sufficient to show that it would have passed all cross validation tests in which the calibration data contained sufficient data to fix the value of b.

Prediction in advance has some value to those who are ignorant of the logic of example. Here we have a valuable public relations tool, for the general public knows that science can be fudged. If a model is fudged on four data points, by using four adjustable parameters, it will not predict the other 31 data points correctly. And even for the experts, the derivation of the model is sometimes very complex, or based on weakly justified approximating assumptions. In such cases, prediction in advance is also a valuable check.

History is valuable if the logic of the situation is not fully known. This is fully compatible with a logical theory of confirmation provided that historical circumstances are irrelevant once the logic of the situation is fully specified. This is a fortunate fact about confirmation, for there are numerous important scientific predictions that have not been made in advance, but this has not prevented them for impressing the scientific community.

One famous example is Einstein�s prediction of the precession of the perihelion of Mercury, which the Newtonians had failed to explain for centuries.20 The planet Mercury has the largest observed precession, of 574 seconds of arc per century. In Newtonian mechanics any precession of a planet�s perihelion requires that the effective radial dependence of the net force on the planet be slightly different from 1 2r , where r is the distance from the sun. This is what happens when the gravitational influence of the other planets in added to that of the sun. However, detailed Newtonian calculations of that effect predict it to be approximately 531 seconds of arc per century, which fails to account fro the difference of 43 seconds of arc.

Einstein�s general theory of relativity correctly predicted the residue. The Einsteinian model was constructed with full knowledge of the correct value. Nevertheless, the logic of the derivation was open to inspection to anyone who could follow the mathematics. The model was the simplest one possible, treating Mercury as a �test� particle moving in a spherically symmetric gravitational field. Nobody has downplayed the significance of this test of relativity simply because it was not predicted in advance, or because the calculations were done with the explicit intention of showing the answer 43 seconds of arc.

20 My account is taken from Marion (1965).

26

Various Examples

Consider Planck�s black-body radiation formula (Fig. 7), which succeeded in fitting the known data for both high and low radiation frequencies simultaneously, with only one new adjustable parameter, which we now know as Planck�s constant, h. Prior to Planck�s law, there was no single formula that achieved this. The achievement was that Planck�s formula could be cross validated by dividing the data into low and high wavelengths. Prior to Planck�s discovery, this could not be done.

Spurred by his success, Planck looked for a deeper understanding of his law. He derived his law from his quantum hypothesis a few months later.21 Nevertheless, Planck�s quantum hypothesis was rightly viewed with a healthy skepticism. It was not widely accepted until Einstein used it to explain the photoelectric effect in 1905.22 In 1902 Lenard had already documented some qualitative features of the photoelectric effect. Unfortunately, Lenard�s data did not allow for an independent determination of Planck�s constant. It was Millikan who collected the necessary data in 1914, for which he won the Nobel Prize in 1923. After Millikan work demonstrated the agreement of the two measurements of Planck�s constant, Einstein was awarded the prize for his paper on the photoelectric effect. The eventual agreement of the two independent measurements Planck�s constant is another example of the importance of cross validation.

Two frequently cited examples of confirmation in philosophy of science are examples of prediction in advance. There is the famous prediction of the return of Halley�s comet in 1759, and also the discovery of the planet Neptune by Le Verrier and Adams. I want to show that these examples are well understood in terms of the logic of cross validation.

In the case of Halley�s comet, it was not the mere return of the comet that impressed people. It�s not difficult to predict that a comet observed in 1531, 1607, and 1682, will return in 1759, assuming that the period of motion is constant. The simple-minded prediction from the observed correlation actually predicted that Halley�s comet would pass its perihelion (the closest point to the sun) in the middle of 1759. The extraordinary fact was that Clairaut predicted that Halley�s comet would not return in the middle of 1759, but nearer the beginning of 1759. His prediction was based on perturbations due to the gravitational influence of Jupiter and Saturn. This example is an example of successful cross validation between data pertaining to Jupiter and Saturn and the path of Halley�s comet. Perturbation theory is not logically airtight, so the fact that the prediction was in advance in undeniably an important feature of the example. On the other hand, predictions can be true by luck alone, so the logic of the cross validation is important as well.

21 Annalen der Physik, vol. 4, 1901. 22 Translations of the original papers by Planck and Einstein are reprinted, with commentary, in Shamos, Morris H. (ed.) (1959): Great Experiments in Physics: Firsthand Accounts from Galileo to Einstein. New York, Dover Publications Inc.

Energy

Wavelength

T = 1595°K

Figure 7: The graph is a plot of Planck�s law (solid line). The experimental data (circles) are fashioned after those shown in a diagram on page 122 of Resnick (1972).

27

Here is one more example. Prior to the discovery of Neptune in 1846, wiggles in the observed motion of Uranus could not be explained by the gravitational interactions of the known planets. Le Verrier and Adams therefore investigated a model in which an eight as-yet-unknown planet caused those wiggles. When this model was combined with the known positions of Uranus, the model predicted the orbit of the postulated planet. When telescopes were pointed at the predicted location of the planet, Neptune was sighted, and confirmed to be a planet.23 The disconfirmation of one Newtonian model was followed by the cross validation of a more complex Newtonian model with one additional adjustable parameter (the mass of Neptune). Here, the loss of simplicity is amply compensated by the increased fit, not only with previously unexplained features of Uranus�s motion, but also with the new test data pertaining to the observations of the newly discovered planet.

Conclusions and Comparisons

Theories are families of models and models are families of curves. However, an omniscient being would have no need for theories or models, for the complete truth could be formulated in terms of a set of interconnected equations. In contrast, scientists make informed choices between rival models on the basis of available empirical information. The patchwork of interconnected equations that result from the best confirmed model represents the state of scientific knowledge.

The story I have told about what properties distinguish winning and losing models at is a story about the �logic� of science, or at least it is a story that applies to a wide range of quantitative sciences. It is necessarily simplified in many respects, and deliberately ignores the social and psychological dimensions of science. In this regard it is similar to the traditional solutions to the problem of scientific confirmation.

Until now, I have tried to minimize negative commentaries on received views in philosophy of science. Now is the time to correct that fault.

The most influential theories of confirmation have been the hypothetico-deductive view of confirmation, Hempel�s theory, Popper�s theory of corroboration, likelihood theories, and the Bayesian theory of confirmation. One striking fact about this list is that it is not short. A second striking fact is that all these theories are similar in their use of fit as the fundamental component in their definitions of confirmation. This broad characteristic will provide sufficient grist for my mill.

The traditional theories all claim that confirmation is a relation between a hypothesis H and evidence E, where E is a statement of observed fact. For example, in the case of the hypothetico-deductive theory the confirmation depends on which of three possible logical relationships between H and E: (i) H entails E, in which case E is said to confirm H, (ii) H entails the negation of E, in which case E is said to disconfirm, refute, or falsify H, or (iii) neither (i) nor (ii) holds, in which case E is confirmationally neutral for H. In likelihood theories of confirmation, the entailment relation is replaced by the probabilistic relation Pr(E|H), which is read �the probability of E given H�. Pr(E|H) is called the likelihood of H relative to E, where �likelihood� is a technical term. The likelihood of H relative to E should not be confused with the probability of H given E, which is written Pr(H|E). Popper�s theory of corroboration and Bayesianism are built on the likelihood relation. Moreover, the likelihood relation is a probabilistic generalization of the entailment relation, as is shown by the following fact: If H entails E, then Pr(E|H) = 1.

23 It had been seen before and mistaken for a fixed star.

28

So, the likelihood measure of fit with data is the foundation stone of all traditional theories of confirmation.

The reason for replacing a single measure of fit with the nexus cross validations is not merely that fit without simplicity is a poor indicator of predictive accuracy (Forster and Sober 1994), but mainly because the multifaceted relationship of a hypothesis with its evidence is important. A single measure of fit to be a rather blunt instrument for assessing the confirmation of a hypothesis. When a model goes wrong, a multifaceted measure of confirmation can tell us not only that something is wrong, but it can tells us specifically what is wrong, and this may provide clues about how to correct the problem. This is what Lakatos (1970) refers to as a positive heuristic.

For example, Kepler�s third law says the cubes of the distance from the sun are proportional to the squares of the periodic times of the planets. In Newton�s theory, the constant of proportionality is a measure of the sun�s mass. Or more precisely, if one model�s each planet in the simplest way, as interacting only with the sun, and that the mass of the planet is much smaller than the mass of the sun, then Newton�s theory implies that the cube of the distance to the square of the period is proportional to the sun�s mass. If this model is applied to all the planets, then the simple fact that the mass of the sun is the same in each case yields Kepler�s third law. The planets from Mercury to Mars agreed with Kepler�s third law well, but Newton knew that Jupiter and the planets outside of Jupiter did not provide independent measurements of the sun�s mass. The nature of the disagreement provided a clue about how to remove the discrepancy. Given that the mass of Jupiter is close to one thousandth of the mass of the sun, which is not negligible, Newton showed that that Jupiter and the sun will revolve around their common center of mass as if they were each test particles revolving around a common body with a mass equal to the sum of their masses. The ratio of the cube of the distance to the square of the period of Jupiter and for the planets outside of Jupiter therefore provided measurements of the sum of the masses of the sun and Jupiter. The mass of Jupiter was also independently determined by the motion of its moons. So, the more sophisticated model not only removed the discrepancy, but also strengthened the agreement of independent measurements of the masses of the sun and Jupiter.

It seems to me that a theory of confirmation based on cross validation testing would account for these examples in a far more direct and natural way than by any formula that compares models one dimensionally. Although it is equally true that such a formula would work in this example. The extra mass parameter introduced in the shift from a one body model to a two body model is more than justified by the improvement in fit.

Traditional theories of confirmation have been slow to recognize that fit must factor in simplicity and unification. They have been even slower in figuring out how to do it. 24 The alternative, which I am recommending, is a conceptually radical move. For it takes particular kinds of correlations to be the fundamental units of evidence�most commonly correlations between independent measurements of theoretical quantities.25

These ideas may not look like much when they are taken one at a time. But they add up to something stronger when they are seen as part of a single systematic philosophy of the quantitative sciences.

24 Myrvold (2003) recently make progress in understanding unification within a Bayesian framework. 25 I have argued elsewhere (http://philosophy.wisc.edu/forster/papers/QM%20Consilience.htm) in the case of the quantum mechanics of spin ½ particles, that the relevant cross validation is the between independent determinations of the anti-commutation property of the Pauli spin operators.

29

References

Arntzenius, Frank (1993): �The Common Cause Principle.� PSA 1992 Volume 2: 227-237. East Lansing, Michigan: Philosophy of Science Association.

Arntzenius, Frank (1995): �A Heuristic for Conceptual Change,� Philosophy of Science 62: 357-369.

Bell, John S. (1964). �On the Einstein-Podolsky-Rosen Paradox�, Physics 1: 195-200. Butts, Robert E. (ed.) (1989). William Whewell: Theory of Scientific Method. Hackett

Publishing Company, Indianapolis/Cambridge. Forster, Malcolm R. (1986): �Unification and Scientific Realism Revisited.� In Arthur

Fine and Peter Machamer (eds.), PSA 1986. E. Lansing, Michigan: Philosophy of Science Association. 1: 394-405.

Forster, Malcolm R. (2000): �Key Concepts in Model Selection: Performance and Generalizability,� Journal of Mathematical Psychology 44: 205-231.

Forster, Malcolm R. (2002), �Predictive Accuracy as an Achievable Goal of Science,� Philosophy of Science 69: S124-S134.

Forster, Malcolm R. and Elliott Sober (1994): �How to Tell when Simpler, More Unified, or Less Ad Hoc Theories will Provide More Accurate Predictions.� British Journal for the Philosophy of Science 45: 1 - 35.

Forster, Malcolm R. and Elliott Sober (2004a), �Why Likelihood?,� in Mark Taper and Subhash Lele (eds), Likelihood and Evidence, Chicago and London: University of Chicago Press.

Forster, Malcolm R. and Elliott Sober (2004b): �Reply to Boik and Kruse,� in Mark Taper and Subhash Lele (eds), Likelihood and Evidence, Chicago and London: University of Chicago Press.

Goodman, Nelson (1965): Fact, Fiction and Forecast, Second Edition. Harvard University Press, Cambridge, Mass.

Harper, William L. (2002), �Howard Stein on Isaac Newton: Beyond Hypotheses.� In David B. Malament (ed.) Reading Natural Philosophy: Essays in the History and Philosophy of Science and Mathematics. Chicago and La Salle, Illinois: Open Court. 71-112.

Harré, Rom (1981): Great Scientific Experiments. Oxford: Phaidon Press. Hempel, Carl G. (1966): Philosophy of Natural Science. Englewood Cliffs, NJ: Prentice-

Hall, Inc. Holton, G. and D. H. D. Roller (1958), Foundations of Modern Physical Science,

Reading, MA; Addison-Wesley Publishing Co. Kruse, Michael (1997): �Variation and the Accuracy of Predictions.� British Journal for

the Philosophy of Science 48: 181-193. Lakatos, Irme (1970): �Falsificationism and the Methodology of Scientific Research

Programmes� in I. Lakatos and A. Musgrave (eds.), Criticism and the Growth of Knowledge. Cambridge: Cambridge University Press, pp. 91-196.

Langley, P, H. A. Simon, G. L. Bradshaw, & J. M. Zytkow (1987). Scientific Discovery: Computational Explorations of the Creative Process. MIT Press, Cambridge, Mass.

Marion, Jerry B. (1965), Classical Mechanics of Particles and Systems, Academic Press, New York and London.

Myrvold, Wayne (2003), �A Bayesian Account of the Virtue of Unification�, Philosophy of Science 70: 399-423.

Musgrave, Alan (1974): �Logical Versus Historical Theories of Confirmation.� The British Journal for the Philosophy of Science 25: 1-23.

30

Popper, Karl (1959): The Logic of Scientific Discovery. London: Hutchinson. Priest, Graham (1976): �Gruesome Simplicity.� Philosophy of Science 43: 432 - 437. Reichenbach, Hans (1938): Experience and Prediction. Chicago: University of Chicago

Press. Reichenbach, Hans (1956): The Direction of Time. Berkeley: University of California

Press. Resnick, R.: 1972, Basic Concepts in Relativity and Early Quantum Theory, John Wiley

& Sons, New York. Shamos, M. H. (ed) (1959), Great Experiments in Physics: Firsthand Accounts from

Galileo to Einstein, Dover Publications Inc., New York. Sober, Elliott (1988): �The Principle of Common Cause.� In J. Fetzer (ed.), Probability

and Causality, pp.211-28. Dordrecht: Kluwer Academic Publishers. Stone, M. (1977): An Asymptotic Equivalence of Choice of Model by Cross-Validation

and Akaike�s Criterion.� Journal of the Royal Statistical Society B 39: 44-47. van Fraassen, Bas (1980), The Scientific Image, Oxford: Oxford University Press. White, Roger (2003), �The Epistemic Advantage of Prediction over Accommodation.�

Mind 112: 653-683.


Recommended