A CRITICAL APPRAISAL OF CAUSAL DISCOVERY ALGORITHMS
Paul Humphreys
Department of Philosophy
University of Virginia
The research program described in detail in Spirtes,
Glymour and Scheines 1993 and outlined by Clark Glymour in the
paper published in this volume (Glymour 1995) has more than one
goal. One of these goals is to provide a connection between
causal structure and probabilities by using an explicit
axiomatization that links directed graphs with probability
distributions over the vertices of those graphs. In using an
axiomatic approach, these authors approach the project of
connecting probability and causation from a different direction
than most other authors, for in the philosophical literature
connections between probability and causation have usually taken
the form of explicit definitions. Sometimes these definitions
attempt to reduce causation to probabilistic concepts (e.g.
Suppes 1970), in other approaches the procedure is to build up
complex causal relations from simple causal relations taken as
primitive, using probabilistic invariance as the defining
characteristic of causal relations (e.g. Humphreys 1989).
The axiomatic approach has, in contrast to these
philosophical efforts, been used to good effect in Judea Pearl's
earlier work on probabilistic inference in artificial
intelligence (Pearl 1988). Spirtes, Glymour, and Scheines'
(hereafter referred to as SGS) interest is also in making
inferences, the idea being that on the basis of statistical
independence and dependence relations between variables perhaps
supplemented by auxiliary information, we can infer the presence
or absence of causal relations between those variables and
represent the causal relations by edges in a directed graph. I
shall address later the important question of the exact sense in
which these graphs are actually representations of causal
relations, rather than of probabilistic dependency relations,
but for now it bears emphasizing that Glymour's account is not a
probabilistic theory of causation in the sense usually
understood by philosophers -- the relations between variables
are deterministic for Glymour, and increases in probability play
no role for him in identifying causal relations.1
A second goal is to provide computationally feasible
algorithms based on those axioms. These algorithms are designed
to discover at least some of the causal relations between
specified variables
of scientific interest and, importantly, SGS claim to be able to
sometimes make these discoveries without using theoretical
background knowledge about which variables can and cannot be
3
causally connected. In doing this, these algorithms are said to
have an advantage over existing methods of inferring causal
structure, such as structural equation models, which do
explicitly have to rely on non-statistical assumptions about
what are plausible and implausible candidates for causal
relations between variables. Such assumptions are often called,
very misleadingly in my view, `a priori' assumptions.2 It is
better to view them as essentially using background knowledge
about causal relations, where this background knowledge does not
come from the data. This background knowledge may be based on
explicit theory or it may be tacit practitioner's knowledge or
it may be just `common sense' but it will rarely be a priori in
the traditional philosophical sense of knowable independently of
any particular empirical experience.
The extent to which the SGS algorithms require background
knowledge varies, and their presentation of their case sometimes
makes the data driven nature of their enterprise seem purer than
it actually is. But as far as I can tell, they never make the
extreme claim that data analysis alone can give a unique causal
graph. Sometimes, a time ordering on the vertices, together
with other assumptions, will do (SGS 1993, p. 133: more on this
in section II below) and indeed all of their algorithms employ
that information (ibid, p.127); in other examples background
4
causal knowledge is incorporated into the programs (ibid,
p.127). Yet they also claim that their programs recovered
almost all of a 37 variable model of an emergency medical system
without any prior information about the ordering of the
variables (ibid., pp.11, 146). It would be more accurate, I
think, to portray their primary interest as an investigation of
to what extent nonexperimental methods can arrive at causal
conclusions, and under what assumptions (e.g. ibid, p.vii, 3,
10, 22).
A third goal is to develop methods for predicting the
results of interventions in existing causal structures. I shall
not have much to say here about this third goal, for it has been
discussed in admirable detail by James Woodward and David
Freedman in their contributions to this volume (Woodward 1995,
Freedman 1995). Fourthly, the research program suggests that
most existing methods of representing causal structure,
including regression, factor analysis, structural equation
models and so on--essentially the whole spectrum of contemporary
econometric and sociometric causal methods--are seriously
inadequate to their tasks. Finally, and perhaps most
surprisingly, SGS's approach does all this in a way that is
supposedly free from the need to say explicitly what causation
is or to provide a definition of causation. To a lesser extent
5
SGS are inexplicit about what is the appropriate interpretation
of probability to use in their approach. This ability to work
with (explicitly) undefined terms within an axiomatic framework
is, of course, one of the great methodological advantages of the
axiomatic approach but it also has its well known drawbacks, the
worst being that it results in too broad a class of structures
that satisfy the axioms. We shall see that this is a defect
that is clearly present in SGS's framework, and it undermines
their claim to have given algorithms that discover causal
structure; to eliminate noncausal epistemic relations, the
supplementary information used in the discovery process plays a
crucial role.
By building on the results of Judea Pearl and his research
group, SGS have provided a powerful constructive approach to
some extremely complex and difficult problems. There is no
doubt that their contributions have moved discussions of causal
structures a giant step forward. And I agree with the spirit of
their remarks that an unhealthy strain of scepticism on the part
of philosophers is often an excuse for avoiding hard
constructive thinking about what can be done in this area. It
must be said, though, that this scepticism is by no means
confined to philosophers, for statisticians have tended to be at
least as sceptical about these matters. (See e.g. Freedman 1983,
6
1995). Yet I find their belief that an intuitive understanding
of causation can carry us a long way in using their methods
quite wrong. In fact, when we come to closely examine the SGS
methods, we shall find that not only is it hard to put a
consistent interpretation on the axioms, but it is by no means
obvious that they have any real causal content at all. As I
shall argue, the most plausible construal of the graphs is that
they are not representations of causal relations understood in
some robustly realistic sense of causation, but are devices for
representing epistemic dependency relations.
I
Let us begin with the relation between graphs and
probability, which will be the principal focus of my remarks.
Some straightforward and obvious comments about the assumptions
that SGS use are in order because it is easy for elementary
points to get lost once one becomes immersed in the complex
details of the SGS programme. In essence, SGS's idea is to get
at the notoriously elusive concept of causation by showing that
certain causal relations, represented in graph-theoretic form,
entail certain probabilistic relations, primarily stochastic
independence and dependence. Because there is a sense that
direct contact between probabilistic relations and empirical
data is possible through relative frequency values or subjective
7
probability values in a way that direct contact between causal
relations and empirical data is not, the SGS methodology
hypothesizes that certain causal relations entail specific
probabilistic dependency relations. Using those entailment
relations, one can at the very least eliminate the possibility
of a causal connection between two variables by virtue of
certain observed statistical independencies. There are many
questions that arise from this rich apparatus. Here I shall
focus on the following three:
1. Do the assumptions on which the representational apparatus is
based contain or presuppose specifically causal content?
2. What is the correct interpretation of causation to use with
these methods?
3. What is the status of the representational apparatus that SGS
use to capture causal relations?
I begin with the first of these but I want to stress that
these three questions are intimately linked and one cannot
properly answer the first without a clear answer to the second
and third. Some of the assumptions on which the algorithms are
based are explicitly cited in Glymour (1995), others are to be
found in SGS (1993).
1.The Markov Assumption. This has a number of different
formulations but one of the most general is:
8
Given a directed acyclic graph (DAG) G over a set of
vertices V representing random variables, and a joint
probability distribution P(V) over V, then <G,P> satisfies the
Markov condition iff for every vertex W in V, then given Parents
(W), W is probabilistically independent of any other vertex X ε
V - {Descendants (W)} U {Parents (W)}.
2. The Faithfulness Assumption. The probability distribution P
that is associated with a DAG G representing the causal
structure of a system is such that every conditional
probabilistic independence relation in P follows from the Markov
condition applied to G. Roughly, there are no "accidental"
independencies.
3. Working Assumption. The distribution of observed variables in
a population is the marginal of a distribution satisfying the
Markov and faithfulness conditions.
4. Causal Sufficiency. The set of vertices of the DAG used to
represent causal structure is closed under common causes; i.e.
every common cause of measured variables is explicitly included
in the model.
5. No Side Affects Assumption. An ideal intervention that
directly manipulates T may alter the distribution of T
conditional on the parents of T, but the distribution of every
other variable conditional on its parents is unchanged.
9
I have here noted only those assumptions used in
determining the causal structure when the model is correctly
specified--i.e. when all the relevant variables have been
identified and included in the model and the goal is to discover
which variables are causes of which others. One exception here
is that error variables are not explicitly included in the
model, but are always assumed to be present for every variable
in the system. The reason for this is that if one variable is a
(deterministic) function of some others then the discovery
procedure can give the wrong results.3 So to avoid this, some
source of purely exogenous stochastic variation is attached to
every variable in the model, and these error variables are
mutually independent. When this condition is satisfied the
graph is called pseudo-indeterministic. We add this as an
explicit assumption.
6. Pseudo-Indeterminism. Each vertex of the DAG has an
unrepresented single parent (an `error variable') which is
independent of every other such error variable and which has a
non-degenerate distribution. All DAGs considered are pseudo-
indeterministic.
(I note here that the independence of the error terms
follows from the Causal Sufficiency assumption, and that the
Causal Sufficiency assumption is dropped for partially oriented
10
inducing path graphs.)
In addition to these assumptions, there is one definition
that lies at the core of the SGS algorithms. This is the
definition of d-connectedness, and it is here that I find SGS's
position on causation difficult to understand.
Definition. Two vertices X, Y in a DAG G, where (X ≠ Y) and X,
Y do not belong to Z, are d-connected by a set of vertices Z if
and only if there is an undirected path U between X and Y of the
form (1) every collider in U has a descendant in Z (including
the null descendant) and (2) every other vertex on U is outside
Z. X, Y are d-separated by Z just in case they are not d-
connected by Z.
The concept of d-connectedness is not easy to grasp, but
what it means is this: Suppose we have the case:
Then X,Y are d-connected by {V} but d-separated by {V,A}. Thus,
if we knew V, then from a knowledge of X, we could reliably
infer what value Y would have.
11
This concept of d-separation plays a key role in both the
SGS algorithm and the PC algorithm used by SGS because it is
the principal device by means of which directions are given to
edges in a DAG on the basis of statistical dependency relations.
As SGS put it "d-separation in fact characterizes all and only
the conditional independence relations that follow from
satisfying the Markov condition for a directed acyclic graph"
((1993), p.72).4 The definition itself is in purely graph
theoretic terms--there are no causal concepts involved. The
first question I want to raise here is this: Is there a
consistent causal interpretation of d-separation that can
motivate its use in the algorithms used by SGS to arrive at
causal structures?
Before I discuss this, let's be clear about what the
perceived difficulty is. At the start of his (1995) paper,
Glymour asserts "While I will rely on some intuitions about
relevant features of the notion of causation, I avoid throughout
any jejune attempt at definition." That approach is fine, and
fits well with the axiomatically based apparatus he has
developed. The method seems to be this: We begin with a
preformal set of intuitions about causation, and about the
relations between causal connectedness and probabilistic
dependence. Then the Working Assumption (number 3 above) is used
12
as the basis of justifying the claim that every observed
statistical independence relation is a result of the causal
relations embodied in the DAG and the Markov condition. That is,
the claim is that whenever one has a purported example of
statistical independence that does not result from a causal
connection, it falls into one of a set of recognizable
misapplications of the method. These include, for example,
nonhomogeneity of the population to which the model is applied
or underspecification of node variables.
But SGS also claim (1993, p. 41) in their chapter on axioms
and their interpretations: "We advocate no definition of
causation, but in this chapter we try to make our usage
systematic, and to make explicit our assumptions connecting
causal structure with probability, counterfactuals, and
manipulations. With suitable metaphysical gyrations the
assumptions could be endorsed from any of these points of view,
perhaps including even the [view that prefers not to talk of
causation at all]." I do not think that SGS have been successful
in keeping their use systematic, and I shall show that only an
epistemic interpretation fits their apparatus as presented.
When Judea Pearl introduced the concept of d-
connectedness, it was within a framework where inferential
connections were the primary focus. In fact, one of the
13
principal reasons for using directed graphs rather than
undirected graphs is because of the latter's inability to
correctly represent cases of common effects. Pearl uses an
example in which two coins are independently flipped and a bell
rings if and only if both coins are the same. When we know that
the bell rang (and when we know it did not) we can infer with
certainty from the outcome of one coin toss what the other coin
outcome was. This induced dependency between the coin outcomes
is clearly epistemic -- it is certainly not causal. As Pearl
says (1988, p.93) "This weakness in the expressive power of
undirected graphs severely limits their ability to represent
informational dependencies". Within what Pearl calls a Bayesian
network, a distinction is made between a joint effect of two
causes (a collider) on the one hand, and a common cause or an
intermediate cause between X and Y on the other. If you examine
the definition of d-connectedness, clause (1) requires that on a
d-connecting path all such colliders be in the conditioning set
or a descendant of a collider be in it. On an epistemic
interpretation, this makes sense. As we saw in the bell
example, learning about the common effect gives us information
about the causes, and by an inference that goes against the
causal direction, a descendant of the common effect can do the
same thing (suppose a dog salivates if and only if the bell
14
rings, for example). Clause (2) requires that one cannot
condition on a common cause or an intermediate cause. So with
d-connectedness, by conditioning on Z, we can learn something
about events further back in the causal network. It is this
asymmetry between common effects and common causes or
intermediate causes that allows the SGS and PC algorithms to
discover colliders in the case where conditioning on the middle
element Z in a sequence X -Z - Y fails to render X and Y
independent. Thus far, we have discussed d-connectedness in
terms of information. But as we have seen, SGS want a much
wider class of interpretations for causal networks than that.
In his paper included in this volume, Glymour is not very
forthcoming about what interpretation to give to d-
separatedness. He says "There is even an intuition to [d-
connectedness] which I will not detail here". (1995, p. ) An
intuition is in fact detailed in the SGS book on pp. 72-73, but
it is not consistent. We are first given an analogy to causal
flow in a pipe, where a collider corresponds to a closed valve
but a common cause and an intermediate cause correspond to an
open valve. This causal flow analogy makes sense along the
lines of an interpretation of causation involving transfer of
energy or transmission of a mark or manipulability, for example.
Next, we are told that conditioning on a open (active) mode
15
converts it to a closed (inactive) mode because we know that
keeping a variable of that type fixed renders X and Y
independent. This makes sense on a manipulability criterion,
and we know that this conversion from one status to another is a
familiar fact about common causes and "screening off". It makes
sense on an inferential view too: once we know the common cause,
knowledge of one effect does not provide further information
about the other, and knowledge of proximate causes renders
knowledge of remote causes irrelevant. But when it comes to
common effects (colliders), SGS (1993, p.72) say merely "that
conditioning on a collider makes it active was noted in section
3.5.2 above".
When we turn to section 3.5.2 we find two types of cases
where this can happen. One involves a "Bayesian example" drawn
from Pearl which involves inference from the known facts that
your car will not start and that your battery is not dead to the
fact that the fuel tank is empty. The other involves a case
related to Simpson's paradox whereby factors independent in the
whole population become independent when conditioned upon a
joint effect factor that subdivides the population frequencies.
Yet the second of these gives us no insight into how causal flow
can be directed through a collider, and since SGS are opposed to
specifically probabilistic accounts of causations, either of the
16
type that relies solely on statistics to characterize causal
relations or that uses probability increases and decreases as
the key to causation, the onus must lie on some other
interpretation of the graph. Certainly a manipulability view of
causation will not account for why a collider becomes active
when it or a descendant is conditionalized upon; nor does it
make sense on a causal flow view. The only view on which it
makes sense is an inferential view where it is as legitimate to
infer from an effect to a cause as vice versa, and in certain
circumstances from one joint cause to another, conditional on
their common effects.
This has two consequences. If the only consistent
interpretation that can be laid on the DAGs is one of causal
inference and not of causation itself, then the liberality of
interpretation suggested by SGS is misplaced. Given the
centrality of d-separation in their discovery procedure, there
is one and only one thoroughgoing interpretation of these
"causal" graphs and that is an epistemic one. Second, when SGS
turn to prediction, they focus almost exclusively on
manipulation as the basis of the No Side Effects Assumption and
its consequences. This must mean that we have to switch causal
horses when moving from one part of the research program to
another.
17
II
It might be said that there is no need to have an
interpretation of d-separation. It could be viewed as an
instrumental device that in some hard to understand way gets us
the right causal graphs in conjunction with the other
assumptions. All that is important is arriving at the correct
causal graphs and the evidence SGS provide in terms of
comparative performance against existing studies does that.
This is a reasonable response, and one that revolves around two
different uses of the axiomatic method: one where all the axioms
are interpreted and truth flows from those axioms to the
theorems and the structures that result from applying the
theorems, and the other where all the truth lies in the
consequences and the axioms pick up whatever interpretation, if
any, that is consistent with those consequences. Suppose we
adopted this second view (which I am not personally in favour
of). An immediate question then arises: how do we know that the
causal structures which result from applying these axioms are
true? Now sometimes SGS declare a successful application
because their result almost conforms to the structure of an
existing model arrived at by other means (e.g. the Rodgers and
Maranto example (Glymour 1995, p. ; SGS 1993, pp. 134 ff).
But on other occasions, such as with the Rindfuss, Bumpass, St.
18
John example (Glymour 1995, p. ; SGS 1993, pp. 139-40) the PC
algorithm is declared to have superior results to existing
methods, in this case because Rindfuss et al initially included
a causal connection from the variable representing the age at
which a first child is born to the variable representing the
education level of women at time of marriage, but they later
found the regression coefficient to be zero, whereas the PC
algorithms deleted that edge in the course of the discovery
process and it was not included in the final model. Now since
causal connections are not directly observable, if we are
employing the second axiomatic strategy where content flows from
the consequences to the axioms, we cannot rely on directly
knowing the truth of the axioms but instead we must rely on our
intuitive or theoretical judgements about causation in
particular cases to validate specific models. (The same point
holds with other cases, such as the Blau and Duncan example, but
I have not discussed those because they employ partially
oriented inducing path graphs rather than the basic apparatus
discussed here). This, of course, is one of the original
questions we raised about these methods: What kind of causal
knowledge, if any, must we have in order to use the SGS methods?
There are (at least) two kinds of causal knowledge that might be
involved. The first is knowledge required to understand the
19
assumptions--this will be in general knowledge about what kind
of causation and probability is involved. The other is
knowledge required to apply the assumptions in particular cases.
Consider in this regard the Markov assumption, upon which much
of the SGS method is built. Within the SGS approach, the Markov
assumption is taken as an axiom, and there are two versions of
it. The first Markov assumption (SGS 1993, p.33) which we
earlier listed as the first assumption, is a purely formal
criterion linking graph structures with probabilistic relations,
and has no causal content. The Causal Markov Condition (ibid,
p.54) is slightly less formal, but not much. It simply asserts:
Let G be a causal graph with vertex set V and P be a probability
distribution over the vertices in V generated by the causal
structure represented by G. G and P satisfy the Causal Markov
Condition if and only if for every W in V, W is independent of
V/(Descendants (W) U Parents (W)) given Parents (W).
But this reference to the causal interpretation of the
graph is unnecessary, because the (formal) Markov condition is a
consequence of a completely acausal condition. To see this,
compare the SGS approach with the one taken by Pearl, within
which the axiomatic formulation was qualitative probabilistic
independence and independency maps. Take the following
definition:
20
Definition. A DAG G is a directed independence graph of P(V) for
an ordering > of the vertices of G if and only if A -> B occurs
in G if and only if ¬(A ∐ B ∣ K(B)), where K(B) is the set of
all vertices V such that V ≠ A and V > B, where A B│Z means
that the joint probability density of A and B, given Z, factors
into the conditional densities of A given Z and of B given Z.
Pearl then proved this result:
If P(V) is a positive distribution, then for any ordering
of the variables in V, P satisfies the Markov and Minimality
conditions for the directed independence graph of P(V) for that
ordering.6
So, my point is this: since, given 1) some ordering of the
vertices (which could be merely temporal ordering) a strictly
positive distribution over the vertices guarantees that the
formal Markov condition will be satisfied, 2) the fact that we
can check whether there are any null probabilistic dependencies
in a purely numerical way, and 3) the fact that SGS can often
eschew prior causal knowledge in favour of temporal ordering or
no ordering at all (SGS 1993, p.112) to arrive at their graphs,
what role does the causal interpretation of the Causal Markov
Condition play in their algorithmic procedures? This question
is particularly pressing in the case of DAGs that are also
21
directed independence graphs. To put the point another way, we
could eliminate the formal Markov condition for those DAG's that
are also directed independence graphs if we had a prior ordering
on the vertices and a strictly positive distribution over those
vertices. SGS state (1993, p. 111) that, aside from
computational inefficiencies of algorithms based on directed
independence graphs (which is not a causal issue), they want to
eliminate the need for a prior ordering on the variables,
presumably because at least in some cases, a time ordering is
inappropriate or unavailable. But then if the causal
interpretation of the graphs in the Causal Markov Condition is
not being used to impose a prior ordering on the vertices, what
role does it play?
III
I turn now to the question of whether the assumptions of
the SGS approach require specific causal information to be
correctly applied. SGS are quite explicit that in general the
SGS and PC algorithms give an equivalence class of causal graphs
rather than a unique causal graph. The algorithms first give an
undirected graph that is usually less than complete (i.e. it is
derived from a completely connected graph by deleting edges) and
then they identify unshielded colliders. (Unshielded colliders
have the form
22
where A is not directly connected to B and vice versa.) This
partially orients the undirected graph, but leaves open a number
of different way to complete the DAG. For example, the graph
could be completed as
Again, in the Rodgers and Maranto example, the graph in
Figure 6 (of Glymour 1995) is but one of a number consistent
with the initial output of the PC algorithm. In the case of
this example, an appeal to time order of the variables is used
23
to get the final graph and to eliminate other members of the
statistically indistinguishable equivalence class. But this
strategy is not always possible. To see this, consider another
case that SGS describe.
Because the assumptions used are graph theoretic and
probabilistic, we can ask whether these methods can model
relationships that are not causal in kind. And so they can. In
the example involving the AFQT test (Glymour 1995, p. ), the
relations involved are classificational rather than causal. The
variables AR (arithmetical reasoning), NO (Numerical
Operations), and WK (Work Knowledge) are, as SGS note,
components of the whole AFQT test and the PC algorithm picks
them out as variables adjacent to AFQT. But AR, NO, WK no more
cause AFQT than 2,3,4 cause 9. They are constitutive components
of it. (Note that if there is at least one other component of
AFQT that is not included in the graph, and this is independent
of AR, NO, WK, then the system can be pseudo-indeterministic.)
It will not do to merely extend our concept of causation to
include constitutive relations, for a standard objection to
counterfactual analyses of causation carries over to the
manipulability view too. It is true that if X's score on the
arithmetical reasoning test had been different, X's AFQT score
would have been different too (assuming the No Side Effects
24
principle and no fortuitous cancelling by the other components)
but the AR score is not a cause of X's AFQT, any more than an
atom having two electrons is a cause of its being a helium atom
-- it is simply part of the defining conditions for what it is
to be a helium atom.8
Here we can perhaps draw a sharp distinction between the
use of these methods as prediction and manipulation devices, and
their use for causal purposes. There is no doubt that from the
AR scores we can partially predict the AFQT score and that by
manipulating the AR score we can change the AFQT score. But
let's not confuse that with discovering what the causes of the
AFQT score are. So if we wish to isolate the causal, as opposed
to the classificational, connections in a graph, we will have to
bring in additional information, and that information will not
always be common sense or temporal order. For example, if the
individual took the AR test first and the NO and WK tests later,
the AFQT score would only be available later than the AR score.
And we might need a significant amount of theoretical knowledge
rather than mere common sense to know which variables are
components of others. (I note here that the Working Assumption
explicitly allows that classificational relations can be
included in a DAG.).
IV
25
Finally, there is an issue that is not discussed by Glymour
in his contribution to this volume that is important and bears
bringing out explicitly. The discovery procedure is designed to
find independence relations and to eliminate the corresponding
edges in the graph, leaving directed edges as representations of
causal connections. But we ought to remember that there are two
kinds of dependencies, positive and negative--in the linear case
these will be represented by positive and negative correlation
coefficients. This difference is obviously important when one
is interested in interventions, as SGS are. Unless we know
whether Y is positively or negatively connected with X we do not
know whether increasing X will increase Y or decrease it, and
without that knowledge, interventions can be counterproductive,
decreasing SAT scores by paying school administrators higher
salaries for example. Of course, the SGS algorithms are
supplemented by estimation procedures to arrive at numerical
values for parameters in the graph, and this will provide the
sign of the coefficients. The point I want to bring out here is
this: SGS discuss the kind of situations within which the
Faithfulness Condition can fail (1993, pp. 64-69), focussing on
contexts within which Simpson's Paradox occurs. Correctly
noting that Simpson's original example involved statistics for
which positive associations within two subpopulations give
26
independence when the subpopulations are combined, SGS prove a
powerful theorem9 (for the continuous case) showing that the set
of parameter values in linear models for which the Faithfulness
Condition fails has measure zero, and hence for pseudo-
indeterministic linear models, we can proceed comfortably with
the knowledge that the Faithfulness Condition will fail only
under probabilistically extraordinary conditions. But the
original Simpson's Paradox is not all SGS need to worry about.
It will be as bad for them if their algorithms, supplemented by
standard statistical methods, erroneously tell us that a pair of
variables are negatively causally connected when they are also
positively connected in subpopulations. Moreover, these kinds
of problems are not limited to Simpson's Paradox. There is a
related and well-known problem in sociological sampling known as
the ecological fallacy10, wherein associations at the group
level, e.g. between mean educational level and racial
composition within counties, are transferred to associations at
the level of individuals. Positive associations at the group
level can arise from negative associations at the individual
level. So it is not enough for SGS to avoid causal structure
ambiguities caused by independence relations in the data not
being properly represented in the graph - they must also, like
every other causal modeller, be concerned with the right
27
population on which to impose their model. That this problem is
not peculiar to their method does not mean it does not need to
be addressed. And then the question is: can we apply the graph
to the correct population without using prior causal knowledge
of which type of association, positive or negative, is the
correct one? Or to be able to rule out as meaningless variables
attached to populations that properly belong only to
individuals? It is implausible that this could be done using
only common sense and hence the problem of signs is one that
seems to require prior causal knowledge to resolve.
V
Let us summarize the project we have been discussing. It
has two aims. One is to provide a computationally feasible
algorithm for identifying the correct causal relations between
variables in models of social phenomena. The other is to
provide an explicit set of assumptions on which correct causal
inference can (in principle) be based. This second part of the
project is of more philosophical interest than the first, but
because that applied aspect is an essential part of the project,
it is perfectly reasonable to ask whether the assumptions on
which these algorithms are based are applicable in real cases
rather than in principle only. It seems clear that some of them
are not. The frequency version of the Markov condition requires
28
that the probability distribution arises from a homogeneous
population i.e. one in which all the units share the same causal
relations (i.e. they all have the same graph). But in the
nonexperimental contexts within which these algorithms are
supposed to be applied, this will rarely if ever be satisfied.
Social variation is far too great for that. Second, there is
extensive use of "prior information" that some variables are not
causes of others. Thus, for example in the use of the PC
algorithm (Glymour 1995, p. ) it is said that ED (the level
of education at the time of marriage) does not cause the other
variables. One of these other variables is REL (the
individual's religious affiliation). Why is it supposed to be
obvious that the level of education at marriage is not a cause
of one's religious affiliation? One knows many women whose
agnosticism is a direct causal effect of their high level of
education. Similarly, one's level of education is frequently a
cause of where in the United States one chooses to live (REGN),
another link supposedly excluded by `prior information'. The
same point can be made with YCIG (cigarette smoking). It might
well be that the associations between these variables and
educational level are due to some relationship other than a
direct causal link in the direction I have suggested, but to
rule such links out a priori is implausible and methodologically
29
suspect.
VI
So, to conclude: the apparatus that Spirtes, Glymour, and
Scheines have developed is one of impressive power and detail.
It moves discussion of causal inference in nonexperimental
contexts to a richly rewarding new terrain and it has the
surpassing virtue of being remarkably explicit about what it is
up to. It is thus a pity that the genuine achievements in their
research are accompanied by such an inexplicit account of the
exact sense in which their graphs are causal. Until this is
remedied, I remain sceptical that their algorithms are genuine
causal discovery engines. Moreover, movement onto new ground
inevitably brings with it some old questions as well as new
ones. So here again are the questions that future work on their
approach needs to answer:
1. In what sense are the DAGs representations of causal
relationships, rather than of conditional probabilistic
dependency relations?
2. Is there a consistent non-epistemic interpretation to the d-
separation condition?
3. (a) How much causal knowledge is actually needed to apply the
algorithms?
(b) In particular, will aggregation problems require
30
background causal knowledge to select the correct level of
causal analysis?
4. Can these methods separate causal graphs from classificatory
graphs?11
31 NOTES
1. Glymour 1995, footnote 8.
2. An explicit example of this can be found in section 3 of
Simon 1954.
3. See Glymour 1995, p. .
4. See also clauses B) and C) of the SGS algorithm (SGS 1993,
p.114), and clauses B) and C) of the PC algorithm (ibid, p.
117).
5. See the reference in note 1 above.
6. This result is cited on p.35 of SGS 1993.
7. Glymour 1995 (p.12-13 ms); SGS 1993, p.96.
8. A similar error can be found in Sosa 1980, sections 1-4.
9. SGS 1993, p.68-9.
10. See e.g. Robinson 1950.
11. I should like to thank David Freedman, Richard Scheines,
Glenn Shafer, Peter Spirtes, and James Woodward for helpful
conversations and correspondence. Not all of them agree
with the positions taken in this paper.
32 REFERENCES
Freedman, D. 1983. "Structural Equation Models: A Case Study".
Technical Report,
Department of Statistics, University of California at
Berkeley.
Freedman, D. 1995. "From Association to Causation Via
Regression" <add citation for this
volume>
Glymour, C. 1995. "A Review of Recent Work on the Foundations
of Causal Inference" <add
citation material for this volume>
Humphreys, P. 1989. The Chances of Explanation. Princeton,
Princeton University Press.
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems.
San Mateo, Morgan-Kaufman.
Robinson, W.S., 1950. "Ecological Correlations and the Behavior
of Individuals". American
Sociological Review 15, pp. 351-357.
Simon, H. 1954. "Spurious Correlation: A Causal
Interpretation", Journal of the American
Statistical Association 49, pp. 467-479.
Sosa, E. 1980. "Varieties of Causation", Grazer Philosophische
Studien, 11, pp. 93-103.
Reprinted in Causation, M. Tooley and E. Sosa, Oxford,
Oxford University Press, 1993, pp. 234-242.
Spirtes, P., G. Glymour, and R. Scheines 1993. Causation,
Prediction, and Search. Lecture
Notes in Statistics, 81. New York, Springer-Verlag.
Suppes, P. 1970. A Probabilistic Theory of Causality. Acta
Philosophica Fennica, XXIV.
Amsterdam, North-Holland Publishing Company.
Woodward, James 1995. "Causal Models, Probabilities and
Invariance" <add citation for this
volume>