+ All Categories
Home > Documents > Learning causal networks from data: a survey and a new algorithm ...

Learning causal networks from data: a survey and a new algorithm ...

Date post: 14-Feb-2017
Category:
Upload: ngokhuong
View: 219 times
Download: 2 times
Share this document with a friend
31
31 Learning causal networks from data: a survey and a new algorithm for recovering possibilistic causal networks * Ramon Sang¨ uesa and Ulises Cort´ es Departament de Llenguatges i Sistemes Inform` atics, Universitat Polit` ecnica de Catalunya, c/Pau Gargallo, 5, 28028 Barcelona, Spain Tel.: +34-3-401 56 40, Fax: +34-3-401 7014 E-mail: [email protected], [email protected] Causal concepts play a crucial role in many reasoning tasks. Organised as a model revealing the causal structure of a do- main, they can guide inference through relevant knowledge. This is an especially difficult kind of knowledge to acquire, so some methods for automating the induction of causal models from data have been put forth. Here we review those that have a graph representation. Most work has been done on the problem of recovering belief nets from data but some extensions are appearing that claim to exhibit a true causal semantics. We will review the analogies between be- lief networks and “true” causal networks and to what extent methods for learning belief networks can be used in learn- ing causal representations. Some new results in recovering possibilistic causal networks will also be presented. 1. Introduction Reasoning in terms of cause and effect is a strategy that arises in many tasks. For example, diagnosis is usually defined as the task of finding the causes (ill- nesses) from the observed effects (symptoms). Simi- larly, prediction can be understood as the description of a future plausible situation where observed effects will be in accordance with the known causal structure of the phenomenon being studied. Causal models are a summary of the knowledge about a phenomenon ex- pressed in terms of causation. Many areas of the ap- * This work has been partially supported by the Spanish Comis- sion Interministerial de Ciencia y Tecnologia Project CICYT-TIC- 96-0878. plied sciences (econometry, biomedics, engineering, etc.) have used such a term to refer to models that yield explanations, allow for prediction and facilitate planning and decision making. Causal reasoning can be viewed as inference guided by a causation theory. That kind of inference can be further specialised into inductive, deductive or ab- ductive causal reasoning. Inductive causal reasoning aims at building a causal model of the phenomenon being observed from data. It is a widely used strategy in statistics, econometry and the biomedical sciences. Deductive causal reasoning provides causal explana- tions given a causal model and a description (data) of the phenomena that has to be explained. Predic- tion too, could be seen as a kind of deduction from a given model and a present known situation in or- der to reach a future situation causally consistent with what is presently known. Abductive causal reasoning amounts to reasoning with a causal model in order to find the possible causes of a given phenomenon, the causal model being known. This could be a crude approximation to diagnosis. Causal concepts are, in fact, central to accepting ex- planations, predictions, etc. as plausible. It has been argued that causation is a basic concept in common sense reasoning, as fundamental as time or space. We will not discuss here such a claim because our aim is a more modest one: describing and evaluating sev- eral methods for building causal models through the recovery of causal schemas from data. We will also give some hints on how to build such models. Causal models have been seen as meta-models by advocates of second-generation expert systems [9]. The importance of causal models seems to lie in that they allow for a focusing of inference on the concepts or phenomena that are really relevant to the case; this is why having a causal model aids in guiding infer- ence and gives a higher-level schema of the reason- ing task for the domain at hand. Usually such higher- AI Communications 10 (1997) 31–61 ISSN 0921-7126 / $8.00 1997, IOS Press
Transcript
Page 1: Learning causal networks from data: a survey and a new algorithm ...

31

Learning causal networks from data:a survey and a new algorithm for recoveringpossibilistic causal networks∗

Ramon Sanguesa and Ulises CortesDepartament de Llenguatges i Sistemes Informatics,Universitat Politecnica de Catalunya,c/Pau Gargallo, 5, 28028 Barcelona, SpainTel.: +34-3-401 56 40, Fax: +34-3-401 7014E-mail: [email protected], [email protected]

Causal concepts play a crucial role in many reasoning tasks.Organised as a model revealing the causal structure of a do-main, they can guide inference through relevant knowledge.This is an especially difficult kind of knowledge to acquire,so some methods for automating the induction of causalmodels from data have been put forth. Here we reviewthose that have a graph representation. Most work has beendone on the problem of recovering belief nets from data butsome extensions are appearing that claim to exhibit a truecausal semantics. We will review the analogies between be-lief networks and “true” causal networks and to what extentmethods for learning belief networks can be used in learn-ing causal representations. Some new results in recoveringpossibilistic causal networks will also be presented.

1. Introduction

Reasoning in terms of cause and effect is a strategythat arises in many tasks. For example, diagnosis isusually defined as the task of finding the causes (ill-nesses) from the observed effects (symptoms). Simi-larly, prediction can be understood as the descriptionof a future plausible situation where observed effectswill be in accordance with the known causal structureof the phenomenon being studied. Causal models area summary of the knowledge about a phenomenon ex-pressed in terms of causation. Many areas of the ap-

∗This work has been partially supported by the Spanish Comis-sion Interministerial de Ciencia y Tecnologia Project CICYT-TIC-96-0878.

plied sciences (econometry, biomedics, engineering,etc.) have used such a term to refer to models thatyield explanations, allow for prediction and facilitateplanning and decision making.

Causal reasoning can be viewed as inference guidedby a causation theory. That kind of inference canbe further specialised into inductive, deductive or ab-ductive causal reasoning. Inductive causal reasoningaims at building a causal model of the phenomenonbeing observed from data. It is a widely used strategyin statistics, econometry and the biomedical sciences.Deductive causal reasoning provides causal explana-tions given a causal model and a description (data)of the phenomena that has to be explained. Predic-tion too, could be seen as a kind of deduction froma given model and a present known situation in or-der to reach a future situation causally consistent withwhat is presently known. Abductive causal reasoningamounts to reasoning with a causal model in order tofind the possible causes of a given phenomenon, thecausal model being known. This could be a crudeapproximation to diagnosis.

Causal concepts are, in fact, central to accepting ex-planations, predictions, etc. as plausible. It has beenargued that causation is a basic concept in commonsense reasoning, as fundamental as time or space. Wewill not discuss here such a claim because our aimis a more modest one: describing and evaluating sev-eral methods for building causal models through therecovery of causal schemas from data. We will alsogive some hints on how to build such models.

Causal models have been seen as meta-models byadvocates of second-generation expert systems [9].The importance of causal models seems to lie in thatthey allow for a focusing of inference on the conceptsor phenomena that are really relevant to the case; thisis why having a causal model aids in guiding infer-ence and gives a higher-level schema of the reason-ing task for the domain at hand. Usually such higher-

AI Communications 10 (1997) 31–61ISSN 0921-7126 / $8.00 1997, IOS Press

Page 2: Learning causal networks from data: a survey and a new algorithm ...

32 R. Sanguesa and U. Cortes / Learning causal networks from data

level schemas are given by experience but they aredifficult to build. Consequently, much effort has beendevoted to devise methods for automatically buildingcausal models.

2. Causation and the discovery process

For the purpose of this overview, causal discoveryis equated to a learning process. This identificationdoes not have complete agreement within the Knowl-edge Discovery in Database and Data Mining com-munities, and there are some discrepant views on suchidentification [50]. In any case, the level of develop-ment of the current methods does not allow for moresophisticated approaches. Casting discovery in termsof learning, we will have to take as a point of depar-ture the data about the phenomenon being studied, acausation theory, and a learning method. The follow-ing, then, are the components of a causal discoverysystem.

– Data about the objects of interest involved inthe phenomenon whose causal structure is tobe discovered. Different kinds of objects canbe engaged in different causal relations: events,episodes, processes, states, etc. Data are just thesyntactical description of those objects; data canbe subjective (i.e., a summarisation of an expert’sopinion) or objective (coming from data files ormeasurement records).

– The causation theory is a description of the con-ditions that should be met by objects representedby data in order to state that a causal relationexists between the objects the data come from.

– Taking the causation theory as background knowl-edge or as bias, the learning method identifiespotential causal relations that form the basis ofthe model being built. We will view the learningprocess as a search procedure and classify dif-ferent learning methods in terms of the heuris-tics and evaluation functions used and the num-ber of models that gives as a result. Informationabout the complexity of the methods will also beconsidered.

– The result of the learning process is a causalmodel of the phenomenon under study. Suchmodels are built by composing the previouslyidentified causal relations. The causal model canbe seen as a theory of the phenomenon beingmodelled. This theory can later be used deduc-

tively or abductively to fulfil predictive or ex-planatory tasks. The kind of tasks that can beperformed with the resulting model depends onthe properties of the causation theory used asbackground knowledge during the learning pro-cess. This implies that, although some causationtheories are more general than others, no one iscompletely adequate for all reasoning tasks in alldomains, so when choosing a discovery methodit will be important to ascertain first what kindof causation theory is more adequate to guideit [83].

We will review discovery methods taking the pre-ceding aspects as discriminating criteria. This willallow us to answer the following questions:

– What kind of phenomena can be interpreted bythe method? That amounts to the question ofwhat causal relations can be identified with whichcausation theory. Such property of a causal the-ory will give us an idea of the area of interest ofthe method, in the sense of what kind of generictasks can be used with what kind of objects (en-gineered devices, general processes, etc.).

– What is the resulting model like? What kind ofknowledge representation does it use? As wewill see, there is a tendency to favour graphicalor mixed models (such as causal networks). Thiswill allow us to discuss what inference methodsthe model can support and how they are imple-mented.

– What are the properties of the search method?This will allow us to pinpoint possible improve-ments for each specific method.

– What are the properties of the data? This willallow us to discriminate how well the discov-ery methods adapt to data that are not ideal (i.e.,missing data, noise, etc.).

In order to review current discovery methods in theterms just discussed, it is necessary to make someconcepts about causation quite precise. To be morespecific we will have to know which are the parame-ters that distinguish the different proposals about cau-sation, the different causation theories. This is theaim of the next section.

2.1. Causation theories

In this section some important issues for distin-guishing theories that try to characterise the causal re-lation will be stated. In doing so, our goal is twofold.

Page 3: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 33

Firstly, we will clarify the traits that allow for distin-guishing causal associations from other types of asso-ciation and, secondly, we will be able to compare howthe different causation theories formalise the commonconcepts underlying causation and, so, we will be ina position to ascertain their respective merits.

In the most abstract way, causation is understoodas a relation between two phenomena or classes ofphenomena: one, the cause, is responsible for the oc-currence of the other one, the effect. In a sense, theoccurrence of causes “produces” the occurrence of ef-fects.

In order to classify the different causation theories,it is important to know which are the concepts thatform the basis of a causation theory. Causation theo-ries differ in the following aspects [18,85].

– The way in which causation is considered tobe produced (deterministically/non-deterministic-ally). The first consideration implies characteri-sation of causation in terms of logical conditions;the second in terms of valid statistical associa-tions between events that are different from spu-rious association.

– The agent producing causation: uniquely bythe intervention of an external agent to the ex-perimenter or because it is a process indepen-dent of the experimenter that implies certainkind of regularity in nature (manipulative accountof causation/non-manipulative account of causa-tion).

– The way in which causes and effects are distin-guished. This is the problem of causal ordering.Usually causes are assumed to precede their ef-fects. So, time can be used in order to establishprecedence.

– The acceptance or not of the Principle of theCommon Cause [12]: this is a principle due toHans Reichenbach stating that between two re-lated objects of interest A, B either A causes Bor B causes A or some other common cause Cexists that both causes A and B.

In a nutshell, a causation theory can be understoodas a triplet: 〈P,M, I〉, with

– P : a language for describing the phenomena ofinterest: more often than not this will be variablesand constraints on variables;

– M : a language for describing valid causal mod-els. This involves criteria for establishing causalordering and criteria for deciding on valid causalassociation (probabilistic or otherwise);

– I: rules for inference: how to build correct expla-nations, correct predictions, correct deductionsusing the model.

3. Causation in AI

As we have already said, there is a growing interestin causal discovery in AI, in automating the identi-fication of causes and effects. The most fundamen-tal motivation is in guiding inference in accord to theknown causal structure of the world.

There are many references to “causal models”,“causal association”, etc., in the AI literature. Interestin causation arises, for example, in common sense rea-soning [51] and automated diagnosis [6,13,41]. Thereare also references in qualitative reasoning and mod-elling [28]. Posterior developments such as second-generation expert systems posit also the use of acausal model of the domain as meta-level for expertsystems [10]. The need for diagnosis appears alsoin engineered devices, which resulted in the motionof “mythical causality” [17] and theories of causalorder [45,46,82]. Several other attempts at definingthe causality principle and causal reasoning have beencontributed by other workers related to AI, most no-tably those dealing with default and nonmonotonicreasoning [78,79,82].

All these methods have different semantics for thecausality relation. Presently, however, the concept ofcausation used in AI most agreed upon stems from thework of Judea Pearl in belief networks [63,67,68] thathas been taken as a reference for the interpretation ofcausal relations. The underlying formalism has corre-lates in decision theory and in planning [40]. It can beunderstood as a hybrid model (involving qualitativeand quantitative aspects) of causality inspired fromseveral sources, mainly statistical ideas on causalityas correlation but also by ideas about probabilisticcausation [75,91]. In Pearl’s formulation, causal or-der is established atemporally in terms of directionof association; causal association is non-deterministicand the principle of common cause is used (see [84]tor a discussion of this point); objects of interest arevariables and the representation language is mainlygraphical.

It is important to remark that this is the researcharea where most work has been done on learningcausal schemas.

Other graphical representations tied with causalityand having some degree of equivalence with Pearl’s

Page 4: Learning causal networks from data: a survey and a new algorithm ...

34 R. Sanguesa and U. Cortes / Learning causal networks from data

networks are: statistical association graphs [71],path analysis graphs [9], Heckerman’s modificationof influence diagrams [36,37] and Spirtes’ causalschemas [87,88].

Non-graphical representations of causation havealso received some attention from the point of viewof learning. Let us just mention the work by Paz-zani [62] which is centred around the idea of us-ing temporal frame representations to induce causalassociations and also the system developed by Pan-durang [60] who uses criteria taken from Simon andIwasaki’s work [44–46] on causal ordering to build alogical causal model.

4. Graphical representations of causality: causalnetworks

The network representation for causality has someprecedents in AI. For example, Peng and Reggia [72]developed a representation for causal links in diagno-sis domains, the causal abductive network, and de-veloped algorithms for reasoning with them. Similarwork can be found in statistics: causal accounts arethe centre of a whole area devoted to graphical modelsin statistics [4,7,55,56,95].

Definition (causal network). A causal network is agraph where nodes represent variables and links standfor causal associations. Links can be directed or undi-rected and may be weighted by a factor or combina-tion of factors expressing the strength of causal asso-ciation.

This is the most general definition possible. Table 1expresses the possible combinations and the actualformalisms.

When association between variables receives agiven direction and the strength of association corre-sponds to conditional probability distributions, the re-sulting representation is called a Bayesian belief net-work. We will describe them in some detail further on.In these representations, causal association is under-stood as a non-deterministic relationship, more pre-cisely a probabilistic one. It is worth noting that un-certainty about any kind of relationship between vari-ables can be due to reasons different from those thatmake the use of probability reasonable. Other for-malisms can be used in representing uncertainty as,for example, possibility theory [21] or belief func-tions. Accordingly, one can think of causal networksthat resort to possibility distributions, belief functions,

etc. in order to express the non-determinism of thecausal association among the variables in the model.We will review some developments in these direc-tions.

Decomposable graphical models express relation-ships between variables by means of undirected links.Strength of association is represented by conditionalprobability distributions. There is no clear criterionfor establishing causal precedence as directionalitymay or may be not present in the model. We will notreview here methods for learning such models, whichare more typical of statistical techniques. The inter-ested reader is referred to [4] for an excellent review.

Note that these two families of models do not sup-port a manipulative view of causation. As such, theycan be applied for the recovery of causal informa-tion from observational data, i.e., data where no in-formation is available regarding which variables areamenable to manipulation and which ones are not.Let us remark, however, that, as Pearl points out [67],there are some patterns of association in conditionalprobability distributions that suggest quite intuitivelythe notion of causal association.

Path models [96] are special representations formultiple linear regression models. Given the regres-sion model

rYX1 = β1 + β2X2X1 + β3X3X1,

rYX2 = β1X2X1 + β2 + β3X3X2,

rYX3 = β1X3X1 + β2X3X2 + β3,

where βi are standardised partial regression coeffi-cients, βi can be interpreted as how much Y changeswhen Xi is changed one unit. Causal association isexpressed by means of regression coefficients, i.e., bythe strength of correlation between variables. Thereare several ways of establishing causal order. We willexplain the way causality is represented in them inSection 4.6.

Pearl’s causal theories [64–66] use a Bayesian be-lief network to represent the relationships betweenvariables in a linear structural model. For this reasonwe will describe them in the section devoted to beliefnetworks.

These three families of models support a manipu-lative view of causation and, as such, they cannot beapplied in learning from observational data but theyare used for learning from experimental data. That is,data where effect and response variables are knownin advance. Let us point out, however, that Pearl’scausal theories’ main merit is that they can be used to

Page 5: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 35

Table 1Possible graph causal models

Type of graph Expression of causal association Type of link

Bayesian belief network Conditional probability Directed

Decomposable graphical models Conditional probability Undirected

Path models Regression coefficients Directed

Causal theories Conditional probability and functional links Directed

establishing conditions on how causal effects can beascertained from observational data.

In reviewing learning methods for causal discoverywe will rely heavily on concepts tied with Bayesiannetworks. This will help us in understanding the prob-lems of inferring causal structures from data whichshare most of the problems found when learningBayesian networks.

4.1. Bayesian networks

In a general sense, a Bayesian network can be seenas graphical representation of a joint probability dis-tribution on a set of variables, the domain variables.Per se, this information is not enough to representcausal knowledge. It has to be augmented with severalother statements that may be explicitly represented.These statements are:

– independence statements: they represent thatsome variables have an influence on the be-haviour of other variables in the domain (depen-dency relation) or that some other ones have nomutual influence;

– causal statements: some [36,37,64] have arguedthat the previous requirement is not sufficient forrepresenting wholly the (probabilistic) causationrelationships existing among variables and theyhave to be augmented with stronger assumptions.

With this aim in mind, other conditions have beenput forth in order to establish causal links in ac-cordance with an intervention model or a decision-theoretic account of causality. We will review thembriefly later on.

Given the variables of a problem domain U ={x1, . . . , xn}, a Bayesian network is a DirectedAcyclic Graph (DAG) where nodes represent vari-ables in U and links stand for direct association be-tween variables that usually are interpreted as directcausal influences. The strength of association be-tween variables is expressed in terms of conditionalprobabilities in groups (or clusters) of parent andchild nodes in the network. It is important to re-

Fig. 1. A simple Bayesian network.

alise that there exist two different components in aBayesian network: a quantitative one (the conditionalprobability values on the links) and a qualitative one(the topology of the DAG). Among the properties ofBayesian networks that are to be remarked are theirability to factorise joint probability distributions andtheir graphical criteria for establishing independenceonly by taking into account the topology of a graph(the d-separation criterion). We will discuss themin the following. There exist several algorithms forpropagating evidence and updating belief in Bayesiannetworks [86].

In Fig. 1 we give an example what kind of infor-mation a simple Bayesian network can convey. Thecorresponding functional decomposition is

p(Battery, Fuel, Motor, Start, Move) =

p(Battery)p(Fuel)p(Motor)

p(Start|Battery, Fuel, Motor)

p(Move|Start)

Page 6: Learning causal networks from data: a survey and a new algorithm ...

36 R. Sanguesa and U. Cortes / Learning causal networks from data

Table 2A marginal probability distribution

Start (yes) 0.78

Start (no) 0.22

Table 3A conditional probability distribution

Start (yes) Start (no)

Move (yes) 0.75 0.05

Move (no) 0.25 0.95

Tables 2 and 3 specify the strength of association.In general, given a DAG D and a joint distribu-

tion P over a set U = {x1, . . . , xn}, D represents Pif there exists a one-to-one correspondence betweenthe variables in U and the nodes in D such that P canbe decomposed recursively as the product

P(x1, . . . , xn

)=∏

P(xi | pai(xi)

)where pai(xi) are the direct predecessors (parents ordirect causes) of xi in D.

That means that each variable x1 is conditionallyindependent of all its other predecessors

{x1, . . . , xni−1}\pai(xi).

This can be expressed in the preceding example asconditional independence statements:

I(Battery|�|Fuel)

I(Motor|�|Battery, Fuel)

I(Move|Start|Battery, Fuel, Motor)

Each statement of the form I(X |Y |Z) is read as“X is independent of Z, given Y ”. This expression isan extension of the classical concept of independenceamong variables where X, Y, Z are interpreted assimple variables with some given values. Note thathere I(X |Y |Z) is to be understood as “for all instan-tiations of all variables in X, Y and Z”.

The notion of independence, however, can be de-fined in such a way as to remove any relationshipwith probability. Criteria for independence have beenproposed for other uncertainty formalisms [24,26,27]as well as in other areas of interest, as databases [90].

From such studies, a possible axiomatic view of in-dependence relations has been agreed upon. The fol-lowing axiomatization resumes the desired propertiesfor a relation to qualify as a relation of independence.

(1) Trivial independence: I(X |Z|�). A null in-formation modifies in no way the informationone already has on X .

(2) Symmetry: I(X |Z|Y ) ⇒ I(Y |Z|X). Givena state of knowledge Z, if knowing Y givesno information on the value that X may take,then knowing X will give no information onthe value that Y could take.

(3) Decomposition: I(X |Z|Y ∪W )⇒ I(X |Z|Y ).If both Y and W are irrelevant for the valueof X , then each one of them, taken separately,should be taken as irrelevant for the value of X .

(4) Weak union: I(X |Z|Y ∪W )⇒ I(Y |Z∪Y |X).When knowing Y , a piece of information takenas irrelevant for X cannot make any other ir-relevant information W become relevant forknowing X .

(5) Contraction:

I(X |Z|Y )&I(X |Z ∪ Y |W )⇒I(X |Z|Y ∪W ).

If W is taken as an irrelevant piece of infor-mation for X after knowing irrelevant infor-mation Y , then W should be irrelevant for thevalue of X before knowing Y .

(6) Intersection:

I(X |Z ∪W |Y )&I(X |Z ∪ Y |W )⇒I(X |Z|Y ∪W ).

If two combined elements of information Y andW are relevant for X , then at least one of themshould be relevant for X , when the other oneis joined with a previous information Z.

Any set of independence assertions about a collec-tion of data that reflects the independence implicit inthe data (any dependency model of the data) that sat-isfies axioms (2)–(5) is called a semi-graphoid. If italso satisfies axiom (6), it is called a graphoid [69].

The interesting thing about Bayesian networks, andin general, about belief networks, is that they can betaken as a representation of a dependency model. Ifthis is so, it is important to know which mappings canbe established between the topology of the networkand its associated dependency properties. The notionof d-separation is central to that task.

Definition 1 (d-separation [68]). If X, Y, and Z arethree disjoint subsets of nodes in a directed acyclicgraph D, then Z is said to d-separate X from Y , iffthere is no path from a node in X to a node in Ywhere the following conditions hold: (1) every nodewith converging arrows either is or has a descendantin Z and (2) every other node is outside Z. A pathsatisfying these two conditions is said to be active;otherwise it is said to be blocked by Z.

Page 7: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 37

Fig. 2. An example for the d-separation criterion.

Example [56]. Given the following Bayesian net-work: X1 takes values in the set {winter, spring, sum-mer, fall} and the other variables are binary-valued.The setsX = {X2} and Y = {X3} are d-separated byZ = {X1}; the path X2 ← X1 → X3 is blocked byX1 which belongs to Z and the path X2 → X4 ← X3

is blocked because X4 as well as all its descendantslie outside Z. On the other hand, X and Y arenot d-separated by Z ′ = {X1, X5} because the pathX2 → X4 ← X3 is made active by X5, which is adescendant of X4 and belongs to Z ′ (see Fig. 2).

If we assume that behind a collection of data thereexists a dependency model, M , then the followingdefinitions express the possible relations between thedependency modelM and its graphical representation,the DAG D.

Definition 2 (I-map). A DAG D is said to be an I-map [68] of a dependency model if every d-separationrelation in D corresponds to an independence relationin M . That is, given X, Y, Z three disjoint sets ofnodes in D:

d-(X |Z|Y )D ⇒ I(X |Z|Y )M .

Example: a trivial example is when D is a completegraph.

Definition 3 (minimal I-map). For a given DAG D,that is an I-map for a given dependency model M , itis minimal if no other DAG D′ with less links thanD is an I-map for M.

Definition 4 (D-map). A DAG D is a D-map [69]for a dependency model M if every independencerelation in M has a one-to-one relation with d-separation relations in D. That is, given X, Y, Zthree disjoint node sets it happens that

d-(X |Z|Y )D ⇐ I(X |Z|Y )M .

Example: when D is a completely disconnectedgraph.

Definition 5 (perfect map). A DAG is a perfect mapof a model M if it is an I-map and a D-map of themodel M .

Given a dependency model, there can exist severaldifferent graphical representations for the same inde-pendence relations in the model. These representa-tions are isomorphic. A typical example is the fol-lowing one. Knowing that x and z are marginally de-pendent but, when y is known, both are conditionallyindependent, the following structures are isomorphic.

x← y ← z ≈ x→ y → z ≈ x← y → z.

This property has important implications for learn-ing.

For a DAG to be isomorphic to a dependencymodel, M , the following conditions are to bemet [69]:

(1) Symmetry: I(X |Z|Y )M ⇔ I(Y |Z|X)M .(2) Composition/Decomposition:

I(X |Z|Y ∪W )M ⇔I(X |Z|Y )M&I(X |Z|W )M .

(3) Weak union:

I(X |Z|Y ∪W )M ⇔ I(X |Z ∪ Y |Y )M .

(4) Contraction:

I(X |Z|Y )M&I(X |Z ∪ Y |W )M ⇒I(X |Z|Y ∪W )M .

(5) Intersection:

I(X |Z ∪W |Y )M&I(X |Z ∪ Y |W )M ⇒I(X |Z|Y ∪W )M .

(6) Weak transitivity:

I(X |Z|Y )M&I(X |Z ∪ w|Y )M ⇒I(X |Z|w)M ◦ I(w|Z|Y )M .

(7) Cordality: I(x|y ∪ z|w)M&I(y|x ∪ w|z)M ⇒I(x|y|w)M ◦ I(x|z|w)M .

Letters in lower case represent individual variables.The d-separation criterion has been proved to be

a necessary and sufficient condition in relation tothe set of distributions represented by a given DAG.There is a one-to-one correspondence between the setof independences implied by recursive decompositionof probability distributions and the d-separation ona DAG.

Page 8: Learning causal networks from data: a survey and a new algorithm ...

38 R. Sanguesa and U. Cortes / Learning causal networks from data

4.2. Other approaches for non-probabilistic beliefnetwork models

4.2.1. Possibilistic networksThe conditional independence properties just men-

tioned allow for the characterisation of independencein several uncertainty calculi. Possibility [21] isa way of dealing with uncertainty and imprecision.Fonck [23,25,27], devised possibilistic networks, aspecialisation of possibilistic hypergraphs [21].

In these networks, uncertainty is assumed to berepresented by a possibility distribution [20]. Foncklater described inference algorithms for such net-works [19,23]. These inference mechanisms are anal-ogous to the ones proposed by Pearl in his originalwork. Fonck proved that the d-separation criterion isalso valid for possibilistic networks. However, shealso detected some important properties of condition-ing operators in possibility theory of special impor-tance for learning possibilistic causal networks.

Possibilistic conditional independence is defined interms of the conditioning operator. Contrary to prob-ability, there are several conditioning operators andseveral ways to combine possibility distributions.

Let us mention the Dempster–Shafer conditioningoperator [49]:

π(X |Y ) = π(X,Y )/π(Y )

and Hisdal’s [39] conditioning operator:

π(X |Y ) ={π(X,Y ) if π(X,Y ) < π(X |Y ),1 otherwise

where π is a possibility distribution. Based on the re-sults of conditioning on possibility distributions, con-ditional independence can be defined in several ways.The traditional way to understand conditional inde-pendence in possibilistic settings was to equate inde-pendence to the equality between the joint possibilitydistribution and a combination of the marginal pos-sibility distributions. This is what is known as thenon-interactivity property.

Definition (non-interactivity [24]). Given two possi-bility distributions on the sets X, Y they are said tobe non-interactive with respect to a third one Z, ifthey can be factored:

π(X,Y |Z) = c(πc(X |Z), πc(Y |Z)

)where c is a possibility distribution combination op-eration (usually the minimum operator) and πc repre-sents the distribution resulting from applying the c op-erator in the conditioning operation.

Other possible ways of defining independence in apossibilistic setting are the following ones.

Definition (strong possibilistic conditional indepen-dence [24]). Given the variables X, Y and Z andthe corresponding possibility distributions, we say thatX is possibilistically conditionally independent of Ygiven Z if the following equalities hold:

πc(X |Y,Z) = π(X |Z)

and

πc(Y |X,Z) = π(Y |Z).

Definition (similarity-based possibilistic conditionalindependence [16]). Two variables X, Y and Z aresaid to be possibilistically conditionally independentwith respect to a third variable Z when

π(X |Y,Z) = simπ(X |Z)

for any values of X, Y and Z, where their symbol= sim denotes that both distributions are similar.

The idea behind similarity-based definitions is thatif two variables are independent conditionally to athird one, then the conditioned distribution cannot bevery different from the original one. The more differ-ent it is, the more dependent the variables are [16].

Fonck [23] proved that, depending on the combina-tion operator used (minimum, product or Lucasiewicz-like T-norm), the resulting independence relationshipscould obey the graphoid axioms or not. In particular,she showed that for non-interactivity semigraphoidaxioms were valid but graphoid axioms were not. Thismeans that such an independence definition could notbe used in defining a possibilistic network, and evenless to learn one from data. Huete [42] also stud-ied the properties of several similarity-based condi-tional independence definitions depending on the typeof similarity used, proving that most of them do notfulfil the symmetry property.

Other similar network proposals are Kruse andGebhardt’s [33] who defined a similar construct basedon their characterisation of possibility in terms of their“context model” [32]. Analogously, Parsons [61] hasproposed a characterisation of possibilistic networksthat draws on Fonck’s previous work but refers toqualitative concepts of influence between variables.These approaches, with the exception of Parsons’,stress that independence relations (whatever the un-derlying uncertainty formalism may be) can be char-acterised by means of the d-separation criterion. Thisis important, because it gives a level of abstraction

Page 9: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 39

above details due to the nature of the uncertainty for-malism used. A further development in the directionof a higher abstraction is Shenoy’s work on valuationsystems [77] which has been given an operational as-pect and so establishes the conditions for propagatinguncertainty values in DAGs [7].

In any case, these methods represent an advancein the direction of providing inference mechanismsbased on uncertainty formalisms other than probabil-ity. They are not changes in learning methods butin representation. Moreover, even if there is a clearsense of unity in the way that independence propertiescarry over to different formalisms thanks to a struc-tural criterion, these characterisations still miss someof the characteristics of causal relations.

Several other assumptions have been introduced inorder to derive truly causal networks. As we havestressed before, this is equated to the proposal of sev-eral new characterisations of causality, i.e., severalnew criteria for the identification of causation. Thenovelty in relation with other criteria previously usedfor example in statistics [9] or even in AI [89] wherecausality is characterised in terms of constraints oncorrelations, is that the new formalizations are basedon an extension of the independence model.

Two interesting departures from the basic beliefgraphical model are to be remarked. One is Heck-erman’s characterisation of probabilistic causal net-works in terms of decision theory [36] and the otherone is Pearl’s new account for causality in terms ofprobabilistic calculus of intervention [66]. Finally,Cooper [11] has put forth conditions to graphicallyidentify certain causal relations in terms of their in-dependence properties.

4.3. Pearl’s intervention view of causality: causaltheories

Pearl has developed a new interpretation for causalnetworks based on the idea of intervention. This isin accordance with other interpretations of causalitywidely used among experimental disciplines [85].

The key idea in Pearl’s new work lies in find-ing structural equivalencies to causal influence andin defining how a change in a variable value due toexternal intervention affects the structure of relatedprobability distributions.

4.3.1. The probabilistic action calculusThe first change in the Bayesian network defined

by Pearl is the explicit representation of a causalmechanism [64]. A causal DAG is a DAG whereeach parent–child subgraph is a deterministic func-tion. A child Xi, with parents pai represents a deter-ministic function:

Xi = fi(pai, εi

)for i = 1, . . . , n, n being the cardinality of the setof domain variables and pai the set of parents forvariable in a given DAG. εi, 1 6 i 6 n, are mutuallyindependent disturbances.

Functions allow for calculating the precise effectsof interventions. The simplest of interventions (i.e., anexternal action) is the setting of a simple variable, thatis, forcing a variable, say Xi, to take a given value, xi.This atomic intervention (set(Xi = xi)), according toPearl, amounts to isolating Xi from the influence ofthe previous functional mechanism Xi = fi(pai, εi)and setting it under the influence of a new mechanismthat makes xi constant while all other mechanismsare left unperturbed. That is, the new correspond-ing graph is a subgraph of the original one where allarrows entering Xi are wiped out.

Pearl suggests entering a new variable in the sys-tem in order to represent the operation of an externalintervention. So a new variable Fi is created and thefollowing convention is made:

Xi = I(Fi,pai, εi

),

where I is a function defined as

I(a, b, c) = fi(a, c) when b = fi.

In this way the action of any external interventionthat may alter Xi is represented by another parentnode of Xi. The effect is analysed through Bayesianconditionalization.

The effect of an atomic intervention of the type(set(Xi = xi)) is encoded by adding to the graph anew node Fi and a link connecting it to Xi. Fi rep-resents the deterministic function but is treated like avariable that can take values in {set(xi), idle}1, xi hasthe same domain as Xi and “idle” means no interven-tion.

The new parent set pa′i of Xi is its previous par-ent set pai and the Fi node. It fulfils the followingcondition:

1See [34] for a similar construct on decisions.

Page 10: Learning causal networks from data: a survey and a new algorithm ...

40 R. Sanguesa and U. Cortes / Learning causal networks from data

Fig. 3. The corresponding manipulated graph.

Fig. 4. An extended DAG with a functional mechanism.

P(xi | pa′i

)= P

(xi | pai

)if Fi = idle

but

P(xi | pa′i

)=

{0 if Fi = set(x′i) and x′i 6= xi,1 if Fi = set(x′i) and x′i = xi.

Graphically, then, we have in Fig. 3 a graph withno external intervention. The extended graph corre-sponding to an external intervention Fi is given inFig. 4.

So the effect of the intervention set(x′i) is to trans-form the original probability distribution P (x1, . . . ,xn) into the new distribution P (x1, . . . , xn | Fi =set(x′i)).

The relation between pre- and post-interventionjoint distributions can be expressed, thanks to the de-composability property of Bayesian networks, as:

P(x1, . . . , xn | Fi = set(x′i)

)=

P(x1, . . . , xn

)/P(xi | pai

)=∏j 6=i P

(xi | pai

)if xi = x′i,

0 otherwise.

Graphically this is equivalent to removing the linksbetween Xi and pai and leaving the rest of the net-work as it was before.

4.3.2. Identifiability of causal effects inobservational data

The interest of Pearl’s action calculus is that it al-lows for the identifiability of causal effects by meansof graphical criteria. The criteria for identification ofa causal effect is that, upon the execution of an actionby external agent in setting a variable (do(x) action),the related probability distributions should be altered.If no alteration appears, then no truly causal effectcan be said to have taken place. So, if a change ina given variable has no effect on the other variableslinked to it in the DAG reflecting dependence rela-tions, no causal relation can be said to exist amongthose variables.

The important twist in Pearl’s work lies in settinggraphical conditions for determining which graphs canbe subject to such a test. That is, to state which graph-ical conditions are to be met by a dependency graphin order to test it for the existence of a causal associ-ation. If a DAG does not meet such conditions, onecannot infer causal effects by manipulating it. Con-sequently, it has to be rejected as a representation ofcausality in the domain.

Pearl’s conditions are the following ones [29].A necessary and sufficient condition for the identifia-bility of the causal effect of a set of variables X onY is that the DAG G containing X and Y satisfiesone of the following conditions:

1. There is no directed path from X to Y in G;2. There is no back-door path from X to Y in G

(i.e., there is no link into X);3. There exists a set of nodes B that blocks all

paths from X to Y ;4. There exists sets of nodes Z1 and Z2 such that:

– no element of Z2 is a descendant of X ;– Z2 blocks every directed path from X to Y

in C−x ;– Z2 blocks all path in G−x between X and Z1

in C−x ;

where C−x is the graph obtained by deletingfrom G all arrows pointing to X .

Definition (back-door criterion [64]). A set of vari-ables Z satisfies the back-door criterion with respectto an ordered pair of variables (Xi, Xj) in DAG Gif:

(i) no node in Z is a descendant of Xi;(ii) Z blocks every path between Xi and Xj which

contains an arrow into Xj .

Page 11: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 41

Fig. 5. Pearl’s example for testing causal identifiability.

For example, in Fig. 5 the sets Z1 = {X1, X2}and Z2 = {X4, X5} obey the back-door criterion butZ3 = {X4} does not because there exists the path(Xi, X3, X1, X3, X2, X5, Xj).

So if a graph G has one of these properties, it canhave a causal interpretation, in the sense given in theprevious section.

Up to this point we have reviewed most of the as-pects of causality related to belief networks. Let usreview two more methods, one establishing a differ-ent set of criteria for finding causal associations ona Bayesian belief network and a last one that mapsa definition of causality in terms of decision theoryonto the already known concept of Bayesian beliefnetwork.

4.4. Cooper’s partial conditions on causal structures

Cooper [11] has tried to devise some new criteriato derive causality from observational data that, inter-estingly enough, make use of independence criteria.We will comment briefly on his characterisations (seeFig. 6).

Cooper takes a Bayesian network as the repre-sentation of the causal relations among the variablesin a domain. Then, he lists a set of several rela-tions that he qualifies as truly causal and he stud-ies what kinds of independence relations are satisfiedby such relations. He identified the following sevenrelations of independence in terms of d-separationas uniquely identifying the structures depicted inFig. 6.

Fig. 6. Cooper’s characterisation of causality in terms of indepen-dence relations.

R1: I(x, y) R5: I(w, z | y)R2: I(x, z | y) R6: I(x, z)R3: I(w, y | x) R7: I(w, y)R4: I(w, z | x)

Relations R1 to R7 are tests of independence interms of the d-separation criterion. The importantresult is that Cooper proved that these seven relationsare sufficient to distinguish among the four networkstructures.

4.5. Heckerman’s decision-based view of causality

The main concept behind this approach is the ideaof unresponsiveness that allows Heckerman to definethe causal relation. In order to understand it, one hasto resort to the transformation of a Bayesian networkinto an influence diagram.

An influence diagram is a representation used torepresent decisions and their consequences. Its struc-ture is a DAG where nodes are of different types.Variables have a possibly infinite set of states. Deci-sion nodes represent the possibility to take a decision(i.e., selecting an alternative); chance nodes representvariables in the domain that may affect decisions butfor which information is uncertain. So, for example,the variable smoking with values yes or no may bea decision variable, while a variable indicating that aperson may develop lung cancer is a chance variable.Arcs are also of two types: information arcs and rele-vance arcs. Information arcs represent what is knownat the time of the decision. Relevance arcs representprobabilistic dependence. Now, apart from this infor-mation, an influence diagram has the following com-ponents:

(a) a set of probability distributions associated witheach chance node;

(b) a utility node, that indicates the expected utilityof the final decision, and a set of utilities.

Page 12: Learning causal networks from data: a survey and a new algorithm ...

42 R. Sanguesa and U. Cortes / Learning causal networks from data

Fig. 7. Heckerman’s causal extension. Example 1.

Deterministic nodes are those nodes that are de-terministic functions of their parents. Arcs point tochance nodes representing conditional dependence re-lations. Bayesian networks can be seen as influencediagrams with no decision nodes.

In Heckerman’s proposal, causality is a relation de-termined by unresponsiveness. A variable x is said tobe unresponsive to a decision d if, no matter whichalternative is chosen for decision d, the outcome of xis the same. Note that if the result of a chance vari-able x is not affected by decision d, then x and d mustbe probabilistically independent.

In Fig. 7, an influence diagram has been built inorder to represent information about the decision ofsmoking or not and about changing diet or not. Pos-sible states for decision variables smoke and diet areshown. Chance nodes are represented by ovals, deci-sion nodes by squares and utility nodes by diamonds.Note that the absence of relevance arcs induces thefact that lung cancer and cardiovascular status areconditionally independent given diet, smoke and geno-type. Although there is no certitude about genotype,it is possible to assert that, whatever the genotype ofa given person is, it will not be affected by whetherthis person smokes or not: genotype is unresponsiveto the decision smoke. Those two variables can betaken as independent in the probabilistic sense.

Definition (mapping variables). Given uncertain vari-ablesX and Y , then the mapping variableX(Y ) is thechance variable that represents all mappings from Xto Y .

state 1 state 2 state 3 state 4smoke

no no no yes no yes no yesno yes yes no no no yes yes

lung cancer

Fig. 8. Heckerman’s causal extension. Example 2.

For example, a mapping variable can be defined bymeans of the decision variable smoke and the chancevariable lung cancer. The mapping variable cancer(smoke) represents all possible deterministic mappingsfrom smoke to cancer. This mapping variable has fourpossible states depending on the two-valued settingsof cancer and lung cancer (see Fig. 8). Each of thesemappings has an associated uncertainty.

Heckerman and Shachter [36,37] state that a set ofvariables C are causes for x with respect to a set ofdecisions D if the following conditions hold.

Definition (cause). (1) x does not belong to C;(2) x is unresponsive to D;(3) C is a minimal set of variables such that X(C)

is unresponsive to D;X(C) is said to be a causal mechanism.

Definition (causal mechanism). Given a decision Dand a chance variable X that is responsive to D, acausal mechanism for X with respect to D is a map-ping variable X(C) where C are causes for X .

Once all these concepts are presented, Heckermanand Schachter establish a correspondence with a spe-cial form of influence diagram, the Howard CanonicalForm canonical diagram.

First they define a blocking relation and try to de-termine when a group of blocking variables embodiesthe notion of unresponsiveness.

Definition (block). Given an influence diagram withdecision nodes C and chance nodes U, U ⊃ C is saidto block D from x in U if every directed path froma node in D to x contains at least one node in C.

Page 13: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 43

In some cases, whenever there is a path from D toa variable x, this variable is responsive to D. So, itseems that the block concept allows for the graphicaltest of causal association. However, this is not gener-ally the case in ordinary influence diagrams. The au-thors show that influence diagrams in Howard Canon-ical Form do ensure that the blocking condition faith-fully represents their concept of causality.

Definition (Howard canonical form influence dia-gram). An influence diagram for chance variables Uand decision variables D is said to be in HowardCanonical Form if (1) every chance node that is nota descendant of a decision node is unresponsive to Dand (2) every chance node that is a descendant of adecision node is a deterministic node.

Heckerman and Shachter argue that their formula-tion is identical to Pearl’s concept of causation (seebelow) with the exception that, in their view, Pearlrequires mechanisms to be independent, and their pro-posal allows for dependent mechanisms (see [37] fortheir argumentation).

4.6. Path models

As we said before, path models are based on themanipulation of multiple regression models. In a re-gression model, a system of equations is built andused for prediction. In such a model there exist sev-eral variables X1, . . . , Xn that can be manipulated tochange the variable of a given response variable Y.The aim of regression models is to assess the value ofcertain coefficients in order to obtain a least squaressystem of equations, i.e., a model that minimises leastsquares distance.

Given a prediction model,

rY X1 = β1 + β2X2X1 + β3X3X1,

rY X2 = β1X2X1 + β2 + β3X3X2,

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·rY Xn = β1X3X1 + β2X3X2 + β3

the idea is to solve the beta coefficients in order toobtain a least squares rule.

Wright [96] proposed an interpretation of suchmodels in graphical terms. In Fig. 9 we have a three-equation model.

The first equation allows us to link X1 and Ythrough the β1 coefficient; the relation between X2

and Y is represented by the path that goes from X1

Fig. 9. A path model corresponding to the three equation model.

to Y via X2; the relation with X1 is given by the in-direct path through X3. Correlation rY X1 is given bythe sum of the factors of the paths. The weight of apath is the product of the coefficients on the links ofthe path. The convention in this representation is thatundirected arrows represent correlations and directedarcs represent causes.

Note that directed links are weighted by betas thatrepresent the partial regression coefficients, that is, theeffects of a variable on Y when all other variablesremain fixed.

To derive a path model from a predictive multipleregression model four steps have to be taken:

(1) Create a prediction model and the correspond-ing path diagram.

(2) Decide which factors are correlations andwhich are beta coefficients. This can be doneby applying the following rules:

– if several variables, X, Y, . . . point to an-other variable Z and they are independentcauses of Z, then the path coefficients arejust the correlations ρZX , ρZY , . . . ;

– if they are dependent, the coefficients are thestandardised partial regression coefficients,βZX , βZY .

(3) Solve for the coefficients. If they are correla-tions, they can be calculated from data; if theyare not, a multiple regression has to be run onthe corresponding variables.

(4) Estimate correlations between variables bysumming up weights along the paths. Thereare some rules for calculating such values. Co-hen [9] gave a graphical expression to theserules by taking into account the way an arrow-head enters a node. The proposed rules are:

Page 14: Learning causal networks from data: a survey and a new algorithm ...

44 R. Sanguesa and U. Cortes / Learning causal networks from data

BA

C

BA

C

BA

C

BA

C

BA

C

BA

C

Fig. 10. Possible networks corresponding to the database in Table 4.

(a) A path cannot go through the same nodetwice;

(b) once a node has been entered by an arrow-head no node can be left by an arrowhead.

By means of these rules, causal associations can beestablished.

5. Learning belief networks

The learning problem for this kind of networks canbe stated as follows:

“Given a set of data, infer the topology for thecausal network that may have generated them to-gether with the corresponding uncertainty distri-bution”

We use here the term “uncertainty distribution” inorder to allow for uncertainty formalisms other thanprobability to be used. However, in reviewing currentmethods we will make use of probabilistic examples.

As we said in Section 1, the process of causal dis-covery, as a learning problem, differs basically in thesearch procedure and also in the function used to ranktentative resulting models. The problems in any ap-proach to such kind of discovery are centred aroundthe high complexity of learning the topology (struc-ture) of the DAG [19]. Once the topology is known,finding the conditional probability tables is straight-forward, although some efforts have been done forimproving the efficiency of learning them. For in-stance, Musick [59] uses neural networks to learn theconditional probability tables.

Table 4A simple database

Variables

Case # A B C

1 1 0 1

2 1 1 1

3 0 1 1

4 0 1 1

Additional problems that some methods are ableto tackle with are: unmeasured variables [76], miss-ing values [34] and instrumental unobserved vari-ables [29].

It is possible to classify the methods by taking thosethat start with an assumption about the structure andthen infer the distribution or conversely, start with anassumption about an uncertainty distribution and tryto recover the corresponding structure.

To make things clearer let us see which are thedimensions of learning such structures.

The search space

Given a data base as simple as in Table 4 [3], thefollowing structures are possible Bayesian networkscompatible with such data, see Fig. 10.

To have an idea of how large the search spacecan become let us take, for example the firststructure, the one with three independent variables.Here three probability distributions have to assessed:p(A), p(B) and p(C). All variables being binary,each probability is specified by a single real number

Page 15: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 45

in [0, 1]. The parameters to be learned can be definedas φa. In order to learn the fourth structure, probabil-ities p(a), p(b) and p(a | b) are to be assessed, thismeans a family of parameters φa whose values arein R4. p(c | b) has to be ascertained for two valuesp(c = t | b = t) and p(c = t | b = f). For anygiven Bayesian network structure, the probability ta-ble p(X/Y ) for any two variables X, Y is a subset ofthe real space of (|X |−1)|Y | dimensions (where |X |is the number of different values for a variables X).A completely disconnected network such that |U | = kand every variable is binary (as the upper left one inthe figure) needs between k and 2k−1 values to spec-ify its conditional probability tables. In the case thatcontinuous variables are used and that they follow aGaussian distribution, a node with k parents will needk(k+1)/2 values to specify the mean and covariancematrices.

Fortunately, equivalence properties among Bayesiannetworks have been identified [94] that allow fora reduction in the number of different networks.There exist networks that represent the same inde-pendence statements. For example, for networks cre-ated with just three variables, there are 25 possiblenetworks (varying arrow orientation and connectiv-ity). However, there are only eleven different equiva-lence classes (where each class contains networks re-flecting the same dependency model). Let us remark,however, that probabilistic equivalence does not meancausal equivalence. Networks in the same class arenot equivalent when interpreted causally. See [34] fora discussion of this point.

Finding and selecting a network

Over this search space some way has to be found inorder to select the “best” network that reflects existingdependences in data. Many of the methods that willbe reviewed make use of standard statistical samplingmethods to derive the needed parameters (structureplus probability distribution).

Here, one assumes that a given structure has gen-erated the data, and then some measure of compati-bility between such assumption and the probability ofobtaining the data has to be devised.

Given a collection of data, during the learning pro-cess, different networks may be possible alternatives,even after considering the dependency equivalencesthat may exist among them. In general, methods thatresort to quality measures have derived some formof establishing the overall quality of a network in

terms of its constituents, reducing quality measures tothe sum of the quality of all given child–parent con-figurations. This is possible thanks to the propertyof factorisation over distribution which is inherent toBayesian networks:

Quality (Network | data)

=∑

quality (x | pax, data),

where pax is the set of parents of variable x.There are other possibilities in choosing different

alternative structures. Some methods based on condi-tional independence tests choose a variable to becomea new node according to a previous given order andwhen, still, several variables are eligible they make arandom choice.

In general, one can distinguish two great groups ofmethods [42]. The first ones are based on the appli-cation of conditional tests between variables and theconstruction of the structure of the DAG based on theresult of such tests; then the conditional probabilitytables (the quantitative part of the network) are cal-culated from the data. The second ones are methodsbased on goodness-of-fit tests between the probabilitydistribution of a tentative DAG and the true joint dis-tribution implied by the data. We will review them inthe same order.

There exists, too, a mixed method, the CB algo-rithm [80] that first derives a structure by means ofCI-tests between variables, and then generates an or-der that is fed to the K2 algorithm. One of our currentproposals is also a hybrid method, but as we will see,we exploit the relation between CI tests and goodness-of-fit in a different way, incrementally as the DAG isbuilt.

5.1. Conditional independence test methods

Algorithms in this class resort to the qualitativeproperties of the networks in order to build the cor-responding belief network. They usually take as in-put a set of dependence assertions among variablesor sets of variables in the domain. The output is abelief network that reflects those relationships. Letus remember that, given a dependency model, severaldifferent networks may reflect the same dependenciesup to isomorphism. All of these algorithms return astructure that has to be completed by calculating theassociated conditional independence tables.

The different structures are in ascending order ofgenerality: trees, polytrees, singly connected graphs,

Page 16: Learning causal networks from data: a survey and a new algorithm ...

46 R. Sanguesa and U. Cortes / Learning causal networks from data

and general DAGs. A polytree is a kind of DAGwhere all nodes with common ancestors do not sharecommon descendants. The name “polytree” stemsfrom the fact that these structures can be seen as a col-lection of several causal trees merged together wherearrows converge head to head (→ x←). Singly con-nected graphs are those graphs that allow a certainkind of cycles: simple cycles.

Definition (simple DAG [30]). A directed acyclicgraph is said to be simple if every pair of nodes witha common direct child have no common ancestor noris one of them ancestor of the other.

Definition (simple cycle). A cycle in a graph is sim-ple if any pair of nodes in the cycle that share a directdescendant neither have a common ancestor nor oneof them is an ancestor of the other one.

It is important to remember that all these methodsassume a certain structure for the underlying distri-bution. It is worth noting that if the true distribu-tion has a different structure, the graphs recovered bythe learning methods are still useful as approxima-tions [15]. De Campos carried on a study in which theresulting approximated graphs were compared againstthe true known structures corresponding to a knowndatabase, the ALARM database. He studied the dif-ference in inference accuracy between the approxi-mated graphs and the original one finding small vari-ations in the probabilities of the queried nodes.

Each one of the following algorithms make use ofone or another of the characteristics of the above-mentioned structures in terms of conditional indepen-dences between variables (i.e., nodes). We will com-ment them on describing each algorithm in turn.

Simple structures

The following method was devised by Geiger, Pazand Pearl [30,31] in order to build singly connectedstructures. They assume that the underlying distribu-tion has a polytree form.

The Geiger, Paz and Pearl algorithm (I)Input: a list of dependences between the variablesin a domain U .Output: a polytree or an error message.1. Build a complete undirected graph.2. Build the Markov network G0 erasing every

arc x−y such that

I(x|U\{x, y}|y)M .

3. Build GR by erasing every arc such that I(x|�|y)M .

4. Turn each arc x−y in GR into x→ y if y hasa neighbour z such that I(x|�|z)M and x−zdoes not belong to GR.

5. Direct the rest of links without creating newhead to head connections. If this is not possi-ble, return error.

6. If the resulting polytree is not an I-map, thenreturn error.

The algorithm makes use of a well-known propertyof simple graphs: for every chain a−b−c if b is a headto head node (a → b ← c) then one can guaranteethat a is marginally independent of c. Note that thenumber of needed independence tests is at most twofor each pair of variables (steps 2 and 3). However,these tests are of a high order because for every pairof variables in a domain with n variables an indepen-dence test of order n− 2 must be made. These testsare exponential with n. So, the algorithm is polyno-mial O(n2) in the number of independence test but,unfortunately, it is exponential for each independencetest.

The same authors put forth another method that isable to recover simple graphs, that is, structures thatallow the presence of simple cycles. It is based on aproperty that asserts that a graph represents well a de-pendency model if whenever every pair of variables xand y are connected by a path without head to headnodes, those nodes are marginally independent.

The Geiger, Paz and Pearl algorithm (II)Input: a list of dependences between the variablesin a domain U .Output: a polytree or an error message.1. Build a complete undirected graph.2. Erase every arc x−y such that I(x|U\{x, y}|y)M .

3. Erase every arc x−y such that I(x|�|y)M .4. Turn each arc x−y and y−z into x → y andy → z whenever x−y−z is in the graph andI(x|�|y)M .

5. Direct the rest of links without creating newhead to head connections. If this is not possi-ble, return error.

6. If the resulting graph does not represent thedependency model well, then return error.

The main difference with respect to the previousalgorithm lies in the kind of structure it recovers. Itcan be seen that complexity is once more quadratic inthe number of needed tests and that each one of theserequires an exponential time.

Page 17: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 47

CH algorithm [42]

This algorithm is devised to recover a special caseof network, a causal polytree. Causal polytrees can beseen as simple DAGs where between any two nodesonly a single path exists.

For each node x a set Λx is defined. Λx containsthe set of variables y belonging to U such that x andy are marginally dependent. In a polytree structure,two variables are dependent if there is no head to headnode in the path connecting them. The idea is to takea variable x and test for any other two variables y, zin Λx in order to check if y lies in the path between xand z.

In order to do that a new concept, the sheaf of anode is defined. The sheaf structure is made up byjust the direct parents and descendants of a given node(Huete calls them “direct causes and effects”).

The algorithm tries to create a growing partial struc-ture T and to restrict the search for new variables to beincluded in it to only those variables in Λx that can af-fect the sheaf of x. De Campos and Huete [14] provedthat if x is a variable in a dependency model M rep-resented by a node in a structure T and z is a variablenot in T but belonging to Λx, then the sheaf of x hasto be modified only if one of the following conditionsholds:

(1) I(x|z|y) is true in M for some y belonging tothe sheaf of x;

(2) I(x|y|z) is false in M for all y belonging tothe sheaf of x.

Moreover, if the previous conditions do not holdthen there exists one and only one node y belongingto the sheaf of x such that I(x|y|z) is true in M . So,the last property helps in directing the search for thevariables whose sheaf structure has to be modified.

The algorithm starts with an empty structure andthen selects a variable x. Next, dependent variableswith respect to x are also found (Λx). A structure Tis built with variables x and those in Λx. Then anew variable y from T is repeatedly selected from itscorresponding Λy of marginally dependent variablesand for any z not in T an attempt is made at insertingit in T . This continues until all variables are in T .

A polytree structure is constructed in O(n2) steps,n being the number of variables in U . Only marginaland first-order conditional independence tests areneeded. However, no directionality for arcs is recov-ered. This was improved in another version of thealgorithm [42] by using a test of independenge on any

Z

T

U

YX

Fig. 11. An example DAG.

three variables x, y, z. If, when testing if x and ywere marginally independent and became dependentgiven z, then they were oriented as x→ z ← y.

5.1.1. Algorithms for recovering DAGsIn order to recover more complex structures than

polytrees or singly connected DAGs by means of con-ditional independence test methods, some other prop-erties relating structure of general DAGs with inde-pendence have to be taken into account.

The following properties are the basis for the nextalgorithm.

A dependency model M is isomorphic to a DAGG, iff [89]:

(a) For each pair of vertices x and y in G, x andy are adjacent iff x and y are conditionally in-dependent given every subset of vertices in G(excluding x and y).

(b) For each triplet of vertices x, y, z such that xand y are adjacent and y and z are adjacent butx and z are not, x → y ← z is a subgraph ofG iff x, y, and z are conditionally independentgiven the set of all variables in G excluding ybut not x and z.

Spirtes, Glymour and Scheines algorithm [89]Input: a list of dependences between the variablesin a domain U .Output: a directed graph.1. Build a complete undirected graph H .2. For every arc x, y if there exists a sub-

set S in U\{x, y} such that I(x|S|y), erasethe arc x−y.

Page 18: Learning causal networks from data: a survey and a new algorithm ...

48 R. Sanguesa and U. Cortes / Learning causal networks from data

X

X

X

T

Y

Z

U

T Z

Z

Z

U

Y

Fig. 12. The variable sheaths built by the HC algorithm.

3. Let K be the resulting graph of step 2. Thenfor every triplet x−y−z in H , such that z−xis not in H , if no subset S in U\{x, z} ex-ists such that I(x|S ∪ {y}| |z) then create theorientation x→ y ← z.

4. Repeat until no more arcs could be oriented.

4.1. If x−y−z is in H with x and y beingnon-adjacent nodes, orient y−z as y → z.

4.2. If there exists a directed path from x toy and the connection x−y also exists thenorient x→ y.

Step 2 is critical, because it needs to search amongall possible subsets in U\{x, y}. This results in anexponential time cost. Note that also independencetests have an exponential cost. The time needed tocalculate such tests is exponential.

An improvement is given by the same authors. Itcarries out the least number of comparisons. It startswith a complete graph and at each step i removesthose vertices x−y for which there exists an indepen-dence relation of order i. In the following, Ad(x) isthe set of adjacent nodes for node x.

The PC algorithmInput: a list of dependences between the variablesin a domain U .

Output: a directed graph.1. Create a complete graph G on the variables

in U.2. n := 0.3. Repeat Until|Ad(x)\{y}| < n for each set of

ordered pairs (x, y).

3.1. Repeat Until all ordered pairs of adjacentvariables (x, y) such that |Ad(x)\{y}| >n and every subset S in Ad(x)\{y} havebeen tested for independence.Select an ordered pair of variables x, y ad-jacent in G such that |Ad(x)\{y}| < n.Select a subset S of Ad(x)\{y} with car-dinality n.If I(x|S|y), then erase x−y fromG. StoreS in the sets Separating (x, y) and Sepa-rating (y, x).

3.2. n := n+ 1.

4. For each triplet of nodes x, y, z where x andy are adjacent, and y and z are adjacent butx and z are not adjacent, orient x → y ← zif and only if y does not belong to Separat-ing (x, z).

5. Repeat Until no more arcs could be oriented.

5.1. If the structure x → y−z belongs to G,where x and z are not adjacent an no headto head arcs point y orient y−z as y → z.

Page 19: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 49

5.2. If there exists a directed path from x to yand the arc x−y exists, turn it into x→ y.

The complexity of the algorithm depends on thenumber of adjacent nodes that each node has in thegraph. If k is the highest number of adjacent nodesfor a node in the graph G, then the number of inde-pendence tests is bounded by:

n2(n− 1)k−1/(k − 1)!.

5.1.2. Comments on conditional independencetest-based methods

As can be seen, the main difficulty with these typeof algorithms is the growing number of higher- orderindependence tests needed to recover complex struc-tures. The more constrained the type of structure ofthe underlying distribution is supposed to have, theless number on independence tests are needed. It isworth noting that the HC algorithm allows a great re-duction in complexity when recovering simple graphs.The only needed tests are zero- and first-order inde-pendence test.

Let us note here that, although recovering complexstructures may seem a better result than the recoveryof simpler ones, the approximation of full DAGs bypolytrees, or singly connected structures is still inter-esting. In an ideal situation, the best network that canbe recovered from data is the one whose structure isjust the same as the one that generated the data. Whenworking with data about which very little is known itis not reasonable to expect to recover the exact net-work that generated it, so an approximation may bethe best that one can hope for. Acid and De Cam-pos [1,15] studied the differences in probability val-ues when inferring in a recovered simple DAG andshowed that they were not significant. Let us remark,however that a different DAG structure implies a dif-ferent causal structure.

5.2. Goodness-of-fit methods and measure of qualitymethods

In the following methods some assumptions aremade for recovering the structure of the network as itwas the case in the previous group of conditional in-dependence test methods. The rationale is to assumethat a graph exists whose nodes correspond to the vari-ables in a database. Due to the factorisation propertyof belief networks, it is easy to make an assumptionabout the probability distribution it induces, PE , asthe product of the distribution of nodes conditioned to

their parents. On the other hand, the database allowsfor the estimation of a joint probability distributionover its variables PD. What those methods try to at-tain is a graph which exhibits the minimum distancebetween PE and PD. Differences between methodsare centred around the type of graph that they allowto recover, measures used in assessing distance be-tween distributions or quality of the distributions andthe way the graph is built applying such measures.

Normally the Kullblack–Kleiber entropy cross-measure is used as a distance between distributions.Distance D between two distributions is defined as:

D(PD, PE

)=∑ PD

(x1, . . . , xn

)logPD

(x1, . . . , xn

)PE(x1, . . . , xn

) ,

summations are taken over all instantiations of xi for1 6 i 6 n.

5.2.1. Singly connected networksIn order to understand some of the following algo-

rithms, one has to go back to an algorithm by Chowand Liu [8] used for recovering trees from data. Tofind trees from data, the maximum weight generatingtree is projected and as the weight for each link thefollowing measure of information between variablesis used:

I(xi, xj

)=

∑PD(xi, xj

)logPD

(xi, xj

)PD(xi)PD(xj)

,

for all instantiations of xi and xj .Note that such a measure is minimum (zero)

when both variables are independent. A theorem byChow and Liu proved that, given a set of variables,(x1, . . . , xn), if the mutual information I(xi, xj) mea-sure is used, then assigning it to every arc (xi, xj) be-tween variables, cross entropy over all tree-structureddistributions is minimised when the structure is a max-imum weight spanning tree.

We will just give the next algorithm for the sakeof our presentation, as it is the basis for Rebane andPearl’s algorithm [74].

Chow and Liu Tree Recovery algorithmInput: a data base on x1, . . . , xn variables.Output: a tree-structure reflecting dependences inthe database.

0. T = {�}, the empty tree.1. Calculate for every pair (xi, xj) the bidimen-

sional marginal distribution P (xi, xj).

Page 20: Learning causal networks from data: a survey and a new algorithm ...

50 R. Sanguesa and U. Cortes / Learning causal networks from data

2. Calculate weights for links between every pair(xi, xj) by means of the I(xi, xj) measure.

3. Select the maximum cost pair (xi, xj)M :

T = T ∪(xi, xj

).

4. Select the next maximum cost pair (xi, xk).If it does not create a cycle in T , add it to T .Else erase (xi, xk) from the set of pairs.

5. Repeat step 4 until including n− 1 links.

Note that the algorithm was initially designed forrecovering trees. It is important to realise that, fora given distribution, it can recover different trees de-pending on the order in which pairs with the sameweight are selected. No arcs are oriented: a skeletonof the tree is recovered. The cost of the algorithm isO(n2 log n).

Rebane and Pearl algorithm

In contrast with the previous algorithm, this oneis able to give an orientation to links. In a firststep it creates the skeleton of the graph (i.e., the re-sult of the Chow and Liu algorithm) and then triesto give an orientation to as many as links as possi-ble. Note that several probabilistic dependence re-lations are indistinguishable in terms of orientation.As we said before, given three variables x, y, zone can test if they can be related by the structurex → y ← z if x is marginally independent of z.However x ← y → z cannot be distinguished fromx→ y → z or x← y ← z.

Rebane and Pearl polytree recover algorithmInput: a database on variables x1, . . . , xn.Output: a partially oriented graph.1. T = maximum weight generating tree result-

ing from the Chow and Liu algorithm.2. Select x, y, z such that I(x | z).

Give x, y, z the orientation x→ y ← z.3. If a subgraph with more than one parent is

found, apply the test of marginal indepen-dence to adjacent nodes.

4. For each node with at least one incoming arc,study the orientation of the rest of the adjacentnodes by means of the marginal independencetest.

5. Repeat steps 2 to 4 until no new orientationscan be found.

6. If there are still some links with no orientation,label them as “undetermined”.

Note that Chow and Liu’s algorithm can beused with several measures of dependence. Acid [1]showed that any measure that satisfies the followingconditions can be used as a dependency degree in-stead of the mutual information function put forth byChow and Liu.

Dependency measure propertyGiven three variables x, y, and z such that x andz are conditionally independent on y, any depen-dency measure, Dep(x, y), such that

min(Dep(x, y),Dep(y, z)

)> Dep(x, z)

can be used for the Chow and Liu algorithm.

This was used in the CASTLE system [2] for learn-ing Bayesian belief networks based on different de-pendence degrees. In CASTLE, two variables are saidto be dependent if their dependency degree is less thana threshold ε fixed by the user. Some of the measuresused by the system are the following ones:

Dep(X,Y ) =∑X,Y

p(X,Y ) logp(X)

p(X)p(Y ),

Dep(X,Y ) =∑X,Y

∑X,Y

p(X,Y ) log p(X)p(X)p(Y )∑

X,Y

p(X,Y ) log(X,Y ),

Dep(X,Y ) =∑X

∑Y

∣∣p(X,Y )− p(X)p(Y )∣∣,

Dep(X,Y ) =∑X

p(X,Y )

×∑Y

∣∣p(X,Y )− p(X)p(Y )∣∣,

Dep(X,Y ) =∑X

p(X,Y )

×∑Y

∣∣p(X,Y )− p(X)p(Y )∣∣2,

Dep(X,Y ) = maxX

maxY

∣∣p(X,Y )− p(X)p(Y )∣∣.

Let us remark that these measures assess the differ-ences in value between the joint probability distribu-tion of two variables and the product of their marginaldistribution. They should be the same if they were in-dependent variables. This is in accord with the knownrelationship that holds between joint and marginal dis-

Page 21: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 51

tributions in probability theory, but, as we will see,the same idea has a different expression in other un-certainty formalisms.

5.2.2. DAG algorithmsCooper and Herskovitz entropy-based method

Cooper and Herskovitz [38] devised a methodwhich took as a quality criterion a measure of the en-tropy of the distribution implied by the structure beingbuilt.

The Kutato algorithmInput: a database on variables x1, . . . , xn; a fixedvalue for entropy α;an order between variables.Output: an oriented DAG or an error code.1. Build a DAG on x1, . . . , xn, assume all vari-

ables to be marginally independent.2. Calculate the DAG entropy.3. Select a link such that

(1) it creates no cycle;(2) it is the one that creates a new graph G′

with minimum entropy;(3) it links variables x, y such that x comes

first in the order;

4. Give the orientation x→ y.5. Repeat steps 2 to 4 until an α entropy level is

reached.

The complexity of the method can be estimated asfollows: given a DAG with n nodes, in order to se-lect the best one O(n2) comparisons are to be made.If all associations are significant, then the process isrepeated O(n2) times. So, complexity (entropy cal-culations aside) is O(n4).

Entropy for a DAG G is calculated by the localentropy of a node instantiation given its parents andthis entropy is weighted by the probability that theparents have a given value instantiation:∑

xi∈U

∑pai(xi)

P(pa(xi)

)×∑

pai(xi)

P(xi | pai(xi) log P

(xi | pa(xi)

)).

A Bayesian-based method: the K2 algorithmCooper and Herskovitz [12] devised an improved

version of the previous algorithm which resorted toBayesian criteria. Given a database D, with infor-mation on n variables, and a Bayesian network Bs

with n nodes (one for each variable in D) one has tofind the Bayesian network that maximises the prob-ability P (Bs | D). They approximate P (Bs | D)by P (Bs, D).

Cooper and Herskovitz’ merit lies in finding asound way for calculating the whole probability of aDAG in terms of local parent–children subgraphs.

Their measure is:

P (Bs | D) = P (Bs)∏

g(xi,pai(xi)

),

where pai(xi) is the set of parents of the vari-able xi, g(xi,pai) is:

g(xi,pai(xi)

)=

∏(ri − 1

)!(

Nij + ri − 1)!Nijk!

,

where, for each variable xi, ri is the number of pos-sible instantiations; N is the number of cases in thedatabase; wij is the j-th instantiation of pai in thedatabase; qi is the number of possible instantiationsfor pai; Nijk is the number of cases in D for whichxi takes the value xik with pai instantiated to wij ;Nij is the sum of Nijk for all values of k.

A previous assumption for this method is that allstructures are equally probable. An order betweenvariables is given. If xi precedes xj in that order,all structures with an arc between xj and xi are tobe removed, further reducing the possible alternatives.A further restriction is that the number of parents agiven node can take, u, is low.

The K2 algorithm proceeds by starting with a singlenode (the first variable in the order) and then takesthe node that increments most the probability of thegiven structure, calculated by means of the g function.When adding a new parent does not increment theprobability, no more nodes are added to the parent set.

The K2 algorithmInput: a set of variables x1, . . . , xn; a given orderamong them;an upper limit u on the number of parents for anode;a database on x1, . . . , xn.Output: a DAG with oriented arcs.For i := 1 to n do1. pai(xi) = �; Ok := true;2. Pold := g(xi,pai(xi));3. While Ok and |pai(xi)| < u do

3.1. Let z be the node in the set of predeces-sors of xi that does not belong to pai(xi)which maximizes g(xi,pai(xi)) ∪ {z});

3.2. Pnew := g(xi,pai(xi) ∪ {z});

Page 22: Learning causal networks from data: a survey and a new algorithm ...

52 R. Sanguesa and U. Cortes / Learning causal networks from data

3.3. If Pnew > Pold. Then Pold := Pnew;pai(xi) := pai(xi) ∪ {z})

Else Ok := false.

Execution time is in the order O(Nu2n2r) with rbeing the maximum value for ri.

There exists an extension for dealing with continu-ous variables and missing values.

Other methods have been devised that make use ofBayesian metrics. Let us mention the BDe metric,proposed by Heckerman, Geiger and Chickering [35].The importance of this metric lies on the kind of as-sumptions it is based upon, which are specially sig-nificant for the recovery of networks with a causalinterpretation. They devised a method that can ad-mit as a priori knowledge a Bayesian belief network,thus opening the possibility to create a more cog-nitively acceptable final network (as it is guided byprevious knowledge). Being a method based on aBayesian metric, its target is to find the belief networkwith maximum probability, given the data. In doingso, they reveal several important assumptions of mostBayesian learning methods.

These assumptions are the following ones. Thedatabase is a multinomial sample from some beliefnetwork. This implies that variables are discrete. Sec-ondly, the user may not be sure about the belief net-work that is generating the data. Thirdly, the usermay be uncertain about the conditional probabilitiesin the network, the parameters of the model to befound. Fourthly, the process is constant over time(this is an assumption called ‘stability’ by Pearl [70]).Parameters are assumed to be independent. Databasesare complete, i.e., all variables in a database are ob-served. Finally, if a variable xi has the same par-ents in any two belief networks, its probability den-sity is the same, it depends only on the parents of xi.Heckerman et al. also show that the parameters to belearned follow a Dirichlet distribution.

Taking into account all these assumptions, they de-rived a metric that relates the joint probability of agiven database D and a Bayesian network Bs andcalled this metric the BD metric (Bayesian metric withDirichlet priors):

P (D,Bs) = P (Bs)n∏i=1

qi∏j=1

Γ(N ′ij)

Γ(N ′ij +Nij)

×ri∏k=1

Γ(N ′ij +Nij)

N ′ijk,

where ri is the number of states of variable xi, Πi isthe set of parents of xi (denoted in this same paperas pai), qi is the number of instances of Πi, Γ is thegamma function,

Nij =

ri∑k=1

Nijk, N ′ij =

ri∑k=1

N ′ijk,

Nijk is the number of cases in database D, wherexi = k and pai = j, N ′ijk is an exponent of the Di-richlet distribution such that N ′ijk > 0, that is, thetheoretical number of cases where xi = k and pai = jin the population.

Now, the same authors adapted their metric so thatwhen two Bayesian networks were isomorphic, theirscore should be the same. The resulting metric wascalled the BDe metric (score equivalent metric forBayesian belief networks). The user has to specifythe a priori probability distributions for the Bayesiannetworks as well as for the densities of the parameters.

Using the minimum description length principleThe idea behind the following methods is the use of

the Minimum Description Length principle as a mea-sure of fit. The best representation for a database isthe model that minimises its description length. Thatis, the representation that minimises, given a codingschema, the sum of the length of encoding:

– the model;– the data, given the model.Lam and Bacchus [52–54] have given coding

schemas for networks as binary strings.

Network codificationFor each node (variable) a list of its parents is

needed together with the list of the conditional prob-abilities of the variables, given the parents; then for agraph with n nodes the total description length is:∑[

|pai(xi)| log2(n) + d(ri − 1)qi],

where |pai(xi)| is the number of parents of a givennode xi; d is the number of bits needed to repre-sent a numerical value; ri is the number of differentvalues xi can take and qi is the number of possibleinstantiations that the set of parents of xi can take.Consequently, more connected networks have longerdescriptions.

Page 23: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 53

Database codificationData are encoded by representing all the values that

appear in the database as a single binary string. Theyuse a Huffman code:

−N∑

p(xi) log2 p∗(xi),

where N is the number of cases in the database, p(xi)is the probability of occurrence of the atomic event xiand p∗(xi) is the probability of the success calculatedfrom the network representing the model.

Such encoding requires an exponential number ofbits, so they take advantage of the factorisation prop-erty of Bayesian networks and calculate the followingnumber:

−N∑

H(xi,pai(xi)

)+N

∑−[∑

p(xi) log2 p(xi)],

where H(xi,pai(xi)) is:

H(xi,pai(xi)

)=∑

p

(xi,pai(xi) log2

p(xi,pai(xi)

)p(xi)pai(xi)

),

the summation is over all parents of xi.Total description length for a given node is then:

DLi =∣∣pai(xi)

∣∣ log2 n+ d(ri − 1)qi

−NH(xi,pai(xi)

).

The total description length of a DAG is calculatedby summing DLi over all variables.

In order to search the space of possible networksa best first search method is used. Separate setsof candidate graphs are maintained. These sets dif-fer in the number of arcs their graphs have. For adatabase with n variables DAGs can have between 0a n(n − 1)/2 arcs so n(n − 1)/2 + 1 separate setsare maintained. Within each set a best first search isperformed. Each element of a set has two compo-nents: a candidate network with the number of arcscorresponding to the set and a pair of nodes betweenwhich an arc could be added without causing a cycle.The search takes as a heuristic the total descriptionlength of the graph.

Before starting search, mutual information (as inthe Chow and Liu algorithm) is calculated for eachpair of nodes and links are sorted accordingly. Let usremark that Chow and Liu’s measure was extended byLam and Bacchus in order to be able to recover more

general graphs (remember that Chow and Liu’s origi-nal work was aimed at the recovery of tree-structureddistributions).

The mutual information measure defined by Lamand Bacchus relates a variable xi and its parents Fxi :

W(XiFxi

)=∑Fxi

P(xi, Fxi

)log

P(Xi, Fxi

)P (Xi)P (Fxi)

.

Wai and Lam prove that cross entropy between thejoint distribution implied by the database and the jointdistribution of the DAG factored distribution is min-imised when the W (xi, Fxi) measure is maximised.

5.2.3. Comments on goodness-of-fit methodsAs we mentioned above, goodness-of-fit methods

are aimed at trying to find a structure that minimisesthe distance between the real distribution implied bythe data and the factored distribution corresponding toa network or to maximize a certain quality criterion.It is important to notice, however, that such an ap-proach tends to favour networks that are too dense andwhere the interpretation of links is somehow counter-intuitive.

Lam and Bacchus favour a method that would re-cover less dense networks. The authors try to find net-works that, although being closer in distribution terms,are as simple as possible within the DAG model.They have found experimentally that the recoverednetworks differ very little in belief updating resultsfrom the correct ones. From the causal point of view,however there are differences in structure that can im-ply a great difference in causation.

Cognitively, however, it seems that goodness-of-fitmethods tend to give to qualitative structure a sec-ondary role. These family of methods tend to add tothe DAG arcs that correspond to very weak depen-dencies, which results in very entangled graphs thatare difficult to understand by humans. Probably, agood alternative would be to allow for some previousknowledge to be added in an understandable form,such as a causal network is. Actually, only some al-gorithms exist in which it is possible to specify theprevious probabilities on arcs but not on complete net-works nor a tentative DAG partial structure. Lam andBacchus offered a way of refining existing DAGs withnew knowledge that could be a possible developmentin this direction [53].

It seems that combining CI-test based methods andgoodness-of-fit methods could bring a balance in thesense that the recovered DAG would exhibit a cor-rect structure with respect to conditional independenceproperties while at the same time it would have thejoint uncertainty distribution that is closest to the data.

Page 24: Learning causal networks from data: a survey and a new algorithm ...

54 R. Sanguesa and U. Cortes / Learning causal networks from data

5.3. Hybrid algorithms

Singh and Valtorta [80,81] have devised an algo-rithm that follows a two-step procedure. In the firststep, it performs a series of conditional independencetests and obtains an ordering among variables; thenit starts the K2 algorithm. The first part is based onwork by Verma and Pearl [93], and Spirtes, Glymourand Scheines [89].

The CB algorithmInput: u, a limit on the number of parents a nodemay have; a set Z of n variables x1, . . . , xn.Output: a DAG.

1. Start with the complete graph G1 on the setof variables Z

ord := 0

oldpai := { } for each i in 1 6 i 6 noldprob := 0.

2. Modify G1 as follows:For each pair of vertices a, b that areadjacent in G1, if AdG1 has a cardinal-ity greater than or equal to ord, andI(a, Sab, b) where Sab is contained inAdG1ab of cardinality ord, remove theedge a− b and store Sab.If for all pairs of adjacent vertices a, b inG1, then |Adjab| < ord, goto step 10.If degree of G1 > u then ord := ord+ 1.Goto the beginning of step 2.

3. Let G be a copy of G1.

For each pair of non-adjacent variablesa, b in G, if there is a node c that is not inSab and is adjacent to both a and b, thenorient edges as a → c and b → c unlessthis creates a cycle.If an edge has already been oriented in thereverse direction, make it bidirected.

4. Try to assign directions to yet undirectededges in G by applying the following rules:

R1: if a→ b and b−c and a and c are notadjacent then direct b→ c;R2: if a → b and b → c and a − c thendirect a→ c;R3: if a−b, b−c, a−c, c−d and d→ athen direct a→ b, c→ b;Moreover if a→ b, b→ c and a↔ c thendirect a→ c.

5. Let pai(xi) := { } for every i, 1 6 i 6 n.

For each node i, add to pai(xi), the setof vertices xj such that for each such xjthere is an edge xj → xi in the partiallydirected graph G.

6. For each undirected or bidirected arc in thepartially directed graph G, choose an orienta-tion as described next:

If xi−xj is an undirected edge an pai(xi)and paj(xj) are the corresponding par-ent sets in G then calculate the followingproducts:

ival = g(xi,pai(xi)

)×g(xj ,pai(xi) ∪ {xi}

),

jval = g(xj ,pai(xi)

)×g(xi,pai(xi) ∪ {xj}

),

where g is the measure defined by Cooperand Herskovitz for K2.If ival > jval then paj(xj) ← pai(xi) ∪{xi} unless the addition of xi → xj cre-ates a cycle.In that case, choose the reverse orientationand change pai(xi). Do a similar thing ifjval > ival.

7. The sets pai(xi) obtained in step 6 definea DAG.

Generate an order by performing a topo-logical sort on it.

8. Apply the K2 algorithm to find the set of par-ents of each node using the order in step 6.

Let pai(xi) be the set of parents found byK2 for node xi.Let newprob :=

∏g(xipai(xi)).

9. If newprob > oldprob then

oldprob := newprobord := ord+ 1oldpai := pai(xi), for every xi, 1 6 i 6nDiscard GGoto step 2Else goto step 10.

10. Output oldpai, for every xi, 1 6 i 6 nOutput oldprob.

Page 25: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 55

The fact that the authors use a chi-square test fortesting dependence at a fixed α level may induce de-pendences that are a product of chance. The qualityof the network hinges critically on the order extractedin the first phase of the algorithm. However, accord-ing to their experimental results, the quality of therecovered networks is high in terms of structure andcloseness for distributions.

5.4. Algorithms for non-probabilistic formalisms

Gebhardt and Kruse [33] have developed severalalgorithms to retrieve possibilistic DAGs. They basetheir search methods on a heuristic expressed in termsof non-specificity, the counterpart of entropy in pos-sibility theory. They try to find the network that min-imises the expected non-specificity. They define non-specificity in terms of the Hartley information mea-sure [49].

Given a set A, the Hartley information measure isdefined as:

H(A) = log2 |A|.Then given a DAG D on a set of variables

x1, . . . , xn, its total non-specificity is:

Nonspec(D) =∑

H(xi | pai(xi)

),

where pai(xi) is the set of parents of xi.The authors developed a greedy algorithm that

looks for the minimum expected non-specificity net-work among all possible networks. It starts with asingle node graph and at each step it adds the linkwith minimum non-specificity. A node ordering is ex-plicitly used in selecting the next node to be consid-ered.

5.4.1. HCS: a hybrid algorithm for recoveringpossibilistic networks

In our recent work we have developed an new hy-brid algorithm for recovering possibilistic networks.It is based on Huete and Campos’ CH algorithm; ituses a measure of non-specificity to choose amongpossible subgraphs.

Non-specificity is currently modelled according toKlir’s [49] definition of the U -uncertainty informationfunction, which is a measure of non-specificity.

Definition (U -uncertainty). Given a variable X withdomain {x1, . . . , xn} and an associated possibilitydistribution Πx(xi), the U -uncertainty of the distribu-tion is:

∫ 1

0log |Xρ| dρ,

where Xρ is the ρ-cut set forX . That is, Xρ = {xi |πx(xi) > ρ}.

Definition (joint U -uncertainty). Given a set of vari-ables with associated possibility distributions πx1 , . . . ,πxn their joint non-specificity is:∫ 1

0log |X1ρ × · · · ×Xnρ| dρ.

Definition (conditional U -uncertainty). Given a twovariables X and Y with associated possibility distri-butions πx, πY their conditional U -uncertainty is:∫ 1

0log|Xρ × Yρ||Yρ|

dρ.

Definition (DAG parent–children U -uncertainty).Given a DAG with domain {x1, . . . , xn} for any givenvariable xi with parent set pai, the parent–childrenU -uncertainty is:

U(xi | pai

)= U

(xi,pai

)− U(pai).

Definition (DAG non-specificity). Given DAG D ona domain U = {x1, . . . , xn} the DAG non-specificityis defined as:

U(D) =∑xi∈U

U(xi | pai

).

Now, our hybrid algorithm is also based on a de-pendency measure between variables. Due to the factthat information is extracted directly from data and isnot supplied by an expert in the form of a dependencylist, a graded measure of dependence is proposed andused. This dependency measure is based on the sim-ilarity between distributions before and after condi-tioning. The more similar the two distribtions are,the less dependent the variables are. The similaritymeasure is quite simple. It measures how much eachvalue of a given variable influences in the modifica-tion of the distribution of the conditioned variable. Athreshold is set in order to allow for some imprecisionin the similarity. That is, for a given threshold α andtwo variables X and Y , only differences in possibilitygreater than α will be taken into account when mea-suring the similarity between π(X) and π(X | Y ).Then, a summation of the differences greater than αis taken, and the resulting value is averaged by thenumber of values of Y . A second limit γ is fixed inorder to decide when two variables are to be taken asdependent or not.

Page 26: Learning causal networks from data: a survey and a new algorithm ...

56 R. Sanguesa and U. Cortes / Learning causal networks from data

Definition (conditional dependency degree). Giventwo variables X and Y with joint possibility distri-bution π(X,Y ) and a real value of the dependencybetween X and Y at level α, Dep(X,Y, α) is definedas:

Dep(X,Y, α) =

(1|Y |

∑yi∈Y

π(yi)

)

×∑

xi∈X|\π(xi)−π(xi|yi)|<α

∣∣π(xi)− π(xi | yi

)∣∣.Now, this measure of similarity between π(X) and

π(X | Y ) before and after conditioning is applied tothe data in order to derive a model of dependence be-tween the variables in the domain. With this model ofdependences a variation of the CH algorithm is used.In this version, several decisions on orientation of sub-graphs are taken by resorting to the parent–childrennon-specificity in order to decide which orientation isbest for a given pair of variables x, y that are knownto be dependent: x→ y or y → x.

The algorithm creates the sheaths corresponding toeach variable in the domain, orients them by using theU -uncertainty measure and then merges the resultingsubgraphs to obtain the final DAG, which is a singlyconnected graph.

In using a non-specificity measure with a CI-testbased algorithm we replicate some of the advantagesof the hybrid algorithms but in possibilistic settings.The structure of the graph adheres to the properties ofconditional independence and the U -uncertainty mea-sure that we recover for the most specific graph, giventhe data. Actual experiments show that the possibilis-tic version of HCS is more robust to imprecision indata than the probabilistic version, this may be dueto the low reliability of the chi-square test when dataare scarce.

Several improvements are on their way in order toextend the same idea to general DAGs. We are alsocreating a version based on a cross non-specificitymeasure.

Note that for possibility, there exists an equivalentof cross entropy defined by Ramer [73] which is ap-propriatetly called “cross non-specificity”.

Definition (cross non-specificity). Given two possi-bility distributions π(X) and π(Y ); the cross non-specificity is:∑

i

|π(xi)− π(yi)| log(n+ i)

log(n+ 1),

where i indexes all values of the sets X and Y .

To the best of our knowledge, no algorithm devisedfor the recovery of possibilistic networks makes useof this property. We are currently introducing a modi-fication of our own work to incorporate such measureof distance between distributions. An extension forincomplete data sets is also planned.

5.4.2. Recovering networks based on probabilityintervals

Huete [42] has developed a method based on CI-tests that is applicable to uncertainty formalisms otherthan probability. It is clear from his work that hismethod can be used, at least, with probability inter-vals. He and De Campos defined a measure of depen-dence between probability interval distributions [14]and then modified Chow and Liu’s algorithm in or-der to use it with this uncertainty formalism. Theyalso extended Rebane and Pearl’s algorithm and use itwith this uncertainty formalism. With respect to com-plexity, both algorithms exhibit the same behaviour astheir probabilistic counterparts.

5.5. Discussion on belief network learningalgorithms

In general, methods relying on goodness-of-fitheuristics need previous knowledge in the form of anorder between variables. Buntine [3] has devised amethod that needs no order but, in exchange, it re-quires an external expert to specify priors on prob-ability distributions. As we have seen, Heckermanet al. [34] have also devised a method of this kindwhere some previous knowledge in the form of priorson distributions have to be fed to the algorithm.

Goodness-of-fit methods tend to give more than oneresulting network that can be ranked in terms of itsprobability. There can be several networks with thesame probability given the equivalence properties ofbelief networks.

Methods based on conditional independence testcriteria are more abstract in the sense that they onlyrecover structure, where conditional distributions areto be added in order to get a belief network. Many dif-ferent formalisms can be used to represent uncertainty.So, in principle, such methods are more abstract thanthe other ones. However, they depend heavily on thedependence list provided at the beginning and someof them resort to probabilistic notions in order to de-cide on arc orientation. They also need a substantialamount of data to deliver reliable CI tests when usingprobability theory.

Page 27: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 57

The need for using different uncertainty formalismsarises in many occasions. For example, when dataare scarce or fraught with imprecision. This is thecase of data coming from sensors. Probability inter-vals or possibility distributions are good alternativesto represent the uncertainty in these systems. Theseuncertainty calculi are also more robust to incompletedata.

New methods for measuring possibilistic informa-tion from data have been developed recently by Joss-lyn [47]. Josslyn also defined the term “possibilis-tic process” and “possibilistic model” in contrast to“stochastic processes” and stressed the need to de-velop a network representation for it. The connec-tion between these methods and the work by Fonck,Huete, Gebhardt and others is still to be made.

Now we will examine briefly other methods usedin recovering models that have a causal network rep-resentation.

5.6. Learning path models

The automatic construction of path models is thepurpose of Cohen et al. [9]. To the best of our knowl-edge no other algorithm has been created for recov-ering path models.

Cohen’s algorithm follows a best-first strategy. Thesearch space is the set of all path models satisfyingthe constraints expressed in the corresponding sectionabove. Path models being graphs, they are representedby means of adjacency matrices. With n variables,the size of the required matrix is n×n and the searchspace is of size 2n

2.

Their algorithm begins with a matrix which onlycontains the dependent variables in the model. Ateach step, a single arc to a variable in the graph istested for inclusion. A list of all possible models to-gether with all possible modifications is kept. Thebest modification is selected and applied to one ofthe possible models; once applied, it is evaluated andinserted in the list of models. The process contin-ues until an acceptable model is obtained or no moresignificant improvements can be made.

The evaluation function is based on the R2 statis-tic which measures the percentage of variance in thedependent variable due to independent variables. Ifthe value of R2 is low, it is because there exists someother variables influencing Y and, so, the resultingmodel is not very good in explanatory terms. Theoret-ically, the best model is the regression model becauseall independent variables are correlated and point tothe dependent variable.

6. Learning “true” causal networks

Recently, there has been a move towards devisingmethods to learn networks that embody some kind ofcausal concepts. The general proposals of Heckermanand Pearl as well as Cooper’s list of relations for char-acterising “true” causal relations are an improvementover previous rapid identifications between belief net-works and causal networks in the sense that they aresemantically more correct. All of them rely, to someextent, on a concept of causation linked to the idea ofexternal intervention. However, they have differentimplications for learning algorithms.

The construction of a causal network, in Hecker-man’s work, is equated to the construction of an in-fluence diagram in the Howard Canonical Form [40].However, during this process, predecessors of everyutility node are known with certainty by the decisionmaker and so is the structure of the arcs. The soletask that remains, then, is to assess the physical prob-abilities associated with chance nodes. Moreover, allstates of decision nodes are known in advance by thedecision maker, so little room is left for a learningmethod. The problem reduces to learning a Bayesiannetwork where decision variables are interpreted aschance variables.

In brief, what Heckerman et al. [37] posit is theproblem of learning Howard Canonical Form influ-ence diagrams, given that the structure is known. Thatis, only parameters of the structure are to be learnt.What they finally remark is to start learning such pa-rameters given an a priori network given by the deci-sion maker.

In Heckerman’s proposal, it is important to startwith a clear knowledge about which variables in thedata are decision variables and which ones are chancevariables. This amounts to knowing beforehand, somestructure of the domain. Moreover, there is a need tospecify a priori knowledge in the form of priors ondistributions. More frequently than not, what expertsdo know well is the existence of qualitative relationsbetween variables or constraints that they have to re-flect. In contrast, experts are very poor estimatorsof probability distributions [48]. So, such a methodcould be very well applied to data coming not frompassive observations but to experimental data, wherea priori one knows which are the controlled variables.

In Pearl’s calculus of intervention proposal, the dis-tinction of decision and chance variables does not ex-ist and, in principle, this could be applied to obser-vational data, as Pearl has suggested and argued con-

Page 28: Learning causal networks from data: a survey and a new algorithm ...

58 R. Sanguesa and U. Cortes / Learning causal networks from data

vincingly elsewhere [64,66]. His idea is that causalrelations can be deduced from DAG structure and ob-servational data by several graphic criteria. To learnstructures reflecting the whole formalization of Pearl(i.e., the one where each child–parents cluster hasa deterministic function interpretation) some hurdleshave to be removed. The difficult part of the questionlies in assessing the form of the functional ties nowreplacing the links in the parent–children subgraphs.Learning functions from observational data is no easytask. Only some kinds of linear combinations of sim-ple functions have been derived from data by systemslike BACON. An improvement on those results aremethods that learn partitioned sets of equations as theFAHRENHEIT [97] system does.

If one sets to the task of learning causal networkswith just observational data, it is clear that the correctapproach must follow the lines of Pearl’s proposal.This could be done, in principle, by using Pearl’s con-ditions for causal effect identification as a test for truecausal associations in recovered DAG. Such knowl-edge could be incorporated in a learning system as acritiquing module that could prune spurious associa-tions from the DAG.

Referring to Cooper’s criteria for distinguishingcausal relations, he has not provided up to now anyapplication of them to learning causal networks. How-ever, it may be as simple as building a belief networkby any of the above mentioned methods and then ap-ply a test to check which of the seven relations holdin the learned network.

No study has been carried out in order to identifyif any of these characterisations of causality can betransferred to other uncertainty formalisms. Let usremark, however, that both Pearl’s and Cooper’s def-initions rely on graphical conditions, so they may beused in order to identify causality under other for-malisms.

7. Summary and conclusions

We have reviewed several representations of causalmodels used in AI settings in order to identify thecommon characteristics of all of them and to explorethe ways in which known learning algorithms can beapplied to several formalisms.

It has been seen that graphical models and pathmodels, constructs that have their roots in statisticaltechniques, have a limited ability for discovery if they

do not use previous knowledge. They seem to beuseful only with experimental data.

It is important to note the central role of Bayesiannetworks and the basic notion of conditional indepen-dence in all these representations. In effect, the no-tion of conditional independence appears in modelswhere uncertainty is not represented by probability.Algorithms for recovering belief networks based onnon-probabilistic representations of uncertainty existand the corresponding structures obey the indepen-dence axioms. Many other formalisms have been putinto correspondence with Bayesian belief networksin causal domains, most notably, Simon’s ideas oncausal order, which Simon himself cast in terms ofthe Bayesian belief network representation.

We have not mentioned several aspects of learningthat are of interest when dealing with real data. Mostalgorithms only work well with discrete variables andcomplete data. There are many cases in the real worldwhere observations are noisy or missing. Some tech-niques do exist in the case of probabilistic formalismsfor solving these practical problems. In our experi-ence, possibilistic representations perform much bet-ter in the presence of noisy or imprecise data, in thesense that the learning algorithms tend to be morerobust than their probabilistic counterparts.

To the best of our knowledge no other algorithm hasbeen developed for extracting non-probabilistic struc-tures (such as belief function representations) fromdata. A remarkable exception is De Campos andHuete’s work on probability interval-based belief net-works.

There is a great deal of research to be done in thedirection of finding which of the conditions of causalidentifiability put forth by Pearl can be translated intoformalisms other than probability. This is importantif the critical approach to causal network pruning is tobe applied in all possible uncertainty representationsand formalisms.

Acknowledgements

The authors wish to thank the anonymous reviewersfor their helpful comments and suggestions.

References

[1] S. Acid and L.M. de Campos, Approximation of causal net-works by polytrees, in: Proceedings of Information Process-ing and Management of Uncertainty in Knowledge-BasedSystems, 1994, pp. 972–977,

Page 29: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 59

[2] S. Acid, L.M. de Campos, A. Gonzalez, R. Molina andN. Perez de la Blanca, Learning with CASTLE, in: Symbolicand Quantitative Approaches to Uncertainty, Lecture Notesin Computer Science, 548, Springer-Verlag, Berlin, 1991,pp. 99–106.

[3] W. Buntine, Theory refinement on Bayesian networks, in:Proceedings of the Seventh Conference on Uncertainty inArtificial Intelligence, Morgan Kaufmann, Los Angeles, CA,1991, pp. 52–60.

[4] W. Buntine, Operations for learning with graphical models,Journal of Artificial Intelligence Research 2 (1994), 159–225.

[5] W. Buntine, Graphical models for discovering knowledge,in: Advances in Knowledge Discovery and Data Mining,U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthu-rusamy, eds, 1995, pp. 59–82.

[6] T. Bylander, Some causal models are deeper than others,Artificial Intelligence in Medicine 2 (1990), 123–128.

[7] J.E. Cano, M. Delgado and S. Moral, An axiomatic frame-work for the propagation of uncertainty in directed acyclicgraphs, International Journal of Approximate Reasoning 8(1993), 253–280.

[8] C.K. Chow and C.N. Liu, Approximating discrete probabilitydistributions with dependence trees, IEEE Transactions onInformation Theory 14(3), 462–467.

[9] P.R. Cohen, A. Carlsson, L. Ballesteros and R.St. Amant,Automating path analysis for building causal models fromdata, in: Proceedings of the International Workshop on Ma-chine Learning, 1993, pp. 57–64.

[10] L. Console and P. Torasso, Hypothetical reasoning in causalmodels, International Journal of Intelligent Systems 5(1)(1990), 83–124.

[11] G. Cooper, Causal discovery from data in the presence of se-lection bias, in: Proceedings of the Fifth International Work-shop on Artificial Intelligence and Statistics, Fort Lauderdale,FL, 1995, pp. 140–150.

[12] G. Cooper and E. Herskovitz, A Bayesian method for the in-duction of probabilistic networks from data, Machine Learn-ing 9 (1992), 330–347.

[13] R. Davis, Diagnosis via causal reasoning: paths of interac-tion and the locality principle, in: Proceedings of AAAI-83,Morgan Kaufmann, San Mateo, CA, 1983, pp. 88–94.

[14] L.M. de Campos and J.F. Huete, Learning non-probabilisticbelief networks, in: Proceedings 2nd European Conferenceon Quantitative and Symbolic Approaches to Uncertainty,Granada, 1993, pp. 56–64.

[15] L.M. de Campos, Independence relationships and LearningAlgorithms for Singly Connected Networks. Technical Re-port, DECSAI-960204. Department of Computer Science andArtificial Intelligence, Universidad de Granada, Spain, 1996.

[16] L.M. de Campos, J. Gebhardt and R. Kruse, Axiomatic treat-ment of possibilistic independence, in: Proceedings of theSymbolic and Quantitative Approaches to Reasoning and Un-certainty European Conference, ECSQUARU-95, Fribourg,Switzerland, 1995, pp. 77–88.

[17] J. de Kleer and J.S. Brown, Theories of causal ordering,Artificial Intelligence 29(1) (1986), 33–61.

[18] E. de Sosa and M. Tooley, eds, Causation, Oxford Readingsin Philosophy, Oxford University Press, Oxford, 1993.

[19] R. Dechter and J. Pearl, Structure identification in relationaldata, Artificial Intelligence 58 (1992), 237–270.

[20] D. Dubois and H. Prade, Inference in possibilistic hyper-graphs, in: Proceedings of the 3rd. IPMU Conference,B. Bouchon-Meunier, R.R. Yager and L.A. Zadeh, eds,

Lecture Notes in Computer Science, 521, Springer-Verlag,Berlin, 1990, pp. 250–259.

[21] D. Dubois and H. Prade, Theorie des Possibilites. Applica-tion a la Representation des Connaissances en Informatique,Masson, Paris, 1986.

[22] M.J. Drudzel and H.A. Simon, Causality in Bayesian belief,in: Proceedings of the Ninth Conference on Uncertainty inArtificial Intelligence, Morgan Kaufmann, San Mateo, CA,1993, pp. 3–11.

[23] P. Fonck, Reseaux d’inference pour le raisonnement possi-biliste, PhD Thesis, Universite de Liege, 1994.

[24] P. Fonck, Conditional independence in possibility theory, in:Proceedings of the Eleventh Conference on Uncertainty inArtificial Intelligence, pp. 221–226.

[25] P. Fonck, Propagating uncertainty in directed acyclic graphs,in: Proceedings of the 4th IPMU Conference, Mallorca.

[26] P. Fonck and E. Straszecka, Building influence networks inthe framework of possibility theory, Annales Univ. Sci. Bu-dapest, Sect. Comp. 12 (1991), 101–106.

[27] P. Fonck, Influence networks in possibility theory, in: Pro-ceedings of the 2nd. DRUMS R.P. 2 Group Workshop, Albi,1991.

[28] K.D. Forbus and D. Getner, Causal reasoning about quanti-ties, in: Proceedings of the Fifth Annual Conference of theCognitive Science Society, Lawrence Erlbaum Associates,NJ, 1983, pp. 196–206.

[29] D. Galles and J. Pearl, Testing identifiability of causal effects,in: Proceedings of the Eleventh Conference on Uncertaintyin Artificial Intelligence, Morgan Kaufmann, San Francisco,CA, 1995, pp. 185–195.

[30] D. Geiger, A. Paz and J. Pearl, Learning simple causal struc-tures, International Journal of Intelligent Systems 8 (1993),231–247.

[31] D. Geiger, A. Paz and J. Pearl, Learning causal trees from de-pendence information, in: In Proceedings of the eighth Na-tional Conference on Artificial Intelligence, 1990, pp. 770–776.

[32] J. Gebhardt and R. Kruse, The context model – an integratingview of vagueness and uncertainty, International Journal ofApproximate Reasoning 9 (1993), 283–314.

[33] J. Gebhardt and R. Kruse, Learning possibilistic networksfrom data, in: Proceedings of the Fifth International Work-shop on Artificial Intelligence and Statistics, Ft. Lauderdale,FL, 1995, pp. 233–244.

[34] D. Heckerman, A Bayesian approach to learning causal net-works, in: Proceedings of the Eleventh Conference on Un-certainty in Artificial Intelligence, UAI-95, pp. 285–295.

[35] D. Heckerman, D. Geiger and D. Chickering, LearningBayesian networks: the combination of knowledge and sta-tistical data, in: Proceedings of the Tenth Conference on Un-certainty in Artificial Intelligence, Morgan Kaufmann, Seat-tle, 1994, pp. 293–301.

[36] D. Heckerman and R. Shachter, A decision-based view ofcausality, in: Proceedings of the Tenth Conference on Un-certainty in Artificial Intelligence, Morgan Kaufmann, 1994,pp. 302 -310.

[37] D. Heckerman and R. Shachter, A definition and graphicalrepresentation of causality, in: Proceedings of the EleventhConference on Uncertainty in Artificial Intelligence, MorganKaufmann, Montreal, 1995.

[38] E.H. Herskovitz and G.F. Cooper, Kutato: an entropy-drivensystem for the construction of probabilistic expert systemsfrom data, in: Proceedings of the Sixth Conference on Un-certainty in Artificial Intelligence, 1990, pp. 54–62.

Page 30: Learning causal networks from data: a survey and a new algorithm ...

60 R. Sanguesa and U. Cortes / Learning causal networks from data

[39] E. Hisdal, Conditional possibilities, independence and non-interaction, Fuzzy Sets and Systems 1 (1978), 283–297.

[40] R. Howard and Matheson, Influence diagrams, in: Read-ings on the Principles and Applications of Decision Analy-sis, R. Oliver and J. Smith, eds, Vol. II, Strategic DecisionsGroup. Menlo Park, CA, 1981, pp. 721–762.

[41] E. Hudlicka, Construction and use of a causal model for diag-nosis, International Journal of Intelligent Systems 3 (1988),315–349.

[42] J.F. Huete, Aprendizaje de redes de creencia mediante la de-teccion de independencias: modelos no probabilısticos, PhDThesis, Universidad de Granada, 1995.

[43] J.F. Huete and L.M. de Campos, Learning causal polytrees,in: Symbolic and Quantitative Approaches to Reasoning andUncertainty, M. Clarke and R. Kruse, eds, Lecture Notesin Computer Science, 747, Springer-Verlag, Berlin, 1993,pp. 180–185.

[44] Y. Iwasaki, Causal ordering in a mixed structure, in: Pro-ceedings of AAAI-88, St. Paul, MN, 1988, pp. 313–318.

[45] Y. Iwasaki and H.A. Simon, Causality and device behaviour,Artificial Intelligence 28 (1986), 3–32.

[46] Y. Iwasaki and H.A. Simon, Theories of causal ordering: re-ply to De Kleer and Brown, Artificial Intelligence 29 (1986),63–67.

[47] C.A. Josslyn, Possibilistic processes for complex systemsmodelling, PhD Thesis, State University of New York atBinghampton, 1994.

[48] D. Kahneman, P. Slovic and A. Tversky, Judgement underUncertainty; Heuristics and Biases, Cambridge UniversityPress, New York, 1982.

[49] G. Klir and T. Folger, Fuzzy Sets Uncertainty and Informa-tion, Prentice-Hall, Englewood Cliffs, NJ, 1988.

[50] Y. Kodratoff, ed., Workshop on Knowledge Discovery inDatabases and Machine Learning, European Conference onMachine Learning.

[51] B. Kuipers, Commonsense reasoning about causality: de-riving behaviour from structure, Artificial Intelligence 24(1984), 169–203.

[52] W. Lam and F. Bacchus, Learning belief networks, an ap-proach based on the MDL principle, Computational Intelli-gence 10(4) (1994).

[53] W. Lam and F. Bacchus, Using new data to refine a Bayesiannetwork, in: Proceedings of the Tenth Conference on Uncer-tainty in Artificial Intelligence, 1994, pp. 383–390.

[54] W. Lam and F. Bacchus, Using causal information and localmeasures to learn Bayesian belief networks, in: Proceedingsof the Ninth Conference on Uncertainty in Artificial Intelli-gence, 1993, pp. 243–250.

[55] D. Madigan, Strategies for graphical model selection, in:Proceedings of the Eighth Conference on Uncertainty in Ar-tificial Intelligence, 1992, pp. 331–336.

[56] D. Madigan and A. Raftery, Model selection and account-ing for model uncertainty in graphical models using Oc-cam’s razor, Journal of the American Statistical Association89 (1994), 1535–1546.

[57] R. Mechling and M. Valtorta, A parallel constructor ofMarkov networks, in: Selecting models from Data: AI andStatistics IV, P. Cheeseman and W. Olford, eds, Springer-Verlag, Berlin, 1994, pp. 202–215.

[58] C. Meek, Causal inference and causal explanation with back-ground knowledge, in: Proceedings of the Eleventh Confer-ence on Uncertainty in Artificial Intelligence, 1995.

[59] C.R. Musick, Belief network induction. PhD Thesis, Univer-sity of California at Berkeley, 1994.

[60] P. Pandurang Nayak, Causal approximations, Artificial Intel-ligence 70 (1994), 277–334.

[61] S. Parsons, Qualitative possibilistic networks, in: Proceed-ings of the 4th IPMU conference, Mallorca, 1992.

[62] M. Pazzani, M. Dyer and M. Flowers, Using prior learningto facilitate the learning of new causal theories, in: Proceed-ings of AAAI-86, Morgan Kaufmann, San Mateo, CA, 1986,pp. 277–279.

[63] J. Pearl, Bayesian Networks, Technical Report, R-216, Com-puter Science Department, University of California, Los An-geles, 1995.

[64] J. Pearl, Causal diagrams for empirical research, TechnicalReport, R-218-B, Computer Science Department, Universityof California, Los Angeles, 1995.

[65] J. Pearl, On the identification of nonparametric structuralequations Technical Report, R-207, Cognitive Systems Lab-oratory, University of California, Los Angeles, 1994.

[66] J. Pearl, A probabilistic calculus of actions, in: Proceedingsof the Tenth Conference on Uncertainty in Artificial Intelli-gence, Morgan Kaufmann, San Mateo, CA, 1994, pp. 454–462.

[67] J. Pearl, Belief networks revisited, Artificial Intelligence 59(1993), 49–56.

[68] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Net-works of plausible inference, Morgan Kaufmann, San Mateo,CA, 1988.

[69] J. Pearl and A. Paz, Graphoids: a graph-based logic forreasoning about relevance relations. Technical Report. CSD-850038. Cognitive Science Laboratory, Computer ScienceDepartment, University of California, Los Angeles, 1985.

[70] J. Pearl and T. Verma, A theory of inferred causation,in: Proceedings of the Second International Conference onKnowledge Representation and Reasoning, Morgan Kauf-mann, San Mateo, CA, 1991.

[71] J. Pearl and N. Wermuth, When can association graphs admita causal interpretation?, in: Selecting Models from Data: AIand Statistics IV, P. Cheeseman and W. Olford, eds, Springer-Verlag, Berlin, 1994, pp. 205–214.

[72] Y. Peng and J.A. Reggia, Abductive Inference Models forDiagnostic Problem-Solving, Symbolic Computation Series,Springer-Verlag, Berlin, 1987.

[73] A. Ramer, Conditional possibility measures, Cybernetics andSystems 20, 185–196.

[74] T. Rebane and J. Pearl, The recovery of causal poly-treesfrom statistical data, in: Uncertainty in Artificial Intelligence,3, L.N. Kanal, T.S. Levitt and J.F. Lemmer, eds, North-Holland, Amsterdam, 1989.

[75] W.C. Salmon, Probabilistic Causation, Pacific PhilosophicalQuarterly 61 (1980), 50–74.

[76] R. Scheines, Inferring causal structure among unmeasuredvariables, in: Selecting Models from Data: AI and StatisticsIV, P. Cheeseman and R.W. Oldford, eds, Springer-Verlag,Berlin, 1994, pp. 262–273.

[77] P.P. Shenoy, Independence in valuation-based systems. Work-ing paper no. 236, University of Kansas, 1991.

[78] Y. Shoham, Reasoning about Change, MIT Press, Cam-bridge, MA, 1988.

[79] Y. Shoham, Nonmonotonic reasoning and causation, Cogni-tive Science 14 (1991), 213–252.

[80] M. Singh and M. Valtorta, Construction of Bayesian networkstructures from data: a survey and an efficient algorithm,International Journal of Approximate Reasoning 12 (1995),111–131.

Page 31: Learning causal networks from data: a survey and a new algorithm ...

R. Sanguesa and U. Cortes / Learning causal networks from data 61

[81] M. Singh and M. Valtorta, An Algorithm for the constructionof Bayesian networks structures from data, in: Proceedingsof the Ninth Conference on Uncertainty in Artificial Intelli-gence, Morgan Kaufmann, 1993, pp. 259–265.

[82] H.A. Simon, Nonmonotonic reasoning and causation: com-ment, Cognitive Science 16 (1991), 293–297.

[83] R.G. Simmons, The roles of associational and causal rea-soning in problem solving, Artificial Intelligence 53 (1992),159–207.

[84] M.E. Sobel, Causal inference in artificial intelligence, in: Se-lecting Models from Data: AI and Statistics IV, P. Cheese-man and R.W. Olford, eds, Springer-Verlag, Berlin, 1994.

[85] M.E. Sobel, Causal inference in the social and behaviouralsciences, in: A Handbook for Statistical Modelling in theSocial and Behavioural Sciences, G. Arminger, C.C. Cloggand M.E. Sobel, eds, Plenum Press, New York, 1994.

[86] D. Spiegelhalter and S. Lauritzen, Sequential updating ofconditional probabilities on directed graphical structures,Networks 20 (1990), 579–605.

[87] P. Spirtes, Detecting causal relations in the presence of un-measured variables, in: Proceedings of the Seventh Confer-ence on Uncertainty in Artificial Intelligence, 1991, pp. 392–397.

[88] P. Spirtes and C. Glymour, Inference, intervention and pre-diction, in: Selecting Models from Data: AI and Statis-tics IV, P. Cheeseman and R.W. Oldford, eds, Springer-Verlag, Berlin, 1994, pp. 233–242.

[89] P. Spirtes, C. Glymour and R. Scheines, Causation, Predic-tion, and Search, Springer-Verlag, Berlin, 1993.

[90] M. Studeny, Formal properties of conditional independencein different calculi of AI, in: In Symbolic and Quantita-tive Approaches to Reasoning and Uncertainty, M. Clarke,R. Kruse and S. Moral, eds, Lecture Notes in Computer Sci-ence, 747, Springer-Verlag, Berlin, 1993, pp. 341–348.

[91] P. Suppes, A Probabilistic Theory of Causation, North-Holland, Amsterdam, 1970.

[92] T. Verma, Causal networks: semantics and expressiveness,in: Uncertainty in Artificial Intelligence, 4, R.D. Schachter,T.S. Levitt, L.N. Kanal and J.F. Lemmer, eds, Elsevier Sci-ence Publishers, Amsterdam, 1989.

[93] T. Verma and J. Pearl, An algorithm for deciding if a setof observed independencies has a causal explanation, in:Proceedings of the 8th Conference on Uncertainty in Artifi-cial Intelligence, Morgan Kaufmann, San Mateo, CA, 1992,pp. 323–330.

[94] T. Verma and J. Pearl, Equivalence and synthesis of causalmodels, in: Proceedings of the Sixth Conference on Un-certainty in Artificial Intelligence, Morgan Kaufmann, Cam-bridge, MA, 1990, pp. 220–227.

[95] J. Whittaker, Graphical Models in Applied MultivariateStatistics, Wiley, 1990.

[96] S. Wright, Correlation and causation, Journal of AgriculturalResearch 20 (1921), 557–585.

[97] J.M. Zytkow, Combining many searches in the FAHREN-HEIT discovery system, in: Proceedings of the Fourth In-ternational Workshop on Machine Learning, 1987, pp. 281–287.


Recommended