Tetrad Overview What is Tetrad?

Tetrad Overview

What is Tetrad? Tetrad is a program for

creating, simulating data from, estimating, testing, predicting with, and searching for

causal/statistical models.

The aim of the program is to provide sophisticated methods in a friendly interface requiring very littlestatistical sophistication of the user and no programming knowledge. It is not intended to replace flexiblestatistical programming systems such as Matlab, Splus or R. Tetrad is freeware that performs many of thefunctions in commercial programs such as Netica, Hugin, LISREL, EQS and other programs, and manydiscovery functions these commercial programs do not perform.

Tetrad is unique in the suite of principled search ("exploration," "discovery") algorithms it provides--forexample its ability to search when there may be unobserved confounders of measured variables, to searchfor models of latent structure, and to search for linear feedback models--and in the ability to calculatepredictions of the effects of interventions or experiments based on a model. All of its search proceduresare "pointwise consistent"--they are guaranteed to converge almost certainly to correct information aboutthe true structure in the large sample limit, provided that structure and the sample data satisfy variouscommonly made (but not always true!) assumptions.

Tetrad is limited to models of categorical data (which can also be used for ordinal data) and to linearmodels ("structural equation models’) with a Normal probability distribution, and to a very limited class oftime series models. The Tetrad programs describe causal models in three distinct parts or stages: a picture,representing a directed graph specifying hypothetical causal relations among the variables; a specificationof the family of probability distributions and kinds of parameters associated with the graphical model; anda specification of the numerical values of those parameters.

The program and its search algorithms have been developed over several years with support from theNational Aeronautics and Space Administration and the Office of Naval Research. Joseph Ramsey hasimplemented most of the program, with substantialassistance from Frank Wimberly. Executable and Source code for all versions of Tetrad IV, and thismanual, are copyrighted, 2004, by Clark Glymour, Richard Scheines, Peter Spirtes and Joseph Ramsey.The program may be freely downloaded and used without permission of copyright holders, who reservethe right to alter the program at any time without notification.

The Tetrad suite of programs permits the user to do any of the following:

1. Generate a graphical statistical/causal model of any of the following kinds: 1. Models for categorical data (Bayes networks); 2. Models for continuous data with variables having a Gaussian (Normal) joint probability

distribution; 3. Models for a limited class of time-series representing genetic regulatory networks..

2. Estimate parameters of models of the following kinds: 1. Models for categorical data in which all variables are recorded in the data (no "latent" variables); 2. Models for continuous data with or without latent variables;

3. Test the fit of models of any of the kinds listed in 2. above. 4. Simulate data from a model. or any of the kinds listed in 1. above. 5. Update models of categorical data; i.e.,, compute the probability of any variable in the model

conditional on any set of values for other variables in the model. 6. Predict the probability of a variable in a model (without latent variables) from interventions that fix

or randomize values for any set of other variables in the model. 7. Search for models:

1. Of categorical data with or without latent variables; 2. Of continuous, Gaussian data with or without latent variables.

8. Compare graphical features of two models. 9. Find alternative models statistically equivalent to any given model without latent variables.

10. Select variables within a dataset for classifying values of cases of another variable in the dataset 11. Classify new (or old) cases using the variables selected in 9. above. 12. Assess the accuracy of classification.

Manual

Why Doesn’t Tetrad...?

For Further Help

References

Tetrad Manual

Tetrad is organized as a main workspace in which on or more sessions can be constructed or edged. Eachsession can contain a number of boxes, which can themselves contain modules (statistical models, datasets, search algorithms, etc.). There are also several functions that appear in more than one box or are usedto manage sessions in general.

For information on the main workspace, see Main Workspace Explained.

For information on each box, see Each Box Explained. You will find there explainations of the modulesthat can be placed into each box.

For information on common tasks, see Some Common Tasks. These items are linked to from moduleexplanations, but they are all made available here in case you want to peruse them.

There are also some definitions of terms. To look at these, see Some Definitions.

The Main Workspace Explained

The main workspace in Tetrad consists of a workbench for building sessions, a toolbar for selecting typesof boxes to add to the session, and a menu bar for performing operations like loading and saving sessions,and so on.

Tetrad 4 works with drag and drop objects, dialog boxes, drop down menus, and clicking. Occasionallyyou will need to typesomething brief--numerals and names.

Most Tetrad operations are performed with a single left click or a double left click. Right clicks are usedfor additional information or less common options.

The main workspace looks like this:

For an explanation of each part of this workspace, follow the link:

For an explanation of the title bar ("Tetrad version"), see Tetrad Versioning. For an explanation of the menubar and the various functions it makes available, see Tetrad Menubar. For an explanation of the left-hand toolbar ("Graph," "Parametric Model," etc.), see Tetrad Toolbar. For an explanation of the main workspace area for a session (with the sample boxes "Graph1,""PM1," etc.), please see How To Build a Session..

Tetrad Versioning

Beginning with version 4.3.2-1, saved sessions (".tet" files) are intended to be backwards compatible. Thatis, sessions saved out using one version are intended to be loadable using versions of Tetrad with equal orhigher version numbers. There may be a few problems; hopefully these will get ironed out quickly.

You might want to know the exact version of Tetrad you are running for two reasons:

1. New features are regularly added to Tetrad, and bugs in old features are removed. You may want toto use a version of Tetrad with a specific version of some algorithm, or you may simply want to syncyour version with a version that someone else is using, to avoid any discrepancies.

2. You may experience some vestigial problems with loading, which we can fix.

If you do experience a problem loading a session that you’ve saved, please let us know by emailing the session itself (the ".tet" file) to Joe Ramsey at [email protected]. We will attempt to fix theproblem and either post a new file for that version or else let you know which later version you can loadthe session in. We sincerely appreciate your help on this.

To find out which exact version of Tetrad you are using, you may use one of three methods.

Method 1

In most operating systems, the version number is displayed in the title bar above the application. In theexample below, it’s "4.3.2-3". The "4" in this case is the major version, the first "3" the minor version, the"2" the minor subversion, and the "3" the incremental release number. This appears in the title bar likethis:

Method 2

You can also find the version number by selecting "Session Version" from the main File menu:

A dialog will appear telling you the version number and date the last time the current session was saved,and the current version of Tetrad:

Each saved ".tet" file is stamped with a version and a date in such a way that, even if the file itself cannotbe loaded, at least this meta-information can be loaded. So if you have a file that won’t load, you can stillsee the version it was saved under and the date it was saved. This allows you to go to the Tetrad websiteand launch the version of Tetrad that was used to save out the file. You can then load the file under thatversion.

Method 3

The current version number is also displayed in the "About Tetrad" menu item in the "Help" menu.

Tetrad Menubar

The main menu bar in Tetrad lets you manage Tetrad sessions, lets you perform some editing operationson the current session, and gives you access to the help functionality (this), among other things. It lookslike this:

The function of each item in each menu is described below. If you were looking for help for popup menusfor boxes, see Popup Menus.

The File Menu

Here’s what each item does in the File menu.

New Session - creates a new, empty Tetrad session. Your previous session is still available-just go tothe Session menu--see below. Open Session - prompts you for a saved Tetrad session file (with suffice ".tet") and opens it. Close Session - closes your main workbench and leaves you in Tetrad or in a stored session inmemory. If there are no sessions you get a blank Tetrad screen with no workbench You can’t doanything in the program until another session is opened or a new one is created. Save Session - saves all of your work, data, the whole thing, to a file (suffix ".tet") so that you cancall it back up later. You are asked to give the session a name if you haven’t done so already. Save Session As - Does the same thing as Save Session, but always asks you for a file name.

Session Version - Gives version information for the current session. See Tetrad Versioning. Save Screenshot - Saves a screenshog (PNG format) of the entire Tetrad Session, in case you don’twant to fool around with Photoshop or GIMP. Save Session Graph Image - Saves an image of just the session graph in the white workspace are withthe boxes and arrow in it. Leaves out the menubar and toolbar. This is useful if your session flowchart islarger than your screen. Exit - Gets you out of Tetrad the polite way.

The Edit Menu

Here’s what each item does in the Edit Menu.

Cut - Cuts out any selected boxes from the workbench (together with any edges between them) andallows you to paste them. For advise on how to select groups of boxes, see Selecting Groups of Nodes. Copy - Same as Cut, but leaves the original nodes in the session. Paste - Paste the cut or copied boxes slightly down and to the right of the original ones, either in thecurrent session or in some other session. Multiple pastes are supported; if you paste multiple times,new copies appear down and to the right of the originally selected boxes.

The Session Menu

Tetrad can keep multiple sessions open at once, but only one workbench is visible at a time. "Sessions"lists all of your open sessions and lets you switch to the workbench of whichever session you want. Yoursessions are automatically given a name, e.g., "Untitled1.tet" unless you have saved them with a name.

The Template Menu

In using Tetrad you will put together a sequence of boxes connected by flowchart arrows. (See How toBuild a Session.) Some sequences are so commonly used, that Tetrad will insert the entire sequence foryou--boxes and arrows--in the workbench all at once. For details, see Using Templates.

The Help Menu

The items listed do the following.

Tetrad Manual - That’s this. You already know about it. About Tetrad [version-number] - Shows information about the project in general. Warranty - Warranty information displayed as per requirements of the GNU General Public License. License - License informatoin displayed as per requirements of the GNU General Public Licence.

Popup Menus

If you right click on any box in the session workbench (a Graph box, a Data box, etc.), a popup menu willbe displayed with a number of options. The options are as follows.

1. Create Model. 2. Edit Model. 3. Destroy Model. 4. (Re)create Descendant Models. 5. Rename Box. 6. Clone Box. 7. Delete Box. 8. Set Repetitions for Simulation 9. Run Simulation.

10. Info for this Box.

Create Model.

"Create" does the same thing as double clicking on the Graph box the first time--if you have not yetcreated a graph. Otherwise it does nothing.

Edit Model.

"Edit" does the same thing as double clicking on the Graph box after you have already created a graph--itopens a graph editing window.

Destroy Model.

Removes the model that the box contains and lets you create another one from scratch. Any modelsdownstream will be destroyed as well, since they depend on the model being destroyed.

(Re)create Descendant Models.

Allows you to quickly create (or recreate if they already exist) a model and all of the models downstreamof it. This is helpful if you’re just playing around with random models but can be frustrating if youaccidentally overwrite models you’re spent a long time creating. Therefore, a warning is displayed beforeany changes are made to make sure you really want to make the changes.

Rename Box.

This lets you rename the session box you right-clicked on. This is useful if you have a number of similarsession boxes on the workbench and would like to keep track of which is which.

Clone Box.

Clones the box you’ve right-clicked on so you can edit the copy separately.

Delete Box.

Deletes the box you’ve right-clicked on, destroying its model and removing it from the session altogether.Cannot be undone.

Set Repetitions for Simulation

Sets the number of times a node is "executed" (i.e., destroyed and randomly recreated) in simulation. Eachtime a node is destroyed and recreated, all of the children of the node are executed as well.

Run Simulation.

Executes (i.e. destroys and recreates) the node right-clicked on. This starts a cascade of nodes beingdestroyed and recreated downstream. This can be useful if you’d like to know how well a particular searchperforms on graphs with 10 nodes in them, with 20 edges selected randomly, for instance.

Info for this Box.

Lets you choose which of the models for the box being right-clicked on you’d like help for, and which ofthe possible parent combinations of that box you’re interested in.Three things are important: 1. The "Unoriented" and "Half-Oriented" buttons are for creating theoretical graphical objects of variouskinds. Do not use these buttons for ordinary modeling. Currently, none of the statistical procedures in theprogram work for graphs with such edges. 2. If you create a model with a latent variable, if you later use the model to generate data, values for thelatent variables will not be shown.3. If you have introduced any boxes that depend on a Graph box, changing the graph will alter the contentsof all boxes downstream in the flowchart from the Graph box. Often it is easier to simply create a newGraph box in the same main workspace window--there is no limit to how many graph or other boxes youcan have at the same time.The menu at the top of the Graph window has two options, "File" and "Edit". "Edit" currently doesnothing. "File" gives three options: You can save a graph in a file--but there is no point because we havenot yet implemented a facility to paste the saved graph into a new window. You can introduce theALARM network, a fairly complex graph standardly used as a test for search algorithms, and you cansave any graph you create as an image file that can be introduced into text documents, e.g., into Microsoft World.

Selecting Groups of Nodes

If you would like to move a whole section of nodes, to copy them or move them to another location, firstdraw a "rubberband" around them and either drag one of the nodes to drag the group or select the copyfunction from the Edit menu to copy the group.

To draw a rubberband around a group of nodes, first click in the white background to the upper left of thegroup of nodes, then drag the mouse down to the lower right of the group of nodes. The rubberband willbe shown as a dotted box around the group, and all of the nodes in the group will be highlighted. It willlook like this:

Once selected, this group of nodes may, for example, be dragged to another location by clicking on X3and dragging it.

Using Templates

In using Tetrad you will put together a sequence of boxes connected by flowchart arrows. (See How toBuild a Session.) Some sequences are so commonly used, that Tetrad will insert the entire sequence foryou--boxes and arrows--in the workbench all at once.

Templates are added to the active session using the Templates menu in the main workspace. TheTemplates menu looks like this:

An image of each template along with a short description of it follows.

Search from Loaded Data

This template can be used if you simply want to load in a data set and do a search on it. The data set canbe either continuous or discrete; the options for search algorithms will depend on which type of data setyou load.

Estimate from Loaded Data (Bayes)

This template is useful if you want to estimate a Bayes instantiatec model (Bayes IM) from a given dataset. A Bayes estimation requires a data set and a Bayes parmaeterized model (Bayes PM) as input. Thereare two difficulties in getting such an estimation to work:

1. All of the measured variables in the Bayes PM must occur in the data set. The maximum likelihood(ML) Bayes estimator and the Dirichlet estimator both require that all of the variables in the BayesPM be measured, although the Structural EM search allows for latents variables.

2. For each variable V in the Bayes PM with categories Ci, i = 1,...,ci for some ci > 0, the variable bythe same name in the data set must have the same categories.

These conditions can be difficult to ensure when building a Bayes PM from scratch. Adding the edge fromData1 to Graph1 in the template creates an edgeless graph in Graph1 that can then be used to construct aspecific DAG to use to build a Bayes PM. Adding the edge from Data1 to PM1 ensures that the categoriesfor each relevant variable in the data set are used when building the Bayes PM. The two arrows out ofData together make it easier to ensure that the Bayes estimation will work.

Estimate from Loaded Data (SEM)

Like the Bayes version of Estimate from Loaded Data, in order to estimate a SEM IM, a continuous dataset and a SEM PM are required that have the same variables. In this case, however, the variables arealways continuous, and continuous variables always have the same range (the real numbers), so there is noneed to add the edge from Data1 to PM1.

Simulate Data

This is a very useful template for simulating continuous or discrete data sets. Continuous data sets can besimulated by constructing a SEM Graph (or DAG), using that to construct a SEM PM, then a SEM IM,and then finally a data set. Discrete data sets can be simulated by constructing a DAG, using that toconstruct a Bayes PM, then a Bayes IM, and finally a data set. For information on any one of these steps,see the help files for the corresponding box or module.

Search from Simulated Data

This template can be used to try out search algorithms on simulated data. Data can be simulated as withthe Simulate Data template, and then an appropriate search procedure can be run on this data. Searchprocedures options are different depending on the type of data simulated.

Search from Simulated Data with Edge Comparisons

This template adds to the Search from Simulated Data a Compare node, which counts the number of extraedges and missing edges in the Search graph vis a vis the reference graph in Graph1. This is useful if youwant ot get a sense of how well a given search procedure performs on data with particular characteristics.

Estimate from Simulated Data

This template can be used to estimate data with respect to the parametric model that generated it. It isuseful if you would like to see how well an estimator does on data with particular characteristics,simulated from an instantiated model with particular characteristics, when you know the parametric modelused to generate it.

Estimate using Results of Search (Bayes)

This template shows how to hook up boxes to estimate data using a model that was generated by a searchalgorithm on that same data. Usually, the graph coming out of Search1 is an equivalence class graph suchas a Pattern or a PAG, and some work might be required to turn this into a DAG or SEM Graph in Graph1that can be used to build an appropriate parametric model in PM1. The edge from Data1 to PM1 is addedin the discrete case to ensure that the variables in PM1 use the same categories as the variables in Data1.

Estimate using Results of Search (SEM)

This template shows how to hook up boxes to estimate data using a model that was generated by a searchalgorithm on that same data. Usually, the graph coming out of Search1 is an equivalence class graph suchas a Pattern or a PAG, and some work might be required to turn this into a DAG or SEM Graph in Graph1that can be used to build an appropriate parametric model in PM1.

Update Bayes IM

This template can be used to do updating operations on a Bayes instantiated model that you’ve built inIM1.

Tetrad Toolbar

The main toolbar allows you select box types to place in the main workspace (the white area). It alsoallows you to select tools for selecting and moving boxes and for drawing arrows between them. Eachbutton in the toolbar is explained below.

Select and Move Button

When the movement button is highlighted, the objects in the workspace can be moved around by clickingover each object and dragging it elsewhere in the workspace.

Once you have created a box, its contents can be opened by double clicking on it. The contents may beanother workbench for creating an object, or may be the object itself once it has been created:

The button at the bottom left of the toolbar column--the one with a red and a green arrow--permits you tomake a flow chart connecting boxes you have placed in the workspace.

Flow Chart Button

To make a flow chart, simply click on the flow chart tool button, and then click on the box you want at thetail of a flowchart arrow and drag the arrow to the box you want at the head of the flow chart arrow. Youcan do this repeatedly without having to click on the .flow chart tool button in between. Only oneflowchart arrow can connect any two boxes, but a box can have any number of flowchart arrows out of it.

The flow chart you create provides the input to each Tetrad operation. Some boxes require no input (e.g,the Graph boz), some require one input (e..g., PM box requires a Graph box as input) and some boxesrequire several inputs (e.g., th Estimate box requires a Data box and a PM box.). Not all connections areallowed, and if you attempt to connect two boxes that cannot be related (e.g., two graph boxes), theflowchart tool will simple refuse to make the connecting arrow.If you put the cursor over a box and let it rest for a moment, a "tip" appears that describes the inputsrequired for the operations in that box.

The Tool Buttons Each tool button when clicked allows the creation of a corresponding box inside the workspace. Variousoperations can be carried out by opening a box, provided it has appropriate inputs. The results of theoperations are contained in, and remain accessible inside of the box in whcih they are created. Running anoperation or program inside a box never creates a new box. We will describe each of the other tool buttonsand how to use them for a variety of tasks. Clicking in this file on the tool buttons illustrated below willprovide much more information about each of their functions and operation.

Graph

Creates an instance of a graph. Options are:

Regular Graph - A set of variables over which a set of edges has been defined, where the edges canbe of any of the four standard Tetrad edge types. Lag Graph - A set of variables, each at a series of time lags. Directed edges may extend fromprevious time lags into the current time step. The time series graph is interpreted as a repeatingupdate graph

Parametric Model

Creates a PM box in which a parametric model can be created. A parametric model specifies the family ofprobability functions connecting cause and effect,a, but does NOT specify values for its parameters.Forexample, if you open the PM box a dialog box will come up giving simple alternatives. One alternative,for example, is "Bayes net." If you choose that, the graph you have specified as input to the PM box willbe parametrized as a categorical model in which the parameters are the (unspecified) conditionalprobabilities of values of each variable on the values of its parent variables in the graph. If you specify"SEM," the graph will be parametrized as a linear Gaussian model, with variances and linear coefficients.The values for the parameters in the parametric model selected are NOT in PM. They must be specified inan IM box, which must have a flowchart arrow from a PM box directed into it.

Instantiated Model

Creates an IM box, which can be used to create an instantiated model. An instantiated model specifiesparticular numberical values for the parameters of a parametric model. Arbitrary parameter values areentered randomly and can be edited in a window created from the IM box

Data

Creates a Data box which can be used to created a data set for an IM and allows the importation of datafiles from outside the program. Only the last data set generated is stored for any one Data box.

Manipulated Data

Takes a Data box with data as input and creates a new data file, with mixxing values marked orinterpolated (for discrete variables), and with multiple copies of user-selected cases.

Estimator Button

Creates an Estimator box. Given a PM and Data, the procedures in the statistical estimator allowestimation of the parameters--that is, creation of an instantiated model, based on the Data input to theEstimator box. Estimators include maximum liklelihood and Dirichlet types. There are also procedures for handling missing values.

Updater Button

Creates an Update box. The Update box requires input from an IM box that is a Bayes net--i.e., is fordiscrete variables. It will compute the conditional probability of any variable in the Bayes net given valuesfor any other variables in the model. It will also compute such probabilities condiitonal on an interventionthe fixes or randomizes other variables.

Classify Button

Classify creates a Classifier box, which requires input from Data and from an IM box. It is used to classifynew cases with the Bayes net in the IM box.. The variables in the IM box must match some of thevariables in the Data. The user specifies a target variable in the IM and the classifier uses the Bayes netstructure of the IM to predict the values of the target in the data set. Statistics on classification accuracyare provided (as ROC curves and confusion matrices.)

Search Button

Regression Button

Creates a Search Box. The Search box requires Data as input. The user can choose from among a varietyof search algorithms, consistent under diifferent assumptions, and can specify background knowledge tobe used in the search.

Compare Button

Creates a Compare box. The Compare box requires input from a Search box result and input from a Graphbox, or input from two Graph boxes. It compares the edges in the structure from Search with the structurein Graph, or the edges in the second graph to be connected to it with the first graph connected to it, andreturns counts of how well the Search graph (or the second graph) agrees with the Graph box structure (orwith the first graph).

How to Build a Session

Sessions in Tetrad are constructed by placing boxes into the white session workspace, connecting them upwith arrows in legal ways that represent their dependencies, and constructing modules in each box usingmodules in parent boxes.

The session window allows you to create boxes in which all of the Tetrad objects are created and storedand all of the Tetrad statistical operations are applied and their results are stored. The contents of each boxcan be viewed by clicking on the box.. To create a box of any kind, for example a Graph box, simply leftclick on the corresponding tool on the left of the workbench, move the mouse over the workspace, andclick again. As soon as you do that, the button for the movement tool at the top of the tool column lightsup.

Here, step by step, is an example of how to use the tools in Tetrad to build a random Bayes net model andsimulate data from it. (We show how to build a random Bayes net, only because it makes the exampleshorter; this is not necessarily something you would need to do ordinarily. For detailed explanations ofhow each module works, see the help files for those modules.) First, we place a Graph box into theworkspace, then a PM box, then an IM box, then a Data box. In each case, we do this by first clicking inthe toolbar on the left for the type of box we want and then clicking in the workspace area where we wantthe box to appear. The result after this step is as follows:

Next, we draw flowchart edges from the Graph box to the PM box, from the PM box to the IM box, andfrom the IM box to the Data box. To start this process, we first click the flowchart tool in the toolbar,which looks like this:

Then we hold the mouse down over the Graph box and drag to the PM box, then release the mouse. Samefor PM to IM and IM to Data. (Notice that only legal edges will be drawn; if an edge is not legal, it simplywill not be drawn.) The result after this step looks like this:

At this point, we have four boxes in the workspace, with dependencies between them specified, but thereare no modules in them. To put a module in the Graph box, double click it, select "Directed AcyclicGraph" from the dropdown, select "A random DAG from the dialog that appears (accepting the defaults),and click "OK." The result will look something like this:

Click "Save." The workspace will now show the Graph box in a different color, indicating that it nowcontains a module. (The other three are still empty.) To place a module into the PM box, now that theparent it depends on now has a module in it, double click the PM box. Select "Bayes Parametric Model"from the dropdown. Click "OK." Select "Automatically Assigned" from the dialog, accepting the defaults.Click "OK." The result looks something like this:

When you click "Save," now two boxes are filled in. To fill in the IM box, double click it, select "BayesInstantiated Model," select "Randomly, overwriting previous values," and click "OK." The result lookssomething like this (with perhaps a different graph and different conditional probabilities showing):

Clicking "Save," you see now three boxes are filled in. Now double click the "Data" box, accepting thedefaults, and click "OK." You now have a data set with 1000 cases, simulated from the Bayes net that wasrandomly generated in the last three steps. It looks something like this:

If you click "Save," you see now that all four boxes are filled in with modules. Notice that along the way,most of the boxes could have stored a variety of different modules. You made a choice as to whichmodules to put in which boxes. Having made those choices, your choices downstream were constrained.The final workspace looks like this:

If you right-click on any of these boxes, you will get a popup menu with a variety of actions you can take.See Popup Menus for more details. Also, it is important to understand the implications of some boxesbeing dependent on others. When you destroy the module in a box, modules downstream will be destroyedalso. See Flowchart Dependencies for more details.

Each Box Explained

Sessions in Tetrad are built up by placing boxes on the main workspace area, connecting them up witharrows, and building modules in each box that depending on parent modules that have already been built.(See How to Build a Session for an example. For a discussion of how arrows create dependencies betweenboxes, see Flowchart Dependencies.)

Boxes are things that look like this:

Each box can contain one of a specific list of modules, and which modules are available depends on whatmodules are in the parent boxes for that box. Here we discuss each box in particular, list the types ofmodules it can contain, and provide links to explanations for those modules. If you’d prefer to go directlyto explanations of modules, see Each Module Explained.

Graph Box

Parametric Model (PM) Box

Instantiated Model (IM) Box

Data Box

Manipulated Data Box

Estimate Box

Update Box

Search Box

Regression Box

Classify Box

Compare Box

Destroying Contents of Boxes

When you create a flowchart by placing boxes on the session workbench and drawing edges betweenthem, you set up dependencies between the models that are in the boxes. For instance, if you place aGraph box and a PM box on the workbench, with an edge from the Graph box to the PM box, whatevermodel you put in the PM box will be dependent on the model you put in the Graph box. Let’s say you puta DAG in the Graph box and a Bayes PM in the PM box, as follows:

Then if you change the DAG by adding a node, the Bayes PM is no longer valid, since it doesn’t have thesame variables as the graph. Tetrad in such a case will offer you choice to either have the Bayes PMautomatically updated to reflect the new variable or to have the edge between the Graph box and PM boxremoved.

If you select "Execute," the Bayes PM will be replaced by a new Bayes PM that adds the new variable andcopies over as much of the information from the old Bayes PM as possible.

It is possible that information may be lost in this process. For example, if you add an IM box to the abovesession and place a Bayes IM in this box, if you change the number of categories for some variable, someof the conditional probability tables for the Bayes IM will inevitably lose information.

Note that all models in boxes downstream will be revised when you click "Execute." This is because ofdependencies created by arrows between boxes downstream.

Each Module Explained

The interface for Tetrad lets you put boxes of different types into the main workspace and connect themup with arrows. Inside each of these boxes a specific module can be built, which depends on whatevermodules are in the parent boxes. In this section, we discuss each module in particular, describe the types ofparents it can take, describe how to navigate the dialogs for constructing and editing those modules, andrefer to books or articles describing background theory for modules whose background theory requiresmore explanation.

Inside the Graph Box

A Graph box in the main workspace looks like this:

When you double left click on the graph box a dialog box opens:

If you choose "Directed Acyclic Graph," you will only be able to construct directed acyclic graphs. Thatis, you will only be allowed to constructed directed (-->) edges, and you will not be permitted to constructcycles in your graph (X-->...-->X for some variable X in your graph). See Directed Acyclic Graphs.

If you choose "(General) Graph," you will be permitted to construct edges between variables withendpoints of the following three types: segment (-), arrow (->), and circle (-o). See General Graphs.

If you choose "Time Series Graph," you will be permitted to construct directed graphs over time-laggedvariables for a give set of variables X1,..., Xn. See Time Series Graphs.

Choosing a graph type

Here is some general advise for picking graph types.

If you are consructing a Bayes model, choose "Directed Acyclic Graph." This will ensure that youuse only directed edges and don’t create cycles. If you are constructing a SEM model, choose "SEM Graph." This will ensure that you construct agraph with only directed edges (-->, showing causal relationships) and bidirected edges (<->,showing correlated errors) and that cycles in the graph will be permitted. If you want to construct or edit other types of graphs used in Tetrad, choose "General Graph." Thiswill allow you to construct directed graphs, patterns, PAGs, POIPGs, MAGs, and so on. See TetradGraph Types for more details. In most cases, you do not need to construct these types of graphsyourself, but they are output by Tetrad search algorithms, and you may need to edit them to turn theminto DAGs. In that case, a "General Graph" editor will be displayed to help you edit these graphs.

See also SEM Graph.

SEM Graph

This is a specialized type of graph used for specifying the graphical structure of structural equation models(SEMs). The causal structure of the graph is indicated using directed edges (-->), and correlated errors areindicated using bidirected edges (<->). Cycles are permitted..

Structurally, between any two variables in the graph X and Y, up to three different edges may be added tothe graph: X-->Y, X<--Y, and X<->Y.

To construct a SEM graph, place a Graph box on the workbench (see Graphs), double click the Graph box,choose "General Graph" from the menu, and click "OK." You will see the following dialog.

If you select "Created manually" and click "OK," a blank graph editor window is opened. If you select "Arandom DAG," you will need to fill in parameters to generate a random DAG. See Generating Random DAGs for more information. This DAG will be treated like a general graph, so that if you edit it you willbe able add edges to it that aren’t directed edges (-->) and you will be able to construct cycles.

You may at this point add variables and edges to the graph, or remove them if they’re already there. Toadd a measured variable to the graph, click "Add Variable" and then click in the white workbench areawhere you want the measured variable to be located. To add a latent variable to the graph, click "AddLatent" and then click in the white workbench area where you want the latent variable to be located. Thenames of an added variable will be the first name in the sequence X1, X2, ..., that’s not already in thegraph. These names may be changed; see Editing Node Properties for details.

To remove a variable from the graph, click on the variable you want to delete to select it and then press thedelete key. If you remove a node from the graph, all of the edges attached to it will be removed as well.

To add an edge to the graph, click the type of edge you want to add, click and hold the mouse button downover the variable you want to edge to be from, and then drag the mouse to the variable you want the edgeto be to. There are four types of edges you may add: directed (-->), undirected (---), unoriented (o-o), andbidirected (o->). Cycles are permitted.

To remove an edge from the graph, click on the edge you want to remove to select it and then press thedelete key.

If all you want to do is turn edges that aren’t directed into directed edges or change the directions ofdirected edges, there is a shortcut way to do this. Simply click on the endpoint of the edge you want thearrow to be on, and the edge will change direction for you. Other edge orientation shortchuts are alsoavailable. See Edge Orientation Shortcuts.

A sample graph might look like this:

The interpretion is that X5 causes X1, X1 causes X3, X2 causes X3, there is a feedback loop between X3and X4, and the error terms for X5 and X1 are correlated. Each variable in a SEM Graph is associatedimplicitly with an error term (see Stuctural Equation Models). To show the error terms for the endogenousterms, select "Show Error Terms" from the Tools menu.

Notice that any bidirected edges are adjusted so that they attach only to exogenous variables when errorterms are shown. To hide the error terms, select "Hide Error Terms" from the Tools menu.

Note that using SEM graphs to construct SEMs is merely a convenience. Any SEM model you can build,you can build from a SEM graph, and all SEM graphs can be used to construct SEM models. However,DAGs can also be used to construct SEM models, and general graphs (see) can be used to construct SEMmodels, provided only directed and bidirected edges are used.

Once you have made a graph, you may rearrange the nodes by clicking and dragging; the edges willfollow along so taht the structure of the graph remains the same. If you would like to move a wholesection of nodes to another location, draw a "rubberband" around them and click on any one to move

them. See Selecting Groups of Nodes for details. If you would like to change the name of a variable, orchange whether the variable is latent or measured, double click the variable, edit its properties, and click"OK." See Editing Node Properties for details.

When you click "Save," the graph editor window will close, and the contents will be saved in memory. Ifyou click "Cancel," the changes you made while editing will be disgarded, and the state of the graphbefore editing will remain unchanged. You may change the graph you made at any time by reopening theGraph box and adding or remove variables or edges or rearranging variables.

If you right click on the Graph box, a popup menu will appear with several options. See Popup Menus formore detail.

Important Points

1. If you create a model with a latent variable, if you later use the model to generate data, values for thelatent variables will not be shown. See Measured Vs. Latent Variables.

2. If you have introduced any boxes that depend on a Graph box, changing the graph will alter thecontents of all boxes downstream in the flowchart from the Graph box. Often it is easier to simplycreate a new Graph box in the same main workspace window--there is no limit to how many graph orother boxes you can have at the same time. See Flowchart Dependencies.

Menus

1. The menu at the top of the Graph window has two options, "File" and "Edit". "Edit" currently doesnothing. "File" gives three options: You can save a graph in a file-- but there is no point because wehave not yet implemented a facility to paste the saved graph into a new window. You can introducethe ALARM network, a fairly complex graph standardly used as a test for search algorithms, and youcan save any graph you create as an image file that can be introduced into text documents, e.g., intoMicrosoft World.

Possible Parents for "SEM Graph"

A "SEM Graph" model can be self-standing, as described above. However, it can also be made a child of anumber of other models of a variety of different box types. Usually what this does is to make a copy in the"SEM Graph" box of a preexisting graph in the parent model, which is often a convenient thing to be ableto do. The following models all have graphs that, if they happen to be interpretable as SEM graphs, can becopied into a "SEM Graph" model:

1. All graph models. 2. All search models. 3. All parametric models.

4. All instantiated models. 5. All updater models.

In certain special cases, making a "SEM Graph " model a child of another model has a specializedbehavior. If you make a "SEM Graph " model a child of a Data box model, the effect is to create a graphwith all of the variables in the data set but no edges, as illustrated below.

Inside the Parametric Model Box

A Parametric Model box in the main workspace looks like this:

In the standard setup the PM Box requires a directed arrow from a Graph Box. The type of graph one usesdepends on the type of parametric model one wishes to construct. See Bayes Parametric Model or SEMParametric Model for details.

Bayes Parametric Model

Description of Model

Bayes Parametric Model (Bayes PM) takes a DAG and adds to it two bits of information:

1. For each named node in the graph, the number of categories for the variable by that name. 2. For each variable, with a given number of categories, the list of category names for that variable.

Given the graph and the additional information in (1) and (2), a Bayes net can be formally specified; it isdetermined what all the parameters of the Bayes net are, although no values for parameters are yet known.To specify a Bayes net up to parameter values, a Bayes Instantiated Model must be constructed, based ona Bayes PM. For details on the parameters of a Bayes IM, see Bayes Instantiated Model.

It is assumed in the current version of Tetrad that all discrete variables are nominal--that is, that the orderof their categories is not important. See Defining Discrete Variables for more details.

How to Construct a Bayes PM

For example, say you put the following boxes on the session, connected as follows:

For example, say you start with this DAG. (It need not be, specifically, in a Directed Acyclic Graph box;all that matters is that it contain only directed edges with no cycles.)

If you click "Save" and double click the PM1 box, you are given a choice of which model type you wouldlike to construct. Choose "Bayes Parametric Model."

Once you click OK, the following dialog appears:

In this dialog, you can click on a variable and edit its number of nodes and category names. For instance,we can change the number of categories for X1 to 3 and set its categories to <Low, Medium, High>.

When you’re finished editing categories for variables, click "Save."

Potential Parents for Bayes Parametric Model

The Bayes PM can take any graph as parent that contains a DAG--that is, a graph that contains onlydirected edges (-->) with no cycles (i.e. there is no X such that X-->...-->X in the graph). The simplestoption is to construct Directed Acyclic Graph in the Graph box. (See Directed Acyclic Graph for moredetails.) If the parent is not a DAG, an error message will be displayed when the Bayes PM is constructed.

SEM Parameteric Model

Description of the Model

A SEM Parametric Model (SEM PM) is structural equation model (SEM) up to specification of what theparameters of the model are, without giving values for those parameters.

The implementation of structural equation models in Tetrad essentially follows Bollen (???). A structualequation model is a set of linear equations expressing each variable as a linear sum of its parents plus anexogenous error term--e.g.,

X1 = a1 * X2 + a2 * X3 + e1,X2 = a3 * X3 + e2,

and so on, where the distribution of each error terms has a specified variance, and correlations among errorterms are specified.

The graph for such a system consists of one node for each variable, one node for each error term (whichmay be hidden, or at least the error terms for exogenous variable may be hidden), a directed edge fromeach variable on the right hand side each such equation above to the variable on the left hand side of theequation, and bidirected edges between each pair of variables whose error terms are correlated. (If theerror term for a variable is being shown the bidirected edge attaches to the error terms instead of thevariables itself.) Cyclical dependencies among variables are permitted. See SEM Graph for details.

The parameters in this model consist of:

1. Each linear coefficient in the structural equations for the model (e.g., a1, a2, and a3, above), plus 2. The variances of each error term in the model (e.g., var(e1), var(e2), above), plus 3. The covariances of each pair of error terms that is specified to be correlated.

The number of parameters, therefore, is equal to the number of edges in the graph of the model with errorterms hidden (directed plus bidirected) plus the number of variables in the model. (When error terms areshown, extra directed edges are added to the graph from error terms to their variables; these to not addparameters to the model.)

The SEM Parameteric Model specifies only this list of parameters and allows this list to be edited. To givespecific values for each parameter in the model, one should use the SEM Instantiated Model.

How to Construct a SEM PM

For example, say you put the following boxes on the session, connected as follows:

Say you start by creating a SEM Graph in the Graph box. (See SEM Graph for details.) To make itinteresting, we create a SEM Graph that uses a couple of bidirected edge and has a cycle.

If you click "Save" and double click the PM1 box, you are given a choice of which model type you wouldlike to construct. Choose "SEM Parametric Model."

Once you click OK, the following dialog appears:

In this dialog, error terms for endogenous variables are shown explicitly, and all of the parameters arelabeled. Parameters B1, B2, B3, B4, and B5 (shown in black) are linear coefficients in the underlyingstructural equation model; parameters T1, T2, T3, T4, and T5 (shown in blue) are error variance terms;parameters T6 and T7 (shown in red) are error covariance terms.

In the dialog, you can double click on any parameter and change its name. For instance, you can doubleclick on the variance term T3, above, and change its name to "var_x3". Also, a fact which becomesimportant in SEM estimation, one can set here whether this parameter should be held fixed for estimationand control its starting value for estimation. (In SEM estimation, parameters are initialized in generalrandomly and then adjusted by an optimization algorithm to optimize, e.g., the maximum likelihoodfunction for the model. See SEM Estimator for details. You can control here how these values areinitialized for this process.)

Clicking OK, you see that the name of the paramter has been changed.

It is important to notice what you cannot do in this editor. You cannot change the list of variable or thenames of variables, and you cannot add or remove edges to the graph. To do these things, simply edit thegraph that was used to construct the SEM PM model.

Potential Parents for a Sem PM

The SEM PM must be constructed using an object that has a graph in it of a type that can be used toconstruct a structural equation model. The obvious choice is a SEM Graph, since with this graph, you canadd bidirected edges and cycles. You can, however, construct a SEM PM using a Directed Acyclic Graph,if you don’t care that the graph cannot contain bidirected edges or cycles, or a General Graph, if you don’tmind making sure on your own that the graph contains only directed and bidirected edges.

Potential Children for a SEM PM

There are two natural children for a SEM PM.

1. SEM Instantiated Model (SEM IM). The SEM PM is in a sense an incomplete SEM model, since itdoesn’t specify values for its parameters. To specify these values, make a SEM IM a child of theSEM PM.

2. SEM Estimator. A SEM Estimator takes a SEM PM and a continuous data set and generates a fullyestimated SEM IM.

Inside the Instantiated Model Box

An Instantiated Model box in the main workspace looks like this:

In the standard setup, an IM box takes as parent a PM box, but instantiated models can generated in otherways as well, such as from estimators or updaters. See Bayes Instantiated Model, Dirichlet BayesInstantiated Model, or SEM instantiated Model for details.

Bayes Instantiated Model


A Bayes Instantiated Model (Bayes IM) extends a Bayes Parameterized Model, specifying values for all ofthe parameters in the Bayes net. The parameters for a Bayes net (in the form that they’re used in Tetrad)are conditional probabilities stored in conditional probability tables, one for each variable in the Bayesnet. A variable X has a (possibly empty) list of parents P1, ..., Pn--i.e., variables Pi such that Pi-->X in theBayes PM. The variable itself and each of its parents has a list of categories. A conditional probabilitytable for X is a specification of the probability P(X=x’ | P1=p1’, ..., Pn=pn’) for each category x’ of X andeach combination of categories <p1’, ..., pn’> for parents P1, ..., Pn of X. For any particular combinationof parent values <p1’, ..., pn’>, the sum of the conditional probabilities P(X=xj | P1=p1’,...,Pn=pn’) for allcategories xj of X is equal to 1.0.

How to Construct a Bayes IM

To construct a Bayes IM, first construct a DAG, then a Bayes PM, and add an IM box to the workspace,with an arrow from the Bayes PM to the IM.

Fill in the Graph box and the PM box, as explained in Bayes Parameterized Model. For instance, youmight end up with a graph that looks like this (the categories for X1 are shown).

Now, double click the IM box. You get a choice of models; choose Bayes Instantiated Model:

What you click OK, you are offered a choice. You may either initialize the parameters of your Bayes netmanually (i.e., fill them in one by one, by hand), or fill them in randomly.

We choose "Manually." We now get a dialog that looks like the following:

X1 here has two parents, X2 and X5. Each combination of parent values for X2 and X5 is listed as a rowin the conditional probability table for X1. Each category for X1 is listed as a column in the conditionalprobability table. We can now fill in these probability values however we like, provided we choosenon-negative real numbers that sum to 1.0 in each row. The interface helps out a little by filling in tablecells whose values are implied. If you fill in the 0.2000 and 0.5000 in the table below, the table will fill inthe 0.3000 for you. Also, if you simply want to fill in table cells randomly, right click on any nonselectedtable cell. You get a popup menu like the one below.

If you select "Randomize this row," the row is filled in with random values. For example:

Similarly for the other popup menu functions shown.

Once all of the table cells have been filled in, the Bayes IM is ready to be used as input to other boxes.You may, for example, simulate data using the Bayes IM, or you may perform updating operations on it.See Simulating Data (Bayes) for more information on how to simulate data and Update Box for moreinformation on updating.

Potential Parents for a Bayes IM

A Bayes IM can be constructed as indicated above (as a child of a Bayes PM). Bayes IM’s are also,however, output by other processes. In particular, a Bayes IM can be made a child of the following:

1. ML Bayes Estimator. Bayes estimations take Bayes PM’s and discrete data sets and produce newBayes IM’s.

2. Dirichlet Estimator. Dirichlet estimators also take Bayes PM’s and discrete data sets (with possiblyDirichlet priors in the form of Dirichlet Bayes IM’s) and output Dirichlet Bayes IM’s. They mayalternatively output Bayes IM’s.

3. Any Bayes updater. All of the Bayes updaters (Row Summing Updater, CPT Invariant Updater, and Approximate Updater) output new, updated Bayes IM’s.

Old text:

If you choose an ML Instantiated Bayes Model, the program will either randomly specify the conditionalprobability of each value of each variable given the values of its parents, or you can specify themmanually through the dialog box:

Choose a variable either using the "Next" button on the right side of the window or by clicking on thevariable in the graph on the leftside of the window. In either case, you will see one or more rows of entry spaces, with each entry spacelabeled by the name of the value of the selected variables. Each row corresponds to an allowed assignmentof variables to the variables that are parents of the selected variable. For each row--each assignment ofallowed values to the variables that are parents of the variable you have selected--you must enter anumerical value between 0 and 1 for each value of the variable you have selected. These are the respectiveprobabilities of the values of your selected variable, condiitonal on the values in that row of its parentsvariables. The numbers you put into any row must add up exactly to 1. If they don’t the program willsimply erase the values you have entered in that row. If you start entering numbers in a row from the left,which is highly recommended, the program will automatically fill in the next to last entry space with thenumber (if one exists) needed to make the row numbers sum to 1. Entering all of these conditional probabilities in even a medium sized model is very tedious, but there is nohelp for it other than to estimate the condiitonal probabilities from a data set, or to randomize. If youchoose "Random" instead of "Manual" you will get a window very much like the one above, except thatrandomly chosen values will be entered in each row. You can edit these values by clicking on a row andtyping. As in other windows showing graphs, you can select a variable by clicking on it.

Potential Parents for Bayes Instantiated Model

The Bayes IM can accept the following potential parents.

1. Bayes Parametric Model. To build a Bayes IM from scratch, one should first build a Bayes PM andthen construct a Bayes IM as a child of the Bayes PM.

2. ML Bayes Estimator. An ML Bayes Estimator estimates a Bayes IM using a given Bayes PM anddiscrete data set.

3. Dirichlet Bayes Estimator. The normal output of a Dirichlet Bayes Estimator is a Dirichlet Bayes IM,but a Bayes Instantiated Model may be substituted. This Bayes IM will contains as parameter values themaximum likelihood probability values from the estimated Dirichlet Bayes IM.

Potential Children for Bayes Instantiated Model

1. Data Box--This is the way to simulate data from a discrete Bayes model. See Simulating Data (Bayes) for more details.

2. Any graph--Directed Acyclic Graph, SEM Graph, General Graph--simply copies the graph from theBayes IM into a new graph box.

3. Bayes PM--copies the Bayes PM from the Bayes IM into a new PM box. 4. Bayes IM--copies the Bayes IM itself into a new IM box. 5. Any of the updating algorithms--Row Summing Exact Updater, CPT Invariant Updater, Approximate

Updater. 6. ML Bayes Estimator (together with a discrete Data Set)--estimates the associated Bayes PM,

producing a new Bayes IM.

Dirichlet Bayes Instantiated Model


A Dirichlet Bayes Instantiated Model (Bayes IM) is an alternative to a Bayes Instantiated Model thatrepresents the distribution over each row in each conditional probability table as a Dirichlet distribution. ADirichlet distribution the probability distribution of the parameters of a multinomial distribution. That is, itis the probability distribution of the parameters of a list of "cells" whose probabilities are (a) mutuallyindependent and (b) sum to 1.0. Each row in a conditional probability table satisfies this criterion, so webuild an alternative Bayes net representation row by row out of distributions defined in this way.

A specific Dirichlet model for a list of cells is given by specifying a Dirichlet parameter for each cell. Themaximum likelihood probability for the cells is then given by the ratio of the Dirichlet parameter for thatcell divided by the sum of Dirichlet parameters for all of cells in the list. In the simple case, these Dirichletparameters will just be cell counts, considered as real numbers. We will usually, therefore, refer toDirichlet parameters as pseudocounts. In the more general case, these pseudocounts can in fact be anypositive real numbers.

A Dirichlet Bayes Instantiated Model is constructed using a Bayes Parametric Model, just like a BayesInstantiated Model. The main differences are:

Instead of conditional probability tables, the Dirichlet Bayes Instantiated Model contains tables ofpseudocounts, as described above. Dirichlet Bayes Instantiated Models are used as inputs to a Dirichlet Estimator, rather than as inputsto an ML Bayes Estimator.

How to Construct a Dirichlet Bayes IM

To construct a Bayes IM, first construct a DAG, then a Bayes PM, and add an IM box to the workspace,with an arrow from the Bayes PM to the IM.

Fill in the Graph box and the PM box, as explained in Bayes Parameterized Model. For instance, youmight end up with a graph that looks like this (the categories for X1 are shown).

Now, double click the IM box. You get a choice of models; choose Dirichlet Bayes Instantiated Model:

What you click OK, you are offered a choice. You may either initialize the parameters of your DirichletBayes net manually (i.e., fill them in one by one, by hand), or fill them in randomly.

We choose "Manually." We now get a dialog that looks like the following:

There are two tabs in the dialog that comes up next, "Probabilities" and "Pseudocounts." Let us consider"Pseudocounts" first. Pseudocounts are displayed in tables, one for each variable, with the same structureas conditional probability tables in Bayes IM’s. Each pseudocount is a positive real number; in this casethe are all initialized to 1.0. The sum of the pseudocounts in each row is shown in the rightmost column.

Turning now to the "Probabilities" tab, we have a table in the form of a conditional probability table thatdisplays maximum likelihood probabilities for each cell of each Dirichlet distribution (row) in the model.These probabilities are calculated by dividing each pseudocount value in the previous display by the sumor pseudocounts in that row. In order not to lose information, the total count for each row is displayed in

the "Probabiliities" tab as well. To recover pseudocounts, simply multiply the probability of a cell by the"total count" in the rightmost column.

[Note: there is some funny business going on with the right-click popup menus for doing randomization.Need to make this work.]

Old text:

If you choose a Dirichlet Instantiated Bayes Model, you will be putting an initial (or prior) Dirichletprobability distribution over the conditional probability of each value of each variable condtional onvalues of its parent variables:. a probability distribution over conditional probabiliy distributions. Theprobability distribution over the conditional probability distributions implies an "all probabilities"considered probability for each value of each variable condiitonal on its parent’s values. Such Dirichletdistributions can be specified by pseudocounts, essentially a kind of fictional database. The program willautomatically create a uniform and symmetric Dirichlet prior distribution for you in which all counts havethe same value--you can pick the value. A Dirichlet Bayes IM may be set up manually (all values set byhand) or set up automatically as a symmetric prior in which all pseudocounts for all cells are set to a given,specified, value. Such Dirichlet distributions are called "symmetric, because the distribution function itselfwith such a choice of pseudocounts is symmetric with respect to variable permutation. (If all pseudocountsare set to 1.0, the distribution function is completely flat and therefore uninformative. If all pseudocountsare set to 0.5, the resulting distribution is known as a Jeffreys prior and has connections to informationtheory.)

SEM Instantiated Model


For a description of the sort of structural equation model (SEM) that is implemented in Tetrad, see SEMParameterized Model. The parameters of the structural equation model are:

1. The linear coefficients of the model, 2. The variances of each error term in the model, and 3. The covariances of each pair of error terms that is specified to be correlated in the model.

The purpose of the SEM IM is to allow values for these parameters to be specified. The SEM IM alsoimplements the functions (such as the maximum likelihood function of the SEM) that are optimized by the SEM Estimator and calculates statistics for the SEM.

How to Construct a SEM IM

To construct a SEM IM, first construct a SEM Graph, then a SEM PM, as explained in SEM ParametricModel, and then add an IM box to the workspace, with arrows form the SEM PM to the IM:

For instance, you might end up with a SEM PM that looks like this.

When you double click the IM box now, you get a SEM IM model that’s been filled in with randomlychosen values. (Notice you’re not given a choice of models; this is because there is only one IM model inTetrad that can serve as a child of a SEM PM model.)

Notice that in the SEM IM model, parameter values appear where parameter names appeared in the SEMPM. For instance, the linear coefficient for the edge X6-->X5 is labeled as "B7" in the SEM PM, but theactual real value for the parameter, "-0.7867," is shown in the SEM IM. These real values may be edited intwo ways. The first way is to click on the numbers themselves. If you click on the "-0.7867," above, asmall box appears to let you edit the value of this parameters.

You may, for instance, type ".25" and hit return; your new value for the parameter is recorded.

The other way to view and edit parameter values is using the Tabular Editor. If you click on the TabularEditor tab, you get this display.

Notice that the parameter value for the edge coefficient of the edge X6-->X5 is "0.2500." We can edit thisvalue again from this view by clicking on the box containing the value "0.2500" and changing it to, say, "-.5."

Whichever view we edit values in, the other view will reflect the updated values.

Notice that in the tabular view there are some columns to show statistics for each parameter. Thesecolumns are used by the SEM Estimator to show how robust the estimtation of each parameter is. Theseare ordinary statistics calculated for SEM estimations; SE standard for "standard error," T for "t statistic,"and P for "p value."

If you click on the Implied Matrices tab, an implied matrix of some type that you choose will be displayed.

You may display either the implied covariance matrix of the model for all variables (shown), or theimplied covariance matrix of the model for the measured variables only, or one of the correspondingimplied correlation matrices.

If you click on the Model Statistics tab after having done a SEM estimation, you will be shown somegoodness of fit statistics for the model as a whole. See SEM Estimator for more details.

Potential Parents for SEM Instantiated Model

A SEM IM may be made the child of the following modules:

1. SEM Parametric Model. The normal way to construct a SEM IM from scratch, as described above, isto first make a SEM PM and then make a SEM IM that’s a child of of the SEM PM.

2. SEM Estimator. A SEM Estimator estimates a SEM IM from a given SEM PM and continuous data set.

Potential Children for SEM Instantiated Model

A SEM IM may be made a parent of the following objects:

1. Data Box. This is the way to simulate data from a SEM model. See Simulating Data (SEM) for moredetails.

2. SEM Graph. Copies the contents of the SEM IM graph into a new Graph box. 3. SEM PM. Makes a copy of the SEM PM used to create this SEM IM. 4. SEM IM. Makes a copy of the parent SEM IM. 5. SEM Estimator (with a continuous Data Set). See SEM Estimator.

Notably, there is currenly no updater that takes as input a SEM IM, although there should be.

Inside the Data Box

A Data Box in the main workspace looks like this:

if it’s data that’s being simulated, or this:

if it’s data that was loaded in from an external file.

Most of the time that you interact with the data box, you will be interacting with a Data Set List, which isa list of data sets, one of which (the one you see) is designated as "selected." For most information, see Data Set Lists.

Data Set List

A data set list stores a list of one or more data set, possibly of different types. One of data sets isdesignated as "active," in the sense that (a) it’s the one you see when you double click the Data Box, and(b) it’s the one that’s used downstream by, e.g., by search and estimation algorithms. The types of datasets that can currently be stored in a data set list are:

1. Tabular Data Set, 2. Covariance Matrix, and 3. Correlation Matrix.

These types of data sets each has a distinctive appearance when being edited, as shown below.

For information on how to load data files, see the help file for Data Loader.

Tabular Data Sets Tabular data sets are rectangular data sets with data for a (possibly mixed) list of continuous and discretevariables. (For detailed information on tabular data sets, see Tabular Data Sets. For information oncontinuous and discrete variables, see Continuous and Discrete Variables. A tabular data set containing alldiscrete data (that is, a discrete data set) looks like this:

A tabular data set contain all continuous data (that is, a continuous data set) looks like this:

These data sets can be edited directly. For information on how to edit tabular data sets in the Data Editor,see Editing Tabular Data Set.

Covariance Matrices

A covariance matrix in Tetrad is a symmetric, positive definite matrix M with dimension equal to thenumber of variables in the data set, associated with a sample size. If the list of variables in <X1, X2, X3,X4, X5>, then var(Xi) = m(i, i) and cov(Xi, Xj) = m(i, j) The sample size may be any number greater thanzero. Here is what a covariance matrix looks like in the data editor:

Only the lower triangle is shown, since the matrix is symmetric. For information on how to edit acovariance matrix in the Data Editor, see Editing Covariance/Correlation Matrices.

Correlation Matrices

A correlation matrix in Tetrad behaves just like a covariance matrix (it is one!), except that it is labelled as"Correlation Matrix," has a diagonal consisting entirely of 1.0’s, and shows correlations instead ofcovariances. That is, if the list of variables in <X1, X2, X3, X4, X5>, as in the above example, then asymmetric, positive dfinite matrix M is shown such that then m(i, j) is the correlation of Xi and Xj and inparticular m(i, i) = 1.0 for all i, j. The sample size may be any number greater than zero. Here is what acovariance matrix looks like in the data editor (notice that only the lower triangle is shown since it issymmetric):

Notice that these last images shows tabs that let you switch back and forth between three different datasets. All three are stored in the Data Box; if you want to, say, search over a different one, simply click thetab for that data set and your next search will be over that one instead.

Missing Data

Missing data for all data types is represented using asterisks ("*"). See Handling Missing Data for details.

Creating Data Sets

You can make a data set in a number different ways:

1. You may create a data set from scratch, by typing it into the data set. See Creating Data from Scratch for details.

2. You may load a data set from a file. See Loading Data for details. 3. You may simulate data from a model. See Simulating Data for details. 4. You may manipulate one data set to create another . See Manipulating Data for details.

Knowledge

Every data set may be associated with background knowledge. The reason for this is that one often wantsto run more than one search from the same data set, using the same knowledge, and associated knowledgewith a data set is an easy way to accomplish that. To see how to set up background knowledge, see Editing Knowledge.

To use knowledge associated with a data set in a search, simply (a) construct the data set, (b) associate theknowledge, (c) add a search box to the main workspace, (d) draw an edge from the data box to the searchbox, and (e) execute the search.

Inside the Manipulated Data Box


The manipulated data box is used to alter a data file. Inside the box you may:

Interpolate a Missing Value value (an asterisk) whereever values are missing in the data file. Insert the modal value of a discrete variable wherever values of that variable are missing in the datafile. (Better interpolations are coming). Produce a data set with multiplications of cases.. Each case is multiplied by the number in theleftmost column of the Data file input to the Manipulated Data Box.

Once it has been run, the Manipulated Data box works just like any other Data BoxNote one programming oddity: if you wish to discretize a continuous variable, you must do so inside theData Box that holds the data file,not inside the Manipulated Data box.

Inside the Estimate Box

An Estimate box in the main workspace looks like this:

The Estimate program takes a parametrized model (in PM) and a data set for the variables in that model,and returns an Instantiated Model, an IM. It will also take data and an (ML) IM as input.. Once a model isestimated, the contents of the Estimate box can be transferred to an empty IM box and then used togenerate data, to classify, or to update (in the last two cases, only if the model is a Bayes net, not a SEM). If a Maximum Likelihood Bayes Net and data are directly connected to Estimate, the estimationprocedures will ignore all cases in the data set with missing values for any variables. Missing data valuescan be interpolated by connecing the data to a Manipulate Data box, and connecting that box to theEstimator box.There are several varieties of estimation, depending on the.graphical input (the PM or IM):1. If the input PM or IM is for a SEM, the Estimate program immediately produces a full informationmaximum likelihood estimate of the parameters, provided the model in the PM or IM is identifiable.Latent variables are allowed. The procedure also gives model statistics, including the implied covarianceand correlation matrices, and the chi square likelihood ratio statistic and its p value for the model.2. If the input is a PM for a Bayes Net, the Estimate program produces a maximum likelihood estimate ofthe model parameters, provided the model has no latent variables..3. If the input is an Maximum Likelihhod Instatiated Bayes Net (an IM), the Estimate program produces amaximum likelihood estimate of the model parameters.4. If the input is a Dirichlet Instantiated Bayes Model, the Dirichlet Bayes estimator estimates a posteriorDirichlet Bayes instantiated model given a prior Dirichlet Bayes instantiated model and a a discrete dataset. The data set must contain all of the same variables as the prior instantiated model. Latent variables arenot allowed.The Dirichlet estimation algorithm is simple. First, a new (blank) posterior Dirichlet Bayes IM is created.Then, for each cell in the posterior, the value (a) from the corresponding cell in the prior is retrieved, andthe number of cases in the data satisfying the condition of that cell (n) is counted. The value of the cell inthe posterior is set to a + n. Estimated conditional probabilities total pseudocount in each row arecalculated from these cell values. As a shortcut, it is possible in the interface to use a Bayes PM and a discrete data set as parents to theDirichlet Bayes Estimator. If you do this, a symmetric Dirichlet Bayes IM will be generated in the

background and used as the prior for the Dirichlet estimation algorithm. The symmetric pseudocount thatshould be used here may be specified at time of construction. In its present implementation, Bayes nets with latent variables cannot be estimated.

Types of estimators:

ML Bayes Estimator SEM Estimator Dirichlet Estimator

ML Bayes Estimator

When Data together with a Dirichlet Instantiated Bayes Model are input to Estimate, newparameters--new probabilities for each variable conditional on its parents’ values--are calculated. byBayes’ Theorem. Unlike an Estimate for an ML instantiated model, which simply ignores the previousparameter values, with the Dirichlet Instantiated Bayes Model the Estimate function combines new Datawith the prior probabilities in the model to produce its parameter estimates.

SEM Estimator

Data and an ML Instantiated Bayes Model" can be input to Estimate. Estimate will erase the parametervalues (the conditional probability of each variable value given values of the parent variables) in the IMmodel and replace them with a maximum liklihood estimate of parameter values from the data.

Dirichlet Estimator

When Data together with a Dirichlet Instantiated Bayes Model are input to Estimate, newparameters--new probabilities for each variable conditional on its parents’ values--are calculated. byBayes’ Theorem. Unlike an Estimate for an ML instantiated model, which simply ignores the previousparameter values, with the Dirichlet Instantiated Bayes Model the Estimate function combines new Datawith the prior probabilities in the model to produce its parameter estimates.

Inside the Update Box


The functions inside the Update Box enable you to use an Instantiated Bayes Net model to compute theconditional probability of any variable in the model from values you specify for any other variables in the model.Tetrad has three programs for updating: (1) Approximate Updater, (2) Row Summing Exact Updater, and(3) CPT Invariant Updater.

1. Approximate Updater

Calculates updated marginals for a Bayes net by simulating data and calculating likelihood ratios. Themethod is as follows. For P(A | E) (where E is themanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence),

enough sample points are simulated from the underlying Bayes Im so that 1000 satisfy the condition E,keeping track of the number n that satisfy condition A. Then the maximum likelihood estimate of P(A | E)is calculated as n / 1000. The approximate updater runs quite quickly, even for large numbers of variables, so long as the number ofvariables inmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence

is small. The more variables there are inmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence,

the more sample points need to be generated to achieve 1000 samples points satisfying E.

2. Row Summing Exact Updater

Calculates updated marginals P(A | E) for a Bayes net (where E is themanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence)

by summing probabilities for rows in the joint probability table that satisfy condition E, summingprobabilities for for rows in the joint probability table that satisfy condition A & E, and dividing thesecond sum by the first. A row in the joint probability in this sense is a combination of values for thevariables of the Bayes net mapped to the probability of that combination of values occurring in a sample.This probability is calculated for each row from the conditional probability tables of the Bayes net usingthe standard factorization of the Bayes net. The row summing updater can be extremely expensive if the number of variables in the Bayes net is largeand the number of variables in

manipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence

is small. However, the row summing updater can be extremely useful (and fast) if almost all of thevariables in the Bayes net are inmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence.

For instance, if all but one variable (say, X) is inmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence,

then the number of rows in the joint probability table that have to be examined in order to calculatemarginals for X is just the number of categories of X.

3. CPT Invariant Updater

Calculates updated marginals P(A | E) for a Bayes net (where E is themanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence)

by breaking the problem down into two parts: first, calculating an "updated Bayes net" (in a sense to bedefined), and second, calculating marginals recursively from this updated Bayes net. Probabilities for aBayes net are specified in terms of conditional probability tables for its variables. These are tables of theprobability for each category of a variable conditional on each combination of parent values of thatvariable, P(V = v’ | P1 = p1’ & ...& Pn = pn’). Define an "updated Bayes net" as the Bayes net in whicheach of these probabilities has been replaced by P(V = v’ | P1 = p1’ &... & Pn = pn’ & E). (Thesereplacement values will not be defined if the conjunction P1 = p1’ &... & Pn = pn’ & E is impossible.) It isstraightforward to show that marginals for such an updated Bayes net just are the updated marginals forthe original Bayes net. In updating a Bayes net, in the sense defined above, only the conditional probabilty tables for ancestors ofthemanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence

variables are altered. This suggests an algorithm for updating a Bayes net givenmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence

E. For each variable that’s an ancestor ofmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence,

use the row summing method to calculate updated conditional probabilities in that variable’s conditionalprobability table. Otherwise, just keep the conditional probabilities from the original Bayes net. This is thealgorithms that is implemented. For calculating single-variable marginals from a Bayes net, Bayes Theorem is used recursively. Forexample, if X-->Y, where X and Y both have categories {0, 1}, P(Y = 0) = P(Y = 0 | X = 0) P(X = 0) +P(Y = 0 | X = 1) P(X = 1). Since all of the probabilites on the right side of this equation are stored in theconditional probability tables for the Bayes net, this values can be calculated directly. For longer chains,more recursion would have to be done to calculate marginals for intervening variables. These interveningmarginals can, however, once calculated, be stored for later use, and they are. The algorithm slows down (as do most updating algorithms) when dealing with graphs where parents of amodelNode are d-connected (see Spirtes, et al 2000 for the exact definition). For instance, in this graph: X-->Y X-->Z Y-->W Z-->W W-->R calculating R requires much more extensive calculation than for this graph: X1-->Y X2-->Z Y-->W Z-->W W-->R.

In this case, in order to calculate marginals for W, one needs to know probabilities of W given particularcombinations of parent values of W, which are given in the conditional probability tables of the Bayes net,and one also needs to know the probabilities of the various combinations of parent values occurreing. Forexample, say that Y, Z have categories {0, 1} in the first graph, above, and one wants to know theprobability P(W = 0). One can calculate this probability as P(W = 0 | Y = 0, Z = 0) P(Y = 0, Z = 0) + P(W= 0 | Y = 0, Z = 1) P(Y = 0, Z = 1) + P(W = 0 | Y = 1, Z = 0) P(Y = 1, Z = 0) + P(W = 0 | Y = 1, Z = 1)P(Y = 1, Z = 1). The problem with d-connected parents is calculating, e.g., P(Y = 0, Z = 0). The CPTinvariant updater calculates this probability in a standard way, as P(Y = 0) P(Z = 0 | Y = 0). This requires arecursive application of the marginal calculating procedure and is expensive. However, the problem ofd-connected parents of variables is a standard problem (even if not always phrased that way) for updatingprocedures. In general, the CPT invariant updater is quite fast, but can be slowed down for two reasons: (a) thesubgraph of the Bayes net restricted to ancestors ofmanipulationEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidencetionEvidence

variables is complicated, forcing more updated conditional probabilities to be calculated, and (b) there area lot of variables in the Bayes net whose parents are moderately or strongly d-connected.

Types of updaters:

Row Summing Updater CPT Invariant Updater Approximate Updater

Row Summing Updater

CPT Invariant Updater

Approximate Updater

Inside the Search Box

The search box in the main workspace area looks like this:

if the search is being conducted over a data set, or this:

if the search is being conducted directly over a graph (as a source of true conditional independence facts,to see how the algorithm behaves ideally).

Tetrad has a variety of search algorithms to assist in searching for causal explanations of a body of data.

It should be noted that the Tetrad search procedures are exponential in the worst case (when all pairs ofvariables are dependent conditional on every other set of variables.) The search procedures may take agood bit of time, and there is no guarantee beforehand as to how long that will be.

These search algorithms are different from those conventionally used in statistics.

1. There are several search algorithms, differing in the assumptions they make.2. Many of the search algorithms allow the user to specify background information that will be used in

the search procedure. In many cases the search results will be uninformative unless such backgroundassumptions are explicitly made. This design not only provides for more flexibility, it alsoencourages the user to be conscious of the additional assumptions imposed in deriving a model from data.

3. Even with background assumptions, data often do not determine a unique best or robust explanation.The search algorithms take in data and return information about a collection of alternative causalgraphs that can explain features of the data. They do not usually return a unique graph, although theysometimes will if sufficient prior knowledge is specified. In contrast, if one searches for a regressionmodel of the influences of a set of variables on a selected variable, a regression model will certainlybe found (provided there are sufficient data points), specifying which variables influence the targetand which do not.

4. The algorithms are in some respects cautious. Search algorithms such as FCI and PC , describedbelow, will often say, correctly, that it cannot be determined whether or not a particular variableinfluences another.

5. The algorithms are not just useful guesses. Under explicit assumptions (which often hold at best onlyapproximately), the algorithms are "pointwise consistent"--they converge almost surely to the correctanswer. The conditions for this sort of consistency of the seaarch procedures are described in the

references. Conventional model search algorithms--stepwise regression, for example--have suchguarantees only under very strong prior assumptions about the causal structure.

6. The output of the search algorithms provides a variety of indirect information about how much theconclusions of the algorithm can be trusted. They can, for example, be run repeatedly for a variety ofspecifications of alpha values in their statistical tests, to gain insight about robustness. For searchalgorithms such as PC, CCD, GES and MimBuild, described below, if the search algorithms return"spaghetti"--a highly connected graph--that indicates the serach cannot determine whether all of theconnected variables may be influenced by a common unmeasured cause. If the PC algoirthm returns anedge with two arrowheads, that indicates a latent variable may be acting; if searches other than CCDreturn graphs with cycles, that indicates the assumptions of the search algoirthm are violated.

7. Some of the search procedures are robust against common difficulties in sampling designs--they givecorrect, but reduced information in such cases. For example, the FCI algorithm allows that there may beunobserved latent common causes of measured variables--or not--and that the sample may have beenformed by a process in which the values of measured variables invlucne whether or not a unit is includedin the sample (sample selection bias). The CCD algorithm allows that the correct causal structure may be"non-recursive"--essentially a cyclic graphical mode, a folded up time series.

8. The output of the algorithms is not an estiamted model with parameter values, but a discription of aclass of causall graphs that can explain statistical features of the data considered by the search procedures.That information can be converted by hand into particular graphical models in the form of directed graphs,which can then be estimated by the program and tested.

The search procedures available are named:

PC - Searches for Bayes net or SEM models when it is assumed there is no latent (unrecorded)variable that contributes to the association of two or more measured variables. CPC - Variant of PC that improves arrow orientation accuracy. PCD - Variant of PC that can be applied to deterministic data. FCI --which performs a search similar to PC but allowing that there may be latent variables. CCD--for searching for non-recursive SEM models (models of feedback systems using cyclic graphs)without latent variables GES -- Scoring search for Bayes net or SEM models when it is assumed there is no latent(unrecorded) variable that contributes to the association of two or more measured variables. MBF -- Searches for the Markov blankets DAGs for a given target T over a list of variables <v1,...,vn,T>. CEF - Variant of MBF that searches for the causal environment of a T (i.e., parents and children ofT). Structural EM - MimBuild--for searching for latent structure from the output of Build Pure Clusters or Purify Clusters BPC --for searching for sets of variables that share a single latent common cause Purify Clusters--for searching for sets of variables that share a single latent common cause

Inputs to the Search Box

There are two possible inputs for a search algorithm: a data set or a graph. If a graph is input, the programallows searches the program computes implied independence and conditional independence relations andallows you to conduct any search that uses only such constraints--the PC, FCI and CCD algorithms.

Why would you apply a Search procedure to a model you already know? For a very important reason:The Search procedures will find the graphical representation of alternative models to your model thatimply the same constraints.

The more usual use of the search algorithms requires a data set as input. Here is an example.

Select the Search button.

Click in the workbench to create a Search icon. Use the Flow Charter button to connect the Data icon to the Search icon.

Double-click the Search icon to choose an search procedure.

Selecting a Search procedure

Tetrad offers the following choices of search algorithms. For more details about the assumptions andparameters needed for each algorithm, click in the respective links.

There are two main classes of algorithms. The first one is designed for general graphs with or withoutassuming the possibility of hidden common causes:

PC algorithm: this method assumes that there are no hidden common causes between observedvariables in the input (i.e., variables from the data set, or observed variables in the input graph) andthat the graphical structure sought has no cycles.

FCI algorithm : this method does not assume that there are no hidden common causes betweenobserved variables in the input (i.e., variables from the data set, or observed variables in the inputgraph); it does assume that the graphical strucutre sought has no cycles.CCD algorithm: this method assumes there are no hidden common causes; it allows cycles; it is onlycorrect for discrete variables under a restrictive assumtptionsGES algorithm: same assumptions as the PC algorithm, except that this one performs search byscoring a graph by its asymptotic posterior probability.

The second class concerns algorithms to search for latent variable structural equation models from dataand background knowledge.

MIM Build algorithm : learns the causal relationships among latent variables, when the true(unknown) data generation process is believed to be a pure measurement/structural model.

Build Pure Clusters algorithm: a complement to MIM Build and Purify, this algorithm learns thecausal relationships from latent variables to observed variables, when the true (unknown) datageneration process is believed to be contain a pure measurement/structural submodel--i.e. a model inwhich each Purify algorithm : given a measurement model, this method searches for a submodel in which thereare no every measured variable is influenced by one and only one latent variable.

Select the desired algorithm that meets your assumptions from the Search list. An initial dialog boxshowing the search parameters you can set is displayed. The following figure illustrates the one that isdisplayed when PC Algorithm is selected.

After the proper parameters are set, if the user checks the box "Execute searches automatically", theautomated search procedure will start when the OK button is clicked. The respective Help button can beused to get instructions about that specific algorithm. The next window displays the result of theprocedure, and can also be used to fire new searches. The following figure illustrates an output for the PC algorithm.

Inserting background knowledge

Besides the assumptions underlying each algorithm, another source of constraints that can be used by thesearch procedures to narrow down the search and return a more informative output is making use ofbackground knowledge provided by the user. To see how to specify background knowledge for a searchalgorithm, see Editing Knowledge.

AssumptionsA search procedure is pointwise consistent if as the sample size increases without bound, the output of thealgorithm converges with probability 1 totrue information about the data generating structure. For all of the Tetrad search algorithms, availableproofs of pointwise consistency, assume at least the following:1. The sample is i.i.d--the probaiblity of any unit in a population being sampled is the same as any other,and the joint probability distribution of the variables is the same for all units.2. The joint probability distribution is locally Markov. In acyclic cases, this is equivalent to a simpler"global" Markov condition: that a variable is independent of all variables that are not its effectsconditional on the direct causes of the variable in the causal graph (its "parents"). In cyclic cases, the localMarkov condition has a related but more technical definition. (See Spirtes, et al., 2000).3. All of the independence and conditional independence relations in the joint probability distribution areconsequences of the local Markov condition for the true causal graph.In addition, various specific search algorithms impose other assumptions. Of course, the search algorithmsmay give correct in information when these assumptions do not strictly hold, and in some cases will do sowhen they are grossly violated--the PC algoirthm, for example, will sometimes correctly identify thepresence of unrecorded common causes of recorded variables.

Types of Searches:

PC CPC PCD FCI CFCI CCD GES MBF MIMBuild BPC Purity

Search Algorithms: PC

The PC algorithm is designed to search for causal explanations of observational or mixed observationaland experimental data in which it may be assumed that the true causal hypothesis is acyclic and there is nohidden common cause between any two variables in the dataset. (It is also assumed that no relationshipbetween variables in the data is deterministic--see PCD).

The algorithm operates by asking a conditional independence oracle to make judgements about theindependence of pairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditionalindepedence tests are available for datasets that consist either entirely of continuous variables or entirelyof discrete variables; hence, datasets of these types can be used as input to the algorithm. As a way ofgetting one’s head around how the algorithm should behave in the ideal, when independence tests alwaysgive correct answers, one may also use a DAG as an input to the algorithm, in which case graphicald-separation will be substituted for an actual independence test.

In the case where a continuous dataset is used as input, the available conditional independence testsassume that the direct causal influence of any variable on any other is linear and that the distribution ofeach variable is Normal.

Some of the above assumptions are not testable using observational data. They should come from priorknowledge or partial experiments.

Pseudocode for the version of PC implemented in Tetrad IV is given below. As shown in the pseudocode,the algorithm can be broken into two phases: an adjacency phase and an orientation phase. In theadjacency phase, a complete undirected graph over the variables is initially constructed and then edgesX---Y are removed if some set S among either the adjacents of X or the adjacents of Y can be found (of acertain size, or "depth") such that I(X, Y | S). Once the adjacency structure over V has been well estimatedby this procedure, an orientation phase is begun. The first step of the orientation phase is to examineunshielded triples and consider whether to orient them as colliders. An unshielded triple is a triple <X, Y,Z> where X is adjacent to Y, Y is adjacent to Z, but X is not adjacent to Z. Since X is not adjacent to Z,the edge X---Z must have been removed during the adjacency search by conditioning on some set Sxz;<X, Y, Z> is oriented as a collider X-->Y<--Z just in case Y is not in this Sxz. Once all such unshieldedtriples have been oriented as colliders by this rule that can be, a series of orientation rules is applied (inthis case, the complete orientation rule set from Meek 1995) to orient any edges whose orientations areimplied by previous orientations. The log of particular decisions the algorithm makes, as described above,when searching on an actual dataset is available through the Logging menu in the interface.

Entering PC parameters

Consider the following "true" causal hypothesis (a DAG):

When the PC algorithm is chosen from the Search dropdown, window appears in which on may enter an alpha value and edit knowledge. The alpha value is the significance level of the statistical test used as aconditional independence oracle for the algorithm. The default value is 0.05, although it is useful toexperiment with different alpha levels to test the sensitivity of the analysis to this parameter. (Typicalvalues for experimenting are 0.01, 0.05, and 0.10.)

PC is sensitive to background knowledge--that is, sensitive to specifications that certain edges are eitherrequired in the model or forbidden to be in the model. To edit this information, click the edit button forbackground knowledge and enter the information in that interface.

When parameters are set to their desired values, click "Execute" to run the algorithm. The output will be apattern like the following:

Interpreting the output

The are basically two types of edges that can appear in PC output:

a directed edge:

In this case, the PC algorithm deduced that A is a direct cause of B, i.e., the causal effect goes from Ato B and it is not intermediated by any of the other observed variable

a undirected edge:

In this case, the PC algorithm cannot tell if A causes B or if B causes A.

The absence of an edge between any pair of nodes means they are independent, or that the causal effect ofone modelNode in the other is intermediate by other observed variables.

Sometimes a double directed edge sometimes appear in a PC search output. Such edges are the result of apartial failure of the PC search. They may appear due to failure of assumptions (e.g., relationships arenon-linear, the population graph is cyclic, etc.) or because the sample is not large enough and somestatistical decisions are inconsistent. In a situation like that, the user may introduce prior knowledge toconstraint the direction such edge may assume, collect more data or use a different algorithm. Knowledgeof the domain will be essential.

Finally, a triplet of nodes may assume the following pattern:

In other words, in such patterns, A and B are connected by an undirected edge, A and C are connected byan undirected edge, and B and C are not connected by an edge. By the PC search assumptions, this meansthat B and C cannot both be cause of A. The three possible scenarios are:

A is a common cause of B and C B is a direct cause of A, and A is a direct cause of C C is a direct cause of A, and A is a direct cause of B

In our example, some edges were compelled to be directed: X2 and X3 are causes of X4, and X4 is a causeof X5. However, we cannot tell much about the triplet (X1, X2, X3), but we know that X2 and X3 cannotboth be causes of X1.

Pseudocode for PC

The following is pseudocode representing the way PC is implemented in Tetrad.

Step A:

Form the complete undirected graph G over v1,...,vn.

Step B (Fast Adjacency Search):

For each depth d = 0, 1, ...: For for each variable x:

"next_y": For each adjacent modelNode y to v: Let adjX = adj(x) - {y} Let adjY = adj(y) - {x}

For each subset Sx of adjX up to size d: If x _||_ y | Sx, remove x---y from G. Continue "next_y."

For each subset Sy of adjY up to size d: if x _||_ y | Sy, remove x---y from G. Continue "next_y."

Step C:

Orient colliders in G, as follows:

For each modelNode x: For each pair of nodes y, z adjacent to x: If y and z are not adjacent: If ~(y _||_ z | x): Orient y-->x<--z as a collider.

Step D:

Apply orientation rules until no more orientations are possible.Rules to use: away from collider, away from cycle, kite1, kite2.(These are Meek’s rules R1, R2, R3, and R4.)

Away from collider:

For each modelNode a: For each b, c in adj(a): If b-->a---c: Orient b-->a-->c. Else if c-->a---b: Orient c-->a-->b.

Away from cycle:

For each modelNode a:

For b, c in adj(a): If a-->b-->c and a---c: Orient a-->c. Else if c-->b-->a and c---a: Orient c-->a.

Kite 1:

For each modelNode a: For each nodes b, c, d in adj(a) such that a---b, a---c, a---d, and !(c---d): If c-->b and d-->b: Orient a-->b.

Kite 2:

For each modelNode a: For each nodes b, c, d in adj(a) such that a---b, a---d, b is not adjacent to d, and either a---c, a-->c, or c-->a, If b-->c and c-->d: Orient a-->d. Else if d-->c and c-->b: Orient a-->b.

References:

Spirtes, Glymour, and Scheines (2000). Causation, Prediction, and Search.

Chris Meek (1995), "Causal inference and causal explanation with background knowledge."

Search Algorithms: FCI

The FCI algorithm is designed to search for causal explanations of observational or mixed observationaland experimental data in which is may be assumed that the true causal graph is acyclic, but there may beunrecorded (hidden, latent) common causes of variables in the data set, or in which there may be sampleselection bias. Sample selection bias occurs when the values of two or more recorded variables influencethe probability that a unit is sampled. (It is also assumed that no relationship between variables in the datais deterministic--see PCD.)

The algorithm operates by asking a conditional independence oracle to make judgements about theindependence of pairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditionalindepedence tests are available for datasets that consist either entirely of continuous variables or entirelyof discrete variables; hence, datasets of these types can be used as input to the algorithm. As a way ofgetting one’s head around how the algorithm should behave in the ideal, when independence tests alwaysgive correct answers, one may also use a DAG as an input to the algorithm, in which case graphicald-separation will be substituted for an actual independence test.


Some of the above assumptions are not testable using observational data. They should come from priorknowledge or partial experiments.FCI is operated by the user exactly as is PC. The differences are in the interpretation of the output. Theoutput of FCI is a partial ancestral graph (PAG). It gives partial information about which variables are orare not drect or indirect causes and effects of other variables. An edge between two variables in the output, however the ends of the edge are marked, indicates thatthere is a causal pathway--a direct cause in one direction or the other or a common cause--connecting thetwo variables, that does not contain any other observed variable. It does not necessarily mean that in thetrue causal graph, the connected variables have a direct causal connection. An edge of any kind betweentwo measured variables implies that the variables are not independent conditional on any set of measured variables.If there is a edge from X to Y that is unmarked--a tail of an arrow-- then X is a cause of Y. X may not,however, be a direct cause of Y.If there is an edge from X to Y that has an arrowhead directed into Y, then Y is not a cause--not anancestor--of X.If there is an edge with two arrowheads connecting X and Y, then there is an unrecorded common cause ofX and YIf an edge end is marked with an "o" the algorithm cannot determine whether there should or should notbe an arrowhead at that edge end.Here is pseudocode for the implementation of the FCI algorithm used in Tetrad:

Given: Independence test I over variables v1,...,vn.

Step A:

Form new empty PAG G with variables from I. Fully connect G using unoriented (o-o) edges.

Step B:

Run a Fast Adjacency Search on G using I.

Step C:


For all nodes B: For each pair of nodes A,C adjacent to B: If A and C are not adjacent: If A and C are d-connected conditional on B: Orient A-->B<--C as a collider.

Step D:

Form a Sepset matrix using a possible d-sep search.Then reorient all edges as unoriented.

Step CI C:

Orient unshielded triples, as follows:

For all nodes B: For each pair of nodes A,C adjacent to B: If A and C are not adjacent: If A and C are d-connected conditional on B: Orient A-->B<--C as a collider. Else: Do nothing (effectively marking A---B---C as a noncollider)

Step CI D:

Apply orientation rules until no more orientations are possible.

Rules to use: double triangle, discriminating paths, away from collider, away from ancestor, away from cycle.

Definitions of Orientation Rules:

Double triangle rule:

If D*-oB, A*->B<-**C and A**-**D**-**C is a noncollider, then D**->B.

For all nodes B: possible A: nodes into B with arrow possible C: nodes into B with arrow possible D: nodes into B with circle

For all possible D:

For all possible A: For all possible C: If A != C and required conditions hold: Orient D*->B.

Discriminating paths rule:

The triangles that must be oriented this way (won’t be done by another rule) all look like the ones below, where the dots are a collider path from L to A with each modelNode on the path (except L) a parent of C.

B

xo x is either an arrowhead or a circle

/ \

v v L....A --> C

For all nodes B possible A: nodes out from B with arrow and into B with arrow or circle. possible C: nodes out from B with arrow and into B with circle.

For all possible A: For all possible C: If A is a parent of C: Find a collider path back from A. If path found and if path endpoint is d-sep from C conditional on B: Set C<--B. else, Set A<->B and B<->C.

Away from collider rule:

If A*->Bo-oC and not A*-**C, then orient A**->B-->C. (Orient either circle if present, don’t need both.)

Away from ancestor rule:

If A*-oC and either A-->B*->C or A*->B-->C, then orient A*->C.

Away from cycle rule:

If Ao->C and A-->B-->C, then orient A-->C.

Pseudocode for FCI:

Given: Independence test I over variables v1,...,vn.

Step A:

Form new empty PAG G with variables from I. Fully connect G using unoriented (o-o) edges.

Step B:

Run a Fast Adjacency Search on G using I.

Step C:


For all nodes B: For each pair of nodes A,C adjacent to B: If A and C are not adjacent: If A and C are d-connected conditional on B: Orient A-->B<--C as a collider.

Step D:

Form a Sepset matrix using a possible d-sep search.Then reorient all edges as unoriented.

Step CI C:

Orient unshielded triples, as follows:

For all nodes B: For each pair of nodes A,C adjacent to B: If A and C are not adjacent: If A and C are d-connected conditional on B: Orient A-->B<--C as a collider. Else: Do nothing (effectively marking A---B---C as a noncollider)

Step CI D:

Apply orientation rules until no more orientations are possible.

Rules to use: double triangle, discriminating paths, away from collider, away from ancestor, away from cycle.

Definitions of Orientation Rules:

Double triangle rule:

If D*-oB, A*->B<-**C and A**-**D**-**C is a noncollider, then D**->B.

For all nodes B: possible A: nodes into B with arrow possible C: nodes into B with arrow possible D: nodes into B with circle

For all possible D: For all possible A:

For all possible C: If A != C and required conditions hold: Orient D*->B.

Discriminating paths rule:

The triangles that must be oriented this way (won’t be done by another rule) all look like the ones below, where the dots are a collider path from L to A with each modelNode on the path (except L) a parent of C.

B

xo x is either an arrowhead or a circle

/ \

v v

L....A --> C

For all nodes B possible A: nodes out from B with arrow and into B with arrow or circle. possible C: nodes out from B with arrow and into B with circle.

For all possible A: For all possible C: If A is a parent of C: Find a collider path back from A. If path found and if path endpoint is d-sep from C conditional on B: Set C<--B. else, Set A<->B and B<->C.

Away from collider rule:

If A*->Bo-oC and not A*-**C, then orient A**->B-->C. (Orient either circle if present, don’t need both.)

Away from ancestor rule:

If A*-oC and either A-->B*->C or A*->B-->C, then orient A*->C.

Away from cycle rule:

If Ao->C and A-->B-->C, then orient A-->C.

Note: Zhang (2006) supplies an orientation rule set for FCI that is both arrow-complete andtail-complete; this is not currently implemented.

Search Algorithms: CCD

The CCD algorithm (due to Thomas Richardson) is operated in the same way as the PC algorithm. Theoutput is interpreted subject to the same restrictions as the PC algorithm, with the exception that CCD willreturn PAGs that can only be consistently interpreted as cyclic graphs.

The algorithm is pointwise consistent for linear systems with Normal distributions, no latent variables andno correlated errors. It is similarly consistent for categorical variables with a Multinomial distribution ifthe true causal graph and the distribuiton satisfy the local Markov conditon--which is quite restrictive forcategorical variables for cyclic systems.

Search Algorithms: GES

GES (Greedy Equivalence Search) is a Bayesian algorithm that searches over Markov equivalence classes,represented by patterns, for a data set D over a set of variables V.

A pattern is an acyclic graph that consists whose edges are either directed (-->) or undirected (---) andrepresents an equivalence class of DAGs, as follows: each directed edge in the pattern is so directed inevery DAG in the equivalence class, and for each undirected edge X---Y in the pattern, a DAG exists inthe equivalence class with that edge directed as X<--Y and a DAG exists in the equivalence class with thatedge directed as X-->Y. To put it differently, a pattern represent the set of edges that can be determined bythe search, with as many of these edges oriented as can be, using the available information.

It is assumed (as with PC) that the true causal graph is acyclic and the no common hidden causes existbetween pairs of variables in the graph. GES can be run on datasets that are either entirely continuous orentirely discrete (but not directly on graphs using d-separation). In the continuous case, it is assumed thatthe direct causal influence of any variable into any other is linear, with the distribution of each variablebeing Normal. Under these assumptions, the algorithm is pointwise consistent.

GES searches over patterns by scoring the patterns themselves. There is a forward sweep in the algorithmand a backward sweep. In the forward sweep, at each step, GES tries to find the edge which, once added,increases the score the most over not adding any edge at all. (After adding each such edge, the pattern isrebuilt by orienting any edge as --- that does not participate in a collider and then applying Meek’s PCorientation rules to add any implied orientations.) Once the algorithm gets to the point where there is noedge that can be added that would increase the score, the backward sweep begins. In the backward sweep,GES tries at each step to find the one edge it can remove that will increase the score of the resulting themost over the previous pattern. Once it gets to the point where there is no edge anymore than onceremoved increases the score, the algorithm stops.

There are some differences in assumptions and expected behavior between this algorithm and the PCalgorithm. When, contrary to assumptions, there is actually a latent common cause of two measuredvariables the PC algorithm will sometimes discover that fact; GES will not.

Information about how precisely GES makes decisions about adding or removing edges can be found inthe logs, which can be accessed using the Logging menu.

Entering GES parameters

Consider the following example:

When the PC algorithm is chosen from the Search Object combo box, the following window appears:

The parameters that are used by the GES algorithm can be specified in this window. The parameters are as follows:

view background knowledge: this button gives access to a background knowledge editor that isanalogous to the one used in most search algorithms.

Execute the search.


The GES algorithm returns a partially oriented graph where the nodes represent the variables given asinput. In our example, the outcome should be as follows if the sample is representative of the population:

The are basically two types of edges that can appear in GES output:

a directed edge:

In this case, the GES algorithm deduced that A is a direct cause of B, i.e., the causal effect goes fromA to B and it is not intermediated by any of the other observed variable

a undirected edge:

In this case, the GES algorithm cannot tell if A causes B or if B causes A.

The absence of an edge between any pair of nodes means they are independent, or that the causal effect ofone modelNode in the other is intermediate by other observed variables. Unlike the PC algorithm, noaccidental double-directed edges can appear. It does not mean that GES will be immune to the samplevariation that caused the unexpected behavior of the PC search. It is a good idea to run both searches andcompare the result.

Finally, a triplet of nodes may assume the following pattern:

In other words, in such patterns, A and B are connected by an undirected edge, A and C are connected byan undirected edge, and B and C are not connected by an edge. By the PC search assumptions, this meansthat B and C cannot both be cause of A. The three possible scenarios are:

A is a common cause of B and C B is a direct cause of A, and A is a direct cause of C C is a direct cause of A, and A is a direct cause of B

In our example, some edges were compelled to be directed: X2 and X3 are causes of X4, and X4 is a causeof X5. However, we cannot tell much about the triplet (X1, X2, X3), but we know that X2 and X3 cannotboth be causes of X1.

References:

Chickering (2002). Optimal structure identification with greedy search. Journal of Machine Learning Research.

Search Algorithms: MBF

The MBF search (Markov blanket fan search) is designed to search for Markov blanket DAGs of targetvariables in datasets, under the assumptions of the PC algorithm--i.e., that the true causal graph over thevariables in the dataset does not contain any cycles, that there are no hidden common causes betweenvariables in the dataset, and that no relationship between variables in the dataset is deterministic. TheMarkov blanket of a variable t is the smallest set S of variables out of a set of variables V such that t _||_ y| S for every y in V \ (S U {t}). The Markov blanket DAG of t is the restriction of the entire causal graphover V to the variables in {t} U S.

MBF operates by asking a conditional independence oracle to make judgements about the independence ofpairs of variables (e.g., X, Z) conditional on sets of variables (e.g., {Y}). Conditional indepedence tests areavailable for datasets that consist either entirely of continuous variables or entirely of discrete variables;hence, datasets of these types can be used as input to the algorithm. (As a way of getting one’s headaround how the algorithm should behave in the ideal, when independence tests always give correctanswers, one may also use a DAG as an input to the algorithm, in which case graphical d-separation willbe substituted for an actual independence test.)


Some of the above assumptions are not testable using observational data. They should come from priorknowledge or partial experiments.

Like PC, MBF returns a pattern, in this case containing:

(a) The target T, the true parents and children of T, and the true parents of the children of T.

(b) All edges among T, true parents of T, true children of T, and true parents of children of T. Some ofthese edges may not be oriented as -->.

(c) Possibly some extra nodes and edges to account for the possibility that if some edges T---v wereactually oriented as T-->v, these nodes and adjacencies would be required in the MB DAG of T.

(d) No nodes or adjacencies or --> edges that do no belong in some MB DAG consistent withindependence facts supplied by (2).

There may also be some bidirected <-> edges in G_out if the independence information from (2), above, isinconsistent. These <-> edges may either be left in the final graph or oriented as if they were directed edges.

Search Algorithms: MIMBuild

Introduction

MIM Build stands for Multiple Indicator Model Build. It is one of the three algorithms in Tetrad designedto build pure measurement/structural models (the others are the Build Pure Clusters algorithm andthe Purify algorithm ).

MIM Build should be used to learn causal relationships among latent variables in a when the measurementmodel is given in advance but the structural model is unknown.

The MIM Build algorithm also assumes that the underlying (unknown) data generating process is a lineargraph. If the user strongly suspects that the latents or indicators may be non-linearly related, MIM Buildshould not be used. We are also assuming that latents here do not have other hidden common causes.

All observed variables are assumed to be continuous, and therefore the current implementation of thealgorithm accepts only continuous data sets as input. For general information about model buildingalgorithms, consult the Search Algorithms page.

Entering MIM Build parameters

Create a new Search object as described in the Search Algorithms page, but in order to follow this tutorial,use the following graph to generate a simulated continuous data set:

When the MIM Build algorithm is chosen from the Search box, a window appears for specifying search parameters.

The parameters that are used by MIM Build can be specified in this window. The parameters are as follows:

alpha value: if you choose the PC search in the combo box "Choice of algorithm", MIM Build usesstatistical hypothesis tests in order to generate models automatically. The alpha value parameterrepresents the level by which such tests are used to accept or reject constraints that compose the finaloutput. The default value is 0.05, but the user may want to experiment with different alpha values inorder to test the sensitivity of her data within this algorithm. number of clusters: MIM Build needs a pure measurement model specified in advance. Themeasurement model is defined by a set of clusters of variables, where each cluster represents a set ofpure indicators of a single latent. In this box, the user specifies how many latents there are in themeasurement model based in prior knowledge. In our example, let’s use three clusters. edit cluster assignments: once the number of latents is specified, the user should now determinewhich variables in the data set should be clustered together. When this button is clicked, thefollowing dialog box appears:

In this example, we want to enter the measurement model that we know is the correct one byassumption. In other words, variables X1, X2 and X3 should be clustered together, since they are pureindicators of a same latent. Variables X4, X5 and X6 form another cluster, and the same holds for X7, X8and X9. In order to perform cluster assignment, since click the respective combo box and choose thecluster that shows up in the list. For example, click the X4 combo box and choose Cluster 1. Do the samefor X5 and X6. For variables X7, X8 and X9, choose Cluster 2. The final outcome should be as follows:

algorithm : MIM Build is actually a family of algorithms for the problem of learning structuralmodels. Currently, we offer two alternatives, both corresponding to the case where we have no latentvariables: the GES and PC search algorithms. The PC version can be slower and less robust thanGES, but might be useful to indicate if the assumption of no extra hidden common causes among thelatents holds (the appearance of double directed edges is an indication of that possibility). view background knowledge: this button gives access to a background knowledge editor that isanalogous to the one used in most search algorithms, but with one difference: instead of enteringbackground knowledge about observed variables (in MIM Build case, all background knowledge aboutobserved variables boils down to the specification of a measurement model), the user here enters priorknowledge about causal relations of latent variables. Latents are denoted by the label _Lx, where x isthe number of the respective cluster. In our example, the latent parent of X7, X8 and X9 is referred as _L2. Note: use of background knowledge is not implemented for GES yet.

Execute the search as explained in the Search Algorithms page.


MIM Build returns a pattern over latent variables that is completely analogous to the one produced by a PC Search, or GES Search. The same interpretation used in such algorithms can be applied to MIMBuild output.

Search Algorithms: BPC

Introduction

Build Pure Clusters (BPC) is one of the three algorithms in Tetrad designed to build puremeasurement/structural models (the others are the MIM Build algorithm and the Purify algorithm).

The goal of Build Pure Clusters is to build a pure measurement model using observed variables from adata set. Observed variables are clustered into disjoint groups, each group representing indicators of asingle hidden variable. Variables in one group are not indicators of the hidden variables associated withthe other groupsl. Also, some variables given as input will not be used because they do not fit into a puremeasurement model along with the chosen ones.

The Build Pure Clusters algorithm assumes that the population can be described as ameasurement/structural model where observed variables are linear indicators of the unknown latents.Notice that linearity among latents is not necessary (although it will be necessary for the MIM Build algorithm) and latents do not need to be continuous. It is also assumed that the unknown population graphcontains a pure subgraph where each latent has at least three indicators. This assumption is not testable isshould be evaluated by the plausibility of the final model.

The current implementation of the algorithm accepts only continuous data sets as input. For generalinformation about model building algorithms, consult the Search Algorithms page.

Entering Build Pure Clusters parameters

For example, consider a model with this true graph:

If data is generated using this model and a search is constructed from the data, selecting BPC, thefollowing parameters will be requested:

alpha value: Build Pure Clusters uses statistical hypothesis tests in order to generate modelsautomatically. The alpha value parameter represents the level by which such tests are used to acceptor reject constraints that compose the final output. The default value is 0.05, but the user may want toexperiment with different alpha values in order to test the sensitivity of her data within this algorithm. number of iterations: Build Pure Clusters uses a randomized procedure in order to generate a model,since in general there are different pure measurement submodels of a given generalmeasurement/structural model. This option allows the use to specify a given number of runs of thealgorithm, where the outputs given for each run are combined together into s single model. Thisusually provides models that are more robust against statistical flunctuations and slight deviancesfrom the assumptions. statistical test: as stated before, automated model building is done by testing statistical hypothesis.Build Pure Clusters provides two basic statistical tests that can be used. Wishart’s Tetrad ssumes thatthe given variables follow a multivariate normal distribution. Bollen’s Tetrad test not make thisassumption. However, it needs to compute a matrix of fourth moments, which can be timeconsuming. It is also less robust against sampling variability when compared to Wishart’s test if thedata actually follows a multivariate normal distribution..

Interpreting the Output.

Upon executin the search, BPC returns a pure measurement model. Because of the internal randomization,outputs may vary from run to run, but one should not expect large differences (and this can be actuallyused to evaluate if the assumptions are reasonable for a given set of input variables). In our example, the

outcome should be as follows if the sample is representative of the population:

Edges with circles at the endpoints are added only to distinguish latent variables from the indicators. BPCdoes not make any claims about the causal relationships among latent variables (this is the role of the MIM Build algorithm). The labels given to the latent variables are arbitrary. As part of the analysis, adomain expert should evaluate if such latents have indeed a physical or abstract meaning, or if they shouldbe discarded as meaningless. Such reification is domain dependent.

Note: If the output is not arranged helpfully, use the Fruchterman-Reingold layout in the Layout menu toarrange more readably.

Search Algorithms: Purify

IntroductionEntering Purify parameters


Introduction

Purify is one of the three algorithms in Tetrad designed to build pure measurement/structural models(the others are the MIM Build algorithm and the Purify algorithm ).

Purify should be used to select indicators of a given measurement model such that the selected indicatorsform a pure measurement model. In other words, the user specifies a set of clusters of indicators, whereeach cluster containts indicators of an assumed latent variable. The task of Purify is to discard anyindicator that is impure, i.e., that may have other common causes with other indicators, or that is a directcause of other indicators.

The Purify algorithm assumes that the population can be described as a measurement/structural modelwhere observed variables are linear indicators of the unknown latents, and that the given measurementmodel is correct, but perhaps impure. Notice that linearity among latents is not necessary (although it willbe necessary for the MIM Build algorithm ) and latents do not need to be continuous.

All variables are assumed to be continuous, and therefore the current implementation of the algorithmaccepts only continuous data sets as input. For general information about model building algorithms,consult the Search Algorithms page.

Entering Purify parameters

Create a new Search object as described in the Search Algorithms page, but in order to follow thistutorial, use the following graph to generate a simulated continuous data set:

Notice that, in this example, X4, X5 and X7 are in impure relations. Notice also that X4 is not an impurityanymore when X7 is removed, but X5 and X7 cannot be made pure, since they are indicators of twolatents.

When the Purify algorithm is chosen from the Search Object combo box, the following window appears:

The parameters that are used by Purify can be specified in this window. The parameters are as follows:

alpha value: Purify uses statistical hypothesis tests in order to generate models automatically. Thealpha value parameter represents the level by which such tests are used to accept or reject constraintsthat compose the final output. The default value is 0.05, but the user may want to experiment withdifferent alpha values in order to test the sensitivity of her data within this algorithm. number of clusters: Purify needs a measurement model specified in advance. The measurementmodel is defined by a set of clusters of variables, where each cluster represents a set of pureindicators of a single latent. In this box, the user specifies how many latents there are in themeasurement model based in prior knowledge. In our example, assuming we know the truemeasurement model, let’s use three clusters. edit cluster assignments: this is identical to the cluster editor of the MIM Build algorithm . Consultits documentation for details. In our example, we should create the following clustering:

statistical test: as stated before, automated model building is done by testing statistical hypothesis.Purify provides two basic statistical tests that can be used. Wishart’s Tetrad ssumes that the givenvariables follow a multivariate normal distribution. Bollen’s Tetrad test not make this assumption.However, it needs to compute a matrix of fourth moments, which can be time consuming. It is alsoless robust against sampling variability when compared to Wishart’s test if the data actually follows amultivariate normal distribution. default mode: there are basically two different strategies used by Purify. In the Impure by defaultmode, the algorithm does not assume that the user believes the measurement model is pure, andtherefore will try to find constraints that guarantees that a indicator is pure with respect to otherindicators. If it fails to find a condition by which indicator A is pure with respect to indicator B, thenA will be marked as impure with respect to B. In the Pure by default mode, the algorithm assumesthat the given measurement model is pure. It will try to find constraints that guarantees that aindicator is impure with respect to other indicators. If it fails to find a condition by which indicator A

is impure with respect to indicator B, then A will be marked as pure with respect to B.

Execute the search as explained in the Search Algorithms page.


Although a given measurement model may have many different pure submodels, the Purify algorithm hasa deterministic output: it will basically throw away indicators that violate constraints, following an orderdetermined by the number of constraints that are violated by each indicator. It returns a pure measurementmodel. In our example, the outcome should be as follows if the sample is representative of the population:

Edges with circles at the endpoints are added only to distinguish latent variables from the indicators.Purify does not make any claims about the causal relationships among latent variables (this is the role ofthe MIM Build algorithm ). The labels given to the latent variables are arbitrary.

Sometimes some latents will not have any indicator. As an important sidenote, if some cluster has onlytwo variables, Purify cannot find any condition by which the two indicators in this cluster can beconsidered pure. If the Impure by default method is chosen, such indicators will always be removed.

Editing Knowledge

Background knowledge (or "knowledge" for short) is a set of specifiable constraints used in a variety ofsearches that can be associated either with search objects or, for convenience sake, with data objects.Background knowledge can be used by search procedures to narrow down the search and return a moreinformative output is making use of background knowledge provided by the user. There are three maintypes of background knowledge that can be used by Tetrad:

forbidden edges: a given pair of variables that cannot be directly connected in some direction or any,independently of what the data say. required edges: a given pair of variable that has to be connected in some direction or another,independently of what the data say. temporal tiers: informs which variables precede others in a temporal order, as a way to decide thedirection of a causal arrow where the algorithm is not able to infer.

Tools for manipulating knowledge are located in the Knowledge menu of components that are associatedwith background knowledge. Consider a PC search over variables X1, X2, X3, X4, and X5. The searcheditor will be initially blank, like so, and will have a Knowledge menu with several tools in it:

If you select "Edit Knowledge," you will se an editor that looks like this:

]

There are three tabs--the "Tiers" tab (showing), the "Edges" tab, and the "Text" tab. The "Tiers" tab letyou specify temporal tiers; you simply drag and drop the variables in the tiers you want, increasing ordecreasing the number of tabs as needed. If V1 is in tier m and V2 is in tier n, where m < n, then the edgeV2-->V1 will be forbidden. Let’s say you drag X1 and X2 to Tier 1, drag X3 and X4 to Tier 2, and dragX5 to Tier 3:

To see the specific edges that are forbidden by this specification of tiers, click on "Edges":

The edges shown in gray are forbidden. You may add required edges to this view by clicking "AddRequired," clicking on the "from" node for the edge you want to add, and dragging to the "to" node. Hereis the same knowledge with two required edges added (shown in green):

Finally, you may view the edited knowledge in a format consistent with Tetrad 3 by clicking on Text:

If you click "Save," this knowledge will be saved and used in the next search.

If you select "Save Knowledge" from the Knowledge menu, you will be able to save knowledge out to afile in the form shown in the "Text" tab, above. If you select "Load Knowledge," you will be able to loadknowledge from a file in the form shown in the "Text" tab, above.

The remaining items in the Knowledge menu are used to help ove knowledge from one box to another.

The remaining items in the Knowledge menu are used to help move knowledge around from component tocomponent. If you select "Copy Knowledge," the current knowledge will be copied to the systemclipboard. If you select paste knowledge, the knowledge stored on the clipboard will be copied into thecurrent box.

How Knowledge is Used

Knowledge is used when searches are done to forbid or required edges. Forbidden edges are not permittedto appear in the final graph; required edges must appear in the final graph. How this is accomplishedvaries from algorithm to algorithm; to see how it’s done in a specific algorithm, see the manual page forthat algorithm.

Temporal tiers provide a mechanism to forbid edges systematically in layered groups. Edges from anylater tier to any earlier tier are forbidden. This provides a convenient way to give knowledge abouttemporal ordering to a search algorithm. The above knowledge would be sensible to provide if, forexample, we knew that X1 and X2 preceded X3 and X4, and X3 and X4 peceded X5. By simply placingthe variables in these tiers, all of the necessary forbidden edges are generated automatically. Variables inthe box marked "Not in tier" do not carry any temporal constraint with respect to any other variable.

Search Algorithms: CPC

CPC (Conservative PC) is a variant of the PC algorithm designed to improve arrowpoint orientationaccuracy. The types of input data permitted and the assumptions for CPC are the same as for PC. That is,input data may be used that is either entirely discrete or entirely continuous. A DAG may be used as input,in which case graphical d-separation will be used in place of conditional independence for purposes of thesearch. It is assumed that the true causal DAG over which the search is being done does not containcycles, that there are no hidden common causes between any pair of observed variables, and that withcontinuous variables the direct causal effect of any variable into any other is linear, with the distribution ofeach variable being Normal.

To know how to interpret the output of CPC, it helps to know how CPC differs from PC. In the colliderorientation step, instead of orienting an unshielded triple <X, Y, Z> as a collider just in case (as in PC) Yis not in the set Sxz that shielded off X from Z and hence led to the remove of X---Z during the adjencyphase of the search, <X, Y, Z> is first labeled as either a collider, a noncollider, or an unfaithful triple,depending, respectively, on whether Y is in none of the sets S among adjacents of X or adjacents of Ysuch that X _||_ Y | S, or all of those sets, or some but not all of those sets. If <X, Y, Z> is labeled as acollider then it is oriented as X-->Y<--Z; if it is labeled as a noncollider or unfaithful, these facts are notedfor later use. (It is intended for unfaithful triples to be underlined in some way, but this is not implementedin the interface currently. However, by inspecting the logs, the classification of unshielded triples intocolliders, noncolliders, and unfaithful triples may be obtained.)

Once all colliders have been marked and all other unshielded triples sorted properly, the Meek orientationrules are then applied as usual, with the exception that where these orientation rules require noncolliders,the fact that a triple is a noncollider is checked against the previously compiled classification ofunshielded triples.

Whereas the output graph of PC is a pattern (allowing for bidirected edges in some cases whereindependence information conflicts with the assumptions of PC), the output graph of CPC is an e-pattern(with the same allowance). The e-pattern represents an equivalence class of DAGs that have the sameadjacencies as the e-pattern, with every edge A-->B in the e-pattern also in the DAG and every unshieldedcollider in the DAG either an unshielded collider in the e-pattern or marked as unfaithful in the e-pattern.(Currently, the interface in Tetrad does how show which triples are marked as unfaithful, but the logs,accessed through the Logging menu, provide this information.)

References:

Spirtes, Glymour, and Scheines (2000). Causation, Prediction, and Search.

Chris Meek (1995), "Causal inference and causal explanation with background knowledge."

Ramsey, Zhang, and Spirtes (2006). Adjacency-Faithfulness and Conservative Causal Inference.Uncertainty in Artificial Intelligence, forthcoming.

Search Algorithms: PCD

PCD is a modification of the PC algorithm that allows it to search over datasets in which some of thevariables stand in deterministic relationships. A deterministic relationship between variables is one that ispurely functional, with no random variation involved. For instance, if X = 2Y, then X and Y in the graphX<--Y-->Z stand in a deterministic relationships. The PC algorithm does not deal well with examples ofthis sort. For instance, the PC algorithm might do a statistical test to determine whether X _||_ Z | Y, inorder to see whehter the edge X---Z should be removed during the adjacency phase. But since X and Ystand in a deterministic relationship, it is always the case that X _||_ Z | Y, since this is informationallyequivalent to asking whether X _||_ Z | X, which is always true. But establishing that X _||_ Z | X is nevera good reason for removing the edge between X and Z; if such a reason were permitted, the adjacencyphase of the PC search would always return an empty graph! In the face of deterministic relationships, thePC algorithm needs to be made aware of when effective conditioning on an endpoint of an edge happensand adjustements need to be made to prevent edges from being eliminated for such reasons.

To correct the problem, PCD checks before performing an independence check of the form X _||_ Z | S,whether X determines either X or Y, and if it does, refuses to do the independence check. This correct theproblem for the adjacency search. There is an additional problem in the step where colliders are oriented.In this case, PCD, when considering whether to orient an unshielded triple <X, Y, Z> as a collider, basedon a conderation of the set Sxz that was used to remove the edge X---Z from the graph during theadjacency search, first asks whether Sxz determines X or Y. If it does not, then <X, Y, Z> is oriented as acollider X-->Y<--Z if Y is not contained in Sxz.

Otherwise, the assumptions and types of data permitted for PCD are identical to that of PC.

Search Algorithms: CFCI

CFCI is an experimental algorithm that modifies FCI in the same way that CPC modifies PC.

Inside the Regression Box

The regression box in the main workspace area looks like this:

Tetrad currently offers linear regression as an option for continuous data sets. Possibly other types ofregression (e.g., logistic regression) will be added in the future.

Inside the Classify Box

A Classify box in the main workspace looks like this:

The operations in the Classify box permit you to use an instantiated model to estimate values for a variablefrom a data set--the variable estimatedmust be in the Instantiated Model, but it need not be in the data set. A Classify Box requires input from aData box containing data and input from an Instantiated Model in an IM box. (Remember that you cancopy an estimated model in an Estimate box into an IM box simply by putting a flowchart arrow fromtheestimate box to a new IM box.)Here are some things to note:

The Instantiated Model can contain variables that are not in the Data The Data can contain variables that are not in the Instantiated Model The variable values must all be categorical--a data file with decimal numbers will not be accepted byClassify The IM must be a Bayes net--either Maximum Likelihood or Dirichlet.. If the target variable to be classified has multiple values, the Classifier will assign the target variableits most probable value for each case. If the target variable to be classified has two values, you can specify the cut-off probability for classification.

The Classifier box will show the graph of the IM used for classification..

The original data can be viewed by cliicking the "Test Data" tab.Tabs above the graph give choices for how the Classifer will work. You can choose:

The variable in the IM that is the target--to be classified (you can also chose this variable by clickingon it in the graph.) If the target is binary, with only two possible values, you can choose the cut-off value below whichthe variable will be classfied one way, and above which, the other.

When you hit the "Classify" button at the top of the graph window inside the Classify box, the Classifierdoes its work and gives you some viewing choices.

According to which tab you then click, you can see:

The original data The orginal data plus, in the first column, for each case the value of the target variable the classifier predicts.

If the target variable is included in the data, the tabs at the top of the Classifier window after theclassfication program has run will also give you.

A Receiver Operating Characteristic, or ROC curve for short. There is a distinct ROC curve for eachvalue of the target variable, showing the ratio of true positives to false positives as a function of thecutoff value of probabilities for positive classification. ROC curves are traditionally used only withbinary variables but the program allows them for muliptle valued variables. The Area Under the ROC curve, or AUC

A Confusion Matrix, showiing, for each value of the target variable, the number of cases having thattrue value that were predicted (by the classifier) to have each of the possible values of the target variable.

You do not need to leave the Classify Box, or destroy its contents, to view the ROC curves or Confusionmatrices for alternative values of the target variable. You do need to do so (or create a new Classify Box)if you want to classify a different target.

Inside the Compare Box

A Compare box in the main workspace looks like this:

The use of the Compare Box is very simple and best understood with an example. Suppose we havegenerated data from an IM with a graph.

and run a PC search with the result:

Then the Compare books returns the agreement and differences between the edges of the two graphs (nottheir orientations)

Common Tasks

There are several tasks in Tetrad that are common to one or more different dialogs or are tasks that need toperformed on sessions in general. We list these tasks here, with explanations. These explanations arelinked to from other files in an attempt to say things once where possible.

Tasks:

Copying, Cutting, Pasting Defining Discrete Variables Destroying Models Discretizing Data Editing Knowledge Editing Node Properties Exiting TETRAD Generating Random DAGs Handling Missing Data Opening Sessions Saving Screenshots Saving Sessions Selecting Groups of Nodes Using Popup Menus Using Templates

Saving, Cutting, Pasting

Nodes and subgraphs of nodes, either in the main session workspace or in any graph display, can becopied, cut, and pasted, either into the same workspace of graph or into some other workspace or graph.

To copy a section of a flowchart graph, first select the nodes you copy. The edges connecting those nodeswill be copied as well.

Now select "Copy" from the Edit menu, or type Control-C. If you now select "Paste" from the Edit menu,or type Control-V, a copy of the selected nodes and edges will be pasted into the workspace slightly downand to the right of the original.

These selected nodes can now be dragged to the desired location.

If you select "Cut" (or type Control-X) instead of "Copy" (Control-C), the original nodes will be deleted,but you will still have the option of pasting copies of them into the workspace.

You may also paste multiple copies of the originally selected nodes into the workspace. These are pasteddown and to the right in each case--e.g.:

All of these procedures work for nodes in graphs as well.

Defining Discrete Variables

Discrete variables in Tetrad may be described as follows.

1. They are assumed to be nominal--that is, the order of categories doesn’t matter for searches andestimations.

2. When trying to decide whether two variables by same name are equal, their categories are idenfied byname.

3. When sending data to algorithms, categories are identified by index only.

Some comments. For point (1), it is clearly a simplification to assume that all discrete variables arenominal, and it clearly in some cases leads to a loss of information, since if you knew the categories forsome variable carried ordinal information you might be able to use tests of conditional independence thattook advantage of this information. For reasons of speed and flexibility, we’ve stayed with the nominalindependence tests.

For point (2), the problem is that a variable "X" can be defined in two different boxes--say, two differentBayes Parametric Model boxes or a Bayes Parametric Model box and and a Data box. It’s possible that thetwo variables have the same number of categories (in fact, when doing estimations, this is desirable) butthat in the one case the categories are <High, Medium, Low> while in the other case the categories are<Low, Medium, High>. In this case, the mapping of categories should be High-->High,Medium-->Medium, Low-->Low and not High-->Low, Medium-->Medium, Low-->High. That is, thecategories should be identified by name.

However, as regards point (3), it is extremely inefficient, especially in Java, to force algorithms overdiscrete variables to deal with names of categories; algorithms need to deal with indices of categories. Sowhen sending a column of data with variable X, with categories <High, Medium, Low> to an estimator,the estimator only knows that there are three categories for X, at indices 0, 1, and 2, respectively. Itdoesn’t know about the names of the categories.

Points (2) and (3) are reconciled in Tetrad using a "bulletin board" system. The first time a list ofcategories is encountered, it is posted on a "bulletin board." After that, if that same list of categories isencountered again, but in some permuted order, the version from the "bulletin board" is retrieved and usedinstead. So any particular list of categories can only appear in one order in Tetrad. (This does not implythat the variables are ordinal; algorithms still interpret these variables as nominal, in that they employstatisitical tests that do not take advantage of ordinality.)

You can see the effects of "bulletin board" system, for example, in the following situations:

If you’ve specified a Bayes Parametric Model and then read data in from a file for the same variables,the order of the categories for the data will be the same as order of categories in your Bayes PM.Estimations, taking Bayes PM’s and discrete data sets as parents will work smoothly. If you create a Bayes PM with variable X, with categories <Low, High>, and later create anotherBayes PM with variable X with categories <High, Low>, the order of categories in the second casewill be adjusted to <Low, High>.

Discretizing Data

Data can be discretized column by column in Tetrad by selecting "Discretize Selected Columns..." fromthe Tools menu of the data editor, which you can launch by double clicking on a Data box.

Both continuous and discrete data can be discretized. Continuous data is discretized by selecting thenumber of categories one want the data to have, giving the categories names, and selecting cut points. Forcategories C1, C2, and C3, cut points c1 and c2 will be needed. Real values in the column < c1 will bemapped to C1; real values in [c1, c2) will be mapped to C2, and real values >= c2 will be mapped to C3.Discrete columns are discretized, by contrast, by simply mapping old categories to new ones, by name.

Consider this data set, simulated from a SEM instantiated model. There are five variables: X1, X2, X3, X4and X5. Three of the columns (X3, X4 and X5) are selected, and the "Discretize Selected Columns..." itemis shown:

After selecting the "Discretize Selected Columns..." item, the following dialog appears:

The "Next" and "Previous" buttons at the bottoms allow one to navigate through the selected columns. Foreach columns, one must select the number of discretized categories, the names for those categories, andthe cut points for those categories. To be helpful, the minimum and maximum value for the column aredisplayed, default category names in the sequence "0", "1", ... are chosen, and cut points are chosen thatevenly divide up the range [Min, Max]. At the bottom of the dialog is a checkbox labeled "Copyunselected columns into new data set." If you check this, the new data set created by the discretizer willcontain all of the variables of the old one, with discretized columns changed. Let’s leave this uncheckedfor now. If you accept all of the defaults, with the checkbox unchecked, a new data set is createdcomprised of discretized versions of X3, X4, and X5, and this new data set is added as a new tab to theData Editor:

Since this tab is selected, it because immediately available to searches, estimations, etc. To see howdiscretization of discrete colums works, we can further discretize X5 in this data set by selecting it andchoosing "Discretize Selected Columns..." from the Tools menu again. The following dialog appears:

We can then specify the category name each category in the column should be mapped to, this timecopying over unselected columns:

If you now click "Discretize," a new data set will be added to the Data Editor in a new tab:

...OLD TEXT:Sometimesthe values of two variables in a data set are strongly correlated. Climate data, for example, mayhave many essentailly redundant variables. Such "multicollinearities" make data analysis difficult, andmake model search especially difficult.. There are various heuristic techniques for dealing with theproblem, but Tetrad offers a simple device. If you click on "Split by collinear columns, the program willprompt you for a correlation value. If you enter a value, say 0.95, the program will create a separate dataset for every pair of variables whosecorrelation is as large or larger than that value. If, for example,variables X2 and X4 are so correlated, and variables X1 and X5 are also so correlated, you will obtain 4distinct data sets, one with X1, X2, X3, one with X1, X4, X3, one with X2, X3, X5 and one with X4, X3,X5.. Be careful with this function: in a large data set, if the correlation is set too low, a huge number ofdata files might be created.The first column shows the number of mulitples of each case in small lettering. Changing that sumber,e.g., from 1 to 5, will add four more cases with the same values to the data set. A data set with each case

repeated according to the multiplier is created when you connect the Data box to a Manipulate Data box.

Editing Node Properties

Nodes have two properties: a name, and whether or not they’re latent. To edit these properties for aparticular node, double click that node, set the properties the way you want, and click "OK."

Consider the following example.

Here we have a graph using default node names X1, X2, and X3. We would like this to read "Heat, a latentvariable, causes volume and pressure," so we need to change the name "X3" to "Heat" and make it latent.Double clicking X3, the following dialog comes up:

Clicking "OK," the result looks like this:

Exiting Tetrad

There are two ways to exit the program. One is to choose "Exit" from the File menu. (See Tetrad Menubar.) The other is to click the red "X" in the top title bar for the application. This looks differentlydepending on what operating system you’re in. In Windows XP, it looks like this:

When you exit the program, Tetrad ask you, for each session you have open, whether you’d like to savethat session and, if so, where you would like to save it. Saved sessions are given the ".tet" suffix and areloadable in later versions of Tetrad.

Generating Random DAGs

In various contexts, you will be given the opportunity to generate a random DAG. The algorithm forgenerating random DAGs in Tetrad is a Markov chain algorithm due to Malancon, Dutour, and Philippe,which selects a random DAG uniformly from the set of DAGs that satisfy a given list of parameters,which you may tweak.

For instance, here is the dialog that comes up when you construct a new DAG in a Graph box:

If you select "A random DAG" in this dialog, you will need to specify seven parameters (or accept thedefaults). The parameters are as follows:

1. Number nodes - that is, the total number of nodes (measured plus latent) in the graph. 2. Number of latent nodes - the number of nodes that should be made latent (see). 3. Maximum number of edges - The number of edges in the randomly generated graph will not exceed

this. In most cases, this will be the number of edges in the graph, but if other parameters are tooconstraining, the number of edges may be less than this.

4. Maximum indegree - The number of edges into any given node will not exceed this. 5. Maximum outdegree - The number of edges out of any given node will not exceed this. 6. Maximum degree - The total number of edges connected to any given node will not exceed this. 7. Connected - "Yes" or "No," depending on whether the entire DAG should be connected or not. A

graph is connected if for every pair of nodes X, Y in the graph, there is an undirected path from X toY.

For the combination of parameters in the above dialog, the following random DAG might be generated:

Handling Missing Data

Missing data is represented in both the tabular and covariance data editors by an asterisk ("*"). There arethree ways to create missing data values in a data set:

In the Data Editor, replace the relevant entries with asterisks, by hand. When loading data, load data that contains missing values. From the Tools menu in the Data Editor, select "Inject Missing Values Randomly."

The first way explicitly declares that a particular datum is missing. The second way reflects missing dataas represented in a data file. The third way automatically adds missing data at random in a data file,usually as a test of how an algorithm performs under conditions of missing data.

Usually it is a good idea to remove or impute missing data before handing it over to an algorithm. Withtabular data, one has the option of removing cases containing missing values from a data file, by selectingTools-->Missing Values-->Remove Cases with Missing Values. Selecting insteadTools-->MissingValues-->Replace Missing Values with Column Mode will replace missing values withthe mode of the column. This will work for either continuous or discrete data. Similarly forTools-->MissingValues-->Replace Missing Values with Column Mean, although this operates only oncontinuous columns. Also, Tools->Missing values-->Replace Missing Values with Extra Category workson discrete variables only and addes an extra category to each indicating where the missing values were.

Tools-->Missing values-->Replace Missing Values with Regression Predictions imputes missing valuesusing a regression model, as follows. First, in a copy of the data set, missing values are imputed usingmeans. Then for each missing value, the column of that missing value is regressed onto the remainingcolumns in the data set, and the predicted value for the case of the missing value using values from thedata set copy replaces the missing value.

If an algorithm is handed data with missing values and does not have a means to deal with it sensibly, amessage is posted asking the use to remove or impute the missing data first. Some algorithmconfigurations do have sensible means to deal with missing data, in which case the data is used. Forinstance, any algorithm taking a Chi Square or G Square independence test as oracle can sensibly ignorerows of variables being compared that contain missing values. For fine control over handling of missingdata, it is best, however, to impute missing data first.

Algorithms that can handle missing data include:

Any search algorithm using Chi Square or G Square as an conditional independence oracle. ML Bayes Updater Dirichlet Bayes updater EM Bayes Updater

Any other algorithm that takes data as an input will display an error message if the data contains missingvalues.

Opening Sessions

Sessions that have been previously saved out as ".tet" files can be loaded back using the "Open Session..."menu item in the main File menu.

You will be asked to locate the file on your hard drive. Once you do, click "Open."

Note that sessions saved using versions published since March 2005 are stamped with the version numberof Tetrad they were saved out with and the date they were saved. (See Tetrad Versioning.) If you need aspecific version of the program for some reason, all versions since March, 2005 can be launched usingJava Web Start from the Tetrad download site, and versions previous to March, 2005 are stored in thearchive directory inside the Tetrad donwload directory on that site and can be launched as executable jars.

Saving Screenshots

Most editors in Tetrad, along with the main workspace area have a menu item in their File menu called"Save Screenshot..." Editors that display graphs also have a second menu item called "Save GraphImage..." These save PNG images of the dialog being displayed and the graph being displayed,respectively.

To show how these menu items function, consider the editor for Directed Acyclic Graphs. If you selectSave Screenshot from the File menu of this editor, you wil see a Save dialog asking you for a filename. Ifyou type a filename (or accept the default) click "Save" a PNG image will be saved for you that lookssomething like this:

Notice that part of the graph is obscured. To save the entire image of just the graph, select "Save GraphImage." If you do this, you will get an image something like this:

Saving and Loading Sessions.

Sessions can be saved by using either the "Save Session" or the "Save Session As..." menu item in themain File menu:

If you use "Save Session As...," you will be asked to supply a name for the session and to locate thedirectory you want to save it in. The name must end with ".tet"; this will be added to the end of your nameif you don’t type it.

If you use "Save Session," you will only be asked for a name and directory for you sesssion if you havenot already supplied one. Otherwise, the session will be saved in the same place and the same name as thelast time you saved it.

If you exit Tetrad, you will be given the option of saving your sessions. See Exiting Tetrad.

Definitions

Although a decent understanding of causal search theory requires more than a few comments in a manual,it is helpful to define at least some of the terms used in this program, or, where it seems more appropriate,to make references to the literature where the terms are defined and discussed in appropriate detail. Wewill try to define enough terms to make it clear at least how to use the program.

Terms:

Measure-Structural Graph Measured and Latent Variables Tetrad Graph Types

Definition: Measurement/structural model

A specific graphical model where all observed variables are indicators of some latent variable (i.e., eachobserved modelNode has at least one latent parent), and no observed variable is a cause of any latentvariable (but they may be causes of another observed variables). The model describing which latents arecauses of which observed variables, and how the observed variables are related, is called the measurement model. The model describing how latents are causally related is called the structural model.

Also, we say that a measurement/structural model is pure if every observed variable has only one latentparent, and no pair of observed variables is connected by a path that does not include a latent variable ofthe model. The following figure illustrates a pure measurement/structural model.

Measured Vs. Latent Variables

Variables in Tetrad are of two types: measured and latent. Measured variables (often called "observed"variables) are variables for which data have been measured. Latent variables are variables for which datahas not been measured but which you believe might be required to explain the causal relationshipsbetween measured variables adequately.

We represent measured variables in graphs using rectangular boxes around their variable names and latentvariables using oval shapes around their variable names. In the example below, Grind, CoffeeTaste, andTemperature are measured variables, while Freshness is a latent variable.

We would expect data sets for these variables to contain columns only for Grind, Temperature, andCoffeeTaste, although causal models for such data might include extra latent variables such as Freshness,in this example.

A latent variable that is a parent of two or more measured variables is referred to as a latent common cause (or unmeasured common cause), as for example CoffeeType in the picture below:

Tetrad Graph Types

The theory of causal search that the Tetrad program implements uses graphs of a variety of different types,some simple and some fairly sophisticated. A brief description of each of the main types is given below.For more details, consult the Bibliography, especially Spirtes, Glymour and Scheines (2002), Causation,Prediction, and Search.

Directed Graphs.

A directed graph is a set of variables V together with a set of directed edges Vi-->Vj for Vi, Vj in V, Vinot equal to Vj. A directed graph may contain cycles--that is, paths of the form X-->...-->X, for some X inV.

Directed Acyclic Graphs (DAG).

A directed acyclic graph is a directed graph that does not contain cycles. This type of graph can be used toconstruct a Bayes or SEM parametric model. A Bayes parametric model requires a DAG.

SEM Graph.

A SEM graph is a directed graph over a set of variables V which has been embellished by adidtionalvariables E representing error terms for endogenous variables in V, edges from each e in E to itscorresponding variable in V, and a set of bidirected edges over this embellished graph. SEM graphs areused to represent the causal structure of SEM models.

Pattern.

From Causation, Prediction and Search (2002), p. 61: "A pattern Pi is a mixed graph with directed andundirected edges. A graph G is in the set of graphs represented by Pi if and only if:

(i) G has the same adjacency relations as Pi;(ii) if the edge between A and B is oriented A-->B in Pi, then it is oriented A-->B in G;(iii) if Y is an unshielded collider on the path <X, Y, Z> in G then Y is an unshielded collider on <X, Y,Z> in Pi."

Y is an unshielded collider on path <X, Y, Z> iff X-->Y<--Z and X and Z are not adjacent.

Patterns are theoretically output by the PC and GES algorithms. Sometimes the PC algorithm includes inits output bidirected edges. When this happens, it is because there is conflicting independence testinformation. This usually means that the assumption of causal sufficiency has been violated and that FCIshould be run as well for comparison.

POIPG.

POIPG’s (pronounced "poip-G") were output by earlier versions of FCI; due to theoretical advances sincethen, FCI now outputs PAGs (see below).

Partial Ancestral Graph (PAG).

A PAG is ...

PAGs are output by FCI and CCD.

Mixed Ancestral Graphs (MAG).

Definition.

Why Doesn’t Tetrad...?

Provide descriptive statistics?

- Because those statistics can easily be obtained in Excel or Matlab

Transform variables, e.g., by taking logs

- Because this can be done in common commercial packages

Do linear regression analysis?

- Same answer.

Use logistic regression or log-linear models?

- These regression procedures could be valuable in estimating parameters in Bayes nets, but they require asound search procedure for selecting interaction terms, and we haven’t solved that yet.

Do factor analysis?

- Probably it should. For many problems, however, the latent variable search procedure in Tetrad is preferable.

Deal with non-Normal distributions for continuous variables?

-Relevant statistics are available only for Normal, Multinomial and Conditional Gaussian distributionfamilies; the last should be included.

Provide all of the models consistent with search output, instead of"patterns," PAGs" etc.

- The models corresponding to a pattern could and perhaps should be provided, but their an infinitenumber of models consistent with a PAG, and your computer is finite.

Provide Bayesain search procedures when there may be latent variables?

- No consistent, computationally feasible, general algoirithms are known.

Provide search procedures for cyclic (non-recursive) models with latent variables

- No consistent search procedures are known.

Provide search procedures for time series?

- Bayes net search procedures can in principle be used for time series, but no practical, consistent, generalsearch procedures are known. The search algorithms in Tetrad can be used to search for "simultaneous"causal relations after regression on lags.

Provide provide search procedures to find a common model or modelsfor two or more separate data sets?

- If the variables in one data set are a subset of those in the other data set, a common model can be soughtwith the present version of Tetrad. If the data sets have entirely distinct variable sets, no principled searchprocedure for a common model is possible. If the data sets contain some common and some distinctvariables, a sometimes useful principled search is possible, but adequate algorithms have not yet been developed.

Further Help

These help files are currently incomplete. If you have a question that they do not answer, or wish to reporta bug, please send email to [email protected]. Someone working on the Tetrad project will answer.

References

The following books and articles are referred to in manual pages.

Bollen.

Spirtes, Glymour, and Scheines (2002). Causation, Predicton, and Search.

Date post:	11-Feb-2022
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

Tetrad Overview What is Tetrad?

Documents