Fast, Accurate Causal Search Algorithms from the Center ... · Causal Discovery Methods...

Post on 26-Aug-2020

0 views 0 download

transcript

Fast, Accurate Causal Search Algorithms from the Center for Causal Discovery (CCD)

The CCD Algorithms Group

University of PittsburghCarnegie Mellon University

Pittsburgh Supercomputing CenterYale University

BD2K All Hands Meeting 11/29/2016

Causal Discovery in Biomedicine

Science is centrally concerned with the discovery of causal relationships in nature

• Understanding• Control

Examples:• Determine the genes and cell signaling pathways that

cause breast cancer • Discover the clinical effects of a new drug• Uncover the mechanisms of pathogenicity of a recently

mutated virus that is spreading rapidly in the population

Why Establish a Center for Causal Discovery Now?

Algorithmic Advances+

Availability of Big Biomedical Data

Algorithmic Advances

• In the past 25 years, there has been tremendous progress in the development of computational methods for representing and discovering causal networks from a combination of observational data, experimental data, and knowledge.

• These methods are generally applicable to biomedical data.

Availability of Big Biomedical Data

• The variety, richness, and quantity of biomedical data havebeen increasing very rapidly.

• The appropriate analysis of these data has great potential to advance biomedical science.

http://aldousvoice.files.wordpress.com/2014/06/database.jpg

Primary Goals of the CCD

• Goal 1. Develop and implement state-of-the-art methods for discovering causal knowledge from biomedical big data using causal graphical models– Make some of the best existing causal discovery methods

available as free, open source software– Develop new methods and make them available

• Goal 2. Investigate three biomedical projects (cancer, lung disease, brain functional connectivity) to evaluate methods and drive their further development

• Goal 3. Disseminate causal discovery software and knowledge widely to biomedical researchers and data scientists

Typical Causal Analysis Workflow

Prior Knowledge

DataCausal

Analysis

Causal Network

Typical Causal Analysis Workflow

Prior Knowledge

DataCausal

Analysis

Causal Network

Causal Hypothesis

Generation by Biomedical Scientists

Typical Causal Analysis Workflow

Prior Knowledge

DataCausal

Analysis

Causal Network

Causal Hypothesis

Generation by Biomedical Scientists

Experiments

Typical Causal Analysis Workflow

Prior Knowledge

CausalAnalysis

Causal Network

Causal Hypothesis

Generation by Biomedical Scientists

Experiments

Data

Typical Causal Analysis Workflow

Prior Knowledge

CausalAnalysis

Causal Network

Causal Hypothesis

Generation by Biomedical Scientists

Experiments

Data

Basic Components Needed to LearCausal Networks from Data

n

• Model representation• Model evaluation• Model search

Model Represenation• Causal Bayesian network (CBN)

– Directed acyclic graph– Nodes represent variables– Arcs represent causal influence– Specify P(X | parents(X)) for each X

This figure is adapted from: Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529.

Model Representation with CBNs

Model Representation Issues

Model Evaluation

• Constraint based (e.g., tests of conditional independence)

• Score based (e.g., Bayesian scores)

What is the Big Data Problem on which the CCD is Primarily Focused?

The Number of Causal Model Structuresas a Function of the Number of Measured Variables*

Number of variables (nodes) Number of Causal Model Structures

1 1

2 3

* Assumes there are no latent variables and no directed cycles.

The Number of Causal Model Structures as a Function of the Number of Measured Variables*

Number of variables (nodes) Number of Causal Model Structures

1 1

2 3

3 25

4 543

* Assumes there are no latent variables and no directed cycles.

The Number of Causal Model Structuresas a Function of the Number of Measured Variables*

Number of variables (nodes) Number of Causal Model Structures

1 1

2 3

3 25

4 543

5 29,281

6 3,781,503

7 1.1 x 109

8 7.8 x 1011

9 1.2 x 1015

10 4.2 x 1018

* Assumes there are no latent variables and no directed cycles.

Our Main Big Data Problem

Analyze biomedical datasets containing a large number of variablesin order to generate plausible hypotheses of the causal relationships that hold among those variables

An Example Algorithm for Causal Discovery with Many Variables: FGES

• GES: A popular CBN learning algorithm that uses greedy search and Bayesian scoring*

• We developed a fast version of GES, called FGES– Optimized the single processor version of GES– Parallelized GES

* Chickering DM. Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (2002) 507-554.

Evaluation of FGES• Generated 10 random CBNs

– 30,000 nodes and 60,000 edges– Continuous-variables with linear relationships and Gaussian noise

• Sampled each CBN to generate 1,000 cases• Provided those cases to FGES and measured its ability to

learn the data-generating CBN

Average Directed Arc

Precision

AverageDirected Arc

Recall

# Processors AverageLearning

Time99% 84% 128 2.3 minutes

For more information:• http://arxiv.org/ftp/arxiv/papers/1507/1507.07749.pdf

• Ramsey J, Glymour C. A Million Variables and More: The Fast Greedy Search (FGS) Algorithm for Learning High Dimensional Graphical Causal Models (to appear).

Another Example of an Algorithm for Causal Discovery with Many Variables: GFCI

• FGES assumes there are no latent confounders, that is, there are no latent variables that cause two or more measured variables

• Biomedical data often contain latent confounders• GFCI* allows for the possibility of latent confounders

• Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52, 368-379.

Evaluation of GFCI• Generated more than 100 random CBNs

– 1,000 nodes and 2,000 edges– Continuous variables with linear Gaussian relationships

• Sampled each CBN to generate 2,000 cases• Provided cases to GFCI and measured its performance

% Latent Nodes

Average Directed Arc

Precision

AverageDirected Arc

Recall

# Processors AverageLearning

Time

5% 92% 93% 1 15 seconds

For more information: Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52, 368-379.

Ongoing Algorithm Work Includes …

• Modeling non-linear relationships

• Modeling causal feedback

• Handling a mixture of continuous and discrete variables

• Outputting uncertainty in edge relationships

• Learning the causal relationships among latent variables

Summary

• Causal discovery is central to biomedical science

• The variety, richness, and quantity of biomedical data are increasing rapidly

• The CCD is providing software now for analyzing big biomedical data to discover causal relationships

• Causal discovery algorithms with additional capabilities will soon be available as well

Acknowledgements

• Thanks to the members of the Algorithms Group of the Center for Causal Discovery for their contributions to the activities described in this talk.

• The Center for Causal Discovery is supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The content of this presentation is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Thank you

gfc@pitt.edu

CCD software is available at:www.ccd.pitt.edu

Extra Slides

Association Versus Causation

• Association • Represents statistical relationships • Predicts outcomes from passive observations• Example uses: classification and regression

• Causation: • Represents mechanisms• Predicts outcomes of active intervention• Example uses: decision making and planning

Example

• Association Smoking – lung cancer – coughing Both smoking and coughing predict lung cancer

• Causation Smoking lung cancer coughing Smoking influences lung cancer Coughing does not influence lung cancer

Recent Examples of the Use of Graphical Causal Discovery Methods

Anticipation-related brain connectivity in bipolar and unipolar depression: A graph theory approachAnna Manelis, Jorge R. C. Almeida, Richelle Stiffler,1 Jeanette C. Lockovich, Haris A. Aslam, Mary L. Phillips. Brain 139 (2016) 2554-2566.

Dobryakova, E., Costa, S. L., Wylie, G. R., DeLuca, J., & Genova, H. M. (2016). Altered effective connectivity during a processing speed task in individuals with multiple sclerosis. Journal of the International Neuropsychological Society: JINS, 22(2), 216-224.

Otsuka, J. (2016). Discovering phenotypic causal structure from nonexperimental data. Journal of evolutionary biology, 29(6), 1268-1277.

Attur, M., Statnikov, A., Samuels, J., Li, Z., Alekseyenko, A. V., Greenberg, J. D., et al. (2015). Plasma levels of interleukin-1 receptor antagonist (IL1Ra) predict radiographic progression of symptomatic knee osteoarthritis. Osteoarthritis and Cartilage, 23(11), 1915-1924.