Post on 26-Aug-2020
transcript
Fast, Accurate Causal Search Algorithms from the Center for Causal Discovery (CCD)
The CCD Algorithms Group
University of PittsburghCarnegie Mellon University
Pittsburgh Supercomputing CenterYale University
BD2K All Hands Meeting 11/29/2016
Causal Discovery in Biomedicine
Science is centrally concerned with the discovery of causal relationships in nature
• Understanding• Control
Examples:• Determine the genes and cell signaling pathways that
cause breast cancer • Discover the clinical effects of a new drug• Uncover the mechanisms of pathogenicity of a recently
mutated virus that is spreading rapidly in the population
Why Establish a Center for Causal Discovery Now?
Algorithmic Advances+
Availability of Big Biomedical Data
Algorithmic Advances
• In the past 25 years, there has been tremendous progress in the development of computational methods for representing and discovering causal networks from a combination of observational data, experimental data, and knowledge.
• These methods are generally applicable to biomedical data.
Availability of Big Biomedical Data
• The variety, richness, and quantity of biomedical data havebeen increasing very rapidly.
• The appropriate analysis of these data has great potential to advance biomedical science.
http://aldousvoice.files.wordpress.com/2014/06/database.jpg
Primary Goals of the CCD
• Goal 1. Develop and implement state-of-the-art methods for discovering causal knowledge from biomedical big data using causal graphical models– Make some of the best existing causal discovery methods
available as free, open source software– Develop new methods and make them available
• Goal 2. Investigate three biomedical projects (cancer, lung disease, brain functional connectivity) to evaluate methods and drive their further development
• Goal 3. Disseminate causal discovery software and knowledge widely to biomedical researchers and data scientists
Typical Causal Analysis Workflow
Prior Knowledge
DataCausal
Analysis
Causal Network
Typical Causal Analysis Workflow
Prior Knowledge
DataCausal
Analysis
Causal Network
Causal Hypothesis
Generation by Biomedical Scientists
Typical Causal Analysis Workflow
Prior Knowledge
DataCausal
Analysis
Causal Network
Causal Hypothesis
Generation by Biomedical Scientists
Experiments
Typical Causal Analysis Workflow
Prior Knowledge
CausalAnalysis
Causal Network
Causal Hypothesis
Generation by Biomedical Scientists
Experiments
Data
Typical Causal Analysis Workflow
Prior Knowledge
CausalAnalysis
Causal Network
Causal Hypothesis
Generation by Biomedical Scientists
Experiments
Data
Basic Components Needed to LearCausal Networks from Data
n
• Model representation• Model evaluation• Model search
Model Represenation• Causal Bayesian network (CBN)
– Directed acyclic graph– Nodes represent variables– Arcs represent causal influence– Specify P(X | parents(X)) for each X
This figure is adapted from: Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529.
Model Representation with CBNs
Model Representation Issues
Model Evaluation
• Constraint based (e.g., tests of conditional independence)
• Score based (e.g., Bayesian scores)
What is the Big Data Problem on which the CCD is Primarily Focused?
The Number of Causal Model Structuresas a Function of the Number of Measured Variables*
Number of variables (nodes) Number of Causal Model Structures
1 1
2 3
* Assumes there are no latent variables and no directed cycles.
The Number of Causal Model Structures as a Function of the Number of Measured Variables*
Number of variables (nodes) Number of Causal Model Structures
1 1
2 3
3 25
4 543
* Assumes there are no latent variables and no directed cycles.
The Number of Causal Model Structuresas a Function of the Number of Measured Variables*
Number of variables (nodes) Number of Causal Model Structures
1 1
2 3
3 25
4 543
5 29,281
6 3,781,503
7 1.1 x 109
8 7.8 x 1011
9 1.2 x 1015
10 4.2 x 1018
* Assumes there are no latent variables and no directed cycles.
Our Main Big Data Problem
Analyze biomedical datasets containing a large number of variablesin order to generate plausible hypotheses of the causal relationships that hold among those variables
An Example Algorithm for Causal Discovery with Many Variables: FGES
• GES: A popular CBN learning algorithm that uses greedy search and Bayesian scoring*
• We developed a fast version of GES, called FGES– Optimized the single processor version of GES– Parallelized GES
* Chickering DM. Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (2002) 507-554.
Evaluation of FGES• Generated 10 random CBNs
– 30,000 nodes and 60,000 edges– Continuous-variables with linear relationships and Gaussian noise
• Sampled each CBN to generate 1,000 cases• Provided those cases to FGES and measured its ability to
learn the data-generating CBN
Average Directed Arc
Precision
AverageDirected Arc
Recall
# Processors AverageLearning
Time99% 84% 128 2.3 minutes
For more information:• http://arxiv.org/ftp/arxiv/papers/1507/1507.07749.pdf
• Ramsey J, Glymour C. A Million Variables and More: The Fast Greedy Search (FGS) Algorithm for Learning High Dimensional Graphical Causal Models (to appear).
Another Example of an Algorithm for Causal Discovery with Many Variables: GFCI
• FGES assumes there are no latent confounders, that is, there are no latent variables that cause two or more measured variables
• Biomedical data often contain latent confounders• GFCI* allows for the possibility of latent confounders
• Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52, 368-379.
Evaluation of GFCI• Generated more than 100 random CBNs
– 1,000 nodes and 2,000 edges– Continuous variables with linear Gaussian relationships
• Sampled each CBN to generate 2,000 cases• Provided cases to GFCI and measured its performance
% Latent Nodes
Average Directed Arc
Precision
AverageDirected Arc
Recall
# Processors AverageLearning
Time
5% 92% 93% 1 15 seconds
For more information: Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52, 368-379.
Ongoing Algorithm Work Includes …
• Modeling non-linear relationships
• Modeling causal feedback
• Handling a mixture of continuous and discrete variables
• Outputting uncertainty in edge relationships
• Learning the causal relationships among latent variables
Summary
• Causal discovery is central to biomedical science
• The variety, richness, and quantity of biomedical data are increasing rapidly
• The CCD is providing software now for analyzing big biomedical data to discover causal relationships
• Causal discovery algorithms with additional capabilities will soon be available as well
Acknowledgements
• Thanks to the members of the Algorithms Group of the Center for Causal Discovery for their contributions to the activities described in this talk.
• The Center for Causal Discovery is supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The content of this presentation is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Thank you
gfc@pitt.edu
CCD software is available at:www.ccd.pitt.edu
Extra Slides
Association Versus Causation
• Association • Represents statistical relationships • Predicts outcomes from passive observations• Example uses: classification and regression
• Causation: • Represents mechanisms• Predicts outcomes of active intervention• Example uses: decision making and planning
Example
• Association Smoking – lung cancer – coughing Both smoking and coughing predict lung cancer
• Causation Smoking lung cancer coughing Smoking influences lung cancer Coughing does not influence lung cancer
Recent Examples of the Use of Graphical Causal Discovery Methods
Anticipation-related brain connectivity in bipolar and unipolar depression: A graph theory approachAnna Manelis, Jorge R. C. Almeida, Richelle Stiffler,1 Jeanette C. Lockovich, Haris A. Aslam, Mary L. Phillips. Brain 139 (2016) 2554-2566.
Dobryakova, E., Costa, S. L., Wylie, G. R., DeLuca, J., & Genova, H. M. (2016). Altered effective connectivity during a processing speed task in individuals with multiple sclerosis. Journal of the International Neuropsychological Society: JINS, 22(2), 216-224.
Otsuka, J. (2016). Discovering phenotypic causal structure from nonexperimental data. Journal of evolutionary biology, 29(6), 1268-1277.
Attur, M., Statnikov, A., Samuels, J., Li, Z., Alekseyenko, A. V., Greenberg, J. D., et al. (2015). Plasma levels of interleukin-1 receptor antagonist (IL1Ra) predict radiographic progression of symptomatic knee osteoarthritis. Osteoarthritis and Cartilage, 23(11), 1915-1924.