Research paperA new method for estimating the probability of causal relationships from observational data: Application to the study of the short-term effects of air pollution on cardiovascular and respiratory disease
Introduction
Identifying causal relationships in the medical domain, such as those between a disease and one or more previously unknown disease factors, can profoundly impact human health. However, randomized controlled trials (RCTs), the gold standard for discovering these relationships, require careful experimental design, making their implementation infeasible or unethical for many interesting causal questions. In contrast, the past decade has seen a dramatic increase in the amount of available observational data. Causal discovery algorithms seek to uncover causal relationships from these observational datasets in order to supplement RCTs.
This paper describes a method for causal discovery that combines two ideas from the literature: Local causal discovery (LCD) and maximal ancestral graph (MAG) models. LCD algorithms generally search for causal relationships on subsets of three or four variables at a time. Accordingly, they are more computationally efficient than approaches that consider all variables at once [1], [2], [3], [4], [5]. MAG models, similar to DAG models,2 define families of probability distributions containing all conditional independence relations described by a graph. In particular, the independence relations of these models are defined by a MAG. Moreover, MAG models characterize all possible marginalizations of DAG models, which enable MAG models to account for latent confounding; conceptually, latent confounders are variables that have been marginalized out of the model [6]. Intuitively, these ideas work well together because LCD marginalizes a large portion of the variables for computational efficiency while MAG models provide a framework for modeling these marginalizations.
The contributions of this paper are as follows:
- •
We introduce the Ancestral Probabilities (AP) procedure for discovering causal relationships from observational data; The procedure uses a Bayesian approach for deriving the probabilities of causal relationships by combining ideas from LCD with the framework of MAG models. Accordingly, AP is able to efficiently identify causal relationships while accounting for latent confounding.
- •
We investigate the effectiveness of AP in terms of discrimination and calibration on synthetically generated data and the benefits of providing background knowledge.
- •
We investigate which airborne pollutants have a short-term causal effect on cardiovascular and respiratory disease using AP; The results are largely consistent with EPA assessments of causality.
- •
The code for AP is publicly available on GitHub: https://github.com/bja43/anc-prob.
The remainder of this paper is organized as follows. Section 2 describes related LCD algorithms and reviews MAG models. Section 3 introduces the AP procedure and evaluates its effectiveness on synthetically generated data. Section 4 provides an analysis of which airborne pollutants have a short-term causal effect on cardiovascular and respiratory disease using the AP procedure. Section 5 states our conclusions.
Section snippets
Background
This section reviews prior LCD algorithms and MAG models. Throughout this paper, denotes a non-empty finite set of variables that act as vertices in the graphical context. More generally, we use the following notation:
- •
- •
where lowercase letters denote variables or singletons and uppercase letters denote sets of variables. We consider a collection of random variables indexed by and drawn from a probability measure . Probabilistic conditional independence between the members
The Ancestral Probabilities procedure
In this section, we formulate and analyze a Bayesian local causal discovery algorithm called the Ancestral Probabilities (AP) procedure. Assuming causal Markov and causal faithfulness, AP computes the probabilities of ancestral relationships between pairs of variables with respect to a local subset of variables. Suppose we have a dataset containing variables and . AP is motivated by the following question: “what is the probability that causally influences ?”, denoted . The AP
Airborne pollutants’ short-term effect on health
In this section we apply the AP procedure to a dataset measuring airborne pollutants, cardiovascular health, and respiratory health. We joined local air composition data from the Environmental Protection Agency (EPA) with clinical data from the University of Pittsburgh Medical Center (UPMC) at the ZIP code-month level. Measurements for the airborne concentration of 160 pollutants were collected from Pennsylvania air-monitoring stations in 2015 and used construct the pollution variables. In
Conclusions
We designed a local causal discovery algorithm called the ancestral probabilities (AP) procedure, which estimates the posterior probabilities of causal relationships. Limitations of this method include:
- •
AP only scales to five variables (LCD methods seldom consider more than four variables);
- •
AP does not currently model selection variables [21].
Analyses on synthetically generated data and on a real dataset measuring airborne pollutants, cardiovascular health, and respiratory health suggest that the
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The research reported in this paper was supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative, grant #4100070287 from the Pennsylvania Department of Health (DOH), and grant IIS-1636786 from the National Science Foundation. Author BA also received support from training grant T15LM007059 from the National Library of Medicine . Additionally, we thank our anonymous reviewers for their
References (21)
A simple constraint-based algorithm for efficiently mining observational databases for causal relationships
Data Min Knowl Discov
(1997)- et al.
Causal discovery using a Bayesian local causal discovery algorithm
Stud Health Technol Inform
(2004) - et al.
A theoretical study of Y structures for causal discovery
- et al.
Data-driven covariate selection for nonparametric estimation of causal effects
- et al.
Local constraint-based causal discovery under selection bias
- et al.
Ancestral graph Markov models
Ann Statist
(2002) Markov properties for acyclic directed mixed graphs
Scand J Stat
(2003)- et al.
Markov properties for mixed graphs
Bernoulli
(2014) - et al.
On the completeness of causal discovery in the presence of latent confounding with tiered background knowledge
- et al.
The comparison and evaluation of forecasters
J R Stat Soc Ser D (Stat)
(1983)
Cited by (0)
- 1
Contributed equally to this work.