Research paper
A new method for estimating the probability of causal relationships from observational data: Application to the study of the short-term effects of air pollution on cardiovascular and respiratory disease

https://doi.org/10.1016/j.artmed.2023.102546Get rights and content

Highlights

  • We introduce the AP procedure for estimating the probability of causal relationships.

  • AP takes observational data as input and accounts for possible latent confounding.

  • We evaluate AP on simulated data with and without background knowledge.

  • We investigate what airborne pollutants cause cardiovascular and respiratory disease.

  • The results are largely consistent with the EPA’s assessments of causality.

Abstract

In this paper we investigate which airborne pollutants have a short-term causal effect on cardiovascular and respiratory disease using the Ancestral Probabilities (AP) procedure, a novel Bayesian approach for deriving the probabilities of causal relationships from observational data. The results are largely consistent with EPA assessments of causality, however, in a few cases AP suggests that some pollutants thought to cause cardiovascular or respiratory disease are associated due purely to confounding.

The AP procedure utilizes maximal ancestral graph (MAG) models to represent and assign probabilities to causal relationships while accounting for latent confounding. The algorithm does so locally by marginalizing over models with and without causal features of interest. Before applying AP to real data, we evaluate it in a simulation study and investigate the benefits of providing background knowledge. Overall, the results suggest that AP is an effective tool for causal discovery.

Introduction

Identifying causal relationships in the medical domain, such as those between a disease and one or more previously unknown disease factors, can profoundly impact human health. However, randomized controlled trials (RCTs), the gold standard for discovering these relationships, require careful experimental design, making their implementation infeasible or unethical for many interesting causal questions. In contrast, the past decade has seen a dramatic increase in the amount of available observational data. Causal discovery algorithms seek to uncover causal relationships from these observational datasets in order to supplement RCTs.

This paper describes a method for causal discovery that combines two ideas from the literature: Local causal discovery (LCD) and maximal ancestral graph (MAG) models. LCD algorithms generally search for causal relationships on subsets of three or four variables at a time. Accordingly, they are more computationally efficient than approaches that consider all variables at once [1], [2], [3], [4], [5]. MAG models, similar to DAG models,2 define families of probability distributions containing all conditional independence relations described by a graph. In particular, the independence relations of these models are defined by a MAG. Moreover, MAG models characterize all possible marginalizations of DAG models, which enable MAG models to account for latent confounding; conceptually, latent confounders are variables that have been marginalized out of the model [6]. Intuitively, these ideas work well together because LCD marginalizes a large portion of the variables for computational efficiency while MAG models provide a framework for modeling these marginalizations.

The contributions of this paper are as follows:

  • We introduce the Ancestral Probabilities (AP) procedure for discovering causal relationships from observational data; The procedure uses a Bayesian approach for deriving the probabilities of causal relationships by combining ideas from LCD with the framework of MAG models. Accordingly, AP is able to efficiently identify causal relationships while accounting for latent confounding.

  • We investigate the effectiveness of AP in terms of discrimination and calibration on synthetically generated data and the benefits of providing background knowledge.

  • We investigate which airborne pollutants have a short-term causal effect on cardiovascular and respiratory disease using AP; The results are largely consistent with EPA assessments of causality.

  • The code for AP is publicly available on GitHub: https://github.com/bja43/anc-prob.

The remainder of this paper is organized as follows. Section 2 describes related LCD algorithms and reviews MAG models. Section 3 introduces the AP procedure and evaluates its effectiveness on synthetically generated data. Section 4 provides an analysis of which airborne pollutants have a short-term causal effect on cardiovascular and respiratory disease using the AP procedure. Section 5 states our conclusions.

Section snippets

Background

This section reviews prior LCD algorithms and MAG models. Throughout this paper, V denotes a non-empty finite set of variables that act as vertices in the graphical context. More generally, we use the following notation:

  • a,b,c,V

  • A,B,C,V

where lowercase letters denote variables or singletons and uppercase letters denote sets of variables. We consider a collection of random variables X indexed by V and drawn from a probability measure P. Probabilistic conditional independence between the members

The Ancestral Probabilities procedure

In this section, we formulate and analyze a Bayesian local causal discovery algorithm called the Ancestral Probabilities (AP) procedure. Assuming causal Markov and causal faithfulness, AP computes the probabilities of ancestral relationships between pairs of variables with respect to a local subset of variables. Suppose we have a dataset x1,,xn containing variables a and b. AP is motivated by the following question: “what is the probability that a causally influences b?”, denoted ab. The AP

Airborne pollutants’ short-term effect on health

In this section we apply the AP procedure to a dataset measuring airborne pollutants, cardiovascular health, and respiratory health. We joined local air composition data from the Environmental Protection Agency (EPA) with clinical data from the University of Pittsburgh Medical Center (UPMC) at the ZIP code-month level. Measurements for the airborne concentration of 160 pollutants were collected from Pennsylvania air-monitoring stations in 2015 and used construct the pollution variables. In

Conclusions

We designed a local causal discovery algorithm called the ancestral probabilities (AP) procedure, which estimates the posterior probabilities of causal relationships. Limitations of this method include:

  • AP only scales to five variables (LCD methods seldom consider more than four variables);

  • AP does not currently model selection variables [21].

Analyses on synthetically generated data and on a real dataset measuring airborne pollutants, cardiovascular health, and respiratory health suggest that the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The research reported in this paper was supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative, grant #4100070287 from the Pennsylvania Department of Health (DOH), and grant IIS-1636786 from the National Science Foundation. Author BA also received support from training grant T15LM007059 from the National Library of Medicine . Additionally, we thank our anonymous reviewers for their

References (21)

  • CooperG.F.

    A simple constraint-based algorithm for efficiently mining observational databases for causal relationships

    Data Min Knowl Discov

    (1997)
  • ManiS. et al.

    Causal discovery using a Bayesian local causal discovery algorithm

    Stud Health Technol Inform

    (2004)
  • ManiS. et al.

    A theoretical study of Y structures for causal discovery

  • EntnerD. et al.

    Data-driven covariate selection for nonparametric estimation of causal effects

  • VersteegP. et al.

    Local constraint-based causal discovery under selection bias

  • RichardsonT.S. et al.

    Ancestral graph Markov models

    Ann Statist

    (2002)
  • RichardsonT.S.

    Markov properties for acyclic directed mixed graphs

    Scand J Stat

    (2003)
  • SadeghiK. et al.

    Markov properties for mixed graphs

    Bernoulli

    (2014)
  • AndrewsB. et al.

    On the completeness of causal discovery in the presence of latent confounding with tiered background knowledge

  • DeGrootM.H. et al.

    The comparison and evaluation of forecasters

    J R Stat Soc Ser D (Stat)

    (1983)
There are more references available in the full text version of this article.

Cited by (0)

1

Contributed equally to this work.

View full text