Copula PC Algorithm for Causal Discovery from Mixed Data

Cui, Ruifei; Groot, Perry; Heskes, Tom

doi:10.1007/978-3-319-46227-1_24

Ruifei Cui¹⁷,
Perry Groot¹⁷ &
Tom Heskes¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9852))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4737 Accesses
11 Citations

Abstract

We propose the ‘Copula PC’ algorithm for causal discovery from a combination of continuous and discrete data, assumed to be drawn from a Gaussian copula model. It is based on a two-step approach. The first step applies Gibbs sampling on rank-based data to obtain samples of correlation matrices. These are then translated into an average correlation matrix and an effective number of data points, which in the second step are input to the standard PC algorithm for causal discovery. A stable version naturally arises when rerunning the PC algorithm on different Gibbs samples. Our ‘Copula PC’ algorithm extends the ‘Rank PC’ algorithm, which has been designed for Gaussian copula models for purely continuous data. In simulations, ‘Copula PC’ indeed outperforms ‘Rank PC’ in cases with mixed variables, in particular for larger numbers of data points, at the expense of a slight increase in computation time.

You have full access to this open access chapter, Download conference paper PDF

Discovery of Causal Models that Contain Latent Variables Through Bayesian Scoring of Independence Constraints

A KNN-Based Non-Parametric Conditional Independence Test for Mixed Data and Application in Causal Discovery

Constraint-based causal discovery with mixed data

Article Open access 02 February 2018

Michail Tsagris, Giorgos Borboudakis, … Ioannis Tsamardinos

Keywords

1 Introduction

Causal discovery, or causal structure learning [23], aims to find an underlying directed acyclic graph (DAG), which represents direct causal relations between variables. It is a very popular approach for multivariate data analysis and therefore is widely studied in the past few years, resulting in lots of algorithms. The PC [27, 28] algorithm can be considered the reference causal discovery algorithm. It makes use of conditional independence tests to build the underlying DAG from observations. Starting from a complete undirected graph, the PC algorithm removes edges recursively according to the outcome of the conditional independence tests. This procedure yields an undirected graph, also called the skeleton. After applying various edge orientation rules, it finally gives back a partially directed graph to represent the underlying DAGs.

One advantage of the PC algorithm is that it is computationally feasible for sparse graphs even with thousands of variables. Therefore, it is widely used in high-dimensional settings, generating a variety of applications [20, 29]. Also, open-source software is available like pcalg [17] and the Tetrad project [25].

When applied to Gaussian models, the PC algorithm tests conditional independence using partial correlation based on Pearson correlations between variables: when the joint distribution is a multivariate Gaussian, pairwise conditional independence is equivalent to the vanishing of the corresponding partial correlation [18]. Following [12], we will refer to the PC algorithm for Gaussian models as the ‘Pearson PC’ algorithm. As input it takes the correlation matrix of the observed data and the number of data points. The number of data points is needed for the conditional independence tests: the higher the number of data points, the more reliable the observed correlation matrix as an estimate of the (unknown) true correlation matrix, and the more easily the null hypothesis of conditional independence (given the same value for the partial correlation and the significance level) gets rejected. Under relatively mild assumptions regarding the sparseness of the true underlying DAG, the ‘Pearson PC’ algorithm shows uniform consistency [16].

Harris and Drton [12] extend the PC algorithm to non-parametric Gaussian (nonparanormal) models, i.e., continuous data assumed to be generated from a Gaussian copula model. They propose to apply the standard PC algorithm, but then replacing the Pearson correlation matrix with rank-based measures of correlation. The so-called ‘Rank PC’ (RPC) algorithm works as well as the ‘Pearson PC’ algorithm on normal data and much better on non-normal data, and is shown to be uniformly consistent in high-dimensional settings.

In this paper, we aim to generalize the ‘Pearson PC’ and ‘Rank PC’ algorithm to Gaussian copula models that can also handle binary and ordinal variables. The ‘Rank PC’ algorithm is explicitly limited to the continuous situation, where ties appear with probability zero, making ranks well-defined. In the presence of binary and ordinal variables, ties make the rank correlations between observed variables different from those between the corresponding latent variables in the Gaussian copula setting. Ignorance of this difference typically leads to underestimates of the (absolute) correlations [13].

It is tempting to follow a similar two-step approach as for ‘Rank PC’: first estimate the correlation matrix in the latent space and then use this as input to the standard PC algorithm. This, however, is not as straightforward as it may seem, for two reasons. First, because of the ties, estimating the correlation matrix of Gaussian copula models for mixed data is considerably more complicated. Second, the ties imply a loss of information, which makes that our estimate of the correlation matrix will tend to be less reliable than in the fully continuous case, which should be accounted for when applying the conditional independence tests in the PC algorithm.

To solve both issues, we propose to make use of a Gibbs sampling procedure, specifically the one derived by Hoff [13] based on the so-called extended rank likelihood. This procedure is relatively straightforward and easy to implement (see the code in the Appendix of [13]). For purely Gaussian data, the correlation matrix samples follow a specific kind of inverse-Wishart distribution [3], which we refer to as the projected inverse-Wishart distribution. Projected inverse-Wishart distributions are characterized by two parameters: the scale matrix and the degrees of freedom; the former relates to the average correlation matrix and the latter to the number of data points. As we will show, under the projected inverse-Wishart, the variance of each off-diagonal element of the correlation matrix is an approximate function of its expectation and the degrees of freedom: the more degrees of freedom, the smaller the variance. The idea is now to estimate the scale matrix and degrees of freedom from the Gibbs samples of more general Gaussian copula models on mixed data, as if they were also drawn from a projected inverse-Wishart distribution. The scale matrix is translated into a correlation matrix and the degrees of freedom into a so-called ‘effective number of data points’, to take into account the reliability of our estimate of the correlation matrix. These are then input to the standard PC algorithm for causal discovery.

We refer to our two-step procedure as the ‘Copula PC’ (CoPC) algorithm. We also derive a stable version, referred to as ‘Stable Copula PC’ (SCPC), which runs PC repeatedly on a number of Gibbs samples. Experimental results show that both CoPC and SCPC outperform the current ‘Rank PC’ algorithm in mixed databases with discrete and continuous variables.

The rest of this paper is organized as follows. Section 2 reviews some relevant background information and analyzes issues of existing algorithms in more detail. Section 3 proposes an approximate inference method for the correlation matrix and the effective number of data points based on the projected inverse-Wishart distribution, and then derives the resulting algorithms CoPC and SCPC. Section 4 compares CoPC and SCPC with the ‘Rank PC’ algorithm on simulated data and provides an illustration on real-world data of ADHD patients. Section 5 gives conclusions and future work.

2 Preliminaries and Problem Analysis

In this section, we first review some necessary background information on causal discovery, then briefly introduce the PC and ‘Rank PC’ algorithm, and finally analyze the challenges current PC algorithms face for mixed data.

2.1 Causal Structure Learning

A graph $G = (\varvec{V},\varvec{E})$ consists of a set of vertices $\varvec{V} = \{X_1,\ldots ,X_p\}$, representing random variables, and a set of edges $\varvec{E}$, representing relations between pairs of variables. A graph is directed if it only contains directed edges while it is undirected if it only contains undirected edges. Graphs containing both directed and undirected edges are called partially directed graphs. A graph with no directed cycles, e.g., $X_i\rightarrow X_j \rightarrow X_i$ is acyclic. A graph which is both directed and acyclic is a Directed Acyclic Graph (DAG). If there is an edge from $X_i$ to $X_j$, $X_i$ is the parent of $X_j$. The set of parents of $X_j$ in graph G is denoted by $pa(G,X_j)$.

A multivariate probability distribution P over variables $\varvec{V}=\{X_1,\ldots ,X_p\}$ is said to factorize according to a DAG $G = (\varvec{V},\varvec{E})$, if the joint probability density function of P can be written as the product of the conditional densities of each variable given its parents in G, i.e., $f(X_1,\ldots ,X_p) = \prod \nolimits _{i=1}^p {f(X_i|pa(G,X_i))}$. If this condition holds, we can read off conditional independence relationships in distribution P from the DAG via a graphical criterion called d-separation [24]. D-separation implies that each variable is independent of its non-descendants given its parents.

Several DAGs may, via d-separation, correspond to the same set of conditional independencies. Such DAGs form a Markov equivalence class, which can be uniquely represented by a completed partially directed acyclic graph (CPDAG) [6]. Arcs in a CPDAG indicate a cause-effect relation between variables since the same arc occurs in all members of the CPDAG. Undirected edges $X_i-X_j$ in a CPDAG indicate that some of its members contain an arc $X_i\rightarrow X_j$ whereas other members contain an arc $X_j\rightarrow X_i$. The aim of causal discovery is to learn the Markov equivalence class of a DAG $G=(\varvec{V},\varvec{E})$ from n i.i.d. observations of $\varvec{V}$.

2.2 PC Algorithm and Rank PC Algorithm

The PC algorithm starts from a complete undirected graph, and then removes edges recursively according to conditional independencies yielding a partially connected undirected graph called the skeleton, after which some orientation rules are applied to direct as many edges as possible, resulting in a completed partially directed acyclic graph, i.e. the underlying CPDAG.

During the process, testing conditional independence plays the most important role. The PC algorithm uses partial correlation, denoted by ${\rho _{uv|S}}$, to do it. The correlation matrix from independent observations of a random vector $\varvec{Z}$ can be used to estimate ${\rho _{uv|S}}$ [2]. Then, classical decision theory is applied to judge conditional independencies using significance level $\alpha $,

(1)

where $u \ne v$ and $S\subseteq \{1,\ldots ,p\}\backslash \{u,v\} $. Thus in order to run the PC algorithm we need the correlation matrix corresponding to the data to estimate $\hat{\rho }_{uv|S}$ and the number of observations n.

The PC algorithm has been extended to a broader class of Gaussian copula by using rank correlations to replace Pearson correlations, resulting in the ‘Rank PC’ algorithm [12]. Rank correlations, typically Spearman’s $\rho $ and kendall’s $\tau $, only consider the ranks among observations, ignoring the actual variables.

Definition 1 (Gaussian Copula Model)

Consider two random vectors $\mathbf Z =(Z_1,\ldots ,Z_p)$ and $\mathbf Y =(Y_1,\ldots ,Y_p)$, satisfying the conditions $\varvec{Z}\sim \mathcal {N}(0,C)$ and $Y_i = F_i^{-1}(\varPhi (Z_i))$ for $i=1,\ldots ,p$ where C denotes the correlation matrix of $\varvec{Z}$ and ${F_{i}}^{-1}(t) = \inf \{ y: F_{i}(y) \ge t\}$ is the pseudo-inverse of a cumulative distribution function $F_i$. Then this model is called Gaussian copula model with correlation matrix C and univariate margins $F_i$.

In the Gaussian copula model, when all margins are continuous, ties occur with zero probability making ranks well-defined. For such so-called nonparanormal models, the sample correlations among ranks can naturally be used as an estimator for the Pearson correlation in the latent space. In this nonparanormal setting, RPC has been shown to perform well [12, 19].

2.3 Challenges for Mixed Data

RPC works well on continuous data, because tied observations occur with probability zero. In the presence of discrete margins, however, the estimator used in RPC is no longer consistent because of the tied observations. In this case, standard rank-based correlation will be different from the true correlation in latent space [13], typically underestimating it. Hence, our first challenge is to estimate the underlying C efficiently and consistently from mixed data.

A second challenge concerns the information loss incurred by discrete variables. Specifically, simply setting n in Eq. (1) to the number of data points can lead to an underestimate of the p-values provided by the conditional independence tests. To solve this problem, we introduce the notion of an effective number of data points.

3 Approximate Inference and Copula PC Algorithm

In this section, we introduce an approximate inference approach for the underlying correlation matrix and the effective number of data points from mixed data. Subsection 3.1 introduces the projected inverse-Wishart distribution and its application to Gaussian models. Subsection 3.2 discusses how to obtain correlation matrix samples from mixed data using a Gibbs sampling procedure. Subsection 3.3 shows how to use these samples to estimate the two parameters of the projected inverse-Wishart distribution: the scale matrix (as the underlying correlation matrix) and the degrees of freedom (as the effective number of data points). Subsection 3.4 gives the resulting Copula PC algorithm and the Stable Copula PC algorithm.

3.1 Projected Inverse-Wishart Distribution

Priors on correlation matrices are typically derived by choosing the inverse-Wishart distribution, denoted by $\mathcal{W}^{-1}(\varSigma ;\varPsi _0,\nu )$, as a prior on covariance matrices and then turning the covariance matrices into a correlation matrix to end up with an implied distribution on the correlation matrix. We choose $\varSigma $ from $\mathcal{W}^{-1}(\varSigma ;\varPsi _0,\nu )$ and write

$$\begin{aligned} P(C) = \mathcal{PW}^{-1}(C;\varPsi _0,\nu ) \end{aligned}$$

(2)

where $C_{ij} = \frac{\varSigma _{ij}}{\sqrt{\varSigma _{ii}\varSigma _{jj}}}$ for $\forall $ i, j. Since many covariance matrices possibly correspond to the same correlation matrix, the above process can be considered as a projection from covariance matrices to a correlation matrix. Therefore, we refer to this distribution on correlation matrix C as a projected inverse-Wishart distribution.

For Gaussian models, the projected inverse-Wishart distribution gives exact inference [21]. Specifically, given data $\varvec{Z} = (\varvec{z_1},\ldots ,\varvec{z_n})$, the posterior reads

$$ P(\varSigma |\varvec{Z}) = \mathcal{W}^{-1}(\varSigma ;\varPsi _0 + \varPsi ,\nu + n)\, \text {and}\, P(C|\varvec{Z}) = \mathcal{PW}^{-1}(C;\varPsi _0 + \varPsi ,\nu + n), $$

with $\varPsi =\varvec{Z}^T\varvec{Z}$. Also, the projected inverse-Wishart is scale invariant [3, 14], in the sense that we can make the posterior distribution on correlation matrices independent of the scale of the data by choosing $\varPsi _0 = \mathbb {0}$, or perhaps better, $\varPsi _0 = \epsilon \mathbb {1}$ in the limit $\epsilon \downarrow 0$.

Summarizing, we consider the prior distribution

$$ P(\varSigma ) = \mathcal{W}^{-1}(\varSigma ; \epsilon \mathbb {1}, p+1)\, \text {in the limit}\, \epsilon \downarrow 0, $$

which in fact boils down to the well-known improper Jeffreys prior [32]:

$$ P(\varSigma ) \propto \Vert \varSigma \Vert ^{-(p+1)}. $$

For Gaussian copula models, although there is no analytical expression, we still expect that the posterior $P(C|\varvec{Y})$ can be approximated through a projected inverse-Wishart distribution, i.e., $P(C|\varvec{Y}) \approx \mathcal{PW}^{-1}(C;\varPsi ,\nu )$ for some $\varPsi $ and $\nu $.

3.2 Gibbs Sampler Based on Extended Rank Likelihood

Hoff [13] describes an elegant procedure to obtain samples from $P(C|\varvec{Y})$ for a Gaussian copula model. The essence is that we only consider the ranks among observations, hence the name extended rank likelihood, ignoring the actual variables. Since the cumulative distribution functions $F_i(Y_i)$ are non-decreasing, observing $y_{i_1,j} < y_{i_2,j}$ implies that $z_{i_1,j} < z_{i_2,j}$, where $y_{i_1,j}$ denotes the $i_1^{th}$ observation of the $j^{th}$ component of random vector $\varvec{Y}$, To be precise, observing $\varvec{Y}=(\varvec{y}_1,\ldots ,\varvec{y}_n)$ tells us that $\varvec{Z}=(\varvec{z}_1,\ldots ,\varvec{z}_n)$ must lie in the set

$$ \left\{ \varvec{Z}\in {\mathbb {R}}^{n\times p}:\max \left\{ z_{k,j}:y_{k,j}<y_{i,j}\right\}<z_{i,j}<\min \left\{ z_{k,j}:y_{i,j}<y_{k,j}\right\} \right\} . $$

Strong posterior consistency for C under the extended rank likelihood has been proved in the situation with both discrete and continuous marginal distribution functions [22].

An off-the-shelf sampling algorithm based on the extended rank likelihood is full Gibbs sampling [13]. The code of this sampling algorithm is provided in the Appendix of [13]. In this algorithm, each component of $\varvec{Z}$ is initialized according to the rank information of the corresponding component of $\varvec{Y}$, after which each component is resampled alternatively. Here we propose a slight modification by just resampling the discrete components instead of all of them. Experimental tests reveal that the results of this faster sampling approach are indistinguishable from Hoff’s original Gibbs sampler. Although this modification is quite straightforward, it significantly reduces computation time because sampling continuous variables is far more time-consuming than sampling discrete ones in Hoff’s Gibbs sampler. We will refer to this modified sampling algorithm as SamplingAlgo.

So, given the observed data $\varvec{Y}$, samples on the underlying correlation matrix, denoted by $\{C^{(1)},\ldots {},C^{(m)}\}$, can be obtained using SamplingAlgo.

3.3 Estimation of the Correlation Matrix and the Effective Number of Data Points

This subsection aims to estimate the underlying correlation matrix and the effective number of data points from the obtained samples.

Theorem 1 suggests a procedure to estimate the parameters $\varPsi $ and $\nu $ from samples of a projected inverse-Wishart distribution $\mathcal{PW}^{-1}(C;\varPsi ,\nu )$.

Theorem 1

If the correlation matrix C follows a projected inverse-Wishart distribution with parameters $\varPsi $ ($\varPsi _{ii}=1$) and $\nu $, i.e.,

$$ P(C) = \mathcal{PW}^{-1}(C;\varPsi ,\nu ), $$

then for each off-diagonal element $C_{ij} (i \ne j)$ and large $\nu $, we have

$$ \mathrm{E}\,[C_{ij}] \approx \varPsi _{ij}\, \text {and}\, \mathrm{Var}\,[C_{ij}] \approx {(1 - (\varPsi _{ij})^2)^2 \over \nu }. $$

The proof is given in the Appendix.

According to Theorem 1, the mean over samples of C is an excellent approximation of $\varPsi $. As for $\nu $, we have,

$$\begin{aligned} \nu \approx {(1 - (\mathrm{E}\,[C_{ij}])^2)^2 \over \mathrm{Var}\,[C_{ij}]}. \end{aligned}$$

(3)

The idea now is to apply the same estimates, as if the samples obtained by Gibbs sampling the Gaussian copula model on mixed data also (approximately) follow a projected inverse-Wishart distribution. Specifically, for the effective number of data points $\hat{n}$, we propose to take the average over all $p(p-1)/2$ estimates on $\nu $ that can be computed by applying (3) to each upper triangular element of a p-dimensional correlation matrix C.

3.4 Copula PC Algorithm and Stable Copula PC Algorithm

Now, we turn the previous results into a working algorithm. The two key input arguments of the ‘Pearson PC’ algorithm are the correlation matrix and the number of data points. In the general Gaussian copula model, we take the mean over $\{C^{(1)},\ldots {},C^{(m)}\}$ and the mean over $p(p-1)/2$ estimates on $\nu $ as the two arguments respectively, resulting in the Copula PC algorithm.

Next, we introduce a stable version of the Copula PC algorithm. We take l instances from all the m samples. For each instance, a corresponding graph can be obtained via the ‘Pearson PC’ algorithm using the earlier estimated effective number of data points, by which a collection of l graphs can be generated, denoted by $\{\widetilde{G}_1,\ldots ,\widetilde{G}_l\}$. We keep those edge marks that emerge with a probability higher than a pre-defined threshold $\beta $ and remove the others, leading to a resulting graph. Since this resulting graph seemingly contains only ‘stable’ edge marks, we call this method stable Copula PC algorithm (SCPC). The size of l has a linear influence on running time because choosing l means the ‘Pearson PC’ algorithm would run l times. As for $\beta $, a small value means keeping more edge marks and vice versa. The Copula PC algorithm and its stable version are summarized in Algorithm 1.

4 Experiments

In this section, we first verify the property of the projected inverse-Wishart distribution described by Eq. (3) and check whether it still holds in the presence of discrete variables. Then, we compare the proposed CoPC and SCPC with the ‘Rank PC’ algorithm on simulated data and give an illustration on real-world data of ADHD patients.

Following Kalisch and Bühlmann [16], we simulate random DAGs and draw samples from the distributions faithful to them. Firstly, we generate an adjacency matrix A, whose entries are zero or in the interval $\left[ 0.1,1\right] $. There exists a directed edge from i to j in the corresponding DAG, if $i < j$ and $A_{ji} \ne 0$. The DAGs generated in this way have the property $\mathrm{E}\,(N_i) = s(p-1)$, where $N_i$ is the number of neighbors of node i, and s is the probability that there is an edge between any two nodes, called the sparseness parameter. Then, the samples of a random vector $\varvec{Z}$ are drawn through

$$\begin{aligned} \varvec{Z} = A\varvec{Z} + \epsilon , \end{aligned}$$

(4)

where $\epsilon = (\epsilon _1,\ldots ,\epsilon _p)$ is a vector of independent standard normal random variables. The data generated in this way follow a multivariate Gaussian distribution.

4.1 Estimation for the Effective Number of Data Points

As argued in Subsect. 3.3, the expectation and variance of the elements of correlation matrices drawn from a projected inverse-Wishart distribution are strongly related. To check this relationship, we proceed as follows: (1) we generate a random p-dimensional correlation matrix $\varPsi $; (2) we draw 500 samples from a projected inverse-Wishart distribution with parameters $\varPsi $ and $\nu $; (3) for each upper triangular element, we plot its variance against its expectation.

The left panel in Fig. 1 shows a typical result for $p=20$ and $\nu =1000$. We see that almost all pairs are distributed around the theoretical curve (solid line) especially when the expectation is far from zero, which indicates that it is indeed possible to infer $\nu $ of a projected inverse-Wishart distribution via the expectation and variance of off-diagonal elements.

Next, we study how our inference method works for estimating $\hat{n}$ in different cases. We first generate n samples of $\varvec{Z}$ using Eq. (4) and discretize some of the variables to obtain the simulated samples of the observed random vector $\varvec{Y}$. Then, we run SamplingAlgo to get samples of the underlying C. The results for $p=20$ and $n=1000$ for different cases are shown in Fig. 1 (right panel), where ‘bins = 2’ means that all variables are binary, ‘bins = 4’ means that all variables are ordinal with 4 levels and ‘continuous’ means that all variables are kept continuous. We take ${(1 - (\mathrm{E}\,[C_{ij}])^2)^2}$ for the x-axis and $n\times \mathrm{Var}\,[C_{ij}]$ for the y-axis, so that all data points are expected to be distributed around a straight line with slope $n/\hat{n}$. For purely continuous variables, a straight line with slope 1 gives an almost perfect fit, as expected. For ordinal and binary variables, we still find a clear trend, but mild deviations from a perfect straight line, indicating that the projected inverse-Wishart distribution is a fine, but not perfect approximation of the exact posterior. The stronger the discretization, the larger the slope $n/\hat{n}$ and thus the lower our estimated effective number of data points.

More extensive experiments (not shown) done with different numbers of variables, data points, Gibbs samples and sparseness parameters, reveal that these hardly influence the general picture, as long as the number of data points and the number of Gibbs samples are both at least 100.

4.2 Causal Discovery on Simulations

In this subsection, we compare CoPC and SCPC with the ‘Rank PC’ [12] algorithm. All computations are implemented in the R-package pcalg.

We first generate multivariate normal data (p variables) via Eq. (4). After that, $25\,\%$ of all p variables are discretized into binary variables, and another $25\,\%$ of them are discretized into ordinal variables with 5 levels. In this way, we simulate the observations of $\varvec{Y}$ which are generated from a Gaussian copula model with both discrete and continuous margins.

Three measures are used to test the performance: (1) percentage of correct edges in the resulting skeleton, usually called true positive rate (TPR); (2) percentage of spurious edges, usually called false positive rate (FPR); (3) Structural Hamming Distance (SHD), counting the number of edge insertions, deletions, and flips in order to transfer the estimated CPDAG into the correct CPDAG [30]. The first two measures are for the skeleton while SHD is for the CPDAG. A smaller SHD indicates better performance.

Next, we compare the performance of three versions of the PC algorithm, RPC, CoPC, and SCPC in terms of TPR, FPR, and SHD. We restrict the significance level to $\alpha = 0.01$, which has been shown to yield the best overall SHD [16]. For CoPC, we drop the first 20 Gibbs samples and save the next 100 samples ($m=100$). For SCPC, we take $l=20$ equidistant samples, so $\{C^{(1)},C^{(6)},\ldots ,C^{(96)}\}$, and choose $\beta $ such that the TPR for SCPC is more or less equal to that of RPC, which amounts to $\beta =0.4$ for sparse graphs with 10 nodes, $\beta =0.45$ for sparse graphs with 50 nodes, and $\beta =0.3$ for dense graphs. The remaining parameters are set as follows: $p\in \{10, 50\}$, $n\in \{500, 1000, 2000, 5000\}$, and $\mathrm{E}\,[N]\in \{2$ (Sparse), 5 $(Dense)\}$.

The comparative results in Fig. 2 (10 nodes) and Fig. 3 (50 nodes) provide the mean over 100 repeated experiments and errorbars representing 95 % confidence intervals. First, for sparse graphs (both small and large graphs), the three algorithms get nearly the same results w.r.t. TPR, but CoPC and SCPC show a large advantage over RPC w.r.t. FPR and SHD except SCPC with large graphs, which becomes more prominent with increasing sample size. Second, for dense graphs, the advantage of CoPC and SCPC over RPC still exists w.r.t. FPR, although seemingly CoPC performs a little worse than SCPC and RPC w.r.t. TPR. Third, we note that the performance of RPC deteriorates seriously w.r.t. FPR with the increase in sample size, while CoPC and SCPC are very stable. Apparently, using sample size as the effective number of data points, RPC incurs more false positives especially for larger sample sizes. Overall, CoPC and SCPC clearly outperform RPC, especially in the sparse cases with larger sample sizes.

4.3 Application to Real-World Data

In this subsection, we give an illustration on a real-world dataset on phenotypic information about children with Attention Deficit Hyperactivity Disorder (ADHD) [5]. It contains 23 variables for 245 subjects. We focus on nine variables as in [26], but keep all subjects with missing values since these are easily handled by the Gibbs sampler. The nine variables considered are: gender (G), attention deficit level (AD), hyperactivity/impulsivity level (HI), verbal IQ (VIQ), performance IQ (PIQ), full IQ (FIQ), aggressive behavior (Agg), medication status (Med), handedness (HN), where four of them (G, Agg, Med, HN) are binary.

We run CoPC and SCPC ($l=30$, $\beta =0.4$) on the dataset and consider prior knowledge that no variable can cause gender. The resulting graphs are shown in Figure 4. The graphs suggest that gender has an effect on attention deficit level, which then causes hyperactivity/impulsivity level. This point has been confirmed by many studies [4], [31]. It is common that AD and Agg cause patients to take medicine. Also, VIQ, PIQ, and FIQ are connected to each other by bi-directed edges. This indicates that the causal sufficiency assumption is violated, i.e., that there should be a latent common cause related to IQ, as also suggested in [26].

5 Conclusions and Future Work

In this paper we introduced a novel two-step approach for estimating the causal structure underlying a Gaussian copula model on mixed data. The essence is to estimate the correlation matrix in the latent space, which can then be given to any causal discovery algorithm to search for its underlying structure. Ties between the discretized observations incur information loss, making the estimate of correlation matrix less reliable than in fully continuous cases. For this, we introduced the notion of ‘effective number of data points’ that can be estimated from the expectation and variance of the correlation matrix elements. Our approach, based on ranks and correlation matrices, is fully scale invariant and has a natural uninformative setting when choosing a uniform distribution over pairwise correlations, which can be adjusted to account for different assumptions.

We like to think of our two-step approach as a general principle, where for each of the two steps one could plug in one’s favorite choice: e.g., a different MCMC method [15] or a MAP approach along the lines of [1] for estimating the correlation matrix and its reliability, and another method, like FCI [28] or BCCD [7], for causal structure learning. Having generated samples, running the PC algorithm several times to gain an insight into the reliability of structure estimates is an obvious thing to do. Similar procedures have been proposed, e.g., by bootstrapping the original dataset [8, 10]. In our simulations, the Gibbs sampler appears to converge quite fast, which makes Gibbs sampling cheap compared to running the PC algorithm, in particular for models with many variables. Our choice to only resample the discrete random variables and not the continuous ones, here also helps. Being fully Bayesian about structure learning as well may be very nice in theory [11], but is computationally infeasible in practice for any reasonable number of variables. Altogether, our Bayesian approach to sample correlation matrices in combination with a more frequentist approach towards structure learning attempts to combine the best of both worlds.

Our methods require the setting of just a few parameters: the significance level $\alpha $ to be used in the PC algorithm (typically 0.01 or 0.05), the number of Gibbs samples and burn-in (the more, the better), and for SCPC, the number of instances l in the ensemble (the more, the better), and the threshold $\beta $ (the higher, the more conservative).

Our estimate of the ‘effective number of data points’ appears to work nicely in practice, but can and perhaps should be further improved. Instead of considering the variance of the elements of the correlation matrix, one may come up with another, more direct estimate, for example the entropy of the distribution and translate that into an effect number of data points. Preliminary attempts in that direction failed by being typically much less robust than the one described in this paper. Our current estimate gives a single, global value for the effective number of data points. Future work may consider estimating a different value for each conditional independence test, since each test only relies on a local structure, involving only part of the variables. Such estimates then can be integrated into the causal discovery algorithm itself. Another line of future research concerns the theoretical analysis of CoPC and SCPC, where it can be studied to what extent and under which conditions consistency can be proven. Our conjecture here is that consistency of our two-step procedure follows from the proven consistency of the two separate steps: Gibbs sampling to estimate the correct correlation matrix C [22] and the PC algorithm to arrive at the correct causal structure [16].

References

Abegaz, F., Wit, E.: Penalized EM algorithm and copula skeptic graphical models for inferring networks for mixed variables, arXiv preprint (2014). arXiv:1401.5264
Anderson, T.W.: An Introduction to Multivariate Statistical Analysis. Wiley, Hoboken (2003)
MATH Google Scholar
Barnard, J., McCulloch, R., Meng, X.L.: Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica, pp. 1281–1311 (2000)
Google Scholar
Bauermeister, J.J., Shrout, P.E., Chávez, L., Rubio-Stipec, M., Ramírez, R., Padilla, L., Anderson, A., García, P., Canino, G.: ADHD and gender: are risks and sequela of ADHD the same for boys and girls? J. Child Psychol. Psychiatry 48(8), 831–839 (2007)
Article Google Scholar
Cao, Q., Zang, Y., Sun, L., Sui, M., Long, X., Zou, Q., Wang, Y.: Abnormal neural activity in children with attention deficit hyperactivity disorder: a resting-state functional magnetic resonance imaging study. Neuroreport 17(10), 1033–1036 (2006)
Article Google Scholar
Chickering, D.M.: Learning equivalence classes of Bayesian-network structures. J. Mach. Learn. Res. 2, 445–498 (2002)
MathSciNet MATH Google Scholar
Claassen, T., Heskes, T.: A Bayesian approach to constraint based causal inference, arXiv preprint (2012). arXiv:1210.4866
Dai, H., Li, G., Zhou, Z.: Ensembling MML causal discovery. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056. Springer, Heidelberg (2004)
Chapter Google Scholar
Eaton, M.L.: The Wishart distribution. In: Multivariate Statistics. Lecture Notes-Monograph Series, vol. 53, pp. 302–333. Institute of Mathematical Statistics, Beachwood, Ohio, USA (2007)
Google Scholar
Entner, D., Hoyer, P.O.: On causal discovery from time series data using FCI. Probabilistic graphical models, pp. 121–128 (2010)
Google Scholar
Friedman, N., Koller, D.: Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Mach. Learn. 50(1–2), 95–125 (2003)
Article MATH Google Scholar
Harris, N., Drton, M.: PC algorithm for nonparanormal graphical models. J. Mach. Learn. Res. 14(1), 3365–3383 (2013)
MathSciNet MATH Google Scholar
Hoff, P.D.: Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Stat. 1(1), 265–283 (2007)
Article MathSciNet MATH Google Scholar
Huang, A., Wand, M.P., et al.: Simple marginally noninformative prior distributions for covariance matrices. Bayesian Anal. 8(2), 439–452 (2013)
Article MathSciNet MATH Google Scholar
Kalaitzis, A., Silva, R.: Flexible sampling of discrete data correlations without the marginal distributions. In: Advances in Neural Information Processing Systems, pp. 2517–2525 (2013)
Google Scholar
Kalisch, M., Bühlmann, P.: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J. Mach. Learn. Res. 8, 613–636 (2007)
MATH Google Scholar
Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H., Bühlmann, P.: Causal inference using graphical models with the R package pcalg. J. Stat. Softw. 47(11), 1–26 (2012)
Article Google Scholar
Lauritzen, S.L.: Graphical Models. Clarendon Press, Oxford (1996)
MATH Google Scholar
Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L.: High-dimensional semiparametric Gaussian copula graphical models. Ann. Stat. 40(4), 2293–2326 (2012)
Article MathSciNet MATH Google Scholar
Maathuis, M.H., Colombo, D., Kalisch, M., Bühlmann, P.: Predicting causal effects in large-scale systems from observational data. Nat. Methods 7(4), 247–248 (2010)
Article Google Scholar
Murphy, K.P.: Conjugate Bayesian analysis of the Gaussian distribution. Def 1(2), 16 (2007)
Google Scholar
Murray, J.S., Dunson, D.B., Carin, L., Lucas, J.E.: Bayesian Gaussian copula factor models for mixed data. J. Am. Stat. Assoc. 108(502), 656–665 (2013)
Article MathSciNet MATH Google Scholar
Pearl, J.: Causality. Cambridge University Press, New York (2009)
Book MATH Google Scholar
Pearl, J., et al.: Causal inference in statistics: an overview. Stat. Surv. 3, 96–146 (2009)
Article MathSciNet MATH Google Scholar
Scheines, R., Spirtes, P., Glymour, C., Meek, C., Richardson, T.: The TETRAD project: constraint based aids to causal model specification. Multivar. Behav. Res. 33(1), 65–117 (1998)
Article Google Scholar
Sokolova, E., Groot, P., Claassen, T., Heskes, T.: Causal discovery from databases with discrete and continuous variables. In: Gaag, L.C., Feelders, A.J. (eds.) PGM 2014. LNCS (LNAI), vol. 8754, pp. 442–457. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11433-0_29
Google Scholar
Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction and Search. Lecture Notes in Statist, vol. 81. Springer, Heidelberg (1993)
Book MATH Google Scholar
Spirtes, P., Glymour, C.N., Scheines, R.: Causation, Prediction, and Search. MIT press, Cambridge (2000)
MATH Google Scholar
Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H., Bühlmann, P.: Causal stability ranking. Bioinformatics 28(21), 2819–2823 (2012)
Article Google Scholar
Tsamardinos, I., Brown, L.E., Aliferis, C.F.: The max-min hill-climbing Bayesian network structure learning algorithm. Mach. Learn. 65(1), 31–78 (2006)
Article Google Scholar
Willcutt, E.G., Pennington, B.F., DeFries, J.C.: Etiology of inattention and hyperactivity/impulsivity in a community sample of twins with learning difficulties. J. Abnorm. Child Psychol. 28(2), 149–159 (2000)
Article Google Scholar
Yang, R., Berger, J.O.: Estimation of a covariance matrix using the reference prior. Ann. Stat. 22(3), 1195–1211 (1994)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands
Ruifei Cui, Perry Groot & Tom Heskes

Authors

Ruifei Cui
View author publications
You can also search for this author in PubMed Google Scholar
Perry Groot
View author publications
You can also search for this author in PubMed Google Scholar
Tom Heskes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruifei Cui .

Editor information

Editors and Affiliations

Università degli Studi di Firenze, Firenze, Italy
Paolo Frasconi
Computer Science, University of Potsdam, Potsdam, Germany
Niels Landwehr
High Performance Computing and Networks, Rende, Italy
Giuseppe Manco
MPI for Informatics, Saarland University, Saarbrücken, Saarland, Germany
Jilles Vreeken

Appendix: Proof of Theorem 1

Consider partitioning the matrix $\varSigma $ and $\varPsi $ as

$$ \varSigma = \left[ \begin{array}{cc} \varSigma _{aa} &{} \varSigma _{ab} \\ \varSigma _{ba} &{} \varSigma _{ bb} \end{array} \right] \,\text {and}\, \varPsi = \left[ \begin{array}{cc} \varPsi _{aa} &{} \varPsi _{ab} \\ \varPsi _{ba} &{} \varPsi _{bb} \end{array} \right] . $$

Then, if $P(\varSigma ) = \mathcal{W}^{-1}(\varSigma ; \varPsi ,\nu )$, we have

(5)

where $\mathop {\text {dim}}(b)$ is the dimension of $\varSigma _{bb}$ and $\varSigma _{bb|a} = \varSigma _{bb} - \varSigma _{ba} \varSigma _{aa}^{-1} \varSigma _{ab} $ [9].

Without loss of generality, we restrict our analysis to a two-dimensional system and suppose that we draw

$$ \varSigma \sim \mathcal{W}^{-1}\left( \left( \begin{array}{cc} 1 &{} \rho \\ \rho &{} 1 \end{array}\right) , \nu \right) . $$

Then, according to (5), we have

$$ \varSigma _{11} \sim \mathcal{W}^{-1}(1,\nu -1),~~ \varSigma _{22|1} \sim \mathcal{W}^{-1}(1-\rho ^2,\nu ),~~\varSigma _{11}^{-1} \varSigma _{12} | \varSigma _{22|1} \sim \mathcal{N}(\rho ,\varSigma _{22|1}). $$

Rewriting the resulting $\hat{\rho }$ in terms of these variables, we obtain

$$\begin{aligned} \hat{\rho } = {\varSigma _{12} \over \sqrt{\varSigma _{11}} \sqrt{\varSigma _{22}}} = {(\varSigma _{11}^{-1} \varSigma _{12}) \sqrt{\varSigma _{11}} \over \sqrt{\varSigma _{22|1} + \varSigma _{11} (\varSigma _{11}^{-1} \varSigma _{12})^2}}. \end{aligned}$$

(6)

Since for large $\nu $,

$$ \mathrm{E}\,[\varSigma _{11}] = \frac{1}{\nu -3} \approx \frac{1}{\nu },~~ \mathrm{E}\,[\varSigma _{22|1}] = \frac{1 - \rho ^2}{(\nu -2)} \approx \frac{1 - \rho ^2}{\nu }, $$

$$ \mathrm{Var}\,[\varSigma _{11}] = {2 \over (\nu -3)^2 (\nu -5)} \approx {2 \over \nu ^3}, \mathrm{Var}\,[\varSigma _{22|1}] = {2 (1-\rho ^2)^2 \over (\nu -2)^2 (\nu -4)} \approx {2 (1-\rho ^2)^2 \over \nu ^3}. $$

we can approximate,

(7)

where x, y, and z are independent random variables, all with mean zero and unit variance. Indeed, for large $\nu $, all noise terms scale with $\sqrt{1/\nu }$ relative to the mean, and can hence be ignored when computing the expectation, to yield, as expected,

$$\begin{aligned} \mathrm{E}\,[\hat{\rho }] \approx \rho . \end{aligned}$$

(8)

To estimate the variance, we substitute (7) into (6), and compute (in leading order, and evaluated for $x = y = z = 0$),

$$\begin{aligned} {\partial \hat{\rho } \over \partial x} \approx \rho (1 - \rho ^2) \sqrt{1 \over 2 \nu },~~ {\partial \hat{\rho } \over \partial y} \approx \rho (1 - \rho ^2) \sqrt{1 \over 2 \nu },~~ {\partial \hat{\rho } \over \partial z} \approx (1 - \rho ^2)^{3/2} \sqrt{1 \over \nu }, \end{aligned}$$

yielding the variance

$$\begin{aligned} \mathrm{Var}\,[\hat{\rho }] = \left( {\partial \hat{\rho } \over \partial x}\right) ^2 + \left( {\partial \hat{\rho } \over \partial y}\right) ^2 + \left( {\partial \hat{\rho } \over \partial z}\right) ^2 \approx {(1 - \rho ^2)^2 \over \nu }. \end{aligned}$$

(9)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cui, R., Groot, P., Heskes, T. (2016). Copula PC Algorithm for Causal Discovery from Mixed Data. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9852. Springer, Cham. https://doi.org/10.1007/978-3-319-46227-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-46227-1_24
Published: 04 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46226-4
Online ISBN: 978-3-319-46227-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics