Elsevier

Information Sciences

Volume 560, June 2021, Pages 152-167
Information Sciences

Mutual-information-inspired heuristics for constraint-based causal structure learning

https://doi.org/10.1016/j.ins.2020.12.009Get rights and content

Abstract

In constraint-based approaches to Bayesian network structure learning, when the assumption of orientation-faithfulness is violated, not only the correctness of edge orientation can be greatly degraded, the soaring cost of conditional independence testing also limits their applicability in learning very large causal networks. Inspired by the strong connection between the degree of mutual information shared by two variables and their conditional independence, we extend the PC-MI algorithm in two ways: (a) the Weakest Edge-First (WEF) strategy implemented in PC-MI is further integrated with Markov-chain consistency to reduce the number of independence testing and sustain the number of false positive edges in skeletal learning; (b) the Smaller Adjacency-Set (SAS) strategy is proposed and we prove that the Smaller Adjacency-Set captures sufficient information for determining whether an unshielded triple forms a v-structure. We have conducted experiments with both low-dimensional and high-dimensional data sets, and the results indicate that our MIIPC approach outperforms the state-of-the-art approaches in both the quality of learning and the execution time.

Introduction

Bayesian network is a kind of probabilistic graph model. Due to its compact representation and powerful reasoning ability under uncertain environment, it has attracted an increasing number of attention from researchers, and it has been widely used in the fields of neuroscience [1], computer science [2], industrial applications [3], dependability and risk analysis [4].

A Bayesian network is mainly composed of two parts: a directed acyclic graph and a probability distribution table, where the nodes represent random variables, and the edges represent direct dependencies between two variables. The probability distribution table describes the dependence of each variable on its parent nodes. The Bayesian network structure learning techniques in the literature can be roughly grouped into three categories: approaches based on structural equation models, score-based approaches, and constraint-based approaches.

Utilizing an appropriately defined structural equation model to learn causal structures is essentially a “data-driven” approach where the observed data distribution is used to discover the underlying causal relations [5]. Some representative models include the classic Linear Non-Gaussian Acyclic Model, Nonlinear additive noise model, and Post-nonlinear causal model [6], where independence testing is used to identify the causal direction between variables. Others have attempted to formalize causal structure learning as a continuous optimization problem. Such methods include DAG-GNN learning method based on graph neural network [7] and RL-BIC2 learning method based on reinforcement learning technology [8].

In the score-based approaches, a scoring function is used to evaluate how well each candidate structure can fit the observation data, and the final output is one or more directed acyclic graphs with the best score. Since the space of the candidate structures increases exponentially with the number of variables, researchers have investigated various search strategies such as structural space search, equivalence class space search, and variable order space search. There are exact learning approaches that attempt to find the global optimal solution, such as integer linear programming methods, dynamic programming methods, branch and bound methods [9], as well as approximate approaches such as evolutionary methods [10], heuristic algorithms [11], Local–global learning methods [12], [13], surrogate model [14], and bounded tree-width structure optimization [15].

The constraint-based approaches use variable independence testing and edge orientation rules to learn the Bayesian network structure that can best explain the observed dependencies. As shown in Fig. 1, a constraint-based approach typically has three steps.

  • Step 1: Skeleton learning (adjacency search). Some conditional independence test, such as G2 test, mutual information and Fisher’s Z-test, is used to identify whether there exists an edge between any two variables.

  • Step 2: V-structure recognition. The conditional-separation sets generated in Step 1 are used to identify potential v-structures.

  • Step 3: The v-structures identified in Step 2, Meek’s orientation rules[16], together with the observational data, are used to transform the undirected graph produced in Step 1 into a directed acyclic graph.

The classic constraint-based learning approach is the PC algorithm introduced by Spirtes et al. [17]. Researchers have recently tried to extend the constraint-based algorithm to learn from time series data [18] and nonstationary data [19]. Because the PC algorithm is order-dependent, researchers have successively proposed order-independent methods such as PC-stable algorithm [20] and PC-MI algorithm [21].

Most constraint-based approaches take the orientation-faithfulness assumption, which, however, is hard to satisfy in real world situations. Researchers have proposed CPC algorithm [22], MPC algorithm, CPC-stable algorithm and MPC-stable algorithm [20] for learning Bayesian network structures when the orientation-faithfulness assumption is violated. Although these methods can improve the quality of structure learning, they often suffer from high computational cost and tend to generate graphs with many unoriented edges.

In this study, we mainly focus on steps 1 and 2, investigating effective strategies for learning high-quality Bayesian network structures when the orientation-faithfulness assumption is partially violated. Our contribution is threefold:

  • We propose an algorithm named MIIPC for learning Bayesian network structures. The adjacency search step of MIIPC is an extension of our PC-MI algorithm [21], where we integrate the WEF heuristic strategy with the notion of Markov-chain consistency, in the hope to effectively reduce the number of conditional independence testing, and to sustain the number of false positive edges.

  • We propose the Smaller Adjacency-Set heuristics for v-structure recognition. We also prove that the Smaller Adjacency-Set is surprisingly powerful in the sense that it can capture sufficient information for determining whether an unshielded triple forms a v-structure.

  • We experimentally show that the proposed algorithm MIIPC, empowered by the WEF and SAS strategies, outperforms the state-of-the-art approaches in both the quality of causal structure learning and the execution time.

The remainder of the paper is organized as follows. Relevant work and terminology are given in Section 2. In Section 3, we propose the notion of Markov-chain consistency and the Smaller Adjacency-Set (SAS) heuristics. Markov-chain consistency is implemented in the adjacency search step of the MIIPC algorithm to learn skeletons, and the SAS heuristics is utilized in the orientation step of MIIPC for determining v-structures. Experiments and comparisons are covered in Section 4, and Section 5 concludes the paper.

Section snippets

Preliminaries and related work

This section introduces terminology and relevant constraint-based approaches.

Two random variables X and Y are independent conditional on SV{X,Y}, denoted by Ind(X;Y|S), iff p(x,y|s)=p(x|s)p(y|s), for all values x of X,y of Y, and s of S such that p(s)>0.

Let G=V,E be a graph, where V={X1,,Xn} is a set of random variables (vertices) in the problem domain under concern, and E is a set of edges between vertices. A directed edge from Xi to Xj is represented by XiXj, and an undirected edge

M-order-based causal structure learning

We introduce the Markov-chain Consistency in Section 3.1, and discuss our algorithm called MIIPC in Section 3.2 M-order-based skeleton construction, 3.3 M-order-based v-structure determination.

Empirical evaluation

We evaluate the proposed MIIPC algorithm in two aspects: computing time and the quality of learned network structures. For the latter, we use the edge-related measurements [26], [27].

  • Extra edges (false positive): the number of edges that are found in the learned structure but are not present in the original “gold-standard” structure.

  • Missing edges (false negative): the number of edges that are present in the original structure but are missing in the learned structure.

  • Reverse edges (orientation

Conclusion and future work

The state-of-the-art constraint-based approaches tend to yield unstable results which can be greatly affected by the order of choosing variable pairs and decisions about v-structures. In addition, the number of conditional independence tests in PC-like algorithms increases exponentially as the number of variables increases.

Inspired by the strong connection between the degree of mutual information shared by two variables and their conditional independence, we have introduced a causal structure

CRediT authorship contribution statement

Xiaolong Qi: Conceptualization, Methodology, Software, Writing - original draft. Xiaocong Fan: Methodology, Writing - review & editing. Huiling Wang: Writing - review & editing. Ling Lin: Writing - review & editing. Yang Gao: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key R&D Program of China (2017YFB0702600, 2017YFB0702601);the National Natural Science Foundation of China (61432008, 61663045, 61503178);SRPH in Xinjiang (XJEDU2020Y036); Yili Normal University Ph.D. Startup Fund (2020YSB007) and scientific research project of Yili Normal University (2020YSZD004).

References (30)

  • S. Zhu, I. Ng, Z. Chen, Causal discovery with reinforcement learning, in: International Conference on Learning...
  • M. Scanagatta et al.

    A survey on bayesian network structure learning from data

    Prog. Artif. Intell.

    (2019)
  • B. O’Gorman et al.

    Bayesian network structure learning using quantum annealing

    Eur. Phys. J. Special Top.

    (2015)
  • T. Gao, K. Fadnis, M. Campbell, Local-to-global bayesian network structure learning, in: International Conference on...
  • J.I. Alonso-Barba, L. de la Ossa, O. Regnier-Coudert, J. McCall, J.A. Gámez, J.M. Puerta, Ant colony and surrogate...
  • Cited by (5)

    • Error-aware Markov blanket learning for causal feature selection

      2022, Information Sciences
      Citation Excerpt :

      Embedded methods combine the filter selection stage with the learning step and obtain the feature subsets by optimizing the objective function, such as regression shrinkage and selection via the lasso (LASSO) [17]. However, most of the traditional feature selection algorithms do not explicitly uncover cause relationships between features and the class variable, and thus they are lack of interpretability and robustness [18–21]. To address this problem, causal feature selection algorithms are presented.

    View full text