Elsevier

Neural Networks

Volume 124, April 2020, Pages 130-145
Neural Networks

A causal discovery algorithm based on the prior selection of leaf nodes

https://doi.org/10.1016/j.neunet.2019.12.020Get rights and content

Abstract

In recent years, Linear Non-Gaussian Acyclic Model (LiNGAM) has been widely used for the discovery of causal network. However, solutions based on LiNGAM usually yield high computational complexity as well as unsatisfied accuracy when the data is high-dimensional or the sample size is too small. Such complexity or accuracy problems here are often originated from their prior selection of root nodes when estimating a causal ordering. Thus, a causal discovery algorithm termed as GPL algorithm (the LiNGAM algorithm of Giving Priority to Leaf-nodes) under a mild assumption is proposed in this paper. It assigns priority to leaf nodes other than root nodes. Since leaf nodes do not affect others in a structure, we can directly estimate a causal ordering in a bottom-up way without performing additional operations like data updating process. Corresponding proofs for both feasibility and superiority are offered based on the properties of leaf nodes. Aside from theoretical analyses, practical experiments are conducted on both synthetic and real-world data, which confirm that GPL algorithm outperforms the other two state-of-the-art algorithms in computational complexity and accuracy, especially when dealing with high-dimensional data (up to 200) or small sample size (down to 100 for the dimension of 70).

Introduction

Causality study has thrived and captured special attention in various fields for a long time, including statistics, artificial intelligence, and cognitive science (Janzing et al., 2017, Pearl, 2000, Spirtes et al., 2000). By deriving causal relationships from factual knowledge, causality discovers connections between variables and seeks possible explanations for observed events. One popular branch of causality focuses on continuous variables under the frameworks of structural equation models (Bollen, 1989) and Bayesian networks (Pearl, 2000, Spirtes et al., 2000). Since the recent study has shown that the non-Gaussianity of noise do benefit the causal discovery (Shimizu, 2014, Shimizu et al., 2006), in this work we focus on the framework of the Linear non-Gaussian Acyclic Model (LiNGAM), which is also a typical model to estimate causal networks from observational data.

The problems of estimating LiNGAM have been extensively studied in the literature, and there are three main frameworks, i.e., Independent Component Analysis (ICA)-based framework (Hyvärinen, 1999, Hyvärinen et al., 2004, Shimizu et al., 2006, Zhang and Chan, 2006) (e.g. ICA-LiNGAM algorithm), Score-based framework (Cai et al., 2018, Hoyer and Hyttinen, 2009) (e.g. BayesLiNGAM algorithm) and Root-based framework (Hyvärinen and Smith, 2013, Shimizu et al., 2011, Sogawa et al., 2011) (e.g. DirectLiNGAM algorithm and its improved version Pairwise-LiNGAM algorithm). The first two frameworks convert LiNGAM into a function optimization problem, whose approaches are sensitive to initial values (Hyvärinen et al., 2004) and often lead to a local optima (Himberg, Hyvärinen, & Esposito, 2004). Compared with the first two frameworks, Root-based framework is a nonparametric estimation and it directly estimates causal networks in a finite number of steps. It contains two main phases: (1) identify a root node; (2) perform data updating process. This data updating process is applied because the selected root node must affect other nodes, and before we select the next root node, their influences on other nodes need to be removed. However, with a limited number of samples, errors may occur during the data updating process, which means that using this updated data with errors causes the wrong identification of the next root node. Moreover, when the data is high-dimensional, challenges of high complexity as well as cascading errors will be met.

Reasons why the Root-based framework is sensitive to the sample size, especially with high dimensional data, are as follows. When the sample is not adequate, it cannot provide enough information to conduct the data updating process. That is to say, it cannot assure that the estimated coefficients are all correct or at least approximately equal to the real coefficients. See a simple example in Fig. 1. As depicted in Fig. 1(a-1) and (a-2), coefficients like −5.08, 3.04, −9.85 and 5.89 should have been at least approximately equal to −5, 3, −10 and 6 so as to avoid errors. When the data is in higher dimensionality, methods under the Root-based framework have to carry out more data updating processes iteratively, which renders higher complexity.

To address the above problems, we propose a Leaf-based framework to solve LiNGAM under the causal tree assumption (The description of this assumption can be seen in Section 4). Specifically, owing to the property of the leaves, we can remove one of them from the whole structure, which does not affect the causal structure of other nodes. Hence we can, without performing data updating process, identify the leaf node iteratively and estimate a complete causal network in a bottom-up way within finite steps. Besides, we also propose an algorithm that gives priority to leaf nodes, namely the GPL algorithm under the Leaf-based framework. In order to identify the leaf node from the observed data, we measure the independence between a variable and its residuals, and find the variable which is dependent with all its residuals as the leaf node. In this way, GPL avoids the cascading errors and owns significant advantages in accuracy and especially computational complexity, with corresponding proofs provided to verify its effectiveness.

The major contributions of this paper are listed as follows:

(1) A new framework to estimate LiNGAM, based on the prior selection of leaf nodes (Leaf-based framework), is proposed.

(2) An effective algorithm to identify leaf nodes is proposed and detailed theoretical proofs are given under the causal tree assumption, from the aspects of sufficiency and necessity.

(3) Theoretical analyses demonstrate that our framework can quickly and efficiently estimate LiNGAM under the causal tree assumption, especially when dealing with the high-dimensional data (up to 200) or small sample size data (down to 100).

(4) Experiments on both synthetic and real-world data, including the causal tree assumption violated data, are conducted, which verify the availability of our algorithm.

The remainder of this paper is organized as follows. Related work on LiNGAM and its variant algorithms will be introduced in Section 2. In Section 3, a rapid review of LiNGAM will be undertaken. Subsequently, in Section 4, properties of leaf nodes with corresponding theoretical proofs will be studied, and a detailed GPL algorithm which gives priority to leaf nodes will be held. After assessing GPL’s performance with experiments in Section 5, a brief conclusion will be drawn in Section 6.

Section snippets

Related work

Linear non-Gaussian acyclic model (LiNGAM), as a functional equation model (Bollen, 1989), is one of the most well-known models for causal discovery. The observed data under LiNGAM needs to be generated from a process represented by a directed acyclic graph (DAG) with the following properties: (1) There exist linear relationships between observed variables; (2) There are no hidden variables (Unobserved Confounders), or equivalently all variables are observable; (3) Disturbance variables (noise

A linear non-Gaussian acyclic model: LiNGAM

A concrete description of LiNGAM (Shimizu et al., 2006) is demonstrated as follows. It shows that the generation process of observed data is originated from a directed acyclic graph (DAG). Denote by B={bij} an n×n adjacency matrix where bij represents the connection strength from a variable xj to another xi in the DAG. Denote by k(i) a causal order of variables xi in DAG, such that no later variable causes any earlier one in order to guarantee its acyclicity. A directed path from xi to xj is a

An algorithm giving priority to leaf nodes

To alleviate the high computational cost and low accuracy of the Root-based method, we attempt to propose a Leaf-based framework to avoid the iterative updating of the data in this section. We begin with the causal tree assumption of the data. Its formal description is given below.

Causal Tree Assumption. Data is represented by a causal tree, a directed acyclic graph which has the following property: any two nodes in a causal tree have and only have at most one path.

This Assumption means that if

Synthetic data

In order to test the superiority of GPL algorithm, rigorous simulations are conducted among GPL and two representative current algorithms, Pairwise-LiNGAM algorithm and DirectLiNGAM algorithm, both of which assign priority to root nodes.

  • DirectLiNGAM, specifically the kernel-based version proposed by Shimizu et al. (2011)

  • Pairwise-LiNGAM algorithm, pairwise likelihood ratios estimated using maximum entropy approximation (Hyvärinen & Smith, 2013).

The three algorithms are evaluated in accuracy and

Conclusions

By assigning priority to leaf nodes during the estimation for a causal ordering, we propose a method to discover the causal network structure for LiNGAM on the causal tree assumption, namely the GPL algorithm. GPL algorithm is capable of detecting the correct causal ordering without iteratively updating data, which vastly improves the performances in both computational complexity and accuracy even with high dimensional data (up to 200) or small sample size (down to 100 for the dimension of 70).

Acknowledgments

We would like to thank the Editors-in-Chief, the Action Editor, and two anonymous reviewers for their helpful comments and suggestions that greatly improved the quality of this work. This work was supported by the NSFC-Guangdong Joint Fund (U1501254), Natural Science Foundation of China (6187 6043, 61472089), Natural Science Foundation of Guangdong (2014A030306004, 2014A030308008), Science and Technology Planning Project of Guangdong (201 3B051000076, 2015B010108006, 2015B010131015), Guangdong

References (30)

  • BollenKenneth A.

    Structural equations with latent variable

    (1989)
  • Cai, Ruichu, Qiao, Jie, Zhang, Zhenjie, & Hao, Zhifeng (2018). SELF: Structural equational likelihood framework for...
  • CaiRuichu et al.

    Triad constraints for learning causal structure of latent variables

  • Cai, Ruichu, Zhang, Zhenjie, & Hao, Zhifeng (2013). SADA: A general framework to support robust causation discovery. In...
  • ChangTianhorng et al.

    Texture analysis and classification with tree-structured wavelet transform

    IEEE Transactions on Image Processing

    (1993)
  • DarmoisGeorge

    Analyse générale des liaisons stochastiques: etude particulière de l’analyse factorielle linéaire

    Revue de l’Institut International de Statistique

    (1953)
  • DurbinRichard et al.

    Biological sequence analysis: Probabilistic models of proteins and nucleic acids

    (1998)
  • HenaoRicardo et al.

    Bayesian sparse factor models and DAGs inference and comparison

  • HimbergJohan et al.

    Validating the independent components of neuroimaging time series via clustering and visualization

    Neuroimage

    (2004)
  • HoyerPatrik O. et al.

    Bayesian discovery of linear acyclic causal models

  • HyvärinenAapo

    New approximations of differential entropy for independent component analysis and projection pursuit

  • HyvärinenAapo

    Fast and robust fixed-point algorithms for independent component analysis

    IEEE Transactions on Neural Networks

    (1999)
  • Hyvärinen, Aapo (2010). Pairwise measures of causal direction in linear non-Gaussian acyclic models. In ACML...
  • HyvärinenAapo et al.

    Independent component analysis, Vol. 46

    (2004)
  • HyvärinenAapo et al.

    Pairwise likelihood ratios for estimation of non-Gaussian structural equation models

    Journal of Machine Learning Research (JMLR)

    (2013)
  • View full text