Abstract
We provide theoretical and empirical evidence for a type of asymmetry between causes and effects that is present when these are related via linear models contaminated with additive non-Gaussian noise. Assuming that the causes and the effects have the same distribution, we show that the distribution of the residuals of a linear fit in the anti-causal direction is closer to a Gaussian than the distribution of the residuals in the causal direction. This Gaussianization effect is characterized by reduction of the magnitude of the high-order cumulants and by an increment of the differential entropy of the residuals. The problem of non-linear causal inference is addressed by performing an embedding in an expanded feature space, in which the relation between causes and effects can be assumed to be linear. The effectiveness of a method to discriminate between causes and effects based on this type of asymmetry is illustrated in a variety of experiments using different measures of Gaussianity. The proposed method is shown to be competitive with state-of-the-art techniques for causal inference.
The vast majority of the work was done while being at the Max Planck Institute for Intelligent Systems and at Cambridge University
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We assume such matrix exists and that it is positive definite.
- 2.
See https://www.codalab.org/competitions/1381 for more information.
- 3.
See http://webdav.tuebingen.mpg.de/cause-effect/ for more details.
References
J. Beirlant, E. J. Dudewicz, L. Györfi, and E. C. Van Der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6 (1): 17–39, 1997.
Z. Chen, K. Zhang, and L. Chan. Nonlinear causal discovery for high dimensional data: A kernelized trace method. In IEEE 13th International Conference on Data Mining, pages 1003–1008, 2013.
Z. Chen, K. Zhang, L. Chan, and B. Schölkopf. Causal discovery via reproducing kernel Hilbert space embeddings. Neural Computation, 26 (7): 1484–1517, 2014.
E. A. Cornish and R. A. Fisher. Moments and cumulants in the specification of distributions. Revue de l’Institut International de Statistique / Review of the International Statistical Institute, 5 (4): 307–320, 1938.
H. Cramér. Mathematical methods of statistics. PhD thesis, 1946.
D. Entner and P. O. Hoyer. Estimating a causal order among groups of variables in linear models. In Internatinal Conference on Artificial Neural Networks, pages 84–91. 2012.
A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, pages 585–592. 2008.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13: 723–773, 2012.
J. M. Hernández-Lobato, P. Morales-Mombiela, and A. Suárez. Gaussianity measures for detecting the direction of causal time series. In International Joint Conference on Artificial Intelligence, pages 1318–1323, 2011.
P. O. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf. Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems 21, pages 689–696, 2009.
A. Hyvärinen. New approximations of differential entropy for independent component analysis and projection pursuit. In Advances in Neural Information Processing Systems 10, pages 273–279. MIT Press, 1998.
A. Hyvärinen and S. M. Smith. Pairwise likelihood ratios for estimation of non-Gaussian structural equation models. Journal of Machine Learning Research, 14 (1): 111–152, 2013.
A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 2004.
D. Janzing, P. O. Hoyer, and B. Schölkopf. Telling cause from effect based on high-dimensional observations. In International Conference on Machine Learning, pages 479–486, 2010.
D. Janzing, J. M. Mooij, K. Zhang, J. Lemeire, J. Zscheischler, P. Daniušis, B. Steudel, and B. Schölkopf. Information-geometric approach to inferring causal directions. Artificial Intelligence, 182–183: 1–31, 2012.
Y. Kawahara, K. Bollen, S. Shimizu, and T. Washio. GroupLiNGAM: Linear non-Gaussian acyclic models for sets of variables. 2012. arXiv:1006.5041.
S. Kpotufe, E. Sgouritsa, D. Janzing, and B. Schölkopf. Consistency of causal inference under the additive noise model. In International Conference on Machine Learning, pages 478–486, 2014.
A. J. Laub. Matrix Analysis For Scientists And Engineers. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2004. ISBN 0898715768.
J. T. Marcinkiewicz. Sur une propriété de la loi de gauss. Mathematische Zeitschrift, 44: 612–618, 1938.
P. McCullagh. Tensor methods in statistics. Chapman and Hall, 1987.
J. M. Mooij, O. Stegle, D. Janzing, K. Zhang, and B. Schölkopf. Probabilistic latent variable models for distinguishing between cause and effect. In Advances in Neural Information Processing Systems 23, pages 1687–1695. 2010.
P. Morales-Mombiela, D. Hernández-Lobato, and A. Suárez. Statistical tests for the detection of the arrow of time in vector autoregressive models. In International Joint Conference on Artificial Intelligence, 2013.
K. Murphy. Machine Learning: a Probabilistic Perspective. The MIT Press, 2012.
J. K. Patel and C. B. Read. Handbook of the normal distribution, volume 150. CRC Press, 1996.
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York, NY, USA, 2000. ISBN 0-521-77362-8.
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002. ISBN 0262194759.
B. Schölkopf, A. Smola, and K.-R. Müller. Kernel principal component analysis. In International Conference on Artificial Neural Networks, pages 583–588. Springer, 1997.
S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen. A linear non-Gaussian acyclic model for causal discovery. Journal of Machine Learning Research, 7: 2003–2030, 2006.
H. Singh, N. Misra, V. Hnizdo, A. Fedorowicz, and E. Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23 (3–4): 301–321, 2003.
L. Song, K. Fukumizu, and A. Gretton. Kernel embeddings of conditional distributions: A unified kernel framework for nonparametric inference in graphical models. Signal Processing Magazine, IEEE, 30: 98–111, 2013.
G. J. Székely and M. L. Rizzo. A new test for multivariate normality. Journal of Multivariate Analysis, 93 (1): 58–80, 2005.
G. J. Székely and M. L. Rizzo. Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143 (8): 1249–1272, 2013.
K. Zhang and A. Hyvärinen. On the identifiability of the post-nonlinear causal model. In International Conference on Uncertainty in Artificial Intelligence, pages 647–655, 2009.
Acknowledgements
Daniel Hernández-Lobato and Alberto Suárez gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. These authors also acknowledge financial support from the Spanish Plan Nacional I+D+i, Grants TIN2013-42351-P and TIN2015-70308-REDT, and from Comunidad de Madrid, Grant S2013/ICE-2845 CASI-CAM-CM. David Lopez-Paz acknowledges support from Fundación la Caixa.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
In this appendix we show that if \(\mathscr {X}\) and \(\mathscr {Y}\) follow the same distribution and they have been centered, then the determinant of the covariance matrix of the random variable corresponding to 𝜖 i, denoted with Cov(𝜖 i), coincides with the determinant of the covariance matrix corresponding to the random variable \(\tilde {\boldsymbol {\epsilon }}_i\), denoted with \(\text{Cov}(\tilde {\boldsymbol {\epsilon }}_i)\).
From the causal model, i.e., y i = Ax i + 𝜖 i, we have that:
Since \(\mathscr {X}\) and \(\mathscr {Y}\) follow the same distribution we have that \(\text{Cov}(\mathscr {Y})=\text{Cov}(\mathscr {X})\). Furthermore, we know from the causal model that \(\mathbf {A}=\text{Cov}(\mathscr {Y},\mathscr {X})\text{Cov}(\mathscr {X})^{-1}\). Then,
In the case of \(\tilde {\boldsymbol {\epsilon }}_i\) we know that the relation \(\tilde {\boldsymbol {\epsilon }}_i = (\mathbf {I}-\tilde {\mathbf {A}} \mathbf {A}){\mathbf {x}}_i - \tilde {\mathbf {A}} \boldsymbol {\epsilon }_i\) must be satisfied, where \(\tilde {\mathbf {A}}=\text{Cov}(\mathscr {X},\mathscr {Y})\text{Cov}(\mathscr {Y})^{-1}= \text{Cov}(\mathscr {X},\mathscr {Y})\text{Cov}(\mathscr {X})^{-1}\). Thus, we have that:
By the matrix determinant theorem we have that \(\text{det} \text{Cov}(\tilde {\boldsymbol {\epsilon }}_i)= \text{det} \text{Cov}(\boldsymbol {\epsilon }_i)\). See [23, p. 117] for further details.
Appendix 2
In this Appendix we motivate that, if the distribution of the residuals is not Gaussian, but is close to Gaussian, one should also expected more Gaussian residuals in the anti-causal direction in terms of the energy distance described in Sect. 8.3.4. For simplicity we will consider the univariate case. We use the fact that the energy distance in the one-dimensional case is the squared distance between the cumulative distribution functions of the residuals and a Gaussian distribution [32]. Thus,
where \(\tilde {D}^2\) and D 2 are the energy distances to the Gaussian distribution in the anti-causal and the causal direction respectively; \(\tilde {F}(x)\) and F(x) are the c.d.f. of the residuals in the anti-causal and the causal direction, respectively; and finally, Φ(x) is the c.d.f. of a standard Gaussian.
One should expect that \(\tilde {D}^2 \leq D^2\). To motivate this, we use the Gram-Charlier series and compute an expansion of \(\tilde {F}(x)\) and F(x) around the standard Gaussian distribution [24]. Such an expansion only converges in the case of distributions that are close to be Gaussian (see Sect. 17.6.6a of [5] for further details). Namely,
where ϕ(x) is the p.d.f. of a standard Gaussian, H n(x) are Hermite polynomials and \(\tilde {a}_n\) and a n are coefficients that depend on the cumulants, e.g., a 3 = κ 3, a 4 = κ 4, \(\tilde {a}_3=\tilde {\kappa }_3\), \(\tilde {a}_4=\tilde {\kappa }_4\). Note, however, that coefficients a n and \(\tilde {a}_n\) for n > 5 depend on combinations of the cumulants. Using such an expansion we find:

where \(\mathbb {E}[\cdot ]\) denotes expectation with respect to a standard Gaussian and we have truncated the Gram-Charlier expansion after n = 4. Truncation of the Gram-Charlier expansion after n = 4 is a standard procedure that is often done in the ICA literature for entropy approximation. See for example Sect. 5.5.1 of [13]. We have also used the fact that \(\mathbb {E}[H_3(x)H_2(x)\phi (x)] = 0\). The same approach can be followed in the case of D 2, the energy distance in the causal direction. The consequence is that \(D^2\approx \kappa _3^2/36 \cdot \mathbb {E}[H_2(x)^2\phi (x)] + \kappa _4^2/576 \cdot \mathbb {E}[H_3(x)^2\phi (x)]\). Finally, the fact that one should expect \(\tilde {D}^2\leq D^2\) is obtained by noting that \(\tilde {\kappa }_n = c_n \kappa _n\), where c n is some constant that lies in the interval (−1, 1), as indicated in Sect. 8.2.1. We expect that this result extends to the multivariate case.
Appendix 3
In this Appendix we motivate that one should expect also more Gaussian residuals in the anti-causal direction, based on a reduction of the cumulants, when the residuals in feature space are projected onto the first principal component. That is, when they are multiplied by the first eigenvector of the covariance matrix of the residuals, and scaled by the corresponding eigenvalue. Recall from Sect. 8.2.2 that these covariance matrices are C = I −AA T and \(\tilde {\mathbf {C}}=\mathbf {I} - {\mathbf {A}}^{\text{T}} \mathbf {A}\), in the causal and anti-causal direction respectively. Note that both matrices have the same eigenvalues.
If A is symmetric we have that both C and \(\tilde {\mathbf {C}}\) have the same matrix of eigenvectors P. Let \({\mathbf {p}}_1^n\) be the first eigenvector multiplied n times using the Kronecker product. The cumulants in the anti-causal and the causal direction, after projecting the data onto the first eigenvector are \(\tilde {\kappa }_n^{\text{proj}} = ({\mathbf {p}}_1^{\text{n}})^{\text{T}} {\mathbf {M}}_n \text{vect}(\tilde {\kappa }_n) = c({\mathbf {p}}_1^{\text{n}})^{\text{T}} \text{vect}(\tilde {\kappa }_n)\) and \(\kappa _n^{\text{proj}} = ({\mathbf {p}}_1^{\text{n}})^{\text{T}} \text{vect}(\kappa _n)\), respectively, where M n is the matrix that relates the cumulants in the causal and the anti-causal direction (see Sect. 8.2.2) and c is one of the eigenvalues of M n. In particular, if A is symmetric, it is not difficult to show that \({\mathbf {p}}_1^{\text{n}}\) is one of the eigenvectors of M n. Furthermore, we also showed in that case that ||M n||op < 1 for n ≥ 3 (see Sect. 8.2.2). The consequence is that c ∈ (−1, 1), which combined with the fact that \(||{\mathbf {p}}_1^n||=1\) leads to smaller cumulants in magnitude in the case of the projected residuals in the anti-causal direction.
If A is not symmetric we motivate that one should also expect more Gaussian residuals in the anti-causal direction due to a reduction in the magnitude of the cumulants. For this, we derive a smaller upper bound on their magnitude. This smaller upper bound is based on an argument that uses the operator norm of vectors.
Definition 8.2
The operator norm of a vector w induced by the ℓ p norm is ||w||op = min{c ≥ 0 : ||w T v||p ≤ c||v||p, ∀v}.
The consequence is that ||w||op ≥||w T v||p∕||v||p, ∀v. Thus, the smallest the operator norm of w, the smallest the expected value obtained when multiplying any vector by the vector w. Furthermore, it is clear that ||w||op = ||w||2, in the case of the ℓ 2-norm. From the previous paragraph, in the anti-causal direction we have \(||\tilde {\kappa }_n^{\text{proj}}||{ }_2= ||(\tilde {\mathbf {p}}_1^n)^{\text{T}} {\mathbf {M}}_n \text{vect}(\kappa _n)||{ }_2\), where \(\tilde {\mathbf {p}}_1\) is the first eigenvector of \(\tilde {\mathbf {C}}\), while in the causal direction we have \(||\kappa _n^{\text{proj}}||{ }_2= ||({\mathbf {p}}_1^n)^{\text{T}} \text{vect}(\kappa _n)||{ }_2\), where p 1 is the first eigenvector of C. Thus, because the norm of each vector \(\tilde {\mathbf {p}}_1^n\) and \({\mathbf {p}}_1^n\) is one, we have that \(||{\mathbf {p}}_1^n||{ }_{\text{op}}=1\). However, because we expect M n, to reduce the norm of \((\tilde {\mathbf {p}}_1^n)^{\text{T}}\), as motivated in Sect. 8.2.2, \(||(\tilde {\mathbf {p}}_1^n)^{\text{T}}{\mathbf {M}}_n||{ }_{\text{op}} < 1\) should follow. This is expected to lead to smaller cumulants in magnitude in the anti-causal direction.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Hernández-Lobato, D., Morales-Mombiela, P., Lopez-Paz, D., Suárez, A. (2019). Non-linear Causal Inference Using Gaussianity Measures. In: Guyon, I., Statnikov, A., Batu, B. (eds) Cause Effect Pairs in Machine Learning. The Springer Series on Challenges in Machine Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-21810-2_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-21810-2_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21809-6
Online ISBN: 978-3-030-21810-2
eBook Packages: Computer ScienceComputer Science (R0)