Abstract
Discovering causal relationships is a fundamental task in investigating the dynamics of complex systems (Pearl in Stat Surv 3:96–146, 2009). Traditional approaches like Granger causality or transfer entropy fail to capture all the interdependence of the statistic moments, which might lead to wrong causal conclusions. In the previous papers (Chen et al. in 25th international conference, ICONIP 2018, Siem Reap, Cambodia, proceedings, Part II, 2018), the authors proposed a novel definition of Wiener causality for measuring the intervene between time series based on relative entropy, providing an integrated description of statistic causal intervene. In this work, we show that relative entropy is a special case of an existing more general divergence estimation. We argue that any Bregman divergences can be used for detecting the causal relations and in theory remedies the information dropout problem. We discuss the benefits of various choices of divergence functions on causal inferring and the quality of the obtained causal models. As a byproduct, we also obtain the robustness analysis and elucidate that RE causality achieves a faster convergence rate among BD causalities. To substantiate our claims, we provide experimental evidence on how BD causalities improve detection accuracy.
Similar content being viewed by others
Notes
For the case of high dimensions, the similar results can be derived by the same fashion.
Also as \(GC_{y\rightarrow x}=\ln \{\mathrm{tr}[\varSigma (x|x^{p})]/\mathrm{tr}[\varSigma (x|x^{p}\oplus y^{q})]\}\).
References
Pearl J (2009) Causal inference in statistics: an overview. Stat Surv 3:96–146
Chen JY, Feng JF, Lu WL (2018) A Wiener causality defined by relative entropy. In: 25th International conference, ICONIP 2018, Siem Reap, Cambodia, proceedings, Part II
Aristotle (2018) Metaphysics: book iota. Clarendon Press, UK
Brock W (1991) Causality, chaos, explanation and prediction in economics and finance. In: Beyond belief: randomness, prediction and explanation in science, pp 230–279
Anscombe E (2018) Causality and determination. Agency and responsibility. Routledge, pp 57–73
Pearl J (1999) Causality: models, reasoning, and inference. Cambridge University Press, Cambridge
Wiener N (1956) The theory of prediction, modern mathematics for engineers. McGraw-Hill, New York
Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econ J Econ Soc 37(3):424–438
Schreiber T (2000) Measuring information transfer. Phys Rev Lett 85(2):461
Valenza G, Faes L, Citi L, Orini M, Barbieri R (2018) Instantaneous transfer entropy for the study of cardiovascular and cardiorespiratory nonstationary dynamics. IEEE Trans Biomed Eng 65(5):1077–1085
Liang XS, Kleeman R (2005) Information transfer between dynamical system components. Phys Rev Lett 95(24):244101
Bernett L, Barrett AB, Seth AK (2009) Granger causality and transfer entropy are equivalent for Gaussian variables. Phys Rev Lett 103(23):238701
Ding M, Chen Y, Bressler S (2006) Handbook of time series analysis: recent theoretical developments and applications. Wiley, Wienheim
Seth AK, Barrett AB, Barnett L (2015) Granger causality analysis in neuroscience and neuroimaging. J Neurosci 35(8):3293–3297
Bregman LM (1967) The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput Math Math Phys 7:3
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79
Cliff OM, Prokopenko M, Fitch R (2018) Minimising the Kullback–Leibler divergence for model selection in distributed nonlinear systems. Entropy 20(2):51
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Si S, Tao D, Geng B (2010) Bregman divergence-based regularization for transfer subspace learning. IEEE Trans Knowl Data Eng 22(7):929–942
Amari SI (2009) Divergence, optimization and geometry. In: International conference on neural information processing, pp 185–193
Oseledec VI (1968) A multiplicative ergodic theorem: Liapunov characteristic number for dynamical systems. Trans Mosc Math Soc 19:197
Hosoya Y (2001) Elimination of third-series effect and defining partial measures of causality. J Time Ser 22:537
Rissanen JJ (1996) Fisher information and stochastic complexity. IEEE Trans Inf Theory 42(1):40–47
He SY (1998) Parameter estimation of hidden periodic model in random fields. Sci China A 42(3):238
Ma HF, Leng SY, Tao CY, Ying X, Kurths J, Lai YC, Lin W (2017) Detection of time delays and directional interactions based on time series from complex dynamical systems. Phys Rev E 96(1):012221
Reynolds D (2015) Gaussian mixture models. Encyclopedia of biometrics. Springer, Berlin, pp 827–832
Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In: Proceedings ninth IEEE international conference on computer vision, Nice, France, pp 487–493
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol 39(1):1–22
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
Geweke J (1989) Baysian inference in econometric models using Monte Carlo integration. Econometrica 57:1317–1339
Burda Y, Grosse R, Salakhutdinov R (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519
Chen LQ, Tao CY, Zhang RY, Henao R, Carin L (2018) Variational inference and model selection with generalized evidence bounds. In: International conference on machine learning, pp 892–901
Tao CY, Chen LQ, Henao R, Feng JF, Carin L (2018) Chi-square generative adversarial network. In: International conference on machine learning, pp 4894–4903
Tao CY, Dai SY, Chen LQ, Bai K, Chen JY, Liu C, Zhang RY, Georgiy Bobashev G, Carin L (2019) Variational annealing of GANs: a Langevin perspective. In: International conference on machine learning, pp 6176–6185
Liu P, Zeng Z, Wang J (2016) Multistability analysis of a general class of recurrent neural networks with non-monotonic activation functions and time-varying delays. Neural Netw 79:117–127
Liu P, Zeng Z, Wang J (2017) Multiple Mittag–Leffler stability of fractional-order recurrent neural networks. IEEE Trans Syst Man Cybern Syst 47(8):2279–2288
Wu A, Zeng Z (2013) Lagrange stability of memristive neural networks with discrete and distributed delays. IEEE Trans Neural Netw Learn Syst 25(4):690–703
Fahrmeir L, Kaufmann H (1985) Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann Stat 13(1):342–368
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work is jointly supported by National Natural Sciences Foundation of China under Grant No. 61673119, the Key Program of the National Science Foundation of China under Grant No. 91630314, the 111 Project (No. B18015), the key project of Shanghai Science and Technology No. 16JC1420402, Shanghai Municipal Science and Technology Major Project No. 2018SHZDZX01 and ZJ LAB.
Appendices
Appendices
1.1 Proof of Theorem 1
Proof
By the LMS approach, we solve
which converges to \(b = {\mathbb {E}}\left[ (y^\top y)^{-1}\right] {\mathbb {E}}\left[ y^\top y\right] \) as T goes infinity.
Then, let \(\epsilon = x-y\cdot b_T\), where \(y\cdot b_T\) works as a predictor. Let p(x, y) be the joint distribution of (x, y), which is Gaussian as we assumed. Then, the joint density function of \((\epsilon , y)\) should be \(p(\epsilon +y\cdot b_T, y)\), which can be written as:
and the density function of y should be
Note
of which the inverse becomes:
with
Then, the logarithm of condition density of \(\epsilon \) w.r.t. y is:
by neglecting the terms without \(\epsilon \). Given the sample data X and Y, let \(\Psi = [\epsilon _1, \ldots , \epsilon _T]\) with \(\epsilon _i = x_i-b_Ty_i\). By the maximum likelihood, the conditional density of \(\epsilon \) w.r.t. Y becomes
which converges to
\(\square \)
1.2 Proof of Theorem 2
We start this section by a straightforward proposition:
Proposition 1
The expected value of the score function is zero.
From an information perspective, MLE can be justified as asymptotically minimizing KL divergence between p(x) actually describing the data and the parameterized (approximating) pdf \(p_\theta (x)\).
MLE satisfies the following two appealing properties called consistency and asymptotic normality [38].
-
Consistency The estimator \(\theta _n\rightarrow \theta _0\) in probability as \(n\rightarrow \infty \), where \(\theta _0\) is the target unknown parameter of the sample distribution.
-
Asymptotic normality The estimator \(\theta _n\) and its target parameter \(\theta _0\) enjoys the following relation:
$$\begin{aligned} \sqrt{n}(\theta _n - \theta _0)\xrightarrow {D}N(0, I^{-1}(\theta _0)) \end{aligned}$$(18)
\(\theta _0\) finds the maximizer of the likelihood function. Namely,
Let \(\theta _n\) be the MLE. Note that the MLE solves the equation:
Since \(\theta _0\) is the maximizer of \({\mathbb {E}}\left[ \log p(X;\theta )\right] \), it also satisfies \({\mathbb {E}}\left[ s(\theta |X))\right] = 0\). Recall that \({\tilde{\theta }}_n\) is MLE of the perturbed data, consider the following expansion:
For the first term,
as \({\tilde{\theta }}_n\rightarrow \theta _0\).
By central limit theorem, the second term
where
Therefore, rearranging the quantities in Eq. (19),
Now we have already derived the asymptotic normality. The next step is to simplify the asymptotic variance.
First we focus on \(s_\theta (\theta _0|X_i)\):
For the first quantity,
Exchange the positions of the derivative and the integration, we have:
Plugging (21) into Eq. (20) gives
Together with (18), \({\tilde{\theta }}_n\) is a consistent estimator of \(\theta _0\).
1.3 Proof of Theorem 3
Proof
Consider the Taylor expansion of \(p(\theta _n)\), we arrive at:
with \(\varDelta \theta _n = \theta _n - \theta _0\).
We first derive the asymptotic normality under the KL-divergence framework. Since the parameters \(\theta _n\) are obtained via MLE, which is equivalent to minimizing the KL-divergence, i.e, the first derivatives of KL-divergence vanish:
By Taylor expansion we obtain up to second order:
where H stands for the Hessian matrix of KL divergence.
Note that \(H(\theta _0)\) is equivalent to \(I(\theta _0)\):
where the equality in the last line arises from Proposition 1. Together with the asymptotic normality of \(\varDelta \theta _n\) in (18), we have:
as n goes infinity.
While under other (non-logarithm form) Bregman divergence framework, Taylor expansion yields:
as n goes infinity.
By (22) and (23) we have BD causality between \(p(\theta _n)\) and \(p(\theta _0)\) converge to zero asymptotically. However, the convergence rate are different, 1/n for relative entropy causality and \(1/\sqrt{n}\) for Bregman divergence causality respectively. \(\square \)
Rights and permissions
About this article
Cite this article
Chen, J., Feng, J. & Lu, W. A Wiener Causality Defined by Divergence. Neural Process Lett 53, 1773–1794 (2021). https://doi.org/10.1007/s11063-019-10187-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-10187-6