A Wiener Causality Defined by Divergence

Chen, Junya; Feng, Jianfeng; Lu, Wenlian

doi:10.1007/s11063-019-10187-6

A Wiener Causality Defined by Divergence

Published: 09 January 2020

Volume 53, pages 1773–1794, (2021)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

Junya Chen¹,
Jianfeng Feng^2,3 &
Wenlian Lu^1,4,5,6

266 Accesses
3 Citations
Explore all metrics

Abstract

Discovering causal relationships is a fundamental task in investigating the dynamics of complex systems (Pearl in Stat Surv 3:96–146, 2009). Traditional approaches like Granger causality or transfer entropy fail to capture all the interdependence of the statistic moments, which might lead to wrong causal conclusions. In the previous papers (Chen et al. in 25th international conference, ICONIP 2018, Siem Reap, Cambodia, proceedings, Part II, 2018), the authors proposed a novel definition of Wiener causality for measuring the intervene between time series based on relative entropy, providing an integrated description of statistic causal intervene. In this work, we show that relative entropy is a special case of an existing more general divergence estimation. We argue that any Bregman divergences can be used for detecting the causal relations and in theory remedies the information dropout problem. We discuss the benefits of various choices of divergence functions on causal inferring and the quality of the obtained causal models. As a byproduct, we also obtain the robustness analysis and elucidate that RE causality achieves a faster convergence rate among BD causalities. To substantiate our claims, we provide experimental evidence on how BD causalities improve detection accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Wiener Causality Defined by Relative Entropy

Fast and effective pseudo transfer entropy for bivariate data-driven causal inference

Article Open access 19 April 2021

On Data-Driven Computation of Information Transfer for Causal Inference in Discrete-Time Dynamical Systems

Article 03 March 2020

Notes

For the case of high dimensions, the similar results can be derived by the same fashion.
Also as $GC_{y\rightarrow x}=\ln \{\mathrm{tr}[\varSigma (x|x^{p})]/\mathrm{tr}[\varSigma (x|x^{p}\oplus y^{q})]\}$.

References

Pearl J (2009) Causal inference in statistics: an overview. Stat Surv 3:96–146
Article MathSciNet Google Scholar
Chen JY, Feng JF, Lu WL (2018) A Wiener causality defined by relative entropy. In: 25th International conference, ICONIP 2018, Siem Reap, Cambodia, proceedings, Part II
Aristotle (2018) Metaphysics: book iota. Clarendon Press, UK
Brock W (1991) Causality, chaos, explanation and prediction in economics and finance. In: Beyond belief: randomness, prediction and explanation in science, pp 230–279
Anscombe E (2018) Causality and determination. Agency and responsibility. Routledge, pp 57–73
Pearl J (1999) Causality: models, reasoning, and inference. Cambridge University Press, Cambridge
MATH Google Scholar
Wiener N (1956) The theory of prediction, modern mathematics for engineers. McGraw-Hill, New York
Google Scholar
Granger C (1969) Investigating causal relations by econometric models and cross-spectral methods. Econ J Econ Soc 37(3):424–438
MATH Google Scholar
Schreiber T (2000) Measuring information transfer. Phys Rev Lett 85(2):461
Article Google Scholar
Valenza G, Faes L, Citi L, Orini M, Barbieri R (2018) Instantaneous transfer entropy for the study of cardiovascular and cardiorespiratory nonstationary dynamics. IEEE Trans Biomed Eng 65(5):1077–1085
Google Scholar
Liang XS, Kleeman R (2005) Information transfer between dynamical system components. Phys Rev Lett 95(24):244101
Article Google Scholar
Bernett L, Barrett AB, Seth AK (2009) Granger causality and transfer entropy are equivalent for Gaussian variables. Phys Rev Lett 103(23):238701
Article Google Scholar
Ding M, Chen Y, Bressler S (2006) Handbook of time series analysis: recent theoretical developments and applications. Wiley, Wienheim
Google Scholar
Seth AK, Barrett AB, Barnett L (2015) Granger causality analysis in neuroscience and neuroimaging. J Neurosci 35(8):3293–3297
Article Google Scholar
Bregman LM (1967) The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput Math Math Phys 7:3
Article MathSciNet Google Scholar
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79
Article MathSciNet Google Scholar
Cliff OM, Prokopenko M, Fitch R (2018) Minimising the Kullback–Leibler divergence for model selection in distributed nonlinear systems. Entropy 20(2):51
Article MathSciNet Google Scholar
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Book Google Scholar
Si S, Tao D, Geng B (2010) Bregman divergence-based regularization for transfer subspace learning. IEEE Trans Knowl Data Eng 22(7):929–942
Article Google Scholar
Amari SI (2009) Divergence, optimization and geometry. In: International conference on neural information processing, pp 185–193
Oseledec VI (1968) A multiplicative ergodic theorem: Liapunov characteristic number for dynamical systems. Trans Mosc Math Soc 19:197
MathSciNet Google Scholar
Hosoya Y (2001) Elimination of third-series effect and defining partial measures of causality. J Time Ser 22:537
Article MathSciNet Google Scholar
Rissanen JJ (1996) Fisher information and stochastic complexity. IEEE Trans Inf Theory 42(1):40–47
Article MathSciNet Google Scholar
He SY (1998) Parameter estimation of hidden periodic model in random fields. Sci China A 42(3):238
MathSciNet MATH Google Scholar
Ma HF, Leng SY, Tao CY, Ying X, Kurths J, Lai YC, Lin W (2017) Detection of time delays and directional interactions based on time series from complex dynamical systems. Phys Rev E 96(1):012221
Article MathSciNet Google Scholar
Reynolds D (2015) Gaussian mixture models. Encyclopedia of biometrics. Springer, Berlin, pp 827–832
Google Scholar
Goldberger J, Gordon S, Greenspan H (2003) An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures. In: Proceedings ninth IEEE international conference on computer vision, Nice, France, pp 487–493
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B Methodol 39(1):1–22
MathSciNet MATH Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
MathSciNet MATH Google Scholar
Geweke J (1989) Baysian inference in econometric models using Monte Carlo integration. Econometrica 57:1317–1339
Article MathSciNet Google Scholar
Burda Y, Grosse R, Salakhutdinov R (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519
Chen LQ, Tao CY, Zhang RY, Henao R, Carin L (2018) Variational inference and model selection with generalized evidence bounds. In: International conference on machine learning, pp 892–901
Tao CY, Chen LQ, Henao R, Feng JF, Carin L (2018) Chi-square generative adversarial network. In: International conference on machine learning, pp 4894–4903
Tao CY, Dai SY, Chen LQ, Bai K, Chen JY, Liu C, Zhang RY, Georgiy Bobashev G, Carin L (2019) Variational annealing of GANs: a Langevin perspective. In: International conference on machine learning, pp 6176–6185
Liu P, Zeng Z, Wang J (2016) Multistability analysis of a general class of recurrent neural networks with non-monotonic activation functions and time-varying delays. Neural Netw 79:117–127
Article Google Scholar
Liu P, Zeng Z, Wang J (2017) Multiple Mittag–Leffler stability of fractional-order recurrent neural networks. IEEE Trans Syst Man Cybern Syst 47(8):2279–2288
Article Google Scholar
Wu A, Zeng Z (2013) Lagrange stability of memristive neural networks with discrete and distributed delays. IEEE Trans Neural Netw Learn Syst 25(4):690–703
Article Google Scholar
Fahrmeir L, Kaufmann H (1985) Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. Ann Stat 13(1):342–368
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical Sciences, Fudan University, No. 220 Handan Road, Shanghai, China
Junya Chen & Wenlian Lu
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
Jianfeng Feng
Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Ministry of Education, Shanghai, China
Jianfeng Feng
Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China
Wenlian Lu
Key Laboratory of Mathematics for Nonlinear Science, Fudan University, Ministry of Education, Shanghai, China
Wenlian Lu
Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai, China
Wenlian Lu

Authors

Junya Chen
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Feng
View author publications
You can also search for this author in PubMed Google Scholar
Wenlian Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenlian Lu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is jointly supported by National Natural Sciences Foundation of China under Grant No. 61673119, the Key Program of the National Science Foundation of China under Grant No. 91630314, the 111 Project (No. B18015), the key project of Shanghai Science and Technology No. 16JC1420402, Shanghai Municipal Science and Technology Major Project No. 2018SHZDZX01 and ZJ LAB.

Appendices

1.1 Proof of Theorem 1

Proof

By the LMS approach, we solve

$$\begin{aligned} b_T = \left[ \frac{1}{T}\sum _{i = 1}^T\left( y_i^\top y_i\right) \right] ^{-1}\left[ \frac{1}{T}\sum _{i = 1}^T\left( y_i^\top x_i\right) \right] \end{aligned}$$

which converges to $b = {\mathbb {E}}\left[ (y^\top y)^{-1}\right] {\mathbb {E}}\left[ y^\top y\right] $ as T goes infinity.

Then, let $\epsilon = x-y\cdot b_T$, where $y\cdot b_T$ works as a predictor. Let p(x, y) be the joint distribution of (x, y), which is Gaussian as we assumed. Then, the joint density function of $(\epsilon , y)$ should be $p(\epsilon +y\cdot b_T, y)$, which can be written as:

$$\begin{aligned} \begin{aligned} p(\epsilon +y\cdot b_T, y)&= (2\pi )^{-(n+m/2)}\det \left[ \varSigma (x\oplus y)\right] ^{-1/2}\\&\exp \Bigg \{-\frac{1}{2}\left[ \epsilon +y\cdot b_T-{\mathbb {E}}\left[ x\right] , y-{\mathbb {E}}\left[ y\right] \right] \\&\varSigma (x\oplus y)^{-1}\left[ \epsilon +y\cdot b_T - {\mathbb {E}}\left[ x\right] , y-{\mathbb {E}}\left[ y\right] ^\top \right] \Bigg \} \end{aligned} \end{aligned}$$

and the density function of y should be

$$\begin{aligned} p(y) = (2\pi )^{-m/2}\det \left[ \varSigma (y)\right] ^{-1/2}\exp \left\{ -\frac{1}{2}\left[ y-{\mathbb {E}}(y)\right] \varSigma (y)^{-1}\left[ y-{\mathbb {E}}(y)\right] ^\top \right\} \end{aligned}$$

Note

$$\begin{aligned} \varSigma (x\oplus y) = \begin{bmatrix} \varSigma (x)&{} \varSigma (x,y)\\ \varSigma (x,y)^\top &{} \varSigma (y) \end{bmatrix} \end{aligned}$$

of which the inverse becomes:

$$\begin{aligned} \varSigma (x\oplus y)^{-1} = \begin{bmatrix} A &{} B\\ B^\top &{} C \end{bmatrix} \end{aligned}$$

with

$$\begin{aligned} \begin{aligned} A&= \left[ \varSigma (x) - \varSigma (x,y)\varSigma (y^{-1}\varSigma (x,y)^\top )\right] ^{-1} = \varSigma (x|y)^{-1}\\ B&= -\varSigma (x|y)^{-1}\varSigma (x,y)\varSigma (y)^{-1}\\ C&= \varSigma (y)^{-1} + \varSigma (y)^{-1}\varSigma (x,y)^\top \varSigma (x|y)^{-1}\varSigma (x,y)\varSigma (y)^{-1}. \end{aligned} \end{aligned}$$

Then, the logarithm of condition density of $\epsilon $ w.r.t. y is:

$$\begin{aligned} \begin{aligned} \ln \left[ p(\epsilon +y\cdot c_T,y)/p(y)\right] \sim -\frac{1}{2}\left\{ \epsilon - \left[ \left( y-{\mathbb {E}}\left[ y\right] \right) B^\top A^{-1}+{\mathbb {E}}\left[ x\right] - y\cdot b_t\right] \right\} \\ \varSigma (x|y)^{-1}\left\{ \epsilon -\left[ \left( y-{\mathbb {E}}\left[ y\right] \right) B^\top A^{-1}+{\mathbb {E}}\left[ x\right] -y\cdot b_T\right] \right\} \end{aligned} \end{aligned}$$

by neglecting the terms without $\epsilon $. Given the sample data X and Y, let $\Psi = [\epsilon _1, \ldots , \epsilon _T]$ with $\epsilon _i = x_i-b_Ty_i$. By the maximum likelihood, the conditional density of $\epsilon $ w.r.t. Y becomes

$$\begin{aligned} p_T(\epsilon |Y)\sim N\left( \left( \frac{1}{T}\sum _{i = 1}^Ty_i - {\mathbb {E}}\left[ y\right] \right) B^\top A^{-1}+{\mathbb {E}}\left[ x\right] - y\cdot b_T, \varSigma (x|y)\right) \end{aligned}$$

which converges to

$$\begin{aligned} p(\epsilon |Y)\sim N\left( {\mathbb {E}}\left[ x\right] - {\mathbb {E}}\left[ y\right] b_T, \varSigma (x|y)\right) . \end{aligned}$$

$\square $

1.2 Proof of Theorem 2

We start this section by a straightforward proposition:

Proposition 1

The expected value of the score function is zero.

$$\begin{aligned} \begin{aligned} {\mathbb {E}} \left[ s(\theta |X)\right]&= \int p(x;\theta ) \frac{\partial }{\partial \theta }\log p(x;\theta ) dx\\&= \int \frac{\partial p(x;\theta )}{\partial \theta } dx = \frac{\partial }{\partial \theta }\int p(x;\theta )dx = 0. \end{aligned} \end{aligned}$$

From an information perspective, MLE can be justified as asymptotically minimizing KL divergence between p(x) actually describing the data and the parameterized (approximating) pdf $p_\theta (x)$.

MLE satisfies the following two appealing properties called consistency and asymptotic normality [38].

Consistency The estimator $\theta _n\rightarrow \theta _0$ in probability as $n\rightarrow \infty $, where $\theta _0$ is the target unknown parameter of the sample distribution.
Asymptotic normality The estimator $\theta _n$ and its target parameter $\theta _0$ enjoys the following relation:
$$\begin{aligned} \sqrt{n}(\theta _n - \theta _0)\xrightarrow {D}N(0, I^{-1}(\theta _0)) \end{aligned}$$
(18)

$\theta _0$ finds the maximizer of the likelihood function. Namely,

$$\begin{aligned} \theta _0 = \arg \max _{\theta } {\mathbb {E}}\left[ \log p(X;\theta )\right] . \end{aligned}$$

Let $\theta _n$ be the MLE. Note that the MLE solves the equation:

$$\begin{aligned} \sum _{i=1}^n s(\theta _n|X_i) = 0 \end{aligned}$$

Since $\theta _0$ is the maximizer of ${\mathbb {E}}\left[ \log p(X;\theta )\right] $, it also satisfies ${\mathbb {E}}\left[ s(\theta |X))\right] = 0$. Recall that ${\tilde{\theta }}_n$ is MLE of the perturbed data, consider the following expansion:

$$\begin{aligned} \begin{aligned} 0=&\frac{1}{n}\sum _{i = 1}^n s({\tilde{\theta }}_n|{\tilde{X}}_i) - s(\theta _n|X_i)\\ =&\frac{1}{n}\sum _{i = 1}^n s({\tilde{\theta }}_n|X_i) - s(\theta _n|X_i) - s({\tilde{\theta }}_n|X_i) + s({\tilde{\theta }}_n|{\tilde{X}}_i) \\ =&\frac{1}{n}\sum _{i = 1}^n({\tilde{\theta }}_n - \theta _n)s_\theta (\theta _n|X_i) - (X_i-{\tilde{X}}_i)s_X({\tilde{\theta }}_n|\tilde{X_i}).\\ \end{aligned} \end{aligned}$$

(19)

For the first term,

$$\begin{aligned} \frac{1}{n}\sum _{i = 1}^n({\tilde{\theta }}_n - \theta _n)s_\theta (\theta _n|X_i) \approx ({\tilde{\theta }}_n - \theta _n){\mathbb {E}}\left[ s_\theta (\theta _0|X_i)\right] \end{aligned}$$

as ${\tilde{\theta }}_n\rightarrow \theta _0$.

By central limit theorem, the second term

$$\begin{aligned} \begin{aligned} \frac{1}{n}\sum _{i = 1}^n (X_i - {\tilde{X}}_i)s_X({\tilde{\theta }}_n|\tilde{X_i})&= \frac{1}{n}\sum _{i = 1}^n (X_i - {\tilde{X}}_i)s_X({\tilde{\theta }}_n|\tilde{X_i}) - {\mathbb {E}}\left[ \epsilon s_X({\tilde{\theta }}_0|\tilde{X_i})\right] \\&\xrightarrow {D} N\left( 0, \frac{\sigma ^2}{n}\right) . \end{aligned} \end{aligned}$$

where

$$\begin{aligned} s^2 = Var\left[ \epsilon s_X({\tilde{\theta }}_0|{\tilde{X}}_i)\right] . \end{aligned}$$

Therefore, rearranging the quantities in Eq. (19),

$$\begin{aligned} \begin{aligned} \sqrt{n}({\tilde{\theta }}_n - \theta _n)&= \frac{\sqrt{n}}{{\mathbb {E}}\left[ s_\theta (\theta _0|X_i)\right] }\left[ \frac{1}{n}\sum _{i = 1}^n(X_i-{\tilde{X}}_i)s_X({\tilde{\theta }}_n|{\tilde{X}}_i) - {\mathbb {E}}\left[ \epsilon s_X({\tilde{\theta }}_0|{\tilde{X}}_i)\right] \right] \\&\xrightarrow {D}N\left( 0, \frac{{\tilde{\sigma }}^2}{{\mathbb {E}}^2\left[ s_\theta (\theta _0|X_i)\right] }\right) \end{aligned} \end{aligned}$$

(20)

Now we have already derived the asymptotic normality. The next step is to simplify the asymptotic variance.

First we focus on $s_\theta (\theta _0|X_i)$:

$$\begin{aligned} \begin{aligned} s_\theta (\theta _0|X_i)&= \frac{\partial ^2}{\partial \theta ^2} \log p(X_i;\theta _0) =\frac{\partial }{\partial \theta }\frac{\frac{\partial }{\partial \theta }p(X_i;\theta _0)}{p(X_i;\theta _0)} \\&= \frac{\frac{\partial }{\partial \theta ^2}p(X_i;\theta _0)}{p(X_i;\theta _0)} - \left( \frac{\frac{\partial }{\partial \theta }p(X_i;\theta _0)}{p(X_i;\theta _0)} \right) ^2\\&= \frac{\frac{\partial }{\partial \theta ^2}p(X_i;\theta _0)}{p(X_i;\theta _0)} -\left( \frac{\partial }{\partial \theta } \log p(X_i; \theta _0)\right) ^2\\&= \frac{\frac{\partial }{\partial \theta ^2}p(X_i;\theta _0)}{p(X_i;\theta _0)} - s^2(\theta _0|X_i). \end{aligned} \end{aligned}$$

For the first quantity,

$$\begin{aligned} {\mathbb {E}}\left[ \frac{\frac{\partial }{\partial \theta ^2}p(X_i;\theta _0)}{p(X_i;\theta _0)} \right] = \int \frac{\frac{\partial }{\partial \theta ^2}p(x;\theta _0)}{p(x;\theta _0)} p(x;\theta _0)dx = \frac{\partial ^2}{\partial \theta ^2}\int p(x;\theta _0)dx = 0 \end{aligned}$$

Exchange the positions of the derivative and the integration, we have:

$$\begin{aligned} {\mathbb {E}}\left[ s(\theta _0|X_i)\right] = {\mathbb {E}}\left[ \frac{\frac{\partial }{\partial \theta ^2}p(X_i;\theta _0)}{p(X_i;\theta _0)} \right] - {\mathbb {E}}\left[ s^2(\theta _0|X_i)\right] = -I(\theta _0). \end{aligned}$$

(21)

Plugging (21) into Eq. (20) gives

$$\begin{aligned} \sqrt{n}\left( {\tilde{\theta }}_n -\theta _n\right) \xrightarrow {D}N\left( 0, \frac{{\tilde{\sigma }}^2}{I^2(\theta _0)}\right) \end{aligned}$$

Together with (18), ${\tilde{\theta }}_n$ is a consistent estimator of $\theta _0$.

1.3 Proof of Theorem 3

Proof

Consider the Taylor expansion of $p(\theta _n)$, we arrive at:

$$\begin{aligned} p(\theta _n) = p(\theta _0) + \varDelta \theta ^\top _n \frac{\partial p(\theta _0)}{\partial \theta _n} +\frac{1}{2}\varDelta \theta _n^\top \frac{\partial ^2 p(\theta _0)}{\partial \theta _n^2} \varDelta \theta _n + o\left( (\varDelta \theta _n)^2\right) \end{aligned}$$

with $\varDelta \theta _n = \theta _n - \theta _0$.

We first derive the asymptotic normality under the KL-divergence framework. Since the parameters $\theta _n$ are obtained via MLE, which is equivalent to minimizing the KL-divergence, i.e, the first derivatives of KL-divergence vanish:

$$\begin{aligned} \frac{\partial }{\partial \theta _n}\Big |_{\theta _n = \theta _0}KL\left( p(\theta _n)||p(\theta _0)\right) = 0 \end{aligned}$$

By Taylor expansion we obtain up to second order:

$$\begin{aligned} KL(p(\theta _n)||p(\theta _0)) = \frac{1}{2}\varDelta \theta _n^\top H(\theta _0) \varDelta \theta _n^\top +o\left( (\varDelta \theta _n)^2\right) \end{aligned}$$

where H stands for the Hessian matrix of KL divergence.

Note that $H(\theta _0)$ is equivalent to $I(\theta _0)$:

$$\begin{aligned} \begin{aligned} H(\theta _0)&= \frac{\partial ^2}{\partial \theta ^2}\Bigg |_{\theta = \theta _0}KL(p(\theta )||p(\theta _0))\\&=\int \frac{\partial }{\partial \theta }\left( \log p(\theta _0)+1\right) \frac{\partial p(\theta _0)}{\partial \theta }dx - \int \frac{\partial ^2 p(\ theta_0)}{\partial \theta ^2}\log p(\theta _0)dx\\&=\int \frac{1}{p(\theta _0)}\left( \frac{\partial p(\theta _0)}{\partial \theta }\right) ^2+\frac{\partial ^2 p(\theta _0)}{\partial \theta ^2}+ \frac{\partial ^2 p(\theta _0)}{\partial \theta ^2}\log p(\theta _0) - \frac{\partial ^2p(\theta _0)}{\partial \theta ^2}\log p (\theta _0)dx\\&= {\mathbb {E}}\left[ \left( \frac{\partial \log p(\theta _0)}{\partial \theta }\right) ^2\right] = I(\theta _0). \end{aligned} \end{aligned}$$

where the equality in the last line arises from Proposition 1. Together with the asymptotic normality of $\varDelta \theta _n$ in (18), we have:

$$\begin{aligned} D_{KL}\left( p(\theta _n)||p(\theta _0)\right) = \frac{1}{2n}\left( (\varDelta \theta _n)^\top I^{1/2}\sqrt{n}\right) \cdot \left( \sqrt{n} I^{1/2}\varDelta \theta _n\right) \sim \frac{1}{2n}\chi ^2(n-1), \end{aligned}$$

(22)

as n goes infinity.

While under other (non-logarithm form) Bregman divergence framework, Taylor expansion yields:

$$\begin{aligned} \begin{aligned} \triangle _\psi \left( p(\theta _n)||p(\theta _0)\right)&= \left( \varDelta \theta _n\right) ^\top \frac{\partial \psi }{\partial \theta _n}+\frac{1}{2}\left( \varDelta \theta _n\right) ^\top \frac{\partial ^2\psi }{\partial \theta _n^2}\varDelta \theta _n +o\left( (\varDelta \theta _n)^2\right) \\&=\left( \varDelta \theta _n\right) ^\top \frac{\partial \psi }{\partial \theta _n} \sim N\left( 0, \frac{1}{n}\frac{\partial \psi }{\partial \theta } I^{-1}\left( \frac{\partial \psi }{\partial \theta }\right) ^\top \right) , \end{aligned} \end{aligned}$$

(23)

as n goes infinity.

By (22) and (23) we have BD causality between $p(\theta _n)$ and $p(\theta _0)$ converge to zero asymptotically. However, the convergence rate are different, 1/n for relative entropy causality and $1/\sqrt{n}$ for Bregman divergence causality respectively. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, J., Feng, J. & Lu, W. A Wiener Causality Defined by Divergence. Neural Process Lett 53, 1773–1794 (2021). https://doi.org/10.1007/s11063-019-10187-6

Download citation

Published: 09 January 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11063-019-10187-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Wiener Causality Defined by Divergence

Abstract

Access this article

Similar content being viewed by others

A Wiener Causality Defined by Relative Entropy

Fast and effective pseudo transfer entropy for bivariate data-driven causal inference

On Data-Driven Computation of Information Transfer for Causal Inference in Discrete-Time Dynamical Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendices

1.1 Proof of Theorem 1

Proof

1.2 Proof of Theorem 2

Proposition 1

1.3 Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Wiener Causality Defined by Divergence

Abstract

Access this article

Similar content being viewed by others

A Wiener Causality Defined by Relative Entropy

Fast and effective pseudo transfer entropy for bivariate data-driven causal inference

On Data-Driven Computation of Information Transfer for Causal Inference in Discrete-Time Dynamical Systems

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendices

1.1 Proof of Theorem 1

Proof

1.2 Proof of Theorem 2

Proposition 1

1.3 Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation