1 Introduction

The concept of dependence among random observations plays a central role in many fields, including statistics, medicine, biology and engineering, among others. Given the complexity inherent in fully understanding dependencies, the strength of these relationships is often distilled into a single metric: the correlation coefficient. Numerous types of correlation coefficients exist, but perhaps the most widely known is the Pearson’s correlation coefficient. Alternative measures of correlation also exist. For instance, rank correlation assesses the relationship between the rankings of two variables or the rankings of the same variable across different conditions. Examples of rank correlation coefficients include Spearman’s rank correlation coefficient, Kendall’s Tau correlation coefficient, and Goodman and Kruskal’s gamma.

Pearson’s correlation coefficient has some disadvantages. For example, the fact that the Pearson’s correlation coefficient is zero does not determine independence between two variables, as only a linear dependence between the variables can be determined and the variables may have a non-linear relationship. In recent years, a novel measure of dependence between random vectors has been proposed, called distance correlation, which was introduced by [1]. According to these authors, distance covariance and distance correlation share a parallel with product-moment covariance and correlation. However, unlike the classical definition of correlation, distance correlation is zero solely when the random vectors are independent. Moreover, the distance correlation can be used to evaluate both linear and nonlinear correlations between variables. Essentially, for all distributions with finite first moments, distance correlation (\(\mathcal {R}\)) extends the notion of correlation in two key ways:

  1. 1.

    \(\mathcal {R}(X,Y)\) is defined for X and Y in arbitrary dimensions;

  2. 2.

    \(\mathcal {R}(X,Y) = 0\) characterizes the independence of X and Y.

Distance correlation satisfies \(0 \le \mathcal {R}\le 1\), and \(\mathcal {R}= 0\) if-and only if- X and Y are independent.

Distance correlation has been applied and extended to a wide variety of fields, such as variable selection [7,8,9,10], as well as in disciplines like biology [11, 12] and medicine [13, 14], among others. It has also been explored within high-dimensional contexts [15, 16]. Furthermore, the applicability of distance correlation was extended to address the challenge of testing the independence of high-dimensional random vectors in [6]. Similarly, the concept of partial distance correlation was introduced by [2]. Distance correlation has been extended to encompass conditional distance correlation [16,17,18,19], and it has also been examined in the realm of survival data [20,21,22,23,24]. Lastly, [25] have demonstrated that for any fixed Pearson correlation coefficient strictly between -1 and 1, the distance correlation coefficient can attain any value within the open unit interval (0,1).

The estimation of distance correlation relies on the estimation of distance covariance. [5] demonstrated the uniqueness of distance covariance. [1] proposed a sample distance covariance estimator, and they showed that it is a V-statistic. This estimator of the distance covariance gives rise to the so-called V-estimator of the distance correlation. Moreover, intermediate results provided by [2] led to an unbiased estimator of the squared distance covariance. This unbiased estimator was further identified as a U-statistic in [3]. This distance covariance estimator results in the U-estimator of the distance correlation.

The two estimators of the distance correlation, U-estimator and V-estimator, have different properties. The direct implementation of the V-estimator results in computational complexity that scales as \(\mathcal {O}(n^2)\). The computation of the U-estimator reduces computational complexity to \(\mathcal {O}(n \log n)\). However, both the U-statistic and the V-statistic of distance correlation can be calculated in \(\mathcal {O}(n \log n)\) steps, the algorithm described by [3] can be straightforwardly extended to the V-statistic version. Both estimators exhibit good asymptotic properties, although, the U-estimator allows for easier derivation of these properties.

One scenario of interest when working with distance correlation is that of independence, when both distance correlation and distance covariance are zero. It is under independence or in situations close to it where the two estimators exhibit the most significant differences. Firstly, the U-estimator of the squared distance covariance is unbiased, leading to occurrences of negative squared distance covariance values, which precludes the computation of the U-estimator of distance correlation. Conversely, the V-estimator is biased as it can only take positive values. However, to the best of our knowledge, the respective advantages and disadvantages of each distance correlation estimator, and the use of one estimator or the other, do not seem to be based on any specific basis. For example, while [23] employ the U-estimator to propose an extension for right-censored data, [17] suggest an extension for conditional distance correlation using both estimators.

This paper describes a simulation study that was conducted to assess the practical behavior and efficiency of each estimator with different dependency models. The experimental results show that neither is consistently better, making the choice of the estimator in practice a challenging task. To tackle this inconvenience, a new approach, specifically a convex linear combination of the aforementioned estimators, is introduced and studied.

The remainder of this paper is organized as follows. Section 2 introduces the preliminaries, defining distance covariance and distance correlation. It also presents the estimators of distance covariance and distance correlation described in the literature. Some of the existing packages developed in R and Python software are introduced. To study the occurrence of negative values for the squared distance covariance estimator which enables calculation of the U-estimator of distance correlation, two modifications of the U-estimator are proposed to handle that problem. Section 3 introduces a convex linear combination approach to address the estimator choice problem. Section 4 presents the results of the simulation study through three models: Farlie-Gumbel-Morgenstern (FGM), a bivariate normal model, and a non-linear model. The results are compared and shown in terms of efficiency, mean squared error (MSE), bias, variance, and computational time for each estimator. Additionally a comprehensive comparison between the original estimators and the proposed alternatives presented. Section 5 analyzes and discuss results for an application to a real data base. Finally, Section 6 provides the concluding remarks.

2 Distance correlation estimation

Let \(X \in \mathbb {R}^p\) and \(Y \in \mathbb {R}^q\) be random vectors, where p and q are positive integers. The characteristic functions of X and Y are denoted as \(\phi _X\) and \(\phi _Y\), respectively, and the joint characteristic function of X and Y is denoted as \(\phi _{X,Y}\).

Definition 1

The distance covariance between random vectors X and Y with finite first moments is the non-negative square root of the number \(\mathcal {V}^2(X, Y)\), defined by:

$$\begin{aligned} \mathcal {V}^2(X,Y)= & \Vert \phi _{X,Y}(t,s)-\phi _X(t)\phi _Y(s) \Vert ^2 \\= & \frac{1}{c_pc_q}\int _{\mathbb {R}^{p+q}} \frac{|\phi _{X,Y}(t,s)-\phi _X(t)\phi _Y(s)|^2}{|t|^{1+p}_p|s|^{1+q}_q}dt\,ds, \end{aligned}$$

where \(c_d = \frac{\pi ^{(1+d)/2}}{\Gamma ((1+d)/2)}\), \(|t|^{1+p}_p = ||t||^{1+p}, |s|^{1+q}_q = ||s||^{1+q}\). Similarly, distance variance is defined as the square root of

$$\begin{aligned} \mathcal {V}^2(X) = \mathcal {V}^2(X,X) = \Vert \phi _{X,X}(t,s)-\phi _X(t)\phi _X(s) \Vert ^2. \end{aligned}$$

Definition 2

The distance correlation between two random vectors X and Y with finite first moments is the positive square root of the non-negative number \(\mathcal {R}^2(X,Y)\) defined by

$$\begin{aligned} \mathcal {R}^2(X,Y)= & {\left\{ \begin{array}{ll} \frac{\mathcal {V}^2(X,Y)}{\sqrt{\mathcal {V}^2(X)\mathcal {V}^2(Y)}}, & \mathcal {V}^2(X)\mathcal {V}^2(Y) > 0 \\ 0, & \mathcal {V}^2(X)\mathcal {V}^2(Y) = 0. \end{array}\right. } \end{aligned}$$
(1)

An equivalent form for computing distance covariance through expectations given by [1], this is, if \(E|X|^2_p < \infty \) and \(E|Y|^2_q < \infty \), then \(E[|X|_p|Y|_q] < \infty \), and

$$\begin{aligned} \mathcal {V}^2(X,Y)= & E[|X_1-X_2|_p|Y_1-Y_2|_q] + E[|X_1-X_2|_p]E[|Y_1-Y_2|_q] \nonumber \\ & \, - 2E[|X_1-X_2|_p|Y_1-Y_3|_q], \end{aligned}$$
(2)

where \((X_1,Y_1),(X_2,Y_2)\) and \((X_3,Y_3)\) are independent and identically distributed as (XY). Note that (2) can be used to compute distance covariance using only the density function, without needing to know the characteristic function. Similarly, it is possible to compute distance variance of X as the square root of

$$\begin{aligned} \mathcal {V}^2(X,X) \,\,\,=\,\,\, \mathcal {V}^2(X)= & E\left[ |X_1-X_2|_p^2\right] + E\left[ |X_1-X_2|_p\right] ^2 \nonumber \\ & \, - 2E\left[ |X_1-X_2|_p|X_1-X_3|_p\right] , \end{aligned}$$
(3)

and similarly for \(\mathcal {V}^2\)(Y). As a result, it is possible to accurately determine the exact distance correlation \(\mathcal {R}(X,Y)\) using (2), (3) and (1).

For an observed random sample \((\varvec{\textrm{X},\textrm{Y}}) = \{(X_k,Y_k): k = 1, \dots ,n\}\) from the joint distribution of random vectors \(X \in \mathbb {R}^p\) and \(Y \in \mathbb {R}^q\), [1] proposed the following empirical estimator of distance covariance.

Definition 3

Empirical distance covariance \(\mathcal {V}_n(\varvec{\textrm{X}}, \varvec{\textrm{Y}})\) is defined by the non-negative square root of:

$$\begin{aligned} \mathcal {V}_n^2(\varvec{\textrm{X}}, \varvec{\textrm{Y}})= & \frac{1}{n^2} \sum _{k,l = 1}^n A_{kl}B_{kl}, \end{aligned}$$
(4)

where \(A_{kl}\) and \(B_{kl}\) denote the corresponding double-centered distance matrices defined as

$$\begin{aligned} A_{kl}= & {\left\{ \begin{array}{ll} a_{kl} - \frac{1}{n} \sum _{j = 1}^n a_{kj} - \frac{1}{n} \sum _{i = 1}^n a_{il} + \frac{1}{n^2} \sum _{i,j=1}^n a_{ij}, & k \ne l \\ 0, & k = l, \end{array}\right. } \end{aligned}$$

where \(a_{kl} = |X_k - X_l|\) are the pairwise distances of the X observations. The terms \(B_{kl}\) are defined in a ismilar manner but using \(b_{kl} = |Y_k-Y_l|\) instead of \(a_{kl}\). Similarly, sample distance variance \(\mathcal {V}_n(\varvec{\textrm{X}})\) is defined as the square root of:

$$\begin{aligned} \mathcal {V}_n^2(\varvec{\textrm{X}})= & \mathcal {V}_n^2(\varvec{\textrm{X},\textrm{X}}) \,\,\, = \,\,\, \frac{1}{n^2}\sum _{k,l = 1}^n A_{kl}^2. \end{aligned}$$

Theorem 1 in [1] proved that \(\mathcal {V}_n^2(\varvec{\textrm{X},\textrm{Y}}) \ge 0\). Moreover, it has been proven that under independence, \(\mathcal {V}_n^2(\varvec{\textrm{X},\textrm{Y}})\) is a degenerate kernel V-statistic.

The first estimator of distance correlation in [1], dCorV, is based on empirical distance variance and covariance, giving the empirical distance correlation.

Definition 4

The empirical distance correlation \(\mathcal {R}_n(\varvec{\textrm{X}}, \varvec{\textrm{Y}})\) is the square root of

$$\begin{aligned} \textrm{dCorV}^2(\varvec{\textrm{X},\textrm{Y}}) \,\,\, = \,\,\, \mathcal {R}_{n}^2(\varvec{\textrm{X}}, \varvec{\textrm{Y}})= & {\left\{ \begin{array}{ll} \frac{\mathcal {V}_n^2(\varvec{\textrm{X},\textrm{Y}})}{\sqrt{\mathcal {V}_n^2(\varvec{\textrm{X}})\mathcal {V}_n^2(\varvec{\textrm{Y}})}}, & \mathcal {V}_n^2(\varvec{\textrm{X}})\mathcal {V}_n^2(\varvec{\textrm{Y}}) > 0 \\ 0, & \mathcal {V}_n^2(\varvec{\textrm{X}})\mathcal {V}_n^2(\varvec{\textrm{Y}}) = 0. \end{array}\right. } \end{aligned}$$
(5)

Likewise, [2] proposed an alternative estimator for the distance covariance \(\mathcal {V}(X,Y)\) based on a \(\mathcal {U}-\)centered matrix.

Definition 5

Let \(A = (a_{kl})\) be a symmetric, real valued \(n \times n\) matrix with zero diagonal, \(n > 2\). Define the \(\mathcal {U}-\)centered matrix \(\tilde{A}\) as follows. Let the (kl)th entry of \(\tilde{A}\) be defined by

$$\begin{aligned} \tilde{A}_{kl} = {\left\{ \begin{array}{ll} a_{kl} - \frac{1}{n-2} \sum _{j=1}^n a_{kj}-\frac{1}{n-2} \sum _{i = 1}^n a_{il} + \frac{1}{(n-1)(n-2)}\sum _{i,j=1}^n a_{ij}, & k \ne l; \\ 0, & k = l. \end{array}\right. } \end{aligned}$$

Here “\(\mathcal {U}-\)centered" is so named because the inner product,

$$\begin{aligned} \mathcal {U}_n^2(\varvec{\textrm{X},\textrm{Y}}) \,\,\, = \,\,\, \left( \tilde{A}\cdot \tilde{B}\right)= & \frac{1}{n-3} \sum _{k \ne l} \tilde{A}_{kl}\tilde{B}_{kl}, \end{aligned}$$
(6)

defines an unbiased estimator of the squared distance covariance \(\mathcal {V}^2(X,Y)\). [3] established that the estimator in (6) can be expressed as a U-statistic. Thus, it becomes feasible to define an alternative estimator of the empirical distance correlation through \(\mathcal {U}_n(\varvec{\textrm{X},\textrm{Y}})\) and distance variance of \(\varvec{\textrm{X}}\) and \(\varvec{\textrm{Y}}\), \(\mathcal {U}_n(\varvec{\textrm{X}})\), \(\mathcal {U}_n(\varvec{\textrm{Y}})\), respectively, in the following manner.

Definition 6

The estimator dCorU of the distance correlation \(\mathcal {R}(X,Y)\), based on U-statistics, is the square root of

$$\begin{aligned} \textrm{dCorU}^2(\varvec{\textrm{X},\textrm{Y}})= & {\left\{ \begin{array}{ll} \frac{\mathcal {U}_n^2(\varvec{\textrm{X},\textrm{Y}})}{\sqrt{\mathcal {U}_n^2(\varvec{\textrm{X}})\mathcal {U}_n^2(\varvec{\textrm{Y}})}}, & \mathcal {U}_n^2(\varvec{\textrm{X}})\mathcal {U}_n^2(\varvec{\textrm{Y}}) > 0 \\ 0, & \mathcal {U}_n^2(\varvec{\textrm{X}})\mathcal {U}_n^2(\varvec{\textrm{Y}}) = 0, \end{array}\right. } \end{aligned}$$
(7)

This reformulation provides a fast algorithm for estimating distance covariance, which can be implemented with a computational complexity of \(\mathcal {O}(n \log n)\), while the original estimator (4) has a computational complexity of \(\mathcal {O}(n^2)\). Alternatively, [26] proposed an algorithm primarily composed of two sorting steps for computing distance correlation estimator in (7). This design renders it simple to implement and also results in a computational complexity of \(\mathcal {O}(n \log n)\).

These results have prompted the implementation and development of many software packages, available in R [30] and Python [31]. A comprehensive comparison of the performance of these packages in both languages is presented by [32]. An open-source Python package for distance correlation and other statistics is introduced: the dcor package [33]. The libraries studied in Python are statsmodels  [34], hyppo  [35], and pingouin  [36]. In R, the energy  [27] package implements the dcov and dcor functions, which return \(\mathcal {V}_n(\varvec{\textrm{X},\textrm{Y}})\) and dCorV(\(\varvec{\textrm{X},\textrm{Y}}\)), respectively. For the U-estimator, the dcovU and bcdcor functions are implemented. These functions return the \(\mathcal {U}_n^2\)(\(\varvec{\textrm{X},\textrm{Y}}\)) and dCorU\(^2\)(\(\varvec{\textrm{X},\textrm{Y}}\)) values, respectively; that is, they do not take the square root. In contrast, dcortools implements the distcov and distcor functions. The argument bias.corr allows the V-estimator to be used when the argument is FALSE, or the U-estimator with bias.corr = TRUE. Additionally, the Rfast  [28] package includes functions such as dvar, dcov, dcor, and bcdcor, using the fast method proposed by [3]. And the bcdcor function computes the bias-corrected distance correlation of two matrices.

A notable aspect to consider, as previously mentioned, is that the computation of \(\mathcal {U}_n^2(\varvec{\textrm{X},\textrm{Y}})\) in (6) can yield negative values in cases of independence or very low levels of dependence and small sample sizes. Consequently, it becomes impossible to calculate dCorU using the expression in (8). This precludes the computation of the dCorU estimator as the square root of dCorU\(^2\). This issue arises from the fact that \(\mathcal {U}_n^2(\varvec{\textrm{X},\textrm{Y}})\) is an unbiased estimator of the squared distance covariance \(\mathcal {V}^2(X,Y)\), which is 0 under independence. To the best of our knowledge, this problem has not been discussed in the literature. However some authors have implemented this estimator in software such as R and Python, giving the following alternatives to compute dCorU. Authors such as [27, 28] return dCorU\(^2\) without computing the square root, while others, such as [29] return for dCorU the square root of the absolute value of dCorU\(^2\), with the sign corresponding to dCorU\(^2\). However, as shown in Table 2 in Section 4.2, when working in scenarios under independence, approximately 60% of the estimates are negative values. For this reason, this study explores two proposals to address this challenge, resulting in the following alternative estimators of the distance correlation:

  • Replace the values of \(\mathcal {U}_n^2(X,Y)\) with their absolute value. This U-estimator of the distance correlation is denoted by dCorU(A).

  • Consider \(\max (\mathcal {U}_n^2(X,Y), 0)\) so negative values of \(\mathcal {U}_n^2(X,\) \(Y)\) are truncated to zero. Denote this U-estimator of \(\mathcal {R}\) as dCorU(T).

Under an independence scenario, both dCorU and dCorU(A) yield identical MSE. This is because the squared of the difference between the estimated value and the real value (\((\text {dCorU} - \mathcal {R})^2)\) does not deviate from the difference obtained when using the absolute value (\((\text {dCorU(A)} - \mathcal {R})^2\)) when \(\mathcal {R}= 0\).

3 New proposal for estimating the distance correlation

Under independence or near-independence conditions (\(\mathcal {R}\approx 0\)), the V-estimator is biased as it always yields positive results. As regards, the U-estimator, this often cannot be computed because of negative values of dCorU\(^2(X,Y)\). The preference between the two estimators seems to also depend on the nature (linear, non-linear) of the relationship between X and Y (see results in Section 4.1). While simulations allow the behavior of estimators to be studied and provide insights into when each should be used, in real-world scenarios determining whether the relationship between X and Y is linear or non-linear, and assessing the level of dependence or independence, can be challenging. Consequently, selecting the appropriate estimator becomes difficult. To overcome this limitation, this paper introduces a new estimator of distance correlation that overcomes these challenges. It consists of a convex linear combination of both estimators (dCorU, dCorV), denoted as

$$\text {dCor}_{\lambda } = \lambda \text {dCorU} + (1 - \lambda )\text {dCorV},$$

where \(\lambda \in [0,1]\) serves as a weighting parameter determining the balance between the two estimators. Note that the convex linear combination inherits many of the asymptotic properties of the original estimators. Specifically, it has less bias than dCorV (because dCorU is unbiased).

An alternative to this estimator is the square root of

$$\begin{aligned} \text {dCor}^2_{\lambda } = \lambda \text {dCorU}^2 + (1 - \lambda )\text {dCorV}^2, \end{aligned}$$

which results in a lower percentage of negative values (see Appendix A). An issue to be addressed in this case is obtaining the square root of \(dCor^2_{\lambda }\) when it is negative. However, this problem can be solved using the options already mentioned for dCorU. It should be noted that none of the combinations ensure that the estimate is always greater than zero, however, both options reduce the percentage of negative values.

The optimal value \(\lambda _0\) is proposed for the parameter \(\lambda \), which minimizes the Mean Squared Error (MSE) of this new estimator. This optimal value \(\lambda _0\), as provided in Lemma 1, relies on various unknown quantities such as the covariance, variance, and bias of dCorU and dCorV.

Lemma 1

Let \(X \in \mathbb {R}^p\) and \(Y \in \mathbb {R}^q\) be random vectors, where p and q are positive integers and define \(\hat{\theta }^U = \)dCorU(X,Y) and \(\hat{\theta }^V = \)dCorV(X,Y). Given the convex linear combination \(\lambda \hat{\theta }^U + (1-\lambda )\hat{\theta }^V\), then, with a view to minimizing the MSE, the optimal value of \(\lambda \) is given by \(\lambda _0\) defined as:

$$\begin{aligned} \lambda _0&= \frac{-\text {Cov}\left( \hat{\theta }^U, \hat{\theta }^V\right) + \text {Var}\left( \hat{\theta }^V\right) + \text {Bias}\left( \hat{\theta }^V\right) \left( \text {Bias}\left( \hat{\theta }^V\right) -\text {Bias}\left( \hat{\theta }^U\right) \right) }{\text {Var}\left( \hat{\theta }^U\right) + \text {Var}\left( \hat{\theta }^V\right) - 2 \text {Cov}\left( \hat{\theta }^U, \hat{\theta }^V\right) + \left( \text {Bias}\left( \hat{\theta }^U\right) - \text {Bias}\left( \hat{\theta }^V\right) \right) ^2}. \end{aligned}$$
(8)

Proof

Let \(\hat{\theta }^C_{\lambda _0} = \lambda _0 \hat{\theta }^U + (1-\lambda _0)\hat{\theta }^V\) be the convex linear combination, then

$$\begin{aligned} \text {MSE}\left( \hat{\theta }^C_{\lambda _0}\right) = \text {Var}\left( \lambda _0 \hat{\theta }^U + (1-\lambda _0)\hat{\theta }^V\right) + \text {Bias}\left( \lambda _0 \hat{\theta }^U + (1-\lambda _0)\hat{\theta }^V\right) ^2, \end{aligned}$$

where

$$\begin{aligned} \text {Var}\left( \hat{\theta }^C_{\lambda _0}\right) = \lambda _0^2\text {Var}\left( \hat{\theta }^U\right) + (1-\lambda _0)^2\text {Var}\left( \hat{\theta }^V\right) + 2\lambda _0(1-\lambda _0)\text {Cov}\left( \hat{\theta }^U, \hat{\theta }^V\right) \end{aligned}$$

and

$$\begin{aligned} \text {Bias}\left( \hat{\theta }^C_{\lambda _0}\right) = \lambda _0\text {Bias}\left( \hat{\theta }^U\right) + (1-\lambda _0)\text {Bias}\left( \hat{\theta }^V\right) . \end{aligned}$$

Therefore,

$$\begin{aligned} \text {MSE}\left( \hat{\theta }^C_{\lambda _0}\right)= & \lambda _0^2\text {Var}\left( \hat{\theta }^U\right) \!+\! (1-\lambda _0)^2\text {Var}\left( \hat{\theta }^V\right) \!+\! 2\lambda _0(1-\lambda _0)\text {Cov}\left( \hat{\theta }^U, \hat{\theta }^V\right) \\ & \,\, + \left[ \lambda _0\text {Bias}\left( \hat{\theta }^U\right) + (1-\lambda _0)\text {Bias}\left( \hat{\theta }^V\right) \right] ^2, \end{aligned}$$

where the optimal value of \(\lambda \) is given by the solution to the following equation:

$$\begin{aligned} \frac{\partial }{\partial \lambda _0}\text {MSE}\left( \hat{\theta }^C_{\lambda _0}\right)= & \lambda _0\left[ \text {Var}\left( \hat{\theta }^U\right) + \text {Var}\left( \hat{\theta }^V \right) - 2\text {Cov}\left( \hat{\theta }^U, \hat{\theta }^V \right) \right. \\ & \, \, + \left. \left( \text {Bias}\left( \hat{\theta }^U \right) - \text {Bias}\left( \hat{\theta }^V \right) \right) ^2\right] - \text {Var}\left( \hat{\theta }^V \right) \\ & \, \, + \,\text {Cov}\left( \hat{\theta }^U, \hat{\theta }^V \right) + \text {Bias}\left( \hat{\theta }^V \right) \left[ \text {Bias}\left( \hat{\theta }^U\right) - \text {Bias}\left( \hat{\theta }^V \right) \right] \\ = & 0. \end{aligned}$$

Solving the previous equation yields the expression of \(\lambda _0\) in (8). \(\square \)

Due to the unavailability of the exact values of each term in the expression of \(\lambda _0\) in (8), the decision was taken to estimate \(\lambda _0\) using bootstrap, specifically, the smoothed bootstrap, see Algorithm 1.

Algorithm 1
figure a

Bootstrap procedure for computing \(\hat{\lambda }_0\).

The smoothed bootstrap relies on K, a kernel function (typically a symmetric density around zero), and \(h > 0\), a smoothing parameter referred to as the bandwidth, in this case, \(h_1,h_2\). The bandwidth regulates the size of the environment used for estimation. It is customary to stipulate that the kernel function K must be non-negative and have an integral of one. Additionally, it is often expected for K to be symmetric. While the selection of the function K does not significantly influence the properties of the estimator (aside from its regularity conditions such as continuity, differentiability, etc.), the choice of the smoothing parameter is crucial for accurate estimation. In essence, the size of the environment utilized for nonparametric estimation must be appropriate-neither excessively large nor too small.

In R it is possible to use the density() function of the base package to obtain a kernel-like estimate of the density (with the bandwidth determined by the bw parameter), some of the methods implemented include Silverman’s [37] and Scott’s [38] rules of thumb, unbiased and biased cross-validation methods and direct plug-in methods, among others, although implementations from other packages can be used.

4 Simulation study

The simulation study employs the dcortools package, specifically utilizing the distcor function. To estimate the distance correlation through \(\textrm{dCorU}\), the code used is distcor(X,Y,bias.corr = T). For computing \(\textrm{dCorV}\), there are two options: distcor(X,Y,bias.corr = F) or simply distcor(X,Y). The study is divided into three main parts, first, a comparison between the original estimators; then, a comparison between the dCorU estimator and the proposed alternatives to mitigate negative values when dealing with dependence or very low dependence and small sample sizes; finally, a comprehensive comparison encompassing all the scenarios, including the original estimators, comparisons of the proposed alternatives for dCorU, and estimates for the proposed convex linear combination.

For each scenario, the mean, bias, variance and mean squared error (MSE) are computed (see Appendix A). Each simulation is repeated for 1000 Monte Carlo iterations, along with three sample sizes: 100, 1000, and 10000. To compare the efficiency (MSE) and computational time of each estimator, three models (FGM, bivariate normal and a non-linear models) with varying levels of dependence are considered, defined as follows.

4.1 Models

FGM model. The first model corresponds to a copula. One of the most popular parametric families of copulas is the Farlie-Gumbel-Morgenstern (FGM) family, which is defined by

$$\begin{aligned} C^{FGM}(x,y)= & xy[1+\theta (1-x)(1-y)], \quad \quad \theta \in [-1,1] \end{aligned}$$

with copula density given by

$$\begin{aligned} c^{FGM}(x,y)= & 1+\theta (2x-1)(2y-1), \quad \quad \theta \in [-1,1], \end{aligned}$$
(9)

where \(f_X(x), f_Y(y) \sim U(0,1)\). A well-known limitation of this family is that it does not allow the modeling of high dependences since Pearson’s correlation coefficient is limited to \(\rho = \frac{\theta }{3}\in \left[ -\frac{1}{3},\frac{1}{3}\right] \). Accurate calculation of distance covariance is achieved using (2), where the density function corresponds to (9). Similarly, calculations were performed for \(\mathcal {V}(X)\) and \(\mathcal {V}(Y)\). As a result, \(\mathcal {R}(X,Y) = \frac{|\theta |}{\sqrt{10}}\) is obtained. It is important to note that the dependence is not strong, specifically, \(0 \le \mathcal {R}\le 0.31622\) (see Fig. 1).

Bivariate normal model. The exact result for the distance correlation in this case is provided by [1] and is expressed as a function of the Pearson correlation coefficient. If (XY) has bivariate normal distribution with unit variance each, then, the squared distance correlation is given by:

$$\begin{aligned} \mathcal {R}^2(X,Y) = \frac{\rho \arcsin {\rho }+\sqrt{1-\rho ^2}-\rho \arcsin {\rho /2}-\sqrt{4-\rho ^2}+1}{1+\pi /3-\sqrt{3}}. \end{aligned}$$

In contrast to the previous model, the bivariate normal model allows complete dependence, resulting in \(\mathcal {R}= 1\) when \(\rho =1\) or \(\rho =-1\). This feature allows the performance of the estimators to be evaluated under high levels of dependence.

Nonlinear model. Let (XY) be a bivariate random variable with density

$$\begin{aligned} f_{X,Y}(x,y)=c\left[ 1-\left( y-4\left( x-\frac{1}{2}\right) ^2\right) ^2\right] ^kI_{[0,1]}(x)I_{[0,1]}(y), \end{aligned}$$

with \(k \in \mathbb {Z}\), controls the degree of dependence (which increases with k) and c is a fixed value that depends on the value of k. For this model, the k’s used are 0 (independence), 2,4,8 and 16 (strong dependence). An initial observation is that even at a relatively low level of dependence, such as \(k = 4\), the model’s behavior becomes discernible (see Fig. 1). It is important to highlight that, in this model, the value of distance correlation when X and Y are totally dependent (i.e. when \(k \rightarrow \infty \)) is not 1, but \(\mathcal {R}\approx 0.41\). So, unlike the linear models (FGM and normal bivariate), in this model low values of \(\mathcal {R}\) (\(0.22 \le \mathcal {R}\le 0.41\)) could indicate a strong relationship between X and Y. Figure 1 shows four samples drawn along with their respective distance correlation. The lines represent the conditional mean \(E[Y|X = x]\).

Fig. 1
figure 1

Samples (\(n = 300\)) generated for each model: FGM in the first row, bivariate normal in the second row, and non-linear in the last row, across different values of \(\theta , \rho ,\) and k, respectively, along with their corresponding distance correlation

4.2 Comparison between dCorU and dCorV estimators

This part of the simulation study compares the original estimators, dCorU and dCorV. Firstly, the comparison for the MSE of dCorU and dCorV across the different sample sizes for each distance correlation for the three models is shown in Fig. 2. Each simulation is repeated for 1000 Monte Carlo iterations, along with three sample sizes: 100, 1000, and 10000. Since the dcortools package is being used, then dCorU is dCorU = sign(dCorU\(^2\)) \(\sqrt{|\text {dCorU}^2|}\).

Fig. 2
figure 2

MSE of dCorU and dCorV under the different models with sample sizes \(n = 100\) (left), \(n = 1000\) (center) and \(n = 10000\) (right) for different distance correlation values

For the FGM model, the differences between the two estimators become more significant across all values of \(\theta \). Under independence (\(\mathcal {R}= 0\)), dCorU outperforms dCorV in terms of MSE for all three sample sizes. However, when there is dependence, dCorV shows better results, even in the presence of a small degree of dependence. As the level of dependence increases, the MSE of both estimators decreases and the differences disappear.

The MSE obtained for the bivariate normal model is shown in Fig. 2b. The conclusions are similar to those for the previous model. Under independence (\(\mathcal {R}= 0\)), dCorU emerges as the superior estimator. However, as soon as there is no independence, the opposite conclusion holds true, namely that dCorV is the best option. As dependence increases (\(\mathcal {R}\ge 0.454\)), sample size becomes less influential, and both estimators tend to converge and provide the same value. Furthermore, for larger sample sizes, a moderate dependence (\(\mathcal {R}> 0.224\)) is sufficient to observe similar behavior between the estimators. It is worth noting that under complete dependence (\(\mathcal {R}= 1\)), the estimates were the same for each sample size.

The results for the non-linear model are shown in Fig. 2c. Similar to the previous models, under independence the optimal estimate is provided by dCorU. However, the dCorV estimator does not systematically emerge as the superior choice at any specific level of dependence. Both estimators exhibit similar behavior with large sample sizes and, at dependence, the same holds true for this model as well.

The results of the bias and variance of both estimators are presented in Appendix A. They provide information that is consistent with the conclusions drawn from the MSE analysis. In particular, the most substantial difference between the two estimators can be observed in the case of a small sample size (\(n = 100\)) and weak dependence or independence, with dCorU exhibiting a negative bias close to zero with increasing levels of dependence. Similarly, dCorV also shows a decreasing bias, although always positive. The variance of dCorV is comparatively smaller than the variance of dCorU. In contrast, for larger sample sizes (\(n = 1000, 10000\)), the results show insignificant differences between the two estimators.

The computational times for each estimator are shown in Table 1. The characteristics of the computer equipment used are as follows: Intel(R) Core(TM) i7-1280P 12th Gen CPU with 2.00 GHz and 16 GB RAM. There are no significant differences in the computation times between the estimators. It is important to highlight that these times were obtained with the dcortools package, with which both the U-statistic and the V-statistic of distance correlation can be calculated in \(\mathcal {O}(n \log n)\) steps. In fact, the algorithm in [3] can be straightforwardly extended to the V-statistic version. If another package were used, for example the energy package, the computation times of both methods would be different.

Table 1 Computational time in secs for 1000 samples

4.3 Alternatives to dCorU

As mentioned above, when working under independence or with very low dependencies, the dCorU estimator often cannot be computed due to negative estimates of the squared covariance. Table 2 presents the percentage of negative values obtained for different sample sizes across the three models under varying levels of dependence. This section compares the dCorU estimator with the alternatives proposed in Section 2, specifically dCorU(A) and dCorU(T). Figure 3 shows the comparison of the MSE between dCorU and the proposed methods for each model.

Table 2 Percentage of negative values for \(\mathcal {U}_n^2(X,Y)\) obtained when computing dCorU\(^2\) in (7) for each of the models across different scenarios

The FGM-Model presents negative values across all the studied values of \(\theta \) when \(n = 100\). As the sample size increases, the problem of negative values only appears when the dependence is very weak (\(n = 100\)) or just under independence (\(n = 10000\)), see Table 2. Despite the notably high percentage of negative values in the case of independence, the differences between the estimators are not substantial. However, the most accurate estimation is achieved using dCorU(T), given that \(\mathcal {R}= 0\). Conversely, under dependence, dCorU(A) provides a more accurate estimation. Nevertheless, as the level of dependence strengthens, the MSE values for all three estimations tend to the same value.

In the case of the bivariate normal model (Fig. 3), negative values are encountered only under independence (\(\mathcal {R}= 0\)) or low levels of dependence (\(\mathcal {R}= 0.22\)), both with \(n=100\). Under strong dependence or with large sample sizes, the computation of dCorU does not exhibit the issue of negative values (at least with \(n \ge 100\)). The conclusions are aligned with those of the previous model. In the case of independence, dCorU(T) emerges as the optimal estimator, while in another scenario, it is dCorU(A). In the remaining cases, all three estimators yield comparable MSE. Furthermore, for \(n=10000\), the issue of negative values arises in the independence scenario (Table 2), but the differences remain insignificant.

Finally, in the case of the non-linear model, negative values are observed for weak dependence scenarios (\(k = 0,2,4\)) with \(n = 100\). However, as the sample size increases (\(n = 1000, 10000\)), the problem only arises under independence (\(k = 0\)). It is important to highlight that the conclusions are similar to those in the two previous cases. Under independence, the optimal estimator is dCorU(T), whereas in one of the other cases, it is dCorU(A). For the remaining cases, all three estimators yield the same MSE (Fig. 3).

Fig. 3
figure 3

MSE values for dCorU, dCorU with absolute value (dCorU(A)), and truncated value (dCorU(T)) across three sample sizes (n) and three different models: the FGM model in the first row, the bivariate normal model in the second row, and the nonlinear model in the last row. The corresponding distance correlation values are also presented

4.4 New estimator for distance correlation

This section explores and compares the original estimators (dCorU and dCorV) with proposed variations, including dCorU(A) and dCorU(T), together with their corresponding convex combinations computed with the optimal value of the parameter, \(\lambda _0\) (8), as follows:

$$\begin{aligned} \text {dCor}_{\lambda _0}= & \lambda _0\text {dCorU} + (1 - \lambda _0) \text {dCorV}, \\ \text {dCor(A)}_{\lambda _0}= & \lambda _0\text {dCorU(A)} + (1 - \lambda _0) \text {dCorV}, \\ \text {dCor(T)}_{\lambda _0}= & \lambda _0\text {dCorU(T)} + (1 - \lambda _0) \text {dCorV}. \end{aligned}$$

The comparisons are conducted across two scenarios for the convex linear combination. The first scenario involves estimating the optimal \(\lambda _0\) (8) using Algorithm 1, while the second scenario utilizes the optimal value of \(\lambda _0\) in Lemma 1, where the bias, variance and covariance terms in (8) are approximated using 1000 Monte Carlo repetitions. The study focuses on the behavior of the MSE in each model under different levels of dependence and with a sample size \(n = 100\) (see Fig. 4).

The value of \(\hat{\lambda }_0\) was obtained using Algorithm 1 with 1000 bootstrap iterations. The bandwidths, considered \(h_1 = h_2\) for simplicity, were chosen as the value of the grid between 0.0025 and 0.32 with the lowest MSE. A simulation study showing the effect of the bandwidths is included in Appendix A. It is evident that the bandwidth has an effect only under independence conditions, with lower bandwidth values corresponding to lower MSE. When the level of dependence is moderate, the impact of the bandwidth becomes less significant, and under high dependence, the results for different bandwidth values tend to converge.

Fig. 4
figure 4

Comparison of the MSE between the different estimators and their respective convex linear combination and sample size \(n = 100\). Specifically, \(\text {dCor}_{\hat{\lambda }_0}\) denotes the combination \(\hat{\lambda }_0\)dCorU + \(\left( 1 - \hat{\lambda }_0\right) \)dCorV with \(\hat{\lambda }_0\) estimated using 1000 bootstrap replications. Similarly, \(\text {dCor}_{\lambda _0}\) denotes the combination \(\lambda _0\)dCorU + \((1 - \lambda _0)\)dCorV using the real optimal value of \(\lambda _0\) (8). The same naming convention applies to dCorU(A) and dCorU(T), each representing their respective combinations. The \(\mathcal {R}\) values shown are rounded values of the corresponding values from each model

In Fig. 4, the maximum value of the distance correlation considered is \(\mathcal {R}= 0.31\), which corresponds to the maximum level achievable in the FGM-Model. It is notable that, even though \(\mathcal {R}\) is approximately the same for all three models, the associated degree of dependence for each value of \(\mathcal {R}\) is different for each scenario, especially for the non-linear model. For this reason, the behavior of the estimators is not comparable over the models for a fixed value of \(\mathcal {R}\). The main observations that can be made include the following: under independence (\(\mathcal {R}= 0\)), the proposed estimator \(\text {dCor}_{\lambda 0}\) outperforms dCorU and dCorV when using the optimal \(\lambda _0\). The performance of the proposed estimator worsens slightly when \(\lambda _0\) is estimated, but it still gives much better results than dCorV. As soon as there is no independence (\(\mathcal {R}> 0\)), the proposed estimator with estimated \(\lambda _0\) gives comparable results to the one with the optimal \(\lambda _0\), and it improves both dCorU and dCorV

These conclusions can be observed for each of the models in Fig. 13, where it is noticeable that the most significant differences with each of the estimators occur under conditions of independence. Conversely, as dependence increases, the estimates tend to become more comparable. Specifically, the nonlinear model exhibits a decrease in MSE as \(\mathcal {R}\) increases, while the linear models do not show such a significant decrease.

Figure 5 shows the MSE for each model under varying levels of dependence, from independence (\(\mathcal {R}= 0\)) up to the maximum dependence permitted by each model. It is worth noting that the nonlinear model reaches a maximum \(\mathcal {R}\) value of around 0.4134, indicating complete dependence between X and Y.The behavior observed in the non-linear model is also evident in the normal bivariate model, wherein the MSE remains quite similar across all cases, starting at \(\mathcal {R}= 0.75\). Moreover, the reduction in MSE becomes evident in each case as \(\mathcal {R}\) increases, eventually reaching 0 in the case of complete dependence (\(\mathcal {R}= 1\)). However, this is not the case for the nonlinear model, where the MSE does not reach zero for the full dependence scenario (\(\mathcal {R}= 0.41\)).

It is important to mention that in cases of independence or very low dependence, dCorU(T) and its combination with \(\lambda _0\) exhibit the lowest MSE. Additionally, it is worth noting that the bootstrap approximation of \(\lambda _0\) shows larger discrepancies from the actual value in scenarios of independence or low dependence (see Appendix A). However, when dependence is low (approximately \(\mathcal {R}= 0.1\)), dCorU(A) and its convex combinations \(\left( \text {dCor(A)}_{\lambda _0} \text {and } \text {dCor(A)}_{\hat{\lambda }_0}\right) \) result in a lower MSE. Under moderate or high dependence (\(\mathcal {R}\ge 0.24\)), convex linear combinations tend to converge to the same MSE, consistently approaching the best estimate provided by the individual estimators (dCorU, dCorV, dCorU(A), dCorU(T)).

It becomes evident that the convex linear combination using the value \(\hat{\lambda }_0\) obtained through 1000 bootstrap repetitions consistently approximates the best estimates provided by the original estimators (dCorU, dCorV) across all the scenarios examined. While it is true that under specific conditions, some alternative combinations appear to yield slightly smaller errors, it is important to note that distinguishing between working under independence or with very low dependence may not always be feasible.

Fig. 5
figure 5

Comparison of the MSE among the different estimators and their respective convex linear combinations performed for the provided models across various levels of dependence, from independence (\(\mathcal {R}= 0\)) to the strongest dependence supported by each model for \(n = 100\)

5 Application to real data

Rheumatoid arthritis (RA) is a chronic systemic inflammatory disease leading to joint inflammation, functional disability, deformities, and a shortened life expectancy. The goal of treatment in RA is remission in the early stage of the disease before patients develop permanent deformities. Therefore, it is important to identify predictors of relapse, so that physicians can individualize the treatment plan [39].

Obesity is now thought to be related to flare-ups in rheumatoid arthritis patients who are in remission. In order to study this statement, a data base obtained in the A Coruña University Hospital Complex (CHUAC, Complejo Hospitalario Universitario de A Coruña, Spain) was used. A clinical trial was conducted on 196 patients in remission, who were randomly divided into two groups: a control group and an experimental group, in order to study the patients’ responses to a reduction in treatment. The control group consisted of 99 patients who were given the treatment in the usual manner. In the experimental group, however, the treatment was gradually reduced until it was completely withdrawn.

Fig. 6
figure 6

Relationship between the difference in DAS28 and its variants and BMI for the two groups (control and experimental)

Fig. 7
figure 7

Relationship between the difference in SDAI and BMI for the two groups (control and experimental)

There were 375 covariates, including age, sex, weight, and genetic information. To study the effect of overweight and obesity on flare-ups, the following information was used:

  • Body mass index (BMI): measurement of a person’s leanness or corpulence based on their height and weight, intended to quantify tissue mass. This information is available at two points: at the beginning of the study and at the patient’s last visit. For the dependency study, the information from the beginning (of the study) was used.

  • Disease Activity Score in 28 joints (DAS28): continuous measure of RA disease activity that combines information from swollen joints, tender joints, acute phase response and general health. Similar to BMI, DAS28 information was obtained at the start of treatment and at the patient’s last visit. In this case, the difference between the DAS28 scores at the last visit and at the beginning was used to observe if there was a significant change between the two data points. Similarly, the variants DAS28V.CRP, DAS28V.ESR, and DAS28CRP were used. The difference between each is in the way it is calculated. DAS28-CRP uses the C-reactive protein count, while DAS28-ESR uses the erythrocyte sedimentation rate. On the other hand, DAS28.V indicates that patient assessment is not included in the calculation of DAS28.

  • Simplified Disease Activity Index (SDAI): A composite index used in the assessment of disease activity in rheumatoid arthritis (RA). It combines several clinical and laboratory measures to provide a comprehensive evaluation of disease severity. It is used similarly to the DAS28, comparing the values between the last visit and the start of the study.

Figure 6 shows the relationship between BMI and the difference in DAS28 (last visit and start of the study) and its respective variants. All patients are depicted in two colors, corresponding to the group to which they belonged (control and experimental). Similarly, Fig. 7 shows the relationship between the SDAI and the BMI at the start of the study.

Table 3 Percentage of patients with normal weight, overweight, or obesity according to BMI for all data and the corresponding groups
Table 4 Distance correlation between BMI and change in each indicator of activity for RA (DAS28, DAS283V.CRP, DAS283V.ESR, DAS28.CRP) and the distance correlation between BMI and all the indicators

Table 3 shows the proportion of patients with a normal weight (BMI < 25), overweight (25 \(\le \) BMI < 30) or obesity (BMI \(\ge \) 30) according to the BMI.

In order to study whether overweight and/or obesity had any influence on the increase in DAS28, its variants and SDAI, which could indicate that a flare-up was experienced, the distance correlation was computed using both estimators and the proposed convex combination, and it was also compared with Pearson’s correlation coefficient (see Table 4). Additionally, the effect of BMI at the beginning of treatment was studied with all possible indicators of a flare-up. In this case, it was not possible to compute the Pearson’s coefficient.

It is noticeable that the estimates between both estimators (dCorU and dCorV) differed in most cases. The most noteworthy difference was observed in the control group, where dCorU was consistently negative. However, when the convex combination was computed, in the majority of cases, it was equal to dCorV. As mentioned earlier, small values of distance correlation does not necessarily imply a weak relationship, as total dependence does not always correspond to a distance correlation of 1. On the other hand, the Pearson correlation coefficient close to zero does not necessarily indicate independence. Moreover, it was always lower than the distance correlation, suggesting that the dependency relationship is better captured by dCor. From the information available, it can be concluded that the change in DAS28 was not entirely independent of BMI, although the dependence did not appear to be very strong. There seemed to be a stronger dependence in the experimental group.

On the other hand, it can be observed that the proportions of people with a normal weight or overweight were very similar between both groups (control and experimental). However, the dependence between BMI and the difference in DAS28/SDAI was greater in the experimental group. This implies that, from a medical perspective, there was a higher risk of changes in DAS28/SDAI when the BMI was out of the normal range and the person was part of the experimental group. This result might suggest that there was a dependence between the group and the change in DAS28/SDAI, indicating a difference between belonging to the control group and the experimental group. Table 5 shows the distance correlation between the group (control, experimental) and the change in each indicator of activity for rheumatoid arthritis. It can be observed that there is a certain dependence between group membership and the occurrence of a change, particularly in DAS28.

Table 5 Distance correlation between group and change in each indicator of activity for RA (DAS28, DAS283V.PCR, DAS283V.VSG, DAS28.PCR)

Finally, one of the advantages of using distance correlation that it is able to calculate the dependence between vectors of different dimensions. For this study, it was observed that BMI had a certain dependence on the change in the different variables that describe the behavior of RA (see Table 4). This dependence was observed with all the data as well as within the divided groups (control and experimental). Whenever distance correlation did not clearly indicate independence or strong dependence, it is advisable to conduct a more exhaustive study with hypothesis tests. For example, [1] proposed a distance covariance test.

6 Conclusions

This research addressed the problem of choosing the best method to estimate distance correlation between two vectors X and Y. Under independence, dCorU seemed to be preferable over dCorV for all the models and sample sizes analyzed. However, \(\mathcal {U}_n^2(\varvec{\textrm{X},\textrm{Y}})\) might be negative in scenarios of independence or low dependence, particularly with small sample sizes, making it impossible to calculate dCorU. This observation prompted the development of two alternative proposals, based on truncating or computing the absolute value of \(\mathcal {U}_n^2(\varvec{\textrm{X},\textrm{Y}})\). Under independence, the superior estimation is generally when truncating.

In contrast, under dependence, the conclusions differ. The dCorV estimator aligns with the best results in terms of MSE for the linear models (FGM and bivariate normal). In the case of the non-linear model, the dCorU estimator provides better estimates, although the differences are not relevant. Moreover, when considering the two estimators, namely dCorU(A) and dCorU(T), the optimal estimator appeared to be dCorU(A), especially in cases of weak dependence. However, in terms of computational time, both estimators were similar.

In practice, it is difficult to choose the best estimator for the distance correlation, as the choice depends on whether the relationship between X and Y is linear or not, and whether there is independence or not, questions that remain unanswered in real-life studies. To address these complexities, a new estimator was proposed involving the convex linear combination of the two estimators, dCorU and dCorV, as well as their respective extensions, dCorU(A) and dCorU(T). In the majority of the cases, the proposed estimator with the parameter \(\lambda _0\) estimated using bootstrap, \(\text {dCor}_{\hat{\lambda }_0}\), yielded better results than those obtained using only dCorU or dCorV. However, it is necessary to remember that computation time will increase as the number of bootstrap iterations increases.