A simple nonparametric index of bivariate association for environmental data exploration
Section snippets
Software availability
Software is free MATLAB code written by Varvara Vetrova.
Available at https://github.com/vetrovav/P-index.
Index of bivariate association
As noted by Murrell et al. (2016), randomization has the effect of inducing independence between variables. This means that for a given bivariate data set {X,Y} there can be no association between the two variables if the X values have been rearranged in random order (which implies the associated Y values will also have been rearranged in random order).
We use the scatter plot mean nearest neighbour distance as a familiar measure to form a data-based index of association, with randomization
Synthetic examples
The effect of X-randomization is illustrated in Fig. 1, Fig. 2, showing how a single randomization of X has varying degrees of disruption of an initial {X,Y} association. The well-defined circular pattern of Fig. 1a is randomized to a disorganised spatial scatter, but the minimal {X,Y} association for the two clusters of Fig. 1c is largely unchanged in appearance from a randomization of X (Fig. 1d). Similarly, the random scatter of points in Fig. 2a remains a random scatter after a
Data application
The environmental variable Y of interest in this case is the percentage of days per month on the west coast of the South Island of New Zealand when the upper air wind direction is between 216° and 365°, coupled with wind speed exceeding 5 ms-1. Such wind conditions at this location are often associated with rain so the monthly frequency of days of this type is likely to be related to regional monthly precipitation.
The X variables are the monthly mean 700 hPa geopotential heights (metres) as
Conclusion
An index of general bivariate association is proposed, based on the degree to which data randomization induces an increase in scatterplot mean nearest neighbour distance.
The use of p here is proposed simply as an index rather than a significance test. The method might be further investigated where p is equated to the p level of a randomization significance test, with power comparisons then made against other bivariate significance tests.
Further investigation is needed but the p index approach
References (21)
- et al.
Input variable selection for water resources systems using a modified minimum redundancy maximum relevance (mMRMR) algorithm
Adv. Water Resour.
(2009) - et al.
Model-free sure screening via maximum correlation
J. Multivar. Analysis
(2016) An Informational measure of correlation
Inf. control
(1957)- et al.
Non-linear variable selection for artificial neural networks using partial mutual information
Environ. Model. Softw.
(2008) Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 1 — a strategy for system predictor identification
J. Hydrol.
(2000)- et al.
Selection of significant input variables for time series forecasting
Environ. Model. Softw.
(2015) - et al.
Efficient test for nonlinear dependence of two continuous variables
BMC Bioinforma.
(2015) - et al.
Distribution free tests of independence based on the sample distribution function
Ann. Math. Stat.
(1961) - et al.
Measuring non-linear dependence for two random variables distributed along a curve
Stat. Comput.
(2009) - et al.
A consistent multivariate test of association based on ranks of distances
Biometrika
(2013)