Elsevier

Environmental Modelling & Software

Volume 96, October 2017, Pages 283-290
Environmental Modelling & Software

A simple nonparametric index of bivariate association for environmental data exploration

https://doi.org/10.1016/j.envsoft.2017.07.006Get rights and content

Highlights

  • A purely nonparametric index is proposed for general bivariate association.

  • The index reflects nearest neighbour distances increasing if an existing association is disrupted by randomization.

  • No estimation of marginal distributions or other continuous functions are required.

  • No assumption is made concerning physical or stochastic processes giving rise to the data.

Abstract

A purely data-based index for detecting bivariate association is proposed for preliminary data exploration when seeking to model a dependent variable, associated with a possibly large number of independent variables. No particular form of association between the dependent and independent variables is assumed. The proposed bivariate association index is the value p, which is the probability that a scatter plot created by an X-randomization will generate a smaller mean nearest neighbour distance. The rationale is that randomizing an existing X-Y association will result in a scatter plot which will usually have a greater mean nearest neighbour distance. The process is then repeated for all other independent variables to give a specific p for each one. A subset of potentially informative independent variables is then obtained by noting all those with low p values, but just how small p should be is left to the user.

Section snippets

Software availability

Software is free MATLAB code written by Varvara Vetrova.

Available at https://github.com/vetrovav/P-index.

Index of bivariate association

As noted by Murrell et al. (2016), randomization has the effect of inducing independence between variables. This means that for a given bivariate data set {X,Y} there can be no association between the two variables if the X values have been rearranged in random order (which implies the associated Y values will also have been rearranged in random order).

We use the scatter plot mean nearest neighbour distance D¯ as a familiar measure to form a data-based index of association, with randomization

Synthetic examples

The effect of X-randomization is illustrated in Fig. 1, Fig. 2, showing how a single randomization of X has varying degrees of disruption of an initial {X,Y} association. The well-defined circular pattern of Fig. 1a is randomized to a disorganised spatial scatter, but the minimal {X,Y} association for the two clusters of Fig. 1c is largely unchanged in appearance from a randomization of X (Fig. 1d). Similarly, the random scatter of points in Fig. 2a remains a random scatter after a

Data application

The environmental variable Y of interest in this case is the percentage of days per month on the west coast of the South Island of New Zealand when the upper air wind direction is between 216° and 365°, coupled with wind speed exceeding 5 ms-1. Such wind conditions at this location are often associated with rain so the monthly frequency of days of this type is likely to be related to regional monthly precipitation.

The X variables are the monthly mean 700 hPa geopotential heights (metres) as

Conclusion

An index of general bivariate association is proposed, based on the degree to which data randomization induces an increase in scatterplot mean nearest neighbour distance.

The use of p here is proposed simply as an index rather than a significance test. The method might be further investigated where p is equated to the p level of a randomization significance test, with power comparisons then made against other bivariate significance tests.

Further investigation is needed but the p index approach

References (21)

There are more references available in the full text version of this article.

Cited by (0)

View full text