Measuring information-based complexity across scales using cluster analysis

doi:10.1016/j.ecoinf.2007.03.011

Ecological Informatics

Volume 2, Issue 2, 1 June 2007, Pages 121-127

https://doi.org/10.1016/j.ecoinf.2007.03.011 Get rights and content

Abstract

Scaling of ecological data can present a challenge firstly because of the large amount of information contained in an ecological data set, and secondly because of the problem of fitting data to models that we want to use to capture structure. We present a measure of similarity between data collected at several scales using the same set of attributes. The measure is based on the concept of Kolmogorov complexity and implemented through minimal message length estimates of information content and cluster analysis (the models). The similarity represents common patterns across scales, within the model class. We thus provide a novel solution to the problem of simultaneously considering data structure, model fit and scale. The methods are illustrated in application to an ecological data set.

Introduction

Scale is a concept central to all ecological studies, whether relating to space or time. Sayama et al. (2003) demonstrated that there are powerful linkages between scales, contradicting the erroneous, though commonly held, assumption that it is possible to neatly partition evolutionary effects at different spatial scales, to study a molecule, individual, population, metapopulation, species or ecosystem. It is these dynamic linkages among the levels, rather than the number of levels themselves, that should probably be the focus of attention. Hogeweg (2002) argues that ‘processes do not, in biotic systems, operate in isolation and the existence of entanglement at different time and space scales does not need explanation, being there by default’. Ignoring it by segregating time and space scales is simply a modelling artefact.

There are several ways in which scale appears as a feature in ecological studies. An omnipresent problem is the modifiable areal unit problem (MUAP; see Openshaw, 1984, Fotheringham and Wong, 1991, Nakaya, 2000, Brunsdon, 2002, Holt et al., 1996, Jelinski and Wu, 1996). Ecological units do not come in convenient packets and the size, shape and distribution of samples will all have effects on any study; this aspect of scale has already received considerable research (Brunsdon, 2002, Pavlov et al., 2001; see also methods developed by Juhász-Nagy and Podani, 1983). The effects of scale can, however, be mitigated by employing fuzzy concepts. This allows any individual sample to partake of several component structures and leads to consistent estimates of cluster parameters. Bar-Yam (2002) proposed that agglomerative clustering indicates the mechanism by which information is lost as the level of uncertainty increases across scales, but a quantitative measure will depend on the similarity coefficient and the particular algorithm used for clustering. Pavlov et al. (2001) and Puzicha and Buhman (1998) use similar ideas to obtain segmentation of images based on texture variation and using fractal concepts; specifically Pavlov et al. (2001) suggest using wavelet decompositions. The number and distribution of samples also links to the part-whole problem (cf. Szabo, 1996) and also to the relationship between habitat heterogeneity and spontaneous pattern production (Sayama et al., 2003).

Here we shall consider a different problem that concerns the estimation of common structure between levels. We ask, does common structure exist and, if so, how strong is it? And how does it decay as differences in scale increase? Some argue that scale invariance, or the presence of similar structure across different spatial or temporal scales, should be expected for complex systems (e.g., Brown et al., 2002). However, Wolpert and Macready (2000) recently put forward another view that over different space and time scales, the patterns exhibited by such a complex system should vary greatly, and in ways that are unexpected given the patterns on the other scales. The degree of dissimilarity plotted against scale would therefore provide a profile, which can be used as a system descriptor, and compared with other system profiles irrespective of the subject matter. Binder and Plazas (2001) recommend a similar procedure. This obviously requires a suitable measure of dissimilarity or similarity between data at two or more scales. Several authors have suggested ordination methods for multiscale analyses (Noy-Meir and Anderson, 1971, Borcard and Legendre, 2002); however, these do not in general provide a similarity measure between scales. Another approach makes use of fractal and multifractal analysis (e.g., based on Rényi's generalized entropy functions; Borda-de-Agua et al., 2002); however, again the degree of self-similarity cannot easily be determined. In this paper, we present a clustering approach to determine similarity between scales. The problems caused MUAP can be overcome by using fuzzy clustering. This allows us to identify common structure at different scales using the minimum message length principle. The method was applied to ecological data to test its efficacy at detecting changes in community structure in terms of the composition and relative abundances of species in the community.

Section snippets

A minimum message length similarity measure

Dale (2002); (see also Dale and Anand, 2004) have proposed using the minimum message length (MML) principle to estimate the Kolmogorov complexity as a sum of two components: model (structure) description and model fit. Kolmogorov complexity is a measure of the difficulty of description of a pattern or algorithm (Li and Vitányi, 1997); however, the measure has not been used very often for ecological informatics (but see Anand and Orlóci, 1996, Anand and Orlóci, 2000). In the present work, it is

Data and methods

The data were modified as follows in order to examine the changes in community structure, in terms of the composition and relative abundance of species, at different scales: The primary data consist of records of the cover abundance of 119 species of understorey plants. These were collected from line transects from 6 sites located along a historic pollution gradient (Anand et al., 2003, Tucker and Anand, 2003, Desrochers and Anand, 2005). This gradient reflects decreasing historic sulphur

Results

The results from the independent analysis of the several scales are shown in Table 1a. The number of classes and the associated n-class MML show a close relationship with the size of the population employed and at all scales the clustering provides a markedly better n-class result compared with that for a single class; however, the MML per thing values are not as closely related. Turning to the pairwise analyses (Table 1b), we obtain similar results except that the MML per individual values are

Discussion

We introduce a new measure for cross-scale analysis of ecological data and the structures it defines. On the basis of a single analysis, it is not possible to decide if the results are an inherent feature of ecological systems. There is certainly some common structure as well as idiosyncratic variation, and the methods used here can separate these components: cross-scale similarity measures based on Kolmogorov complexity provide needed information. The exception is the Bush analysis where we

Acknowledgments

MA acknowledges funding from the Natural Sciences and Engineering Research Council of Canada, the Canada Research Chairs Program and the Ontario Ministry of Science and Technology for infrastructure and salary support for RD and MD. We thank B.C. Tucker and K. Lemire for assistance with field data collection and Steve Kaufman for technical assistance and comments on a previous version of the manuscript. An anonymous reviewer provided helpful comments.

References (47)

M. Anand et al.
Complexity in plant communities: the notion and quantification
Journal of Theoretical Biology
(1996)
M. Anand et al.
On hierarchical partitioning of an ecological complexity function
Ecological Modelling
(2000)
M. Anand et al.
Characterizing biocomplexity and soil microbial dynamics along a smelter-damaged landscape gradient
The Science of the Total Environment
(2003)
D. Borcard et al.
All-scale spatial analysis of ecological data by means of principal coordinates of neighbour matrices
Ecological Modelling
(2002)
M.B. Dale et al.
Markov models for incorporating temporal dependence
Acta Oecologica
(2002)
P. Hogeweg
Computing an organism: on the interface between informatic and dynamic processes
BioSystems
(2002)
A.N. Pavlov et al.
Scaling features of texts, images and time series
Physica. A
(2001)
C. Ricotta et al.
Spatial complexity of ecological communities: Bridging the gap between probabilistic and non-probabilistic uncertainty measures
Ecological Modelling
(2006)
Y. Agusta et al.
MML clustering of continuous-valued data using Gaussian and t distributions
Y. Agusta et al.
Clustering of Gaussian and t distributions using minimum message length

Y. Agusta et al.

Unsupervised learning of correlated multivariate Gaussian mixture models using MML

Y. Agusta et al.

Unsupervised learning of gamma mixture models using minimum message length

Y. Bar-Yam

Sum rule for multiscale representations of kinematically described systems

Advances in Complex Systems

(2002)

P.M. Binder et al.

Multiscale analysis of complex systems

Physics Review E

(2001)

L. Borda-de-Agua et al.

Species-area curves, diversity indices, and species abundance distributions: a multifractal analysis

American Naturalist

(2002)

J.H. Brown et al.

The fractal nature of nature: power laws, ecological complexity and biodiversity

Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences

(2002)

C. Brunsdon

A Bayesian perspective on the modifiable areal unit problem using data augmentation

M.B. Dale

Models, measures and messages: an essay on the role for induction

Community Ecology

(2002)

M.B. Dale et al.

Domain knowledge, evidence, complexity and convergence

International Journal of Ecology and Environmental Sciences

(2004)

M.B. Dale et al.

Minimum message length clustering: an explication and some applications to vegetation data

Community Ecology

(2001)

R.E. Desrochers et al.

Quantifying the components of biocomplexity along ecological perturbation gradients

Biodiversity and Conservation

(2005)

T. Edgoose et al.

MML Markov classification of sequential data

Statistics and Computing

(1999)

A.S. Fotheringham et al.

The modifiable areal unit problem in statistical analysis

Environment and Planning A

(1991)

Cited by (8)

Geospatial analysis of hypospadias and cryptorchidism prevalence rates based on postal code in a Canadian province with stable population
2023, Journal of Pediatric Urology
Citation Excerpt :
The discrepancies between these studies are likely a result of using more granular administrative boundaries for analysis in the present study. Studies on the use of GIS in population health research have identified that using finer spatial units such as FSAs can help reduce statistical bias associated with aggregating data to large units such as counties [18]. As such, this study is a more accurate representation of the geospatial clustering of these anomalies in Nova Scotia than previously reported [9].
Hypospadias and cryptorchidism are hormone-mediated malformations that occur during male development. Prevalence rates of hypospadias and cryptorchidism are thought to be increasing worldwide. In-utero exposure to endocrine-disrupting chemicals (EDCs) may have a role in the occurrence of these malformations. Our group has reported significant clustering of hypospadias and cryptorchidism at the county level in areas of intense agricultural activity in the Canadian province of Nova Scotia (NS). Finer scale spatial analysis has shown clustering near urban centres.
The objectives of the study were: 1) to perform a granular geospatial analysis of hypospadias and cryptorchidism prevalence, at the postal code level, of all babies born in NS over a 26-year period; and 2) to determine whether there is spatial correlation between these conditions and industries linked to toxic output.
Cases of hypospadias and cryptorchidism were identified based on ICD-10 codes from the Nova Scotia Atlee Perinatal Database with records of all live births in NS between 1988 and 2013. Data were geocoded and mapped based on the three first digits of the maternal postal code (Forward Sortation Area [FSA]). Regional prevalence of congenital anomalies was calculated for each of the 77 FSAs. To identify statistically significant high and low prevalence clusters for each anomaly, Local Morans I was used on the spatial data. Geospatial point data was created for industries linked to toxic output and correlation between clusters of malformations and proximity to these industries was assessed.
During the study period, there were 1045 cases of hypospadias and 993 cases of cryptorchidism. Both hypospadias and cryptorchidism demonstrated statistically significant areas of high prevalence clusters. There was no significant spatial correlation between the local clustering of the congenital malformations and proximity to toxic industries.
Our study shows heterogeneity in the distribution of hypospadias and cryptorchidism, which is consistent with previously published works. In this follow-up, granular geospatial analysis of hypospadias and cryptorchidism prevalence in an area with stable population, we did not confirm the previous findings of high clustering in areas of intense agricultural activity. Furthermore, our analysis did not find high clustering of the congenital malformations in areas near toxic industries to support a clear environmental role in their development. Some of the limitations include underdiagnosis of hypospadias and cryptorchidism (as they both present with a clinical spectrum and are non-life threatening), and limited data currently available on the route of exposure to EDC industries in Nova Scotia.
Complexity of chemical products, plants, processes and control systems
2009, Chemical Engineering Research and Design
Complex systems require more time and resources to develop and are more likely to go wrong and the resulting faults are more difficult to correct. Complexity is a property that we all intuitively recognise, but yet it is ill-defined and this makes it difficult to obtain the measures of complexity that are needed to support management decisions regarding the allocation of resources to projects such as the development of chemical processes. In this paper, we differentiate between complicatedness and complexity, and between process and plant. Using words and examples from a variety of contexts (including mechanical engineering and management science), we try to illustrate the properties of each, such that in later papers we can provide unambiguous measures of system complexity to permit the comparison of process options.
Do partial cuts create forest complexity? A new approach to measuring the complexity of forest patterns using photographs and the mean information gain
2013, Forestry Chronicle
Factors driving potential ammonia oxidation in Canadian arctic ecosystems: Does spatial scale matter?
2012, Applied and Environmental Microbiology
Information sets partition based on entropy using improved particle swarm optimization algorithm
2010, Journal of Computational Information Systems
Ecological systems as complex systems: Challenges for an emerging science
2010, Diversity

View all citing articles on Scopus

View full text

Measuring information-based complexity across scales using cluster analysis

Abstract

Introduction

Section snippets

A minimum message length similarity measure

Data and methods

Results

Discussion

Acknowledgments

Journal of Theoretical Biology

Ecological Modelling

The Science of the Total Environment

Ecological Modelling

Acta Oecologica

BioSystems

Physica. A

Ecological Modelling

MML clustering of continuous-valued data using Gaussian and t distributions

Clustering of Gaussian and t distributions using minimum message length

Unsupervised learning of correlated multivariate Gaussian mixture models using MML

Unsupervised learning of gamma mixture models using minimum message length

Sum rule for multiscale representations of kinematically described systems

Advances in Complex Systems

Multiscale analysis of complex systems

Physics Review E

Species-area curves, diversity indices, and species abundance distributions: a multifractal analysis

American Naturalist

The fractal nature of nature: power laws, ecological complexity and biodiversity

Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences

A Bayesian perspective on the modifiable areal unit problem using data augmentation

Models, measures and messages: an essay on the role for induction

Community Ecology

Domain knowledge, evidence, complexity and convergence

International Journal of Ecology and Environmental Sciences

Minimum message length clustering: an explication and some applications to vegetation data

Community Ecology

Quantifying the components of biocomplexity along ecological perturbation gradients

Biodiversity and Conservation

MML Markov classification of sequential data

Statistics and Computing

The modifiable areal unit problem in statistical analysis

Environment and Planning A