Neural pattern recognition and multivariate data: water typology of the Paraı́ba do Sul River, Brazil

https://doi.org/10.1016/j.envsoft.2004.03.018Get rights and content

Abstract

Modelling environmental processes is a complicated task, for the number of variables involved is usually high. This paper considers the use of neural pattern recognition to analyze structures within large data sets related to the study of ecological phenomena. The purpose is to use the information obtained with the aid of an unsupervised clustering step in a pattern recognition algorithm to obtain insight into the processes occurring along the period of observation. Once the processes are identified, more reliable models can be derived. The method proved to be helpful to highlight the major fluctuations in river water chemistry, and to identify complementary characteristics relevant to understanding the processes involved in the transport of dissolved nutrients in the Paraı́ba do Sul River basin outlet.

Introduction

The continuous development of computing facilities has made the analysis of multivariate data much more customary. This is especially relevant to research in environmental sciences, where the number of variables involved in analyses is generally high, and modelling is a complex process.

Recently, alternative methods of data analysis, like neural networks and fuzzy logic based algorithms have emerged as interesting tools for time series analysis (Clair and Ehrman, 1998, Nunnari et al., 1998, Maier and Dandy, 2000, Maier and Dandy, 1996, Dawson and Wilby, 1999, Maier and Dandy, 1998, Maier and Dandy, 1999). The approach used up to now, however is based on supervised training of data sets, which means that both the input data (measured variables) and the desired output (quantity to be predicted) are simultaneously presented to a back-propagation-like neural network. The training process consists of, for instance, minimizing the error between the actual output and the desired one by the adjustment of the interconnection weights through a gradient descent method (Hertz et al., 1991).

Although some interesting results can be obtained in forecasting, the simple use of a back-propagation-like neural network provides no direct insight into the relevant ecological processes. Back-propagation algorithms are vulnerable to the argument that they only are a way to compute a transfer matrix linking the experimental data to the desired results.

In this context, the traditional methods of multivariate analysis, like principal components analysis (PCA), are superior in the task of providing reliable information for further modelling of ecological processes. Traditional methods of multivariate analysis, however, present a well known and extremely serious drawback: the interpretation of the components and the establishment of relationships with actual events is very laborious and relevant information may be ignored. Establishing correlations between variables in a multidimensional space through PCA may be regarded as an eigenvalue problem.

Eigenvalue problems carry an intrinsic indeterminacy: the result of any similarity transformation applied to a set of eigenvectors associated with a solution is also a solution. Conventional analytical tools, such as PCA, are based on algorithms which find a generic solution and then apply transformation matrices, so that the basis vectors of the space with reduced dimensionality are rotated towards the axes explaining the major part of the variance within the data set. This usually leads to eigenvectors which cannot immediately be translated into the kind of information the analyst is used to dealing with.

In the absence of noise, the dimensionality of the space would simply be the number of non-zero eigenvalues. In the case of real measurements, however, a threshold indicating when an eigenvalue can be considered as statistically significant must be established. As both transient and episodic events are related to low-valued eigenvalues, the chances of cutting off important information through a bad choice of the significance threshold are considerable.

The purpose of this paper is to introduce a slightly different computational sequence, successfully employed in Materials Science (Gatts et al., 1995a, Gatts et al., 1995b). The idea is to consider each series of measurements at a given time as a snapshot of the system under consideration and to impose a constraint to assure that the solution must have the same general characteristics of the data set. This is accomplished by submitting the experimental data to an unsupervised cluster analysis algorithm, so that the n snapshots (data vectors) are distributed into P (<n) classes of similarity. The set composed by the P vectors representing each category of data is presumed to be a solution and an orthonormal basis set is derived. The projection of the original data set onto the directions defined by these basis vectors generates a profile indicating the contribution of each factor (data cluster, class or category) at a given time. From the calculated profile and the basis set it is possible to reconstruct the data set and from the comparison between the experimental data set and the reconstructed one, it is possible to verify the adequacy of the model.

Section snippets

The method

The procedure proposed here is developed in two steps. Initially the dimensionality of the problem is determined through the utilization of an unsupervised learning algorithm, a modified ART-like network (Hertz et al., 1991, Gatts et al., 1995a, Gatts et al., 1995b, Souza et al., 1993). The representation of the categories of data found in the first stage then feeds a competitive network (Hertz et al., 1991) in order to refine the model.

The ART-like network produces what can be considered a

Software implementation

The routines to perform the pattern recognition tasks were implemented in MATLAB®. In order to provide free access and use of the code, a version in Python is under development.

Results and discussion

The first problem related to the application of Principal Component Analysis to the experimental data is the determination of how many components are significant. Fig. 1 shows the percentage of the total variance explained by each eigenvalue. From this picture it is straightforward to conclude that at least six factors must be considered. Defining the threshold for distinguishing relevant information from noise, however, is not an easy task. The consideration of seven components may be

Conclusion

This study describes a method of data processing that can be used as an alternative to conventional methods of multivariate analysis. The main and obvious advantage is that data interpretation is simplified, because the output of the computational algorithm has the format the analyst is used to deal with. Besides, the method prevents relevant information from being discarded. The method was proven to be helpful to highlight the major fluctuations on river water chemistry, as well as to point to

Acknowledgements

The authors gratefully acknowledge the financial support from FAPERJ and CNPq.

References (11)

  • C. Gatts et al.

    Ultramicroscopy

    (1995)
  • H.R. Maier et al.

    Environmental Modelling and Software

    (1998)
  • H.R. Maier et al.

    Environmental Modelling and Software

    (2000)
  • G. Nunnari et al.

    Ecological Modelling

    (1998)
  • T.A. Clair et al.

    Water Resources Research

    (1998)
There are more references available in the full text version of this article.

Cited by (16)

  • Which method to use? An assessment of data mining methods in Environmental Data Science

    2018, Environmental Modelling and Software
    Citation Excerpt :

    The prediction time is very quick, even for big and complex ANNs. Applications and References: Numerous applications have been developed for ANN; as an indication we mention the works by Kralisch et al. (2001) and Almasri and Kaluarachchi (2005) on nitrogen loading; Mas et al. (2004) on deforestation; Tasadduq et al. (2002) on surface temperature in desert; Tirelli et al. (2011) on flora abundance, Carvalho et al. (2008) on biological diversity in coastal water; Bartoletti et al. (2018) on rainfall; Babovic (2005) on hydrology; Izquierdo et al. (2006) on detection of anomalies in water supply systems; Brentan et al. (2017a) on water demand forecast; Kusiak et al. (2013) on pumping in WWTP; those of Belanche et al. (2001), Gibbs et al. (2003) and Gatts et al. (2005) on water quality; Kurt and Oktay (2010) on air quality; and Pacifici et al. (2009) and Taghavifar et al. (2013) on land use and soil. The discussion on nonlinear ordination and visualization of ecological data by Kohonen networks; ecological time-series modelling by recurrent networks Recknagel et al. (2002); along with the application of Dixon et al. (2007) in anaerobic wastewater treatment processes.

  • The distribution of macrofauna on the inner continental shelf of southeastern Brazil: The major influence of an estuarine system

    2013, Estuarine, Coastal and Shelf Science
    Citation Excerpt :

    The Paraíba do Sul River seasonal discharge exhibits a dry period between the months of May and September and a rainy one from October to April (Carvalho et al., 2002). The temporal variations in the flow rate induce variations on the coastal water's salinity and temperature (Rudorff et al., 2011), in the concentrations of nutrients (Gatts et al., 2005) and pollutants and on the export (Lacerda et al., 1993) and sediment dynamics (Ribeiro et al., 2004). The present study aimed to evaluate the effect of the flow rate of the Paraíba do Sul River on the benthic macrofauna of the inner platform (<50 m), where different discharge periods of the river and rainfall are well defined.

  • Chapter Twelve Data Mining for Environmental Systems

    2008, Developments in Integrated Environmental Assessment
    Citation Excerpt :

    Thus, the main role of ANNs is as an approximation function, especially suited for predicting non-linear functions. Applications: Numerous applications have been developed, and as an indication we refer to the works of Kralisch et al. (2001) and Almasri and Kaluarachchi (2005) on nitrogen loading, Mas et al. (2004) on deforestation, Babovic (2005) on hydrology, those of Belanche et al. (2001), Gibbs et al. (2003) and Gatts et al. (2005) on water quality, or the discussion on non-linear ordination and visualisation of ecological data by Kohonen networks, and ecological time-series modelling by recurrent networks (Recknagel, 2001), along with the recent application of Dixon et al. (2007) to anaerobic wastewater treatment processes. DM and machine learning are of course not restricted to the methods discussed here, and some less common techniques have been applied to environmental problems.

  • Exploring the ecological status of human altered streams through Generative Topographic Mapping

    2007, Environmental Modelling and Software
    Citation Excerpt :

    Most of the proposed NN models are supervised and concerned with problems of classification, regression and prediction. Less attention has been paid to unsupervised models (Thandaveswara and Sajikumar, 2000; Brodnjak-Vončina et al., 2002; Gatts et al., 2005), which are specially suitable for exploratory multivariate data analysis, and usually concerned with clustering problems. The present study is the first, exploratory stage of a wider research effort to characterize and predict the stream ecological status.

View all citing articles on Scopus
View full text