Elsevier

Ecological Informatics

Volume 42, November 2017, Pages 46-54
Ecological Informatics

Consensus methods based on machine learning techniques for marine phytoplankton presence–absence prediction

https://doi.org/10.1016/j.ecoinf.2017.09.004Get rights and content

Highlights

  • We present six non-homogeneous consensus models to predict the presence–absence of marine phytoplankton species.

  • In most of the cases, the consensus models behaved better than the single-models that were used to construct them.

  • The single-models considered were generalized linear models, random forests, boosting and support vector machines.

  • Our results suggest that attention must be given to consensus methods when dealing with ecological prediction.

Abstract

We performed different consensus methods by combining binary classifiers, mostly machine learning classifiers, with the aim to test their capability as predictive tools for the presence–absence of marine phytoplankton species. The consensus methods were constructed by considering a combination of four methods (i.e., generalized linear models, random forests, boosting and support vector machines). Six different consensus methods were analyzed by taking into account six different ways of combining single-model predictions. Some of these methods are presented here for the first time. To evaluate the performance of the models, we considered eight phytoplankton species presence–absence data sets and data related to environmental variables. Some of the analyzed species are toxic, whereas others provoke water discoloration, which can cause alarm in the population. Besides the phytoplankton data sets, we tested the models on 10 well-known open access data sets. We evaluated the models' performances over a test sample. For most (72%) of the data sets, a consensus method was the method with the lowest classification error. In particular, a consensus method that weighted single-model predictions in accordance with single-model performances (weighted average prediction error — WA-PE model) was the one that presented the lowest classification error most of the time. For the phytoplankton species, the errors of the WA-PE model were between 10% for the species Akashiwo sanguinea and 38% for Dinophysis acuminata. This study provides novel approaches to improve the prediction accuracy in species distribution studies and, in particular, in those concerning marine phytoplankton species.

Introduction

In the classification framework of machine learning (ML), ensemble methods or aggregating methods consist in combining the predictions of several classifiers (also called hypotheses or base classifiers) that are performed over the same data set. The predictions are combined with the main goal of reducing variance and constructing a more stable and accurate predictor James et al., 2014, Hastie et al., 2001, Bourel, 2012, Bourel, 2013. Ensemble methods have had great success not only in the ML community, but also among researchers from different fields and with statistical modeling interests, because of their accuracy, which is generally higher than that of individual classifiers (Polikar, 2006). Despite the merits of these methods, it is often a challenge to understand completely the theoretical framework behind them.

The strategy of combining the outputs of different classifiers implies that individual classifiers make errors on different instances. The logic is that, if each classifier makes different errors, then a good combination of these classifiers can reduce the total error, improving the errors of not-so-good classifiers. For this, it is interesting to make each classifier as unique as possible with respect to misclassified instances. In particular, it is necessary to find classifiers whose decision boundaries are adequately different from those of others. Such a set of classifiers is said to be diverse Polikar, 2006, Brown et al., 2005 and references therein). In general, however, ensemble algorithms do not attempt to maximize a specific diversity measure. Rather, increased diversity is usually sought somewhat heuristically through various resampling procedures, such as the selection (randomly or not) of different training parameters, models, or subsets of features.

Ensemble methods can be classified into two categories: homogeneous and non-homogeneous. Homogeneous methods combine classifiers of the same nature; examples of this type of methods are bagging (Breiman, 1996a), random forests (RF) (Breiman, 2001), and boosting Freund and Schapire, 1997, Schapire and Freund, 1998. In this paper, we will pay attention to non-homogeneous methods and we will refer to them as consensus methods. Consensus methods consist of a combination of various methods of a different nature. Examples of this type of methods are stacking Wolpert, 1992, Ting and Witten, 1999, Breiman, 1996b and mixture of experts (Masoudnia and Ebrahimpour, 2014). The different predictors are combined in some way; for instance, in the case of mixture of experts, this is done generally by averaging (with or without weights) or by voting over the models' predictions. In the case of stacking, the outputs of the different classifiers are used to train another classifier, which makes the final decision rule of the methods.

A way of doing a mixture of experts is inspired, to some extent, by Bayesian voting, and it consists in assigning a weight to each hypothesis (Kuncheva, 2014). A classifier h generally calculates the posterior probability that a given observation belongs to a class. To fix the notation, we can think that h computes a vector p0h(x),p1h(x), where p0h(x) and p1h(x) are the posterior probabilities that observation x belongs to class 0 or to class 1, respectively. The consensus of different intermediate classifiers h1,,hM is to generate a classifier F of the form F(x)=Argmaxk{0,1}m=1Mwhm,Lpkhm(x).

This type of combination is called a weighted averaging combining rule. In this paper, we will compare it empirically to other mixture-of-expert rules and to two versions of stacking.

Concerning the ecological modeling of species presence–absence, the performance of different statistical techniques could vary significantly from a particular case study to another, and it is not very clear sometimes which model is the most suitable. There are two possible strategies to reduce the models' uncertainty: (1) by acquiring an understanding via extensive model comparisons as to which method will generally provide the best predictive performance and in what conditions (Marmion et al., 2009b) and (2) by using consensus methods (i.e., non-homogeneous ensemble methods) Thuiller, 2004, Thuiller et al., 2005, Araújo and New, 2007, Marmion et al., 2009b. As mentioned earlier, consensus methods overcome the problem of variability in the predictions of different single models since they are based on the combination of their predictions. Hence, a relevant combination of several unbiased (i.e., with good accuracy) model outputs will result in a more accurate prediction.

The matter rests in choosing adequate single models and finding a relevant algorithm to combine them. When dealing with ecological problems, ML techniques seem to be good candidates for single models because of their predictive capacity (Olden and Jackson, 2002). These techniques are frequently and increasingly considered in ecological studies, in particular in modeling species presence–absence or abundance from environmental variables De’ath and Fabricius, 2000, Guisan et al., 2002, Drake et al., 2006, Cutler et al., 2007, Kampichler et al., 2010, Olden and Jackson, 2002. ML methods have advantages over traditional statistical methods (e.g., linear models and generalized linear models) since they can deal with some characteristics typical of ecological data such as unusual distributions, non-linearity, multiple missing values, complex data interactions, and dependence on the observations Guisan et al., 2002, Cutler et al., 2007, Crisci et al., 2012. Besides their flexibility, they typically outperform traditional approaches, making them ideal for modeling ecological systems (Olden et al., 2008). In fact, concerning ecological studies, ML methods are always considered when performing consensus models Marmion et al., 2009a, Marmion et al., 2009b, Lauzeral et al., 2015, Comte and Grenouillet, 2013, Thuiller et al., 2009. Besides ML techniques, more classical techniques such as generalized linear modeling or linear discriminant analysis are usually considered in the consensus construction Thuiller et al., 2009, Marmion et al., 2009a, Marmion et al., 2009b, Lauzeral et al., 2015, Comte and Grenouillet, 2013 since, in some cases (e.g., linear relations between the predictors and the response variable), these methods may outperform ML techniques.

It must be noted that, although the consensus approach clearly has a number of attractive characteristics, the understanding of its merits for ecological prediction is still limited (Marmion et al., 2009b); hence, further studies comparing the predictive capacity of consensus methods with that of single methods are needed. It must be noted also that most of the applications of consensus methods in ecological studies are related to the study of species distribution models (SDMs) (Guisan and Thuiller, 2005).

In this paper, we explore the performance of six different consensus methods for predicting the presence–absence of eight marine phytoplankton species from the Atlantic coast of Uruguay. Four of the methods are a mixture of experts, and the other two are stacking applications. Moreover, we analyze the performance of the consensus models by considering 10 well-known open access data sets. To generate the consensus, we combined four individual models with very different structures, three of which have been documented as some of the most accurate ML techniques: boosting, RF, and support vector machine (SVM), whereas the fourth is a generalized linear model (GLM) that could better capture the linear relationships in data. For a more detailed description of these models, we refer the reader to the Supplementary material.

Section snippets

Methods

In this section, we present i) the data sets used to evaluate the performance of the models; ii) the principal concepts of supervised classification, iii) a description of the consensus models analyzed in this work; iv) the way in which we calculated the prediction error of the models; and v) the model tuning and optimization, and the use of software and functions.

Models' performance

With all the data sets considered together, the WA-PE consensus method was the model that presented the lowest generalization error in most of the cases (9 cases out of 18) Fig. 3, Table 2, Table 3). MV and StackRF among the consensus methods, and RF among the single methods, were next to WA-PE in “number of wins” (two wins each) (Fig. 3a). Finally, the remaining methods presented the lowest generalization error only once (GLM, boosting, and SVM) or on no occasion at all (MeanProb, WA-AUC, and

Consensus models' performance

In this study, we applied six different consensus methods to predict the presence–absence of marine phytoplankton species. Furthermore, we evaluated the performance of the models using open access data sets.

To construct the consensus, we decided to combine three ML techniques that are well known in the ML community, present very good performance generally, and, at the same time, are well known and broadly used in ecological studies (e.g., Cutler et al., 2007, De’ath, 2007, Guo et al., 2005. It

Conclusions

Consensus methods present an interesting alternative for developing predictive tools to create sound monitoring and management tools. They have shown to produce favorable results compared to those by single methods (Polikar, 2006), although further applications in the ecology area must be addressed to determine the potential of these methods. In particular, further knowledge in the context of marine phytoplankton, and especially on species that represent challenges for water managers and

Acknowledgments

This work was supported by ECOS-Sud Aprendizaje Automático para la Modelización y el Análisis de Recursos Naturales (project n° U14E02) and by ANII-Uruguay.

References (74)

  • JeongK.-S. et al.

    Prediction and elucidation of phytoplankton dynamics in the Nakdong River (Korea) by means of a recurrent artificial neural network

    Ecol. Model.

    (2001)
  • KampichlerC. et al.

    Classification in conservation biology: a comparison of five machine-learning methods

    Eco. Inform.

    (2010)
  • KimJ.H. et al.

    Killing potential protist predators as a survival strategy of the newly described dinoflagellate Alexandrium pohangense

    Harmful Algae

    (2016)
  • LeeJ.H. et al.

    Neural network modelling of coastal algal blooms

    Ecol. Model.

    (2003)
  • MarmionM. et al.

    Statistical consensus methods for improving predictive geomorphology maps

    Comput. Geosci.

    (2009)
  • McGillicuddyD.

    Models of harmful algal blooms: conceptual, empirical, and numerical approaches

    J. Mar. Syst.

    (2010)
  • MoisenG.G. et al.

    Predicting tree species presence and basal area in Utah: a comparison of stochastic gradient boosting, generalized additive models, and tree-based methods

    Ecol. Model.

    (2006)
  • OdebrechtC. et al.

    Surf zone diatoms: a review of the drivers, patterns and role in sandy beaches food chains

    Estuar. Coast. Shelf Sci.

    (2014)
  • RichardsonA. et al.

    A dynamic quantitative approach for predicting the shape of phytoplankton profiles in the ocean

    Prog. Oceanogr.

    (2003)
  • ScardiM. et al.

    Developing an empirical model of phytoplankton primary production: a neural network case study

    Ecol. Model.

    (1999)
  • WilsonH. et al.

    Towards a generic artificial neural network model for dynamic predictions of algal abundance in freshwater lakes

    Ecol. Model.

    (2001)
  • WolpertD.

    Stacked generalization

    Neural Netw.

    (1992)
  • AlexandreL.A. et al.

    Combining independent and unbiased classifiers using weighted average

  • AndersenP. et al.

    Estimating Cell Numbers

    Manual on Harmful Marine Microalgae

    (2003)
  • BourelM.

    Model aggregation methods and applications

    Mem. Trab. difusión Cient. Tec.

    (2012)
  • BourelM.

    Apprentissage statistique par aggregation de modeles

    Ph.D Thesis Université Aix-Marseille, France

    (2013)
  • BreimanL.

    Bagging predictors

    Mach. Learn.

    (1996)
  • BreimanL.

    Stacked regression

    Mach. Learn.

    (1996)
  • BreimanL.

    Arcing classifiers

    Ann. Stat.

    (1998)
  • BreimanL.

    Random forests

    Mach. Learn.

    (2001)
  • BrotonsL. et al.

    Presence–absence versus presence-only modelling methods for predicting bird habitat suitability

    Ecography

    (2004)
  • CampbellE.E.

    The global distribution of surf diatom accumulations

    Rev. Chil. Hist. Nat.

    (1996)
  • ComteL. et al.

    Species distribution modelling and imperfect detection: comparing occupancy versus consensus methods

    Divers. Distrib.

    (2013)
  • CutlerD.R. et al.

    Random forests for classification in ecology

    Ecology

    (2007)
  • De’athG.

    Boosted trees for ecological modeling and prediction

    Ecology

    (2007)
  • De’athG. et al.

    Classification and regression trees: a powerful yet simple technique for ecological data analysis

    Ecology

    (2000)
  • DevroyeL. et al.

    A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics

    (1997)
  • Cited by (22)

    • Potentialities and limitations of machine learning to solve cut-and-shuffle mixing problems: A case study

      2022, Chemical Engineering Science
      Citation Excerpt :

      Although the MSE loss is very small across NN realizations (Fig. 6) and across training datasets (Fig. 8), the mixing performance for extrapolation in terms of the number of interfaces is more sensitive for both [Figs. 10(b) and (c), respectively]. The variability across realizations of networks trained on the same training data can be reduced by averaging NN outputs between realizations (Bourel et al., 2017) to reduce prediction error in cut locations, as shown in Fig. 10(c). Furthermore, the larger training datasets increase the number of interfaces [Fig. 10(c)], even though training did not use this metric.

    • Automation of species-specific cyanobacteria phycocyanin fluorescence compensation using machine learning classification

      2022, Ecological Informatics
      Citation Excerpt :

      In some cases, individual species will dominate the in-situ cyanoHAB community (Rousso et al., 2022b; Soares et al., 2013; Wang et al., 2010), while in others multiple species will coexist within the cyanoHAB (Gallego et al., 2019; Tromas et al., 2017; Zhang et al., 2021). For the latter, grouping of species by functional groups (e.g., morphological characteristics such as cell size and colony structures; adaptive physiological features such as diazotrophy and buoyancy regulation; see Reynolds, 2000 for details of functional groups classifications) can be performed (Bourel et al., 2017; Crisci et al., 2017; Shimoda et al., 2016), as similar species may co-exist due to niche and fitness similarities (Gallego et al., 2019). Tailored laboratory experiments should be designed to encompass either mono-specific cultures or multi-species cultures that often co-exist at the site of interest (see Rousso et al., 2022a).

    • Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters

      2021, Water Research
      Citation Excerpt :

      For example, Random forest (Jones et al., 2013; Parkhurst et al., 2005), Artificial Neural networks (Choi and Seo, 2018; He and He, 2008; Kashefipour et al., 2005), Bayesian networks (Avila et al., 2018) and Wavelet analysis (Ge and Frick, 2009; Zhang et al., 2018) were successfully applied to model water quality. Metalearning models, such as stacking and consensus methods, where the outputs of individual ML models are taken as input for another model that produces the final prediction are beginning to be applied in the field (Bourel et al., 2017; Wang et al., 2021). Advances in soft computing techniques has widespread the use of artificial neural network and support vector machine in the field of environmental engineering (Haghiabi et al., 2018b) and water quality studies (Haghiabi, 2016).

    View all citing articles on Scopus
    View full text