Consensus methods based on machine learning techniques for marine phytoplankton presence–absence prediction
Introduction
In the classification framework of machine learning (ML), ensemble methods or aggregating methods consist in combining the predictions of several classifiers (also called hypotheses or base classifiers) that are performed over the same data set. The predictions are combined with the main goal of reducing variance and constructing a more stable and accurate predictor James et al., 2014, Hastie et al., 2001, Bourel, 2012, Bourel, 2013. Ensemble methods have had great success not only in the ML community, but also among researchers from different fields and with statistical modeling interests, because of their accuracy, which is generally higher than that of individual classifiers (Polikar, 2006). Despite the merits of these methods, it is often a challenge to understand completely the theoretical framework behind them.
The strategy of combining the outputs of different classifiers implies that individual classifiers make errors on different instances. The logic is that, if each classifier makes different errors, then a good combination of these classifiers can reduce the total error, improving the errors of not-so-good classifiers. For this, it is interesting to make each classifier as unique as possible with respect to misclassified instances. In particular, it is necessary to find classifiers whose decision boundaries are adequately different from those of others. Such a set of classifiers is said to be diverse Polikar, 2006, Brown et al., 2005 and references therein). In general, however, ensemble algorithms do not attempt to maximize a specific diversity measure. Rather, increased diversity is usually sought somewhat heuristically through various resampling procedures, such as the selection (randomly or not) of different training parameters, models, or subsets of features.
Ensemble methods can be classified into two categories: homogeneous and non-homogeneous. Homogeneous methods combine classifiers of the same nature; examples of this type of methods are bagging (Breiman, 1996a), random forests (RF) (Breiman, 2001), and boosting Freund and Schapire, 1997, Schapire and Freund, 1998. In this paper, we will pay attention to non-homogeneous methods and we will refer to them as consensus methods. Consensus methods consist of a combination of various methods of a different nature. Examples of this type of methods are stacking Wolpert, 1992, Ting and Witten, 1999, Breiman, 1996b and mixture of experts (Masoudnia and Ebrahimpour, 2014). The different predictors are combined in some way; for instance, in the case of mixture of experts, this is done generally by averaging (with or without weights) or by voting over the models' predictions. In the case of stacking, the outputs of the different classifiers are used to train another classifier, which makes the final decision rule of the methods.
A way of doing a mixture of experts is inspired, to some extent, by Bayesian voting, and it consists in assigning a weight to each hypothesis (Kuncheva, 2014). A classifier h generally calculates the posterior probability that a given observation belongs to a class. To fix the notation, we can think that h computes a vector , where and are the posterior probabilities that observation x belongs to class 0 or to class 1, respectively. The consensus of different intermediate classifiers is to generate a classifier F of the form
This type of combination is called a weighted averaging combining rule. In this paper, we will compare it empirically to other mixture-of-expert rules and to two versions of stacking.
Concerning the ecological modeling of species presence–absence, the performance of different statistical techniques could vary significantly from a particular case study to another, and it is not very clear sometimes which model is the most suitable. There are two possible strategies to reduce the models' uncertainty: (1) by acquiring an understanding via extensive model comparisons as to which method will generally provide the best predictive performance and in what conditions (Marmion et al., 2009b) and (2) by using consensus methods (i.e., non-homogeneous ensemble methods) Thuiller, 2004, Thuiller et al., 2005, Araújo and New, 2007, Marmion et al., 2009b. As mentioned earlier, consensus methods overcome the problem of variability in the predictions of different single models since they are based on the combination of their predictions. Hence, a relevant combination of several unbiased (i.e., with good accuracy) model outputs will result in a more accurate prediction.
The matter rests in choosing adequate single models and finding a relevant algorithm to combine them. When dealing with ecological problems, ML techniques seem to be good candidates for single models because of their predictive capacity (Olden and Jackson, 2002). These techniques are frequently and increasingly considered in ecological studies, in particular in modeling species presence–absence or abundance from environmental variables De’ath and Fabricius, 2000, Guisan et al., 2002, Drake et al., 2006, Cutler et al., 2007, Kampichler et al., 2010, Olden and Jackson, 2002. ML methods have advantages over traditional statistical methods (e.g., linear models and generalized linear models) since they can deal with some characteristics typical of ecological data such as unusual distributions, non-linearity, multiple missing values, complex data interactions, and dependence on the observations Guisan et al., 2002, Cutler et al., 2007, Crisci et al., 2012. Besides their flexibility, they typically outperform traditional approaches, making them ideal for modeling ecological systems (Olden et al., 2008). In fact, concerning ecological studies, ML methods are always considered when performing consensus models Marmion et al., 2009a, Marmion et al., 2009b, Lauzeral et al., 2015, Comte and Grenouillet, 2013, Thuiller et al., 2009. Besides ML techniques, more classical techniques such as generalized linear modeling or linear discriminant analysis are usually considered in the consensus construction Thuiller et al., 2009, Marmion et al., 2009a, Marmion et al., 2009b, Lauzeral et al., 2015, Comte and Grenouillet, 2013 since, in some cases (e.g., linear relations between the predictors and the response variable), these methods may outperform ML techniques.
It must be noted that, although the consensus approach clearly has a number of attractive characteristics, the understanding of its merits for ecological prediction is still limited (Marmion et al., 2009b); hence, further studies comparing the predictive capacity of consensus methods with that of single methods are needed. It must be noted also that most of the applications of consensus methods in ecological studies are related to the study of species distribution models (SDMs) (Guisan and Thuiller, 2005).
In this paper, we explore the performance of six different consensus methods for predicting the presence–absence of eight marine phytoplankton species from the Atlantic coast of Uruguay. Four of the methods are a mixture of experts, and the other two are stacking applications. Moreover, we analyze the performance of the consensus models by considering 10 well-known open access data sets. To generate the consensus, we combined four individual models with very different structures, three of which have been documented as some of the most accurate ML techniques: boosting, RF, and support vector machine (SVM), whereas the fourth is a generalized linear model (GLM) that could better capture the linear relationships in data. For a more detailed description of these models, we refer the reader to the Supplementary material.
Section snippets
Methods
In this section, we present i) the data sets used to evaluate the performance of the models; ii) the principal concepts of supervised classification, iii) a description of the consensus models analyzed in this work; iv) the way in which we calculated the prediction error of the models; and v) the model tuning and optimization, and the use of software and functions.
Models' performance
With all the data sets considered together, the WA-PE consensus method was the model that presented the lowest generalization error in most of the cases (9 cases out of 18) Fig. 3, Table 2, Table 3). MV and StackRF among the consensus methods, and RF among the single methods, were next to WA-PE in “number of wins” (two wins each) (Fig. 3a). Finally, the remaining methods presented the lowest generalization error only once (GLM, boosting, and SVM) or on no occasion at all (MeanProb, WA-AUC, and
Consensus models' performance
In this study, we applied six different consensus methods to predict the presence–absence of marine phytoplankton species. Furthermore, we evaluated the performance of the models using open access data sets.
To construct the consensus, we decided to combine three ML techniques that are well known in the ML community, present very good performance generally, and, at the same time, are well known and broadly used in ecological studies (e.g., Cutler et al., 2007, De’ath, 2007, Guo et al., 2005. It
Conclusions
Consensus methods present an interesting alternative for developing predictive tools to create sound monitoring and management tools. They have shown to produce favorable results compared to those by single methods (Polikar, 2006), although further applications in the ecology area must be addressed to determine the potential of these methods. In particular, further knowledge in the context of marine phytoplankton, and especially on species that represent challenges for water managers and
Acknowledgments
This work was supported by ECOS-Sud Aprendizaje Automático para la Modelización y el Análisis de Recursos Naturales (project n° U14E02) and by ANII-Uruguay.
References (74)
- et al.
Predicting potentially toxigenic pseudo-Nitzschia blooms in the chesapeake Bay
J. Mar. Syst.
(2010) - et al.
Ensemble forecasting of species distributions
Trends Ecol. Evol.
(2007) - et al.
Trophic niche shifts driven by phytoplankton in sandy beach ecosystems
Estuar. Coast. Shelf Sci.
(2016) - et al.
Diversity creation methods: a survey and categorisation
IEEE Circuits Syst. Mag.
(2005) - et al.
A review of supervised machine learning algorithms and their applications to ecological data
Ecol. Model.
(2012) - et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997) - et al.
Generalized linear and generalized additive models in studies of species distributions: setting the scene
Ecol. Model.
(2002) - et al.
Support vector machines for predicting distribution of sudden oak death in California
Ecol. Model.
(2005) - et al.
Red tides in Masan Bay, Korea in 2004-2005: i. Daily variations in the abundance of red-tide organisms and environmental factors
Harmful Algae
(2013) - et al.
A hierarchy of conceptual models of red-tide generation: nutrition, behavior, and biological interactions
Harmful Algae
(2015)
Prediction and elucidation of phytoplankton dynamics in the Nakdong River (Korea) by means of a recurrent artificial neural network
Ecol. Model.
Classification in conservation biology: a comparison of five machine-learning methods
Eco. Inform.
Killing potential protist predators as a survival strategy of the newly described dinoflagellate Alexandrium pohangense
Harmful Algae
Neural network modelling of coastal algal blooms
Ecol. Model.
Statistical consensus methods for improving predictive geomorphology maps
Comput. Geosci.
Models of harmful algal blooms: conceptual, empirical, and numerical approaches
J. Mar. Syst.
Predicting tree species presence and basal area in Utah: a comparison of stochastic gradient boosting, generalized additive models, and tree-based methods
Ecol. Model.
Surf zone diatoms: a review of the drivers, patterns and role in sandy beaches food chains
Estuar. Coast. Shelf Sci.
A dynamic quantitative approach for predicting the shape of phytoplankton profiles in the ocean
Prog. Oceanogr.
Developing an empirical model of phytoplankton primary production: a neural network case study
Ecol. Model.
Towards a generic artificial neural network model for dynamic predictions of algal abundance in freshwater lakes
Ecol. Model.
Stacked generalization
Neural Netw.
Combining independent and unbiased classifiers using weighted average
Estimating Cell Numbers
Manual on Harmful Marine Microalgae
Model aggregation methods and applications
Mem. Trab. difusión Cient. Tec.
Apprentissage statistique par aggregation de modeles
Ph.D Thesis Université Aix-Marseille, France
Bagging predictors
Mach. Learn.
Stacked regression
Mach. Learn.
Arcing classifiers
Ann. Stat.
Random forests
Mach. Learn.
Presence–absence versus presence-only modelling methods for predicting bird habitat suitability
Ecography
The global distribution of surf diatom accumulations
Rev. Chil. Hist. Nat.
Species distribution modelling and imperfect detection: comparing occupancy versus consensus methods
Divers. Distrib.
Random forests for classification in ecology
Ecology
Boosted trees for ecological modeling and prediction
Ecology
Classification and regression trees: a powerful yet simple technique for ecological data analysis
Ecology
A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics
Cited by (22)
Top-down and bottom-up control of phytoplankton in a mid-latitude continental shelf ecosystem
2023, Progress in OceanographyPotentialities and limitations of machine learning to solve cut-and-shuffle mixing problems: A case study
2022, Chemical Engineering ScienceCitation Excerpt :Although the MSE loss is very small across NN realizations (Fig. 6) and across training datasets (Fig. 8), the mixing performance for extrapolation in terms of the number of interfaces is more sensitive for both [Figs. 10(b) and (c), respectively]. The variability across realizations of networks trained on the same training data can be reduced by averaging NN outputs between realizations (Bourel et al., 2017) to reduce prediction error in cut locations, as shown in Fig. 10(c). Furthermore, the larger training datasets increase the number of interfaces [Fig. 10(c)], even though training did not use this metric.
Automation of species-specific cyanobacteria phycocyanin fluorescence compensation using machine learning classification
2022, Ecological InformaticsCitation Excerpt :In some cases, individual species will dominate the in-situ cyanoHAB community (Rousso et al., 2022b; Soares et al., 2013; Wang et al., 2010), while in others multiple species will coexist within the cyanoHAB (Gallego et al., 2019; Tromas et al., 2017; Zhang et al., 2021). For the latter, grouping of species by functional groups (e.g., morphological characteristics such as cell size and colony structures; adaptive physiological features such as diazotrophy and buoyancy regulation; see Reynolds, 2000 for details of functional groups classifications) can be performed (Bourel et al., 2017; Crisci et al., 2017; Shimoda et al., 2016), as similar species may co-exist due to niche and fitness similarities (Gallego et al., 2019). Tailored laboratory experiments should be designed to encompass either mono-specific cultures or multi-species cultures that often co-exist at the site of interest (see Rousso et al., 2022a).
Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters
2021, Water ResearchCitation Excerpt :For example, Random forest (Jones et al., 2013; Parkhurst et al., 2005), Artificial Neural networks (Choi and Seo, 2018; He and He, 2008; Kashefipour et al., 2005), Bayesian networks (Avila et al., 2018) and Wavelet analysis (Ge and Frick, 2009; Zhang et al., 2018) were successfully applied to model water quality. Metalearning models, such as stacking and consensus methods, where the outputs of individual ML models are taken as input for another model that produces the final prediction are beginning to be applied in the field (Bourel et al., 2017; Wang et al., 2021). Advances in soft computing techniques has widespread the use of artificial neural network and support vector machine in the field of environmental engineering (Haghiabi et al., 2018b) and water quality studies (Haghiabi, 2016).