Input variable selection using machine learning and global sensitivity methods for the control of sludge bulking in a wastewater treatment plant
Introduction
Nowadays, wastewater treatment before discharge into the environment is an essential step to protect nature and human health. Most often, wastewater is treated in biological wastewater treatment plants (WWTPs), where a mixture of bacteria degrades the contaminating water components in an activated sludge process. The coexistence of different types of bacteria and the proper balance between them are amongst the most important operational challenges that, if they fail, can lead to operational problems and degradation of treatment performance.
One of the most serious operational problems in biological WWTPs is sludge bulking, which causes problems in solid-liquid separation. Bulking occurs when the suspended solids in the activated sludge process, i.e. biomass flocs, do not separate from the treated water by gravity settling in the settling tank. That is a common problem in modern biological nutrient removal (BNR) plants where long sludge retention times (SRTs) are used. Long SRTs favour the growth of filamentous bacteria. An appropriate balance between floc forming and filamentous bacteria improves sludge settling, but an excess of filamentous bacteria can lead to poor settling, foaming on the reactor surface, and dewatering problems in the sludge treatment process.
Bulking can be related to the characteristics of the wastewater and/or the operating conditions of the plant. However, the various causes have not been fully explored, and there are no first-principles or generally applicable models linking sludge bulking to process conditions. Also, recommended operational adjustments to limit the occurrence of bulking are inconsistent and often contradictory, e.g., increasing or decreasing various operational parameters such as sludge age, return sludge flow, waste sludge flow, oxygen concentration, etc. Due to the knowledge gaps and lack of formal theoretical descriptions of sludge bulking, data-driven models have been developed.
Data-driven models relate sludge bulking and process conditions based on empirical relationships derived from process data. Most commonly, models are designed as bulking prediction models aiming at forecasting the occurrence of bulking in advance (Liu et al., 2020). Bulking prediction models are designed as time-series models (Liu et al., 2016a), multivariate models derived from other process data (Lou and Zhao, 2012; Bagheri et al., 2015; Deepnarain et al., 2019; Szeląg et al., 2020) or a combination of both (Liu et al., 2016b). Various methods have been proposed already, e.g. Artificial Neural Networks (ANN) (Capodaglio et al., 1991; Bagheri et al., 2015; Han et al., 2016), Principal Component Regression (PCR) (Lou and Zhao, 2012), Gaussian Processes Regression (GPR) (Liu et al., 2016b), Principal Component Analysis (PCA) and decision trees (Deepnarain et al., 2019). In most cases, the focus is on the quality of model prediction, i.e. whether the model can predict the occurrence of sludge bulking with high accuracy.
The issues of sludge bulking diagnosis have also been considered (Cheng et al., 2019; Han et al., 2021). On-line diagnosis consists of two steps. First, a data-driven model is used to detect the occurrence of bulking, and in the second step, the causes are identified. Liu et al. (2020) propose two further steps, i.e. remaining useful life prediction and maintenance strategy. In the preventive maintenance stage, the operating parameters should be adjusted to compensate for the incipient fault and keep sludge bulking below the control limit.
As presented above, on-line monitoring and diagnosis of sludge bulking involve data-driven prediction and identification of the causes of sludge bulking based on the temporal dynamics of process variables. However, prior knowledge of the key process variables associated with the conditions for sludge bulking is required. In addition, once the cause variable is identified, subsequent knowledge of the most promising control actions and adjustments of process operating parameters is also required (Nittami et al., 2020). Since this knowledge is very specific to each case and difficult to obtain directly from plant operations, it is expected to be obtained through data-driven model development. The model should include both potential causes and control variables that have been discovered to be related to sludge bulking conditions. Such a model will also provide information on the regions of bulking and non-bulking conditions in the space of plant operating parameters, which is important for control purposes.
For this purpose, the design of a data-driven model is considered in this paper. Modelling is intended for knowledge discovery, i.e. finding variables that are highly related to sludge bulking conditions. Therefore, the selection of model input variables is considered as one of the most important outcomes of modelling.
Different methods can be used for input variable selection. They can be divided into model-based and model-free (filter) methods. Model-based methods are further divided into wrapper methods and embedded methods (Guyon and Elisseeff, 2003). Filters are used to select a subset of variables as a pre-processing step, regardless of the modelling approach. Examples of filters are statistical analysis methods based on Pearson correlation coefficient, coefficient of determination R2, F-test, or other similar criteria. Wrappers use a selected model to evaluate subsets of variables according to their predictive power. Some well-known wrapper methods in classical statistical approaches for variable selection in regression are forward selection and backward elimination (Andersen and Bro, 2010). In these cases, regressors in the selected model are systematically added or removed one by one until cross-validation results confirm the minimal set of regressors that provides the best model accuracy. A common feature of wrapper methods is that they are computationally intensive and can become intractable when the number of input variables is large. Embedded methods perform variable selection in the process of model training and are usually specific to a selected modelling approach. In this case, the task of variable selection is delegated to the model learning phase. An example of an embedded method is ANN training based on a pruning strategy where the irrelevant and/or redundant weights of a network are gradually removed.
When developing data-driven sludge bulking models, pre-existing knowledge is usually used to constrain the initial set of candidate variables. Variable selection is then based on statistical tests of candidate variables and/or a model-based search for the most appropriate combinations of input variables. Methods used include correlation analysis (Lou and Zhao, 2012), the Chi-squared test (Deepnarain et al., 2019), the variable importance in projection (VIP) method (Liu et al., 2016b; Chmielowski et al., 2019), PCA and forward selection (Bagheri et al., 2015), the Fischer-Snedecor test followed by a search for different combinations of independent variables (Szeląg et al., 2020). In many of these cases, linear methods are used for pre-processing the input variables, e.g., correlation analysis and PCA, which may not discover the significant input variables in the case of nonlinear relations (Šindelář and Babuška, 2004). On the other hand, when using wrapper methods, even if the combinatorial problem of input variable selection is not extreme, the choice of input variables can be difficult when the differences in the performance of models with different sets of variables are small. This problem occurred in the development of binary classification models (Szeląg et al., 2020).
This paper proposes a model-based approach for the selection of input variables of sludge bulking models. The procedure follows the general scheme of wrapper methods (May et al., 2011) with the addition of variable ranking. Variable ranking allows the identification of the most influential model variables and is performed in our case by applying Global Sensitivity Analysis (GSA). GSA is a set of statistical techniques used to investigate the extent to which variation in model output can be attributed to variation in model inputs. Many GSA methods have been proposed in the literature, usually calculating a set of sensitivity indices for the various factors of the model. These indices can be used to estimate the impact of individual variables or groups of variables on model output. In this paper, we use the Variance-Based Sensitivity Analysis (VBSA or Sobol's method) (Sobol’, 2001), which is one of the most popular methods in many disciplines (Wei et al., 2015; Makrygiorgos et al., 2020). Its advantages are that it provides global sensitivity over the entire input space, as opposed to local sensitivity at a particular model solution, and that it can be used for nonlinear and nonadditive systems. It is applied, amongst other things, to understand the dominant controls of a system (model) (Pianosi et al., 2015), which is the subject of this work. Performing a sensitivity analysis within the operating region of the process variables allows us to evaluate the impact of potential control variables on the process performance and thus estimate their ability to control sludge bulking. The approach is useful in cases where many combinations of process variables result in a similar performance, making it difficult to reduce input variables based on model performance alone. A similar machine learning framework for identifying relationships between operational variables and effluent parameters in WWTPs was proposed by Wang et al. (2021). In their case, permutation importance (PI) was used as a measure of variable importance.
The approach is presented for a full-scale WWTP where a severe problem of a sludge bulking phenomenon is encountered throughout the year. Microscopic analysis revealed that the filamentous bacterium Microthrix parvicella was present in the biological reactors of the WWTP. Its presence can be in theory associated with certain operating conditions. These conditions and associated process variables were considered as potential model regressors in the data-driven model design. The model was designed using various classification methods in the Matlab classification toolbox. As a pre-processing step for classification, the model output, i.e., measured sludge settleability was clustered into bulking and non-bulking states.
The original contributions are as follows:
- -
The procedure for selecting input variables based on global sensitivity analysis as a variable importance measure.
- -
The application of various machine learning methods to design data-driven models of sludge bulking.
- -
The demonstration of the proposed method on a full-scale WWTP case study.
This paper is organised as follows. In the next section, we first present the WWTP case study, followed by the description of the proposed procedure for input variable selection as well as the clustering and classification methods. In Section 3, we demonstrate the proposed method on a full-scale WWTP and discuss the results. The paper ends with the conclusions describing the main results and perspectives for future work.
Section snippets
WWTP case study
The case study under consideration is a WWTP for 95,000 PE (population equivalent) treating municipal and industrial wastewater. The plant was upgraded in 2016 for complete nitrogen and phosphorus removal with biological treatment and chemical precipitation, respectively. The treatment facilities consist of mechanical treatment (screens, grit and grease chamber, primary clarifier), a biological stage with suspended biomass activated sludge process (nine mixed and/or aerated reactors) and sludge
Results and discussion
For the design of the sludge bulking model, on-line and laboratory measurements from the considered full-scale WWTP were collected in the period from January 2019 to July 2020. Continuously measured on-line signals were sampled at a one-day interval as daily average values.
Conclusions
We have proposed a procedure that is important for the prevention of sludge bulking in WWTPs. The procedure identifies process operating variables that can be used in the control of sludge bulking. It relies on data-driven clustering and classification using global sensitivity analysis for input variables ranking. The proposed input variable selection method can be classified as a wrapper method with backward elimination of input variables. It allows keeping a large number of input variables
CRediT authorship contribution statement
Nadja Hvala: Conceptualization, Methodology, Software, Validation, Investigation, Formal analysis, Data curation, Writing – review & editing, Visualization. Juš Kocijan: Conceptualization, Methodology, Software, Investigation, Formal analysis, Writing – review & editing, Visualization.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was financially supported by the Slovenian Research Agency, program P2–0001, and Public Water Utility JP Komunala Kranj, d.o.o. The authors would like to thank the WWTP personnel, Blaž Bajželj, Lucija Janeš and Marko Margetič, for their assistance with the collection of plant data and information on plant operation. B. Bajželj is the author of Fig. 2a and Fig. 2b, L. Janeš is the author of Fig. 2c.
References (38)
- et al.
Hyperspectral remote sensing for assessment of chlorophyll sufficiency levels in mature oil palm (Elaeis guineensis) based on frond numbers: analysis of decision tree and random forest
Comput. Electron. Agric.
(2020) - et al.
Modeling and optimization of activated sludge bulking for a real wastewater treatment plant using hybrid artificial neural networks-genetic algorithm approach
Process Saf. Environ. Prot.
(2015) - et al.
Sludge bulking analysis and forecasting: application of system identification and artificial neural computing technologies
Water Res.
(1991) - et al.
A novel fault identification and root-causality analysis of incipient faults with applications to wastewater treatment processes
Chemom. Intell. Lab. Syst.
(2019) - et al.
Risk assessment modelling of microbiology-related solids separation problems in activated sludge systems
Environ. Model. Softw.
(2008) - et al.
Decision tree for identification and prediction of filamentous bulking at full-scale activated sludge wastewater treatment plant
Process Saf. Environ. Prot.
(2019) - et al.
Stable limited filamentous bulking through keeping the competition between floc-formers and filaments in balance
Bioresour. Technol.
(2012) - et al.
A soft computing method to predict sludge volume index based on a recurrent self-organizing neural network
Appl. Soft Comput.
(2016) - et al.
Data-knowledge-driven diagnosis method for sludge bulking of wastewater treatment process
J. Process Control
(2021) - et al.
A comprehensive insight into floc characteristics and their impact on compressibility and settleability of activated sludge
Chem. Eng. J.
(2003)
Development of multi-step soft-sensors using a Gaussian process model with application for fault prognosis
Chemom. Intell. Lab. Syst.
Integrated design of monitoring, analysis and maintenance for filamentous sludge bulking in wastewater treatment
Measurement
Surrogate modeling for fast uncertainty quantification: application to 2D population balance models
Comput. Chem. Eng.
A matlab toolbox for global sensitivity analysis
Environ. Model. Softw.
‘‘Microthrix parvicella’’, a filamentous bacterium causing bulking and foaming in activated sludge systems: a review of current knowledge
FEMS Microbiol. Rev.
Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates
Math. Comput. Simul.
The effect of residual ammonia concentration under aerobic conditions on the growth of Microthrix parvicella in biological nutrient removal plants
Water Res.
A machine learning framework to improve effluent quality control in wastewater treatment plants
Sci. Total Environ.
Variable importance analysis: a comprehensive review
Reliab. Eng. Syst. Safety
Cited by (6)
Sludge bulking monitoring in industrial wastewater treatment plants through graphical methods: A dynamic graph embedding and Bayesian networks approach
2023, Journal of Environmental ManagementPrediction of Activated Sludge Sedimentation Performance Using Deep Transfer Learning
2023, ACS ES and T EngineeringApplication of machine learning at wastewater treatment facilities: a review of the science, challenges and barriers by level of implementation
2023, Environmental Technology ReviewsImprovement of biomass aggregation in sludge bulking by magnetic field application
2022, Environmental Quality Management