Input variable selection using machine learning and global sensitivity methods for the control of sludge bulking in a wastewater treatment plant

doi:10.1016/j.compchemeng.2021.107493

Computers & Chemical Engineering

Volume 154, November 2021, 107493

https://doi.org/10.1016/j.compchemeng.2021.107493 Get rights and content

Highlights

•
Sludge bulking is a severe condition in wastewater-treatment-plant operation.
•
Machine learning helps to find relationships in the process data for bulking control.
•
Important variables are identified using global sensitivity analysis as a variable-importance measure.
•
Increased aeration intensity and limited N-sources help in the control of bulking.

Abstract

Sludge bulking is a common and undesired phenomenon in wastewater treatment plants that negatively affects biomass settling characteristics, deteriorates treatment efficiency and causes severe operational problems. First-principles models for this phenomenon are not yet available. Therefore, data-driven models have been developed to predict sludge bulking. In this paper, the bulking phenomenon is studied from the control point of view and operating variables that can be used to control sludge bulking are identified. Identification is performed by designing a data-driven model using available process data as well as clustering and various classification methods. A global sensitivity analysis is applied to select the operating variables with the highest impact on sludge bulking. Application of the proposed approach to full-scale data has shown that increasing aeration intensity and limiting nitrogen sources are the most promising control actions for bulking control.

Introduction

Nowadays, wastewater treatment before discharge into the environment is an essential step to protect nature and human health. Most often, wastewater is treated in biological wastewater treatment plants (WWTPs), where a mixture of bacteria degrades the contaminating water components in an activated sludge process. The coexistence of different types of bacteria and the proper balance between them are amongst the most important operational challenges that, if they fail, can lead to operational problems and degradation of treatment performance.

One of the most serious operational problems in biological WWTPs is sludge bulking, which causes problems in solid-liquid separation. Bulking occurs when the suspended solids in the activated sludge process, i.e. biomass flocs, do not separate from the treated water by gravity settling in the settling tank. That is a common problem in modern biological nutrient removal (BNR) plants where long sludge retention times (SRTs) are used. Long SRTs favour the growth of filamentous bacteria. An appropriate balance between floc forming and filamentous bacteria improves sludge settling, but an excess of filamentous bacteria can lead to poor settling, foaming on the reactor surface, and dewatering problems in the sludge treatment process.

Bulking can be related to the characteristics of the wastewater and/or the operating conditions of the plant. However, the various causes have not been fully explored, and there are no first-principles or generally applicable models linking sludge bulking to process conditions. Also, recommended operational adjustments to limit the occurrence of bulking are inconsistent and often contradictory, e.g., increasing or decreasing various operational parameters such as sludge age, return sludge flow, waste sludge flow, oxygen concentration, etc. Due to the knowledge gaps and lack of formal theoretical descriptions of sludge bulking, data-driven models have been developed.

Data-driven models relate sludge bulking and process conditions based on empirical relationships derived from process data. Most commonly, models are designed as bulking prediction models aiming at forecasting the occurrence of bulking in advance (Liu et al., 2020). Bulking prediction models are designed as time-series models (Liu et al., 2016a), multivariate models derived from other process data (Lou and Zhao, 2012; Bagheri et al., 2015; Deepnarain et al., 2019; Szeląg et al., 2020) or a combination of both (Liu et al., 2016b). Various methods have been proposed already, e.g. Artificial Neural Networks (ANN) (Capodaglio et al., 1991; Bagheri et al., 2015; Han et al., 2016), Principal Component Regression (PCR) (Lou and Zhao, 2012), Gaussian Processes Regression (GPR) (Liu et al., 2016b), Principal Component Analysis (PCA) and decision trees (Deepnarain et al., 2019). In most cases, the focus is on the quality of model prediction, i.e. whether the model can predict the occurrence of sludge bulking with high accuracy.

The issues of sludge bulking diagnosis have also been considered (Cheng et al., 2019; Han et al., 2021). On-line diagnosis consists of two steps. First, a data-driven model is used to detect the occurrence of bulking, and in the second step, the causes are identified. Liu et al. (2020) propose two further steps, i.e. remaining useful life prediction and maintenance strategy. In the preventive maintenance stage, the operating parameters should be adjusted to compensate for the incipient fault and keep sludge bulking below the control limit.

As presented above, on-line monitoring and diagnosis of sludge bulking involve data-driven prediction and identification of the causes of sludge bulking based on the temporal dynamics of process variables. However, prior knowledge of the key process variables associated with the conditions for sludge bulking is required. In addition, once the cause variable is identified, subsequent knowledge of the most promising control actions and adjustments of process operating parameters is also required (Nittami et al., 2020). Since this knowledge is very specific to each case and difficult to obtain directly from plant operations, it is expected to be obtained through data-driven model development. The model should include both potential causes and control variables that have been discovered to be related to sludge bulking conditions. Such a model will also provide information on the regions of bulking and non-bulking conditions in the space of plant operating parameters, which is important for control purposes.

For this purpose, the design of a data-driven model is considered in this paper. Modelling is intended for knowledge discovery, i.e. finding variables that are highly related to sludge bulking conditions. Therefore, the selection of model input variables is considered as one of the most important outcomes of modelling.

Different methods can be used for input variable selection. They can be divided into model-based and model-free (filter) methods. Model-based methods are further divided into wrapper methods and embedded methods (Guyon and Elisseeff, 2003). Filters are used to select a subset of variables as a pre-processing step, regardless of the modelling approach. Examples of filters are statistical analysis methods based on Pearson correlation coefficient, coefficient of determination R², F-test, or other similar criteria. Wrappers use a selected model to evaluate subsets of variables according to their predictive power. Some well-known wrapper methods in classical statistical approaches for variable selection in regression are forward selection and backward elimination (Andersen and Bro, 2010). In these cases, regressors in the selected model are systematically added or removed one by one until cross-validation results confirm the minimal set of regressors that provides the best model accuracy. A common feature of wrapper methods is that they are computationally intensive and can become intractable when the number of input variables is large. Embedded methods perform variable selection in the process of model training and are usually specific to a selected modelling approach. In this case, the task of variable selection is delegated to the model learning phase. An example of an embedded method is ANN training based on a pruning strategy where the irrelevant and/or redundant weights of a network are gradually removed.

When developing data-driven sludge bulking models, pre-existing knowledge is usually used to constrain the initial set of candidate variables. Variable selection is then based on statistical tests of candidate variables and/or a model-based search for the most appropriate combinations of input variables. Methods used include correlation analysis (Lou and Zhao, 2012), the Chi-squared test (Deepnarain et al., 2019), the variable importance in projection (VIP) method (Liu et al., 2016b; Chmielowski et al., 2019), PCA and forward selection (Bagheri et al., 2015), the Fischer-Snedecor test followed by a search for different combinations of independent variables (Szeląg et al., 2020). In many of these cases, linear methods are used for pre-processing the input variables, e.g., correlation analysis and PCA, which may not discover the significant input variables in the case of nonlinear relations (Šindelář and Babuška, 2004). On the other hand, when using wrapper methods, even if the combinatorial problem of input variable selection is not extreme, the choice of input variables can be difficult when the differences in the performance of models with different sets of variables are small. This problem occurred in the development of binary classification models (Szeląg et al., 2020).

This paper proposes a model-based approach for the selection of input variables of sludge bulking models. The procedure follows the general scheme of wrapper methods (May et al., 2011) with the addition of variable ranking. Variable ranking allows the identification of the most influential model variables and is performed in our case by applying Global Sensitivity Analysis (GSA). GSA is a set of statistical techniques used to investigate the extent to which variation in model output can be attributed to variation in model inputs. Many GSA methods have been proposed in the literature, usually calculating a set of sensitivity indices for the various factors of the model. These indices can be used to estimate the impact of individual variables or groups of variables on model output. In this paper, we use the Variance-Based Sensitivity Analysis (VBSA or Sobol's method) (Sobol’, 2001), which is one of the most popular methods in many disciplines (Wei et al., 2015; Makrygiorgos et al., 2020). Its advantages are that it provides global sensitivity over the entire input space, as opposed to local sensitivity at a particular model solution, and that it can be used for nonlinear and nonadditive systems. It is applied, amongst other things, to understand the dominant controls of a system (model) (Pianosi et al., 2015), which is the subject of this work. Performing a sensitivity analysis within the operating region of the process variables allows us to evaluate the impact of potential control variables on the process performance and thus estimate their ability to control sludge bulking. The approach is useful in cases where many combinations of process variables result in a similar performance, making it difficult to reduce input variables based on model performance alone. A similar machine learning framework for identifying relationships between operational variables and effluent parameters in WWTPs was proposed by Wang et al. (2021). In their case, permutation importance (PI) was used as a measure of variable importance.

The approach is presented for a full-scale WWTP where a severe problem of a sludge bulking phenomenon is encountered throughout the year. Microscopic analysis revealed that the filamentous bacterium Microthrix parvicella was present in the biological reactors of the WWTP. Its presence can be in theory associated with certain operating conditions. These conditions and associated process variables were considered as potential model regressors in the data-driven model design. The model was designed using various classification methods in the Matlab classification toolbox. As a pre-processing step for classification, the model output, i.e., measured sludge settleability was clustered into bulking and non-bulking states.

The original contributions are as follows:

-
The procedure for selecting input variables based on global sensitivity analysis as a variable importance measure.
-
The application of various machine learning methods to design data-driven models of sludge bulking.
-
The demonstration of the proposed method on a full-scale WWTP case study.

This paper is organised as follows. In the next section, we first present the WWTP case study, followed by the description of the proposed procedure for input variable selection as well as the clustering and classification methods. In Section 3, we demonstrate the proposed method on a full-scale WWTP and discuss the results. The paper ends with the conclusions describing the main results and perspectives for future work.

Section snippets

WWTP case study

The case study under consideration is a WWTP for 95,000 PE (population equivalent) treating municipal and industrial wastewater. The plant was upgraded in 2016 for complete nitrogen and phosphorus removal with biological treatment and chemical precipitation, respectively. The treatment facilities consist of mechanical treatment (screens, grit and grease chamber, primary clarifier), a biological stage with suspended biomass activated sludge process (nine mixed and/or aerated reactors) and sludge

Results and discussion

For the design of the sludge bulking model, on-line and laboratory measurements from the considered full-scale WWTP were collected in the period from January 2019 to July 2020. Continuously measured on-line signals were sampled at a one-day interval as daily average values.

Conclusions

We have proposed a procedure that is important for the prevention of sludge bulking in WWTPs. The procedure identifies process operating variables that can be used in the control of sludge bulking. It relies on data-driven clustering and classification using global sensitivity analysis for input variables ranking. The proposed input variable selection method can be classified as a wrapper method with backward elimination of input variables. It allows keeping a large number of input variables

CRediT authorship contribution statement

Nadja Hvala: Conceptualization, Methodology, Software, Validation, Investigation, Formal analysis, Data curation, Writing – review & editing, Visualization. Juš Kocijan: Conceptualization, Methodology, Software, Investigation, Formal analysis, Writing – review & editing, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was financially supported by the Slovenian Research Agency, program P2–0001, and Public Water Utility JP Komunala Kranj, d.o.o. The authors would like to thank the WWTP personnel, Blaž Bajželj, Lucija Janeš and Marko Margetič, for their assistance with the collection of plant data and information on plant operation. B. Bajželj is the author of Fig. 2a and Fig. 2b, L. Janeš is the author of Fig. 2c.

References (38)

A.D. Amirruddin et al.
Hyperspectral remote sensing for assessment of chlorophyll sufficiency levels in mature oil palm (Elaeis guineensis) based on frond numbers: analysis of decision tree and random forest
Comput. Electron. Agric.
(2020)
M. Bagheri et al.
Modeling and optimization of activated sludge bulking for a real wastewater treatment plant using hybrid artificial neural networks-genetic algorithm approach
Process Saf. Environ. Prot.
(2015)
A.G. Capodaglio et al.
Sludge bulking analysis and forecasting: application of system identification and artificial neural computing technologies
Water Res.
(1991)
H. Cheng et al.
A novel fault identification and root-causality analysis of incipient faults with applications to wastewater treatment processes
Chemom. Intell. Lab. Syst.
(2019)
J. Comas et al.
Risk assessment modelling of microbiology-related solids separation problems in activated sludge systems
Environ. Model. Softw.
(2008)
N. Deepnarain et al.
Decision tree for identification and prediction of filamentous bulking at full-scale activated sludge wastewater treatment plant
Process Saf. Environ. Prot.
(2019)
J. Guo et al.
Stable limited filamentous bulking through keeping the competition between floc-formers and filaments in balance
Bioresour. Technol.
(2012)
H.G. Han et al.
A soft computing method to predict sludge volume index based on a recurrent self-organizing neural network
Appl. Soft Comput.
(2016)
H.G. Han et al.
Data-knowledge-driven diagnosis method for sludge bulking of wastewater treatment process
J. Process Control
(2021)
B. Jin et al.
A comprehensive insight into floc characteristics and their impact on compressibility and settleability of activated sludge
Chem. Eng. J.
(2003)

Y. Liu et al.

Development of multi-step soft-sensors using a Gaussian process model with application for fault prognosis

Chemom. Intell. Lab. Syst.

(2016)

Y. Liu et al.

Integrated design of monitoring, analysis and maintenance for filamentous sludge bulking in wastewater treatment

Measurement

(2020)

G. Makrygiorgos et al.

Surrogate modeling for fast uncertainty quantification: application to 2D population balance models

Comput. Chem. Eng.

(2020)

F. Pianosi et al.

A matlab toolbox for global sensitivity analysis

Environ. Model. Softw.

(2015)

S. Rossetti et al.

‘‘Microthrix parvicella’’, a filamentous bacterium causing bulking and foaming in activated sludge systems: a review of current knowledge

FEMS Microbiol. Rev.

(2005)

I.M Sobol’

Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates

Math. Comput. Simul.

(2001)

M.W. Tsai et al.

The effect of residual ammonia concentration under aerobic conditions on the growth of Microthrix parvicella in biological nutrient removal plants

Water Res.

(2003)

D. Wang et al.

A machine learning framework to improve effluent quality control in wastewater treatment plants

Sci. Total Environ.

(2021)

P. Wei et al.

Variable importance analysis: a comprehensive review

Reliab. Eng. Syst. Safety

(2015)

Cited by (6)

Sludge bulking monitoring in industrial wastewater treatment plants through graphical methods: A dynamic graph embedding and Bayesian networks approach
2023, Journal of Environmental Management
Sludge bulking is a prevalent issue in wastewater treatment plants (WWTPs) that negatively impacts effluent quality by hindering the normal functioning of treatment processes. To tackle this problem, we propose a novel graph-based monitoring framework that employs advanced graph-based techniques to detect and diagnose sludge bulking events. The proposed framework utilizes historical datasets under normal operating conditions to extract pertinent features and causal relationships between process variables. This enables operators to trigger alarms and diagnose the root cause of the bulking event. Sludge bulking detection is carried out using the dynamic graph embedding (DGE) method, which identifies similarities among process variables in both temporal and neighborhood dependencies. Consequently, the dynamic Bayesian network (DBN) computes the prior and posterior probabilities of a belief, updated at each time step. Variations in these probabilities indicate the potential root cause of the sludge bulking event. The results demonstrate that the DGE outperforms other linear and non-linear feature extraction methods, achieving a detection rate of 99%, zero false alarms, and less than one percent incorrect detections. Additionally, the DBN-based diagnostic method accurately identified the majority of sludge bulking root causes, primarily those resulting from sudden drops in COD concentration, with an accuracy of 98% an improvement of 11% over state-of-the-art techniques.
Sludge Bulking Monitoring in Industrial Wastewater Treatment Plants Through Graphical Methods: A Dynamic Graph Embedding and Bayesian Networks Approach
2023, SSRN
Prediction of Activated Sludge Sedimentation Performance Using Deep Transfer Learning
2023, ACS ES and T Engineering
Application of machine learning at wastewater treatment facilities: a review of the science, challenges and barriers by level of implementation
2023, Environmental Technology Reviews
Total phosphorus removal in multi-soil-layering nature-based technology: Evaluation of influencing factors and prediction using data-driven methods
2022, Research Square
Improvement of biomass aggregation in sludge bulking by magnetic field application
2022, Environmental Quality Management

View full text

Input variable selection using machine learning and global sensitivity methods for the control of sludge bulking in a wastewater treatment plant

Highlights

Abstract

Introduction

Section snippets

WWTP case study

Results and discussion

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgements

Comput. Electron. Agric.

Process Saf. Environ. Prot.

Water Res.

Chemom. Intell. Lab. Syst.

Environ. Model. Softw.

Process Saf. Environ. Prot.

Bioresour. Technol.

Appl. Soft Comput.

J. Process Control

Chem. Eng. J.

Chemom. Intell. Lab. Syst.

Measurement

Comput. Chem. Eng.

Environ. Model. Softw.

FEMS Microbiol. Rev.

Math. Comput. Simul.

Water Res.

Sci. Total Environ.

Reliab. Eng. Syst. Safety