Input variable selection using machine learning and global sensitivity methods for the control of sludge bulking in a wastewater treatment plant

https://doi.org/10.1016/j.compchemeng.2021.107493Get rights and content

Highlights

  • Sludge bulking is a severe condition in wastewater-treatment-plant operation.

  • Machine learning helps to find relationships in the process data for bulking control.

  • Important variables are identified using global sensitivity analysis as a variable-importance measure.

  • Increased aeration intensity and limited N-sources help in the control of bulking.

Abstract

Sludge bulking is a common and undesired phenomenon in wastewater treatment plants that negatively affects biomass settling characteristics, deteriorates treatment efficiency and causes severe operational problems. First-principles models for this phenomenon are not yet available. Therefore, data-driven models have been developed to predict sludge bulking. In this paper, the bulking phenomenon is studied from the control point of view and operating variables that can be used to control sludge bulking are identified. Identification is performed by designing a data-driven model using available process data as well as clustering and various classification methods. A global sensitivity analysis is applied to select the operating variables with the highest impact on sludge bulking. Application of the proposed approach to full-scale data has shown that increasing aeration intensity and limiting nitrogen sources are the most promising control actions for bulking control.

Introduction

Nowadays, wastewater treatment before discharge into the environment is an essential step to protect nature and human health. Most often, wastewater is treated in biological wastewater treatment plants (WWTPs), where a mixture of bacteria degrades the contaminating water components in an activated sludge process. The coexistence of different types of bacteria and the proper balance between them are amongst the most important operational challenges that, if they fail, can lead to operational problems and degradation of treatment performance.

One of the most serious operational problems in biological WWTPs is sludge bulking, which causes problems in solid-liquid separation. Bulking occurs when the suspended solids in the activated sludge process, i.e. biomass flocs, do not separate from the treated water by gravity settling in the settling tank. That is a common problem in modern biological nutrient removal (BNR) plants where long sludge retention times (SRTs) are used. Long SRTs favour the growth of filamentous bacteria. An appropriate balance between floc forming and filamentous bacteria improves sludge settling, but an excess of filamentous bacteria can lead to poor settling, foaming on the reactor surface, and dewatering problems in the sludge treatment process.

Bulking can be related to the characteristics of the wastewater and/or the operating conditions of the plant. However, the various causes have not been fully explored, and there are no first-principles or generally applicable models linking sludge bulking to process conditions. Also, recommended operational adjustments to limit the occurrence of bulking are inconsistent and often contradictory, e.g., increasing or decreasing various operational parameters such as sludge age, return sludge flow, waste sludge flow, oxygen concentration, etc. Due to the knowledge gaps and lack of formal theoretical descriptions of sludge bulking, data-driven models have been developed.

Data-driven models relate sludge bulking and process conditions based on empirical relationships derived from process data. Most commonly, models are designed as bulking prediction models aiming at forecasting the occurrence of bulking in advance (Liu et al., 2020). Bulking prediction models are designed as time-series models (Liu et al., 2016a), multivariate models derived from other process data (Lou and Zhao, 2012; Bagheri et al., 2015; Deepnarain et al., 2019; Szeląg et al., 2020) or a combination of both (Liu et al., 2016b). Various methods have been proposed already, e.g. Artificial Neural Networks (ANN) (Capodaglio et al., 1991; Bagheri et al., 2015; Han et al., 2016), Principal Component Regression (PCR) (Lou and Zhao, 2012), Gaussian Processes Regression (GPR) (Liu et al., 2016b), Principal Component Analysis (PCA) and decision trees (Deepnarain et al., 2019). In most cases, the focus is on the quality of model prediction, i.e. whether the model can predict the occurrence of sludge bulking with high accuracy.

The issues of sludge bulking diagnosis have also been considered (Cheng et al., 2019; Han et al., 2021). On-line diagnosis consists of two steps. First, a data-driven model is used to detect the occurrence of bulking, and in the second step, the causes are identified. Liu et al. (2020) propose two further steps, i.e. remaining useful life prediction and maintenance strategy. In the preventive maintenance stage, the operating parameters should be adjusted to compensate for the incipient fault and keep sludge bulking below the control limit.

As presented above, on-line monitoring and diagnosis of sludge bulking involve data-driven prediction and identification of the causes of sludge bulking based on the temporal dynamics of process variables. However, prior knowledge of the key process variables associated with the conditions for sludge bulking is required. In addition, once the cause variable is identified, subsequent knowledge of the most promising control actions and adjustments of process operating parameters is also required (Nittami et al., 2020). Since this knowledge is very specific to each case and difficult to obtain directly from plant operations, it is expected to be obtained through data-driven model development. The model should include both potential causes and control variables that have been discovered to be related to sludge bulking conditions. Such a model will also provide information on the regions of bulking and non-bulking conditions in the space of plant operating parameters, which is important for control purposes.

For this purpose, the design of a data-driven model is considered in this paper. Modelling is intended for knowledge discovery, i.e. finding variables that are highly related to sludge bulking conditions. Therefore, the selection of model input variables is considered as one of the most important outcomes of modelling.

Different methods can be used for input variable selection. They can be divided into model-based and model-free (filter) methods. Model-based methods are further divided into wrapper methods and embedded methods (Guyon and Elisseeff, 2003). Filters are used to select a subset of variables as a pre-processing step, regardless of the modelling approach. Examples of filters are statistical analysis methods based on Pearson correlation coefficient, coefficient of determination R2, F-test, or other similar criteria. Wrappers use a selected model to evaluate subsets of variables according to their predictive power. Some well-known wrapper methods in classical statistical approaches for variable selection in regression are forward selection and backward elimination (Andersen and Bro, 2010). In these cases, regressors in the selected model are systematically added or removed one by one until cross-validation results confirm the minimal set of regressors that provides the best model accuracy. A common feature of wrapper methods is that they are computationally intensive and can become intractable when the number of input variables is large. Embedded methods perform variable selection in the process of model training and are usually specific to a selected modelling approach. In this case, the task of variable selection is delegated to the model learning phase. An example of an embedded method is ANN training based on a pruning strategy where the irrelevant and/or redundant weights of a network are gradually removed.

When developing data-driven sludge bulking models, pre-existing knowledge is usually used to constrain the initial set of candidate variables. Variable selection is then based on statistical tests of candidate variables and/or a model-based search for the most appropriate combinations of input variables. Methods used include correlation analysis (Lou and Zhao, 2012), the Chi-squared test (Deepnarain et al., 2019), the variable importance in projection (VIP) method (Liu et al., 2016b; Chmielowski et al., 2019), PCA and forward selection (Bagheri et al., 2015), the Fischer-Snedecor test followed by a search for different combinations of independent variables (Szeląg et al., 2020). In many of these cases, linear methods are used for pre-processing the input variables, e.g., correlation analysis and PCA, which may not discover the significant input variables in the case of nonlinear relations (Šindelář and Babuška, 2004). On the other hand, when using wrapper methods, even if the combinatorial problem of input variable selection is not extreme, the choice of input variables can be difficult when the differences in the performance of models with different sets of variables are small. This problem occurred in the development of binary classification models (Szeląg et al., 2020).

This paper proposes a model-based approach for the selection of input variables of sludge bulking models. The procedure follows the general scheme of wrapper methods (May et al., 2011) with the addition of variable ranking. Variable ranking allows the identification of the most influential model variables and is performed in our case by applying Global Sensitivity Analysis (GSA). GSA is a set of statistical techniques used to investigate the extent to which variation in model output can be attributed to variation in model inputs. Many GSA methods have been proposed in the literature, usually calculating a set of sensitivity indices for the various factors of the model. These indices can be used to estimate the impact of individual variables or groups of variables on model output. In this paper, we use the Variance-Based Sensitivity Analysis (VBSA or Sobol's method) (Sobol’, 2001), which is one of the most popular methods in many disciplines (Wei et al., 2015; Makrygiorgos et al., 2020). Its advantages are that it provides global sensitivity over the entire input space, as opposed to local sensitivity at a particular model solution, and that it can be used for nonlinear and nonadditive systems. It is applied, amongst other things, to understand the dominant controls of a system (model) (Pianosi et al., 2015), which is the subject of this work. Performing a sensitivity analysis within the operating region of the process variables allows us to evaluate the impact of potential control variables on the process performance and thus estimate their ability to control sludge bulking. The approach is useful in cases where many combinations of process variables result in a similar performance, making it difficult to reduce input variables based on model performance alone. A similar machine learning framework for identifying relationships between operational variables and effluent parameters in WWTPs was proposed by Wang et al. (2021). In their case, permutation importance (PI) was used as a measure of variable importance.

The approach is presented for a full-scale WWTP where a severe problem of a sludge bulking phenomenon is encountered throughout the year. Microscopic analysis revealed that the filamentous bacterium Microthrix parvicella was present in the biological reactors of the WWTP. Its presence can be in theory associated with certain operating conditions. These conditions and associated process variables were considered as potential model regressors in the data-driven model design. The model was designed using various classification methods in the Matlab classification toolbox. As a pre-processing step for classification, the model output, i.e., measured sludge settleability was clustered into bulking and non-bulking states.

The original contributions are as follows:

  • -

    The procedure for selecting input variables based on global sensitivity analysis as a variable importance measure.

  • -

    The application of various machine learning methods to design data-driven models of sludge bulking.

  • -

    The demonstration of the proposed method on a full-scale WWTP case study.

This paper is organised as follows. In the next section, we first present the WWTP case study, followed by the description of the proposed procedure for input variable selection as well as the clustering and classification methods. In Section 3, we demonstrate the proposed method on a full-scale WWTP and discuss the results. The paper ends with the conclusions describing the main results and perspectives for future work.

Section snippets

WWTP case study

The case study under consideration is a WWTP for 95,000 PE (population equivalent) treating municipal and industrial wastewater. The plant was upgraded in 2016 for complete nitrogen and phosphorus removal with biological treatment and chemical precipitation, respectively. The treatment facilities consist of mechanical treatment (screens, grit and grease chamber, primary clarifier), a biological stage with suspended biomass activated sludge process (nine mixed and/or aerated reactors) and sludge

Results and discussion

For the design of the sludge bulking model, on-line and laboratory measurements from the considered full-scale WWTP were collected in the period from January 2019 to July 2020. Continuously measured on-line signals were sampled at a one-day interval as daily average values.

Conclusions

We have proposed a procedure that is important for the prevention of sludge bulking in WWTPs. The procedure identifies process operating variables that can be used in the control of sludge bulking. It relies on data-driven clustering and classification using global sensitivity analysis for input variables ranking. The proposed input variable selection method can be classified as a wrapper method with backward elimination of input variables. It allows keeping a large number of input variables

CRediT authorship contribution statement

Nadja Hvala: Conceptualization, Methodology, Software, Validation, Investigation, Formal analysis, Data curation, Writing – review & editing, Visualization. Juš Kocijan: Conceptualization, Methodology, Software, Investigation, Formal analysis, Writing – review & editing, Visualization.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was financially supported by the Slovenian Research Agency, program P2–0001, and Public Water Utility JP Komunala Kranj, d.o.o. The authors would like to thank the WWTP personnel, Blaž Bajželj, Lucija Janeš and Marko Margetič, for their assistance with the collection of plant data and information on plant operation. B. Bajželj is the author of Fig. 2a and Fig. 2b, L. Janeš is the author of Fig. 2c.

References (38)

Cited by (6)

View full text