Abstract
Complete data on wastewater quality are essential for managing and monitoring wastewater treatment processes. Most management and monitoring methods involve the use of voluminous training data for imputation, but the problem is that the sensors used in wastewater treatment plants (WWTPs) collect only a limited amount of data. The lack of sufficient training data can diminish the accuracy of traditional imputation techniques. To address this problem, this study developed a novel approach called Miss-GBRT (imputing missing values with gradient boosting regression trees), which can impute missing values into wastewater quality data even with minimal training data. The proposed approach consists of a preprocessing stage and an imputation stage. In the preprocessing stage, different copies of masked datasets are produced from raw data according to various levels of missingness, after which pre-imputation is conducted to ensure the integrality of training data. In the imputation stage, Miss-GBRT is used to combine shallow regression trees to regress the residuals of time and impute each missing value into a masked dataset in a stepwise manner. We carried out extensive experiments on the WWTP datasets of the University of California, Irvine and Beijing Drainage Group to compare Miss-GBRT with baseline imputation methods. The results demonstrated that the proposed approach improves the accuracy with which missing wastewater quality data are imputed under limited training data. It can also perform better than other methods on datasets with considerable proportions of missing values.
Graphical abstract
Similar content being viewed by others
Data availability
The data that support the findings of this study are openly available in University of California Irvine (UCI) machine learning repository at http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant.
References
Tang W, Pei Y, Zheng H, Zhao Y, Shu L, Zhang H (2022) Twenty years of China’s water pollution control: Experiences and challenges. Chemosphere 295:133875. https://doi.org/10.1016/j.chemosphere.2022.133875
Teegavarapu RS, Aly A, Pathak CS, Ahlquist J, Fuelberg H, Hood J (2018) Infilling missing precipitation records using variants of spatial interpolation and data-driven methods: use of optimal weighting parameters and nearest neighbour-based corrections. Int J Climatol 38(2):776–793. https://doi.org/10.1002/joc.5209
Oriani F, Borghi A, Straubhaar J, Mariethoz G, Renard P (2016) Missing data simulation inside flow rate time-series using multiple-point statistics. Environ Modell Softw 86:264–276. https://doi.org/10.1016/j.envsoft.2016.10.002
Tabari H, Hosseinzadeh Talaee P (2015) Reconstruction of river water quality missing data using artificial neural networks. Water Qual Res J Can 50(4):326–335. https://doi.org/10.2166/wqrjc.2015.044
Srebotnjak T, Carr G, de Sherbinin A, Rickwood C (2012) A global Water Quality Index and hot-deck imputation of missing data. Ecol Indic 17:108–119. https://doi.org/10.1016/j.ecolind.2011.04.023
Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7(1):1–21. https://doi.org/10.1186/s40537-020-00313-w
Dzaferagic M, Marchetti N, Macaluso I (2021) Fault detection and classification in Industrial IoT in case of missing sensor data. IEEE Internet Things J 9(11):8892–8900. https://doi.org/10.1109/JIOT.2021.3116785
Zhang Z, Lin X, Li M, Wang Y (2021) A customized deep learning approach to integrate network-scale online traffic data imputation and prediction. Transp Res C: Emerg Technol 132:103372. https://doi.org/10.1016/j.trc.2021.103372
Ispirova G, Eftimov T, Seljak BK (2020) Evaluating missing value imputation methods for food composition databases. Food Chem Toxicol 141:111368. https://doi.org/10.1016/j.fct.2020.111368
Ba-Alawi AH, Loy-Benitez J, Kim S, Yoo C (2022) Missing data imputation and sensor self-validation towards a sustainable operation of wastewater treatment plants via deep variational residual autoencoders. Chemosphere 288:132647. https://doi.org/10.1016/j.chemosphere.2021.132647
Chen Z, Cao Y, Ding SX, Zhang K, Koenings T, Peng T, Yang C, Gui W (2019) A distributed canonical correlation analysis-based fault detection method for plant-wide process monitoring. IEEE IEEE Trans Ind Inform 15(5):2710–2720. https://doi.org/10.1109/TII.2019.2893125
Tian Y, Yao H, Li Z (2020) Plant-wide process monitoring by using weighted copula–correlation based multiblock principal component analysis approach and online-horizon Bayesian method. ISA Trans 96:24–36. https://doi.org/10.1016/j.isatra.2019.06.002
Wang B, Li Z, Dai Z, Lawrence N, Yan X (2019) A probabilistic principal component analysis-based approach in process monitoring and fault diagnosis with application in wastewater treatment plant. Appl Soft Comput 82:105527. https://doi.org/10.1016/j.asoc.2019.105527
Wang G, Jia QS, Zhou M, Bi J, Qiao J, Abusorrah A (2022) Artificial neural networks for water quality soft-sensing in wastewater treatment: a review. Artif Intell Rev 55(1):565–587. https://doi.org/10.1007/s10462-021-10038-8
Tencaliec P, Favre AC, Prieur C, Mathevet T (2015) Reconstruction of missing daily streamflow data using dynamic regression models. Water Resources Res 51(12):9447–9463. https://doi.org/10.1002/2015WR017399
Han H, Sun M, Han H, Wu X, Qiao J (2023) Univariate imputation method for recovering missing data in wastewater treatment process. Chin J Chem Eng 53:201–210. https://doi.org/10.1016/j.cjche.2022.01.033
Zhong L, Chang Y, Wang F, Gao S (2021) Distributed Missing Values Imputation Schemes for Plant-Wide Industrial Process Using Variational Bayesian Principal Component Analysis. Ind Eng Chem Res 61(1):580–593. https://doi.org/10.1021/acs.iecr.1c03860
Li D, Li L, Li X, Ke Z, Hu Q (2020) Smoothed LSTM-AE: A spatio-temporal deep model for multiple time-series missing imputation. Neurocomputing 411:351–363. https://doi.org/10.1016/j.neucom.2020.05.033
Chen Z, Xu H, Jiang P, Yu S, Lin G, Bychkov I, Bychkow L, Hmelnov A, Ruzhnikow G, Zhu N, Liu Z (2021) A transfer Learning-Based LSTM strategy for imputing Large-Scale consecutive missing data and its application in a water quality prediction system. J Hydrol 602:126573. https://doi.org/10.1016/j.jhydrol.2021.126573
Ba-Alawi AH, Nam K, Heo S, Woo T, Aamer H, Yoo C (2023) Explainable multisensor fusion-based automatic reconciliation and imputation of faulty and missing data in membrane bioreactor plants for fouling alleviation and energy saving. Chem Eng J 452:139220. https://doi.org/10.1016/j.cej.2022.139220
Cheng H, Wu J, Huang D, Liu Y, Wang Q (2021) Robust adaptive boosted canonical correlation analysis for quality-relevant process monitoring of wastewater treatment. ISA Trans 117:210–220. https://doi.org/10.1016/j.isatra.2021.01.039
Bengio Y, Lecun Y, Hinton G (2021) Deep learning for AI. Commun ACM 64(7):58–65. https://doi.org/10.1145/3448250
Samek W, Montavon G, Lapuschkin S, Anders CJ, Müller KR (2021) Explaining deep neural networks and beyond: A review of methods and applications. Proc IEEE 109(3):247–278. https://doi.org/10.1109/JPROC.2021.3060483
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Yu W, Zhu C, Li Z, Hu Z, Wang Q, Ji H, Jiang M (2022) A survey of knowledge-enhanced text generation. ACM Comput Surv 54(11s):1–38. https://doi.org/10.1145/3512467
Albusac J, Vallejo D, Castro-Schez JJ, Sanchez-Sobrino S, Gomez-Portes C (2021) Multi-analysis surveillance and dynamic distribution of computational resources: Towards extensible, robust, and efficient monitoring of environments. Expert Syst Appl 175:114692. https://doi.org/10.1016/j.eswa.2021.114692
Loy-Benitez J, Li Q, Nam K, Yoo C (2020) Sustainable subway indoor air quality monitoring and fault-tolerant ventilation control using a sparse autoencoder-driven sensor self-validation. Sustain Cities Soc 52:101847. https://doi.org/10.1016/j.scs.2019.101847
Tan M, Liu Z, Chen CP, Zhang Y (2022) Neuroadaptive asymptotic consensus tracking control for a class of uncertain nonlinear multiagent systems with sensor faults. Inform Sci 584:685–700. https://doi.org/10.1016/j.ins.2021.10.053
Jana D, Patil J, Herkal S, Nagarajaiah S, Duenas-Osorio L (2022) CNN and Convolutional Autoencoder (CAE) based real-time sensor fault detection, localization, and correction. Mech Syst Signal Process 169:108723. https://doi.org/10.1016/j.ymssp.2021.108723
Sabar MA, Honda R, Haramoto E (2022) CrAssphage as an indicator of human-fecal contamination in water environment and virus reduction in wastewater treatment. Water Res 221:118827. https://doi.org/10.1016/j.watres.2022.118827
Ba-Alawi AH, Ifaei P, Li Q, Nam K, Djeddou M, Yoo C (2020) Process assessment of a full-scale wastewater treatment plant using reliability, resilience, and econo-socio-environmental analyses (R2ESE). Process Saf Environ Protect 133:259–274. https://doi.org/10.1016/j.psep.2019.11.018
Belchior CAC, Araújo RAM, Souza FAA, Landeck JAC (2018) Sensor-fault tolerance in a wastewater treatment plant by means of ANFIS-based soft sensor and control reconfiguration. Neural Comput Applic 30:3265–3276. https://doi.org/10.1007/s00521-017-2901-3
Anter AM, Gupta D, Castillo O (2020) A novel parameter estimation in dynamic model via fuzzy swarm intelligence and chaos theory for faults in wastewater treatment plant. Soft Comput 24(1):111–129. https://doi.org/10.1007/s00500-019-04225-7
Ly QV, Truong VH, Ji B, Nguyen XC, Cho KH, Ngo HH, Zhang Z (2022) Exploring potential machine learning application based on big data for prediction of wastewater quality from different full-scale wastewater treatment plants. Sci Total Environ 832:154930. https://doi.org/10.1016/j.scitotenv.2022.154930
Lizarralde I, Fernández-Arévalo T, Manas A, Ayesa E, Grau P (2019) Model-based optimization of phosphorus management strategies in Sur WWTP, Madrid. Water Res 153:39–52. https://doi.org/10.1016/j.watres.2018.12.056
Han H, Liu Z, Hou Y, Qiao J (2019) Data-driven multiobjective predictive control for wastewater treatment process. IEEE Trans Ind Inf 16(4):2767–2775. https://doi.org/10.1109/TII.2019.2940663
Peng C, Zeyu L, Gongming W, Pu W (2021) An effective deep recurrent network with high-order statistic information for fault monitoring in wastewater treatment process. Expert Syst. Appl 167:114141. https://doi.org/10.1016/j.eswa.2020.114141
O'Brien JW, Grant S, Banks, AP, Bruno R, Carter S, Choi PM, ..., Mueller JF (2019) A National Wastewater Monitoring Program for a better understanding of public health: A case study using the Australian Census. Environ Int 122:400-411. https://doi.org/10.1016/j.envint.2018.12.003
Shi H, Wang P, Yang X, Yu H (2020) An improved mean imputation clustering algorithm for incomplete data. Neural Process Lett 54:3537–3550. https://doi.org/10.1007/s11063-020-10298-5
Zhang W, Yan S, Li J, Tian X, Yoshida T (2022) Credit risk prediction of SMEs in supply chain finance by fusing demographic and behavioral data. Transp Res E: Logist Transp Rev 158:102611. https://doi.org/10.1016/j.tre.2022.102611
Zhang W, Zhao J, Peng R, et al (2023) SusRec: An Approach to Sustainable Developer Recommendation for Bug Resolution Using Multimodal Ensemble Learning. IEEE Trans Rel 72:61–78. https://doi.org/10.1109/TR.2022.3176733
Xia J, Zhang J, Wang Y, Han L, Yan H (2022) WC-KNNG-PC: Watershed clustering based on k-nearest-neighbor graph and Pauta Criterion. Pattern Recog 121:108177. https://doi.org/10.1016/j.patcog.2021.108177
Zhang W, Yang Y, Wang Q (2015) Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf Softw Technol 58:58–70. https://doi.org/10.1016/j.infsof.2014.10.005
Liu Y, Dillon T, Yu W, Rahayu W, Mostafa F (2020) Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet Things J 7(8):6855–6867. https://doi.org/10.1109/JIOT.2020.2970467
Friedman JH (2001) Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics 29:1189–1232
Liu J, Huang Q, Ulishney C, Dumitrescu CE (2021) Greedy Function Approximation: A Graduty natural gas spark ignition engine. Appl Energy 300:117413. https://doi.org/10.1016/j.apenergy.2021.117413
Liu Q, Wang X, Huang X, Yin X (2020) Prediction model of rock mass class using classification and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn Undergr Space Technol 106:103595. https://doi.org/10.1016/j.tust.2020.103595
Cai J, Xu K, Zhu Y, Hu F, Li L (2020) Prediction and analysis of net ecosystem carbon exchange based on gradient boosting regression and random forest. Appl Energy 262:114566. https://doi.org/10.1016/j.apenergy.2020.114566
Kovacs DJ, Li Z, Baetz BW, Hong Y, Donnaz S, Zhao X, Zhou P, Ding H, Dong Q (2022) Membrane fouling prediction and uncertainty analysis using machine learning: A wastewater treatment plant case study. J Membr Sci 660:120817. https://doi.org/10.1016/j.memsci.2022.120817
Gil Pavas E, Correa-Sanchez S (2019) Optimization of the heterogeneous electro-Fenton process assisted by scrap zero-valent iron for treating textile wastewater: Assessment of toxicity and biodegradability. J Water Process Eng 32:100924. https://doi.org/10.1016/j.jwpe.2019.100924
Testolin RC, Mater L, Sanches-Simoes E, Dal Conti-Lampert A, Correa AX, Groth ML, Oliveira-Carneiro M, Radetski CM (2020) Comparison of the mineralization and biodegradation efficiency of the Fenton reaction and Ozone in the treatment of crude petroleum-contaminated water. J Environ Chem Eng 8(5):104265. https://doi.org/10.1016/j.jece.2020.104265
Saravanan A, Kumar PS, Jeevanantham S, Karishma S, Tajsabreen B, Yaashikaa PR, Reshma B (2021) Effective water/wastewater treatment methodologies for toxic pollutants removal: Processes and applications towards sustainable development. Chemosphere 280:130595. https://doi.org/10.1016/j.chemosphere.2021.130595
Fan NS, Bai YH, Wu J, Zhang Q, Fu JJ, Zhou WL, Huang BC, Jin RC (2020) A two-stage anammox process for the advanced treatment of high-strength ammonium wastewater: microbial community and nitrogen transformation. J Clean Prod 261:121148. https://doi.org/10.1016/j.jclepro.2020.121148
Zhang L, Zhang Q, Li X, Jia T, Wang S, Peng Y (2022) Enhanced nitrogen removal from municipal wastewater via a novel combined process driven by partial nitrification/anammox (PN/A) and partial denitrification/anammox (PD/A) with an ultra-low hydraulic retention time (HRT). Bioresour Technol 363:127950. https://doi.org/10.1016/j.biortech.2022.127950
Shanmugam K, Gadhamshetty V, Tysklind M, Bhattacharyya D, Upadhyayula VK (2022) A sustainable performance assessment framework for circular management of municipal wastewater treatment plants. J Clean Prod 339:130657. https://doi.org/10.1016/j.jclepro.2022.130657
Liu S, Wang Z, Wei G, Li M (2019) Distributed set-membership filtering for multirate systems under the round-robin scheduling over sensor networks. IEEE Trans Cybern 50(5):1910–1920. https://doi.org/10.1109/TCYB.2018.2885653
Zhou Z, Wang K, Qiang J, Pang H, Yuan Y, An Y, Zhou C, Ye J, Wu Z (2021) Mainstream nitrogen separation and side-stream removal to reduce discharge and footprint of wastewater treatment plants. Water Res 188:116527. https://doi.org/10.1016/j.watres.2020.116527
Yang J, Liu X, Ying L, Chen X, Li M (2020) Correlation analysis of environmental treatment, sewage treatment and water supply efficiency in China. Sci Total Environ 708:135128. https://doi.org/10.1016/j.scitotenv.2019.135128
Huang Z, Zhao J, Yang YY, Jia YW, Zhang QQ, Chen CE, Liu YS, Yang B, Xie L, Ying GG (2020) Occurrence, mass loads and risks of bisphenol analogues in the Pearl River Delta region, South China: Urban rainfall runoff as a potential source for receiving rivers. Environ Pollut 263:114361. https://doi.org/10.1016/j.envpol.2020.114361
Wang D, Thunéll S, Lindberg U, Jiang L, Trygg J, Tysklind M, Souihi N (2021) A machine learning framework to improve effluent quality control in wastewater treatment plants. Sci Total Environ 784:147138. https://doi.org/10.1016/j.scitotenv.2021.147138
Huang R, Xu J, Xie L, Wang H, Ni X (2022) Energy neutrality potential of wastewater treatment plants: A novel evaluation framework integrating energy efficiency and recovery. Front Environ Sci Eng 16(9):117. https://doi.org/10.1007/s11783-022-1549-0
Newhart KB, Holloway RW, Hering AS, Cath TY (2019) Data-driven performance analyses of wastewater treatment plants: A review. Water Res 157:498–513. https://doi.org/10.1016/j.watres.2019.03.030
Nguyen G, Dlugolinsky S, Bobák M, Tran V, López García Á, Heredia I, Malík P, Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52:77–124. https://doi.org/10.1007/s10462-018-09679-z
Gui J, Sun Z, Wen Y, et al (2023) A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. IEEE Trans Knowl Data Eng 35:3313–3332. https://doi.org/10.1109/TKDE.2021.3130191
Yang Z, Xu B, Luo W, Chen F (2022) Autoencoder-based representation learning and its application in intelligent fault diagnosis: A review. Measurement 189:110460. https://doi.org/10.1016/j.measurement.2021.110460
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. Proc IEEE 109(1):43–76. https://doi.org/10.1109/JPROC.2020.3004555
Acknowledgments
This research was supported in part by the Beijing Natural Science Fund under Grant No. 9222001; the National Natural Science Foundation of China under Grant Nos. 72174018 and 71932002; the Philosophy and Sociology Science Fund from Beijing Municipal Education Commission (SZ2021110005001).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
All authors declare that they have no conflict of interest, financial or otherwise. This article does not contain any studies with human participants or animals performed by any of the authors. All the database is acquired from the public logging system (Internet source) whose appropriate references are added in the sections above.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, W., Li, R., Zhao, J. et al. Miss-gradient boosting regression tree: a novel approach to imputing water treatment data. Appl Intell 53, 22917–22937 (2023). https://doi.org/10.1007/s10489-023-04828-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04828-6