Skip to main content
Log in

Miss-gradient boosting regression tree: a novel approach to imputing water treatment data

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Complete data on wastewater quality are essential for managing and monitoring wastewater treatment processes. Most management and monitoring methods involve the use of voluminous training data for imputation, but the problem is that the sensors used in wastewater treatment plants (WWTPs) collect only a limited amount of data. The lack of sufficient training data can diminish the accuracy of traditional imputation techniques. To address this problem, this study developed a novel approach called Miss-GBRT (imputing missing values with gradient boosting regression trees), which can impute missing values into wastewater quality data even with minimal training data. The proposed approach consists of a preprocessing stage and an imputation stage. In the preprocessing stage, different copies of masked datasets are produced from raw data according to various levels of missingness, after which pre-imputation is conducted to ensure the integrality of training data. In the imputation stage, Miss-GBRT is used to combine shallow regression trees to regress the residuals of time and impute each missing value into a masked dataset in a stepwise manner. We carried out extensive experiments on the WWTP datasets of the University of California, Irvine and Beijing Drainage Group to compare Miss-GBRT with baseline imputation methods. The results demonstrated that the proposed approach improves the accuracy with which missing wastewater quality data are imputed under limited training data. It can also perform better than other methods on datasets with considerable proportions of missing values.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1:
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The data that support the findings of this study are openly available in University of California Irvine (UCI) machine learning repository at http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant.

References

  1. Tang W, Pei Y, Zheng H, Zhao Y, Shu L, Zhang H (2022) Twenty years of China’s water pollution control: Experiences and challenges. Chemosphere 295:133875. https://doi.org/10.1016/j.chemosphere.2022.133875

    Article  Google Scholar 

  2. Teegavarapu RS, Aly A, Pathak CS, Ahlquist J, Fuelberg H, Hood J (2018) Infilling missing precipitation records using variants of spatial interpolation and data-driven methods: use of optimal weighting parameters and nearest neighbour-based corrections. Int J Climatol 38(2):776–793. https://doi.org/10.1002/joc.5209

    Article  Google Scholar 

  3. Oriani F, Borghi A, Straubhaar J, Mariethoz G, Renard P (2016) Missing data simulation inside flow rate time-series using multiple-point statistics. Environ Modell Softw 86:264–276. https://doi.org/10.1016/j.envsoft.2016.10.002

    Article  Google Scholar 

  4. Tabari H, Hosseinzadeh Talaee P (2015) Reconstruction of river water quality missing data using artificial neural networks. Water Qual Res J Can 50(4):326–335. https://doi.org/10.2166/wqrjc.2015.044

    Article  Google Scholar 

  5. Srebotnjak T, Carr G, de Sherbinin A, Rickwood C (2012) A global Water Quality Index and hot-deck imputation of missing data. Ecol Indic 17:108–119. https://doi.org/10.1016/j.ecolind.2011.04.023

    Article  Google Scholar 

  6. Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7(1):1–21. https://doi.org/10.1186/s40537-020-00313-w

    Article  Google Scholar 

  7. Dzaferagic M, Marchetti N, Macaluso I (2021) Fault detection and classification in Industrial IoT in case of missing sensor data. IEEE Internet Things J 9(11):8892–8900. https://doi.org/10.1109/JIOT.2021.3116785

    Article  Google Scholar 

  8. Zhang Z, Lin X, Li M, Wang Y (2021) A customized deep learning approach to integrate network-scale online traffic data imputation and prediction. Transp Res C: Emerg Technol 132:103372. https://doi.org/10.1016/j.trc.2021.103372

    Article  Google Scholar 

  9. Ispirova G, Eftimov T, Seljak BK (2020) Evaluating missing value imputation methods for food composition databases. Food Chem Toxicol 141:111368. https://doi.org/10.1016/j.fct.2020.111368

    Article  Google Scholar 

  10. Ba-Alawi AH, Loy-Benitez J, Kim S, Yoo C (2022) Missing data imputation and sensor self-validation towards a sustainable operation of wastewater treatment plants via deep variational residual autoencoders. Chemosphere 288:132647. https://doi.org/10.1016/j.chemosphere.2021.132647

    Article  Google Scholar 

  11. Chen Z, Cao Y, Ding SX, Zhang K, Koenings T, Peng T, Yang C, Gui W (2019) A distributed canonical correlation analysis-based fault detection method for plant-wide process monitoring. IEEE IEEE Trans Ind Inform 15(5):2710–2720. https://doi.org/10.1109/TII.2019.2893125

    Article  Google Scholar 

  12. Tian Y, Yao H, Li Z (2020) Plant-wide process monitoring by using weighted copula–correlation based multiblock principal component analysis approach and online-horizon Bayesian method. ISA Trans 96:24–36. https://doi.org/10.1016/j.isatra.2019.06.002

    Article  Google Scholar 

  13. Wang B, Li Z, Dai Z, Lawrence N, Yan X (2019) A probabilistic principal component analysis-based approach in process monitoring and fault diagnosis with application in wastewater treatment plant. Appl Soft Comput 82:105527. https://doi.org/10.1016/j.asoc.2019.105527

    Article  Google Scholar 

  14. Wang G, Jia QS, Zhou M, Bi J, Qiao J, Abusorrah A (2022) Artificial neural networks for water quality soft-sensing in wastewater treatment: a review. Artif Intell Rev 55(1):565–587. https://doi.org/10.1007/s10462-021-10038-8

    Article  Google Scholar 

  15. Tencaliec P, Favre AC, Prieur C, Mathevet T (2015) Reconstruction of missing daily streamflow data using dynamic regression models. Water Resources Res 51(12):9447–9463. https://doi.org/10.1002/2015WR017399

    Article  Google Scholar 

  16. Han H, Sun M, Han H, Wu X, Qiao J (2023) Univariate imputation method for recovering missing data in wastewater treatment process. Chin J Chem Eng 53:201–210. https://doi.org/10.1016/j.cjche.2022.01.033

    Article  Google Scholar 

  17. Zhong L, Chang Y, Wang F, Gao S (2021) Distributed Missing Values Imputation Schemes for Plant-Wide Industrial Process Using Variational Bayesian Principal Component Analysis. Ind Eng Chem Res 61(1):580–593. https://doi.org/10.1021/acs.iecr.1c03860

    Article  Google Scholar 

  18. Li D, Li L, Li X, Ke Z, Hu Q (2020) Smoothed LSTM-AE: A spatio-temporal deep model for multiple time-series missing imputation. Neurocomputing 411:351–363. https://doi.org/10.1016/j.neucom.2020.05.033

    Article  Google Scholar 

  19. Chen Z, Xu H, Jiang P, Yu S, Lin G, Bychkov I, Bychkow L, Hmelnov A, Ruzhnikow G, Zhu N, Liu Z (2021) A transfer Learning-Based LSTM strategy for imputing Large-Scale consecutive missing data and its application in a water quality prediction system. J Hydrol 602:126573. https://doi.org/10.1016/j.jhydrol.2021.126573

    Article  Google Scholar 

  20. Ba-Alawi AH, Nam K, Heo S, Woo T, Aamer H, Yoo C (2023) Explainable multisensor fusion-based automatic reconciliation and imputation of faulty and missing data in membrane bioreactor plants for fouling alleviation and energy saving. Chem Eng J 452:139220. https://doi.org/10.1016/j.cej.2022.139220

    Article  Google Scholar 

  21. Cheng H, Wu J, Huang D, Liu Y, Wang Q (2021) Robust adaptive boosted canonical correlation analysis for quality-relevant process monitoring of wastewater treatment. ISA Trans 117:210–220. https://doi.org/10.1016/j.isatra.2021.01.039

    Article  Google Scholar 

  22. Bengio Y, Lecun Y, Hinton G (2021) Deep learning for AI. Commun ACM 64(7):58–65. https://doi.org/10.1145/3448250

    Article  Google Scholar 

  23. Samek W, Montavon G, Lapuschkin S, Anders CJ, Müller KR (2021) Explaining deep neural networks and beyond: A review of methods and applications. Proc IEEE 109(3):247–278. https://doi.org/10.1109/JPROC.2021.3060483

    Article  Google Scholar 

  24. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  25. Yu W, Zhu C, Li Z, Hu Z, Wang Q, Ji H, Jiang M (2022) A survey of knowledge-enhanced text generation. ACM Comput Surv 54(11s):1–38. https://doi.org/10.1145/3512467

    Article  Google Scholar 

  26. Albusac J, Vallejo D, Castro-Schez JJ, Sanchez-Sobrino S, Gomez-Portes C (2021) Multi-analysis surveillance and dynamic distribution of computational resources: Towards extensible, robust, and efficient monitoring of environments. Expert Syst Appl 175:114692. https://doi.org/10.1016/j.eswa.2021.114692

    Article  Google Scholar 

  27. Loy-Benitez J, Li Q, Nam K, Yoo C (2020) Sustainable subway indoor air quality monitoring and fault-tolerant ventilation control using a sparse autoencoder-driven sensor self-validation. Sustain Cities Soc 52:101847. https://doi.org/10.1016/j.scs.2019.101847

    Article  Google Scholar 

  28. Tan M, Liu Z, Chen CP, Zhang Y (2022) Neuroadaptive asymptotic consensus tracking control for a class of uncertain nonlinear multiagent systems with sensor faults. Inform Sci 584:685–700. https://doi.org/10.1016/j.ins.2021.10.053

    Article  Google Scholar 

  29. Jana D, Patil J, Herkal S, Nagarajaiah S, Duenas-Osorio L (2022) CNN and Convolutional Autoencoder (CAE) based real-time sensor fault detection, localization, and correction. Mech Syst Signal Process 169:108723. https://doi.org/10.1016/j.ymssp.2021.108723

    Article  Google Scholar 

  30. Sabar MA, Honda R, Haramoto E (2022) CrAssphage as an indicator of human-fecal contamination in water environment and virus reduction in wastewater treatment. Water Res 221:118827. https://doi.org/10.1016/j.watres.2022.118827

  31. Ba-Alawi AH, Ifaei P, Li Q, Nam K, Djeddou M, Yoo C (2020) Process assessment of a full-scale wastewater treatment plant using reliability, resilience, and econo-socio-environmental analyses (R2ESE). Process Saf Environ Protect 133:259–274. https://doi.org/10.1016/j.psep.2019.11.018

    Article  Google Scholar 

  32. Belchior CAC, Araújo RAM, Souza FAA, Landeck JAC (2018) Sensor-fault tolerance in a wastewater treatment plant by means of ANFIS-based soft sensor and control reconfiguration. Neural Comput Applic 30:3265–3276. https://doi.org/10.1007/s00521-017-2901-3

    Article  Google Scholar 

  33. Anter AM, Gupta D, Castillo O (2020) A novel parameter estimation in dynamic model via fuzzy swarm intelligence and chaos theory for faults in wastewater treatment plant. Soft Comput 24(1):111–129. https://doi.org/10.1007/s00500-019-04225-7

    Article  Google Scholar 

  34. Ly QV, Truong VH, Ji B, Nguyen XC, Cho KH, Ngo HH, Zhang Z (2022) Exploring potential machine learning application based on big data for prediction of wastewater quality from different full-scale wastewater treatment plants. Sci Total Environ 832:154930. https://doi.org/10.1016/j.scitotenv.2022.154930

    Article  Google Scholar 

  35. Lizarralde I, Fernández-Arévalo T, Manas A, Ayesa E, Grau P (2019) Model-based optimization of phosphorus management strategies in Sur WWTP, Madrid. Water Res 153:39–52. https://doi.org/10.1016/j.watres.2018.12.056

    Article  Google Scholar 

  36. Han H, Liu Z, Hou Y, Qiao J (2019) Data-driven multiobjective predictive control for wastewater treatment process. IEEE Trans Ind Inf 16(4):2767–2775. https://doi.org/10.1109/TII.2019.2940663

    Article  Google Scholar 

  37. Peng C, Zeyu L, Gongming W, Pu W (2021) An effective deep recurrent network with high-order statistic information for fault monitoring in wastewater treatment process. Expert Syst. Appl 167:114141. https://doi.org/10.1016/j.eswa.2020.114141

    Article  Google Scholar 

  38. O'Brien JW, Grant S, Banks, AP, Bruno R, Carter S, Choi PM, ..., Mueller JF (2019) A National Wastewater Monitoring Program for a better understanding of public health: A case study using the Australian Census. Environ Int 122:400-411. https://doi.org/10.1016/j.envint.2018.12.003

  39. Shi H, Wang P, Yang X, Yu H (2020) An improved mean imputation clustering algorithm for incomplete data. Neural Process Lett 54:3537–3550. https://doi.org/10.1007/s11063-020-10298-5

  40. Zhang W, Yan S, Li J, Tian X, Yoshida T (2022) Credit risk prediction of SMEs in supply chain finance by fusing demographic and behavioral data. Transp Res E: Logist Transp Rev 158:102611. https://doi.org/10.1016/j.tre.2022.102611

    Article  Google Scholar 

  41. Zhang W, Zhao J, Peng R, et al (2023) SusRec: An Approach to Sustainable Developer Recommendation for Bug Resolution Using Multimodal Ensemble Learning. IEEE Trans Rel 72:61–78. https://doi.org/10.1109/TR.2022.3176733

  42. Xia J, Zhang J, Wang Y, Han L, Yan H (2022) WC-KNNG-PC: Watershed clustering based on k-nearest-neighbor graph and Pauta Criterion. Pattern Recog 121:108177. https://doi.org/10.1016/j.patcog.2021.108177

    Article  Google Scholar 

  43. Zhang W, Yang Y, Wang Q (2015) Using Bayesian regression and EM algorithm with missing handling for software effort prediction. Inf Softw Technol 58:58–70. https://doi.org/10.1016/j.infsof.2014.10.005

    Article  Google Scholar 

  44. Liu Y, Dillon T, Yu W, Rahayu W, Mostafa F (2020) Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet Things J 7(8):6855–6867. https://doi.org/10.1109/JIOT.2020.2970467

    Article  Google Scholar 

  45. Friedman JH (2001) Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics 29:1189–1232

  46. Liu J, Huang Q, Ulishney C, Dumitrescu CE (2021) Greedy Function Approximation: A Graduty natural gas spark ignition engine. Appl Energy 300:117413. https://doi.org/10.1016/j.apenergy.2021.117413

    Article  Google Scholar 

  47. Liu Q, Wang X, Huang X, Yin X (2020) Prediction model of rock mass class using classification and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn Undergr Space Technol 106:103595. https://doi.org/10.1016/j.tust.2020.103595

    Article  Google Scholar 

  48. Cai J, Xu K, Zhu Y, Hu F, Li L (2020) Prediction and analysis of net ecosystem carbon exchange based on gradient boosting regression and random forest. Appl Energy 262:114566. https://doi.org/10.1016/j.apenergy.2020.114566

    Article  Google Scholar 

  49. Kovacs DJ, Li Z, Baetz BW, Hong Y, Donnaz S, Zhao X, Zhou P, Ding H, Dong Q (2022) Membrane fouling prediction and uncertainty analysis using machine learning: A wastewater treatment plant case study. J Membr Sci 660:120817. https://doi.org/10.1016/j.memsci.2022.120817

    Article  Google Scholar 

  50. Gil Pavas E, Correa-Sanchez S (2019) Optimization of the heterogeneous electro-Fenton process assisted by scrap zero-valent iron for treating textile wastewater: Assessment of toxicity and biodegradability. J Water Process Eng 32:100924. https://doi.org/10.1016/j.jwpe.2019.100924

    Article  Google Scholar 

  51. Testolin RC, Mater L, Sanches-Simoes E, Dal Conti-Lampert A, Correa AX, Groth ML, Oliveira-Carneiro M, Radetski CM (2020) Comparison of the mineralization and biodegradation efficiency of the Fenton reaction and Ozone in the treatment of crude petroleum-contaminated water. J Environ Chem Eng 8(5):104265. https://doi.org/10.1016/j.jece.2020.104265

    Article  Google Scholar 

  52. Saravanan A, Kumar PS, Jeevanantham S, Karishma S, Tajsabreen B, Yaashikaa PR, Reshma B (2021) Effective water/wastewater treatment methodologies for toxic pollutants removal: Processes and applications towards sustainable development. Chemosphere 280:130595. https://doi.org/10.1016/j.chemosphere.2021.130595

    Article  Google Scholar 

  53. Fan NS, Bai YH, Wu J, Zhang Q, Fu JJ, Zhou WL, Huang BC, Jin RC (2020) A two-stage anammox process for the advanced treatment of high-strength ammonium wastewater: microbial community and nitrogen transformation. J Clean Prod 261:121148. https://doi.org/10.1016/j.jclepro.2020.121148

    Article  Google Scholar 

  54. Zhang L, Zhang Q, Li X, Jia T, Wang S, Peng Y (2022) Enhanced nitrogen removal from municipal wastewater via a novel combined process driven by partial nitrification/anammox (PN/A) and partial denitrification/anammox (PD/A) with an ultra-low hydraulic retention time (HRT). Bioresour Technol 363:127950. https://doi.org/10.1016/j.biortech.2022.127950

    Article  Google Scholar 

  55. Shanmugam K, Gadhamshetty V, Tysklind M, Bhattacharyya D, Upadhyayula VK (2022) A sustainable performance assessment framework for circular management of municipal wastewater treatment plants. J Clean Prod 339:130657. https://doi.org/10.1016/j.jclepro.2022.130657

    Article  Google Scholar 

  56. Liu S, Wang Z, Wei G, Li M (2019) Distributed set-membership filtering for multirate systems under the round-robin scheduling over sensor networks. IEEE Trans Cybern 50(5):1910–1920. https://doi.org/10.1109/TCYB.2018.2885653

    Article  Google Scholar 

  57. Zhou Z, Wang K, Qiang J, Pang H, Yuan Y, An Y, Zhou C, Ye J, Wu Z (2021) Mainstream nitrogen separation and side-stream removal to reduce discharge and footprint of wastewater treatment plants. Water Res 188:116527. https://doi.org/10.1016/j.watres.2020.116527

    Article  Google Scholar 

  58. Yang J, Liu X, Ying L, Chen X, Li M (2020) Correlation analysis of environmental treatment, sewage treatment and water supply efficiency in China. Sci Total Environ 708:135128. https://doi.org/10.1016/j.scitotenv.2019.135128

    Article  Google Scholar 

  59. Huang Z, Zhao J, Yang YY, Jia YW, Zhang QQ, Chen CE, Liu YS, Yang B, Xie L, Ying GG (2020) Occurrence, mass loads and risks of bisphenol analogues in the Pearl River Delta region, South China: Urban rainfall runoff as a potential source for receiving rivers. Environ Pollut 263:114361. https://doi.org/10.1016/j.envpol.2020.114361

    Article  Google Scholar 

  60. Wang D, Thunéll S, Lindberg U, Jiang L, Trygg J, Tysklind M, Souihi N (2021) A machine learning framework to improve effluent quality control in wastewater treatment plants. Sci Total Environ 784:147138. https://doi.org/10.1016/j.scitotenv.2021.147138

    Article  Google Scholar 

  61. Huang R, Xu J, Xie L, Wang H, Ni X (2022) Energy neutrality potential of wastewater treatment plants: A novel evaluation framework integrating energy efficiency and recovery. Front Environ Sci Eng 16(9):117. https://doi.org/10.1007/s11783-022-1549-0

    Article  Google Scholar 

  62. Newhart KB, Holloway RW, Hering AS, Cath TY (2019) Data-driven performance analyses of wastewater treatment plants: A review. Water Res 157:498–513. https://doi.org/10.1016/j.watres.2019.03.030

    Article  Google Scholar 

  63. Nguyen G, Dlugolinsky S, Bobák M, Tran V, López García Á, Heredia I, Malík P, Hluchý L (2019) Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artif Intell Rev 52:77–124. https://doi.org/10.1007/s10462-018-09679-z

    Article  Google Scholar 

  64. Gui J, Sun Z, Wen Y, et al (2023) A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. IEEE Trans Knowl Data Eng 35:3313–3332. https://doi.org/10.1109/TKDE.2021.3130191

  65. Yang Z, Xu B, Luo W, Chen F (2022) Autoencoder-based representation learning and its application in intelligent fault diagnosis: A review. Measurement 189:110460. https://doi.org/10.1016/j.measurement.2021.110460

    Article  Google Scholar 

  66. Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. Proc IEEE 109(1):43–76. https://doi.org/10.1109/JPROC.2020.3004555

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported in part by the Beijing Natural Science Fund under Grant No. 9222001; the National Natural Science Foundation of China under Grant Nos. 72174018 and 71932002; the Philosophy and Sociology Science Fund from Beijing Municipal Education Commission (SZ2021110005001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen Zhang.

Ethics declarations

Conflict of interests

All authors declare that they have no conflict of interest, financial or otherwise. This article does not contain any studies with human participants or animals performed by any of the authors. All the database is acquired from the public logging system (Internet source) whose appropriate references are added in the sections above.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Li, R., Zhao, J. et al. Miss-gradient boosting regression tree: a novel approach to imputing water treatment data. Appl Intell 53, 22917–22937 (2023). https://doi.org/10.1007/s10489-023-04828-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04828-6

Keywords

Navigation