Abstract
Accurate streamflow estimation and assessing the significant parameters are crucial for effective water resource management. In this research, the SWAT model was used to determine streamflow in the Ponnaiyar River Basin, achieving satisfactory accuracy with NSE and R2 of 0.67, KGE of 0.73, and RMSE of 9.257 during calibration. The correlated parameters were established using Pearson Correlation Analysis from the calibrated SWAT-generated parameters. Streamflow prediction was performed with Principal Component Analysis-Multiple Linear Regression (PCA-MLR) using these correlated parameters, resulting in an accuracy of NSE and R2 = 0.67, KGE = 0.69, and RMSE = 9.577 during training, and NSE and R2 = 0.47, KGE = 0.49, and RMSE = 13.624 during testing. Since PCA-MLR exhibited reduced accuracy during testing, this study proposed the combined Soil and Water Assessment Tool-eXtreme Gradient Boosting (SWAT-XGBoost) model, which outperformed the leading-edge models such as SWAT-Categorical Boosting (SWAT-CatBoost) and SWAT-Light Gradient Boosting Machine (SWAT-LightGBM) while maintaining the same correlated parameters. The SWAT-XGBoost model achieved enhanced accuracy with NSE and R2 = 0.83, KGE = 0.85, and RMSE = 2.226 during training, and NSE and R2 = 0.67, KGE = 0.69, and RMSE = 9.805 during testing. The most influential parameters were determined for accurate streamflow prediction using XGBoost’s built-in feature importance. The XGBoost model was developed, considering only these influential parameters among the correlated ones, maintaining the same accuracy during training but exhibiting increased accuracy of NSE and R2 = 0.71, KGE = 0.72, and RMSE = 8.516 during testing. Additionally, SHapley Additive exPlanations (SHAP) impact analysis was conducted on the SWAT-XGBoost model to explain the interactions between these influential parameters. Based on the results of the SHAP impact analysis, an XGBoost model was constructed, incorporating positive impact features, negative impact features, and a combination of both. The XGBoost model, built with combined positive and negative impact features, exhibited superior accuracy during training and testing compared to SWAT-XGBoost, which focused primarily on the most influential parameters. This study provides valuable guidance for researchers and policymakers working with limited data availability using integrated model development techniques to enhance streamflow prediction.
Similar content being viewed by others
Data availability
Data and information provided based on reasonable request.
References
Abbasi M, Farokhnia A, Bahreinimotlagh M, Roozbahani R (2021) A hybrid of Random Forest and Deep Auto-Encoder with support vector regression methods for accuracy improvement and uncertainty reduction of long-term streamflow prediction. J Hydrol (Amst) 597:125717. https://doi.org/10.1016/j.jhydrol.2020.125717
Abbaspour KC (2015) SWAT calibration and uncertainty programs. A user manual 103:17–66. Swiss Federal Institute of Aquatic Science and Technology: Eawag, Duebendorf, Switzerland, pp 1–100
Addis HK, Strohmeier S, Ziadat F et al (2016) Modeling streamflow and sediment using SWAT in Ethiopian highlands. Int J Agric Biol Eng 9:51–66. https://doi.org/10.3965/j.ijabe.20160905.2483
Akbarian M, Saghafian B, Golian S (2023) Monthly streamflow forecasting by machine learning methods using dynamic weather prediction model outputs over Iran. J Hydrol (Amst) 620:129480. https://doi.org/10.1016/j.jhydrol.2023.129480
Alabi RO, Elmusrati M, Leivo I, Almangush A, Mäkitie AA (2023) Machine learning explainability in nasopharyngeal cancer survival using LIME and SHAP. Sci Rep 13(1):8984
Alipour MH (2023) Streamflow prediction in ungauged basins located within data-scarce areas using XGBoost: role of feature engineering and explainability. Int J River Basin Manag :1–22. https://doi.org/10.1080/15715124.2023.2245809
Alizadeh S, Asadollah SBHS, Sharafati A (2022) Post-processing of the UKMO ensemble precipitation product over various regions of Iran: integration of long short-term memory model with principal component analysis. Theoret Appl Climatol 150(1–2):453–467
Anand J, Gosain AK, Khosa R, Srinivasan R (2018) Regional scale hydrologic modeling for prediction of water balance, analysis of trends in streamflow and variations in streamflow: the case study of the Ganga River basin. J Hydrol Reg Stud 16:32–53. https://doi.org/10.1016/j.ejrh.2018.02.007
Arnold JG, Moriasi DN, Gassman PW, Abbaspour KC, White MJ, Srinivasan R, Jha MK (2012) SWAT: Model use, calibration, and validation. Trans ASABE 55(4):1491–1508
Asante-Okyere S, Shen C, Ziggah YY, Rulegeya MM, Zhu X (2020) Principal component analysis (PCA) based hybrid models for the accurate estimation of reservoir water saturation. Comput Geosci 145:104555
Ateeq-ur-Rauf, Ghumman AR, Ahmad S, Hashmi HN (2018) Performance assessment of artificial neural networks and support vector regression models for stream flow predictions. Environ Monit Assess 190. https://doi.org/10.1007/s10661-018-7012-9
Babar S, Ramesh H (2015) Streamflow response to land use-land cover change over the Nethravathi River Basin, India. J Hydrol Eng 20. https://doi.org/10.1061/(asce)he.1943-5584.0001177
Balu A, Ramasamy S, Sankar G (2023) Assessment of climate change impact on hydrological components of Ponnaiyar river basin, Tamil Nadu using CMIP6 models. J Water Clim Change 14:730–747. https://doi.org/10.2166/wcc.2023.354
Baptista ML, Goebel K, Henriques EM (2022) Relation between prognostics predictor evaluation metrics and local interpretability SHAP values. Artif Intell 306:103667
Bartoletti N, Casagli F, Marsili-Libelli S, Nardi A, Palandri L (2018) Data-driven rainfall/runoff modelling based on a neuro-fuzzy inference system. Environ Model Softw 106:35–47
Brejda JJ, Moorman TB, Karlen DL, Dao TH (2000) Identification of regional soil quality factors and indicators I. Central and Southern High Plains. Soil Sci Soc Am J 64(6):2115–2124
Brighenti TM, Bonumá NB, Grison F et al (2019) Two calibration methods for modeling streamflow and suspended sediment with the swat model. Ecol Eng 127:103–113. https://doi.org/10.1016/j.ecoleng.2018.11.007
Chathuranika IM, Gunathilake MB, Baddewela PK, Sachinthanie E, Babel MS, Shrestha S, Rathnayake US (2022) Comparison of two hydrological models, HEC-HMS and SWAT in runoff estimation: application to Huai Bang Sai Tropical Watershed, Thailand. Fluids 7(8):267
Chen T, Guestrin C (2016) XGBoost: A scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp 785–794
Chen S, Huang J, Huang J-C (2023) Improving daily streamflow simulations for data-scarce watersheds using the coupled SWAT-LSTM approach. J Hydrol (Amst) 622:129734. https://doi.org/10.1016/j.jhydrol.2023.129734
Cohen J (1987) Statistical power analysis for the behavioral sciences (revised edition). Laurence Erlbaum Associates, Publishers, Hillsdale, NJ
Dile Y, Srinivasan R, George C (2022) QGIS 3 interface for SWAT (QSWAT3) QSWAT3 step by step setup for the Robit Watershed. Lake Tana basin Ethiopia Contents. https://swat.tamu.edu/media/116574/qswat3_manual_v10.pdf
Dorogush AV, Ershov V, Gulin A (2018) CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363. https://doi.org/10.48550/arXiv.1810.11363
Dunn J, Mingardi L, Zhuo YD (2021) Comparing interpretability and explainability for feature selection. arXiv preprint arXiv:2105.05328. https://doi.org/10.48550/arXiv.2105.05328
Esha RI, Imteaz MA (2019) Assessing the predictability of MLR models for long-term streamflow using lagged climate indices as predictors: a case study of NSW (Australia). Hydrol Res 50(1):262–281
Fadhliani, Zulkafli Z, Yusuf B, Nurhidayu S (2021) Assessment of streamflow simulation for a tropical forested catchment using dynamic topmodel—dynamic fluxes and connectivity for predictions of hydrology (decipher) framework and generalized likelihood uncertainty estimation (glue). Water (Switzerland) 13:1–16. https://doi.org/10.3390/w13030317
Gan M, Pan S, Chen Y, Cheng C, Pan H, Zhu X (2021) Application of the machine learning lightgbm model to the prediction of the water levels of the lower columbia river. J Mar Sci Eng 9(5):496
Ge J, Zhao L, Yu Z et al. (2022) Prediction of greenhouse tomato crop evapotranspiration using XGBoost machine learning model. Plants 11. https://doi.org/10.3390/plants11151923
Ghimire U, Akhtar T, Shrestha NK, Paul PK, Schürz C, Srinivasan R, Daggupati P (2022) A long-term global comparison of IMERG and CFSR with surface precipitation stations. Water Resour Manage 36(14):5695–5709
Gramegna A, Giudici P (2021) SHAP and LIME: an evaluation of discriminative power in credit risk. Front Artif Intell 4:752558
Guillén-Casla V, Rosales-Conrado N, León-González ME et al (2011) Principal component analysis (PCA) and multiple linear regression (MLR) statistical tools to evaluate the effect of E-beam irradiation on ready-to-eat food. J Food Compos Anal 24:456–464. https://doi.org/10.1016/j.jfca.2010.11.010
Haghnazar H, Johannesson KH, González-Pinzón R et al (2022) Groundwater geochemistry, quality, and pollution of the largest lake basin in the Middle East: comparison of PMF and PCA-MLR receptor models and application of the source-oriented HHRA approach. Chemosphere. https://doi.org/10.1016/j.chemosphere.2021.132489. (Chemosphere 288)
Hancock JT, Khoshgoftaar TM (2020) CatBoost for big data: an interdisciplinary review. J big data 7(1):1–45
Hao R, Bai Z (2023) Comparative Study for Daily Streamflow Simulation with different machine learning methods. Water (Switzerland) 15. https://doi.org/10.3390/w15061179
Hsieh WW, Yuval, Li J, Shabbar A, Smith S (2003) Seasonal prediction with error estimation of Columbia River Streamflow in British Columbia. J Water Resour Plan Manag 129(2):146–149
Huang G, Wu L, Ma X, Zhang W, Fan J, Yu X, Zhou H (2019) Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions. J Hydrol 574:1029–1041
Huffman GJ, Bolvin DT, Braithwaite D, Hsu K, Joyce R, Xie P, Yoo SH (2015) NASA global precipitation measurement (GPM) integrated multi-satellite retrievals for GPM (IMERG). Algorithm Theoretical Basis Document (ATBD) Version 4(26):30 (https://www.uoguelph.ca/watershed/w3s/)
Ibrahim UA, Dan’azumi S, Bdliya HH, Bunu Z, Chiroma MJ (2022) Comparison of WEAP and SWAT models for streamflow prediction in the Hadejia-Nguru wetlands, Nigeria. Model Earth Syst Environ 8(4):4997–5010
Jeong J, Kannan N, Arnold J, Glick R, Gosselink L, Srinivasan R (2010) Development and integration of sub-hourly rainfall–runoff modeling capability within a watershed model. Water Resour Manage 24:4505–4527
Jozaghi A, Shen H, Ghazvinian M, Seo DJ, Zhang Y, Welles E, Reed S (2021) Multi-model streamflow prediction using conditional bias-penalized multiple linear regression. Stoch Env Res Risk Assess 35(11):2355–2373
Kashid SS, Ghosh S, Maity R (2010) Streamflow prediction using multi-site rainfall obtained from hydroclimatic teleconnection. J Hydrol (Amst) 395:23–38. https://doi.org/10.1016/j.jhydrol.2010.10.004
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, …, Liu TY (2017) Lightgbm: A highly efficient gradient boosting decision tree. Adv Neural Inf Process Syst 30
Khatun S, Sahana M, Jain SK, Jain N (2018) Simulation of surface runoff using semi distributed hydrological model for a part of Satluj Basin: parameterization and global sensitivity analysis using SWAT CUP. Model Earth Syst Environ 4:1111–1124. https://doi.org/10.1007/s40808-018-0474-5
Khoi DN (2016) Comparison of the HEC-HMS and SWAT hydrological models in simulating the stream flow. J Sci Technol 53(5A):189–195
Kilinc HC, Ahmadianfar I, Demir V, Heddam S, Al-Areeq AM, Abba SI, …, Yaseen ZM (2023) Daily scale river flow forecasting using hybrid gradient boosting model with genetic algorithm optimization. Water Resour Manage 1–16
Koh H, Blum HB (2022) Machine learning-based sensitivity of steel frames with highly imbalanced and high-dimensional data. Eng Struct 259. https://doi.org/10.1016/j.engstruct.2022.114126
Kolluru V, Kolluru S, Konkathi P (2020) Evaluation and integration of reanalysis rainfall products under contrasting climatic conditions in India. Atmos Res 246:105121. https://doi.org/10.1016/j.atmosres.2020.105121
Kumar R, Anbalagan R (2016) Landslide Susceptibility Mapping Using Analytical Hierarchy Process (AHP) in Tehri Reservoir Rim Region, Uttarakhand. J Geol Soc India 87:271–286. https://doi.org/10.1007/s12594-016-0395-8
Kumar V, Kedam N, Sharma KV, Mehta DJ, Caloiero T (2023) Advanced machine learning techniques to improve hydrological prediction: a comparative analysis of streamflow prediction models. Water 15(14):2572
Kumar N, Singh SK, Srivastava PK, Narsimlu B (2017) SWAT model calibration and uncertainty analysis for streamflow prediction of the Tons River Basin, India, using sequential uncertainty fitting (SUFI-2) algorithm. Model Earth Syst Environ. https://doi.org/10.1007/s40808-017-0306-z
Lal M, Mishra SK, Pandey A et al (2017) Evaluation de la méthode du numéro de courbe Du Service De La Conservation Des Sols à partir de données provenant de parcelles agricoles. Hydrogeol J 25:151–167. https://doi.org/10.1007/s10040-016-1460-5
Li Z (2022) Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Comput Environ Urban Syst 96:101845
Li L, Qiao J, Yu G et al (2022) Interpretable tree-based ensemble model for predicting beach water quality. Water Res. https://doi.org/10.1016/j.watres.2022.118078
Lian Y, Luo J, Wang J, Zuo G, Wei N (2022) Climate-driven model based on long short-term memory and bayesian optimization for multi-day-ahead daily streamflow forecasting. Water Resour Manag 1–17. https://doi.org/10.1007/s11269-021-03002-2
Lin Y, Wang D, Wang G, Qiu J, Long K, Du Y, Dai Y (2021) A hybrid deep learning algorithm and its application to streamflow prediction. J Hydrol 601:126636
Liu J, Liu T, Bao A, De Maeyer P, Feng X, Miller SN, Chen X (2016) Assessment of different modelling studies on the spatial hydrological processes in an arid alpine catchment. Water Resour Manage 30:1757–1770
Liu J, Ren K, Ming T et al (2023) Investigating the effects of local weather, streamflow lag, and global climate information on 1-month-ahead streamflow forecasting by using XGBoost and SHAP: two case studies involving the contiguous USA. Acta Geophys 71:905–925. https://doi.org/10.1007/s11600-022-00928-y
Lundberg SM, Erion G, Chen H, DeGrave A, Prutkin JM, Nair B, Lee SI (2020) From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2(1):56–67
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, pp 4765–4774
Mehraein M, Mohanavelu A, Naganna SR, Kulls C, Kisi O (2022) Monthly streamflow prediction by Metaheuristic regression approaches considering satellite precipitation data. Water 14(22):3636
Mosca E, Szigeti F, Tragianni S, Gallagher D, Groh G (2022) SHAP-based explanation methods: a review for NLP interpretability. In: Proceedings of the 29th International Conference on Computational Linguistics (pp. 4593–4603)
Nandi S, Reddy MJ (2022) An integrated approach to streamflow estimation and flood inundation mapping using VIC, RAPID and LISFLOOD-FP. J Hydrol (Amst) 610:127842. https://doi.org/10.1016/j.jhydrol.2022.127842
Narsimlu B, Gosain AK, Chahar BR et al (2015) SWAT model calibration and uncertainty analysis for Streamflow Prediction in the Kunwari River Basin, India, using sequential uncertainty fitting. Environ Processes 2:79–95. https://doi.org/10.1007/s40710-015-0064-8
Ni L, Wang D, Wu J et al (2020) Streamflow forecasting using extreme gradient boosting model coupled with gaussian mixture model. J Hydrol (Amst) 586:124901. https://doi.org/10.1016/j.jhydrol.2020.124901
Noori N, Kalin L (2016) Coupling SWAT and ANN models for enhanced daily streamflow prediction. J Hydrol (Amst) 533:141–151. https://doi.org/10.1016/j.jhydrol.2015.11.050
Noteboom M, Seidou O, Lapen DR (2021) Predicting water quality trends resulting from forest cover change in an agriculturally dominated river basin in eastern Ontario, Canada. Water Qual Res J 56:218–238. https://doi.org/10.2166/wqrj.2021.010
Oo HT, Zin WW, Thin Kyi CC (2020) Analysis of streamflow response to changing climate conditions using SWAT model. Civil Eng J (Iran) 6:194–209. https://doi.org/10.28991/cej-2020-03091464
Patra PK, Behera D, Naik SP, Goswami S (2021) Spatio-temporal variation of vegetation and urban sprawl using remote sensing and GIS: a case study of Cuttack City, Odisha, India. J Geosci Res 6(2):213–219 (https://earthexplorer.usgs.gov)
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A (2018) CatBoost: unbiased boosting with categorical features. Adv Neural Inf Proces Syst 31:6638–6648
Psomas A, Panagopoulos Y, Konsta D, Mimikou M (2016) Designing water efficiency measures in a catchment in Greece using WEAP and SWAT models. Procedia Eng 162:269–276
Rezazadeh MS, Ganjalikhani M, Zounemat-Kermani M (2015) Comparing the performance of semi-distributed SWAT and lumped HEC-HMS hydrological models in simulating river discharge (case study: Ab-Bakhsha Watershed). Iran J Ecohydrol 2(4):467–479
Salim I, Sajjad RU, Paule-Mercado MC et al (2019) Comparison of two receptor models PCA-MLR and PMF for source identification and apportionment of pollution carried by runoff from catchment and sub-watershed areas with mixed land cover in South Korea. Sci Total Environ 663:764–775. https://doi.org/10.1016/j.scitotenv.2019.01.377
Sampath VK, Radhakrishnan N (2023) A comparative study of LULC classifiers for analysing the cover management factor and support practice factor in RUSLE model. Earth Sci Inform 16:733–751. https://doi.org/10.1007/s12145-022-00911-7
Sanjay Shekar NC, Vinay DC (2021) Performance of hec-hms and swat to simulate streamflow in the sub-humid tropical hemavathi catchment. J Water Clim Change 12:3005–3017. https://doi.org/10.2166/wcc.2021.072
Sao D, Kato T, Tu LH et al (2020) Evaluation of different objective functions used in the sufi-2 calibration process of swat-cup on water balance analysis: a case study of the pursat river basin, Cambodia. Water (Switzerland) 12:1–22. https://doi.org/10.3390/w12102901
Schilling KE, Walter CF (2005) Estimation of streamflow, base flow, and nitrate-nitrogen loads in IOWA using multiple linear regression models 1. JAWRA J Am Water Resour Assoc 41(6):1333–1346
Seong C, Sridhar V, Billah MM (2018) Implications of potential evapotranspiration methods for streamflow estimations under changing climatic conditions. Int J Climatol 38:896–914. https://doi.org/10.1002/joc.5218
Shi X, Wong YD, Li MZF et al (2019) A feature learning approach based on XGBoost for driving assessment and risk prediction. Accid Anal Prev 129:170–179. https://doi.org/10.1016/j.aap.2019.05.005
Singh V, Bankar N, Salunkhe SS et al (2013) Hydrological stream flow modelling on Tungabhadra catchment: parameterization and uncertainty analysis using SWAT CUP. Current science, pp 1187–1199. https://www.jstor.org/stable/24092398
Suliman AHA, Jajarmizadeh M, Harun S, Mat Darus IZ (2015) Comparison of semi-distributed, GIS-based hydrological models for the prediction of streamflow in a large catchment. Water Resour Manage 29:3095–3110
Sushanth K, Mishra A, Mukhopadhyay P, Singh R (2023) Real-time streamflow forecasting in a reservoir-regulated river basin using explainable machine learning and conceptual reservoir module. Sci Total Environ 861:160680
Szczepanek R (2022) Daily streamflow forecasting in mountainous catchment using XGBoost, LightGBM and CatBoost. Hydrology 9. https://doi.org/10.3390/hydrology9120226
Vaulet T, Al-Memar M, Fourie H, Bobdiwala S, Saso S, Pipi M, De Moor B (2022) Gradient boosted trees with individual explanations: an alternative to logistic regression for viability prediction in the first trimester of pregnancy. Comput Methods Programs Biomed 213:106520
Weierbach H, Lima AR, Willard JD et al (2022) Stream temperature predictions for river basin management in the Pacific Northwest and Mid-Atlantic regions using machine learning. Water (Switzerland) 14. https://doi.org/10.3390/w14071032
Westra S, Brown C, Lall U, Sharma A (2007) Modeling multivariable hydrological series: principal component analysis or Independent component analysis?. Water Resources Research 43(6). https://doi.org/10.1029/2006WR005617
Zhang Q, Liu J, Singh VP et al (2017) Hydrological responses to climatic changes in the Yellow River basin, China: climatic elasticity and streamflow prediction. J Hydrol (Amst) 554:635–645. https://doi.org/10.1016/j.jhydrol.2017.09.040
Zhou S, Liu Z, Wang M et al (2022a) Impacts of building configurations on urban stormwater management at a block scale using XGBoost. Sustain Cities Soc 87:104235. https://doi.org/10.1016/j.scs.2022.104235
Zhou X, Wen H, Li Z, Zhang H, Zhang W (2022b) An interpretable model for the susceptibility of rainfall-induced shallow landslides based on SHAP and XGBoost. Geocarto Int 37(26):13419–13450
Zomlot Z, Verbeiren B, Huysmans M, Batelaan O (2015) Spatial distribution of groundwater recharge and base flow: Assessment of controlling factors. J Hydrol Reg Stud 4:349–368. https://doi.org/10.1016/j.ejrh.2015.07.005
Funding
The authors did not receive specific funds, grants or other support for the submitted work.
Author information
Authors and Affiliations
Contributions
R. Y. P. is responsible for the methodology, data collection, data analysis, figures preparation and initial draft preparation. And R.M. provided supervision, formal investigation, and edited the manuscript. Both authors reviewed and provided feedback on earlier versions of the manuscript. They have all read and approved the final version. The authors take full responsibility for the integrity and accuracy of the entire work and are committed to resolving any issues that may arise. With this manuscript’s submission, I would like to undertake that all the authors mutually agree to submit this manuscript. The name and the order of authors are correctly presented in the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Communicated by: H. Babaie
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
R, Y.P., R, M. Enhanced streamflow prediction using SWAT’s influential parameters: a comparative analysis of PCA-MLR and XGBoost models. Earth Sci Inform 16, 4053–4076 (2023). https://doi.org/10.1007/s12145-023-01139-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12145-023-01139-9