Development of PCA-based cluster quantile regression (PCA-CQR) framework for streamflow prediction: Application to the Xiangxi river watershed, China
Graphical abstract
Introduction
Accurate streamflow prediction is crucial for many water applications such as flooding control, hydropower generation, irrigation, and rural and urban water supply [8], [9], [11], [23][34], [12], [13]. Consequently, a great number of efforts have been proposed to develop effective forecasting techniques for improving hydrologic prediction [10], [49]. These forecasting approaches can be classified into two categories: process- and data-driven approaches. The process-driven approaches, mainly involving various forms of rainfall-runoff models (e.g. lumped, distributed hydrologic models), are developed on the basis of the inherent mass and energy conservation laws in the water cycle system. However, the application of process-based models requires a large quantity of hydro-meteorological data and robust optimization techniques for calibration. In comparison, the data-driven techniques can capture the mapping between input (e.g. rainfall, evaporation, temperature, etc.) and output (i.e. streamflow) variables without considering the physical laws that underline the rainfall-runoff process [47]. This kind of approaches mainly include statistical regression [38], [2], [1], [37][3], [22], artificial intelligence [5], [43], [45] and machine learning methods [30], [38]41,4].
The stepwise cluster analysis is a kind of nonparametric statistical method based on multivariate analysis of variance. It is firstly proposed by Huang [20] and has been used for a number of environmental prediction problems [21], [44], [46], [9], [14], [25]. Specifically, for streamflow prediction, Li et al. [25] developed a stepwise-clustered hydrological inference model based on stepwise cluster analysis; Fan et al. [10] conducted monthly streamflow prediction for Xiangxi River based on climate teleconnections. In SCA, the relationships between explanatory variables and responses are reflected through non-functional cluster trees. These cluster trees are obtained by a series of cutting and merging process, which is to divide the original response samples into some new irrelevant sets according to given criteria. Moreover, the SCA can establish the projections between multiple explanatory variables and multiple responses through the non-functional cluster tree, which mean that one set of inputs may correspond to multiple responses. Such projections can futher reveal the inherent uncertain features in the hydrologic prediction problems. The SCA-based approaches are effective for streamflow predictions due to its capability of reflecting nonlinearity and uncertainty between explanatory and response variables. However, there are still some issues needed for further exploration: (i) SCA-based methods are able to reflect nonlinearity between explanatory and response variables, but such nonlinearity is seldom identified before the application of SCA; (ii) previous methods used direct explanatory variables as the predictors, and thus the multicollinearity among the explanatory variables may influence the performance of SCA; (iii) the mean, minimum and maximum values in each tip cluster are usually used in the prediction process of SCA, so there is a need to incorporate other regression approaches into SCA to provide probabilistic predictions; (iv) two parameters should be predefined in the implementation of SCA, the significant level for cutting and merging process and the minimum sample size in the tip cluster, but few studies are reported to characterize the impacts of these two parameters on the performance of the model.
Therefore, this study improves upon previous SCA approaches by integration of principal component analysis (PCA), stepwise cluster analysis (SCA) and quantile regression (QR) techniques, leading to a PCA-based cluster quantile regression (PCA-CQR) approach. Compared with previous SCA approaches, the proposed PCA-CQR approach can eliminate the multicollinearity among explanatory variables and provide probabilistic predictions for responses. Moreover, the nonlinearity between explanatory variables and responses is characterized through the measure of maximal information coefficient (MIC) before application of PCA-CQR. To demonstrate the applicability of PCA-CQR approach, the developed method will be applied to the monthly streamflow forecasting for the Xiangxi River in the Three Gorges Reservoir area, China. Sensitivity analysis will also be conducted to characterize the impacts of control parameters on the model performance.
Section snippets
Methodology
Fig. 1 presents the framework of PCA-CQR, which involves five main procedures: (i) characterizing the relationships between explanatory and response variables through MIC, (ii) converting the original explanatory variables into uncorrelated variables through PCA, (iii) constructing the cluster tree through SCA, (iv) establishing the probabilistic prediction function in each tip cluster by quantile regression, and (v) performing sensitivity analysis. Among these five procedures, the core part is
Site description and data collection
The proposed PCA-CQR approach is applied to monthly steamflow prediction to demonstrate its applicability. The Xiangxi River basin is located between 30.96–31.67°N and 110.47–111.13°E in Hubei part of China Three Gorges Reservoir (TGR) region, draining an area of about 3200 km2, as shown in Fig. 3. Originating in the Shennongjia Nature Reserve, the main stream of Xiangxi River has a length of 94 km, and a catchment area of 3099 km2. It is one of the main tributaries of the Yangtze River [18]. The
Deterministic prediction
In this study, the monthly stream flow records for the Xingshan gauging station and the meteorological data of Xingshan meteorological station from 1991 to 2010, are applied to demonstrate the applicability of the developed PCA-CQR model. The data of the former 80% samples (from February 1991 to January 2007) are employed for calibration and those of the latter 20% samples (from February 2007 to December 2010) are used for verification. In all SCA-based approaches, including the PCA-CQR, the
Discussion
The PCA-CQR approach can capture the relationship between meteorological data and streamflow through a cluster tree, which can deal with discrete and nonlinear relationships between explanatory and response variables. To further demonstrate the capability of the proposed PCA-CQR approach, the comparison between PCA-CQR and other data-driven methods would be conducted. In this study, the performance of PCA-CQR is compared with multiple linear regression (MLR) and generalized regression neural
Conclusions
In this paper, a PCA-based cluster quantile regression approach (PCA-CQR) was proposed by integrating principal component analysis (PCA), stepwise cluster analysis (SCA) and quantile regression (QR) into a framework. The proposed PCA-CQR method can effectively capture discrete and nonlinear relationships between explanatory and response variables. In PCA-CQR, PCA is proposed to reduce the dimension in the explanatory variables and deal with the multicollinearity in those explanatory variables;
Acknowledgments
This research was supported by the Natural Science Foundation of China (51520105013 and 51679087), the National Key Research and Development Plan (2016YFC0502800), Xiamen University of Technology Foreign Science and Technology Cooperation and Communication Foundation (E201400200), Xiamen University of Technology High-level personnel Foundation (YKJ14038), Fujian Class A Foundation (JA14242).
References (54)
- et al.
A coupled ensemble filtering and probabilistic collocation method for uncertainty quantification of hydrologic models
J. Hydrol.
(2015) - et al.
Hydrologic risk analysis in the Yangtze River basin through coupling Gaussian mixtures into copulas
Adv. Water Resour.
(2016) A stepwise cluster analysis method for predicting air quality in an urban environment
Atmos. Environ. Part B
(1992)- et al.
River flow forecasting through conceptual models: part 1. A discussion of principles
J. Hydrol.
(1970) - et al.
Statistical downscaling of precipitation using quantile regression
J. Hydrol.
(2013) - et al.
Development of a clusterwise-linear-regression-based forecasting system for characterizing DNAPL dissolution behaviors in porous media
Sci. Total Environ.
(2012) - et al.
A comparison of performance of several artificial intelligence methods for forecasting monthly discharge time series
J. Hydrol.
(2009) - et al.
A stepwise cluster analysis approach for downscaled climate projection- A Canadian case study
Environ. Modell. Software
(2013) - et al.
Hydrological modeling of River Xiangxi using SWAT2005: A comparison of model parameterizations using station and gridded meteorological observations
Quat. Int.
(2010) - et al.
Simultaneous estimation of relative permeability and capillary pressure for tight formations using ensemble-based history matching method
Comput. Fluids
(2013)
Comparison of multiple linear and nonlinear regression, autoregressive integrated moving average, artificial neural network, and wavelet artificial neural network methods for urban water demand forecasting in Montreal, Canada
Water Resour. Res.
Comparison of multivariate regression and artificial neural networks for peak urban water demand forecasting: the evaluation of different ANN learning algorithms
J. Hydrol. Eng. (ASCE)
Uncertainty assessment in environmental risk through Bayesian networks
J. Environ. Inf.
Modelling sediment trapping by non-Submerged grass buffer strips using nonparametric supervised learning technique
J. Environ. Inf.
Generalized regression neural network in monthly flow forecasting
Civil Eng. Environ. Syst.
Multivariate Data Analysis
A scoring system for probability forecasts of ranked categories
J. Appl. Meteorol.
Planning water resources allocation under multiple uncertainties: a generalized fuzzy two-stage stochastic programming method
IEEE Trans. Fuzzy Syst.
A PCM-based stochastic hydrological model for uncertainty quantification in watershed systems
Stoch. Environ. Res. Risk Assess.
A stepwise-cluster forecasting approach for monthly streamflows based on climate teleconnections
Stoch. Environ. Res. Risk Assess.
Bivariate hydrologic risk analysis based on a coupled entropy-copula method for the Xiangxi River in the Three Gorges Reservoir area, China
Theor. Appl. Climatol.
Probabilistic prediction for monthly streamflow through coupling stepwise cluster analysis and quantile regression methods
Water Resour. Manage.
Statistical downscaling of extreme precipitation events using censored quantile regression
Mon. Weather Rev.
Principal component analysis of the watershed hydrochemical response to forest clearance and its usefulness for chloride mass balance applications
Water Resour. Res.
Status of automatic calibration for hydrologic models
J. Hydrol. Eng.
Bayesian uncertainty analysis in hydrological modeling associated with watershed subdivision level: a case study of SLURP model applied to the Xiangxi River watershed, China
Stoch. Environ. Res. Risk Assess.
Annual fractions of high-speed streams from principal component analysis of local geomagnetic activity?
J. Geophys. Res.-Space Phys.
Cited by (28)
A combined optimization prediction model for earth-rock dam seepage pressure using multi-machine learning fusion with decomposition data-driven
2024, Expert Systems with ApplicationsBagged stepwise cluster analysis for probabilistic river flow prediction
2023, Journal of HydrologyPredicting river dissolved oxygen time series based on stand-alone models and hybrid wavelet-based models
2021, Journal of Environmental ManagementCitation Excerpt :Therefore, the previous DO was used as the antecedent variable. Similarly, previous studies examining DO prediction and streamflow prediction on hourly or daily timescales have also shown that the inclusion of antecedent variables is critically important (Fan et al., 2017; Li et al., 2020). There is no fixed criterion for the selection of exogenous variables because exogenous variables that affect DO patterns change over space and time.
A new asymmetric ϵ-insensitive pinball loss function based support vector quantile regression model
2020, Applied Soft Computing Journal