Elsevier

Applied Soft Computing

Volume 51, February 2017, Pages 280-293
Applied Soft Computing

Development of PCA-based cluster quantile regression (PCA-CQR) framework for streamflow prediction: Application to the Xiangxi river watershed, China

https://doi.org/10.1016/j.asoc.2016.11.039Get rights and content

Highlights

  • PCA and quantile regression are integrated into SCA to improve its efficiency.

  • MIC is used to reveal nonlinearity between explanatory and response variables.

  • sensitivity analysis is performed to identify the impacts of control parameters.

Abstract

In this study, a PCA-based cluster quantile regression (PCA-CQR) method was proposed through integrating principal component analysis and quantile regression approaches into a stepwise cluster analysis framework. In detail, the principal component analysis was adopted to overcome the multicollinearity among the explanatory variables, while the quantile regression approach was used to provide probabilistic information in prediction. The proposed PCA-CQR method can effectively capture discrete and nonlinear relationships between explanatory and response variables. The applicability of PCA-CQR was demonstrated by a case study of monthly streamflow prediction in the Xiangxi River, China. The nonlinearity between the hydro-meteorological variables and the streamflow measurements was characterized through the measure of maximal information coefficient (MIC), which demonstrated the need of the proposed PCA-CQR method. The results showed that the previous monthly streamflow and precipitation, as well as potential evapotranspiration in current month posed significant nonlinear impacts on the streamflow in current month. Three components could well reflect the total variance of the input variables. Comparison between traditional SCA and PCA-CQR showed that the proposed approach could provide more accurate predictions than traditional SCA methods. Moreover, probabilistic forecasts could be provided by PCA-CQR, and the 90% predictive intervals could well bracket the observations in both calibration and validation periods. Also, sensitivity analysis was performed to identify the impacts of the control parameters in PCA-CQR on the performance of the proposed model. The results showed the proposed PCA-CQR improved the robustness of traditional SCA. Finally, comparison among PCA-CQR, GRNN and MLR also showed the effectiveness of the proposed method.

Introduction

Accurate streamflow prediction is crucial for many water applications such as flooding control, hydropower generation, irrigation, and rural and urban water supply [8], [9], [11], [23][34], [12], [13]. Consequently, a great number of efforts have been proposed to develop effective forecasting techniques for improving hydrologic prediction [10], [49]. These forecasting approaches can be classified into two categories: process- and data-driven approaches. The process-driven approaches, mainly involving various forms of rainfall-runoff models (e.g. lumped, distributed hydrologic models), are developed on the basis of the inherent mass and energy conservation laws in the water cycle system. However, the application of process-based models requires a large quantity of hydro-meteorological data and robust optimization techniques for calibration. In comparison, the data-driven techniques can capture the mapping between input (e.g. rainfall, evaporation, temperature, etc.) and output (i.e. streamflow) variables without considering the physical laws that underline the rainfall-runoff process [47]. This kind of approaches mainly include statistical regression [38], [2], [1], [37][3], [22], artificial intelligence [5], [43], [45] and machine learning methods [30], [38]41,4].

The stepwise cluster analysis is a kind of nonparametric statistical method based on multivariate analysis of variance. It is firstly proposed by Huang [20] and has been used for a number of environmental prediction problems [21], [44], [46], [9], [14], [25]. Specifically, for streamflow prediction, Li et al. [25] developed a stepwise-clustered hydrological inference model based on stepwise cluster analysis; Fan et al. [10] conducted monthly streamflow prediction for Xiangxi River based on climate teleconnections. In SCA, the relationships between explanatory variables and responses are reflected through non-functional cluster trees. These cluster trees are obtained by a series of cutting and merging process, which is to divide the original response samples into some new irrelevant sets according to given criteria. Moreover, the SCA can establish the projections between multiple explanatory variables and multiple responses through the non-functional cluster tree, which mean that one set of inputs may correspond to multiple responses. Such projections can futher reveal the inherent uncertain features in the hydrologic prediction problems. The SCA-based approaches are effective for streamflow predictions due to its capability of reflecting nonlinearity and uncertainty between explanatory and response variables. However, there are still some issues needed for further exploration: (i) SCA-based methods are able to reflect nonlinearity between explanatory and response variables, but such nonlinearity is seldom identified before the application of SCA; (ii) previous methods used direct explanatory variables as the predictors, and thus the multicollinearity among the explanatory variables may influence the performance of SCA; (iii) the mean, minimum and maximum values in each tip cluster are usually used in the prediction process of SCA, so there is a need to incorporate other regression approaches into SCA to provide probabilistic predictions; (iv) two parameters should be predefined in the implementation of SCA, the significant level for cutting and merging process and the minimum sample size in the tip cluster, but few studies are reported to characterize the impacts of these two parameters on the performance of the model.

Therefore, this study improves upon previous SCA approaches by integration of principal component analysis (PCA), stepwise cluster analysis (SCA) and quantile regression (QR) techniques, leading to a PCA-based cluster quantile regression (PCA-CQR) approach. Compared with previous SCA approaches, the proposed PCA-CQR approach can eliminate the multicollinearity among explanatory variables and provide probabilistic predictions for responses. Moreover, the nonlinearity between explanatory variables and responses is characterized through the measure of maximal information coefficient (MIC) before application of PCA-CQR. To demonstrate the applicability of PCA-CQR approach, the developed method will be applied to the monthly streamflow forecasting for the Xiangxi River in the Three Gorges Reservoir area, China. Sensitivity analysis will also be conducted to characterize the impacts of control parameters on the model performance.

Section snippets

Methodology

Fig. 1 presents the framework of PCA-CQR, which involves five main procedures: (i) characterizing the relationships between explanatory and response variables through MIC, (ii) converting the original explanatory variables into uncorrelated variables through PCA, (iii) constructing the cluster tree through SCA, (iv) establishing the probabilistic prediction function in each tip cluster by quantile regression, and (v) performing sensitivity analysis. Among these five procedures, the core part is

Site description and data collection

The proposed PCA-CQR approach is applied to monthly steamflow prediction to demonstrate its applicability. The Xiangxi River basin is located between 30.96–31.67°N and 110.47–111.13°E in Hubei part of China Three Gorges Reservoir (TGR) region, draining an area of about 3200 km2, as shown in Fig. 3. Originating in the Shennongjia Nature Reserve, the main stream of Xiangxi River has a length of 94 km, and a catchment area of 3099 km2. It is one of the main tributaries of the Yangtze River [18]. The

Deterministic prediction

In this study, the monthly stream flow records for the Xingshan gauging station and the meteorological data of Xingshan meteorological station from 1991 to 2010, are applied to demonstrate the applicability of the developed PCA-CQR model. The data of the former 80% samples (from February 1991 to January 2007) are employed for calibration and those of the latter 20% samples (from February 2007 to December 2010) are used for verification. In all SCA-based approaches, including the PCA-CQR, the

Discussion

The PCA-CQR approach can capture the relationship between meteorological data and streamflow through a cluster tree, which can deal with discrete and nonlinear relationships between explanatory and response variables. To further demonstrate the capability of the proposed PCA-CQR approach, the comparison between PCA-CQR and other data-driven methods would be conducted. In this study, the performance of PCA-CQR is compared with multiple linear regression (MLR) and generalized regression neural

Conclusions

In this paper, a PCA-based cluster quantile regression approach (PCA-CQR) was proposed by integrating principal component analysis (PCA), stepwise cluster analysis (SCA) and quantile regression (QR) into a framework. The proposed PCA-CQR method can effectively capture discrete and nonlinear relationships between explanatory and response variables. In PCA-CQR, PCA is proposed to reduce the dimension in the explanatory variables and deal with the multicollinearity in those explanatory variables;

Acknowledgments

This research was supported by the Natural Science Foundation of China (51520105013 and 51679087), the National Key Research and Development Plan (2016YFC0502800), Xiamen University of Technology Foreign Science and Technology Cooperation and Communication Foundation (E201400200), Xiamen University of Technology High-level personnel Foundation (YKJ14038), Fujian Class A Foundation (JA14242).

References (54)

  • J. Adamowski et al.

    Comparison of multiple linear and nonlinear regression, autoregressive integrated moving average, artificial neural network, and wavelet artificial neural network methods for urban water demand forecasting in Montreal, Canada

    Water Resour. Res.

    (2012)
  • J. Adamowski et al.

    Comparison of multivariate regression and artificial neural networks for peak urban water demand forecasting: the evaluation of different ANN learning algorithms

    J. Hydrol. Eng. (ASCE)

    (2008)
  • A. Ahmadi et al.

    Uncertainty assessment in environmental risk through Bayesian networks

    J. Environ. Inf.

    (2015)
  • S. Akram et al.

    Modelling sediment trapping by non-Submerged grass buffer strips using nonparametric supervised learning technique

    J. Environ. Inf.

    (2015)
  • H.K. Cigizoglu

    Generalized regression neural network in monthly flow forecasting

    Civil Eng. Environ. Syst.

    (2005)
  • W.W. Cooley et al.

    Multivariate Data Analysis

    (1971)
  • E.S. Epstein

    A scoring system for probability forecasts of ranked categories

    J. Appl. Meteorol.

    (1969)
  • Y.R. Fan et al.

    Planning water resources allocation under multiple uncertainties: a generalized fuzzy two-stage stochastic programming method

    IEEE Trans. Fuzzy Syst.

    (2015)
  • Y.R. Fan et al.

    A PCM-based stochastic hydrological model for uncertainty quantification in watershed systems

    Stoch. Environ. Res. Risk Assess.

    (2015)
  • Y.R. Fan et al.

    A stepwise-cluster forecasting approach for monthly streamflows based on climate teleconnections

    Stoch. Environ. Res. Risk Assess.

    (2015)
  • Y.R. Fan et al.

    Bivariate hydrologic risk analysis based on a coupled entropy-copula method for the Xiangxi River in the Three Gorges Reservoir area, China

    Theor. Appl. Climatol.

    (2016)
  • Y.R. Fan et al.

    Probabilistic prediction for monthly streamflow through coupling stepwise cluster analysis and quantile regression methods

    Water Resour. Manage.

    (2016)
  • P. Friederichs et al.

    Statistical downscaling of extreme precipitation events using censored quantile regression

    Mon. Weather Rev.

    (2007)
  • H. Guan et al.

    Principal component analysis of the watershed hydrochemical response to forest clearance and its usefulness for chloride mass balance applications

    Water Resour. Res.

    (2013)
  • H.V. Gupta et al.

    Status of automatic calibration for hydrologic models

    J. Hydrol. Eng.

    (1999)
  • J.C. Han et al.

    Bayesian uncertainty analysis in hydrological modeling associated with watershed subdivision level: a case study of SLURP model applied to the Xiangxi River watershed, China

    Stoch. Environ. Res. Risk Assess.

    (2014)
  • L. Holappa et al.

    Annual fractions of high-speed streams from principal component analysis of local geomagnetic activity?

    J. Geophys. Res.-Space Phys.

    (2014)
  • Cited by (28)

    • Predicting river dissolved oxygen time series based on stand-alone models and hybrid wavelet-based models

      2021, Journal of Environmental Management
      Citation Excerpt :

      Therefore, the previous DO was used as the antecedent variable. Similarly, previous studies examining DO prediction and streamflow prediction on hourly or daily timescales have also shown that the inclusion of antecedent variables is critically important (Fan et al., 2017; Li et al., 2020). There is no fixed criterion for the selection of exogenous variables because exogenous variables that affect DO patterns change over space and time.

    View all citing articles on Scopus
    View full text