Keywords

1 Introduction

The simulation of hydrological systems and their interactions needs an advanced modeling of water-, energy- and mass cycles in high spatio-temporal resolution [20]. This kind of modeling is used to support water-related predictions and decision making. Such a high-resolution, distributed and physically based modeling demands high performance computing (HPC) and parallel processing of the model units to function fast and efficiently [10, 13, 14]. However, parallel running of such models is challenging for domain scientists, since the interactions among the model units are not strictly independent. Either one can run the processes parallel e.g by using a Message Passing Interface (MPI) for communication and exchange of data between processes, or one can run processes of independent model units in parallel and the processes of dependent model units sequentially. Furthermore, development, testing, execution and update of such a model on HPC Clusters involve potentially a large configuration overhead and require advanced programming expertise of domain scientists. The main aim of this work is to reduce the computational effort of the model, and in addition, to discover underlying patterns of the hydrological systems [5]. The remainder of this paper is structured as follows: Sect. 2 provides further information about the study background, Sect. 3 is a survey of related work, the proposed approach is explained in Sect. 4. In Sect. 5, the processing results are presented, Sect. 6 is about the implementation environment and the conclusions are drawn in Sect. 7.

Fig. 1.
figure 1

Simplified hierarchy of the CAOS model units (modified after [20]). (Color figure online)

2 Background

2.1 Hydrological Model

In this paper we apply our methods on the CAOS (Catchment as Organized Systems) model proposed by Zehe et al. [20]. This model simulates water related dynamics in the lower mesoscale catchments (few tens to few hundreds of square kilometers). The CAOS model provides a high-resolution and distributed process based simulation of water- and energy fluxes in the near surface atmosphere, the earth’s surface and subsurface. These simulations are generally applicable in the field of hydrological research, agricultural water demand estimation and erosion protection or flood forecasting. The landscape is represented by model elements organized in three major hierarchy levels (Fig. 1). The smallest model elements are soil columns referred to as Elementary Functional Units (EFUs). Each EFU is composed of Soilsurface, Soillayers, Macropores (vertical cracks) and Vegetation. In an EFU, all vertical water movements (infiltration, vertical soil water flow, and evapotranspiration) are modelled. On the second hierarchy level, Hillslope model elements contain and connect all EFUs along the downhill path from a ridge line to a river. In a Hillslope, all lateral, downhill flow processes (surface flow and groundwater flow) are modelled in network-like flow structures called rills on the surface and pipes in the subsurface (blue lines in Fig. 1, middle and right sketch). A catchment model element finally contains all Hillslopes, i.e. the drainage area up to a point of interest at a river. In a catchment, all processes of lateral water transport in a river are modelled. EFUs within the same Hillslope may interact due to backwater effects. Hillslopes act completely independent of each other. Before executing the hydrologic simulation, the catchment is divided into Hillslopes based on the flow network derived from a Digital Elevation Model (DEM). Hillslopes are then subdivided in laterally connected EFUs (Fig. 1). The hierarchy of model elements can be abstracted into a network model [5] to profit the advantages of such a representation of objects and their relationships.

2.2 Study Case

The study area used to develop and test the hydrological model is the Attert catchment in the Grand Duchy of Luxembourg. Since the computation of the hydrological model is time consuming, a representative subset of the Attert catchment, the Wollefsbach catchment, is used for the initial development (Fig. 2). To give an insight into the required simulation time, we executed the CAOS model of the Wollefsbach catchment for January 2014 in 5-min resolution on a single core system. The properties, main structure statistics, and execution time are presented in Table 1. The simulation execution time of the whole Attert catchment has not been determined yet.

Table 1. Case study properties.
Fig. 2.
figure 2

Digital Elevation Model of the Attert catchment (brown line) and the Wollefsbach catchment (current study case, red dashed line). (Color figure online)

3 Related Work

Environmental scientists mostly use classification and clustering methods in order to detect patterns in data sets, make decisions and extract the required information by using similarity measurements [3, 15, 18, 19]. [19] studied K-means, Clara, HClust and Fuzzy clustering algorithms to analyze the uncertainty of weather situations. The proposed method reduced the RMSE of point forecasts by up to \( 10\% \). In order to predict the minimum and maximum of weather parameters like temperature and humidity, [17] compared the application of K-means and Hierarchical clustering using internal validation measures. [18] used spectral clustering to determine regions of coherent precipitation regime. They obtained spatial patterns of the precipitation regions that provide a new hydro-climatological insight to understand the hydrological systems. Furthermore, time series is one of the main input data types in environmental science [6] and dealing with these data requires additional preprocessing like dimensionality reduction and distance measurement [1].

4 Methodology

In order to speed up the simulation, we introduce a two-step approach. The first step is to apply an elementary parallelization on the independent model elements level (Hillslopes) and execute the simulation on multi-core processors [5]. The simulation has been run for the Wollefsbach catchment for the duration of one month (January, 2014). The model time resolution of the Hillslope outputs (flux drainage to River) is set to five minutes. Since the simulation code is being developed in MathWorks Matlab, for the elementary parallelization, we have used its parfor functionality. The average execution time of the parallel simulation with a 16-cores processor is 5.4 h, which is a 9.4 times speedup in comparison to the sequential run. The second step of our method can be categorized into Model Order Reduction techniques [8] which is the focus of this paper. Such techniques aim to reduce the computational costs by dimensionality reduction and by computing an approximation of the original model. We exploit the hydrological similarity [9] to reduce the model complexity and computation efforts as a result. The underlying idea of our approach is that similar model units function similarly if departing from similar initial states and being exposed to similar forcing (rainfall or radiation). To realize that we apply clustering algorithms to cluster functionally similar model units. The studied model units are Hillslopes (the model units without exchange between individuals). Then we run the simulation only on the representative of each cluster and map the output to the other members of every cluster. The uncertainty of the approximation can be controlled by the number of clusters and the corresponding computation time.

5 Processing Results

In this work, the initial clustering which defines the initial state of the simulation model units is introduced. We use the time series of discharge from Hillslope model elements obtained from a drainage test (Fig. 3). In the drainage test, the simulation model is executed for all of the Hillslopes initiated with full storage of water. In other words, at the beginning of the test, the Hillslopes are full of water up to their maximum capacity and drain over time. The test is applied on the Wollefsbach catchment and starts from an arbitrary time (in this case, January; Fig. 3) and lasts until the drainage of Hillslopes reaches a predefined boundary (Equilibrium). This time duration is called Time to Equilibrium (TE) of the Hillslopes. During the test no forcing factor is being applied. These time series are integral signatures of Hillslope size, slope, soil (Hillslope structure) and drainage properties, which we then express by two key features that are TE and Active Storage (AS). The second feature, AS extracted from drainage test is the accumulated volume of water flowing out of a Hillslope at each time step normalized to the initial Total Storage of that Hillslope. The time series are used as the input data of our approach (Fig. 4).

Fig. 3.
figure 3

Time series of Total Storage (total water volume) of Hillslopes at each time step; Each line represents a single Hillslope.

5.1 Dimensionality Reduction

Considering the input time series shown in Fig. 4, we extract the features describing their characteristics. The hydrologically meaningful features are AS of each Hillslope at the TE and the gradient of the first time step of the time series called 1st-Gradient, because the speed of drainage especially at the first steps of the test, characterizes the Hillslopes. Other features describing the time series are mathematical moments that express the shape of the distribution. We extracted the four moments Mean, Variance, Skewness and Kurtosis. Thus initially, we have overall a seven-dimensional feature set. Then, each feature set of all Hillslopes are normalized to standard deviation (\( \sigma \)) of that feature set. Dimensionality reduction is an important method to reduce computation complexity. For further dimensionality reduction, we filter highly correlated, i.e. redundant features. Principally, highly correlated features carry similar information so we can reduce them to only one feature. Here, the Pearson correlation coefficient was calculated for each pair of the extracted features (Fig. 5). The Pearson value ranges from \( -1 \) to 1 where 1 defines the total positive linear correlation, 0 is the no linear correlation and \( -1 \) is for total negative linear correlation [16]. Our extracted features are mostly non or positively correlated. Since AS and TE are our hydrological key features, we filter the features that are highly correlated with these two. Regarding Fig. 5, Mean, Variance and AS fit to our exclusion criteria. Therefore, we filter Mean and Variance from our feature set and keep AS. There is no pair of features with highly negative correlation. Finally, we have a five-dimensional feature set consisting of Skewness, Kurtosis, 1st-Gradient, AS and TE.

Fig. 4.
figure 4

Time series of AS; Each line represents a single Hillslope.

5.2 Clustering

Having a feature set as input data, we continue our approach with the application of popular conventional clustering methods namely K-means, K-medoid and DBSCAN and present their efficiency in our use case.

Fig. 5.
figure 5

Pearson correlation coefficient of the features from AS time series.

Application of K-Means Clustering. The only required parameter for the K-means algorithm is the number of clusters (K). In order to determine K, there exists the conventional elbow method to define the optimal number of K using the total within-cluster sum of squares (WSS) or the average distance to centroid [11]. This method is useful in cases where K should be determined only based on the location of the points to be clustered. However, there might be additional constraints suggesting K. In this work, we propose an approach that considers the uncertainty of the simulation introduced by the clustering approach. There is a balance between the number of clusters, K, and the hydrological model uncertainty, based on the RMSE and the simulation computation time of Hillslopes. We determine K with a small yet representative catchment (Wollefsbach) to apply it to the bigger catchment (Attert). Thus, the K parameter can be selected according to the criteria of the hydrologist. Initially, we apply the K-means clustering with varying number of K. Then the RMSE is calculated within each cluster between the cluster members and the representative of that cluster. We define the cluster representative as the Medoid data point whose average dissimilarity to other points in the cluster is minimal. Formally, the Medoid of \( x_{1},x_{2},\cdots ,x_{n} \) as members of each cluster is defined as [12]:

Fig. 6.
figure 6

RMSE and representative Hillslopes computation time for varying Ks using K-means clustering. The gray markers show the original values and the curves in red and green represent their smoothed trend. (Color figure online)

$$\begin{aligned} x_{\text {medoid}}={\text {argmin}}_{y\in \{x_{1},x_{2},\cdots ,x_{n}\}}\sum _{i=1}^{n}d(y,x_{i}), \end{aligned}$$
(1)

where \( d(y,x_{i}) \) is the distance function between y and the ist x. RMSE is the standard deviation of the prediction errors. Formally, RMSE is [7]:

$$\begin{aligned} \text {RMSE}=\sqrt{\sum _{i=1}^{N}(z_{f_{i}}-z_{o_{i}} )^{2}/N}, \end{aligned}$$
(2)

where N is the sample size, \( z_{f_{i}} \) are the predicted values and \( z_{o_{i}} \) are the observed values. According to this, the RMSE measure was calculated between the AS time series of the cluster members and the representative of that cluster. Thus, there is one RMSE measurement per Hillslope for each K variation. Finally, the total RMSE measure of all Hillslopse is calculated and plotted in Fig. 6 using the following equation:

$$\begin{aligned} \sigma _\mathrm{total RMSE} = \sqrt{\sum _{i=1}^{P} (\mathrm RMSE_{i})^{2}}, \end{aligned}$$
(3)

where P is the number of data points in the feature set. In order to find the optimal number of K, we use the trade-off between the RMSE measurement and sum of the computation time of representative Hillslopes of each cluster. According to our methodology, the simulation is applied only on the representative Hillslopes and sum of their computation time is calculated for each number of K. The results of this experiment are shown in Fig. 6, where the horizontal axis represents the number of clusters and the vertical axis, the RMSE measurement and sum of the computation time of representative Hillslopes of each cluster normalized by Min-Max normalization. Evidently, as the number of clusters raises, the corresponding RMSE decreases while the computation time increases (Fig. 6). The main goal of our approach is to achieve the best trade-off between computation time and simulation uncertainty. In Fig. 6, a range of the intended compromise between RMSE and computation time is recognizable where the curves intersect. As K-means places the initial centeroids randomly, the output of its executions with the same number of K differs slightly. Thus, the intended compromise occurs where \( 32< K < 42 \), \( 11.8 \%< RMSE < 14.2 \% \) of the maximum \( RMSE = 39.2 \) and the computation time ranges from \( 10.3 \% \) to \( 16.2 \% \) of total computation time (31.8 days). As an example, the spatial distribution of the K-means clustering at \( K = 37 \) which corresponds to the best compromise between RMSE and computation time in Wollefsbach catchment is shown in Fig. 7. Each color indicates a cluster and the number of its members can be found in the legend of map. All the single member clusters are shown in blue, which are single Hillslopes that do not fit into the other clusters. The map shows a valid Hillslopes clustering, considering the hydrological parameters like the structure, size and location of the Hillslopes. Generally, the overhead of running such a clustering during the simulation is negligible.

Fig. 7.
figure 7

Spatial distribution of K-means clusters at \( K = 37 \) applied on Wollefsbach catchment. All single member clusters are shown in dark blue. (Color figure online)

Application of K-Medoids Clustering. Another variant of K-means is the K-medoids algorithm that uses the actual data points as cluster centers. It receives the number of clusters (K) and the distance matrix of points as input parameters. We have used the K-medoid source code available at [2]. The algorithm was run for variable number of K and the results are shown in Fig. 8. The plot indicates that the intended compromise range between RMSE and computation time occurs where \( 58< K < 78 \), \( 16.8 \%< RMSE < 34.7 \% \) of the maximum \( RMSE = 31.8 \) and the related computation time is between \( 22.7 \% \) and \( 33.8 \% \) of the maximum computation time (31.8 days).

Fig. 8.
figure 8

RMSE and representative Hillslopes computation time for varying Ks using K-medoids clustering. The gray markers show the original values and the curves in red and green represent the smoothed trend. (Color figure online)

Application of DBSCAN Clustering. DBSCAN clustering requires two main parameters as input, namely Eps and MinPts. In order to find a set of optimal parameters, DBSCAN clustering is applied on a different range of Eps and MinPts. The same method of determining and visualizing RMSE with the computation time described in Sect. 5.2 is used with DBSCAN clustering. For each set of parameters, the number of clusters is calculated. Noise clusters are considered as one cluster in the whole number of clusters. The results shown in Fig. 9 indicate that the intended compromise range between RMSE and computation time is achieved where the number of clusters ranges between \( 51 {-} 62 \), \( 0.3< Eps < 0.7 \), \( 1< MinPts < 21 \), the RMSE is between \( 14.5 \% \) and \( 31.4 \%\) of maximum RMSE (38.6) and the computation time is in range of \( 17.9\% \) and \( 23\% \) of the maximum computation time (31.8 days). The direct comparison of the three applied methods is illustrated in Fig. 10, which clearly shows that the K-means clustering performs better for the studied case and features the lowest RMSE for up to 18 days of computation. A summary of all results are available in Table 2.

Fig. 9.
figure 9

RMSE and representative Hillslopes computation time for variable Eps and MinPts using DBSCAN clustering. Some of the DBSCAN parameters’ combination generate the same number of clusters.

Fig. 10.
figure 10

Comparison of the RMSE and computation time of all analyses.

Table 2. Parameters and achievements of different clustering methods

6 Implementation Environment

All the analysis methods are implemented in Python and executed on a computer with Ubuntu 16.04.4 LTS operating system running the Linux kernel 4.4.0-127-generic and a four-core 64-bit Intel(R) Core(TM) i5-6300U CPU @ 2.40 GHz processor. The benchmarking of simulation model parallelization has been done on a computer with Red Hat Enterprise Linux Server release 7.4 running the linux kernel 3.10.0-693.11.6.el7.x86_64 and a 16-core Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00 GHz processor. All scripts, data files and requirements of the analyses are available as a gitlab repository named “hyda” [4].

7 Conclusions and Future Work

In this work we introduced an approach to make use of landscape properties to reduce computational redundancies in hydrological model simulations. We applied three different clustering methods namely, K-Means, K-Medoids and DBSCAN on the time series data from a study case in hydrology. According to the results, the K-means clustering functions better than the other applied clustering methods. It achieves the intended compromise between RMSE and Hillslopes computation time in a range of \( 11.8 \%< RMSE < 14.2 \% \) and \( 10.3 \%< computation time < 16.2 \% \). The K-means clustering requires a smaller number of clusters and consequently lower representative Hillslopes computation time in comparison to the other studied clustering methods. Considering the \( 16.8 \%< RMSE < 34.7 \% \) and \( 22.7 \%< computation time < 33.8 \% \), K-medoids clustering shows worse performance than the other two methods. DBSCAN clustering has promising results also not pleasing as the K-means method. The main challenge of applying DBSCAN is to find an intended balance of both Eps and MinPts parameters. As a future work, the methods will be applied on the whole Attert catchment simulations and as a forward step the clustering approach will be extended to consider also forcing in the simulation model.