Keywords

1 Introduction

Efficient diagnosis of leakages in water distribution networks (WDNs) has been a topic of interest over the last years [8, 15]. In particular, the accurate location of leakages is of significant economical and environmental importance. Research on leak location in WDNs follows two main lines: hardware-based and software-based methods, being the latter the main interest in this paper.

Software-based approaches comprehend data analysis [8], model-based [15], and hybrid strategies [16], most of which try to identify a node in the network as the leak location. However, these exact-node strategies present accuracy limitations due to the uncertainty in the data, the variability in consumer demands and the number of sensors installed, among other factors. Therefore, several software-based strategies in recent years have approached the leak location task as a leak zone identification problem aiming to narrow the possible leak location down to a group of nodes [2, 4, 16, 20]. Once a leak zone has been proposed, the exact location of the leak can be pinpointed using hardware-based methods. This approach has generated an increase in leak location reliability while sacrificing leak location resolution. Leak location reliability measures the performance of the method in terms of identifying the correct zone where the leak is located while leak location resolution is associated with the size of the zone where the leak is located, i.e. the more resolution the smaller the size of the zone.

Quiñones-Grueiro [4] proposes to partition the network into zones by means of a k-medoids algorithm with topological variables. Data for each zone is artificially generated to form classes. The leak zone location problem is then solved by using random forests (RFs) and support vector machines (SVM) classifiers trained with the generated data. A similar strategy is followed by Zhang [20], who performs network partitioning using k-means clustering with hydraulic variables.

Iterative zone divisions have also been proposed by Shekofteh [16] and Chen [2] using Girvan-Newman algorithm and hydraulics-based k-means, respectively. Both propose the division of the network into two zones to then identify one as the leak location. The identified location is then divided again into two subzones repeating the process in an iterative manner until a previously determined division iteration is reached.

Despite the increasing number of leak zone location approaches presented recently, none of them has conducted an analysis of the influence of the network partitioning methods on the leak location reliability. Therefore, the main objective of this paper is to analyze the effect of different zone division strategies on the overall performance of the leak zone location task. Four clustering algorithms are studied: DBSCAN, k-medoids, Girvan-Newman algorithm and agglomerative clustering. A secondary objective is to analyze the effect of the type of variable used when performing the clustering. The clustering methods are tested considering both topological and hydraulic variables and the leak location task is solved by using an SVM classifier. The paper is structured as follows. In Sect. 2, a theoretical base is given for the methods used for clustering the nodes. Section 3 presents a summary of the methodology followed to locate a leakage. Section 4 introduces the Modena WDN which is used in this paper as the case study. The classification results corresponding to each clustering strategy are presented and compared in Sect. 5. Finally, conclusions are issued and recommendations for future studies are made.

2 Materials

2.1 WDN Partitioning Strategies

The partitioning of the network has been approached through clustering methods. The task of clustering a data set \(X = \{x_1, x_2, \dots x_n\}\) of n objects consists in identifying a number of disjoint subsets in X formed by objects with similar characteristics while the similarities among subsets are minimum [9]. These subsets or groups are called clusters. Several approaches have been proposed for solving the clustering task [3, 10, 12], differing in the way the clusters are shaped, the type of objects being clustered and the measures of object similarity, among other characteristics. In this paper four clustering strategies are used. They were selected due to their simplicity, their effectiveness and the fact they are widely known.

K-medoids Clustering. K-medoids is a partitioning clustering method [10] in which a set \(M = \{m_1, m_2, ... m_k\}\) of k representative objects from a data set \(X = \{x_1, x_2, \dots x_n\}\) are selected in order to build k \(X_c \subset X, c = 1, 2\dots k\) subsets (clusters) of X. Each representative object \(m_c\) is referred to as the medoid of cluster \(X_c\). Once the medoids have been determined, the clusters are built by assigning each object from data set X to the cluster formed by the nearest medoid [9]. In order to do so, a similarity or distance measure \(d_o(x, x')\) must be defined for every two objects \(x, x' \in X\). The k-medoids clustering problem is then reduced to finding a set of medoids M such that the total distance from every object in X to its corresponding medoid is minimum. This can be formulated as the following optimization problem:

$$\begin{aligned} \begin{aligned}&\min _{M}\sum ^{n}_{i = 1}\sum ^{n}_{j = 1} d_o(x_i, x_j) z_{ij} \quad s.t. \\&z_{ij} = \left\{ \begin{array}{lc} 1 \quad if \quad &{} x_i = m_c, m_c \in M \quad and \quad x_j \in X_c\\ 0 \quad \quad &{} otherwise\\ \end{array} \right. \end{aligned} \end{aligned}$$
(1)

where \(m_c\) is the medoid representing cluster \(X_c\) and \(c = 1, 2, \dots k\).

Agglomerative Clustering. Agglomerative clustering is a hierarchical clustering method that, given a data set \(X = \{x_1, x_2, \dots x_n\}\), yields a hierarchical tree or dendrogram of clusters. Each junction in the dendrogram represents the merging of two clusters and the bottom of the tree is formed by n one-object clusters, while the top has only one cluster with all n objects [9].

In order to build a cluster dendrogram, the following iterative methodology is defined:

  1. 1.

    Create a starting set \(C^0 = \{C_1^0, C_2^0, \dots , C_n^0\}\) of one-object clusters which goes at the bottom of the dendrogram.

  2. 2.

    Merge the two nearest clusters to form a new cluster set \(C^1 = \{C_1^1, C_2^1, \dots , C_{n-1}^1\}\). In order to do so, a similarity or distance measure \(d_c(C, C')\) between any two \(C, C'\) clusters must be defined; which, by itself, implies the need for a similarity or distance measure \(d_o(x, x')\) for every two objects \(x, x' \in X\).

  3. 3.

    Repeat step 2 for every cluster set \(C^i = \{C_1^i, C_2^i, \dots , C_{n-i}^i\}\) until a single-cluster set \(C^{n-1} = \{C^{n-1}_1\}\) is generated.

The agglomerative clustering problem is reduced to finding a level in the hierarchy which renders a cluster set most suitable for the task at hand. Several suitability criteria might be followed, such as a previously determined optimal number of clusters or an analysis of the inconsistency coefficient, among others [7].

Girvan-Newman Community Algorithm. Girvan-Newman clustering algorithm is based on the determination of communities within a graph network. A community is a group of network nodes within which connections are dense while connections with other groups are sparse [12].

This is also a hierarchical clustering algorithm, however, in this case, it is a divisive algorithm; meaning it starts with a single cluster \(C^0\) containing all network nodes and proceeds to iteratively split it until n one-node clusters \(C^{n-1} = \{C_1^{n-1}, C_2^{n-1}, \dots , C_n^{n-1}\}\) are generated.

A set of clusters \(C^i = \{C_1^i, C_2^i, \dots , C_{i+1}^i\}\) at any level of the hierarchy (\(i = 0, 1, \dots n-2\)) is divided to form a cluster set \(C^{i+1} = \{C_1^{i+1}, C_2^{i+1}, \dots , C_{i + 2}^{i + 1}\}\) by removing edges from the network with the highest edge betwenness score until one of the clusters in \(C^i\) is no longer connected [1]. Edge betwenness is a measure of how many paths in the network go through a given edge and its calculated in several ways. However, the particular edge betwenness calculation method does not have much impact on the overall performance of the clustering task.

Density Based Spatial Clustering Algorithm of Applications with Noise (DBSCAN). DBSCAN is a density based clustering algorithm which produces a set of clusters from a data set X while identifying objects in the data set that don’t belong to any cluster and are, therefore, considered noise. It is particularly useful when the clusters to be formed might have arbitrary shapes [13].

Cluster formation in DBSCAN takes place by selecting a set of core objects and their neighborhoods. A core object \(x_p\) is defined as an object in X for which there are at least MinP other objects in X within an Eps radius neighborhood from \(x_p\) [3]. In order to construct an Eps neighborhood for each object, a similarity or distance measure \(d_o(x, x')\) must be defined for every two objects \(x, x' \in X\). In other words, an object \(x_p\) is considered a core object when the following is true:

$$\begin{aligned} \mid N_{eps}(x_p)\mid \ge MinP \quad s.t. \quad N_{eps}(x_p) = \{x_q \in X \mid d_o(x_p, x_q) \le Eps\} \end{aligned}$$
(2)

where \(N_{eps}(x_p)\) is \(x_p\)’s Eps-neighborhood and \(\mid N_{eps}(x_p)\mid \) is the neighborhood cardinality (i.e. number of objects).

Once all core objects have been identified, clusters are formed by merging sets of core objects which are in each other’s Eps-neighborhoods together with their respective neighborhoods. All objects that have no core objects within their Eps-neighborhoods are considered noise and are not assigned to any cluster. The MinP parameter defines a minimum cluster size while the combination of both MinP and Eps determines a minimum cluster density, (or a maximum noise density). These parameters can be adjusted for a known cluster density, or in search of a desired number of clusters. They can also be determined with the help of a k-dist graph defined in [3].

3 Methodology

The leak localization task is addressed as a classification problem using SVM classifiers. Given a sample vector of WDN operational variables, a possible leak zone location is represented by a class. The construction of these classes consists in the clustering of groups of nodes forming k zones within the network. In this section, the methodologies used for network partitioning are presented, as well as the methodology used in the classification step.

3.1 Class Formation

In order to form the classes, the four clustering methods presented in Sect. 2.1 were implemented. The Girvan-Newman method is graph-based and, therefore, topological in nature, since it works directly with the network’s nodes and connections (pipes). However, the other three methods (k-medoids, agglomerative clustering and DBSCAN) are defined for any set of objects. With this in mind, two types of variables are considered for each of these three methods: one based on the network’s structure (topology-based), and the other based on patterns generated by leaks in each node (hydraulics-based).

On one hand, in the case of topology-based clustering, the set of objects X is defined as the node tags for the WDN. The similarity measure between objects \(d_o(x, x')\) is defined as the topological distance between two nodes, i.e. the total pipe distance traversed on the shortest path between both nodes. Both the topological object set X and the distance value between all its components can be easily determined from the structure of the WDN, making it an easily accessible strategy. Also, considering topological distance as a similarity measure, the formation of connected zones [1] is guaranteed, which should make the process of subsequently narrowing down the location of the leak more effective.

On the other hand, the use of a sensitivity matrix for zone generation is often found in leak diagnosis applications [2, 20]. In this case, for the hydraulics-based clustering, the set of objects X is defined as the rows of the sensitivity matrix generated by a set of leaks simulated across the network. Since the objects in X are now feature vectors, several similarity measures \(d_o(x, x')\) can be defined between objects. A downside of this approach is the need for a hydraulic model of the network in order to generate the sensitivity matrix, as well as the fact that the matrix is conditioned by the simulation parameters. Also, not having a set of objects or a similarity measure directly related to the network structure, the zones can be disconnected.

3.2 a DBSCAN Variation

As presented in Sect. 2.1, DBSCAN generates a number of clusters and a noise set with all the objects that don’t belong to any clusters. However, in this case, a leak can occur in any node of the network, therefore, every object in the data set X must be assigned to a cluster. To solve this problem, an extra step was added to the DBSCAN clustering algorithm in which every object in the noise set is assigned to the nearest cluster, using single linkage [9] as a similarity measure.

4 Case Study

4.1 Modena WDN

The WDN of the city of Modena, in Italy was selected as case study to test the performance of the strategies proposed. This network (Fig. 1), has 268 junctions, 317 pipes, and 4 reservoirs, which make it a large-scaled network. The network is completely gravity-fed [19], therefore, it has no pumps.

Fig. 1.
figure 1

Modena WDN topological representation

4.2 Data Generation

The hydraulic model of the Modena WDN was used in order to generate synthetic leakage data by means of EPANET 2.0 [14] hydraulic simulation software. The following considerations are regarded during simulation:

  • A regime of minimum night flow (from 2AM to 6AM) is considered, where the variations caused by consumer demands is minimum, making the pattern recognition task simpler.

  • A sample time of 15 min is set, for a total of 4 samples per hour and 4 h per day (due to minimum nigh flow regime). Samples are filtered by averaging the four samples in an hour. The 4 filtered samples in a day are considered a scenario.

  • Leaks of random sizes were generated within the following interval: [2.7; 6.2]lps in every node of the network. This represents 1.6 to 3.6% of the network’s total demand and 1 to 3 times a node’s nominal demand.

  • Aiming to simulate the variations caused in node demands by the consumers’ water usage patterns and, therefore, generate more realistic data samples, a certain level of uncertainty was considered when simulating the consumer demands by sampling from a Gaussian distribution defined as follows: \(\aleph ~ \sim \{d_n, 0.05d_n\}\) where \(d_n\) is the nominal demand of the node.

  • Also aiming towards realistic simulation, measurement noise (with mean 0 and standard deviation of \(0.025mH_20\)) is added to the pressure values simulated.

4.3 Sensor Configuration

Being this such a large network, monitoring pressure and flow in all of its junctions and pipes would be an economically strenuous task; not to mention unnecessary, since several works [2, 18] have achieved satisfactory results by selecting only a subset of locations within the network to position sensors. A set of pressure sensors, being economically more accesible, have been considered to be installed in this paper. The optimal sensor positions were selected aiming to maximize leak detection performance by means of a genetic optimization algorithm [6]. Three pressure sensor configurations were proposed:

Table 1. Pressure sensor configurations

5 Results and Discussion

A comparison was first considered among clustering methods based on topological variables and, afterwards, methods based on hydraulic variables were compared. Finally, an overall comparison between all methods was developed.

In order to test each method, the following methodology was developed:

  1. 1.

    The WDN was partitioned into k clusters, each representing a class.

  2. 2.

    A balanced training set was generated, with 400 leak scenarios per class.

  3. 3.

    An SVM classifier was trained to identify the leak location class. SVM optimal hyperparameters were determined using grid search [4].

  4. 4.

    The classifier performance was evaluated by means of a completely new generated test set with 50 scenarios per node.

  5. 5.

    Bayesian temporal reasoning was applied for all 4 samples in each scenario in order to improve classification performance [17].

  6. 6.

    Steps 3 to 5 were repeated 10 times with different training sets to assess the variability in performance given by the uncertainty.

The effect of each clustering method on the classification performance was tested for all three sensor configurations. A fixed number of classes \(k = 25\) was selected for all experiments. A performance measure was defined in order to effect the comparison. Leak zone location performance was defined as: \(LZP = 100\frac{CL}{TS}\), where TS is the total number of scenarios in the test set and CL is the number of scenarios for which the leak zone location was properly estimated.

5.1 Topology-Based Clustering Methods

As expected, topology-based clustering algorithms produced connected zones (classes) in every case. Figure 2 presents a performance comparison of the four clustering methods tested.

The obtained results can be interpreted as a two-factor full factorial experiment design with sensor configuration and clustering method as the two factors with three and four levels respectively [11]. Therefore, Friedman’s nonparametric statistical test [5] was executed in order to find significant differences among clustering methods. No significant difference was found between DBSCAN and k-medoids. However, Girvan-Newman algorithm and agglomerative clustering present statistical differences. In general, lower performance is attained for the 10 sensor configuration, while the 15 sensor configuration shows a slightly higher performance than its 20 sensor peer. Regarding clustering methods, Girvan-Newman algorithm achieves the worst performance overall and agglomerative clustering renders the highest.

Fig. 2.
figure 2

Comparing Topology-based Clustering Methods

5.2 Hydraulics-Based Clustering Methods

Hydraulics-based clustering algorithms however, in some cases, did not generate connected zones (classes), which mostly depends on the number of classes and the distance measures used. By using hydraulic variables, the clustering methods k-medoids clustering, agglomerative clustering and DBSCAN were tested; and a performance comparison is presented in Fig. 3.

Fig. 3.
figure 3

Comparing Hydraulics-based Clustering Methods

This can also be interpreted as a full factorial experiment in the same way. Friedman’s test showed similar results, with no method presenting significant statistical differences in performance. K-medoids showed the best performance for the 10 and 15 sensor configurations to be surpassed by DBSCAN algorithm in the 20 sensor configuration and mean overall performance.

5.3 Effect of the Variable Used

The effect of the type of variable used on the leak zone location performance is analyzed separately for each sensor configuration. Two factors are defined for this analysis: clustering method with three levels (k-medoids, agglomerative clustering and DBSCAN) and type of variable with two levels (topological or hydraulic). Friedman’s statistical test was used to analyze the effect of the type of variable on the performance.

For the 10 sensor configuration, all three hydraulics-based methods show significantly better classification performance, with hydraulics-based k-medoids being the best clustering method with a mean performance of 97.56%.

For the 15 sensor configuration, both topology-based and hydraulics-based methods show similar performances, with hydraulics-based k-medoids yielding, again, the best results with a mean performance of 99%, closely followed by topology-based agglomerative clustering and DBSCAN with a mean performance of 98.68% and 98.64% respectively. Finally, for the 20 sensor case, there’s a clear superiority shown by the hydraulics-based DBSCAN algorithm with a mean performance of 98.91%.

6 Conclusions

A comparison among 7 zone partitioning procedures for WDNs was developed for three different (10, 15 and 20) sensor configurations, aiming to study the effect of the clustering method used on the zone-based classification performance, as well as the effect of the type of variable used when performing the clustering.

Agglomerative clustering presented the best results among topology-based clustering methods while Girvan-Newman algorithm showed a poor effect on classification performance for all sensor configurations. However, no significant difference was found among hydraulics-based methods. For the 10 sensor configuration, hydraulics-based methods considerably outperform topology-based ones, however, for the 15 sensors and 20 sensors cases, both groups present strategies with similarly commendable results. For future studies, a study of the effect of combining different sensor positioning strategies with these clustering methodologies will be conducted.