Keywords

1 Introduction

In data mining, clustering is one of the most commonly used methods to divide a set of unlabeled data into related clusters. Clustering has no prior knowledge about data, which leads to it could acquire some hidden information in data. Among many clustering algorithms, K-means is the one of most popular methods for its high efficiency and simplicity, but K-means is prone to getting trapped into local optimums when having poor initial centroids [1].

Nature-inspired heuristic algorithms, such as Genetic Algorithm (GA) [2,3,4], Particle Swarm Optimization (PSO) [5,6,7], Ant Colony Optimization (ACO) [8, 9], which attract scholars to apply them in clustering problems, have good performance in data clustering. In this context, Water Cycle Algorithm (WCA) was proposed by Eskandar et al. [10], focusing on the processes of water cycle and how streams and rivers flow to sea.

In WCA, besides the main step of flow, evaporation and raining are also important portions, which help WCA escape from local optimization. To enhance the performance of WCA, many improvements of WCA are proposed. Chen et al. [11] presented Hierarchical Learning WCA (HLWCA) to divide the solutions into collections with hierarchy differences to improve WCA’s global searching ability. Al-Rawashdeh et al. [12] applied hybrid Water Cycle and Simulated Annealing to improve the accuracy of feature selection and to evaluate proposed Spam Detection. Bahreininejad [13] studied the impact of the Augmented Lagrange Method (ALM) on WCA and presented WCA-ALM algorithm to enhance convergence and solution quality. In 2019, an Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA) was presented by Niu et al. [14], aiming to utilize the information communication of inter-peer individuals to enhance the performance of whole WCA. According to IPCWCA, each stream and river need to learn and get information from one of their peers on some dimensions before flow step, which is also beneficial to improve population diversity.

In this paper, we try to combine IPCWCA with K-means and apply it to clustering analysis, including data clustering and customer segmentation. This kind of method can be divided into IPCWCA module and K-means module: IPCWCA module is executed at first to get a global best individual and then K-means module inherits this individual to continue its clustering process. SSE (sum of squared error) is adopted as fitness function to judge the performance of clustering. The smaller SSE is, the better; otherwise, the reverse. In addition, to compare the performances of the above algorithms from a statistical viewpoint, Friedman test is used in this paper.

The rest of the paper is organized as follows: Sect. 2, 3 and 4 introduce Water Cycle Algorithm, Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA), and K-means Algorithm respectively. Section 5 presents the series of WCA + K-means based methods in details. In Sect. 6, the experiment and results are discussed. In the final Sect. 7, conclusions of the work are presented.

2 Water Cycle Algorithm

Water Cycle Algorithm (WCA), simulating natural phenomenon of water cycle, is originally presented to address engineering optimization problems. WCA mainly consists of three steps: flow, evaporation and raining.

Specifically, WCA pays attention to the flow among streams, rivers and sea. It is noted that sea is the best individual in the whole population while rivers are some good individuals which are inferior to sea. Finally, the remaining individuals are considered as streams.

After flow, a stream’s position will be updated, using

$$ X_{Stream} \left( {t + 1} \right) = X_{Stream} \left( t \right) + rand \times C \times \left( {X_{Sea} \left( t \right) - X_{Stream} \left( t \right)} \right) $$
(1)
$$ X_{Stream} \left( {t + 1} \right) = X_{Stream} \left( t \right) + rand \times C \times \left( {X_{River} \left( t \right) - X_{Stream} \left( t \right)} \right) $$
(2)

Then, if the fitness value of a stream is better than specific river’s or sea’s, exchange their roles.

A river’s position is updated after flowing to the sea, using

$$ X_{River} \left( {t + 1} \right) = X_{River} \left( t \right) + rand \times C \times \left( {X_{Sea} \left( t \right) - X_{River} \left( t \right)} \right) $$
(3)

Similarly, if the river has better fitness value than sea, exchange their roles.

3 Inter-Peer Communication Mechanism Based Water Cycle Algorithm

In order to decrease information loss and enhance communication efficiency among individuals, an Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA) is presented.

Unlike original WCA, IPCWCA considers the relationship between inter-peer individuals, i.e. streams to streams, rivers to rivers. Besides learning from a higher level individual, a stream/river can acquire information from another stream/river before flow step in IPCWCA.

Peer of a stream or river is determined randomly, which helps to improve population diversity, using Eq. (4)–(5)

$$ I_{Stream} = fix\left( {rand*\left( {S - Nsr} \right)} \right) + 1\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,I_{Stream} \ne X_{i} $$
(4)
$$ I_{river} = fix\left( {rand*\left( {Nsr - S} \right)} \right) + 1\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,I_{river} \ne X_{j} $$
(5)

Where S is the number of individuals, Nsr is the total number of rivers and sea.

$$ \begin{aligned} Position_{Stream} \left( {1,d} \right) = Position_{Stream} \left( {1,d} \right)*gauss \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,gauss = N\left( {0,\left| {Position_{{I_{Stream} }} \left( {1,d} \right)} \right|} \right) \hfill \\ \end{aligned} $$
(6)
$$ \begin{aligned} Position_{river} \left( {1,d} \right) = Position_{river} \left( {1,d} \right)*gauss \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,gauss = N\left( {0,\left| {Position_{{I_{river} }} \left( {1,d} \right)} \right|} \right) \hfill \\ \end{aligned} $$
(7)

where “gauss” is a normal distribution with a mean of 0 and a variance of the Istream’s or Iriver’s dth dimension’s absolute value. It is noted that the dimensions of learning between inter-peer is selected randomly instead of studying from all dimensions.

4 K-Means Algorithm

K-means is a well-known clustering method, which divides data vectors into K groups, usually adopting Euclidean metric to calculate the distance between data vectors and cluster centers.

First of all, K-means needs to select initial K centroids (M = (M1, M2,…, Mj,…, MK)) and distribute each data vector to the cluster Cj (j = 1,…, K) by Euclidean metric:

$$ d\left( {X_{p} ,M_{j} } \right) = \sqrt {\sum\limits_{n = 1}^{{N_{d} }} {\left( {X_{pn} - M_{jn} } \right)^{2} } } $$
(8)

where Xp is the p-th data vector; Mj represents the j-th centroid; Nd is the dimension of data vector.

In K-means, it is important to recalculate cluster centroids, using:

$$ M_{j} = \frac{1}{{n_{j} }}\sum\limits_{{X_{P} \in C_{j} }} {X_{p} } $$
(9)

where nj is the number of data vectors in cluster Cj.

5 WCA + K-Means Based Methods

5.1 The IPCWCA/WCA Module

In the module of WCA or IPCWCA, each individual is encoded as follows:

$$ P_{i} = \left( {M_{i1} ,M_{i2} , \ldots M_{ij} , \ldots ,M_{iK} } \right) $$

where K represents the number of clusters; Mij is the j-th cluster centroid vector of the i-th individual in cluster Cij. The fitness function is used to calculate the fitness value of each individual to data vectors, which can be described as:

$$ SSE = \sum\nolimits_{J = 1}^{K} {\sum\limits_{{\forall X_{p} \in C_{ij} }} {d\left( {X_{p} ,M_{ij} } \right)^{2} } } $$
(10)

where d is defined in Eq. (8); nij is the number of data vectors in cluster Cij.

For clustering, inter-peer communication process is different from flow step in learning dimensions, which can be concluded to three versions: IPCWCA-1, IPCWCA-A, IPCWCA-R. IPCWCA-1 only gets information from the first category of a peer, IPCWCA-A studies from all of the peer’s categories and IPCWCA-R learns randomly. Additionally, the dimension of learning in each category is random.

Take an example, there are thousands of four-dimensional data vectors that need to be divided into three categories. Therefore, each individual in population is a 3 × 4 matrix. Three potential learning methods from a peer are illustrated in Fig. 1.

Fig. 1.
figure 1

Three potential learning methods from peer X

5.2 The K-Means Module

K-means module runs after WCA or IPCWCA module, acquiring initial cluster centroids from the best individual of last module and then searching for the final solution. Figure 2 shows the flowchart of IPCWCA + K-means.

Fig. 2.
figure 2

Flowchart of IPCWCA + K-means

6 Experiments and Results

6.1 Datasets and Experiment Settings

In this section, eight datasets from UCI are selected to test the performance of the proposed algorithm, including six simple datasets for data clustering and two business datasets (Australian Credit and German Credit) for customer segmentation. The information of these datasets is described in Table 1. For the purpose of decreasing the negative effects of abnormal data points, all datasets are preprocessed by minimum and maximum normalization. Besides SSE, accuracy is also selected to test the performance of clustering.

Table 1. The chosen eight datasets

In the experiments, K-means can converge quickly within 50 iterations while WCA and IPCWCA need to run more iterations to find stable solutions. To compare them easily, the whole times of iterations are set 100, which means K-means algorithm runs 100 iterations while IPCWCA/WCA module and K-means module runs 50 iterations respectively in hybrid methods. Other parameters in WCA/IPCWCA module are set according to [14], the number of individuals is 50, Nsr = 4 and dmax = 1e−16.

6.2 Results and Analyses

In the experiments, each algorithm is executed 30 time on each dataset. Numerical result followed by the mean value and standard deviation of SSE and accuracy (%) are illustrated in Table 2. In addition, Fig. 3 shows the convergence of SSE for WCA-Based + K-means methods on eight datasets.

Table 2. Numerical Results on Eight Datasets
Fig. 3.
figure 3

Convergence of SSE for WCA-Based + K-means methods on eight datasets

In general, as shown in Table 2 and Fig. 3, IPCWCA-R + K-means obtains the best SSE and the best accuracy in seven and five datasets respectively, which acquires the most optimal results among all algorithms on eight datasets. Although other hybrid methods can’t perform well like IPCWCA-R + K-means in SSE and accuracy, they still behave better than original K-means on most of the cases. For Banknote dataset, K-means and other methods perform similarly in SSE and accuracy, possibly because Banknote dataset is simple with low dimensions, which makes K-means capable of solving this clustering problem well.

As for customer segmentation datasets, they have more instances and higher dimension. For Australian Credit dataset, three proposed methods acquire better results than K-means and WCA + K-means in SSE and accuracy, which indicates that the three hybrid methods are applicable to solve this clustering problem. In German Credit dataset, three proposed methods still get better SSE, but fail to acquire the best accuracy. Interestingly, on the customer segmentation Australian Credit dataset, IPCWCA-A + K-means gets the optimal result, but on German Credit dataset, IPCWCA-R + K-means gets the optimal SSE value, which indicates that different scenarios may require different approaches and one algorithm may not find the best solution for all problems.

In order to compare the performances of the above algorithms from a statistical viewpoint, Friedman test is adopted in this paper. The Friedman test is a nonparametric statistical test of multiple group measures, which can be used to determine whether a set of algorithms have differences in performance. Null hypothesis H0 is proposed: There is not difference in the performance among these algorithms. The significance level in this testing hypothesis is α = 0.05. We reject H0 when TF > Fα, where, TF-value is given by

$$ T_{F} = \frac{{T_{{x^{2} }} \left( {N - 1} \right)}}{{N\left( {k - 1} \right) - T_{{x^{2} }} }} $$
(11)
$$ T_{{x^{2} }} = \frac{12N}{{k\left( {k + 1} \right)}}\left( {\sum\nolimits_{i = 1}^{k} {R_{i}^{2} - \frac{{k\left( {k + 1} \right)^{2} }}{4}} } \right) $$
(12)

TF follows the F distribution with k−1 and (k−1) (N−1) degree of freedom. Where k and N are the number of algorithms and datasets respectively, i.e. k = 5, N = 8. \( T_{{x^{2 } }} \) is defined in Eq. (12), Ri is the ith algorithm’s average rank value. As an unsupervised method without label information guidance, clustering performance is evaluated by SSE in this paper, i.e. the smaller the SSE, the better the clustering effect. Therefore, in this Friedman test, mean value of SSE acquired by compared algorithms on each dataset is used as evaluation indicator. Table 3 shows the aligned ranks of algorithms. By Eq. (11)-(12), the result of TF-value is 4.852. Because TF > F0.05 (4,28) = 2.714, H0 is rejected, i.e. these algorithms have difference in performance.

Table 3. Aligned ranks of algorithms

To further explore how these algorithms are different, Nemenyi subsequent validation is used as follow. CD is the critical range of the algorithm’s average rank- value difference, which is defined as

$$ CD = q_{\alpha } \sqrt {\frac{{k\left( {k + 1} \right)}}{6N}} $$
(13)

qα = 2.728 (α = 0.05) in this paper. Calculated by Eq. (13), the value of CD is equal to 2.157. Therefore, Friedman test pattern based on CD-value is shown in Fig. 4.

Fig. 4.
figure 4

Friedman Test Pattern

From the results of Friedman test, firstly, the null hypothesis H0 is rejected, which means these compared algorithms have difference in performance. Secondly, according to the rank-values of algorithms in Table 3, IPCWCA-R + K-means have the best average rank-value followed by IPCWCA-A +K-means, IPCWCA-1 +K-means, WCA + K-means and K-means. Compared with original K-means, three proposed methods acquire better average rank-values, which means the three algorithms have better performances.

Finally, as illustrated in Fig. 4, the performance of IPCWCA-R + K-means is significantly different from K-means’ and WCA + K-means’ performances, which proves that the results obtained on all the datasets with IPCWCA-R + K-means differ from the original K-means’ and WCA + K-means’ final results. In addition, it is noted that the three proposed methods have overlaps in Fig. 4, therefore, there is no significant difference in performance between them.

7 Conclusions and Further Work

Inspired by the good global search ability of IPCWCA, three hybrid clustering methods based on IPCWCA and K-means are presented and compared in this paper. Among these three proposed methods, according to SSE, accuracy and statistical analyses, IPCWCA-R + K-means behaved best in most of the datasets. However, in the dataset of customer segmentation Australian Credit dataset, IPCWCA-R + K-means cannot perform well, which indicates that different datasets may require different approaches. Fortunately, compared with original K-means and WCA + K-means, IPCWCA + K-means-based methods behave better in SSE and accuracy in most occasions and perform better in Friedman test.

In future research, we will continue to improve the proposed methods to solve different kinds of clustering problems, especially on high-dimensional data. In addition, customer segmentation problems will be studied more comprehensively, such as customer segmentation models, evaluation criterions and so on.