Abstract
K-means is a classical clustering method, but it is easy to fall into local optimums because of poor centers. Inspired by the good global search performance of Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA), three hybrid methods based on IPCWCA and K-means are presented in this paper, which are used to address the shortcoming of K-means and explore better clustering approaches. The hybrid methods consist of two modules successively: IPCWCA module and K-means module, which means that K-means module will inherit the best individual from IPCWCA module to start its clustering process. Compared with original K-means and WCA + K-means methods on eight datasets (including two customer segmentation datasets) based on SSE, accuracy and Friedman test, proposed methods show greater potential to solve clustering problems both in simple and customer segmentation datasets.
Similar content being viewed by others
Keywords
1 Introduction
In data mining, clustering is one of the most commonly used methods to divide a set of unlabeled data into related clusters. Clustering has no prior knowledge about data, which leads to it could acquire some hidden information in data. Among many clustering algorithms, K-means is the one of most popular methods for its high efficiency and simplicity, but K-means is prone to getting trapped into local optimums when having poor initial centroids [1].
Nature-inspired heuristic algorithms, such as Genetic Algorithm (GA) [2,3,4], Particle Swarm Optimization (PSO) [5,6,7], Ant Colony Optimization (ACO) [8, 9], which attract scholars to apply them in clustering problems, have good performance in data clustering. In this context, Water Cycle Algorithm (WCA) was proposed by Eskandar et al. [10], focusing on the processes of water cycle and how streams and rivers flow to sea.
In WCA, besides the main step of flow, evaporation and raining are also important portions, which help WCA escape from local optimization. To enhance the performance of WCA, many improvements of WCA are proposed. Chen et al. [11] presented Hierarchical Learning WCA (HLWCA) to divide the solutions into collections with hierarchy differences to improve WCA’s global searching ability. Al-Rawashdeh et al. [12] applied hybrid Water Cycle and Simulated Annealing to improve the accuracy of feature selection and to evaluate proposed Spam Detection. Bahreininejad [13] studied the impact of the Augmented Lagrange Method (ALM) on WCA and presented WCA-ALM algorithm to enhance convergence and solution quality. In 2019, an Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA) was presented by Niu et al. [14], aiming to utilize the information communication of inter-peer individuals to enhance the performance of whole WCA. According to IPCWCA, each stream and river need to learn and get information from one of their peers on some dimensions before flow step, which is also beneficial to improve population diversity.
In this paper, we try to combine IPCWCA with K-means and apply it to clustering analysis, including data clustering and customer segmentation. This kind of method can be divided into IPCWCA module and K-means module: IPCWCA module is executed at first to get a global best individual and then K-means module inherits this individual to continue its clustering process. SSE (sum of squared error) is adopted as fitness function to judge the performance of clustering. The smaller SSE is, the better; otherwise, the reverse. In addition, to compare the performances of the above algorithms from a statistical viewpoint, Friedman test is used in this paper.
The rest of the paper is organized as follows: Sect. 2, 3 and 4 introduce Water Cycle Algorithm, Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA), and K-means Algorithm respectively. Section 5 presents the series of WCA + K-means based methods in details. In Sect. 6, the experiment and results are discussed. In the final Sect. 7, conclusions of the work are presented.
2 Water Cycle Algorithm
Water Cycle Algorithm (WCA), simulating natural phenomenon of water cycle, is originally presented to address engineering optimization problems. WCA mainly consists of three steps: flow, evaporation and raining.
Specifically, WCA pays attention to the flow among streams, rivers and sea. It is noted that sea is the best individual in the whole population while rivers are some good individuals which are inferior to sea. Finally, the remaining individuals are considered as streams.
After flow, a stream’s position will be updated, using
Then, if the fitness value of a stream is better than specific river’s or sea’s, exchange their roles.
A river’s position is updated after flowing to the sea, using
Similarly, if the river has better fitness value than sea, exchange their roles.
3 Inter-Peer Communication Mechanism Based Water Cycle Algorithm
In order to decrease information loss and enhance communication efficiency among individuals, an Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA) is presented.
Unlike original WCA, IPCWCA considers the relationship between inter-peer individuals, i.e. streams to streams, rivers to rivers. Besides learning from a higher level individual, a stream/river can acquire information from another stream/river before flow step in IPCWCA.
Peer of a stream or river is determined randomly, which helps to improve population diversity, using Eq. (4)–(5)
Where S is the number of individuals, Nsr is the total number of rivers and sea.
where “gauss” is a normal distribution with a mean of 0 and a variance of the Istream’s or Iriver’s dth dimension’s absolute value. It is noted that the dimensions of learning between inter-peer is selected randomly instead of studying from all dimensions.
4 K-Means Algorithm
K-means is a well-known clustering method, which divides data vectors into K groups, usually adopting Euclidean metric to calculate the distance between data vectors and cluster centers.
First of all, K-means needs to select initial K centroids (M = (M1, M2,…, Mj,…, MK)) and distribute each data vector to the cluster Cj (j = 1,…, K) by Euclidean metric:
where Xp is the p-th data vector; Mj represents the j-th centroid; Nd is the dimension of data vector.
In K-means, it is important to recalculate cluster centroids, using:
where nj is the number of data vectors in cluster Cj.
5 WCA + K-Means Based Methods
5.1 The IPCWCA/WCA Module
In the module of WCA or IPCWCA, each individual is encoded as follows:
where K represents the number of clusters; Mij is the j-th cluster centroid vector of the i-th individual in cluster Cij. The fitness function is used to calculate the fitness value of each individual to data vectors, which can be described as:
where d is defined in Eq. (8); nij is the number of data vectors in cluster Cij.
For clustering, inter-peer communication process is different from flow step in learning dimensions, which can be concluded to three versions: IPCWCA-1, IPCWCA-A, IPCWCA-R. IPCWCA-1 only gets information from the first category of a peer, IPCWCA-A studies from all of the peer’s categories and IPCWCA-R learns randomly. Additionally, the dimension of learning in each category is random.
Take an example, there are thousands of four-dimensional data vectors that need to be divided into three categories. Therefore, each individual in population is a 3 × 4 matrix. Three potential learning methods from a peer are illustrated in Fig. 1.
5.2 The K-Means Module
K-means module runs after WCA or IPCWCA module, acquiring initial cluster centroids from the best individual of last module and then searching for the final solution. Figure 2 shows the flowchart of IPCWCA + K-means.
6 Experiments and Results
6.1 Datasets and Experiment Settings
In this section, eight datasets from UCI are selected to test the performance of the proposed algorithm, including six simple datasets for data clustering and two business datasets (Australian Credit and German Credit) for customer segmentation. The information of these datasets is described in Table 1. For the purpose of decreasing the negative effects of abnormal data points, all datasets are preprocessed by minimum and maximum normalization. Besides SSE, accuracy is also selected to test the performance of clustering.
In the experiments, K-means can converge quickly within 50 iterations while WCA and IPCWCA need to run more iterations to find stable solutions. To compare them easily, the whole times of iterations are set 100, which means K-means algorithm runs 100 iterations while IPCWCA/WCA module and K-means module runs 50 iterations respectively in hybrid methods. Other parameters in WCA/IPCWCA module are set according to [14], the number of individuals is 50, Nsr = 4 and dmax = 1e−16.
6.2 Results and Analyses
In the experiments, each algorithm is executed 30 time on each dataset. Numerical result followed by the mean value and standard deviation of SSE and accuracy (%) are illustrated in Table 2. In addition, Fig. 3 shows the convergence of SSE for WCA-Based + K-means methods on eight datasets.
In general, as shown in Table 2 and Fig. 3, IPCWCA-R + K-means obtains the best SSE and the best accuracy in seven and five datasets respectively, which acquires the most optimal results among all algorithms on eight datasets. Although other hybrid methods can’t perform well like IPCWCA-R + K-means in SSE and accuracy, they still behave better than original K-means on most of the cases. For Banknote dataset, K-means and other methods perform similarly in SSE and accuracy, possibly because Banknote dataset is simple with low dimensions, which makes K-means capable of solving this clustering problem well.
As for customer segmentation datasets, they have more instances and higher dimension. For Australian Credit dataset, three proposed methods acquire better results than K-means and WCA + K-means in SSE and accuracy, which indicates that the three hybrid methods are applicable to solve this clustering problem. In German Credit dataset, three proposed methods still get better SSE, but fail to acquire the best accuracy. Interestingly, on the customer segmentation Australian Credit dataset, IPCWCA-A + K-means gets the optimal result, but on German Credit dataset, IPCWCA-R + K-means gets the optimal SSE value, which indicates that different scenarios may require different approaches and one algorithm may not find the best solution for all problems.
In order to compare the performances of the above algorithms from a statistical viewpoint, Friedman test is adopted in this paper. The Friedman test is a nonparametric statistical test of multiple group measures, which can be used to determine whether a set of algorithms have differences in performance. Null hypothesis H0 is proposed: There is not difference in the performance among these algorithms. The significance level in this testing hypothesis is α = 0.05. We reject H0 when TF > Fα, where, TF-value is given by
TF follows the F distribution with k−1 and (k−1) (N−1) degree of freedom. Where k and N are the number of algorithms and datasets respectively, i.e. k = 5, N = 8. \( T_{{x^{2 } }} \) is defined in Eq. (12), Ri is the ith algorithm’s average rank value. As an unsupervised method without label information guidance, clustering performance is evaluated by SSE in this paper, i.e. the smaller the SSE, the better the clustering effect. Therefore, in this Friedman test, mean value of SSE acquired by compared algorithms on each dataset is used as evaluation indicator. Table 3 shows the aligned ranks of algorithms. By Eq. (11)-(12), the result of TF-value is 4.852. Because TF > F0.05 (4,28) = 2.714, H0 is rejected, i.e. these algorithms have difference in performance.
To further explore how these algorithms are different, Nemenyi subsequent validation is used as follow. CD is the critical range of the algorithm’s average rank- value difference, which is defined as
qα = 2.728 (α = 0.05) in this paper. Calculated by Eq. (13), the value of CD is equal to 2.157. Therefore, Friedman test pattern based on CD-value is shown in Fig. 4.
From the results of Friedman test, firstly, the null hypothesis H0 is rejected, which means these compared algorithms have difference in performance. Secondly, according to the rank-values of algorithms in Table 3, IPCWCA-R + K-means have the best average rank-value followed by IPCWCA-A +K-means, IPCWCA-1 +K-means, WCA + K-means and K-means. Compared with original K-means, three proposed methods acquire better average rank-values, which means the three algorithms have better performances.
Finally, as illustrated in Fig. 4, the performance of IPCWCA-R + K-means is significantly different from K-means’ and WCA + K-means’ performances, which proves that the results obtained on all the datasets with IPCWCA-R + K-means differ from the original K-means’ and WCA + K-means’ final results. In addition, it is noted that the three proposed methods have overlaps in Fig. 4, therefore, there is no significant difference in performance between them.
7 Conclusions and Further Work
Inspired by the good global search ability of IPCWCA, three hybrid clustering methods based on IPCWCA and K-means are presented and compared in this paper. Among these three proposed methods, according to SSE, accuracy and statistical analyses, IPCWCA-R + K-means behaved best in most of the datasets. However, in the dataset of customer segmentation Australian Credit dataset, IPCWCA-R + K-means cannot perform well, which indicates that different datasets may require different approaches. Fortunately, compared with original K-means and WCA + K-means, IPCWCA + K-means-based methods behave better in SSE and accuracy in most occasions and perform better in Friedman test.
In future research, we will continue to improve the proposed methods to solve different kinds of clustering problems, especially on high-dimensional data. In addition, customer segmentation problems will be studied more comprehensively, such as customer segmentation models, evaluation criterions and so on.
References
Pollard, D.: A central limit theorem for K-means clustering. Ann. Probab. 10(4), 919–926 (1982)
Dutta, D., Sil, J., Dutta, P.: Automatic clustering by multi-objective genetic algorithm with numeric and categorical features. Expert Syst. with Appl. 137, 357–379 (2019)
Mustafi, D., Sahoo, G.: A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the K-means algorithm with applications in text clustering. Soft. Comput. 23(15), 6361–6378 (2019)
Gribel, D., Vidal, T.: HG-means: a scalable hybrid genetic algorithm for minimum sum-of-squares clustering. Pattern Recogn. 88, 569–583 (2019)
Lai, D.T.C., Miyakawa, M., Sato, Y.: Semi-supervised data clustering using particle swarm optimisation. Soft. Comput. 24(5), 3499–3510 (2020)
Janani, R., Vijayarani, S.: Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst. Appl. 134, 192–200 (2019)
Liu, W.B., Wang, Z.D., Liu, X.H., Zeng, N.Y., Bell, D.: A novel particle swarm optimization approach for patient clustering from emergency departments. IEEE Trans. Evol. Comput. 23(4), 632–644 (2019)
Menendez, H.D., Otero, F.E.B., Camacho, D.: Medoid-based clustering using ant colony optimization. Swarm Intell. 10(2), 123–145 (2016)
Inkaya, T., Kayaligil, S., Ozdemirel, N.E.: Ant colony optimization based clustering methodology. Appl. Soft Comput. 28, 301–311 (2015)
Eskandar, H., Sadollah, A., Bahreininejad, A., Hamdi, M.: Water cycle algorithm–A novel metaheuristic optimization method for solving constrained engineering optimization problems. Comput. Struct. 110–111(10), 151–166 (2012)
Chen, C.H., Wang, P., Dong, H.C., Wang, X.J.: Hierarchical learning water cycle algorithm. Appl. Soft Comput. 86, p. 105935 (2020) https://doi.org/10.1016/j.asoc.2019
Al-Rawashdeh, G., Mamat, R., Abd Rahim, N.H.B.: Hybrid water cycle optimization algorithm with simulated annealing for spam E-mail detection. IEEE Access. 7, 143721–143734 (2019)
Bahreininejad, A.: Improving the performance of water cycle algorithm using augmented lagrangian method. Adv. Eng. Softw. 132, 55–64 (2019)
Niu, B., Liu, H., Song, X.: An inter-peer communication mechanism based water cycle algorithm. In: Tan, Y., Shi, Y.H., Niu, B. (eds.) Advances in Swarm Intelligence. LNCS, vol. 11655, pp. 50–59. Springer, Chiang Mai (2019). https://doi.org/10.1007/978-3-030-26369-0_5
Acknowledgement
The work described in this paper was supported by Innovating and Upgrading Institute Project from Department of Education of Guangdong Province (2017GWTSCX038), Innovative Talent Projects in Guangdong Universities (2018GWQNCX143), Guangdong Province Soft Science Project (2019A101002075), Guangdong Province Educational Science Plan 2019 (2019JKCY010) and Guangdong Province Postgraduate Education Innovation Research Project (2019SFKC46).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, H., Tan, L., Jin, L., Niu, B. (2020). Improved Water Cycle Algorithm and K-Means Based Method for Data Clustering. In: Huang, DS., Premaratne, P. (eds) Intelligent Computing Methodologies. ICIC 2020. Lecture Notes in Computer Science(), vol 12465. Springer, Cham. https://doi.org/10.1007/978-3-030-60796-8_50
Download citation
DOI: https://doi.org/10.1007/978-3-030-60796-8_50
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60795-1
Online ISBN: 978-3-030-60796-8
eBook Packages: Computer ScienceComputer Science (R0)