Improved Water Cycle Algorithm and K-Means Based Method for Data Clustering

Liu, Huan; Tan, Lijing; Jin, Luoxin; Niu, Ben

doi:10.1007/978-3-030-60796-8_50

Huan Liu¹⁰,
Lijing Tan¹¹,
Luoxin Jin¹¹ &
…
Ben Niu¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12465))

Included in the following conference series:

International Conference on Intelligent Computing

1115 Accesses

Abstract

K-means is a classical clustering method, but it is easy to fall into local optimums because of poor centers. Inspired by the good global search performance of Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA), three hybrid methods based on IPCWCA and K-means are presented in this paper, which are used to address the shortcoming of K-means and explore better clustering approaches. The hybrid methods consist of two modules successively: IPCWCA module and K-means module, which means that K-means module will inherit the best individual from IPCWCA module to start its clustering process. Compared with original K-means and WCA + K-means methods on eight datasets (including two customer segmentation datasets) based on SSE, accuracy and Friedman test, proposed methods show greater potential to solve clustering problems both in simple and customer segmentation datasets.

Download conference paper PDF

Performance-enhanced rough $$k$$ -means clustering algorithm

Article 11 August 2020

M. Sivaguru & M. Punniyamoorthy

Customer Segmentation Analysis Using Clustering Algorithms

RETRACTED ARTICLE: A study on e-commerce customer segmentation management based on improved K-means algorithm

Article 03 December 2018

Yulin Deng & Qianying Gao

Keywords

1 Introduction

In data mining, clustering is one of the most commonly used methods to divide a set of unlabeled data into related clusters. Clustering has no prior knowledge about data, which leads to it could acquire some hidden information in data. Among many clustering algorithms, K-means is the one of most popular methods for its high efficiency and simplicity, but K-means is prone to getting trapped into local optimums when having poor initial centroids [1].

Nature-inspired heuristic algorithms, such as Genetic Algorithm (GA) [2,3,4], Particle Swarm Optimization (PSO) [5,6,7], Ant Colony Optimization (ACO) [8, 9], which attract scholars to apply them in clustering problems, have good performance in data clustering. In this context, Water Cycle Algorithm (WCA) was proposed by Eskandar et al. [10], focusing on the processes of water cycle and how streams and rivers flow to sea.

In WCA, besides the main step of flow, evaporation and raining are also important portions, which help WCA escape from local optimization. To enhance the performance of WCA, many improvements of WCA are proposed. Chen et al. [11] presented Hierarchical Learning WCA (HLWCA) to divide the solutions into collections with hierarchy differences to improve WCA’s global searching ability. Al-Rawashdeh et al. [12] applied hybrid Water Cycle and Simulated Annealing to improve the accuracy of feature selection and to evaluate proposed Spam Detection. Bahreininejad [13] studied the impact of the Augmented Lagrange Method (ALM) on WCA and presented WCA-ALM algorithm to enhance convergence and solution quality. In 2019, an Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA) was presented by Niu et al. [14], aiming to utilize the information communication of inter-peer individuals to enhance the performance of whole WCA. According to IPCWCA, each stream and river need to learn and get information from one of their peers on some dimensions before flow step, which is also beneficial to improve population diversity.

In this paper, we try to combine IPCWCA with K-means and apply it to clustering analysis, including data clustering and customer segmentation. This kind of method can be divided into IPCWCA module and K-means module: IPCWCA module is executed at first to get a global best individual and then K-means module inherits this individual to continue its clustering process. SSE (sum of squared error) is adopted as fitness function to judge the performance of clustering. The smaller SSE is, the better; otherwise, the reverse. In addition, to compare the performances of the above algorithms from a statistical viewpoint, Friedman test is used in this paper.

The rest of the paper is organized as follows: Sect. 2, 3 and 4 introduce Water Cycle Algorithm, Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA), and K-means Algorithm respectively. Section 5 presents the series of WCA + K-means based methods in details. In Sect. 6, the experiment and results are discussed. In the final Sect. 7, conclusions of the work are presented.

2 Water Cycle Algorithm

Water Cycle Algorithm (WCA), simulating natural phenomenon of water cycle, is originally presented to address engineering optimization problems. WCA mainly consists of three steps: flow, evaporation and raining.

Specifically, WCA pays attention to the flow among streams, rivers and sea. It is noted that sea is the best individual in the whole population while rivers are some good individuals which are inferior to sea. Finally, the remaining individuals are considered as streams.

After flow, a stream’s position will be updated, using

$$ X_{Stream} \left( {t + 1} \right) = X_{Stream} \left( t \right) + rand \times C \times \left( {X_{Sea} \left( t \right) - X_{Stream} \left( t \right)} \right) $$

(1)

$$ X_{Stream} \left( {t + 1} \right) = X_{Stream} \left( t \right) + rand \times C \times \left( {X_{River} \left( t \right) - X_{Stream} \left( t \right)} \right) $$

(2)

Then, if the fitness value of a stream is better than specific river’s or sea’s, exchange their roles.

A river’s position is updated after flowing to the sea, using

$$ X_{River} \left( {t + 1} \right) = X_{River} \left( t \right) + rand \times C \times \left( {X_{Sea} \left( t \right) - X_{River} \left( t \right)} \right) $$

(3)

Similarly, if the river has better fitness value than sea, exchange their roles.

3 Inter-Peer Communication Mechanism Based Water Cycle Algorithm

In order to decrease information loss and enhance communication efficiency among individuals, an Inter-Peer Communication Mechanism Based Water Cycle Algorithm (IPCWCA) is presented.

Unlike original WCA, IPCWCA considers the relationship between inter-peer individuals, i.e. streams to streams, rivers to rivers. Besides learning from a higher level individual, a stream/river can acquire information from another stream/river before flow step in IPCWCA.

Peer of a stream or river is determined randomly, which helps to improve population diversity, using Eq. (4)–(5)

$$ I_{Stream} = fix\left( {rand*\left( {S - Nsr} \right)} \right) + 1\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,I_{Stream} \ne X_{i} $$

(4)

$$ I_{river} = fix\left( {rand*\left( {Nsr - S} \right)} \right) + 1\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,I_{river} \ne X_{j} $$

(5)

Where S is the number of individuals, Nsr is the total number of rivers and sea.

$$ \begin{aligned} Position_{Stream} \left( {1,d} \right) = Position_{Stream} \left( {1,d} \right)*gauss \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,gauss = N\left( {0,\left| {Position_{{I_{Stream} }} \left( {1,d} \right)} \right|} \right) \hfill \\ \end{aligned} $$

(6)

$$ \begin{aligned} Position_{river} \left( {1,d} \right) = Position_{river} \left( {1,d} \right)*gauss \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,gauss = N\left( {0,\left| {Position_{{I_{river} }} \left( {1,d} \right)} \right|} \right) \hfill \\ \end{aligned} $$

(7)

where “gauss” is a normal distribution with a mean of 0 and a variance of the I_stream’s or I_river’s dth dimension’s absolute value. It is noted that the dimensions of learning between inter-peer is selected randomly instead of studying from all dimensions.

4 K-Means Algorithm

K-means is a well-known clustering method, which divides data vectors into K groups, usually adopting Euclidean metric to calculate the distance between data vectors and cluster centers.

First of all, K-means needs to select initial K centroids (M = (M₁, M₂,…, M_j,…, M_K)) and distribute each data vector to the cluster C_j (j = 1,…, K) by Euclidean metric:

$$ d\left( {X_{p} ,M_{j} } \right) = \sqrt {\sum\limits_{n = 1}^{{N_{d} }} {\left( {X_{pn} - M_{jn} } \right)^{2} } } $$

(8)

where X_p is the p-th data vector; M_j represents the j-th centroid; N_d is the dimension of data vector.

In K-means, it is important to recalculate cluster centroids, using:

$$ M_{j} = \frac{1}{{n_{j} }}\sum\limits_{{X_{P} \in C_{j} }} {X_{p} } $$

(9)

where n_j is the number of data vectors in cluster C_j.

5 WCA + K-Means Based Methods

5.1 The IPCWCA/WCA Module

In the module of WCA or IPCWCA, each individual is encoded as follows:

$$ P_{i} = \left( {M_{i1} ,M_{i2} , \ldots M_{ij} , \ldots ,M_{iK} } \right) $$

where K represents the number of clusters; M_ij is the j-th cluster centroid vector of the i-th individual in cluster C_ij. The fitness function is used to calculate the fitness value of each individual to data vectors, which can be described as:

$$ SSE = \sum\nolimits_{J = 1}^{K} {\sum\limits_{{\forall X_{p} \in C_{ij} }} {d\left( {X_{p} ,M_{ij} } \right)^{2} } } $$

(10)

where d is defined in Eq. (8); n_ij is the number of data vectors in cluster C_ij.

For clustering, inter-peer communication process is different from flow step in learning dimensions, which can be concluded to three versions: IPCWCA-1, IPCWCA-A, IPCWCA-R. IPCWCA-1 only gets information from the first category of a peer, IPCWCA-A studies from all of the peer’s categories and IPCWCA-R learns randomly. Additionally, the dimension of learning in each category is random.

Take an example, there are thousands of four-dimensional data vectors that need to be divided into three categories. Therefore, each individual in population is a 3 × 4 matrix. Three potential learning methods from a peer are illustrated in Fig. 1.

5.2 The K-Means Module

K-means module runs after WCA or IPCWCA module, acquiring initial cluster centroids from the best individual of last module and then searching for the final solution. Figure 2 shows the flowchart of IPCWCA + K-means.

6 Experiments and Results

6.1 Datasets and Experiment Settings

In this section, eight datasets from UCI are selected to test the performance of the proposed algorithm, including six simple datasets for data clustering and two business datasets (Australian Credit and German Credit) for customer segmentation. The information of these datasets is described in Table 1. For the purpose of decreasing the negative effects of abnormal data points, all datasets are preprocessed by minimum and maximum normalization. Besides SSE, accuracy is also selected to test the performance of clustering.

Table 1. The chosen eight datasets

Full size table

In the experiments, K-means can converge quickly within 50 iterations while WCA and IPCWCA need to run more iterations to find stable solutions. To compare them easily, the whole times of iterations are set 100, which means K-means algorithm runs 100 iterations while IPCWCA/WCA module and K-means module runs 50 iterations respectively in hybrid methods. Other parameters in WCA/IPCWCA module are set according to [14], the number of individuals is 50, Nsr = 4 and dmax = 1e−16.

6.2 Results and Analyses

In the experiments, each algorithm is executed 30 time on each dataset. Numerical result followed by the mean value and standard deviation of SSE and accuracy (%) are illustrated in Table 2. In addition, Fig. 3 shows the convergence of SSE for WCA-Based + K-means methods on eight datasets.

Table 2. Numerical Results on Eight Datasets

Full size table

In general, as shown in Table 2 and Fig. 3, IPCWCA-R + K-means obtains the best SSE and the best accuracy in seven and five datasets respectively, which acquires the most optimal results among all algorithms on eight datasets. Although other hybrid methods can’t perform well like IPCWCA-R + K-means in SSE and accuracy, they still behave better than original K-means on most of the cases. For Banknote dataset, K-means and other methods perform similarly in SSE and accuracy, possibly because Banknote dataset is simple with low dimensions, which makes K-means capable of solving this clustering problem well.

As for customer segmentation datasets, they have more instances and higher dimension. For Australian Credit dataset, three proposed methods acquire better results than K-means and WCA + K-means in SSE and accuracy, which indicates that the three hybrid methods are applicable to solve this clustering problem. In German Credit dataset, three proposed methods still get better SSE, but fail to acquire the best accuracy. Interestingly, on the customer segmentation Australian Credit dataset, IPCWCA-A + K-means gets the optimal result, but on German Credit dataset, IPCWCA-R + K-means gets the optimal SSE value, which indicates that different scenarios may require different approaches and one algorithm may not find the best solution for all problems.

In order to compare the performances of the above algorithms from a statistical viewpoint, Friedman test is adopted in this paper. The Friedman test is a nonparametric statistical test of multiple group measures, which can be used to determine whether a set of algorithms have differences in performance. Null hypothesis H0 is proposed: There is not difference in the performance among these algorithms. The significance level in this testing hypothesis is α = 0.05. We reject H0 when T_F > F_α, where, T_F-value is given by

$$ T_{F} = \frac{{T_{{x^{2} }} \left( {N - 1} \right)}}{{N\left( {k - 1} \right) - T_{{x^{2} }} }} $$

(11)

$$ T_{{x^{2} }} = \frac{12N}{{k\left( {k + 1} \right)}}\left( {\sum\nolimits_{i = 1}^{k} {R_{i}^{2} - \frac{{k\left( {k + 1} \right)^{2} }}{4}} } \right) $$

(12)

T_F follows the F distribution with k−1 and (k−1) (N−1) degree of freedom. Where k and N are the number of algorithms and datasets respectively, i.e. k = 5, N = 8. $ T_{{x^{2 } }} $ is defined in Eq. (12), R_i is the i^th algorithm’s average rank value. As an unsupervised method without label information guidance, clustering performance is evaluated by SSE in this paper, i.e. the smaller the SSE, the better the clustering effect. Therefore, in this Friedman test, mean value of SSE acquired by compared algorithms on each dataset is used as evaluation indicator. Table 3 shows the aligned ranks of algorithms. By Eq. (11)-(12), the result of T_F-value is 4.852. Because T_F > F_0.05 (4,28) = 2.714, H0 is rejected, i.e. these algorithms have difference in performance.

Table 3. Aligned ranks of algorithms

Full size table

To further explore how these algorithms are different, Nemenyi subsequent validation is used as follow. CD is the critical range of the algorithm’s average rank- value difference, which is defined as

$$ CD = q_{\alpha } \sqrt {\frac{{k\left( {k + 1} \right)}}{6N}} $$

(13)

q_α = 2.728 (α = 0.05) in this paper. Calculated by Eq. (13), the value of CD is equal to 2.157. Therefore, Friedman test pattern based on CD-value is shown in Fig. 4.

From the results of Friedman test, firstly, the null hypothesis H0 is rejected, which means these compared algorithms have difference in performance. Secondly, according to the rank-values of algorithms in Table 3, IPCWCA-R + K-means have the best average rank-value followed by IPCWCA-A +K-means, IPCWCA-1 +K-means, WCA + K-means and K-means. Compared with original K-means, three proposed methods acquire better average rank-values, which means the three algorithms have better performances.

Finally, as illustrated in Fig. 4, the performance of IPCWCA-R + K-means is significantly different from K-means’ and WCA + K-means’ performances, which proves that the results obtained on all the datasets with IPCWCA-R + K-means differ from the original K-means’ and WCA + K-means’ final results. In addition, it is noted that the three proposed methods have overlaps in Fig. 4, therefore, there is no significant difference in performance between them.

7 Conclusions and Further Work

Inspired by the good global search ability of IPCWCA, three hybrid clustering methods based on IPCWCA and K-means are presented and compared in this paper. Among these three proposed methods, according to SSE, accuracy and statistical analyses, IPCWCA-R + K-means behaved best in most of the datasets. However, in the dataset of customer segmentation Australian Credit dataset, IPCWCA-R + K-means cannot perform well, which indicates that different datasets may require different approaches. Fortunately, compared with original K-means and WCA + K-means, IPCWCA + K-means-based methods behave better in SSE and accuracy in most occasions and perform better in Friedman test.

In future research, we will continue to improve the proposed methods to solve different kinds of clustering problems, especially on high-dimensional data. In addition, customer segmentation problems will be studied more comprehensively, such as customer segmentation models, evaluation criterions and so on.

References

Pollard, D.: A central limit theorem for K-means clustering. Ann. Probab. 10(4), 919–926 (1982)
Article MathSciNet Google Scholar
Dutta, D., Sil, J., Dutta, P.: Automatic clustering by multi-objective genetic algorithm with numeric and categorical features. Expert Syst. with Appl. 137, 357–379 (2019)
Article Google Scholar
Mustafi, D., Sahoo, G.: A hybrid approach using genetic algorithm and the differential evolution heuristic for enhanced initialization of the K-means algorithm with applications in text clustering. Soft. Comput. 23(15), 6361–6378 (2019)
Article Google Scholar
Gribel, D., Vidal, T.: HG-means: a scalable hybrid genetic algorithm for minimum sum-of-squares clustering. Pattern Recogn. 88, 569–583 (2019)
Article Google Scholar
Lai, D.T.C., Miyakawa, M., Sato, Y.: Semi-supervised data clustering using particle swarm optimisation. Soft. Comput. 24(5), 3499–3510 (2020)
Article Google Scholar
Janani, R., Vijayarani, S.: Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Syst. Appl. 134, 192–200 (2019)
Article Google Scholar
Liu, W.B., Wang, Z.D., Liu, X.H., Zeng, N.Y., Bell, D.: A novel particle swarm optimization approach for patient clustering from emergency departments. IEEE Trans. Evol. Comput. 23(4), 632–644 (2019)
Article Google Scholar
Menendez, H.D., Otero, F.E.B., Camacho, D.: Medoid-based clustering using ant colony optimization. Swarm Intell. 10(2), 123–145 (2016)
Article Google Scholar
Inkaya, T., Kayaligil, S., Ozdemirel, N.E.: Ant colony optimization based clustering methodology. Appl. Soft Comput. 28, 301–311 (2015)
Article Google Scholar
Eskandar, H., Sadollah, A., Bahreininejad, A., Hamdi, M.: Water cycle algorithm–A novel metaheuristic optimization method for solving constrained engineering optimization problems. Comput. Struct. 110–111(10), 151–166 (2012)
Article Google Scholar
Chen, C.H., Wang, P., Dong, H.C., Wang, X.J.: Hierarchical learning water cycle algorithm. Appl. Soft Comput. 86, p. 105935 (2020) https://doi.org/10.1016/j.asoc.2019
Al-Rawashdeh, G., Mamat, R., Abd Rahim, N.H.B.: Hybrid water cycle optimization algorithm with simulated annealing for spam E-mail detection. IEEE Access. 7, 143721–143734 (2019)
Article Google Scholar
Bahreininejad, A.: Improving the performance of water cycle algorithm using augmented lagrangian method. Adv. Eng. Softw. 132, 55–64 (2019)
Article Google Scholar
Niu, B., Liu, H., Song, X.: An inter-peer communication mechanism based water cycle algorithm. In: Tan, Y., Shi, Y.H., Niu, B. (eds.) Advances in Swarm Intelligence. LNCS, vol. 11655, pp. 50–59. Springer, Chiang Mai (2019). https://doi.org/10.1007/978-3-030-26369-0_5
Chapter Google Scholar

Download references

Acknowledgement

The work described in this paper was supported by Innovating and Upgrading Institute Project from Department of Education of Guangdong Province (2017GWTSCX038), Innovative Talent Projects in Guangdong Universities (2018GWQNCX143), Guangdong Province Soft Science Project (2019A101002075), Guangdong Province Educational Science Plan 2019 (2019JKCY010) and Guangdong Province Postgraduate Education Innovation Research Project (2019SFKC46).

Author information

Authors and Affiliations

College of Managment, Shenzhen University, Shenzhen, 518060, China
Huan Liu & Ben Niu
School of Management, Shenzhen Institute of Information Technology, Shenzhen, 518172, China
Lijing Tan & Luoxin Jin

Authors

Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lijing Tan
View author publications
You can also search for this author in PubMed Google Scholar
Luoxin Jin
View author publications
You can also search for this author in PubMed Google Scholar
Ben Niu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lijing Tan .

Editor information

Editors and Affiliations

Machine Learning and Systems Biology, Tongji University, Shanghai, China
De-Shuang Huang
School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, H., Tan, L., Jin, L., Niu, B. (2020). Improved Water Cycle Algorithm and K-Means Based Method for Data Clustering. In: Huang, DS., Premaratne, P. (eds) Intelligent Computing Methodologies. ICIC 2020. Lecture Notes in Computer Science(), vol 12465. Springer, Cham. https://doi.org/10.1007/978-3-030-60796-8_50

Download citation

DOI: https://doi.org/10.1007/978-3-030-60796-8_50
Published: 05 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60795-1
Online ISBN: 978-3-030-60796-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improved Water Cycle Algorithm and K-Means Based Method for Data Clustering

Abstract

Similar content being viewed by others

Performance-enhanced rough $$k$$ -means clustering algorithm

Customer Segmentation Analysis Using Clustering Algorithms

RETRACTED ARTICLE: A study on e-commerce customer segmentation management based on improved K-means algorithm

Keywords

1 Introduction

2 Water Cycle Algorithm

3 Inter-Peer Communication Mechanism Based Water Cycle Algorithm

4 K-Means Algorithm