Elsevier

Knowledge-Based Systems

Volume 145, 1 April 2018, Pages 289-297
Knowledge-Based Systems

Improved K-means algorithm based on density Canopy

https://doi.org/10.1016/j.knosys.2018.01.031Get rights and content

Abstract

In order to improve the accuracy and stability of K-means algorithm and solve the problem of determining the most appropriate number K of clusters and best initial seeds, an improved K-means algorithm based on density Canopy is proposed. Firstly, the density of sample data sets, the average sample distance in clusters and the distance between clusters are calculated, choosing the density maximum sampling point as the first cluster center and removing the density cluster from the data sets. Defining the product of sample density, the reciprocal of the average distance between the samples in the cluster, and the distance between the clusters as weight product, the other initial seeds is determined by the maximum weight product in the remaining data sets until the data sets is empty. The density Canopy is used as the preprocessing procedure of K-means and its result is used as the cluster number and initial clustering center of K-means algorithm. Finally, the new algorithm is tested on some well-known data sets from UCI machine learning repository and on some simulated data sets with different proportions of noise samples. The simulation results show that the improved K-means algorithm based on density Canopy achieves better clustering results and is insensitive to noisy data compared to the traditional K-means algorithm, the Canopy-based K-means algorithm, Semi-supervised K-means++ algorithm and K-means-u* algorithm. The clustering accuracy of the proposed K-means algorithm based on density Canopy is improved by 30.7%, 6.1%, 5.3% and 3.7% on average on UCI data sets, and improved by 44.3%, 3.6%, 9.6% and 8.9% on the simulated data sets with noise signal respectively. With the increase of the noise ratio, the noise immunity of the new algorithm is more obvious, when the noise ratio reached 30%, the accuracy rate is improved 50% and 6% compared to the traditional K-means algorithm and the Canopy-based K-means algorithm.

Introduction

Clustering algorithm, which is one of the most classical algorithms of data mining, has been researched by many scholars. Clustering technology is widely used in many fields. In the commercial field, it can be used to analyze customers’ behavior, providing an important basis for the development of commercial marketing strategy [1], [2]. Besides, in the field of internet e-commerce, it can be used to analyze the characteristics of similar customers according to the user's browsing logs, so as to help internet merchants to provide better customer service [3]. In addition, clustering analysis has an important application in data mining for big data on smart grid user side [4], [5], [6]. By mining the effective information in user's electricity data and analyzing the user's electricity using behavior, the power consumption forecast is carried out. It is of great significance for grid companies to carry on the electric power dispatching [7], [8]. According to the different clustering methods, the clustering algorithm can be divided into division-based method, hierarchical-based method, density-based method, mesh-based and model-based method [9].

K-means algorithm is a commonly used clustering algorithm based on division method [10], [11], [12], its procedure is simple and efficient, suiting for clustering analysis of big data sets. It uses distance as the similarity to divide the sample into several clusters. Within the same cluster, the similarity among samples is higher, and the dissimilarity among samples in different clusters is higher. At present, K-means algorithm mainly has two problems, which are the determination of cluster number (value K) and the selection of the initial clustering centers. Therefore, the research work of K-means algorithm is mainly focused on the above two aspects. By dividing the original data sets into several optimal subsets and selecting the initial clustering center in each subset, the authors gives a method of division clustering in [13], although the method improves the accuracy of clustering, it increases the complexity of the algorithm, which is not suitable for clustering analysis of big datasets. A data sampling and K-means pre-clustering method has been proposed in [14], through multiple data sampling and generating a clustering result by K-means algorithm respectively, the clustering results are calculated intersection and constructed the weighted connected graph to obtain the clustering center. However, the method lacks the consideration of the overall sample distribution of the data sets, having some limitations and instability. Besides, a method of determining the upper limit of cluster number K by AP algorithm [15] is proposed in [16], but the specific method of determining the optimal K value is not given. Mao Dianhui proposes a method that Canopy algorithm and K-means algorithm are combined to determine the clustering input parameters in [17], using the maximum and minimum distance method [18], [19] to solve the problem of determining the threshold T1  and  T2  in Canopy clustering. However, the immunity of the algorithm to noise is weak. In addition, an improved method named Semi-supervised K-means++ algorithm was proposed in 2016 [20]. By marking up some of the data firstly, the rest was labeled according to the minimum cost, and the expected result can be received by account for the labels. But choosing the suitable labeled data, which has a certain impact on the final clustering results, is not easy. Therefore, the new method has some limits. Moreover, Fritzke proposed the K-means-u* algorithm to improve the limits of K-means++ algorithm in 2017 [21], however, it increases the complexity of the algorithm greatly, not suiting for the scenes having large amount of data.

Therefore, a new Canopy clustering method based on the density of samples is proposed in this paper. The optimal value K of the data sets and the initial clustering center are obtained by density Canopy algorithm, which are used as the input parameters of the K-means algorithm, solving the two difficult problems: the determination of value K and the selection of the initial clustering center [22], [23]. The simulation tests on UCI website datasets [24] and simulated data sets with noise, show that the K-means clustering method based on new density Canopy can obtain better clustering results, at the same time, it is more robust to noise rejection.

This paper is organized as follows. In Section II, the improved K-means algorithm based on density Canopy is presented. In Section III, the simulation and the results are presented and discussed. Finally in Section IV, the relevant conclusions are drawn.

Section snippets

Canopy algorithm principle

The canopy algorithm is an unsupervised pre-clustering algorithm introduced by McCallum et al. [25], It is often used as preprocessing steps for the K-means algorithm or the Hierarchical clustering algorithm. As shown in Fig. 1, Canopy algorithm sets two distance thresholds T1  and  T2, selects the initial cluster center randomly, and calculates the Euclidean Distance between sample and initial center. The sample will be classified into the corresponding cluster according to thresholds.

UCI data sets simulation experiment

The experimental data in this section are derived from the UCI website, selecting the following seven testing data sets: Soybean-small, Iris, Wine, Segmentation, Ionoshpere, Pima Indians Diabetes and Segmentation-T.

As shown in Table 1, each data set has different number of samples, and each sample has different number of attributes, through which to test the effectiveness of the improved algorithm. The segmentation-T is the data sets adding a certain amount of analog value on the basis of the

Conclusions and future work

K-means algorithm is one of the most typical methods of data mining. Aiming at the two disadvantages about the determination of the value K and initial clustering center in traditional K-means algorithm, an improved K-means algorithm based on density Canopy is proposed in this paper. In the improved algorithm, the density parameter is added. By defining the density of the samples in the data sets, the average distance between the samples in the cluster and the distance between the clusters, the

Acknowledgment

This work was supported in part by Scientific Research Project of State Grid Corporation of China(JS71-16-001), National Natural Science Foundation of China (61001089), Scientific and Technological Research Program of Chongqing Municipal Education Commission(KJ1500407), and Youth Science Foundation of Chongqing University of Posts and Telecommunications (A2014-107).

References (25)

  • B.U. Yuan-Yuan et al.

    Research of clustering algorithm based on K-means

    J. Southwest Univ. Nationalities

    (2009)
  • Jun Wang et al.

    A novel clustering algorithm based on feature weighting distance and soft subspace learning

    Chin. J. Comput.

    (2012)
  • Cited by (145)

    View all citing articles on Scopus
    View full text