Improved K-means algorithm based on density Canopy
Introduction
Clustering algorithm, which is one of the most classical algorithms of data mining, has been researched by many scholars. Clustering technology is widely used in many fields. In the commercial field, it can be used to analyze customers’ behavior, providing an important basis for the development of commercial marketing strategy [1], [2]. Besides, in the field of internet e-commerce, it can be used to analyze the characteristics of similar customers according to the user's browsing logs, so as to help internet merchants to provide better customer service [3]. In addition, clustering analysis has an important application in data mining for big data on smart grid user side [4], [5], [6]. By mining the effective information in user's electricity data and analyzing the user's electricity using behavior, the power consumption forecast is carried out. It is of great significance for grid companies to carry on the electric power dispatching [7], [8]. According to the different clustering methods, the clustering algorithm can be divided into division-based method, hierarchical-based method, density-based method, mesh-based and model-based method [9].
K-means algorithm is a commonly used clustering algorithm based on division method [10], [11], [12], its procedure is simple and efficient, suiting for clustering analysis of big data sets. It uses distance as the similarity to divide the sample into several clusters. Within the same cluster, the similarity among samples is higher, and the dissimilarity among samples in different clusters is higher. At present, K-means algorithm mainly has two problems, which are the determination of cluster number (value K) and the selection of the initial clustering centers. Therefore, the research work of K-means algorithm is mainly focused on the above two aspects. By dividing the original data sets into several optimal subsets and selecting the initial clustering center in each subset, the authors gives a method of division clustering in [13], although the method improves the accuracy of clustering, it increases the complexity of the algorithm, which is not suitable for clustering analysis of big datasets. A data sampling and K-means pre-clustering method has been proposed in [14], through multiple data sampling and generating a clustering result by K-means algorithm respectively, the clustering results are calculated intersection and constructed the weighted connected graph to obtain the clustering center. However, the method lacks the consideration of the overall sample distribution of the data sets, having some limitations and instability. Besides, a method of determining the upper limit of cluster number K by AP algorithm [15] is proposed in [16], but the specific method of determining the optimal K value is not given. Mao Dianhui proposes a method that Canopy algorithm and K-means algorithm are combined to determine the clustering input parameters in [17], using the maximum and minimum distance method [18], [19] to solve the problem of determining the threshold T1 and T2 in Canopy clustering. However, the immunity of the algorithm to noise is weak. In addition, an improved method named Semi-supervised K-means++ algorithm was proposed in 2016 [20]. By marking up some of the data firstly, the rest was labeled according to the minimum cost, and the expected result can be received by account for the labels. But choosing the suitable labeled data, which has a certain impact on the final clustering results, is not easy. Therefore, the new method has some limits. Moreover, Fritzke proposed the K-means-u* algorithm to improve the limits of K-means++ algorithm in 2017 [21], however, it increases the complexity of the algorithm greatly, not suiting for the scenes having large amount of data.
Therefore, a new Canopy clustering method based on the density of samples is proposed in this paper. The optimal value K of the data sets and the initial clustering center are obtained by density Canopy algorithm, which are used as the input parameters of the K-means algorithm, solving the two difficult problems: the determination of value K and the selection of the initial clustering center [22], [23]. The simulation tests on UCI website datasets [24] and simulated data sets with noise, show that the K-means clustering method based on new density Canopy can obtain better clustering results, at the same time, it is more robust to noise rejection.
This paper is organized as follows. In Section II, the improved K-means algorithm based on density Canopy is presented. In Section III, the simulation and the results are presented and discussed. Finally in Section IV, the relevant conclusions are drawn.
Section snippets
Canopy algorithm principle
The canopy algorithm is an unsupervised pre-clustering algorithm introduced by McCallum et al. [25], It is often used as preprocessing steps for the K-means algorithm or the Hierarchical clustering algorithm. As shown in Fig. 1, Canopy algorithm sets two distance thresholds T1 and T2, selects the initial cluster center randomly, and calculates the Euclidean Distance between sample and initial center. The sample will be classified into the corresponding cluster according to thresholds.
UCI data sets simulation experiment
The experimental data in this section are derived from the UCI website, selecting the following seven testing data sets: Soybean-small, Iris, Wine, Segmentation, Ionoshpere, Pima Indians Diabetes and Segmentation-T.
As shown in Table 1, each data set has different number of samples, and each sample has different number of attributes, through which to test the effectiveness of the improved algorithm. The segmentation-T is the data sets adding a certain amount of analog value on the basis of the
Conclusions and future work
K-means algorithm is one of the most typical methods of data mining. Aiming at the two disadvantages about the determination of the value K and initial clustering center in traditional K-means algorithm, an improved K-means algorithm based on density Canopy is proposed in this paper. In the improved algorithm, the density parameter is added. By defining the density of the samples in the data sets, the average distance between the samples in the cluster and the distance between the clusters, the
Acknowledgment
This work was supported in part by Scientific Research Project of State Grid Corporation of China(JS71-16-001), National Natural Science Foundation of China (61001089), Scientific and Technological Research Program of Chongqing Municipal Education Commission(KJ1500407), and Youth Science Foundation of Chongqing University of Posts and Telecommunications (A2014-107).
References (25)
- et al.
Modeling of fuzzy-based voice of customer for business decision analytics
Knowl.-Based Syst.
(2017) - et al.
A sampling based sentiment mining approach for e-commerce applications
Inf. Process. Manage.
(2017) - et al.
A clustering approach to domestic electricity load profile characterization using smart metering data
Appl. Energy
(2015) - et al.
K-means based cluster analysis of residential smart meter measurements
Energy Procedia
(2016) - et al.
The MinMax k-Means clustering algorithm
Pattern Recognit.
(2014) - et al.
Intelligent data analysis approaches to churn as a business problem: a survey
Knowl. Inf. Syst.
(2017) - et al.
K-means based load estimation of domestic smart meter measurements
Appl. Energy
(2016) - et al.
Application of associated clustering and classification method in electric power load forecasting
Chin. J. Comput.
(2012) - et al.
Clustering of electricity consumption behavior dynamics toward big data applications
IEEE Trans. Smart Grid
(2016) - et al.
Clustering algorithms research
J. Softw.
(2008)
Research of clustering algorithm based on K-means
J. Southwest Univ. Nationalities
A novel clustering algorithm based on feature weighting distance and soft subspace learning
Chin. J. Comput.
Cited by (145)
Optimization-oriented online modeling for generators of absorption heat pump systems
2024, Applied Thermal EngineeringResearch on crack classification method and failure precursor index based on RA-AF value of brittle rock
2024, Theoretical and Applied Fracture MechanicsA low complexity binary-weighted energy disaggregation framework for residential electricity consumption
2023, Energy and Buildings