Abstract
In recent years, the concept of clustering stability is widely used to determining the number of clusters in a given dataset. This paper proposes an improvement of stability methods based on bootstrap technique. This amelioration is achieved by combining the instability property with an evaluation criterion and using a DCA (Difference Convex Algorithm) based clustering algorithm. DCA is an innovative approach in nonconvex programming, which has been successfully applied to many (smooth or nonsmooth) large-scale nonconvex programs in various domains. Experimental results on both synthetic and real datasets are promising and demonstrate the effectiveness of our approach.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ben-Hur, A., Elisseeff, A., Guyon, I.: A Stability Based Method for Discovering Structure in Clustered Data. In: Pacific Symposium on Biocomputing, vol. 7, pp. 6–17 (2002)
Calinski, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics Simulation and Computation 3(1), 1–27 (1974)
Chiang, M.M., Mirkin, B.: Experiments for the Number of Clusters in K-Means. In: EPIA Workshops, pp. 395–405 (2007)
Chiang, M.M., Mirkin, B.: Intelligent Choice of the Number of Clusters in K-Means Clustering: An Experimental Study with Different Cluster Spreads. Journal Classification 27(1), 3–40 (2010)
Fang, Y., Wang, J.: Selection of the Number of Clusters via the Bootstrap Method. Computation Statistics and Data Analysis 56(3), 468–477 (2012)
Hamerly, G., Elkan, C.: Learning the K in K-Means. In: Neural Information Processing Systems. MIT Press (2003)
Jinyan, L., Huiqing, L.: Kent ridge bio-medical dataset repository (2002)m, http://datam.i2r.a-star.edu.sg/datasets/krbd/index.html (accessed on october 2014)
Kudova, P.: Clustering Genetic Algorithm. In: 18th International Workshop on DEXA, Regensburg, Germany (2007)
Minh, L.H., Thuy, T.M.: DC programming and DCA for solving Minimum Sum–of–Squares Clustering using weighted dissimilarity measures. Special Issue on Optimization and Machine Learning. Transaction on Computational Collective Intelligent XIII (2014)
Le Thi, H.A.: Contribution à l’optimisation non convexe et l’optimisation globale: Théorie, Algoritmes et Applications. HDR, Univesité. Rouen (1997)
Le Thi, H.A.: DC Programming and DCA, http://lita.sciences.univ-metz.fr/~lethi
Le Thi, H.A., Le Hoai, M., Van Nguyen, V.: A DC Programming approach for Feature Selection in Support Vector Machines learning. Journal of Advances in Data Analysis and Classification 2(3), 259–278 (2008)
Le Thi, H.A., Le Hoai, M., Pham Dinh, T.: Fuzzy clustering based on nonconvex optimisation approaches using difference of convex (DC) functions algorithms. Journal of Advances in Data Analysis and Classification 2, 1–20 (2007)
Le Thi, H.A., Le Hoai, M.: Optimization based DC programming and DCA for Hierarchical Clustering. European Journal of Operational Research 183, 1067–1085 (2006)
Le Thi, H.A., Le Hoai, M., Pham Dinh, T., Van Huynh, N.: Binary classification via spherical separator by DC programming and DCA. Journal of Global Optimization, 1–15 (2012)
Le Thi, H.A., Le Hoai, M., Pham Dinh, T., Van Huynh, N.: Block Clustering based on DC programming and DCA. Neural Computation 25(10) (2013)
Le Thi, H.A., Tayeb Belghiti, M., Pham Dinh, T.: A new efficient algorithm based on DC programming and DCA for clustering. Journal of Global Optimization 37(4), 593–608 (2007)
Le Thi, H.A., Pham Dinh, T.: DC programming: Theory, algorithms and applications. In: The State of the Proceedings of The First International Workshop on Global Constrained Optimization and Constraint Satisfaction (Cocos 2002), Valbonne-Sophia Antipolis, France (October 2002)
Le Thi, H.A., Pham Dinh, T.: The DC (Difference of Convex functions) Programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of Operations Research 46, 23–46 (2005)
Le Thi, H.A., Vo Xuan, T., Pham Dinh, T.: Feature Selection for linear SVMs under Uncertain Data: Robust optimization based on Difference of Convex functions Algorithms. Neural Networks 59, 36–50 (2014)
Lichman, M.: UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine (2013), http://archive.ics.uci.edu/ml (accessed on October 2014)
Lu, Y., Lu, S., Fotouhi, F., Deng, Y., Susan, J.B.: Incremental genetic K-means algorithm and its application in gene expression data analysis. BMC Bioinformatics (2004)
Maulik, U., Bandyopadhyay, S.: Genetic algorithm-based clustering technique. Pattern Recognition 33(9), 1455–1465 (2000)
Melnykov, V., Chen, W.C., Maitra, R.: MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms. Journal of Statistical Software 51(12), 1–25 (2012)
Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a dataset. Psychometrika 50(2), 159–179 (1985)
Pelleg, D., Moore, A.: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. In: Pro. of the 17th International Conference on Machine Learning, pp. 727–734 (2000)
Pham Dinh, T., Le Thi, H.: Recent Advances in DC Programming and DCA. Transaction on Computational Collective Intelligence 8342, 1–37 (2014)
Pham Dinh, T., Le Thi, H.: Convex analysis approach to DC programming: theory, algorithms and applications. Acta Mathematica Vietnamica 1, 289–355 (1997)
Sharma, S., Rai, S.: Genetic K-Means Algorithm Implementation and Analysis. International Journal of Recent Technology and Engineering 1(2), 117–120 (2012)
Sugar, C.A., Gareth, J.M.: Finding the number of clusters in a dataset: An information theoretic approach. Journal of the American Statistical Association 33, 750–763 (2003)
Ta Minh Thuy: Techniques d’optimisation non convexe basée sur la programmation DC et DCA et méthodes evolutives pour la classification non supervisée. Ph.D thesis, University of Lorraine (2014), http://docnum.univ-lorraine.fr/public/DDOC_T_2014_0099_TA.pdf (accessed on January 2015)
Thuy, T.M., Le Thi, H.A., Boudjeloud-Assala, L.: An Efficient Clustering Method for Massive Dataset Based on DC Programming and DCA Approach. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.) ICONIP 2013, Part II. LNCS, vol. 8227, pp. 538–545. Springer, Heidelberg (2013)
Ta, M.T., Le Thi, H.A., Boudjeloud-Assala, L.: Clustering Data Stream by a Sub-window Approach Using DCA. In: Perner, P. (ed.) MLDM 2012. LNCS, vol. 7376, pp. 279–292. Springer, Heidelberg (2012)
Thuy, T.M., Le An, T.H., Boudjeloud-Assala, L.: Clustering data streams over sliding windows by DCA. In: Nguyen, N.T., van Do, T., Thi, H.A. (eds.) ICCSAMA 2013. SCI, vol. 479, pp. 65–75. Springer, Heidelberg (2013)
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the Gap statistic. Journal of Royal Statistical Society, Series B 63, 411–423 (2000)
Ulrike von, L.: Clustering Stability: An Overview. Foundations and Trends in Machine Learning 2(3), 235–274 (2009)
Wang, J.: Consistent selection of the number of clusters via cross validation. Biometrika 97(4), 893–904 (2010)
http://webdocs.cs.ualberta.ca/~yaling/Cluster/Php/data_gen.php (accessed on (October 2014)
http://www.nipsfsc.ecs.soton.ac.uk/datasets/ (accessed on October 2014)
http://cs.joensuu.fi/sipu/datasets/ (accessed on October 2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Thuy, T.M., Thi Hoai An, L. (2015). An Improvement of Stability Based Method to Clustering. In: Le Thi, H., Nguyen, N., Do, T. (eds) Advanced Computational Methods for Knowledge Engineering. Advances in Intelligent Systems and Computing, vol 358. Springer, Cham. https://doi.org/10.1007/978-3-319-17996-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-17996-4_12
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-17995-7
Online ISBN: 978-3-319-17996-4
eBook Packages: EngineeringEngineering (R0)