Abstract
The Survival of the Fittest is a principle which selects the superior and eliminates the inferior in the nature. This principle has been used in many fields, especially in optimization problem-solving. Clustering in data mining community endeavors to discover unknown representations or patterns hidden in datasets. Hierarchical clustering algorithm (HCA) is a method of cluster analysis which searches the optimal distribution of clusters by a hierarchical structure. Strategies for hierarchical clustering generally have two types: agglomerative with a bottom-up procedure and divisive with a top-down procedure. However, most of the clustering approaches have two disadvantages: the use of distance-based measurement and the difficulty of the clusters integration. In this paper, we propose an optimal probabilistic estimation (OPE) approach by exploiting the Survival of the Fittest principle. We devise a hierarchical clustering algorithm (HCA) based on OPE, also called OPE-HCA. The OPE-HCA combines optimization with probability and agglomerative HCA. Experimental results show that the OPE-HCA has the ability of searching and discovering patterns at different description levels and can also obtain better performance than many clustering algorithms according to NMI and clustering accuracy measures.
Similar content being viewed by others
References
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Pearson Addison Wesley, London
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann, San Francisco
Aggarwal CC, Reddy CK (eds) (2013) Data clustering: algorithms and applications. CRC Press, Boca Raton, FL
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol I, statistics, 281–297
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, pp 1027–1035
Nazeer KAA, Sebastian MP (2010) Clustering biological data using enhanced k-means algorithm. Electronic Engineering and Computing Technology. Springer, Berlin, pp 433–442
Kaufman L, Rousseeuw P (1990) Finding Groups in data: an introduction to cluster analysis. Wiley, New York
Park HS, Jun CH (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36(2):3336–3341
Bezdek JC, Ehrlich R, Full W (1984) FCM: the fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203
Pal NR, Pal K, Keller JM et al (2005) A possibilistic fuzzy c-means clustering algorithm. IEEE Trans Fuzzy Syst 13(4):517–530
Zhang T, Ramakrishnan R, Livny M (1997) BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Discov 1(2):141–182
P. Smyth. Probabilistic model-based clustering of multivariate and sequential data. Proceedings of the Seventh International Workshop on AI and Statistics, San Francisco, CA: Morgan Kaufman, 1999: 299-304
Cadez IV, Gaffney S, Smyth P (2000) A general probabilistic framework for clustering individuals and objects. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2000, pp 140–149
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, 2004, pp 59–68
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Proceedings of the 22nd international conference on machine learning, Bonn, Germany, 2005, pp 297–304
Papapetrou O, Siberski W, Fuhr N (2012) Decentralized probabilistic text clustering. IEEE Trans Knowl Data Eng 24(10):1848–1861
Boudjeloud-Assala L (2012) Visual interactive evolutionary algorithm for high dimensional outlier detection and data clustering problems. Int J Bio-Inspir Comput 4(1):6–13
Larrañaga P, Lozano JA (eds) (2002) Estimation of distribution algorithms: a new tool for evolutionary computation. Kluwer Academic Publishers, Boston
Furey E, Curran K, McKevitt P (2012) HABITS: a Bayesian filter approach to indoor tracking and location. Int J Bio-Inspir Comput 4(2):79–88
Fan J, Liang Y, Xu Q, Jia R, Cui Z (2011) EDA-USL: unsupervised clustering algorithm based on estimation of distribution algorithm. Int J Wirel Mob Comput 5(1):88–97
Fan J, Feng Z, Liu W et al (2014) Predicting yeast protein localization sites by a new clustering algorithm based on weighted feature ensemble. J Comput Theor Nanosci 11(6):1563–1568
Yan D, Mukai H (1993) Optimization algorithm with probabilistic estimation. J Optim Theory Appl 79(2):345–371
Sánchez JA, Benedí JM (1997) Consistency of stochastic context-free grammars from probabilistic estimation based on growth transformations. IEEE Trans Pattern Anal Mach Intell 19(9):1052–1055
Apte C, Grossman E, Pednault EP, Rosen BK, Tipu FA, White B (1999) Probabilistic estimation-based data mining for discovering insurance risks. IEEE Intell Syst 14(6):49–58
Ferri C, Flach PA, Hernández-Orallo J (2003) Improving the AUC of probabilistic estimation trees. In: Machine learning: ECML 2003, pp 121–132. Springer, Berlin
Jaulin L (2010) Probabilistic set-membership approach for robust regression. J Stat Theory Pract 4(1):155–167
Choi A, Woo W (2011) Multiple-criteria decision-making based on probabilistic estimation with contextual information for physiological signal monitoring. Int J Inf Technol Decis Mak 10(1):109–120
Han Y, Wen J, Cabric D, Villasenor JD (2011) Probabilistic estimation of the number of frequency-hopping transmitters. IEEE Trans Wirel Commun 10(10):3232–3240
Jiang L, Cai Z, Wang D, Zhang H (2012) Improving Tree augmented Naive Bayes for class probability estimation. Knowl Based Syst 26(2):239–245
Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2013) Probabilistic estimation of respiratory rate using Gaussian processes. In: 2013 35th annual international conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp 2902–2905
Duchi J, Wainwright MJ, Jordan MI (2013) Local privacy and minimax bounds: sharp rates for probability estimation. In: Advances in neural information processing systems, the 27th annual conference on neural information processing systems (NIPS 2013), Lake Tahoe, Nevada, pp 1529–1537
Azad R, Davami F (2014) A robust and adaptable method for face detection based on color probabilistic estimation technique. arXiv preprint arXiv:1407.6318
Friedman N (2003) Pcluster: probabilistic agglomerative clustering of gene expression profiles. Technical Report 2003-80, Hebrew University
Segal E, Koller D (2002) Probabilistic hierarchical clustering for biological data. In: Proceedings of the sixth annual international conference on computational biology, ACM, pp 273–280
Fan J, Xu Q, Liang Y (2012) A novel classification learning framework based on estimation of distribution algorithms. Int J Comput Sci Math 3(4):353–366
Hauschild M, Pelikan M (2011) An introduction and survey of estimation of distribution algorithms. Swarm Evolut Comput 1(3):111–128
Newman DJ, Hettich S, Blake CL, Merz CJ (1998) UCI repository of machine learning databases. University of California, Department of Information and Computer Science, Irvine, CA. http://www.ics.uci.edu/~mlearn/MLRepository.html
Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York
Frey BJ, Dueck D (2007) Clustering by passing messages between data points. Science 315:972–976
Acknowledgments
The author wishes to thank the anonymous reviewers and the JEO Assistant for their constructive comments and suggestions. The author thank the students of the laboratory Zheng Feng, Wenhua Liu, Yuhao Cai, and Tianyi Liang for participating in the experiment. This paper is supported by National Natural Science Foundation of China under Grant 61203305 and Shandong Provincial Natural Science Foundation of China under Grant ZR2012FM003.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Fan, J. OPE-HCA: an optimal probabilistic estimation approach for hierarchical clustering algorithm. Neural Comput & Applic 31, 2095–2105 (2019). https://doi.org/10.1007/s00521-015-1998-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-015-1998-5