Abstract
In big data situation, to detect clusters of different size and shape is a challenging and imperative task. Density based clustering approaches have been widely used in many areas of science due to its simplicity and the ability to detect clusters of different sizes and shapes over the last several years. With diverse conversion on categorical data, a modified version of the DBSCAN algorithm is proposed to cluster mixed data, noted as density based clustering algorithm for mixed data with integration of entropy and probability distribution (EPDCA). Optional and various conversions are provided for clustering process with adaptability. Some benchmark data sets from UCI have been selected for testing the capability and validity of EPDCA. It was shown that the clustering results of EPDCA are considerably improved, especially on automatically number of clusters formed, noise discovery and time elapsed to form clusters.






Similar content being viewed by others
References
Hsu, C.C., Huang, Y.P.: Incremental clustering of mixed data based on distance hierarchy. Expert Syst. Appl. 35(3), 1177–1185 (2008)
Zhang, X., Wu, Y., Zhao, C.: MrHeter: improving MapReduce performance in heterogeneous environments. Clust. Comput. 19, 1691–1701 (2016)
Kaur, A., Datta, A.: A novel algorithm for fast and scalable subspace clustering of high-dimensional data. J. Big Data 2(1), 1–24 (2015)
Dutta, D., Dutta, P., Sil, J.: Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm. Int. J. Hybrid Intell. Syst. 11(1), 41–54 (2014)
Sakr, S.: Cloud-hosted databases: technologies, challenges and opportunities. Clust. Comput. 17(2), 87–502 (2014)
Chang, C.S., Liao, W., Chen, Y.S., et al.: A mathematical theory for clustering in metric spaces. IEEE Trans. Netw. Sci. Eng. 3(1), 2–16 (2016)
Parameswari, P., Samath, J.A., Saranya, P.: Efficient birch clustering algorithm for categorical and numerical data using modified co-occurrence method. Int. J. Appl. Eng. Res. 10(11), 27661–27673 (2015)
Jalal, A.S., Anant, R., Sunita, J., et al.: A density based algorithm for discovering density varied clusters in large spatial databases. Int. J. Comput. Appl. 3(6), 1–4 (2010)
Lee, J., Lee, Y.J.: An effective dissimilarity measure for clustering of high-dimensional categorical data. Knowl. Inf. Syst. 38(3), 743–757 (2014)
Cao, F., Liang, J., Li, D., et al.: A dissimilarity measure for the k-modes clustering algorithm. Knowl. Based Syst. 26(9), 120–127 (2011)
Ji, J., Pang, W., Zheng, Y., et al.: A novel cluster center initialization method for the k-prototypes algorithms using centrality and distance. Appl. Math. Inf. Sci. 9(6), 2933–2942 (2015)
Lee, M., Pedrycz, W.: The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features. Fuzzy Sets Syst. 160(24), 3590–3600 (2009)
Sander, J., Ester, M., Kriegel, H.P., et al.: Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Mining Knowl. Discov. 2(2), 169–194 (1998)
Tran, T.N., Wehrens, R., Buydens, L.M.C.: KNN-kernel density-based clustering for high-dimensional multivariate data. Comput. Stat. Data Anal. 51(2), 513–525 (2006)
Hinneburg, A., Keim, D.A.: A general approach to clustering in large databases with noise. Knowl. Inf. Syst. 5(4), 387–415 (2003)
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Sugiyama, M., Niu, G., Yamada, M., et al.: Information-maximization clustering based on squared-loss mutual information. Neural Comput. 26(1), 84–131 (2014)
Tran, T.N., Drab, K., Daszykowski, M.: Revised DBSCAN algorithm to cluster data with dense adjacent clusters. Chemom. Intell. Lab. Syst. 120(2), 92–96 (2013)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2001)
Maulik, U., Bandyopadhyay, S., Saha, I.: Integrating clustering and supervised learning for categorical data analysis. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 40(4), 664–675 (2010)
Ahmad, A., Dey, L.: A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set. Pattern Recognit. Lett. 28(1), 110–118 (2007)
Lin, J., Lin, H.: A density-based clustering over evolving heterogeneous data stream. Int. J. Digit. Content Technol. Appl. 5(6), 275–277 (2009)
Webb, J.A., Bond, N.R., Wealands, S.R., et al.: Bayesian clustering with AutoClass explicitly recognises uncertainties in landscape classification. Ecography 30(4), 526–536 (2007)
Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Trans. Knowl. Data Eng. 14(4), 673–690 (2002)
Xu, Z., Luo, X., Yu, J., Xu, W.: Measuring semantic similarity between words by removing noise and redundancy in web snippets. Concurr. Comput. 23(18), 2496–2510 (2011)
Wikaisuksakul, S.: A multi-objective genetic algorithm with fuzzy c-means for automatic data clustering. Appl. Soft Comput. 24, 679–691 (2014)
Capitaine, H.L., Frelicot, C.: A cluster-validity index combining an overlap measure and a separation measure based on fuzzy–aggregation operators. IEEE Trans. Fuzzy Syst. 19(3), 580–588 (2011)
Xu, Z., Luo, X., Mei, L., Hu, C.: Measuring the semantic discrimination capability of association relations. Concurr. Comput. 26(2), 380–395 (2014)
Ahmad, A., Dey, L.: A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 63(2), 503–527 (2007)
Zheng Z, Gong M, Ma J, et al: Unsupervised evolutionary clustering algorithm for mixed type data. In: IEEE Congress on Evolutionary Computation, pp. 1–8 (2009)
Liu, W., Luo, X., Gong, Z., Xuan, J., Kou, N., Xu, Z.: Discovering the core semantics of event from social media. Future Gener. Comput. Syst. 64, 175–185 (2016)
Hsu, C.C., Chen, Y.C.: Mining of mixed data with application to catalog marketing. Expert Syst. Appl. 32(1), 12–23 (2007)
Chao, J., Pang, W., Zhou, C.G.: An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120(1), 590–596 (2013)
Acknowledgements
The authors are very grateful to the editors and reviewers for their valuable comments and suggestions. This work was supported in part by the Major Projects of the National Social Science Fund of China (Grant Nos. 16ZDA045 and 15ZDB168), National Natural Science Foundation of China (Grant Nos. 71603197, 71371148, and 91024020), Junior Fellowships for CAST Advanced Innovation Think-tank Program (DXB-ZKQN-2016-013), China Postdoctoral Science Foundation Funded Project (2016M592403).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liu, X., Yang, Q. & He, L. A novel DBSCAN with entropy and probability for mixed data. Cluster Comput 20, 1313–1323 (2017). https://doi.org/10.1007/s10586-017-0818-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-017-0818-3