Abstract
Due to enormous growth in both volume and variety of data, clustering a very large database is a time-consuming process. To speed up clustering process, sampling has been recognized as a very utilitarian approach to reduce the dataset size in which a collection of data points are taken as a sample and then a clustering algorithm is applied to partitioning the data points in that sample into clusters. In this approach, the data points, that are not sampled, do not get their cluster labels. The process of allocating unlabeled data points into proper clusters has been well explored purely in numerical or categorical domain only, but not the both. In this paper, we propose a hybrid similarity coefficient to find the resemblance between an unlabeled data point and a cluster, based on the importance of categorical attribute values and the mean values of numerical attributes. Furthermore, we propose a Hybrid Data Labeling Algorithm (HDLA), based on this similarity coefficient to designate an appropriate cluster label to each unlabeled data point. We analyze its time complexity and perform various experiments using synthetic and real world datasets to demonstrate the efficacy of HDLA.
Similar content being viewed by others
References
Han, J., & Kamber, M. (2006). Data mining, southeast asia edition: Concepts and techniques. San Mateo, CA: Morgan Kaufmann.
Chen, M.-S., Han, J., Yu, P.S. (1996). Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and data Engineering, 8(6), 866–883.
Jain, A.K., Duin, R.P.W., Mao, J. (2000). Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(1), 4–37.
Masulli, F., & Schenone, A. (1999). A fuzzy clustering based segmentation system as support to diagnosis in medical imaging. Artificial Intelligence in Medicine, 16(2), 129–147.
Chen, L., Zou, L.-J., Tu, L. (2012). A clustering algorithm for multiple data streams based on spectral component similarity. Information Sciences, 183(1), 35–47.
Krishna, K., Ramakrishnan, K.R., Thathachar, M. (1997). Vector quantization using genetic k-means algorithm for image compression. In Proceedings of international conference on information, communications and signal processing, ICICS (Vol. 3 pp. 1585–1587). IEEE.
Charikar, M., Chekuri, C., Feder, T., Motwani, R. (1997). Incremental clustering and dynamic information retrieval. In Proceedings of the 22th annual ACM symposium on theory of computing (pp. 626–635). ACM.
Jain, A.K., Murty, M.N., Flynn, P.J. (1999). Data clustering: A review. ACM computing Surveys (CSUR), 31(3), 264–323.
Berkhin, P. (2004). Survey of clustering data mining techniques, 2002. San Jose, CA:Accrue Software.
Xu, R., Wunsch, D., et al. (2005). Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3), 645–678.
Mishra, N., Oblinger, D., Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of the 12th annual ACM-SIAM symposium on discrete algorithms, Society for Industrial and Applied Mathematics (pp. 439–447).
Bradley, P.S., Fayyad, U.M., Reina, C., et al. (1998). Scaling clustering algorithms to large databases. In KDD (pp. 9–15).
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). California.
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 283–304.
Chen, H.-L., Chuang, K.-T., Chen, M.-S. (2008). On data labeling for clustering categorical data. IEEE Transactions on Knowledge and Data Engineering, 20(11), 1458–1472.
Huang, Z. (1997). Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st pacific-asia conference on knowledge discovery and data mining,(PAKDD) (pp. 21–34). Singapore.
Cheung, Y.-M., & Jia, H. (2013). Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognition, 46(8), 2228–2238.
Wang, S., Fan, Y., Zhang, C., Xu, H.X., Hao, X., Hu, Y. (2008). Entropy based clustering of data streams with mixed numeric and categorical values. In 7th IEEE/ACIS international conference on computer and information science,ICIS 08 (pp. 140–145). IEEE.
Chen, C.-Y., Hwang, S.-C., Oyang, Y.-J. (2005). A statistics-based approach to control the quality of subclusters in incremental gravitational clustering. Pattern Recognition, 38(12), 2256–2269.
David, G., & Averbuch, A. (2012). Spectralcat: Categorical spectral clustering of numerical and nominal data. Pattern Recognition, 45(1), 416–433.
Luo, H., Kong, F., Li, Y. (2006). Clustering mixed data based on evidence accumulation. In Advanced data mining and applications (pp. 348–355). Berlin Heidelberg New York:Springer.
Ji, J., Bai, T., Zhou, C., Ma, C., Wang, Z. (2013). An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120, 590–596.
He, Z., Xu, X., Deng, S. (2005). Scalable algorithms for clustering large datasets with mixed type attributes. International Journal of Intelligent Systems, 20(10), 1077–1089.
Li, C., & Biswas, G. (2002). Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering, 14(4), 673–690.
Cao, F., & Liang, J. (2011). A data labeling method for clustering categorical data. Expert Systems with Applications, 38(3), 2381–2385.
Maimon, O.Z., & Rokach, L. (2005). Data mining and knowledge discovery handbook, Vol. 1. Springer, Berlin Heidelberg New York.
Bache, K., & Lichman, M. (2013). Uci machine learning repository. University of california, School of information and computer science, Irvine, CA.
Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17, 255–287.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sangam, R.S., Om, H. Hybrid data labeling algorithm for clustering large mixed type data. J Intell Inf Syst 45, 273–293 (2015). https://doi.org/10.1007/s10844-014-0348-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-014-0348-x