Abstract
Streaming large volumes of data has a wide range of real-world applications, e.g., video flows, internet calls, and online games etc. Thus, fast and real-time data stream processing is important. Traditionally, data clustering algorithms are efficient and effective to mine information from large data. However, they are mostly not suitable for online data stream clustering. Therefore, in this work, we propose a novel fast and grid based clustering algorithm for hybrid data stream (FGCH). Specifically, we have made the following main contributions: 1), we develop a non-uniform attenuation model to enhance the resistance to noise; 2), we propose a similarity calculation method for hybrid data, which can calculate the similarity more efficiently and accurately; and 3), we present a novel clustering center fast determination algorithm (CCFD), which can automatically determine the number, center, and radius of clusters. Our technique is compared with several state-of-art clustering algorithms. The experimental results show that our technique can achieve more than better clustering accuracy on average. Meanwhile, the running time is shorter compared with the closest algorithm.
Similar content being viewed by others
References
Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer Society Press
Wu Z, Xu Q, Li J, Fu C, Qi X, Xiang Y (2018) Passive indoor localization based on CSI and naive Bayes classification. IEEE Trans Syst Man Cybern Syst 48(9):1566–1577
Silva JA, Faria ER, Barros RC, Hruschka ER (2013) Data stream clustering: a survey. Acm Comput Surv 46(1):13
Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Elsevier Science Inc.
Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882–907
Fu C, Zhao M, Lu F, Chen X, Chen J, Wu Z, Xia Y, Xuan Q (2018) Link weight prediction using supervised learning methods and its application to yelp layered network. IEEE Trans Knowl Data Eng 30 (8):1507–1518
Qi X, Fang B, Yi L, Wang J, Zhang J, Zheng Y, Bao G (2018) Automatic pearl classification machine based on a multistream convolutional neural network. IEEE Trans Ind Electron 65(8):6538–6547
Dawar S, Sharma V, Goyal V (2017) Mining top-k high-utility itemsets from a data stream under sliding window model. Appl Intell 47(4):1240–1255
Xuan Q, Zhang ZY, Fu C, Hu HX, Filkov V (2018) Social synchrony on complex networks. IEEE Trans Cybern 48(5):1420–1431
Hassanien AE, Azar AT, Snasael V, Kacprzyk J, Abawajy JH (2015) Big data in complex systems. Springer International Publishing
Xiang Y, Tang Y, Zhu W (2016) Mobile sensor network noise reduction and re-calibration using Bayesian network. Atmosp. Measur. Techn. 9(9):347–357
Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Elsevier Science Inc.
Wang S, Fan Y, Zhang C, Xu H, Hao X, Hu Y (2008) Entropy based clustering of data streams with mixed numeric and categorical values
Huang Z (1997) Clustering large data sets with mixed numeric and categorical values, pp 21–34
Ji J, Pang W, Zhou C, Han X, Wang Z (2013) Corrigendum: corrigendum to ’a fuzzy k-prototype clustering algorithm for mixed numeric and categorical data’ [knowledge-based systems, 30 (2012) 129-135]. Neurocomputing 120(10):590–596
Gath I, Geva A (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Machine Intell 11(7):773–780
Chatzis SP (2011) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Pergamon Press, Inc.
Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 14(4):673–690
Hsu CC, Yu CC (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32 (1):12–23
Ryu TW, Eick F (1998) Similarity measures for multi-valued attributes for database clustering. In: Conf on smart engineering system design: neural networks, fuzzy logic, evolutionary programming, data mining & rough sets, pp 1–4
Chavent M, De Carvalho F, Lechevallier Y, Verde R (2006) New clustering methods for interval data. Comput Stat 21(2):211–229
Chen M, Li L, Bo W, Cheng J, Pan L, Chen X (2016) Effectively clustering by finding density backbone based-on k nn. Pattern Recogn 60:486–498
De Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238
Rodriguez A, Laio A (2014) Machine learning. Clustering by fast search and find of density peaks. Science 344(6191):1492
Liadan O, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: International conference on data engineering, 2002. Proceedings, pp 685–694
Aggarwal C C, Yu PS, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, pp 81–92. Elsevier
Aggarwal CC, Han J, Wang J, Philip S (2004) A framework for projected clustering of high dimensional data streams. In: Thirtieth International conference on very large data bases, pp 852–863
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Siam International conference on data mining, April 20-22, 2006, Bethesda, MD, USA, pp 328–339
Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effective evaluation measure for clustering on evolving data streams. In: ACM SIGKDD International conference on knowledge discovery and data mining, pp 868–876
Hyde R, Angelov P, Mackenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382C383:96–114
Bodyanskiy YV, Tyshchenko OK, Kopaliani DS (2017) An evolving connectionist system for data stream fuzzy clustering and its online learning. Neurocomputing
Blei D (2006) Variational inference for Dirichlet process mixtures. J Bayesian Anal 1(1):121–143
Huynh V, Phung D (2017) Streaming clustering with Bayesian nonparametric models. Neurocomputing, 258
Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41(1):127–152
Gomes HM, Gomes HM (2015) Sncstream: a social network-based data stream clustering algorithm. In: ACM Symposium on applied computing, pp 935–940
Barddal JP, Gomes HM, Enembreck F, Barths̈ JP (2016) Sncstream +: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73
Xu J, Wang G, Li T, Deng W, Gou G (2016) Fat node leading tree for data stream clustering with density peaks. Knowl-Based Syst 120:99–117
Han D, Giraud-Carrier C, Li S (2015) Efficient mining of high-speed uncertain data streams. Kluwer Academic Publishers
Sang CY, Di HS (2014) Co-clustering over multiple dynamic data streams based on non-negative matrix factorization. Appl Intell 41(2):487–502
Yi W, Li T (2018) Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell 4:1–15
Zheng Z, Gong M, Ma J, Jiao L (2010) Unsupervised evolutionary clustering algorithm for mixed type data. In: Evolutionary computation, pp 1–8
Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596
David G, Averbuch A (2012) Spectralcat: categorical spectral clustering of numerical and nominal data. Pattern Recogn 45(1):416–433
Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. Research Issues Data Mining Knowl Discov, 1–8
Chen JY, He HH (2015) Research on density-based clustering algorithm for mixed data with determine cluster centers automatically. Acta Automatica Sinica
Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inform Sci 345(C):271–293
Zhang X, Furtlehner C, Sebag M (2008) Data streaming with affinity propagation. Lect Notes Comput Sci 5212:628–643
Zhang JP, Chen FC, Li SM, Liu LX (2011) Data stream clustering algorithm based on density and affinity propagation techniques. Zidonghua Xuebao/acta Automatica Sinica 40(2):277–288
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, J., Lin, X., Xuan, Q. et al. FGCH: a fast and grid based clustering algorithm for hybrid data stream. Appl Intell 49, 1228–1244 (2019). https://doi.org/10.1007/s10489-018-1324-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1324-x