Skip to main content
Log in

FGCH: a fast and grid based clustering algorithm for hybrid data stream

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Streaming large volumes of data has a wide range of real-world applications, e.g., video flows, internet calls, and online games etc. Thus, fast and real-time data stream processing is important. Traditionally, data clustering algorithms are efficient and effective to mine information from large data. However, they are mostly not suitable for online data stream clustering. Therefore, in this work, we propose a novel fast and grid based clustering algorithm for hybrid data stream (FGCH). Specifically, we have made the following main contributions: 1), we develop a non-uniform attenuation model to enhance the resistance to noise; 2), we propose a similarity calculation method for hybrid data, which can calculate the similarity more efficiently and accurately; and 3), we present a novel clustering center fast determination algorithm (CCFD), which can automatically determine the number, center, and radius of clusters. Our technique is compared with several state-of-art clustering algorithms. The experimental results show that our technique can achieve more than better clustering accuracy on average. Meanwhile, the running time is shorter compared with the closest algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Karypis G, Han EH, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer Society Press

  2. Wu Z, Xu Q, Li J, Fu C, Qi X, Xiang Y (2018) Passive indoor localization based on CSI and naive Bayes classification. IEEE Trans Syst Man Cybern Syst 48(9):1566–1577

    Article  Google Scholar 

  3. Silva JA, Faria ER, Barros RC, Hruschka ER (2013) Data stream clustering: a survey. Acm Comput Surv 46(1):13

    Article  MATH  Google Scholar 

  4. Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Elsevier Science Inc.

  5. Goodall DW (1966) A new similarity index based on probability. Biometrics 22(4):882–907

    Article  Google Scholar 

  6. Fu C, Zhao M, Lu F, Chen X, Chen J, Wu Z, Xia Y, Xuan Q (2018) Link weight prediction using supervised learning methods and its application to yelp layered network. IEEE Trans Knowl Data Eng 30 (8):1507–1518

    Article  Google Scholar 

  7. Qi X, Fang B, Yi L, Wang J, Zhang J, Zheng Y, Bao G (2018) Automatic pearl classification machine based on a multistream convolutional neural network. IEEE Trans Ind Electron 65(8):6538–6547

    Article  Google Scholar 

  8. Dawar S, Sharma V, Goyal V (2017) Mining top-k high-utility itemsets from a data stream under sliding window model. Appl Intell 47(4):1240–1255

    Article  Google Scholar 

  9. Xuan Q, Zhang ZY, Fu C, Hu HX, Filkov V (2018) Social synchrony on complex networks. IEEE Trans Cybern 48(5):1420–1431

    Article  Google Scholar 

  10. Hassanien AE, Azar AT, Snasael V, Kacprzyk J, Abawajy JH (2015) Big data in complex systems. Springer International Publishing

  11. Xiang Y, Tang Y, Zhu W (2016) Mobile sensor network noise reduction and re-calibration using Bayesian network. Atmosp. Measur. Techn. 9(9):347–357

    Article  Google Scholar 

  12. Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Elsevier Science Inc.

  13. Wang S, Fan Y, Zhang C, Xu H, Hao X, Hu Y (2008) Entropy based clustering of data streams with mixed numeric and categorical values

  14. Huang Z (1997) Clustering large data sets with mixed numeric and categorical values, pp 21–34

  15. Ji J, Pang W, Zhou C, Han X, Wang Z (2013) Corrigendum: corrigendum to ’a fuzzy k-prototype clustering algorithm for mixed numeric and categorical data’ [knowledge-based systems, 30 (2012) 129-135]. Neurocomputing 120(10):590–596

    Article  Google Scholar 

  16. Gath I, Geva A (1989) Unsupervised optimal fuzzy clustering. IEEE Trans Pattern Anal Machine Intell 11(7):773–780

    Article  MATH  Google Scholar 

  17. Chatzis SP (2011) A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Pergamon Press, Inc.

  18. Li C, Biswas G (2002) Unsupervised learning with mixed numeric and nominal data. IEEE Trans Knowl Data Eng 14(4):673–690

    Article  Google Scholar 

  19. Hsu CC, Yu CC (2007) Mining of mixed data with application to catalog marketing. Expert Syst Appl 32 (1):12–23

    Article  MathSciNet  Google Scholar 

  20. Ryu TW, Eick F (1998) Similarity measures for multi-valued attributes for database clustering. In: Conf on smart engineering system design: neural networks, fuzzy logic, evolutionary programming, data mining & rough sets, pp 1–4

  21. Chavent M, De Carvalho F, Lechevallier Y, Verde R (2006) New clustering methods for interval data. Comput Stat 21(2):211–229

    Article  MathSciNet  MATH  Google Scholar 

  22. Chen M, Li L, Bo W, Cheng J, Pan L, Chen X (2016) Effectively clustering by finding density backbone based-on k nn. Pattern Recogn 60:486–498

    Article  Google Scholar 

  23. De Andrade Silva J, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238

    Article  Google Scholar 

  24. Rodriguez A, Laio A (2014) Machine learning. Clustering by fast search and find of density peaks. Science 344(6191):1492

    Article  Google Scholar 

  25. Liadan O, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: International conference on data engineering, 2002. Proceedings, pp 685–694

  26. Aggarwal C C, Yu PS, Han J, Wang J (2003) A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference, pp 81–92. Elsevier

  27. Aggarwal CC, Han J, Wang J, Philip S (2004) A framework for projected clustering of high dimensional data streams. In: Thirtieth International conference on very large data bases, pp 852–863

  28. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Siam International conference on data mining, April 20-22, 2006, Bethesda, MD, USA, pp 328–339

  29. Kremer H, Kranen P, Jansen T, Seidl T, Bifet A, Holmes G, Pfahringer B (2011) An effective evaluation measure for clustering on evolving data streams. In: ACM SIGKDD International conference on knowledge discovery and data mining, pp 868–876

  30. Hyde R, Angelov P, Mackenzie AR (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382C383:96–114

    Article  Google Scholar 

  31. Bodyanskiy YV, Tyshchenko OK, Kopaliani DS (2017) An evolving connectionist system for data stream fuzzy clustering and its online learning. Neurocomputing

  32. Blei D (2006) Variational inference for Dirichlet process mixtures. J Bayesian Anal 1(1):121–143

    Article  MathSciNet  MATH  Google Scholar 

  33. Huynh V, Phung D (2017) Streaming clustering with Bayesian nonparametric models. Neurocomputing, 258

  34. Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41(1):127–152

    Article  Google Scholar 

  35. Gomes HM, Gomes HM (2015) Sncstream: a social network-based data stream clustering algorithm. In: ACM Symposium on applied computing, pp 935–940

  36. Barddal JP, Gomes HM, Enembreck F, Barths̈ JP (2016) Sncstream +: extending a high quality true anytime data stream clustering algorithm. Inf Syst 62:60–73

    Article  Google Scholar 

  37. Xu J, Wang G, Li T, Deng W, Gou G (2016) Fat node leading tree for data stream clustering with density peaks. Knowl-Based Syst 120:99–117

    Article  Google Scholar 

  38. Han D, Giraud-Carrier C, Li S (2015) Efficient mining of high-speed uncertain data streams. Kluwer Academic Publishers

  39. Sang CY, Di HS (2014) Co-clustering over multiple dynamic data streams based on non-negative matrix factorization. Appl Intell 41(2):487–502

    Article  Google Scholar 

  40. Yi W, Li T (2018) Improving semi-supervised co-forest algorithm in evolving data streams. Appl Intell 4:1–15

    Google Scholar 

  41. Zheng Z, Gong M, Ma J, Jiao L (2010) Unsupervised evolutionary clustering algorithm for mixed type data. In: Evolutionary computation, pp 1–8

  42. Ji J, Bai T, Zhou C, Ma C, Wang Z (2013) An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 120:590–596

    Article  Google Scholar 

  43. David G, Averbuch A (2012) Spectralcat: categorical spectral clustering of numerical and nominal data. Pattern Recogn 45(1):416–433

    Article  MATH  Google Scholar 

  44. Huang Z (1997) A fast clustering algorithm to cluster very large categorical data sets in data mining. Research Issues Data Mining Knowl Discov, 1–8

  45. Chen JY, He HH (2015) Research on density-based clustering algorithm for mixed data with determine cluster centers automatically. Acta Automatica Sinica

  46. Chen JY, He HH (2016) A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Inform Sci 345(C):271–293

    Article  Google Scholar 

  47. Zhang X, Furtlehner C, Sebag M (2008) Data streaming with affinity propagation. Lect Notes Comput Sci 5212:628–643

    Article  Google Scholar 

  48. Zhang JP, Chen FC, Li SM, Liu LX (2011) Data stream clustering algorithm based on density and affinity propagation techniques. Zidonghua Xuebao/acta Automatica Sinica 40(2):277–288

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Xiang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Lin, X., Xuan, Q. et al. FGCH: a fast and grid based clustering algorithm for hybrid data stream. Appl Intell 49, 1228–1244 (2019). https://doi.org/10.1007/s10489-018-1324-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1324-x

Keywords

Navigation