Abstract
High-dimensional data streams emerge ubiquitously in many real-world applications such as network monitoring and forest cover type. Clustering such data streams differs from traditional data clustering algorithm where given datasets are generally static and can be repeatedly read and processed, thus facing more challenges due to having to satisfy such constraints as bounded memory, single-pass, real-time response and concept drift detection. Recently, many methods of such type have been proposed. However, when dealing with high-dimensional data, they often result in high computational cost and poor performance due to the curse of dimensionality. To address the above problem, in this paper, we present a new clustering algorithm for data streams, called RPGStream, by combining the random projection method with the growing neural gas (GNG) model which is an incremental self-organizing approach, belonging to the family of topological maps such as SOM or neural gas. To gain insights into the performance improvement obtained by our algorithm, we analyze and identify the major influence of random projection on GNG. Although our method is embarrassingly simple just by incorporating the random projection into an exponential fading function of GNG, the experimental results on variety of benchmark datasets indicate that our method can still achieve comparable or even better performance than G-Stream algorithm even if the raw dimension is compressed up to 10% of the original one (e.g., for CoverType dataset, its dimension is reduced from 54 to 5).
Similar content being viewed by others
Notes
References
Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687
Aggarwal CC (2009) Data streams: an overview and scientific applications. In: Gaber MM (ed) Scientific data mining and knowledge discovery. Springer, Berlin, pp 377–397
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29. VLDB Endowment, pp 81–92
Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, vol 30. VLDB Endowment, pp 852–863
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New York
Boutsidis C, Zouzias A, Drineas P (2010) Random projections for \( k \)-means clustering. In: Advances in neural information processing systems, pp 298–306
Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining. SIAM, pp 328–339
Cardoso Â, Wichert A (2012) Iterative random projections for high-dimensional data clustering. Pattern Recognit Lett 33(13):1749–1755
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 133–142
Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM on symposium on theory of computing. ACM, New York, pp 163–172
Dang XH, Lee V, Ng WK, Ciptadi A, Ong KL (2009) An EM-based algorithm for clustering data streams in sliding windows. In: Zhou X et al (eds) International conference on database systems for advanced applications. Springer, Berlin, pp 230–235
Dy JG, Brodley CE (2000) Feature subset selection and order identification for unsupervised learning. In: ICML. Citeseer, pp 247–254
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193
Fritzke B (1994) Growing cell structures—a self-organizing network for unsupervised and supervised learning. Neural Netw 7(9):1441–1460
Fritzke B et al (1995) A growing neural gas network learns topologies. Adv Neural Inf Process Syst 7:625–632
Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26
Gama J, Rodrigues PP (2009) An overview on mining data streams. In: Abraham A et al (eds) Foundations of computational, intelligence, vol 6. Springer, Berlin, pp 29–45
Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50
Hecht-Nielsen R (1994) Context vectors: general purpose approximate meaning representations self-organized from raw data. Comput Intell Imitating Life 3(11):43–56
Hodge VJ, Austin J (2001) Hierarchical growing cell structures: Treegcs. IEEE Trans Knowl Data Eng 13(2):207–218
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, New York, pp 604–613
Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26(189–206):1
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. ACM SIGMOD Rec 30(2):151–162
Kohonen T (1998) The self-organizing map. Neurocomputing 21(1):1–6
Kriegel HP, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Cushing JB, French J, Bowers S (eds) International conference on scientific and statistical database management. Springer, Berlin, pp 387–404
Li Y, Yang G, He H, Jiao L, Shang R (2016) A study of large-scale data clustering based on fuzzy clustering. Soft Comput 20(8):3231–3242. https://doi.org/10.1007/s00500-015-1698-1
Liberty E, Sriharsha R, Sviridenko M (2016) An algorithm for online k-means clustering. In: 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX). SIAM, Philadelphia, pp 81–89
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27
Martinetz T, Schulten K et al (1991) A “neural-gas” network learns topologies. Artif Neural Netw 397–402
Musco CN (2015) Dimensionality reduction for k-means clustering. Ph.D. thesis, Massachusetts Institute of Technology
Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569
O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: International conference on data engineering. IEEE, pp 685–694
Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33(1):32–37
Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850
Satizábal HF, Perez-Uribe A (2015) Unsupervised template discovery in activity recognition using the gamma growing neural gas algorithm. Soft Comput 19(9):2435–2445. https://doi.org/10.1007/s00500-014-1499-y
Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, New York, pp 861–866
Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, Philadelphia, pp 407–415
Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):1–13
Smith T, Alahakoon D (2009) Growing self-organizing map for online continuous clustering. In: Abraham A et al (eds) Foundations of computational intelligence, vol 4. Springer, Berlin, pp 49–83
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
Tasoulis DK, Ross G, Adams NM (2007) Visualising the cluster structure of data streams. In: Berthold MR, Shawe-Taylor J, Lavrač N (eds) International symposium on intelligent data analysis. Springer, Berlin, pp 81–92
Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: Alhajj R et al (eds) International conference on advanced data mining and applications. Springer, Berlin, pp 605–615
Vojáek L, Drdilov P, Dvorsk J (2017) Optimalization of parallel GNG by neurons assigned to processes. In: IFIP International conference on computer information systems and industrial management, pp 63–72
Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):14
Webb AR (2003) Statistical pattern recognition. Wiley, New York
Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Ye M, Liu W, Wei J, Hu X (2016) Fuzzy-means and cluster ensemble with random projection for big data clustering. Math Probl Eng 2016:13
Yin C, Xia L, Zhang S, Sun R, Wang J (2018) Improved clustering algorithm based on high-speed network data stream. Soft Comput 22(13):4185–4195
Zhang P, Shen Q (2018) Fuzzy c-means based coincidental link filtering in support of inferring social networks from spatiotemporal data streams. Soft Comput. https://doi.org/10.1007/s00500-018-3363-y
Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214
Acknowledgements
This work is supported by the National Natural Science Foundation of China (NSFC) under the Grant Nos. 61672281 and 61472186, the Key Program of NSFC under Grant No. 61732006, as well as the founding of Jiangsu Innovation Program for Graduate Education under Grant No. KYLX15_0322.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhu, Y., Chen, S. Growing neural gas with random projection method for high-dimensional data stream clustering. Soft Comput 24, 9789–9807 (2020). https://doi.org/10.1007/s00500-019-04492-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-04492-4