Skip to main content
Log in

Growing neural gas with random projection method for high-dimensional data stream clustering

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

High-dimensional data streams emerge ubiquitously in many real-world applications such as network monitoring and forest cover type. Clustering such data streams differs from traditional data clustering algorithm where given datasets are generally static and can be repeatedly read and processed, thus facing more challenges due to having to satisfy such constraints as bounded memory, single-pass, real-time response and concept drift detection. Recently, many methods of such type have been proposed. However, when dealing with high-dimensional data, they often result in high computational cost and poor performance due to the curse of dimensionality. To address the above problem, in this paper, we present a new clustering algorithm for data streams, called RPGStream, by combining the random projection method with the growing neural gas (GNG) model which is an incremental self-organizing approach, belonging to the family of topological maps such as SOM or neural gas. To gain insights into the performance improvement obtained by our algorithm, we analyze and identify the major influence of random projection on GNG. Although our method is embarrassingly simple just by incorporating the random projection into an exponential fading function of GNG, the experimental results on variety of benchmark datasets indicate that our method can still achieve comparable or even better performance than G-Stream algorithm even if the raw dimension is compressed up to 10% of the original one (e.g., for CoverType dataset, its dimension is reduced from 54 to 5).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.cse.fau.edu/~xqzhu/stream.html.

  2. http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.

  3. https://archive.ics.uci.edu/ml/datasets/Covertype.

  4. http://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities.

  5. http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring.

References

  • Achlioptas D (2003) Database-friendly random projections: Johnson–Lindenstrauss with binary coins. J Comput Syst Sci 66(4):671–687

    Article  MathSciNet  MATH  Google Scholar 

  • Aggarwal CC (2009) Data streams: an overview and scientific applications. In: Gaber MM (ed) Scientific data mining and knowledge discovery. Springer, Berlin, pp 377–397

    Chapter  Google Scholar 

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, vol 29. VLDB Endowment, pp 81–92

    Chapter  Google Scholar 

  • Aggarwal CC, Han J, Wang J, Yu PS (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the thirtieth international conference on very large data bases, vol 30. VLDB Endowment, pp 852–863

    Chapter  Google Scholar 

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications, vol 27. ACM, New York

    Google Scholar 

  • Boutsidis C, Zouzias A, Drineas P (2010) Random projections for \( k \)-means clustering. In: Advances in neural information processing systems, pp 298–306

  • Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining. SIAM, pp 328–339

  • Cardoso Â, Wichert A (2012) Iterative random projections for high-dimensional data clustering. Pattern Recognit Lett 33(13):1749–1755

    Article  Google Scholar 

  • Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 133–142

  • Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for k-means clustering and low rank approximation. In: Proceedings of the forty-seventh annual ACM on symposium on theory of computing. ACM, New York, pp 163–172

  • Dang XH, Lee V, Ng WK, Ciptadi A, Ong KL (2009) An EM-based algorithm for clustering data streams in sliding windows. In: Zhou X et al (eds) International conference on database systems for advanced applications. Springer, Berlin, pp 230–235

    Chapter  Google Scholar 

  • Dy JG, Brodley CE (2000) Feature subset selection and order identification for unsupervised learning. In: ICML. Citeseer, pp 247–254

  • Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 186–193

  • Fritzke B (1994) Growing cell structures—a self-organizing network for unsupervised and supervised learning. Neural Netw 7(9):1441–1460

    Article  Google Scholar 

  • Fritzke B et al (1995) A growing neural gas network learns topologies. Adv Neural Inf Process Syst 7:625–632

    Google Scholar 

  • Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Rec 34(2):18–26

    Article  MATH  Google Scholar 

  • Gama J, Rodrigues PP (2009) An overview on mining data streams. In: Abraham A et al (eds) Foundations of computational, intelligence, vol 6. Springer, Berlin, pp 29–45

    Google Scholar 

  • Ghesmoune M, Lebbah M, Azzag H (2016) A new growing neural gas for clustering data streams. Neural Netw 78:36–50

    Article  Google Scholar 

  • Hecht-Nielsen R (1994) Context vectors: general purpose approximate meaning representations self-organized from raw data. Comput Intell Imitating Life 3(11):43–56

    Google Scholar 

  • Hodge VJ, Austin J (2001) Hierarchical growing cell structures: Treegcs. IEEE Trans Knowl Data Eng 13(2):207–218

    Article  Google Scholar 

  • Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, New York, pp 604–613

  • Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. Contemp Math 26(189–206):1

    MathSciNet  MATH  Google Scholar 

  • Keogh E, Chakrabarti K, Pazzani M, Mehrotra S (2001) Locally adaptive dimensionality reduction for indexing large time series databases. ACM SIGMOD Rec 30(2):151–162

    Article  MATH  Google Scholar 

  • Kohonen T (1998) The self-organizing map. Neurocomputing 21(1):1–6

    Article  MathSciNet  MATH  Google Scholar 

  • Kriegel HP, Kröger P, Ntoutsi I, Zimek A (2011) Density based subspace clustering over dynamic data. In: Cushing JB, French J, Bowers S (eds) International conference on scientific and statistical database management. Springer, Berlin, pp 387–404

    Chapter  Google Scholar 

  • Li Y, Yang G, He H, Jiao L, Shang R (2016) A study of large-scale data clustering based on fuzzy clustering. Soft Comput 20(8):3231–3242. https://doi.org/10.1007/s00500-015-1698-1

    Article  Google Scholar 

  • Liberty E, Sriharsha R, Sviridenko M (2016) An algorithm for online k-means clustering. In: 2016 Proceedings of the eighteenth workshop on algorithm engineering and experiments (ALENEX). SIAM, Philadelphia, pp 81–89

  • Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  MATH  Google Scholar 

  • Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng 68(1):1–27

    Article  Google Scholar 

  • Martinetz T, Schulten K et al (1991) A “neural-gas” network learns topologies. Artif Neural Netw 397–402

  • Musco CN (2015) Dimensionality reduction for k-means clustering. Ph.D. thesis, Massachusetts Institute of Technology

  • Nguyen HL, Woon YK, Ng WK (2015) A survey on data stream clustering and classification. Knowl Inf Syst 45(3):535–569

    Article  Google Scholar 

  • O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: International conference on data engineering. IEEE, pp 685–694

  • Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Rec 33(1):32–37

    Article  Google Scholar 

  • Rand WM (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850

    Article  Google Scholar 

  • Satizábal HF, Perez-Uribe A (2015) Unsupervised template discovery in activity recognition using the gamma growing neural gas algorithm. Soft Comput 19(9):2435–2445. https://doi.org/10.1007/s00500-014-1499-y

    Article  Google Scholar 

  • Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, New York, pp 861–866

  • Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, Philadelphia, pp 407–415

  • Silva JA, Faria ER, Barros RC, Hruschka ER, de Carvalho AC, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv (CSUR) 46(1):1–13

    Article  MATH  Google Scholar 

  • Smith T, Alahakoon D (2009) Growing self-organizing map for online continuous clustering. In: Abraham A et al (eds) Foundations of computational intelligence, vol 4. Springer, Berlin, pp 49–83

    Google Scholar 

  • Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617

    MathSciNet  MATH  Google Scholar 

  • Tasoulis DK, Ross G, Adams NM (2007) Visualising the cluster structure of data streams. In: Berthold MR, Shawe-Taylor J, Lavrač N (eds) International symposium on intelligent data analysis. Springer, Berlin, pp 81–92

  • Udommanetanakit K, Rakthanmanon T, Waiyamai K (2007) E-stream: evolution-based technique for stream clustering. In: Alhajj R et al (eds) International conference on advanced data mining and applications. Springer, Berlin, pp 605–615

    Chapter  Google Scholar 

  • Vojáek L, Drdilov P, Dvorsk J (2017) Optimalization of parallel GNG by neurons assigned to processes. In: IFIP International conference on computer information systems and industrial management, pp 63–72

  • Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):14

    Google Scholar 

  • Webb AR (2003) Statistical pattern recognition. Wiley, New York

    MATH  Google Scholar 

  • Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F (2016) Characterizing concept drift. Data Min Knowl Discov 30(4):964–994

    Article  MathSciNet  MATH  Google Scholar 

  • Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Ye M, Liu W, Wei J, Hu X (2016) Fuzzy-means and cluster ensemble with random projection for big data clustering. Math Probl Eng 2016:13

    MathSciNet  MATH  Google Scholar 

  • Yin C, Xia L, Zhang S, Sun R, Wang J (2018) Improved clustering algorithm based on high-speed network data stream. Soft Comput 22(13):4185–4195

    Article  Google Scholar 

  • Zhang P, Shen Q (2018) Fuzzy c-means based coincidental link filtering in support of inferring social networks from spatiotemporal data streams. Soft Comput. https://doi.org/10.1007/s00500-018-3363-y

    Article  Google Scholar 

  • Zhou A, Cao F, Qian W, Jin C (2008) Tracking clusters in evolving data streams over sliding windows. Knowl Inf Syst 15(2):181–214

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (NSFC) under the Grant Nos. 61672281 and 61472186, the Key Program of NSFC under Grant No. 61732006, as well as the founding of Jiangsu Innovation Program for Graduate Education under Grant No. KYLX15_0322.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Songcan Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, Y., Chen, S. Growing neural gas with random projection method for high-dimensional data stream clustering. Soft Comput 24, 9789–9807 (2020). https://doi.org/10.1007/s00500-019-04492-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04492-4

Keywords

Navigation