Skip to main content
Log in

On clustering massive text and categorical data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal D (2007) Detecting anomalies in cross-classified streams: a Bayesian approach. KAIS J 11(1): 29–44

    Google Scholar 

  2. Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. ICDE conference

  3. Aggarwal CC, Procopiuc C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62. http://dblp.uni-trier.de/rec/bibtex/journals/tkde/AggarwalPY02

    Google Scholar 

  4. Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. ACM SIGMOD conference

  5. Aggarwal CC, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. VLDB conference

  6. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. VLDB conference

  7. Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detecting and tracking pilot study final report. In: Proceedings of the broadcast news understanding and transcription workshop

  8. Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. ACM SIGIR conference, pp 37–45

  9. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. ACM PODS conference

  10. Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15: 702–719

    Article  Google Scholar 

  11. Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. ICML conference

  12. Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. SIGKDD conference

  13. Cao B, Ester M, Qian W, Zhou A (2006) Density based clustering of evolving data stream with noise. SIAM data mining conference

  14. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. KDD conference

  15. Cutting D, Karger D, Pedersen J, Tukey J (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the SIGIR, pp 318–329

  16. Domingos P, Hulten G (2000) Mining high-speed data streams. ACM SIGKDD conference

  17. Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2: 139–172

    Google Scholar 

  18. Franz M, Ward T, Scott McCarley J, Zhu W-J (2001) Unsupervised and supervised clustering for topic tracking. SIGIR conference

  19. Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. J Artif Intell 40: 11–61

    Article  Google Scholar 

  20. Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the VLDB conference

  21. Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the international conference on data engineering

  22. He Q, Chang K, Lim EP, Zhang J (2007) Bursty feature representation for clustering text streams. SDM conference

  23. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. Proceedings of the VLDB conference

  24. Li Y, Gopalan R (2006) Clustering transactional data streams. Adv Artif Intell, pp 1069–1073. http://dblp.uni-trier.de/rec/bibtex/conf/ausai/LiG06

  25. Ng R, Han J (1994) Efficient and Effective Clustering Methods for Spatial Data Mining. Very Large Data Bases Conference

  26. O’Callaghan L et al (2002) Streaming-data algorithms for high-quality clustering. ICDE conference

  27. Peterson GL, McBride BT (2008) The importance of generalizability for anomaly detection. KAIS J 14(3): 377–392

    Google Scholar 

  28. Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD conference

  29. Silverstein C, Pedersen J (1997) Almost-constant time clustering of arbitrary corpus sets. In: Proceedings of the ACM SIGIR, pp 60–66

  30. Surendran A, Sra S (2006) Incremental aspect models for mining document streams. Principles Knowl Discov Data Mining (PKDD), pp 633–640. http://dblp.uni-trier.de/rec/bibtex/conf/pkdd/SurendranS06

  31. Yang Y, Pierce T, Carbonell J (1998) A study on retrospective and on-line event detection. In: Proceedings of the SIGIR conference

  32. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD conference

  33. Zhang J, Ghahramani Z, Yang Y (2005) A probabilistic model for online document clustering with application to novelty detection. In: Saul L, Weiss Y, Bottou L (eds) Advances in neural information processing letters, vol 17

  34. Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6), pp 790–798. http://dblp.uni-trier.de/rec/bibtex/journals/nn/Zhong05

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charu C. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C., Yu, P.S. On clustering massive text and categorical data streams. Knowl Inf Syst 24, 171–196 (2010). https://doi.org/10.1007/s10115-009-0241-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0241-z

Keywords

Navigation