On clustering massive text and categorical data streams

Aggarwal, Charu C.; Yu, Philip S.

doi:10.1007/s10115-009-0241-z

On clustering massive text and categorical data streams

Regular Paper
Published: 06 August 2009

Volume 24, pages 171–196, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Charu C. Aggarwal¹ &
Philip S. Yu²

686 Accesses
61 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

Agrawal D (2007) Detecting anomalies in cross-classified streams: a Bayesian approach. KAIS J 11(1): 29–44
Google Scholar
Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. ICDE conference
Aggarwal CC, Procopiuc C, Yu PS (2002) Finding localized associations in market basket data. IEEE Trans Knowl Data Eng 14(1):51–62. http://dblp.uni-trier.de/rec/bibtex/journals/tkde/AggarwalPY02
Google Scholar
Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. ACM SIGMOD conference
Aggarwal CC, Han J, Wang J, Yu P (2003) A framework for clustering evolving data streams. VLDB conference
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. VLDB conference
Allan J, Carbonell J, Doddington G, Yamron J, Yang Y (1998) Topic detecting and tracking pilot study final report. In: Proceedings of the broadcast news understanding and transcription workshop
Allan J, Papka R, Lavrenko V (1998) On-line new event detection and tracking. ACM SIGIR conference, pp 37–45
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. ACM PODS conference
Banerjee A, Ghosh J (2004) Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres. IEEE Trans Neural Netw 15: 702–719
Article Google Scholar
Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. ICML conference
Bradley P, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. SIGKDD conference
Cao B, Ester M, Qian W, Zhou A (2006) Density based clustering of evolving data stream with noise. SIAM data mining conference
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. KDD conference
Cutting D, Karger D, Pedersen J, Tukey J (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the SIGIR, pp 318–329
Domingos P, Hulten G (2000) Mining high-speed data streams. ACM SIGKDD conference
Fisher D (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2: 139–172
Google Scholar
Franz M, Ward T, Scott McCarley J, Zhu W-J (2001) Unsupervised and supervised clustering for topic tracking. SIGIR conference
Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. J Artif Intell 40: 11–61
Article Google Scholar
Gibson D, Kleinberg J, Raghavan P (1998) Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the VLDB conference
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the international conference on data engineering
He Q, Chang K, Lim EP, Zhang J (2007) Bursty feature representation for clustering text streams. SDM conference
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. Proceedings of the VLDB conference
Li Y, Gopalan R (2006) Clustering transactional data streams. Adv Artif Intell, pp 1069–1073. http://dblp.uni-trier.de/rec/bibtex/conf/ausai/LiG06
Ng R, Han J (1994) Efficient and Effective Clustering Methods for Spatial Data Mining. Very Large Data Bases Conference
O’Callaghan L et al (2002) Streaming-data algorithms for high-quality clustering. ICDE conference
Peterson GL, McBride BT (2008) The importance of generalizability for anomaly detection. KAIS J 14(3): 377–392
Google Scholar
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD conference
Silverstein C, Pedersen J (1997) Almost-constant time clustering of arbitrary corpus sets. In: Proceedings of the ACM SIGIR, pp 60–66
Surendran A, Sra S (2006) Incremental aspect models for mining document streams. Principles Knowl Discov Data Mining (PKDD), pp 633–640. http://dblp.uni-trier.de/rec/bibtex/conf/pkdd/SurendranS06
Yang Y, Pierce T, Carbonell J (1998) A study on retrospective and on-line event detection. In: Proceedings of the SIGIR conference
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD conference
Zhang J, Ghahramani Z, Yang Y (2005) A probabilistic model for online document clustering with application to novelty detection. In: Saul L, Weiss Y, Bottou L (eds) Advances in neural information processing letters, vol 17
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6), pp 790–798. http://dblp.uni-trier.de/rec/bibtex/journals/nn/Zhong05
Google Scholar

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, 19 Skyline Drive, Hawthorne, NY, 10532, USA
Charu C. Aggarwal
University of Illinois at Chicago, Chicago, IL, USA
Philip S. Yu

Authors

Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar
Philip S. Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Charu C. Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aggarwal, C.C., Yu, P.S. On clustering massive text and categorical data streams. Knowl Inf Syst 24, 171–196 (2010). https://doi.org/10.1007/s10115-009-0241-z

Download citation

Received: 08 September 2008
Revised: 31 May 2009
Accepted: 20 June 2009
Published: 06 August 2009
Issue Date: August 2010
DOI: https://doi.org/10.1007/s10115-009-0241-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

On clustering massive text and categorical data streams

Abstract

Access this article

Similar content being viewed by others

State-of-the-art on clustering data streams

Clustering Large Datasets Using Data Stream Clustering Techniques

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On clustering massive text and categorical data streams

Abstract

Access this article

Similar content being viewed by others

State-of-the-art on clustering data streams

Clustering Large Datasets Using Data Stream Clustering Techniques

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation