Abstract
Data mining on uncertain data stream has attracted a lot of attentions because of the widely existed imprecise data generated from a variety of streaming applications in recent years. The main challenge of mining uncertain data streams stems from the strict space and time requirements of processing arriving tuples in high-speed. When new tuples arrive, the number of the possible world instances will increase exponentially related to the volume of the data stream. As one of the most important mining task, how to devise clustering algorithms has been studied intensively on deterministic data streams, whereas the work on the uncertain data streams still remains rare. This paper proposes a novel solution for clustering on uncertain data streams in point probability model, where the existence of each tuple is uncertain. Detailed analysis and the thorough experimental reports both on synthetic and real data sets illustrate the advantages of our new method in terms of effectiveness and efficiency.
Keywords
This work is supported by Shanghai Leading Academic Discipline Project (Project Number: B412) and National Natural Science Foundation of China (NSFC) under grant No. 60803020.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
MacQueen, J.B.: Some Methods for classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, vol. 1, pp. 281–297. University of California Press,
Aggarwal, C.C., Yu, P.S.: A Framework for Clustering Uncertain Data Streams. In: Proc. of ICDE (2008)
OCallaghan, L., Meyerson, A., Motwani, R., Mishra, N., Guha, S.: Streaming-Data Algorithms for High-Quality Clustering. In: Proc. of ICDE (2002)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Clustering Evolving Data Streams. In: Proc. of VLDB (2003)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: Proc. of SIGMOD (1996)
Zhou, A., Cao, F., Qian, W., Jin, C.: Tracking clusters in evolving data streams over sliding windows. Knowledge and Information System Journal (KAIS) (2007)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A Framework for Projected Clustering of High Dimensional Data Streams. In: Proc. of VLDB (2004)
Tasoulis, D.K., Adams, N.M., Hand, D.J.: Unsupervised Clustering In Streaming Data. In: Proc. of ICDM (2006)
Kriegel, H.-P., Pfeifle, M.: Density-Based Clustering of Uncertain Data. In: Proc. of KDD (2005)
Kriegel, H.-P., Pfeifle, M.: Hierarchical Density-Based Clustering of Uncertain Data. In: Proc. of ICDM (2005)
Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient Clustering of Uncertain Data. In: Proc. of ICDM (2006)
Cormode, G., Garofalakis, M.N.: Sketching probabilistic data streams. In: Proc. of SIGMOD (2007)
Jayram, T.S., McGregor, A., Muthukrishnan, S., Vee, E.: Estimating statistical aggregates on probabilistic data streams. In: Proc. of PODS (2007)
Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning databases, http://www.ics.uci.edu/~mlearn/MLRepository.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, C., Jin, C., Zhou, A. (2009). Efficiently Clustering Probabilistic Data Streams. In: Li, Q., Feng, L., Pei, J., Wang, S.X., Zhou, X., Zhu, QM. (eds) Advances in Data and Web Management. APWeb WAIM 2009 2009. Lecture Notes in Computer Science, vol 5446. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00672-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-00672-2_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00671-5
Online ISBN: 978-3-642-00672-2
eBook Packages: Computer ScienceComputer Science (R0)