Abstract
Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method.
Similar content being viewed by others
Notes
We call a tuple absorbed by a micro-cluster only if it joins in the micro-cluster.
References
Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin
Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of ICDE, pp. 1152–1154
Alex N, Hasenfuss A, Hammer B (2009) Patch clustering for massive data sets. Neurocomputing 72:1455–1469
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB
Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceedings of ICDE
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of ACM SIGACT-SIGMOD symposium on principles of database systems
Burdick D, Deshpande PM, Jayram T, Ramakrishnan R, Vaithyanathan S (2005) OLAP over uncertain and imprecise data. In: Proceedings of VLDB
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of ACM PODS
Benjelloun O, Sarma AD, Halevy AY, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: Proceedings of VLDB
Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Proceedings of PAKDD
Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of ACM SIGMOD
Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of PODS
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of ICDE
Dalvi NN, Suciu D (2007) Management of probabilistic data foundations and challenges. In: Proceedings of ACM PODS
Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of SIGMOD, pp 73–84
Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey
Jayram T, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceedings of SODA
Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endow 1(1):301–312
Kao B, Lee SD, Cheung DW, Ho W-S, Chan KF (2008) Clustering uncertain data using Voronoi diagrams. In: Proceedings of ICDM, pp 333–342
Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of KDD
Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of ICDM
Lee SD, Kao B, Cheng R (2007) Reducing uk-means to k-means. In: Proceedingd of ICDM workshops, pp 483–488
Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of ICDM
O’Callaghan L, Mishra N, Meyerson A, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of ICDE
Pelekis N, Kopanakis I, Kotsifakos EK, Frentzos E, Theodoridis Y (2011) clustering uncertain trajectories. Knowl Inf Syst 28(1):117–147
Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB
Xin D, Halevy AY, Yu C (2007) Data integration with uncertainty. In: Proceedings of VLDB
Zhang M, Chen S, Jensen CS, Ooi BC, Zhang Z (2009) Effectively indexing uncertain moving objects for predictive queries. In: Proceedings of VLDB
Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of SIGMOD
Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probilistic skyline operator over sliding windows. In: Proceedings of ICDE
Zhang Y, Lin X, ZHU G, Zhang W, Lin Q (2010) Efficient rank based knn query processing over uncertain data. In: Proceedings of ICDE
Zhang T, Ramakrishnan R, Livnya M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103–114
Acknowledgments
Cheqing Jin is supported by the 973 program of China (No. 2012CB316203), and NSFC (No. 60933001 and 61070052). Aoying Zhou is supported by NSFC (No. 60925008), the 973 program of China (No. 2010CB731402) and National High Technology Research and Development Program (863) (No. 2012AA011003). We thanks Mr. Xiaofeng Xu for his efforts on the experiments.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
In this part, we describe some details.
1.1 Five kinds of distances between two deterministic clusters
Let \(\{{\overrightarrow{x}}_i\}\) and \(\{{\overrightarrow{x}}_j\}\) be two clusters of sizes \(N_1\) and \(N_2\), respectively, where \(i = 1, 2, \ldots , N_1\) and \(j = N_1+1, N_1+2, \ldots , N_1 + N_2\). Let \(\overrightarrow{C}_1\) and \(\overrightarrow{C}_2\) denote the centroids of two clusters. The centroid Euclidean distance (\(D_0\)), the centroid Manhattan distance (\(D_1\)), the average inter-cluster distance (\(D_2\)), the average intra-cluster distance (\(D_3\)), and the variance increase distance (\(D_4\)), are defined below.
Here, \(\overrightarrow{C}_k^{(t)}\) indicates the \(t\)th element of the \(d\)-dimensional vector \(\overrightarrow{C}_k\), and \(D_3\) is the root mean square distance of the cluster merged from \(\{{\overrightarrow{x}}_i\}\) and \(\{{\overrightarrow{x}}_j\}\).
1.2 How to minimize PSSQ value
Theorem 3.2
For a cluster \(\{{\overrightarrow{x}}_1\ldots {\overrightarrow{x}}_N\}\), the minimal value of PSSQ is \(\sum _{i=1}^N E_{i,2}-\frac{(\sum _{i=1}^N{\overrightarrow{E}}_{i,1})^2}{\sum _{i=1}^NPr_i}\), where centroid \({\overrightarrow{\mathcal{C }}}\) is \(\frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}}{\sum _{i=1}^NPr_i}\).
Proof
In Eq. (25), \({\overrightarrow{\mathcal{C }}}^{(j)}\) and \({\overrightarrow{E}}_{i,1}^{(j)}\) denote the \(j\)th entries in vectors \({\overrightarrow{\mathcal{C }}}\) and \({\overrightarrow{E}}_{i,1}\) respectively. Moreover, \({\overrightarrow{\mathcal{C }}}^{(1)}\ldots {\overrightarrow{\mathcal{C }}}^{(d)}\) can also be treated as \(d\) independent variables. For each variable \({\overrightarrow{\mathcal{C }}}^{(j)}\) (also denoted as \(x\)), we assume \(a=(\sum _{i=1}^N Pr_i)\) and \(b=\sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}\), it is easy to verify that \(\min (ax^2-2bx)=-\frac{b^2}{a}\) when \(x=\frac{b}{a}\). Hence, the value of PSSQ is minimized when: \(\forall j, 1\le j\le d\), we have \({\overrightarrow{\mathcal{C }}}^{(j)}=\frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}}{\sum _{i=1}^N Pr_i}\).
After putting the above all together, we have:
subject to
\(\square \)
1.3 Computing sophisticated statistics by using UFs
The statistics for a pair of clusters (i.e, \(\mathcal{D }_0 - \mathcal{D }_5\)) are more sophisticated. Given two UFs (\(UF_1\) and \(UF_2\)), we can still compute them efficiently.
\(\mathcal{D }_0\) and \(\mathcal{D }_1\) are Euclidean and Manhattan distances between centroids of a pair of clusters respectively. Note that \({\overrightarrow{x}}^{(t)}\) means the \(k\)th dimension of a vector \({\overrightarrow{x}}_i\). So,
\(\mathcal{D }_2\) and \(\mathcal{D }_3\) represent the average inter-cluster and intra-cluster distances respectively. We consider the co-existing confidence, \(Pr_{i,j}\), of a pair of tuples, \({\overrightarrow{x}}_i\) and \({\overrightarrow{x}}_j\). Thus,
In deterministic data environments, \(D_4\) represents the increase of SSQ when two clusters merge. Symmetrically, we redefine \(\mathcal{D }_4\) as the increase of PSSQ when two clusters merge in uncertain data environments. Let \({\overrightarrow{\mathcal{C }}}_0\) denote the centroid of the merged cluster. According to Theorem 3.2, \({\overrightarrow{\mathcal{C }}}_0=\frac{{\overrightarrow{L}}_1+{\overrightarrow{L}}_2}{P_1+P_2}\).
1.4 The support of the probabilistic discrete distribution
This paper also studies the case where each tuple is described by a discrete probability distribution function. Assume a tuple \({\overrightarrow{x}}_i\) has \(s_i\) candidate values, denoted as \({\overrightarrow{x}}_{i,1}, \ldots , {\overrightarrow{x}}_{i,s_i}\). \(\forall 1\le j\le s_i\), tuple \(Pr[{\overrightarrow{x}}_i = {\overrightarrow{x}}_{i,j}] = P_{i,j}\). Let \(Pr_i\) denote the sum of the existential probabilities of \({\overrightarrow{x}}_i\), i.e, \(Pr_i=\sum _{i=1}^{s_i}P_{i,j}\). Note that the value of \(Pr_i\) can be smaller than 1, which means the tuple \({\overrightarrow{x}}_i\) still has \(1-Pr_i\) probability to be a value out of the domain, i.e, \(Pr[{\overrightarrow{x}}_i =\bot ]=1-Pr_i\), where \(\bot \) is a virtual value out of the domain.
Now, \({\overrightarrow{E}}_{i, 1}\) and \(E_{i,2}\) are rewritten below without modifying the semantics: (i) \({\overrightarrow{E}}_{i,1}=\sum _{j=1}^{s_i}P_{i,j}\cdot {\overrightarrow{x}}_{i,j}\), (ii) \(E_{i,2}=\sum _{j=1}^{s_i}P_{i,j}\cdot {\overrightarrow{x}}_{i,j}^2\).
When both tuples, \({\overrightarrow{x}}_i\) and \({\overrightarrow{x}}_j\), are described by discrete probability distribution functions, the expected squared distance can also be computed in a similar way.
Rights and permissions
About this article
Cite this article
Jin, C., Yu, J.X., Zhou, A. et al. Efficient clustering of uncertain data streams. Knowl Inf Syst 40, 509–539 (2014). https://doi.org/10.1007/s10115-013-0657-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0657-3