Skip to main content
Log in

Efficient clustering of uncertain data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://nsidc.org/data/g00807.html.

  2. We use \({\overrightarrow{x}}_i\) to denote a deterministic tuple in Sect. 2.1 and an uncertain tuple from Sect. 2.2, respectively.

  3. We call a tuple absorbed by a micro-cluster only if it joins in the micro-cluster.

  4. http://nsidc.org/data/g00807.html.

  5. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

  6. http://kdd.ics.uci.edu/databases/covertype/covertype.html.

  7. http://mathworld.wolfram.com/NormalDistribution.html.

References

  1. Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin

  2. Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of ICDE, pp. 1152–1154

  3. Alex N, Hasenfuss A, Hammer B (2009) Patch clustering for massive data sets. Neurocomputing 72:1455–1469

    Article  Google Scholar 

  4. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB

  5. Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceedings of ICDE

  6. Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of ACM SIGACT-SIGMOD symposium on principles of database systems

  7. Burdick D, Deshpande PM, Jayram T, Ramakrishnan R, Vaithyanathan S (2005) OLAP over uncertain and imprecise data. In: Proceedings of VLDB

  8. Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of ACM PODS

  9. Benjelloun O, Sarma AD, Halevy AY, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: Proceedings of VLDB

  10. Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Proceedings of PAKDD

  11. Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of ACM SIGMOD

  12. Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of PODS

  13. Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57

    Article  MATH  MathSciNet  Google Scholar 

  14. Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of ICDE

  15. Dalvi NN, Suciu D (2007) Management of probabilistic data foundations and challenges. In: Proceedings of ACM PODS

  16. Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD

  17. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of SIGMOD, pp 73–84

  18. Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184

    Article  MATH  MathSciNet  Google Scholar 

  19. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey

  20. Jayram T, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceedings of SODA

  21. Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endow 1(1):301–312

    Article  Google Scholar 

  22. Kao B, Lee SD, Cheung DW, Ho W-S, Chan KF (2008) Clustering uncertain data using Voronoi diagrams. In: Proceedings of ICDM, pp 333–342

  23. Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of KDD

  24. Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of ICDM

  25. Lee SD, Kao B, Cheng R (2007) Reducing uk-means to k-means. In: Proceedingd of ICDM workshops, pp 483–488

  26. Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of ICDM

  27. O’Callaghan L, Mishra N, Meyerson A, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of ICDE

  28. Pelekis N, Kopanakis I, Kotsifakos EK, Frentzos E, Theodoridis Y (2011) clustering uncertain trajectories. Knowl Inf Syst 28(1):117–147

    Article  Google Scholar 

  29. Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB

  30. Xin D, Halevy AY, Yu C (2007) Data integration with uncertainty. In: Proceedings of VLDB

  31. Zhang M, Chen S, Jensen CS, Ooi BC, Zhang Z (2009) Effectively indexing uncertain moving objects for predictive queries. In: Proceedings of VLDB

  32. Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of SIGMOD

  33. Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probilistic skyline operator over sliding windows. In: Proceedings of ICDE

  34. Zhang Y, Lin X, ZHU G, Zhang W, Lin Q (2010) Efficient rank based knn query processing over uncertain data. In: Proceedings of ICDE

  35. Zhang T, Ramakrishnan R, Livnya M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103–114

Download references

Acknowledgments

Cheqing Jin is supported by the 973 program of China (No. 2012CB316203), and NSFC (No. 60933001 and 61070052). Aoying Zhou is supported by NSFC (No. 60925008), the 973 program of China (No. 2010CB731402) and National High Technology Research and Development Program (863) (No. 2012AA011003). We thanks Mr. Xiaofeng Xu for his efforts on the experiments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aoying Zhou.

Appendix

Appendix

In this part, we describe some details.

1.1 Five kinds of distances between two deterministic clusters

Let \(\{{\overrightarrow{x}}_i\}\) and \(\{{\overrightarrow{x}}_j\}\) be two clusters of sizes \(N_1\) and \(N_2\), respectively, where \(i = 1, 2, \ldots , N_1\) and \(j = N_1+1, N_1+2, \ldots , N_1 + N_2\). Let \(\overrightarrow{C}_1\) and \(\overrightarrow{C}_2\) denote the centroids of two clusters. The centroid Euclidean distance (\(D_0\)), the centroid Manhattan distance (\(D_1\)), the average inter-cluster distance (\(D_2\)), the average intra-cluster distance (\(D_3\)), and the variance increase distance (\(D_4\)), are defined below.

$$\begin{aligned} D_0&= \left( (\overrightarrow{C}_1-\overrightarrow{C}_2)^2\right) ^{\frac{1}{2}}\end{aligned}$$
(20)
$$\begin{aligned} D_1&= \left| \overrightarrow{C}_1-\overrightarrow{C}_2\right| =\sum _{t=1}^d\left| \overrightarrow{C}_1^{(t)}-\overrightarrow{C}_2^{(t)}\right| \end{aligned}$$
(21)
$$\begin{aligned} D_2&= \left( \frac{\sum _{i=1}^{N_1} \sum _{j=N_1+1}^{N_1+N_2}({\overrightarrow{x}}_i-{\overrightarrow{x}}_j)^2}{N_1N_2}\right) ^{\frac{1}{2}} \end{aligned}$$
(22)
$$\begin{aligned} D_3&= \left( \frac{\sum _{i=1}^{N_1+N_2} \sum _{j=1}^{N_1+N_2}({\overrightarrow{x}}_i-{\overrightarrow{x}}_j)^2}{(N_1+N_2)(N_1+N_2-1)}\right) ^{\frac{1}{2}} \end{aligned}$$
(23)
$$\begin{aligned} D_4&= \sum _{k=1}^{N_1+N_2} \left( {\overrightarrow{x}}_k-\frac{\sum _{k=1}^{N_1+N_2}{\overrightarrow{x}}_k}{N_1+N_2}\right) ^2-\sum _{i=1}^{N_1} \left( {\overrightarrow{x}}_i-\frac{\sum _{i=1}^{N_1}{\overrightarrow{x}}_i}{N_1}\right) ^2\nonumber \\&-\sum _{j=N_1+1}^{N_1+N_2}\left( {\overrightarrow{x}}_j-\frac{\sum _{j=N_1+1}^{N_1+N_2}{\overrightarrow{x}}_j}{N_2}\right) ^2 \end{aligned}$$
(24)

Here, \(\overrightarrow{C}_k^{(t)}\) indicates the \(t\)th element of the \(d\)-dimensional vector \(\overrightarrow{C}_k\), and \(D_3\) is the root mean square distance of the cluster merged from \(\{{\overrightarrow{x}}_i\}\) and \(\{{\overrightarrow{x}}_j\}\).

1.2 How to minimize PSSQ value

Theorem 3.2

For a cluster \(\{{\overrightarrow{x}}_1\ldots {\overrightarrow{x}}_N\}\), the minimal value of PSSQ is \(\sum _{i=1}^N E_{i,2}-\frac{(\sum _{i=1}^N{\overrightarrow{E}}_{i,1})^2}{\sum _{i=1}^NPr_i}\), where centroid \({\overrightarrow{\mathcal{C }}}\) is \(\frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}}{\sum _{i=1}^NPr_i}\).

Proof

$$\begin{aligned} \text{ PSSQ }&= \sum _{i=1}^NPr_i\cdot ES({\overrightarrow{x}}_i,{\overrightarrow{\mathcal{C }}}) \nonumber \\&= \sum _{i=1}^N E_{i,2} -2\left( \sum _{i=1}^N{\overrightarrow{E}}_{i,1}\right) {\overrightarrow{\mathcal{C }}} +\left( \sum _{i=1}^N Pr_i\right) {\overrightarrow{\mathcal{C }}}^2\nonumber \\&= \sum _{i=1}^N E_{i,2} + \sum _{j=1}^d\left( \left( \sum _{i=1}^N Pr_i\right) {\overrightarrow{\mathcal{C }}}^{(j)}{\overrightarrow{\mathcal{C }}}^{(j)}-2\left( \sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}\right) {\overrightarrow{\mathcal{C }}}^{(j)}\right) \end{aligned}$$
(25)

In Eq. (25), \({\overrightarrow{\mathcal{C }}}^{(j)}\) and \({\overrightarrow{E}}_{i,1}^{(j)}\) denote the \(j\)th entries in vectors \({\overrightarrow{\mathcal{C }}}\) and \({\overrightarrow{E}}_{i,1}\) respectively. Moreover, \({\overrightarrow{\mathcal{C }}}^{(1)}\ldots {\overrightarrow{\mathcal{C }}}^{(d)}\) can also be treated as \(d\) independent variables. For each variable \({\overrightarrow{\mathcal{C }}}^{(j)}\) (also denoted as \(x\)), we assume \(a=(\sum _{i=1}^N Pr_i)\) and \(b=\sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}\), it is easy to verify that \(\min (ax^2-2bx)=-\frac{b^2}{a}\) when \(x=\frac{b}{a}\). Hence, the value of PSSQ is minimized when: \(\forall j, 1\le j\le d\), we have \({\overrightarrow{\mathcal{C }}}^{(j)}=\frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}}{\sum _{i=1}^N Pr_i}\).

After putting the above all together, we have:

$$\begin{aligned} \min (\text{ PSSQ })=\sum _{i=1}^N E_{i,2}-\frac{(\sum _{i=1}^N{\overrightarrow{E}}_{i,1})^2}{\sum _{i=1}^NPr_i} \end{aligned}$$
(26)

subject to

$$\begin{aligned} {\overrightarrow{\mathcal{C }}} = \frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}}{\sum _{i=1}^NPr_i} \end{aligned}$$
(27)

\(\square \)

1.3 Computing sophisticated statistics by using UFs 

The statistics for a pair of clusters (i.e, \(\mathcal{D }_0 - \mathcal{D }_5\)) are more sophisticated. Given two UFs (\(UF_1\) and \(UF_2\)), we can still compute them efficiently.

\(\mathcal{D }_0\) and \(\mathcal{D }_1\) are Euclidean and Manhattan distances between centroids of a pair of clusters respectively. Note that \({\overrightarrow{x}}^{(t)}\) means the \(k\)th dimension of a vector \({\overrightarrow{x}}_i\). So,

$$\begin{aligned} \mathcal{D }_0&= (({\overrightarrow{\mathcal{C }}}_1-{\overrightarrow{\mathcal{C }}}_2)^2)^{\frac{1}{2}}=\left( \left( \frac{{\overrightarrow{L}}_1}{P_1}-\frac{{\overrightarrow{L}}_2}{P_2}\right) ^2\right) ^{\frac{1}{2}} \end{aligned}$$
(28)
$$\begin{aligned} \mathcal{D }_1&= \left| {\overrightarrow{\mathcal{C }}}_1 - {\overrightarrow{\mathcal{C }}}_2\right| = \sum _{t=1}^d\left| \left( \frac{{\overrightarrow{L}}_1}{P_1}\right) ^{(t)}-\left( \frac{{\overrightarrow{L}}_2}{P_2}\right) ^{(t)}\right| \end{aligned}$$
(29)

\(\mathcal{D }_2\) and \(\mathcal{D }_3\) represent the average inter-cluster and intra-cluster distances respectively. We consider the co-existing confidence, \(Pr_{i,j}\), of a pair of tuples, \({\overrightarrow{x}}_i\) and \({\overrightarrow{x}}_j\). Thus,

$$\begin{aligned} \mathcal{D }_2^2&= \frac{\sum _{i=1}^{N_1}\sum _{j=N_1+1}^{N_1+N_2}Pr_{i,j}\cdot ES({\overrightarrow{x}}_i,{\overrightarrow{x}}_j)}{\sum _{i=1}^{N_1}\sum _{j=N_1+1}^{N_1+N_2}Pr_{i,j}} =\frac{P_2\cdot {S}_1+P_1\cdot {S}_2-2\cdot {\overrightarrow{L}}_1\cdot {\overrightarrow{L}}_2}{P_1P_2}\quad \quad \end{aligned}$$
(30)
$$\begin{aligned} \mathcal{D }_3^2&= \frac{\sum _{i=1}^{N_1+N_2}\sum _{j=1}^{N_1+N_2,j\ne {i}}Pr_{i,j}\cdot ES({\overrightarrow{x}}_i,{\overrightarrow{x}}_j)}{\sum _{i=1}^{N_1+N_2}\sum _{j=1}^{N_1+N_2,j\ne {i}}Pr_{i,j}}\nonumber \\&= \frac{2(P_1+P_2)({S}_1+{S}_2)-2({\overrightarrow{L}}_1+{\overrightarrow{L}}_2)^2- 2({Z}_1+{Z}_2)}{(P_1+P_2)^2-(P2_1+P2_2)} \end{aligned}$$
(31)

In deterministic data environments, \(D_4\) represents the increase of SSQ when two clusters merge. Symmetrically, we redefine \(\mathcal{D }_4\) as the increase of PSSQ when two clusters merge in uncertain data environments. Let \({\overrightarrow{\mathcal{C }}}_0\) denote the centroid of the merged cluster. According to Theorem 3.2, \({\overrightarrow{\mathcal{C }}}_0=\frac{{\overrightarrow{L}}_1+{\overrightarrow{L}}_2}{P_1+P_2}\).

$$\begin{aligned} \mathcal{D }_4&= \sum _{h=1}^{N_1+N_2}Pr_h\cdot ES({\overrightarrow{x}}_h-{\overrightarrow{\mathcal{C }}}_0) -\sum _{i=1}^{N_1}Pr_i\cdot ES({\overrightarrow{x}}_i-{\overrightarrow{\mathcal{C }}}_1) \nonumber \\&-\sum _{j=N_1+1}^{N_1+N_2}Pr_j\cdot ES({\overrightarrow{x}}_j-{\overrightarrow{\mathcal{C }}}_2)\nonumber \\&= \frac{({\overrightarrow{L}}_1P_2-{\overrightarrow{L}}_2P_1)^2}{P_1P_2(P_1+P_2)} \end{aligned}$$
(32)

1.4 The support of the probabilistic discrete distribution

This paper also studies the case where each tuple is described by a discrete probability distribution function. Assume a tuple \({\overrightarrow{x}}_i\) has \(s_i\) candidate values, denoted as \({\overrightarrow{x}}_{i,1}, \ldots , {\overrightarrow{x}}_{i,s_i}\). \(\forall 1\le j\le s_i\), tuple \(Pr[{\overrightarrow{x}}_i = {\overrightarrow{x}}_{i,j}] = P_{i,j}\). Let \(Pr_i\) denote the sum of the existential probabilities of \({\overrightarrow{x}}_i\), i.e, \(Pr_i=\sum _{i=1}^{s_i}P_{i,j}\). Note that the value of \(Pr_i\) can be smaller than 1, which means the tuple \({\overrightarrow{x}}_i\) still has \(1-Pr_i\) probability to be a value out of the domain, i.e, \(Pr[{\overrightarrow{x}}_i =\bot ]=1-Pr_i\), where \(\bot \) is a virtual value out of the domain.

Now, \({\overrightarrow{E}}_{i, 1}\) and \(E_{i,2}\) are rewritten below without modifying the semantics: (i) \({\overrightarrow{E}}_{i,1}=\sum _{j=1}^{s_i}P_{i,j}\cdot {\overrightarrow{x}}_{i,j}\), (ii) \(E_{i,2}=\sum _{j=1}^{s_i}P_{i,j}\cdot {\overrightarrow{x}}_{i,j}^2\).

When both tuples, \({\overrightarrow{x}}_i\) and \({\overrightarrow{x}}_j\), are described by discrete probability distribution functions, the expected squared distance can also be computed in a similar way.

$$\begin{aligned} ES({\overrightarrow{x}}_i,{\overrightarrow{x}}_j)&= \frac{1}{Pr_i\cdot Pr_j}\sum _{l=1}^{s_i}\sum _{h=1}^{s_j} Pr_{i,l}\cdot Pr_{j,h}\cdot ({\overrightarrow{x}}_{i,l}-{\overrightarrow{x}}_{j,h})^2 \nonumber \\&= \frac{E_{i,2}}{Pr_i}+\frac{E_{j,2}}{Pr_j}-2\cdot \frac{{\overrightarrow{E}}_{i,1}{\overrightarrow{E}}_{j,1}}{Pr_iPr_j} \end{aligned}$$
(33)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, C., Yu, J.X., Zhou, A. et al. Efficient clustering of uncertain data streams. Knowl Inf Syst 40, 509–539 (2014). https://doi.org/10.1007/s10115-013-0657-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0657-3

Keywords

Navigation