Efficient clustering of uncertain data streams

Jin, Cheqing; Yu, Jeffrey Xu; Zhou, Aoying; Cao, Feng

doi:10.1007/s10115-013-0657-3

Efficient clustering of uncertain data streams

Regular Paper
Published: 02 July 2013

Volume 40, pages 509–539, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Cheqing Jin¹,
Jeffrey Xu Yu²,
Aoying Zhou¹ &
…
Feng Cao³

717 Accesses
12 Citations
Explore all metrics

Abstract

Clustering uncertain data streams has recently become one of the most challenging tasks in data management because of the strict space and time requirements of processing tuples arriving at high speed and the difficulty that arises from handling uncertain data. The prior work on clustering data streams focuses on devising complicated synopsis data structures to summarize data streams into a small number of micro-clusters so that important statistics can be computed conveniently, such as Clustering Feature (CF) (Zhang et al. in Proceedings of ACM SIGMOD, pp 103–114, 1996) for deterministic data and Error-based Clustering Feature (ECF) (Aggarwal and Yu in Proceedings of ICDE, 2008) for uncertain data. However, ECF can only handle attribute-level uncertainty, while existential uncertainty, the other kind of uncertainty, has not been addressed yet. In this paper, we propose a novel data structure, Uncertain Feature (UF), to summarize data streams with both kinds of uncertainties: UF is space-efficient, has additive and subtractive properties, and can compute complicated statistics easily. Our first attempt aims at enhancing the previous streaming approaches to handle the sliding-window model by using UF instead of old synopses, inclusive of CluStream (Aggarwal et al. in Proceedings of VLDB, 2003) and UMicro (Aggarwal and Yu in Proceedings of ICDE, 2008). We show that such methods cannot achieve high efficiency. Our second attempt aims at devising a novel algorithm, cluUS , to handle the sliding-window model by using UF structure. Detailed analysis and thorough experimental reports on synthetic and real data sets confirm the advantages of our proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering Heterogeneous Data Streams with Uncertainty over Sliding Window

Stream Clustering Algorithms: A Primer

Enhancement of Data Streaming in Clustering for Uncertain Data

Notes

http://nsidc.org/data/g00807.html.
We use ${\overrightarrow{x}}_i$ to denote a deterministic tuple in Sect. 2.1 and an uncertain tuple from Sect. 2.2, respectively.
We call a tuple absorbed by a micro-cluster only if it joins in the micro-cluster.
http://nsidc.org/data/g00807.html.
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
http://kdd.ics.uci.edu/databases/covertype/covertype.html.
http://mathworld.wolfram.com/NormalDistribution.html.

References

Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin
Aggarwal CC (2009) On high dimensional projected clustering of uncertain data streams. In: Proceedings of ICDE, pp. 1152–1154
Alex N, Hasenfuss A, Hammer B (2009) Patch clustering for massive data sets. Neurocomputing 72:1455–1469
Article Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of VLDB
Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: Proceedings of ICDE
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of ACM SIGACT-SIGMOD symposium on principles of database systems
Burdick D, Deshpande PM, Jayram T, Ramakrishnan R, Vaithyanathan S (2005) OLAP over uncertain and imprecise data. In: Proceedings of VLDB
Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of ACM PODS
Benjelloun O, Sarma AD, Halevy AY, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: Proceedings of VLDB
Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Proceedings of PAKDD
Cormode G, Garofalakis M (2007) Sketching probabilistic data streams. In: Proceedings of ACM SIGMOD
Cormode G, McGregor A (2008) Approximation algorithms for clustering uncertain data. In: Proceedings of PODS
Dunn JC (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3:32–57
Article MATH MathSciNet Google Scholar
Dalvi NN, Suciu D (2004) Efficient query evaluation on probabilistic databases. In: Proceedings of ICDE
Dalvi NN, Suciu D (2007) Management of probabilistic data foundations and challenges. In: Proceedings of ACM PODS
Ester M, Kriegel H-P, Sander J, Xu X (1996) Density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of SIGMOD, pp 73–84
Hochbaum D, Shmoys D (1985) A best possible heuristic for the k-center problem. Math Oper Res 10(2):180–184
Article MATH MathSciNet Google Scholar
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice Hall, New Jersey
Jayram T, Kale S, Vee E (2007) Efficient aggregation algorithms for probabilistic data. In: Proceedings of SODA
Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endow 1(1):301–312
Article Google Scholar
Kao B, Lee SD, Cheung DW, Ho W-S, Chan KF (2008) Clustering uncertain data using Voronoi diagrams. In: Proceedings of ICDM, pp 333–342
Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of KDD
Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Proceedings of ICDM
Lee SD, Kao B, Cheng R (2007) Reducing uk-means to k-means. In: Proceedingd of ICDM workshops, pp 483–488
Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of ICDM
O’Callaghan L, Mishra N, Meyerson A, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of ICDE
Pelekis N, Kopanakis I, Kotsifakos EK, Frentzos E, Theodoridis Y (2011) clustering uncertain trajectories. Knowl Inf Syst 28(1):117–147
Article Google Scholar
Tao Y, Cheng R, Xiao X, Ngai WK, Kao B, Prabhakar S (2005) Indexing multi-dimensional uncertain data with arbitrary probability density functions. In: Proceedings of VLDB
Xin D, Halevy AY, Yu C (2007) Data integration with uncertainty. In: Proceedings of VLDB
Zhang M, Chen S, Jensen CS, Ooi BC, Zhang Z (2009) Effectively indexing uncertain moving objects for predictive queries. In: Proceedings of VLDB
Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: Proceedings of SIGMOD
Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probilistic skyline operator over sliding windows. In: Proceedings of ICDE
Zhang Y, Lin X, ZHU G, Zhang W, Lin Q (2010) Efficient rank based knn query processing over uncertain data. In: Proceedings of ICDE
Zhang T, Ramakrishnan R, Livnya M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103–114

Download references

Acknowledgments

Cheqing Jin is supported by the 973 program of China (No. 2012CB316203), and NSFC (No. 60933001 and 61070052). Aoying Zhou is supported by NSFC (No. 60925008), the 973 program of China (No. 2010CB731402) and National High Technology Research and Development Program (863) (No. 2012AA011003). We thanks Mr. Xiaofeng Xu for his efforts on the experiments.

Author information

Authors and Affiliations

Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, Shanghai, China
Cheqing Jin & Aoying Zhou
Department of S.E.E.M, Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
IBM Research - China, Shanghai, China
Feng Cao

Authors

Cheqing Jin
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Xu Yu
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Feng Cao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aoying Zhou.

Appendix

In this part, we describe some details.

1.1 Five kinds of distances between two deterministic clusters

Let $\{{\overrightarrow{x}}_i\}$ and $\{{\overrightarrow{x}}_j\}$ be two clusters of sizes $N_1$ and $N_2$, respectively, where $i = 1, 2, \ldots , N_1$ and $j = N_1+1, N_1+2, \ldots , N_1 + N_2$. Let $\overrightarrow{C}_1$ and $\overrightarrow{C}_2$ denote the centroids of two clusters. The centroid Euclidean distance ($D_0$), the centroid Manhattan distance ($D_1$), the average inter-cluster distance ($D_2$), the average intra-cluster distance ($D_3$), and the variance increase distance ($D_4$), are defined below.

$$\begin{aligned} D_0&= \left( (\overrightarrow{C}_1-\overrightarrow{C}_2)^2\right) ^{\frac{1}{2}}\end{aligned}$$

(20)

$$\begin{aligned} D_1&= \left| \overrightarrow{C}_1-\overrightarrow{C}_2\right| =\sum _{t=1}^d\left| \overrightarrow{C}_1^{(t)}-\overrightarrow{C}_2^{(t)}\right| \end{aligned}$$

(21)

$$\begin{aligned} D_2&= \left( \frac{\sum _{i=1}^{N_1} \sum _{j=N_1+1}^{N_1+N_2}({\overrightarrow{x}}_i-{\overrightarrow{x}}_j)^2}{N_1N_2}\right) ^{\frac{1}{2}} \end{aligned}$$

(22)

$$\begin{aligned} D_3&= \left( \frac{\sum _{i=1}^{N_1+N_2} \sum _{j=1}^{N_1+N_2}({\overrightarrow{x}}_i-{\overrightarrow{x}}_j)^2}{(N_1+N_2)(N_1+N_2-1)}\right) ^{\frac{1}{2}} \end{aligned}$$

(23)

$$\begin{aligned} D_4&= \sum _{k=1}^{N_1+N_2} \left( {\overrightarrow{x}}_k-\frac{\sum _{k=1}^{N_1+N_2}{\overrightarrow{x}}_k}{N_1+N_2}\right) ^2-\sum _{i=1}^{N_1} \left( {\overrightarrow{x}}_i-\frac{\sum _{i=1}^{N_1}{\overrightarrow{x}}_i}{N_1}\right) ^2\nonumber \\&-\sum _{j=N_1+1}^{N_1+N_2}\left( {\overrightarrow{x}}_j-\frac{\sum _{j=N_1+1}^{N_1+N_2}{\overrightarrow{x}}_j}{N_2}\right) ^2 \end{aligned}$$

(24)

Here, $\overrightarrow{C}_k^{(t)}$ indicates the $t$th element of the $d$-dimensional vector $\overrightarrow{C}_k$, and $D_3$ is the root mean square distance of the cluster merged from $\{{\overrightarrow{x}}_i\}$ and $\{{\overrightarrow{x}}_j\}$.

1.2 How to minimize PSSQ value

Theorem 3.2

For a cluster $\{{\overrightarrow{x}}_1\ldots {\overrightarrow{x}}_N\}$, the minimal value of PSSQ is $\sum _{i=1}^N E_{i,2}-\frac{(\sum _{i=1}^N{\overrightarrow{E}}_{i,1})^2}{\sum _{i=1}^NPr_i}$, where centroid ${\overrightarrow{\mathcal{C }}}$ is $\frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}}{\sum _{i=1}^NPr_i}$.

Proof

$$\begin{aligned} \text{ PSSQ }&= \sum _{i=1}^NPr_i\cdot ES({\overrightarrow{x}}_i,{\overrightarrow{\mathcal{C }}}) \nonumber \\&= \sum _{i=1}^N E_{i,2} -2\left( \sum _{i=1}^N{\overrightarrow{E}}_{i,1}\right) {\overrightarrow{\mathcal{C }}} +\left( \sum _{i=1}^N Pr_i\right) {\overrightarrow{\mathcal{C }}}^2\nonumber \\&= \sum _{i=1}^N E_{i,2} + \sum _{j=1}^d\left( \left( \sum _{i=1}^N Pr_i\right) {\overrightarrow{\mathcal{C }}}^{(j)}{\overrightarrow{\mathcal{C }}}^{(j)}-2\left( \sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}\right) {\overrightarrow{\mathcal{C }}}^{(j)}\right) \end{aligned}$$

(25)

In Eq. (25), ${\overrightarrow{\mathcal{C }}}^{(j)}$ and ${\overrightarrow{E}}_{i,1}^{(j)}$ denote the $j$th entries in vectors ${\overrightarrow{\mathcal{C }}}$ and ${\overrightarrow{E}}_{i,1}$ respectively. Moreover, ${\overrightarrow{\mathcal{C }}}^{(1)}\ldots {\overrightarrow{\mathcal{C }}}^{(d)}$ can also be treated as $d$ independent variables. For each variable ${\overrightarrow{\mathcal{C }}}^{(j)}$ (also denoted as $x$), we assume $a=(\sum _{i=1}^N Pr_i)$ and $b=\sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}$, it is easy to verify that $\min (ax^2-2bx)=-\frac{b^2}{a}$ when $x=\frac{b}{a}$. Hence, the value of PSSQ is minimized when: $\forall j, 1\le j\le d$, we have ${\overrightarrow{\mathcal{C }}}^{(j)}=\frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}^{(j)}}{\sum _{i=1}^N Pr_i}$.

After putting the above all together, we have:

$$\begin{aligned} \min (\text{ PSSQ })=\sum _{i=1}^N E_{i,2}-\frac{(\sum _{i=1}^N{\overrightarrow{E}}_{i,1})^2}{\sum _{i=1}^NPr_i} \end{aligned}$$

(26)

subject to

$$\begin{aligned} {\overrightarrow{\mathcal{C }}} = \frac{\sum _{i=1}^N{\overrightarrow{E}}_{i,1}}{\sum _{i=1}^NPr_i} \end{aligned}$$

(27)

$\square $

1.3 Computing sophisticated statistics by using UFs

The statistics for a pair of clusters (i.e, $\mathcal{D }_0 - \mathcal{D }_5$) are more sophisticated. Given two UFs ($UF_1$ and $UF_2$), we can still compute them efficiently.

$\mathcal{D }_0$ and $\mathcal{D }_1$ are Euclidean and Manhattan distances between centroids of a pair of clusters respectively. Note that ${\overrightarrow{x}}^{(t)}$ means the $k$th dimension of a vector ${\overrightarrow{x}}_i$. So,

$$\begin{aligned} \mathcal{D }_0&= (({\overrightarrow{\mathcal{C }}}_1-{\overrightarrow{\mathcal{C }}}_2)^2)^{\frac{1}{2}}=\left( \left( \frac{{\overrightarrow{L}}_1}{P_1}-\frac{{\overrightarrow{L}}_2}{P_2}\right) ^2\right) ^{\frac{1}{2}} \end{aligned}$$

(28)

$$\begin{aligned} \mathcal{D }_1&= \left| {\overrightarrow{\mathcal{C }}}_1 - {\overrightarrow{\mathcal{C }}}_2\right| = \sum _{t=1}^d\left| \left( \frac{{\overrightarrow{L}}_1}{P_1}\right) ^{(t)}-\left( \frac{{\overrightarrow{L}}_2}{P_2}\right) ^{(t)}\right| \end{aligned}$$

(29)

$\mathcal{D }_2$ and $\mathcal{D }_3$ represent the average inter-cluster and intra-cluster distances respectively. We consider the co-existing confidence, $Pr_{i,j}$, of a pair of tuples, ${\overrightarrow{x}}_i$ and ${\overrightarrow{x}}_j$. Thus,

$$\begin{aligned} \mathcal{D }_2^2&= \frac{\sum _{i=1}^{N_1}\sum _{j=N_1+1}^{N_1+N_2}Pr_{i,j}\cdot ES({\overrightarrow{x}}_i,{\overrightarrow{x}}_j)}{\sum _{i=1}^{N_1}\sum _{j=N_1+1}^{N_1+N_2}Pr_{i,j}} =\frac{P_2\cdot {S}_1+P_1\cdot {S}_2-2\cdot {\overrightarrow{L}}_1\cdot {\overrightarrow{L}}_2}{P_1P_2}\quad \quad \end{aligned}$$

(30)

$$\begin{aligned} \mathcal{D }_3^2&= \frac{\sum _{i=1}^{N_1+N_2}\sum _{j=1}^{N_1+N_2,j\ne {i}}Pr_{i,j}\cdot ES({\overrightarrow{x}}_i,{\overrightarrow{x}}_j)}{\sum _{i=1}^{N_1+N_2}\sum _{j=1}^{N_1+N_2,j\ne {i}}Pr_{i,j}}\nonumber \\&= \frac{2(P_1+P_2)({S}_1+{S}_2)-2({\overrightarrow{L}}_1+{\overrightarrow{L}}_2)^2- 2({Z}_1+{Z}_2)}{(P_1+P_2)^2-(P2_1+P2_2)} \end{aligned}$$

(31)

In deterministic data environments, $D_4$ represents the increase of SSQ when two clusters merge. Symmetrically, we redefine $\mathcal{D }_4$ as the increase of PSSQ when two clusters merge in uncertain data environments. Let ${\overrightarrow{\mathcal{C }}}_0$ denote the centroid of the merged cluster. According to Theorem 3.2, ${\overrightarrow{\mathcal{C }}}_0=\frac{{\overrightarrow{L}}_1+{\overrightarrow{L}}_2}{P_1+P_2}$.

$$\begin{aligned} \mathcal{D }_4&= \sum _{h=1}^{N_1+N_2}Pr_h\cdot ES({\overrightarrow{x}}_h-{\overrightarrow{\mathcal{C }}}_0) -\sum _{i=1}^{N_1}Pr_i\cdot ES({\overrightarrow{x}}_i-{\overrightarrow{\mathcal{C }}}_1) \nonumber \\&-\sum _{j=N_1+1}^{N_1+N_2}Pr_j\cdot ES({\overrightarrow{x}}_j-{\overrightarrow{\mathcal{C }}}_2)\nonumber \\&= \frac{({\overrightarrow{L}}_1P_2-{\overrightarrow{L}}_2P_1)^2}{P_1P_2(P_1+P_2)} \end{aligned}$$

(32)

1.4 The support of the probabilistic discrete distribution

This paper also studies the case where each tuple is described by a discrete probability distribution function. Assume a tuple ${\overrightarrow{x}}_i$ has $s_i$ candidate values, denoted as ${\overrightarrow{x}}_{i,1}, \ldots , {\overrightarrow{x}}_{i,s_i}$. $\forall 1\le j\le s_i$, tuple $Pr[{\overrightarrow{x}}_i = {\overrightarrow{x}}_{i,j}] = P_{i,j}$. Let $Pr_i$ denote the sum of the existential probabilities of ${\overrightarrow{x}}_i$, i.e, $Pr_i=\sum _{i=1}^{s_i}P_{i,j}$. Note that the value of $Pr_i$ can be smaller than 1, which means the tuple ${\overrightarrow{x}}_i$ still has $1-Pr_i$ probability to be a value out of the domain, i.e, $Pr[{\overrightarrow{x}}_i =\bot ]=1-Pr_i$, where $\bot $ is a virtual value out of the domain.

Now, ${\overrightarrow{E}}_{i, 1}$ and $E_{i,2}$ are rewritten below without modifying the semantics: (i) ${\overrightarrow{E}}_{i,1}=\sum _{j=1}^{s_i}P_{i,j}\cdot {\overrightarrow{x}}_{i,j}$, (ii) $E_{i,2}=\sum _{j=1}^{s_i}P_{i,j}\cdot {\overrightarrow{x}}_{i,j}^2$.

When both tuples, ${\overrightarrow{x}}_i$ and ${\overrightarrow{x}}_j$, are described by discrete probability distribution functions, the expected squared distance can also be computed in a similar way.

$$\begin{aligned} ES({\overrightarrow{x}}_i,{\overrightarrow{x}}_j)&= \frac{1}{Pr_i\cdot Pr_j}\sum _{l=1}^{s_i}\sum _{h=1}^{s_j} Pr_{i,l}\cdot Pr_{j,h}\cdot ({\overrightarrow{x}}_{i,l}-{\overrightarrow{x}}_{j,h})^2 \nonumber \\&= \frac{E_{i,2}}{Pr_i}+\frac{E_{j,2}}{Pr_j}-2\cdot \frac{{\overrightarrow{E}}_{i,1}{\overrightarrow{E}}_{j,1}}{Pr_iPr_j} \end{aligned}$$

(33)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, C., Yu, J.X., Zhou, A. et al. Efficient clustering of uncertain data streams. Knowl Inf Syst 40, 509–539 (2014). https://doi.org/10.1007/s10115-013-0657-3

Download citation

Received: 13 December 2011
Revised: 21 March 2013
Accepted: 27 April 2013
Published: 02 July 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10115-013-0657-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient clustering of uncertain data streams

Abstract

Access this article

Similar content being viewed by others

Clustering Heterogeneous Data Streams with Uncertainty over Sliding Window

Stream Clustering Algorithms: A Primer

Enhancement of Data Streaming in Clustering for Uncertain Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Five kinds of distances between two deterministic clusters

1.2 How to minimize PSSQ value

Theorem 3.2

Proof

1.3 Computing sophisticated statistics by using UFs

1.4 The support of the probabilistic discrete distribution

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient clustering of uncertain data streams

Abstract

Access this article

Similar content being viewed by others

Clustering Heterogeneous Data Streams with Uncertainty over Sliding Window

Stream Clustering Algorithms: A Primer

Enhancement of Data Streaming in Clustering for Uncertain Data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Five kinds of distances between two deterministic clusters

1.2 How to minimize PSSQ value

Theorem 3.2

Proof

1.3 Computing sophisticated statistics by using UFs

1.4 The support of the probabilistic discrete distribution

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation