Abstract
Clustering streaming data is gaining importance as automatic data acquisition technologies are deployed in diverse applications. We propose a fully incremental projected divisive clustering method for high-dimensional data streams that is motivated by high density clustering. The method is capable of identifying clusters in arbitrary subspaces, estimating the number of clusters, and detecting changes in the data distribution which necessitate a revision of the model. The empirical evaluation of the proposed method on numerous real and simulated datasets shows that it is scalable in dimension and number of clusters, is robust to noisy and irrelevant features, and is capable of handling a variety of types of non-stationarity.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal, C.C.: A survey of stream clustering algorithms. In: Aggarwal, C.C., Reddy, C. (eds.) Data Clustering: Algorithms and Applications, pp. 457–482. CRC Press, Boca Raton (2013)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on Very large data bases, vol. 29, pp. 81–92 (2003)
Aggarwal, C.C., Han, J., Wang, J., Yu, P.S.: A framework for projected clustering of high dimensional data streams. In: Proceedings of the Thirtieth international conference on Very large data bases, pp. 852–863 (2004)
Amini, A., Saboohi, H., Wah, T.Y., Herawan, T.: Dmm-stream: A density mini-micro clustering algorithm for evolving datastreams. In: Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), 675-682 (2014)
Amini, A., Wah, T.Y., Saboohi, H.: On density based data streams clustering algorithms: a survey. J. Comput. Sci. Technol. 29(1), 116–141 (2014)
Anagnostopoulos, C., Tasoulis, D.K., Adams, N.M., Pavlidis, N.G., Hand, D.J.: Online linear and quadratic discriminant analysis with adaptive forgetting for streaming classification. Stat. Anal. Data Min. 5(2), 139–166 (2012)
Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: OPTICS: ordering points to identify the clustering structure. In: Proceedings of the ACM Sigmod Conference, pp. 49–60 (1999)
Artac, M., Jogan, M., Leonardis, A.: Incremental PCA for on-line visual learning and recognition. In: Proceedings of the 16th International Conference on Pattern Recognition, vol. 3, pp. 781–784 (2002)
Azzalini, A., Torelli, N.: Clustering via nonparametric density estimation. Stat. Comput. 17(1), 71–80 (2007). doi:10.1007/s11222-006-9010-y
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 1–16 (2002)
Bache, K., Lichman, M.: UCI Machine Learning Repository. University of California, Irvine, School of Information and Computer Sciences. http://archive.ics.uci.edu/m
Boley, D.: Principal direction divisive partitioning. Data Min. Knowl. Discov. 2(4), 325–344 (1998)
Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min. Knowl. Discov. 27(3), 344–371 (2013)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining, pp. 328–339 (2006)
Chen, Y., Tu, L.: Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142 (2007)
Cuevas, A., Febrero, M., Fraiman, R.: Cluster analysis: a further approach based on density estimation. Comput. Stat. Data Anal. 36(4), 441–459 (2001)
Cuevas, A., Fraiman, R.: A plug-in approach to support estimation. Ann. Stat. 25(6), 2300–2312 (1997)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231 (1996)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Hartigan, J.A.: Clustering Algorithms. Wiley Series in Probability and Mathematical Statistics. Wiley, New York (1975)
Hartigan, P.M.: Algorithm as 217: computation of the dip statistic to test for unimodality. J. R. Stat. Soc. 34(3), 320–325 (1985)
Hartigan, J.A., Hartigan, P.M.: The dip test of unimodality. Ann. Stat. 13(1), 70–84 (1985)
Hassani, M., Kranen, P., Saini, R., Seidl, T.: Subspace anytime stream clustering. In: Proceedings of the 26th International Conference on Scientific and Statistical Database Management, p. 37 (2014)
Hassani, M., Spaus, P., Gaber, M.M., Seidl, T.: Density-based projected clustering of data streams. In: Proceedings of the 6th International Conference on Scalable Uncertainty Management, pp. 311–324 (2012)
Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice-Hall International, Upper Saddle River (1999)
Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Jia, C., Tan, C., Yong, A.: A grid and density-based clustering algorithm for processing data stream. In: International Conference on Genetic and Evolutionary Computing (2008)
Kranen, P., Assent, I., Baldauf, C., Seidl, T.: Self-adaptive anytime stream clustering. In: IEEE International Conference on Data Mining, pp. 249–258, doi:10.1109/ICDM.2009.47 (2009)
Kranen, P.: Anytime algorithms for stream data mining. Diese Dissertation. RWTH Aachen University (2011)
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data. 3(1), 1–58 (2009)
Li, Y., Xu, L.-Q., Morphett, J., Jacobs, R.: An integrated algorithm of incremental and robust pca. In: Proceedings of the International Conference on Image Processing, 1, pp. 245–248 (2009)
Menardi, G., Azzalini, A.: An advancement in clustering via nonparametric density estimation. Stat. Comput. 24(5), 753–767 (2014). doi:10.1007/s11222-013-9400-x
Müller, D.W., Sawitzki, G.: Excess mass estimates and tests for multimodality. J. Am. Stat. Assoc. 86(415), 738–746 (1991)
Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., Kriegel, H.P.: Density-based projected clustering over high dimensional data streams. In: Proceedings SiAM International Conference on Data Mining, pp. 987–998 (2012)
Pavlidis, N.G., Tasoulis, D.K., Adams, N.M., Hand, D.J.: \(\lambda \)-perceptron: an adaptive classifier for data-streams. Pattern Recognit. 44(1), 78–96 (2011)
Reynolds Jr, M.R., Stoumbos, Z.G.: A CUSUM chart for monitoring a proportion when inspecting continuously. J. Qual. Technol. 3(1), 87 (1999)
Rigollet, P., Vert, R.: Optimal rates for plug-in estimators of density level sets. Bernoulli 15(4), 1154–1178 (2009)
Rinaldo, A., Wasserman, L.: Generalized density clustering. Ann. Stat. 38(5), 2678–2722 (2010)
Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 410–420 (2007)
Scott, D.W.: Multivariate Density Estimation: Theory, Practice, and Visualization, vol. 383. John Wiley & Sons, New York (2009)
Silva, J.A., Faria, E.R., Barros, R.C., Hruschka, E.R., de Carvalho, A.C.P.L.F., Gama, J.: Data stream clustering: a survey. ACM Comput. Surv. 46(1), 13:1–13:31 (2013)
Stuetzle, W.: Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J. Classif. 20(5), 25–47 (2003)
Stuetzle, W., Nugent, R.: A generalized single linkage method for estimating the cluster tree of a density. J. Comput. Gr. Stat. 19(2), 397–418 (2010)
Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.: Enhancing principal direction divisive clustering. Pattern Recognit. 43(10), 3391–3411 (2010)
Tasoulis, S.K., Tasoulis, D.K., Plagianakos, V.P.: Clustering of high dimensional data streams. In: Maglogiannis, L., Vlahavas, L. (eds.) Artificial Intelligence: Theories and Applications, pp. 223–230. Springer, Berlin (2012)
Vergara, A., Vembu, S., Ayhan, T., Ryan, M.A., Homer, M.L., Huerta, R.: Chemical gas sensor drift compensation using classifier ensembles. Sens. Actuators B 166, 320–329 (2012)
von Luxburg, U.: Clustering Stability. Now Publishers Inc, Hanover (2010)
Weng, J., Zhang, Y., Hwang, W.-S.: Candid covariance-free incremental principal component analysis. IEEE Trans. Pattern Anal. Mach. Intell. 25(8), 1034–1040 (2003)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Conf. 25, 103–114 (1996)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Mach. Learn. 42, 143–175 (2001)
Acknowledgments
David Hofmeyr gratefully acknowledges funding from both the Engineering and Physical Sciences Research Council (EPSRC) and the Oppenheimer Memorial Trust.
Author information
Authors and Affiliations
Corresponding author
Appendix 1: Proofs
Appendix 1: Proofs
Before we can prove Lemma 2, we require the following preliminaries.
The algorithm for computing the dip of a distribution function F constructs a unimodal distribution function G with the following properties: (i) The modal interval of G, [m, M], is equal to the modal interval of the closest unimodal distribution function to F, which we denote by \(F^U\), based on the supremum norm; (ii) \(\Vert F - G\Vert _\infty = 2\Vert F - F^U\Vert _\infty \); (iii) G is the greatest convex minorant of F on \((-\infty , m]\); (iv) G is the least concave majorant of F on \([M, \infty )\). By construction, the function G is linear between its nodes. A node \(n \le m\) of G satisfies \(G(n) = \lim \inf _{x \rightarrow n}F(x)\), while a node \(n \ge M\) of G satisfies \(G(n) = \hbox {limsup}_{x \rightarrow n}F(x)\). If F is the distribution function of a discrete random variable, then G is continuous.
The function \(F^U\) can be constructed by finding appropriate values \(b<m, B>M\) s.t. \(F^U\) is equal to \(G+\hbox {Dip}(F)\) on [b, m], equal to \(G-\hbox {Dip}(F)\) on [M, B], linearly interpolating between G(m) and G(M) and given any appropriate tails, which we choose to be linearly decreasing/increasing to 0 and 1 respectively.
Before proving Lemma 2, we require the following preliminary result, which relies on the notion of a step linear function.
Definition 3
(Step Linear) A function f is step linear on a non-empty, compact interval \(I = [a, b]\), if
for some \(\alpha , \beta \in \mathbb {R}\) and \(n \in {\mathbb {N}}\).
A step linear function is piecewise constant, and has n equally sized jumps of size \(\beta \) spaced equally on I with the final jump ocurring at b. The approximate empirical distribution function \(\tilde{F}\) (Sect. 4.2.2) is therefore step linear over the approximating intervals.
Proposition 4
Let f be step linear on an interval \(I = [a, b]\), and satisfy \(\lim _{x \rightarrow a^-}f(x) = \alpha - \beta \), where \(\alpha , \beta \) as in the above definition for f. Let g be liner on I and continuous on a neighbourhood of I. Then
Proof
Let \(f_m\) and \(f^M\) be linear on a neighbourhood of I s.t. they form the closest lower and upper bounding functions of f on I respectively. Since f is step linear, we have,
We therefore have, by above and the fact that \(g, f_m\), and \(f^M\) are linear on I,
\(\square \)
We are now in a position to prove Lemma 2, which states that the dip of a compactly approximated sample, as described in Sect. 4.2.2, provides a lower bound on the dip of the true sample.
Proof of Lemma 2
Let \(I = [a, b]\) be any compact interval and \(F_I\) the empirical distribution function of \((\mathcal {X}\cap I^c) \cup \hbox {Unif}(\mathcal {X}, I)\). Assume \(\vert \mathcal {X}\cap I \vert >1\), since otherwise \(F_I = F_\mathcal {X}\) and we are done. We can assume that the endpoints of I are elements of \(\mathcal {X}\) since this defines the same uniform set. \(F_\mathcal {X}\) and \(F_I\) are therefore equal on \(\hbox {Int}(I)^c\). In fact, since \(\mathcal {X}\) consists of unique points, \(\exists \epsilon > 0\) s.t. \(F_I(x) = F_\mathcal {X}(x) \ \forall x \not \in (a+\epsilon , b-\epsilon )\). Define \(F^\prime _I\) to be equal to \(F_\mathcal {X}^U\) for \(x \not \in \hbox {Int}(I)\) and linearly interpolating between \(F_X^U(a)\) and \(F_X^U(b)\). By construction \(F^\prime _I\) is a continuous unimodal distribution function.
We now show \(\Vert F_I - F_I^\prime \Vert _\infty \le \Vert F_\mathcal {X}- F_\mathcal {X}^U\Vert _\infty \). To see this, suppose that it is not true, i.e., \(\exists x\) s.t. \(\vert F_I(x) - F_I^\prime (x)\vert > \sup _y \vert F_\mathcal {X}(y) - F_\mathcal {X}^U(y)\vert \). Clearly \(x \in \hbox {Int}(I)\) due to the equalities discussed above and the construction of \(F^\prime _I\). Because of the continuity of \(F_\mathcal {X}^U\) and \(F_I^\prime \) and the equality of \(F_\mathcal {X}\) and \(F_I\) on \((a, a+\epsilon ) \cup (b-\epsilon , b)\), we have
and
But by Proposition 4 one of these left hand sides is at least as large as \(\vert F_I(x) - F_I^\prime (x) \vert \), leading to a contradiction.
We have shown that the addition of a single interval cannot increase the dip. We can apply the same logic to the now modified sample \((\mathcal {X}\cap I^c) \cup \hbox {Unif}(\mathcal {X}, I)\), iterating the addition of disjoint intervals to obtain a non-increasing sequence of dips. \(\square \)
In the above proof, we do not show that \(F_I^\prime \) is the closest unimodal distribution function to \(F_I\), however its existence necessitates the closest one being at least as close. Now, the sample approximations we employ still contain a full t atoms after t observations, however, they can be stored in \({\mathcal {O}}(k)\) for k intervals. We can easily show that the dip of such a sample approximation can be computed in \({\mathcal {O}}(k)\) time.
Proposition 5
The dip of a sample consisting of k uniform sets with disjoint ranges can be computed in \({\mathcal {O}}(k)\) time.
Proof
We begin by showing that there exists a unimodal distribution function which is linear on the ranges of the uniform sets and which achieves the minimal distance to the empirical distribution function of the sample.
Let F be a continuous unimodal distribution function s.t. \(\Vert F - \tilde{F}\Vert _\infty = \hbox {Dip}(\tilde{F})\). Define \(F^\prime \) similarly to in the above proof to be the continuous distribution function which is equal to F outside and at the boundaries of the intervals defining the uniform sets and linearly interpolating on them. Using the same logic, we know that \(\sup _{x}\vert F^\prime (x) - \tilde{F}(x) \vert \le \sup _x\vert F(x) - \tilde{F}(x)\vert \), hence \(\Vert F^\prime - \tilde{F}\Vert _\infty = \hbox {Dip}(\tilde{F})\).
Proposition 4 ensures that points in the interior of the intervals will not be chosen by the dip algorithm as end points of the modal interval of G, nor points at which the difference between the functions is supremal. The possible choices for these locations is therefore \({\mathcal {O}}(k)\), and the algorithm need not evaluate the functions except at the endpoints of the intervals. \(\square \)
Finally, we provide a proof of Proposition 3.
Proof of Proposition 3
For \(s > 1\) we have \(\Vert v_s - v_{s-1}\Vert = \Vert v_s\Vert \Vert v_s-v_{s-1}\Vert \ge \vert v_s \cdot (v_s - v_{s-1})\vert = \vert v_sv_{s-1}-1\vert \), since \(\Vert v_t\Vert = 1 \ \forall t\). Therefore, since \(\{v_t\}_{t=1}^\infty \) is almost surely convergent, and therefore almost surely Cauchy, we have \(v_s \cdot v_{s-1} \xrightarrow {a.s.} 1 \Rightarrow \arccos (v_s \cdot v_{s-1}) \xrightarrow {a.s.}0\). Now, we can easily show that,
Take \(\epsilon > 0\) and t large enough that \(\gamma ^{t-1}\lambda _1 < \gamma \epsilon \), and \(t>k+2\), where \(k = \lfloor \log (\epsilon (1-\gamma )/2\pi )/\log (\gamma )-1\rfloor \). Consider,
and \(\frac{\pi \gamma ^{k+1}}{1-\gamma } \le \frac{\epsilon }{2}\). In all,
Notice that k does not depend on t. With probability 1, for any given \(\epsilon >0\) there is a \({\mathcal {T}}\) s.t. \(T>{\mathcal {T}}\) implies \(\sum _{i=0}^k\arccos (v_{T-i}\cdot v_{T-i-1})\le \epsilon /2\), implying \(\lambda _T \le \epsilon \) for all \(T > {\mathcal {T}}\), and the result follows. \(\square \)
Rights and permissions
About this article
Cite this article
Hofmeyr, D.P., Pavlidis, N.G. & Eckley, I.A. Divisive clustering of high dimensional data streams. Stat Comput 26, 1101–1120 (2016). https://doi.org/10.1007/s11222-015-9597-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-015-9597-y