Skip to main content
Log in

Scalable density-based clustering with quality guarantees using random projections

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering offers significant insights in data analysis. Density-based algorithms have emerged as flexible and efficient techniques, able to discover high-quality and potentially irregularly shaped clusters. Here, we present scalable density-based clustering algorithms using random projections. Our clustering methodology achieves a speedup of two orders of magnitude compared with equivalent state-of-art density-based techniques, while offering analytical guarantees on the clustering quality in Euclidean space. Moreover, it does not introduce difficult to set parameters. We provide a comprehensive analysis of our algorithms and comparison with existing density-based algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. SOPTICS presents small differences to the Fast OPTICS (FOPTICS) presented in Schneider and Vlachos (2013).

  2. This condition could be removed for low-dimensional spaces, i.e., assuming d is constant.

  3. Java is a registered trademark of Oracle and/or its affiliates.

  4. https://elki.dbs.ifi.lmu.de/.

  5. Intel is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. Other product or service names may be trademarks or service marks of other companies.

  6. http://alumni.cs.ucr.edu/~mvlachos/erc/projects/density-based/.

  7. The latest optimizations are not included in version 0.7.0.

References

  • Achtert E, Böhm C, Kröger P (2006) DeLi-Clu: boosting robustness, completeness, usability, and efficiency of hierarchical clustering by a closest pair ranking. In: Proceedings of the Pacific-Asia conference knowledge discovery and data mining (PAKDD), pp 119–128

    Chapter  Google Scholar 

  • Andrade G, Ramos G, Madeira D, Sachetto R, Ferreira R, Rocha L (2013) G-DBSCAN: a GPU accelerated algorithm for density-based clustering. Procedia Comput Sci 18:369–378

    Article  Google Scholar 

  • Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of the ACM international conference on management of data (SIGMOD), pp 49–60

  • Asuncion A, Newman D (2007) UCI machine learning repository. http://archive.ics.uci.edu/ml/datasets.html

  • Böhm C, Noll R, Plant C, Wackersreuther B (2009) Density-based clustering using graphics processors. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 661–670

  • Chitta R, Murty MN (2010) Two-level k-means clustering algorithm for k-tau relationship establishment and linear-time classification. Pattern Recognit 43(3):796–804

    Article  Google Scholar 

  • Dasgupta S, Freund Y (2008) Random projection trees and low dimensional manifolds. In: Proceedings of the symposium on theory of computing (STOC), pp 537–546

  • Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the annual symposium on computational geometry, pp 253–262

  • de Vries T, Chawla S, Houle ME (2012) Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32(1):25–52

    Article  Google Scholar 

  • Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 226–231

  • Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):341–352

    Article  Google Scholar 

  • Hinneburg A, Gabriel H-H (2007) Denclue 2.0: fast clustering based on kernel density estimation. In: Advances in intelligent data analysis (IDA), pp 70–80

  • Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the ACM conference knowledge discovery and data mining (KDD), pp 58–65

  • Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  Google Scholar 

  • Jain AK, Law MHC (2005) Data clustering: a user’s dilemma. In: Proceedings of the pattern recognition and machine intelligence, pp 1–10

    Google Scholar 

  • Johnson WB, Lindenstrauss J (1984) Extensions of Lipschitz maps into a Hilbert space. Contemp Math 26:189–206

    Article  Google Scholar 

  • Koyutürk M, Grama A, Ramakrishnan N (2005) Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Trans Knowl Data Eng 17(4):447–461

    Article  Google Scholar 

  • Lin C-J (2011) LibSVM datasets. http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

  • Loh W-K, Yu H (2015) Fast density-based clustering through dataset partition using graphics processing units. Inf Sci 308:94–112

    Article  Google Scholar 

  • Schneider J, Vlachos M (2013) Fast parameterless density-based clustering via random projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 861–866

  • Schneider J, Vlachos M (2014) On randomly projected hierarchical clustering with guarantees. In: Proceedings of the SIAM international conference on data mining (SDM), pp 407–415

    Chapter  Google Scholar 

  • Schneider J, Wattenhofer R (2011) Distributed coloring depending on the chromatic number or the neighborhood growth. In: International colloquium structural information and communication complexity (SIROCCO), pp 246–257

    Chapter  Google Scholar 

  • Schneider J, Bogojeska J, Vlachos M (2014) Solving Linear SVMs with multiple 1D projections. In: Proceedings of the international conference on information and knowledge management (CIKM), pp 221–230

  • Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. PVLDB 8(12):1976–1987

    Google Scholar 

  • Urruty T, Djeraba C, Simovici DA (2007) Clustering by random projections. In: Industrial conference on data mining, pp 107–119

  • Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280

    Article  Google Scholar 

  • Whang JJ, Sui X, Dhillon IS (2012) Scalable and memory-efficient clustering of large-scale social networks. In: Proceedings of the IEEE conference on data mining (ICDM), pp 705–714

  • Yu Y, Zhao J, Wang X, Wang Q, Zhang Y (2015) Cludoop: an efficient distributed density-based clustering for big data using hadoop. Int J Distrib Sens Netw 2015:579391. doi:10.1155/2015/579391

    Article  Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007–2013)/ERC Grant Agreement No. 259569.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Schneider.

Additional information

Responsible editor: G. Karypis.

Appendix

Appendix

Proof of Theorem 8

Assume for now that the random line used for partitioning is chosen uniformly at random. By definition and using linearity of expectation, the expectation of \(X^{short}_A\) is \({\mathbb {E}}[X^{short}_A]:= \sum _{C\in \mathcal {S}_A\setminus \mathcal {N}(A,c_dr)} p(E^{short}_A(C))\). Using Theorem 7 to bound \(p(E^{short}_A(C))\),

$$\begin{aligned} {\mathbb {E}}[X^{short}_A]\le 3\log (c_d)/c_d|\mathcal {S}_A\setminus \mathcal {N}(A,c_dr)| \end{aligned}$$

The probability that the random variable \(X^{short}_A\) exceeds the expectation \({\mathbb {E}}[X^{short}_A]\) by a factor \(c_d/\log (c_d)^2\) or more is at most \(\log (c_d)^2/c_d\) using Markov’s inequality. Thus, for the probability of event \(E_0\) as defined below we have

Analogously, let us bound the probability of event \(E^{long}_A(C)\) that a projection of two points CA results in a distance \(L\cdot (C-A)\) much beyond the expectation. Next, we use Theorem 7 to bound \(p(E^{long}_A(C))\). By definition the expectation of \(X^{long}_A\) is \({\mathbb {E}}[X^{long}_A]:=\sum _{C\in \mathcal {S}_A \cap \mathcal {N}(A,r)} p(E^{long}_A(C))\). Consider the upper bound of \({\mathbb {E}}[X^{long}_A]\) being \({\mathbb {E}}[X^{long}_A]\cdot c_d\), i.e., \(c_d/(c_d)^{\log (c_d)/2} |\mathcal {S}_A \cap \mathcal {N}(A,r)|\ge 1/(c_d)^{\log (c_d)/3} |\mathcal {S}_A \cap \mathcal {N}(A,r)|\) (for \(c_d>1\)). Thus, define the probability of event \(E_1\) and bound as before using Markov’s inequality as follows:

$$\begin{aligned} p(E_1):=p\left( X^{long}_A \le |\mathcal {S}_A \cap \mathcal {N}(A,r)|/(c_d)^{\log (c_d)/3}\right) \ge 1-1/c_d. \end{aligned}$$

Assume \(E_0\) occurs. This excludes at most a fraction \(\log (c_d)^2/{c_d}\in [0,1]\), ie. require \(c_d>80\), of all possible projections for event \(E_1\), leaving

$$\begin{aligned} (1-1/c_d-\log (c_d)^2/{c_d}) > 1- 2\log (c_d)^2/c_d. \end{aligned}$$

Thus, the probability of \(E_1\) given \(E_0\) becomes \(p(E_1|E_0)=1-2\log (c_d)^2/c_d\).

The probability of event \(E':=E_0 \cap E_1\) is

$$\begin{aligned} \begin{aligned} p(E')=p(E_1|E_0)\cdot p(E_0)&\ge (1-2\log (c_d)^2/{c_d})\cdot (1-\log (c_d)^2/{c_d}) \\&\ge (1-2\log (c_d)^2/{c_d})^2. \end{aligned} \end{aligned}$$

Let us deal with dependencies among chosen lines. We choose \(c_L\cdot log N\) random lines independently from each other. Define \(c_{E'}:=1-p(E')=2\log (c_d)^2/{c_d}\). Then using a Chernoff Bound (see Theorem1), we have that the probability of event \(E_f(S_A)\) that the number of lines for which \(E'\) does not hold for \(\mathcal {S}_A\) exceeds the expectation \(c_L\cdot \log N\cdot c_{E'}\) by at most a factor \(1+\sqrt{3}/(c_L)^{1/4}\) is \(1-1/N^{\sqrt{c_L} c_{E'}}\). Thus, assume that we “reuse” the projections for a total of \(N^{c2}\) sets of points across all partitionings, ie. \(c2<2\). Therefore, given that the projections have been reused \(N^{c2}\) times the probability \(p(E_f(S_C))\) for a set \(S_C\) can be upper bounded using the bound for dependent events from Theorem 2 \(1-1/N^{\sqrt{c_L} c_{E'}-c2}\). Thus, the probability for a bad event increases at most by a factor of \(1+\sqrt{3}/(c_L)^{1/4}\) yielding that for \(c_L\) sufficiently large.

$$\begin{aligned} p(E')\ge 1-2\log (c_d)^2/{c_d} \cdot ( 1+\sqrt{3}/(c_L)^{1/4}) \le 1-4\log (c_d)^2/{c_d} \end{aligned}$$

\(\square \)

Proof of Theorem 9

The idea of the proof is to look at a point A and remove “very” far away points until there are only relatively few of them left. Then, we consider somewhat closer points (but still quite far away) and remove them until we are left with only some very close points and some potentially further away points. Consider a partitioning of set \(\mathcal {S}_A\) into two sets \(\mathcal {S}_0\) and \(\mathcal {S}_{1,A}\), i.e., \(A \in \mathcal {S}_{1,A}\) using algorithm Partition and random projection line L. Assume that the following condition holds for set \(\mathcal {S_A}\): There are many more points “very far away” from A than not so distant points using some factor \(f_d\ge c_d\):

$$\begin{aligned} c_r |\mathcal {S}_A \cap \mathcal {N}(A,f_d\cdot r)| \le |\mathcal {S}_A \setminus \mathcal {N}(A,f_d\cdot r)| \end{aligned}$$
(2)

The value \(c_r\) is defined later; we require \(c_r\ge f_d\). We prove that even in this case after a sequence of splittings of the point set only few very far away points end up in set \(\mathcal {S}_{1,A}\). (If there are fewer faraway points than somewhat close points, the probability that many of them end up in the same set is even smaller.) Define event \(E_1\) as follows: A splitting point is picked such that for the subset \(\mathcal {S}_{1,A}\) most very close points from \(\mathcal {N}(A,r) \cap \mathcal {S}_A\) remain, i.e.,

$$\begin{aligned} |\mathcal {S}_{1,A} \cap \mathcal {N}(A,r)| \ge |\mathcal {S}_{A} \cap \mathcal {N}(A,r)| \cdot (1-1/c_r). \end{aligned}$$

The probability of event \(E_1\) can be bounded as follows. Assume that \(E'\) as defined in Theorem 8 occurs (using \(f_d>c_d\) instead of \(c_d\)), i.e., most distances are scaled roughly by the same factor from a point A to other points. To minimize the probability of \(p(E_1|E')\) we assume that all projected distances from faraway points to A are minimized and those of close points are maximized. This means that at most a fraction \(1/\log {f_d}\) of all very far away points \(\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)\) are below a factor \(3\log (f_d)/f_d\) of their expected length and that the distances to all other points in \(\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)\) are shortened exactly by that factor. We assume the worst possible scenario, i.e., those far away points are split such that they end up in the same set as A, i.e., they become part of \(S_{1,A}\). At most a fraction \(1/(f_d)^{\log (f_d)/3}\) of all very close points \(\mathcal {S}_{A} \cap \mathcal {N}(A,r)\) are above a factor \(\log (f_d)\) of the expectation. We assume that those points behave in the worst possible manner, i.e., the close points exceeding the expectation are split such that they end up in a different set than A, i.e., \(S_{0}\) not \(S_{1,A}\). Next, we bound the probability that no other points from \(\mathcal {S}_{A} \cap \mathcal {N}(A,r)\) are split. If we pick a splitting point among the fraction of \(1-1/\log f_d\) points from \(\mathcal {S}_{A} \setminus \mathcal {N}(A,f_dr)\) that are not shortened by more than a factor \(3\log (f_d)/f_d\), then \(p(E_1|E')\) occurs. By initial assumption we have \((1-1/f_d^{\log (f_d)/3})|\mathcal {S}_{A} \cap \mathcal {N}(A,f_d\cdot r)|\le (1-1/\log {f_d})\cdot c_r |\mathcal {S}_A \setminus \mathcal {N}(A,f_dr)|\) and thus, \(|\mathcal {S}_{A} \setminus \mathcal {N}(A,f_dr)|/|\mathcal {S}_{A}|\le 2/c_r\) for \(1-1/\log f_d > 1/2\), i.e., \(f_d\) sufficiently large, and because \(|\mathcal {S}_{A}| \ge |\mathcal {S}_A \setminus \mathcal {N}(A,f_dr)|\). Put differently, the probability to pick a bad splitting point is at most \(2/c_r\). The occurrence of event \(E'\) reduces the probability of \(E_1\) at most by \(1-p(E')\), i.e., \(p(E_1|E')\ge p(E_1)-(1-p(E'))\).

Therefore,

$$\begin{aligned} \begin{aligned} p(E_1)&= p(E')p(E_1|E') \\&= p(E')\cdot (1- |\mathcal {S}_{A} \setminus \mathcal {N}(A,f_d\cdot r)|/|\mathcal {S}_{A}|-(1-p(E'))) \\&\ge p(E')\cdot (1-2/c_r - 4\log (f_d)^2/f_d)) \\&\ge (1-4\log (f_d)^2/f_d)^2\cdot (1-6\log (f_d)^2/\min ({f_d},c_r)) \qquad \text {(Substitution of } p(E')) \\&= (1-6\log (f_d)^2/f_d)^3 \qquad \text {(since by definition } c_r\ge f_d) \end{aligned} \end{aligned}$$

Define event \(E_2\) as follows: At least \(1/3-1/c_r\) of all far away points \(|\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)|\) are not contained in \(\mathcal {S}_{1,A}\), i.e.,

$$\begin{aligned} |\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)|\ge 2/3|\mathcal {S}_{1,A} \setminus \mathcal {N}(A,f_dr)|. \end{aligned}$$

The probability that the size of the set resulting from the split \(\mathcal {S}_{1,A}\) is at most 2/3 of the original set \(\mathcal {S}_{A}\) is 1 / 3, because a splitting point is chosen uniformly at random. When restricting our choice to far away points \(\mathcal {S}_A\setminus \mathcal {N}(A,f_dr)\), we can use that owing to Condition (2) at most a fraction \(1/c_r\) of all points are not far away. The probability of \(E_2\) given \(E_1\) can be bounded by assuming that all events, i.e., choices of random lines and splitting points, that are excluded owing to the occurrence of \(E_1\) actually would have caused \(E_2\). More precisely, we can subtract the probability of the complementary event of \(E_1\), i.e., \(p(E_2|E_1) = 2/3-1/c_r - (1-p(E_1)) \ge 2/3 - 1/c_r - (1-4\log (c_d)^2/{f_d})^3 \ge 1/4\) for a sufficiently large constant \(f_d\). The initial set \(\mathcal {S}:=\mathcal {P}\) has to be split at most \(c_L\log N\) times until the final set \(\mathcal {S}_A\) containing A (which is not split any further) is computed (see proof of Theorem 3). We denote a trial T as up to \(\log f_d\) splits of a set \(\mathcal {S}\) into two sets. A trial T is successful if after at most \(\log f_d\) splits of a set \(\mathcal {S}_A\) the final set \(\mathcal {S}'_A\subset \mathcal {S}_A\) is of size at most \(|\mathcal {S}_A|/2\) and \(E_1\) occurred for every split. The probability for a successful trial p(T) is equal to the probability that \(E_1\) always occurs and \(E_2\) at least once. This gives:

$$\begin{aligned} p(T)= & {} p(E_1)^{\log f_d}\cdot (1-p(E_2|E_1)^{\log f_d}) \nonumber \\\ge & {} (1-6\log (f_d)^2/f_d)^{3\log f_d} \cdot (1-1/4^{\log f_d}) \nonumber \\\ge & {} (1-6\log (f_d)^2/f_d)^{4\log f_d} \end{aligned}$$
(3)

Starting from the entire point set we need \(\log (N/minSize)+1\) (consecutive) successful trials until a point A is in a set of size less than \(\textit{minSize}\) and the splitting stops. Next we prove that the probability to have that many successful trials is constant given that the required upper bound on the neighborhood holds, i.e., (1). Assume there are \(n_i\) points within distance \([i^{3/2+c_s} \cdot c_d\cdot r, (i+1)^{3/2+c_s}\cdot c_d\cdot r]\) for a positive integer i. In particular, note that the statement holds for arbitrarily positioned points. We do not even require them to be fixed across several trials.

The upper bound on the neighborhood growth (1) yields that \(n_i\le 2^{i^{1/2}}\cdot |\mathcal {N}(A,c_dr)|\). Furthermore, we have that \(\sum _{i=1}^{\infty } n_i \le N\). Next, we analyze how many trials we need to remove points \(n_i\) until only the close points \(\mathcal {N}(A,c_dr)\) remain. We are going from large i to small i, i.e., remove distant points first. For each \(n_i\) we need at most \(\log n_i - \log |\mathcal {N}(A,c_dr)| \le i^{1/2}\) successes. Let \(E_{n_{i}}\) be the event that this happens, i.e., that we have that many consecutive successes.

$$\begin{aligned} p(E_{n_{i}}):= & {} \prod _{j=1}^{\log n_i - \log |\mathcal {N}(A,c_dr)|} p(T) \nonumber \\= & {} \prod _{j=1}^{\log n_i - \log |\mathcal {N}(A,c_dr)|} \left( 1-6\log (x)^2/(x)\right) ^{4\log x} \quad \text {(Defining } x:= c_d\cdot i^{3/2+c_s} ) \nonumber \\= & {} \prod _{j=1}^{\sqrt{i}} \left( 1-1/2^{\log (x)- 2\log (6\log x)}\right) ^{4\log (x)} \nonumber \\= & {} 2^{4 \sqrt{i} \log (x)\log \left( 1-1/2^{\log (x)-2\log (6\log x)}\right) } \nonumber \\= & {} 2^{-4 \sqrt{i} \log (x)\cdot 1/2^{\log (x)-2\log (6\log x)}}\qquad \text {(Using }\log (1-x)\le -x)\nonumber \\\ge & {} 2^{ -4 \sqrt{i}\cdot \log (x)\cdot \log (x)^4 /x} \qquad \text {(Using } 2^{2\log (6\log x)}= (6\log (x))^2, 2^{\log (x)}=x) \nonumber \\= & {} 2^{\frac{-24\sqrt{i}\log (c_d\cdot i^{3/2+c_s})^3}{c_d\cdot i^{3/2+c_s}}} \nonumber \\\ge & {} 2^{-\frac{1}{c_d\cdot i}} \text {(for } c_s \text { and } c_d \text { sufficiently large)} \end{aligned}$$
(4)

As the number of points N is finite, the number of \(n_i>0\) is also finite. Let \(m_A\) be the largest value such that \(n_{m_A}>0\). Let \(p_A:=p(\wedge _{i\in [1,m_A]} {E_{n_{i}}})\) be the probability that all trials for all \(n_i\) in \(i \in [1,m_A]\) and \(n_i>0\) are successful. Note that the events \(E_{n_i}\) are not independent for a fixed point set \(\mathcal {P}\). However, the bound (4) on \(p(E_{n_{i}})\) holds as long as condition 2 is fulfilled, i.e., for an arbitrary point set. Put differently, the bound (4) holds even for the “worst” distribution of points. Therefore, we have that \(p_A:=p(\wedge _{i\in [1,m_A]} {E_{n_{i}}})\ge \prod _{i\in [1,m_A]} 2^{-1/(i\cdot c_d)}\) using stochastic domination. Note that our choice of maximizing \(n_i\), i.e., the number of required successful trials for \(E_{n_i}\) minimizes the probability of a \(p(E_{n_i})\). This is quite intuitive, since it says that we should maximize the number of points closest to A that should not be placed in the same set as A (i.e., they are just a bit too far to yield the claimed approximation guarantee). We also need to be aware of the fact that the distribution for the \(n_i\) under the constraint that \(\sum _{i=1}^{m_A} n_i \le N\) should minimize the bound for \(p_A\). It is also apparent from the derivation of (4) that this happens when we maximize \(n_i\); the probability for \(p_A\) decreases more if we maximize small i. Essentially, this follows from line 2 in (4) because the number of trials \(n_T\) is less than \(\sqrt{i}\) and each trial is successful with probability of \((1-1/i^{3/2})\) (focusing on dominating terms), yielding an overall success probability of \((1-1/i^{3/2})^{n_T}\) for a trial. Thus, \((1-1/i^{3/2})^{\sqrt{i}}> (1-1/l^{3/2})^{\sqrt{l}}\) for \(1<i<l\). Put differently, choosing \(n_i\) large for a large i is not a problem for our algorithm, because it is unlikely that these points will be projected in between the nearest points to A.

Therefore, when maximizing the number of points close to A, we have that \(m_A=(\log N)^2\), i.e., all \(n_i\) for \(i>(\log N)^2\) are 0 because \(2^{\sqrt{(\log N)^2}}=n_{(\log N)^2}=N\). Additionally, note that we need at most \(c_8 \log N\) trials in total. As each trial slices the number of points by 1 / 2, we only need to take into the account the subset \(X\in [1,m_A]\) for which the number of points doubles, i.e., \(n_j=2\cdot n_{i}\), for \(n_i=2^{i^{1/2}}\). This happens whenever \(i^{1/2}\) is an integer, i.e., for \(i=1,4,9,16,\ldots \), we get \(n_i=1,2,3,4,\ldots \). Thus, we only need to look at \(i^2 \in [1,m_A]\)

$$\begin{aligned} p_A\ge & {} \prod _{i^2\in [1,m_A]} 2^{- 1/(c_d\cdot i)}\\\ge & {} \prod _{i^2\in \left[ 1,\log ^2 N \right] } 2^{-1/(c_d\cdot i)}\\\ge & {} 2^{-1/c_d \sum _{i^2\in \left[ 1,\log ^2 N \right] } 1/i}\\= & {} 2^{-1/c_d \sum _{i\in \left[ 1,\log N\right] } 1/i^2}\\\ge & {} 2^{-2/c_d} \\\ge & {} 1/2^{2/c_d} \end{aligned}$$

Thus, when doing \( c_p (\log N)\) partitionings, we have at least \(c_p/16 \log N\) successes for point A whp using Theorem 1 and \(c_d\ge 1\). This also holds for all points whp using Theorem 2.

Finally, let us bound the number of nearby points that remain. We need at most \(c_L \log N\) (see Theorem 3) projections until a point set will not be split further. Each projection reduces the points \(|\mathcal {N}(A,r)|\) at most by factor \(1-1/c_r\). We give a bound in two steps, i.e., for \(c_r\ge \log ^3 N\) and \(c_r \in [ c_d, \log ^3 N]\).

$$\begin{aligned} \prod _{i=1}^{c_L\log N} (1-1/c_r)\ge & {} \left( 1-1/\log ^3 N)^{c_L\log N} \text { (Assuming }c_r\ge \log ^3 N \right) \\\ge & {} 1-1/\log N \end{aligned}$$

To reduce the number of points by a factor of \(\log ^3 N\) requires \(3\cdot \log \log N \) trials, each reducing the set by a factor 1/2. Thus, trial i is conducted using a factor \(c_r = \log ^3 N / 2^i\) of the original points or, equivalently, trial \(3\cdot \log \log N-i\) is conducted with \(c_r= 2^i\). Thus, in total the fraction of remaining points in \(\mathcal {N}(A,r)\) is

$$\begin{aligned} (1-1/\log N) \prod _{i=1}^{3 \log \log N} (1-1/2^i)^{\log c_d}= & {} (1-1/\log N) \cdot \left( \prod _{i=1}^{3 \log \log N} (1-1/2^i)\right) ^{\log c_d}\\= & {} (1-1/\log N) \cdot \left( 2^{\sum _{i=\log c_d}^{3 \log \log N} \log (1-1/2^i)}\right) ^{\log c_d} \\\ge & {} (2^{1-2/c_d})^{\log c_d} \ge 1/(2c_d) \end{aligned}$$

\(\square \)

Proof of Theorem 10

First we bound the number of neighbors. Using Theorem 9 we obtain \(c_p/16 (\log N)\) sets \(\mathcal {S}_A\) containing A. Define \(\mathfrak {S}_A\) to be the union of all sets \(\mathcal {S}_A \in \mathfrak {S}\) containing A. Before the last split of a set \(\mathcal {S}_A\) resulting in the sets \(\mathcal {S}_{1,A}\) and \(\mathcal {S}_2\), the set \(\mathcal {S}_A\) must be of size at least \(c_m\cdot \textit{minPts}\); the probability that splitting it at a random point results in a set \(\mathcal {S}_{1,A}\) with \(|\mathcal {S}_A|<c_m/2\cdot \textit{minPts}\) is at most 1/2. Thus, using a Chernoff bound (Theorem 1), at least \(c_p/128\log N\) sets \(\mathcal {S}_A \in \mathfrak {S}_A\) are of size at least \(c_m/2\cdot \textit{minPts}\) whp.

Let \(\mathcal {S}_A\) be a set \(\mathcal {S}_A\) with size at least \(c_m/2\cdot \textit{minPts}\). Consider the process when the neighborhood \(\mathcal {N}(A)\) is built by inspecting one set \(\mathcal {S}_A\) after the other. Assume that the number of neighbors \(|\mathcal {N}(A)|< c_m/2 minPts/(2{c_d})\). Thus, the probability of event \(p(\text {Choose new close neighbor } B)=p(B \not \in \mathcal {N}(A) \wedge B \in \mathcal {N}(A,r))\) that a point \(B \in \mathcal {S}_A\) but not already in \(\mathcal {N}(A)\) is chosen from \(\mathcal {N}(A,r) \cap \mathcal {S}_A\) is at least \(c_m/(4{c_d})\).

$$\begin{aligned}&p\left( \text {Choose new close neighbor } B| |\mathcal {N}(A)|< c_m/2\cdot \textit{minPts}/(2c_d)\right) \\&\quad := p(B \not \in \mathcal {N}(A) \wedge B \in \mathcal {N}(A,r))) = c_m/(4{c_d}) \end{aligned}$$

As by assumption \(\textit{minPts} < c_m \log N\) and there are at least \(c_p/128\log N\) sets \(\mathcal {S}_A\) with \(|\mathcal {S}_A|\ge c_m/2\cdot \textit{minPts}\) and \(c_p \ge c_m\cdot 128\), using the Chernoff bound in Theorem 1 we get that there are at least \(c_m/(4{c_d}) minPts\) points within distance \(D_{c_m \textit{minPts}}(A)\) in \(\mathcal {N}(A)\) whp for every point A. Setting \(c_m\ge 8 c_d\) completes the proof. \(\square \)

Proof of Theorem 12

To compute Davg(A) with \(f=1\) we consider the \((1+f)\cdot \textit{minPts}= 2 minPts \) closest points to A from \(\mathcal {N}(A)\). Using Theorem 10 2minPts points are contained in \(\mathcal {N}(A)\) with distance at most \(D_{c_m \textit{minPts}}(A)\). This yields \(Davg(A)\le D_{c_m \textit{minPts}}(A)\). Thus, the upper bound follows. To compute Davg(A), we average the distance using the \(2\cdot \textit{minPts}\)-closest points to A. The smallest value of Davg(A) is reached when \(\mathcal {N}(A)\) contains all \(2\cdot \textit{minPts}\) closest points to A, which implies \(Davg(A)\ge D_{2\cdot \textit{minPts}}(A)\ge D_{ \textit{minPts}}(A)\) for any set of neighbors \(\mathcal {N}(A)\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schneider, J., Vlachos, M. Scalable density-based clustering with quality guarantees using random projections. Data Min Knowl Disc 31, 972–1005 (2017). https://doi.org/10.1007/s10618-017-0498-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-017-0498-x

Keywords

Navigation