Skip to main content

Finding Meaningful Cluster Structure Amidst Background Noise

  • Conference paper
  • First Online:
Algorithmic Learning Theory (ALT 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9925))

Included in the following conference series:

Abstract

We consider efficient clustering algorithm under data clusterability assumptions with added noise. In contrast with most literature on this topic that considers either the adversarial noise setting or some noise generative model, we examine a realistically motivated setting in which the only restriction about the noisy part of the data is that it does not create significantly large “clusters”. Another aspect in which our model deviates from common approaches is that we stipulate the goals of clustering as discovering meaningful cluster structure in the data, rather than optimizing some objective (clustering cost).

We introduce efficient algorithms that discover and cluster every subset of the data with meaningful structure and lack of structure on its complement (under some formal definition of such “structure”). Notably, the success of our algorithms does not depend on any upper bound on the fraction of noisy data.

We complement our results by showing that when either the notions of structure or the noise requirements are relaxed, no such results are possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The assignment to clusters can sometimes be probabilistic, and clusters may be allowed to intersect, but these aspects are orthogonal to the discussion in this paper.

References

  1. Ackerman, M., Ben-David, S.: Clusterability: a theoretical study. In: International Conference on Artificial Intelligence and Statistics, pp. 1–8 (2009)

    Google Scholar 

  2. Awasthi, P., Blum, A., Sheffet, O.: Center-based clustering under perturbation stability. Inf. Process. Lett. 112(1), 49–54 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  3. Balcan, M.-F., Blum, A., Fine, S., Mansour, Y.: Distributed learning, communication complexity and privacy. arXiv preprint arXiv:1204.3514 (2012)

  4. Balcan, M.-F., Blum, A., Vempala, S.: A discriminative framework for clustering via similarity functions. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pp. 671–680. ACM (2008)

    Google Scholar 

  5. Balcan, M.F., Liang, Y.: Clustering under perturbation resilience. In: Mehlhorn, K., Pitts, A., Wattenhofer, R., Czumaj, A. (eds.) ICALP 2012, Part I. LNCS, vol. 7391, pp. 63–74. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  6. Ben-David, S.: Computational feasibility of clustering under clusterability assumptions. arXiv preprint arXiv:1501.00437 (2015)

  7. Ben-David, S., Haghtalab, N.: Clustering in the presence of background noise. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 280–288 (2014)

    Google Scholar 

  8. Ben-David, S., Reyzin, L.: Data stability in clustering: a closer look. Theor. Comput. Sci. 558, 51–61 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bilu, Y., Linial, N.: Are stable instances easy? Comb. Probab. Comput. 21(05), 643–660 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  10. Cuesta-Albertos, J.A., Gordaliza, A., Matrán, C., et al.: Trimmed \( k \)-means: an attempt to robustify quantizers. Ann. Stat. 25(2), 553–576 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  11. Dave, R.N.: Robust fuzzy clustering algorithms. In: Second IEEE International Conference on Fuzzy Systems, pp. 1281–1286. IEEE (1993)

    Google Scholar 

  12. García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat., 1324–1345 (2008)

    Google Scholar 

  13. Reyzin, L.: Data stability in clustering: a closer look. In: Stoltz, G., Vayatis, N., Zeugmann, T., Bshouty, N.H. (eds.) ALT 2012. LNCS, vol. 7568, pp. 184–198. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  14. Vapnik, V.N., Ya, A.: Chervonenkis: on the uniform convergence of relative frequencies of events to their probabilities. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds.) Measures of Complexity, pp. 11–30. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shrinu Kushagra .

Editor information

Editors and Affiliations

A Proofs of Missing Lemmas and Theorems

A Proofs of Missing Lemmas and Theorems

Proof of Theorem 2 Fix any \(\mathcal S \subseteq \mathcal X\). Let \(\mathcal C^*_S = \{S_1^*, \ldots , S_k^*\}\) be a clustering of \(\mathcal S\) such that \(m(\mathcal C_{\mathcal S}^*) = t\) and \(\mathcal C^*_S\) has \((\alpha , \eta )\)-center proximity. Denote by \(r_i := r(S_i^*)\) and \(r = \max r_i\). Define \(Y_B^{\mathcal C} := \{C_i \in \mathcal C : C_i \subseteq B \text { or } |B \cap C_i| \ge t/2\}\). Note that whenever a ball B satisfies the sparse-distance condition, all the clusters in \(Y_{B}^{{\mathcal C}^{(l)}}\) are merged together and the clustering \(\mathcal C^{(l+1)}\) is updated. We will prove the theorem by proving two key facts.

  • F.1 If the algorithm merges points from a good cluster \(S_i^*\) with points from some other good cluster, then at this step the distance being considered \(d = d(p,q) > r_i\).

  • F.2 When the algorithm considers the distance \(d = r_i\), it merges all points from \(S_i^*\) (and possibly points from \(\mathcal X\setminus \mathcal S\)) into a single cluster \(C_i\). Hence, there exists a node in the tree \(N_i\) which contains all the points from \(S_i^*\) and no points from any other good cluster \(S_j^*\).

Note that the theorem follows from these two facts. Similar reasoning was also used in proof of Lemma 3 in [5]. We now prove both of these facts formally.

Proof of Fact. F.1 Let \(\mathcal C^{(l)} = \{C_1, \ldots , C_{k'}\}\) be the current clustering of \(\mathcal X\). Let \(l+1\) be the first merge step which merges points from the good cluster \(S_i^*\) with points from some other good cluster. Let \(p, q \in \mathcal X\) be the pair of points being considered at this step and \(B = B(p, d(p, q))\) the ball that satisfies the sparse distance condition at this merge step. Denote by \(Y = Y_{B}^{C^{(l)}}\). We need to show that \(d(p, q) > r_i\). To prove this, we need Claim 1 below.

Claim 1

Let \(p, q \in \mathcal X\) and B, Y, \(S_i^*\) and \(C^{(l)}\) be as defined above. If \(d(p, q) \le r,\) then \(B \cap S_i^* \ne \emptyset \) and there exists \(n \ne i\) such that \(B \cap S_n^* \ne \emptyset \).

\(l+1\) is the first step which merges points from \(S_i^*\) with some other good cluster. Hence, \(\exists C_i \in Y\) such that \(C_i\cap S_i^* \ne \emptyset \) and \(\forall n \ne i\), \(C_i \cap S_n^* = \emptyset \). Also, \(\exists C_j \in Y\) such that \(C_j \cap S_j^* \ne \emptyset \) for some \(S_j^*\) and \(C_j \cap S_i^* = \emptyset \).

\(C_i \in Y\). Hence, \(C_i \subseteq B\) or \(|C_i \cap B| \ge t/2\). The former is trivial. In the latter, for the sake of contradiction, assume that B contains no points from \(S_i^*\). This implies that \(B \cap C_i \subseteq B \cap \{\mathcal X \setminus \mathcal S\}\) and \(|B\cap \{\mathcal X \setminus \mathcal S\}| \ge t/2\). This is a contradiction. The case when \(C_j \in Y\) is identical.    \(\blacksquare \)

Claim 2

Let the framework be as given in Claim 1. Then, \(d(p, q) > r_i\).

If \(d(p, q) > r\), then the claim follows trivially. In the other case, from Claim 1, B contains \(p_i \in S_i^*\) and \(p_j \in S_j^*\). Let \(r_i = d(c_i, q_i)\) for some \(q_i \in S_i^*\).

\(d(c_i, q_i)< \frac{1}{\alpha } d(q_i, c_j) < \frac{1}{\alpha } [ \frac{1}{\alpha }d(p_i, p_j) + \frac{1}{\alpha }d(c_i, q_i) + d(p_i, p_j) + 2d(c_i, q_i)]\) This implies that \((\alpha ^2 - 2\alpha - 1)d(q_i, c_i) < (\alpha + 1) d(p_i, p_j)\). For \(\alpha \ge 2 + \sqrt{7}\), this implies that \(d(c_i, q_i) < d(p_i, p_j)/2\) which implies \(d(c_i, q_i) < d(p, q)\). This result was also stated in [5].   \(\blacksquare \)

Proof of Fact F.2 Let \(\mathcal C^{(l)} = \{C_1, \ldots , C_{k'}\}\) be the current clustering of \(\mathcal X\). Let \(l+1\) be the merge step when \(p = s_i\) and \(q = q_i\) such that \(d(s_i, q_i) = r_i\). We will prove that the ball \(B = B(s_i, q_i)\) satisfies the sparse-distance condition.

Claim 3

Let \(s_i\), \(q_i\), \(r_i\), B and Y be as defined above. Then, B satisfies the sparse distance condition and for all \(C \in Y\), for all \(j \ne i, C \cap S_j^* = \emptyset \).

\(|B| = |S_i^*| \ge t\). Observe that, for all \(C \in \mathcal C^{(l)}\), \(|C| = 1\) or \(|C| \ge t\).

  • Case 1. \(|C| = 1\). If \(C \cap B \ne \emptyset \implies C \subseteq B = S_i^*\).

  • Case 2. \(|C|\ge t\). \(C \cap B \ne \emptyset \). Let h(C) denote the height of the cluster in the tree T.

    • Case 2.1. \(h(C) = 1\). In this case, there exists a ball \(B'\) such that \(B' = C\). We know that \(r(B') \le r_i \le r\). Hence using Claim 2, we get that for all \(j \ne i\), \(B' \cap S_j^* = \emptyset \). Thus, \(|B'\setminus S_i^*| \le t/2 \implies |B\cap C| = |C| - |C\setminus B| = |C| - |B'\setminus S_i^*| \ge t/2\). Hence, \(C \in Y\).

    • Case 2.2. \(h(C) > 1\). Then there exists some \(C'\) such that \(h(C') = 1\) and \(C' \subset C\). Now, using set inclusion and the result from the first case, we get that \(|B\cap C| \ge |B\cap C'| \ge t/2\). Hence, \(C \in Y\). Using Claim 2, we get that for all \(j \ne i\), \(C \cap S_j^* = \emptyset \).   \(\blacksquare \)

Fig. 1.
figure 1

\(\mathcal X \subseteq \mathbb {R}\) such that no tree can capture all the \((\alpha , \eta )\)-proximal clusterings.

Proof of Theorem 4 Let \(\mathcal X, B_1, B_2, B_1', B_2'\) be as shown in Fig. 1. Let \(t_1 = \frac{t}{2}+1\) and \(t_2 = \frac{t}{2}-2\). For \(\alpha \le 2+\sqrt{3}\), clusterings \(\mathcal C_{\mathcal S} = \{B_1, B_2, B_3, \ldots , B_k\}\) and \(\mathcal C_{\mathcal S'} = \{\ B_1', B_2', B_3, \ldots , B_k\}\) satisfy \((\alpha , 1)\)-center proximity and \(m(\mathcal C_{\mathcal S}) = m(\mathcal C_{\mathcal S}') = t\). Now, a simple proof by contradiction shows that there doesn’t exist a tree T and prunings P and \(P'\) such that P respects \(\mathcal C_{\mathcal S}\) and \(P'\) respects \(\mathcal C_{\mathcal S'}\).    \(\blacksquare \)

Proof of Theorem 5 The clustering instance \(\mathcal X\) is an extension of Fig. 1. Let \(G_1 = \{B_1, B_1', B_2, B_2'\}\) be the balls as in Fig. 1. Now, construct \(G_2 = \{B_3, B_3', B_4, B_4'\}\) exactly identical to \(G_1\) but far. In this way, we construct k / 2 copies of \(G_1\).    \(\blacksquare \)

Proof of Theorem 6 Let \(\mathcal X \subseteq \mathbb {R}\) be as shown in Fig. 2. Let \(t' = \frac{t}{2}-1\) and let \(B_1, B_2, B_3, B_1'\), \(B_2', B_3', B_1'', B_2''\) and \(B_3''\) be as shown in Fig. 2. For \(\alpha \le 2\sqrt{2}+3\), clusterings \(\mathcal C_{\mathcal S} = \{B_1, B_2, B_3, \ldots , B_k\}\), \(\mathcal C_{\mathcal S'} = \{\ B_1', B_2', B_3, \ldots , B_k\}\) and \(\mathcal C_{\mathcal S}'' = \{\ B_1'', B_2'', B_3, \ldots , B_k\}\) satisfy \((\alpha , 1)\)-center proximity. Also, \(m(\mathcal C_{\mathcal S}) = m(\mathcal C_{\mathcal S}') =\) \(m(\mathcal C_{\mathcal S}'') = t\). Arguing similarly as in Theorem 4 completes the proof.    \(\blacksquare \)

Fig. 2.
figure 2

\(\mathcal X \subseteq \mathbb {R}\) such that no algorithm can capture all the \(\alpha \)-proximal clusterings.

Proofs of Theorems 7 , 15 and 13 have the exact same ideas as the proof of Theorem 5. To prove the lower bound in the list model, instance constructed in Theorem 5 is a simple extension of the instance in Theorem 4. The instances for the proof of Theorems 7, 15 and 13 are similarly constructed as extensions of their respective tree lower bound instances (Theorems 6, 14 and 12).

Proof of Theorem 8 We will show that \(\mathcal C_{\mathcal X}^*\) has strong stability ([4]) which will complete the proof (Theorem 8 in [4]). Let \(A \subset C_i^*\) and \(B \subseteq C_j^*\). Let \(p \in A\) and \(q \in C_i^* \setminus A\) be points which achieve the minimum distance between A and \(C_i^*\setminus A\). If \(c_i \in A\) then \(d(p, q) \le d(c_i, q) \le r\). If \(c_i \in C_i^* \setminus A\) then \(d(p, q) \le d(p, c_i) \le r\). Hence, \(d_{min} (A, C_i^*\setminus A) \le r\). Similarly, we get that \(d_{min}(A, B) > r\).   \(\blacksquare \)

Proof of Theorems 14 and 12 are also identical to the proofs of Theorem 6 and 4.

Proof of Theorem 10 Fix \(\mathcal S \subseteq \mathcal X\). Denote by \(r_i := r(S_i^*)\). Let be the clustering outputed by the algorithm. Let \(\mathcal L = \{B_1, \ldots , B_l\}\) be the list of balls as outputed by Phase 1 of Algorithm 3. Let G be the graph as constructed in Phase 2 of the algorithm. Observe that \(B = B(s_i, r_i) = S_i^* \in \mathcal L\). WLOG, denote this ball by \(B^{(i)}\) and the corresponding vertex in the graph G by \(v^{(i)}\). We will prove the theorem by proving two key facts.

  • F.1 If \(B_{i1}\) and \(B_{i2}\) intersect \(S_i^*\) then the vertices \(v_{i1}\) and \(v_{i2}\) are connected.

  • F.2 If \(B_{i1}\) intersects \(S_i^*\) and \(B_{j1}\) intersects \(S_j^*\) then \(v_{i1}\) and \(v_{j1}\) are disconnected in G.

Claim 4

Let \(\mathcal L, G, B^{(i)}\) and \(v^{(i)}\) be as defined above. Let balls \(B_{i1}, B_{i2} \in \mathcal L\) be such that \(B_{i1} \cap S_i^* \ne \emptyset \) and \(B_{i2} \cap S_i^* \ne \emptyset \). Then there exists a path between \(v_{i1}\) and \(v_{i2}\).

Assume that \(v_{i1}\) and \(v^{(i)}\) are not connected by an edge. Hence, \(|B_{i1} \setminus B^{(i)}| \ge t/2\). Since \(\lambda > 4\), for all \(j \ne i\), \(B_{i1} \cap S_j^* = \emptyset \). Thus, \(B_{i1} \setminus B^{(i)} \subseteq \mathcal X \setminus \mathcal S\). which contradicts \(|B_{i1} \cap \{\mathcal X \setminus \mathcal S\}| < t/2\).   \(\blacksquare \)

Claim 5

Let the framework be as in Claim 4. Let \(B_{i1} \in \mathcal L\) be such that \(B_{i1} \cap S_i^* \ne \emptyset \) and \(B_{j1}\) be such that \(B_{j1} \cap S_j^* \ne \emptyset \). Then \(v_{i1}\) and \(v_{j1}\) are disconnected in G.

Assume that \(v_{i1}\) and \(v_{j1}\) are connected. Hence, there exists vertices \(v_{i}\) and \(v_{n}\) such that \(v_i\) and \(v_n\) are connected by an edge in G and \(B_i \cap S_i^* \ne \emptyset \) and \(B_n \cap S_n^* \ne \emptyset \) for some \(n \ne i\). \(|B_i \cap B_n| \ge t/2\). Now, \(\lambda \ge 4\), thus \(B_i \cap \{\mathcal S \setminus S_i^*\} = \emptyset \) and \(B_n \cap \{\mathcal S\setminus S_n^*\} = \emptyset \). Thus, \(B_i \cap B_n \subseteq \mathcal X \setminus \mathcal S\) which contradicts the sparseness assumption.    \(\blacksquare \)

Theorem 16

(Vapnik and Chervonenkis [14]). Let X be a domain set and D a probability distribution over X. Let H be a class of subsets of X of finite VC-dimension d. Let \(\epsilon , \delta \in (0,1)\). Let \(S \subseteq X\) be picked i.i.d according to D of size m. If \(m > \frac{c}{\epsilon ^2}(d\log \frac{d}{\epsilon }+\log \frac{1}{\delta })\), then with probability \(1-\delta \) over the choice of S, we have that \(\forall h \in H\)

$$\bigg |\frac{|h\cap S|}{|S|} - P(h)\bigg | < \epsilon $$

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kushagra, S., Samadi, S., Ben-David, S. (2016). Finding Meaningful Cluster Structure Amidst Background Noise. In: Ortner, R., Simon, H., Zilles, S. (eds) Algorithmic Learning Theory. ALT 2016. Lecture Notes in Computer Science(), vol 9925. Springer, Cham. https://doi.org/10.1007/978-3-319-46379-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46379-7_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46378-0

  • Online ISBN: 978-3-319-46379-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics