Finding Meaningful Cluster Structure Amidst Background Noise

Kushagra, Shrinu; Samadi, Samira; Ben-David, Shai

doi:10.1007/978-3-319-46379-7_23

Shrinu Kushagra¹⁶,
Samira Samadi¹⁷ &
Shai Ben-David¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9925))

Included in the following conference series:

International Conference on Algorithmic Learning Theory

1329 Accesses
2 Citations

Abstract

We consider efficient clustering algorithm under data clusterability assumptions with added noise. In contrast with most literature on this topic that considers either the adversarial noise setting or some noise generative model, we examine a realistically motivated setting in which the only restriction about the noisy part of the data is that it does not create significantly large “clusters”. Another aspect in which our model deviates from common approaches is that we stipulate the goals of clustering as discovering meaningful cluster structure in the data, rather than optimizing some objective (clustering cost).

We introduce efficient algorithms that discover and cluster every subset of the data with meaningful structure and lack of structure on its complement (under some formal definition of such “structure”). Notably, the success of our algorithms does not depend on any upper bound on the fraction of noisy data.

We complement our results by showing that when either the notions of structure or the noise requirements are relaxed, no such results are possible.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The assignment to clusters can sometimes be probabilistic, and clusters may be allowed to intersect, but these aspects are orthogonal to the discussion in this paper.

References

Ackerman, M., Ben-David, S.: Clusterability: a theoretical study. In: International Conference on Artificial Intelligence and Statistics, pp. 1–8 (2009)
Google Scholar
Awasthi, P., Blum, A., Sheffet, O.: Center-based clustering under perturbation stability. Inf. Process. Lett. 112(1), 49–54 (2012)
Article MathSciNet MATH Google Scholar
Balcan, M.-F., Blum, A., Fine, S., Mansour, Y.: Distributed learning, communication complexity and privacy. arXiv preprint arXiv:1204.3514 (2012)
Balcan, M.-F., Blum, A., Vempala, S.: A discriminative framework for clustering via similarity functions. In: Proceedings of the Fortieth Annual ACM Symposium on Theory of Computing, pp. 671–680. ACM (2008)
Google Scholar
Balcan, M.F., Liang, Y.: Clustering under perturbation resilience. In: Mehlhorn, K., Pitts, A., Wattenhofer, R., Czumaj, A. (eds.) ICALP 2012, Part I. LNCS, vol. 7391, pp. 63–74. Springer, Heidelberg (2012)
Chapter Google Scholar
Ben-David, S.: Computational feasibility of clustering under clusterability assumptions. arXiv preprint arXiv:1501.00437 (2015)
Ben-David, S., Haghtalab, N.: Clustering in the presence of background noise. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 280–288 (2014)
Google Scholar
Ben-David, S., Reyzin, L.: Data stability in clustering: a closer look. Theor. Comput. Sci. 558, 51–61 (2014)
Article MathSciNet MATH Google Scholar
Bilu, Y., Linial, N.: Are stable instances easy? Comb. Probab. Comput. 21(05), 643–660 (2012)
Article MathSciNet MATH Google Scholar
Cuesta-Albertos, J.A., Gordaliza, A., Matrán, C., et al.: Trimmed $ k $-means: an attempt to robustify quantizers. Ann. Stat. 25(2), 553–576 (1997)
Article MathSciNet MATH Google Scholar
Dave, R.N.: Robust fuzzy clustering algorithms. In: Second IEEE International Conference on Fuzzy Systems, pp. 1281–1286. IEEE (1993)
Google Scholar
García-Escudero, L.A., Gordaliza, A., Matrán, C., Mayo-Iscar, A.: A general trimming approach to robust cluster analysis. Ann. Stat., 1324–1345 (2008)
Google Scholar
Reyzin, L.: Data stability in clustering: a closer look. In: Stoltz, G., Vayatis, N., Zeugmann, T., Bshouty, N.H. (eds.) ALT 2012. LNCS, vol. 7568, pp. 184–198. Springer, Heidelberg (2012)
Chapter Google Scholar
Vapnik, V.N., Ya, A.: Chervonenkis: on the uniform convergence of relative frequencies of events to their probabilities. In: Vovk, V., Papadopoulos, H., Gammerman, A. (eds.) Measures of Complexity, pp. 11–30. Springer, Heidelberg (2015)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Waterloo, Waterloo, Canada
Shrinu Kushagra & Shai Ben-David
Georgia Institute of Technology, Atlanta, USA
Samira Samadi

Authors

Shrinu Kushagra
View author publications
You can also search for this author in PubMed Google Scholar
Samira Samadi
View author publications
You can also search for this author in PubMed Google Scholar
Shai Ben-David
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shrinu Kushagra .

Editor information

Editors and Affiliations

Montanuniversität Leoben , Leoben, Austria
Ronald Ortner
Ruhr-Uni-Bochum , Bochum, Germany
Hans Ulrich Simon
University of Regina , Regina, Saskatchewan, Canada
Sandra Zilles

A Proofs of Missing Lemmas and Theorems

Proof of Theorem 2 Fix any $\mathcal S \subseteq \mathcal X$. Let $\mathcal C^*_S = \{S_1^*, \ldots , S_k^*\}$ be a clustering of $\mathcal S$ such that $m(\mathcal C_{\mathcal S}^*) = t$ and $\mathcal C^*_S$ has $(\alpha , \eta )$-center proximity. Denote by $r_i := r(S_i^*)$ and $r = \max r_i$. Define $Y_B^{\mathcal C} := \{C_i \in \mathcal C : C_i \subseteq B \text { or } |B \cap C_i| \ge t/2\}$. Note that whenever a ball B satisfies the sparse-distance condition, all the clusters in $Y_{B}^{{\mathcal C}^{(l)}}$ are merged together and the clustering $\mathcal C^{(l+1)}$ is updated. We will prove the theorem by proving two key facts.

F.1 If the algorithm merges points from a good cluster $S_i^*$ with points from some other good cluster, then at this step the distance being considered $d = d(p,q) > r_i$.
F.2 When the algorithm considers the distance $d = r_i$, it merges all points from $S_i^*$ (and possibly points from $\mathcal X\setminus \mathcal S$) into a single cluster $C_i$. Hence, there exists a node in the tree $N_i$ which contains all the points from $S_i^*$ and no points from any other good cluster $S_j^*$.

Note that the theorem follows from these two facts. Similar reasoning was also used in proof of Lemma 3 in [5]. We now prove both of these facts formally.

Proof of Fact. F.1 Let $\mathcal C^{(l)} = \{C_1, \ldots , C_{k'}\}$ be the current clustering of $\mathcal X$. Let $l+1$ be the first merge step which merges points from the good cluster $S_i^*$ with points from some other good cluster. Let $p, q \in \mathcal X$ be the pair of points being considered at this step and $B = B(p, d(p, q))$ the ball that satisfies the sparse distance condition at this merge step. Denote by $Y = Y_{B}^{C^{(l)}}$. We need to show that $d(p, q) > r_i$. To prove this, we need Claim 1 below.

Claim 1

Let $p, q \in \mathcal X$ and B, Y, $S_i^*$ and $C^{(l)}$ be as defined above. If $d(p, q) \le r,$ then $B \cap S_i^* \ne \emptyset $ and there exists $n \ne i$ such that $B \cap S_n^* \ne \emptyset $.

$l+1$ is the first step which merges points from $S_i^*$ with some other good cluster. Hence, $\exists C_i \in Y$ such that $C_i\cap S_i^* \ne \emptyset $ and $\forall n \ne i$, $C_i \cap S_n^* = \emptyset $. Also, $\exists C_j \in Y$ such that $C_j \cap S_j^* \ne \emptyset $ for some $S_j^*$ and $C_j \cap S_i^* = \emptyset $.

$C_i \in Y$. Hence, $C_i \subseteq B$ or $|C_i \cap B| \ge t/2$. The former is trivial. In the latter, for the sake of contradiction, assume that B contains no points from $S_i^*$. This implies that $B \cap C_i \subseteq B \cap \{\mathcal X \setminus \mathcal S\}$ and $|B\cap \{\mathcal X \setminus \mathcal S\}| \ge t/2$. This is a contradiction. The case when $C_j \in Y$ is identical. $\blacksquare $

Claim 2

Let the framework be as given in Claim 1. Then, $d(p, q) > r_i$.

If $d(p, q) > r$, then the claim follows trivially. In the other case, from Claim 1, B contains $p_i \in S_i^*$ and $p_j \in S_j^*$. Let $r_i = d(c_i, q_i)$ for some $q_i \in S_i^*$.

$d(c_i, q_i)< \frac{1}{\alpha } d(q_i, c_j) < \frac{1}{\alpha } [ \frac{1}{\alpha }d(p_i, p_j) + \frac{1}{\alpha }d(c_i, q_i) + d(p_i, p_j) + 2d(c_i, q_i)]$ This implies that $(\alpha ^2 - 2\alpha - 1)d(q_i, c_i) < (\alpha + 1) d(p_i, p_j)$. For $\alpha \ge 2 + \sqrt{7}$, this implies that $d(c_i, q_i) < d(p_i, p_j)/2$ which implies $d(c_i, q_i) < d(p, q)$. This result was also stated in [5]. $\blacksquare $

Proof of Fact F.2 Let $\mathcal C^{(l)} = \{C_1, \ldots , C_{k'}\}$ be the current clustering of $\mathcal X$. Let $l+1$ be the merge step when $p = s_i$ and $q = q_i$ such that $d(s_i, q_i) = r_i$. We will prove that the ball $B = B(s_i, q_i)$ satisfies the sparse-distance condition.

Claim 3

Let $s_i$, $q_i$, $r_i$, B and Y be as defined above. Then, B satisfies the sparse distance condition and for all $C \in Y$, for all $j \ne i, C \cap S_j^* = \emptyset $.

$|B| = |S_i^*| \ge t$. Observe that, for all $C \in \mathcal C^{(l)}$, $|C| = 1$ or $|C| \ge t$.

Case 1. $|C| = 1$. If $C \cap B \ne \emptyset \implies C \subseteq B = S_i^*$.
Case 2. $|C|\ge t$. $C \cap B \ne \emptyset $. Let h(C) denote the height of the cluster in the tree T.
- Case 2.1. $h(C) = 1$. In this case, there exists a ball $B'$ such that $B' = C$. We know that $r(B') \le r_i \le r$. Hence using Claim 2, we get that for all $j \ne i$, $B' \cap S_j^* = \emptyset $. Thus, $|B'\setminus S_i^*| \le t/2 \implies |B\cap C| = |C| - |C\setminus B| = |C| - |B'\setminus S_i^*| \ge t/2$. Hence, $C \in Y$.
- Case 2.2. $h(C) > 1$. Then there exists some $C'$ such that $h(C') = 1$ and $C' \subset C$. Now, using set inclusion and the result from the first case, we get that $|B\cap C| \ge |B\cap C'| \ge t/2$. Hence, $C \in Y$. Using Claim 2, we get that for all $j \ne i$, $C \cap S_j^* = \emptyset $. $\blacksquare $

Proof of Theorem 4 Let $\mathcal X, B_1, B_2, B_1', B_2'$ be as shown in Fig. 1. Let $t_1 = \frac{t}{2}+1$ and $t_2 = \frac{t}{2}-2$. For $\alpha \le 2+\sqrt{3}$, clusterings $\mathcal C_{\mathcal S} = \{B_1, B_2, B_3, \ldots , B_k\}$ and $\mathcal C_{\mathcal S'} = \{\ B_1', B_2', B_3, \ldots , B_k\}$ satisfy $(\alpha , 1)$-center proximity and $m(\mathcal C_{\mathcal S}) = m(\mathcal C_{\mathcal S}') = t$. Now, a simple proof by contradiction shows that there doesn’t exist a tree T and prunings P and $P'$ such that P respects $\mathcal C_{\mathcal S}$ and $P'$ respects $\mathcal C_{\mathcal S'}$. $\blacksquare $

Proof of Theorem 5 The clustering instance $\mathcal X$ is an extension of Fig. 1. Let $G_1 = \{B_1, B_1', B_2, B_2'\}$ be the balls as in Fig. 1. Now, construct $G_2 = \{B_3, B_3', B_4, B_4'\}$ exactly identical to $G_1$ but far. In this way, we construct k / 2 copies of $G_1$. $\blacksquare $

Proof of Theorem 6 Let $\mathcal X \subseteq \mathbb {R}$ be as shown in Fig. 2. Let $t' = \frac{t}{2}-1$ and let $B_1, B_2, B_3, B_1'$, $B_2', B_3', B_1'', B_2''$ and $B_3''$ be as shown in Fig. 2. For $\alpha \le 2\sqrt{2}+3$, clusterings $\mathcal C_{\mathcal S} = \{B_1, B_2, B_3, \ldots , B_k\}$, $\mathcal C_{\mathcal S'} = \{\ B_1', B_2', B_3, \ldots , B_k\}$ and $\mathcal C_{\mathcal S}'' = \{\ B_1'', B_2'', B_3, \ldots , B_k\}$ satisfy $(\alpha , 1)$-center proximity. Also, $m(\mathcal C_{\mathcal S}) = m(\mathcal C_{\mathcal S}') =$ $m(\mathcal C_{\mathcal S}'') = t$. Arguing similarly as in Theorem 4 completes the proof. $\blacksquare $

Proofs of Theorems 7 , 15 and 13 have the exact same ideas as the proof of Theorem 5. To prove the lower bound in the list model, instance constructed in Theorem 5 is a simple extension of the instance in Theorem 4. The instances for the proof of Theorems 7, 15 and 13 are similarly constructed as extensions of their respective tree lower bound instances (Theorems 6, 14 and 12).

Proof of Theorem 8 We will show that $\mathcal C_{\mathcal X}^*$ has strong stability ([4]) which will complete the proof (Theorem 8 in [4]). Let $A \subset C_i^*$ and $B \subseteq C_j^*$. Let $p \in A$ and $q \in C_i^* \setminus A$ be points which achieve the minimum distance between A and $C_i^*\setminus A$. If $c_i \in A$ then $d(p, q) \le d(c_i, q) \le r$. If $c_i \in C_i^* \setminus A$ then $d(p, q) \le d(p, c_i) \le r$. Hence, $d_{min} (A, C_i^*\setminus A) \le r$. Similarly, we get that $d_{min}(A, B) > r$. $\blacksquare $

Proof of Theorems 14 and 12 are also identical to the proofs of Theorem 6 and 4.

Proof of Theorem 10 Fix $\mathcal S \subseteq \mathcal X$. Denote by $r_i := r(S_i^*)$. Let be the clustering outputed by the algorithm. Let $\mathcal L = \{B_1, \ldots , B_l\}$ be the list of balls as outputed by Phase 1 of Algorithm 3. Let G be the graph as constructed in Phase 2 of the algorithm. Observe that $B = B(s_i, r_i) = S_i^* \in \mathcal L$. WLOG, denote this ball by $B^{(i)}$ and the corresponding vertex in the graph G by $v^{(i)}$. We will prove the theorem by proving two key facts.

F.1 If $B_{i1}$ and $B_{i2}$ intersect $S_i^*$ then the vertices $v_{i1}$ and $v_{i2}$ are connected.
F.2 If $B_{i1}$ intersects $S_i^*$ and $B_{j1}$ intersects $S_j^*$ then $v_{i1}$ and $v_{j1}$ are disconnected in G.

Claim 4

Let $\mathcal L, G, B^{(i)}$ and $v^{(i)}$ be as defined above. Let balls $B_{i1}, B_{i2} \in \mathcal L$ be such that $B_{i1} \cap S_i^* \ne \emptyset $ and $B_{i2} \cap S_i^* \ne \emptyset $. Then there exists a path between $v_{i1}$ and $v_{i2}$.

Assume that $v_{i1}$ and $v^{(i)}$ are not connected by an edge. Hence, $|B_{i1} \setminus B^{(i)}| \ge t/2$. Since $\lambda > 4$, for all $j \ne i$, $B_{i1} \cap S_j^* = \emptyset $. Thus, $B_{i1} \setminus B^{(i)} \subseteq \mathcal X \setminus \mathcal S$. which contradicts $|B_{i1} \cap \{\mathcal X \setminus \mathcal S\}| < t/2$. $\blacksquare $

Claim 5

Let the framework be as in Claim 4. Let $B_{i1} \in \mathcal L$ be such that $B_{i1} \cap S_i^* \ne \emptyset $ and $B_{j1}$ be such that $B_{j1} \cap S_j^* \ne \emptyset $. Then $v_{i1}$ and $v_{j1}$ are disconnected in G.

Assume that $v_{i1}$ and $v_{j1}$ are connected. Hence, there exists vertices $v_{i}$ and $v_{n}$ such that $v_i$ and $v_n$ are connected by an edge in G and $B_i \cap S_i^* \ne \emptyset $ and $B_n \cap S_n^* \ne \emptyset $ for some $n \ne i$. $|B_i \cap B_n| \ge t/2$. Now, $\lambda \ge 4$, thus $B_i \cap \{\mathcal S \setminus S_i^*\} = \emptyset $ and $B_n \cap \{\mathcal S\setminus S_n^*\} = \emptyset $. Thus, $B_i \cap B_n \subseteq \mathcal X \setminus \mathcal S$ which contradicts the sparseness assumption. $\blacksquare $

Theorem 16

(Vapnik and Chervonenkis [14]). Let X be a domain set and D a probability distribution over X. Let H be a class of subsets of X of finite VC-dimension d. Let $\epsilon , \delta \in (0,1)$. Let $S \subseteq X$ be picked i.i.d according to D of size m. If $m > \frac{c}{\epsilon ^2}(d\log \frac{d}{\epsilon }+\log \frac{1}{\delta })$, then with probability $1-\delta $ over the choice of S, we have that $\forall h \in H$

$$\bigg |\frac{|h\cap S|}{|S|} - P(h)\bigg | < \epsilon $$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kushagra, S., Samadi, S., Ben-David, S. (2016). Finding Meaningful Cluster Structure Amidst Background Noise. In: Ortner, R., Simon, H., Zilles, S. (eds) Algorithmic Learning Theory. ALT 2016. Lecture Notes in Computer Science(), vol 9925. Springer, Cham. https://doi.org/10.1007/978-3-319-46379-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-46379-7_23
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46378-0
Online ISBN: 978-3-319-46379-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Finding Meaningful Cluster Structure Amidst Background Noise

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Proofs of Missing Lemmas and Theorems

A Proofs of Missing Lemmas and Theorems

Claim 1

Claim 2

Claim 3

Claim 4

Claim 5

Theorem 16

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation