Subexponential algorithm for d-cluster edge deletion: Exception or rule?

https://doi.org/10.1016/j.jcss.2020.05.008Get rights and content

Abstract

We study the question of finding a set of at most k edges, whose removal makes the input n-vertex graph a disjoint union of s-clubs (graphs of diameter s). Komusiewicz and Uhlmann [DAM 2012] showed that Cluster Edge Deletion (i.e., for the case of 1-clubs (cliques)), cannot be solved in time 2o(k)nO(1) unless the Exponential Time Hypothesis (ETH) fails. But, Fomin et al. [JCSS 2014] showed that if the number of cliques in the output graph is restricted to d, then the problem (d-Cluster Edge Deletion) can be solved in time O(2O(dk)+m+n). We show that assuming ETH, there is no algorithm solving 2-Club Cluster Edge Deletion in time 2o(k)nO(1). Further, we show that the same lower bound holds in the case of s-Club d-Cluster Edge Deletion for any s2 and d2.

Introduction

The correlation clustering problem involves identifying clusters of objects in a data set based on their similarity. A traditional way of posing this as a graph theoretic question involves associating vertices with data points and indicating similarity by adjacency. In this setting, the natural notion of a cluster would correspond to a clique, a set of mutually adjacent vertices. Thus, we call a graph G=(V,E) a cluster graph if every connected component of G is a complete graph. The task of identifying clusters can now be viewed as an optimization problem. In particular, a subset FE is called a cluster edge deletion set if GF=(V,EF) is a cluster graph. On the other hand, if for some F(V2), GΔF=(V,EΔF) is a cluster graph, then F is called a cluster editing set. (Here EΔF is the symmetric difference between E and F.) In the Cluster Edge Deletion (Cluster Editing) problem, we are given a graph G and a positive integer k. Our task is to check whether there exists a cluster edge deletion set (cluster editing set), F of size at most k.

The complexity of Cluster Edge Deletion and Cluster Editing is well-understood. The problems are NP-complete and admit constant-factor approximation algorithms [2], [3], [4]. On the other hand, they are also known to be APX-hard [5]. Further, it has been shown that Cluster Edge Deletion and Cluster Editing cannot be solved in time 2o(k)nO(1) unless the Exponential Time Hypothesis (ETH) fails [6], [7]. This led the authors of [7] to consider the question of editing at most k edges to obtain a graph with at most d clusters. This variant continues to be well motivated in several practical settings, where the number of clusters corresponds to an external constraint. With the restriction on the number of clusters in place, there is good news, as [7] describes an algorithm that solves the problem in time O(2O(dk)+m+n). The result in [7] also works for a weighted variant of the problem which includes d-Cluster Edge Deletion.

So far, we have considered the clustering problem in the graph theoretic context using cliques as a natural means for modeling the notion of a cluster. This effectively restricts us to a binary notion of similarity, in that a pair of data points are either similar or not, and we would like to maximize similarities within a cluster and minimize non-similarities across clusters. In many situations, however, this translation can be somewhat severe. A more flexible notion would be that of a structure where the vertices are mutually “not too far apart”, without necessarily being adjacent. Cliques are also a popular choice for modeling highly correlated or connected substructures in applications. Given that cliques impose a very strict connectivity requirement, this modeling suffers from being overly restrictive.

A natural generalization of the notion of cliques would be along the lines of small-diameter graphs. These structures are called clubs and have been proposed as a more reasonable measure of connectivity and correlation. Note that the complete graphs can be thought of as graphs of diameter one. An s-club is a graph of diameter at most s, and note that cliques are exactly 1-clubs. We say that a graph is an s-club cluster if every connected component of it is an s-club. The s-club concept was defined in [8], [9], and it has recently been used in the analysis of social and biological networks [10]. In [11], [12], [13] parameterized studies of finding s-clubs were undertaken. It is worth to mention that several other generalizations of cliques such as s-cliques and s-plexes [14] and the related notion of clustering into these graphs have been studied in literature before.

Immediate questions that arise in the context of clustering are s-Club Cluster Edge Deletion and s-Club Cluster Edge Editing. In both the problems the input is a graph G and a positive integer k. In s-Club Cluster Edge Deletion, our task is to test whether there is a set of at most k edges whose removal leaves us with a graph whose components are s-clubs. In s-Club Cluster Edge Editing, our task is to test whether there is a subset F(V(G)2) of size at most k such the graph (V(G),E(G)F) is an s-club cluster. It is known that both the problems are NP-complete even for s=2 [15]. To explain the FPT algorithms for these problems, we define the notion of restricted Pr in a graph G. A path P=v1vr on r vertices in a graph G is called a restricted Pr, if P is a shortest v1-vr path in G. One can show that a graph G is an s-club cluster if and only if G does not contain any restricted Ps+2. This observation leads to a simple branching algorithm of running time (s+1)knO(s) for s-Club Cluster Edge Deletion. But for s=2, there is faster algorithm that solves the problem in time 2.74knO(1) [15]. Unfortunately, the above characterization will not lead to an immediate FPT algorithm for s-Club Cluster Edge Editing. Figiel showed that 2-Club Cluster Edge Editing is W[2]-hard [16].

For s-Club Cluster Edge Deletion, it is natural to ask if the problem admits a parameterized subexponential time algorithm. Our first result shows that assuming the ETH, the answer is negative.

Theorem 1

2-Club Cluster Edge Deletion cannot be solved in time 2o(n+m+k)nO(1), unless ETH fails, where n and m are the number of vertices and edges of the input graph, respectively.

In the setting of cliques, it was useful to consider the question with the additional dimension of the number of clusters: if we demanded deletion into at most d clusters, then the problem turned out to admit a sub-exponential algorithm. It is therefore natural to consider the corresponding question in the s-club setting: can we identify at most k edges whose removal leaves us with at most d s-clubs? It turns out that the slightest generalization of the cluster editing problem makes the problem significantly harder in the context of subexponential time algorithms. In particular, we show the following theorem.

Theorem 2

s-Club d-Cluster Edge Deletion for s2 and d2 cannot be solved in time 2o(k)nO(1), unless ETH fails.

Theorem 2 shows that the sub-exponential algorithm in the case of 1-Club d-Cluster Edge Deletion is rather an exception. All our results are obtained by reductions from 3-SAT. The Exponential Time Hypothesis (ETH) states that there is no algorithm that solves 3-SAT in time 2o(n)(m+n)O(1) [17], where n is the number of variables and m is the number of clauses in the input 3-SAT formula. Due to the sparsification lemma of Impagliazzo, Paturi and Zane [18], assuming ETH there is no algorithm for 3-SAT running in time 2o(m+n). Our reductions produce instances where the size of the solution depends linearly on (m+n). We refer to the book [19, Chapters 1 and 14] for detailed discussion about parameterized algorithms and lower bounds under the assumption of ETH.

Another related study about the clustering problem was done by Drange et al. [20]. They studied Starforest Editing and Bicluster Editing, where the objective is to make the input graph a disjoint union of stars and bicliques, respectively, by doing at most k edge edit operations. They showed that these problems are NP-hard and cannot be solved in subexponential time unless ETH fails. However, upon bounding the number of stars or bicliques in the solution, they obtain subexponential time algorithms for these problems.

Organization of the paper. In Section 2 we establish the notation and state the problems formally. In Sections 3 and 4 we prove Theorem 1, Theorem 2, respectively. The proof of Theorem 2 is split into three cases, namely s=2, s=3, and s4.

Section snippets

Preliminaries

For any positive integer t, we use [t] as a shorthand for the set {1,2,,t}. For a set X, we use (X2) to denote the set {Y:YX,|Y|=2}.

Graphs. For a finite set V, a pair G=(V,E) such that E(V2) is a graph on V. The elements of V are called vertices, while pairs of vertices {u,v} such that {u,v}E are called edges. For a graph G, we use V(G) and E(G) to denote the set of vertices and edges of G, respectively. In the following, let G=(V,E) and G=(V,E) be graphs, and UV some subset of vertices

2-Club Cluster Edge Deletion

In this section we show that 2-Club Cluster Edge Deletion cannot be solved in time 2o(k)nO(1) unless ETH fails, where n is the number of vertices in the input graph. To show this result we give a reduction from 3-SAT to 2-Club Cluster Edge Deletion. More precisely, from an instance ϕ with m clauses and n variables, of 3-SAT, we construct an instance (G,k) of 2-Club Cluster Edge Deletion with the property that ϕ is satisfiable if and only if (G,k) is a Yes instance, where k=O(m+n). First we

s-Club d-Cluster Edge Deletion

In this section, we show the hardness of s-Club d-Cluster Edge Deletion for all s2. The results are divided into three parts. First, we demonstrate a reduction from 3-SAT to 2-Club 2-Cluster Edge Deletion. With minor modifications, we show that this reduction works for the problem of edge deletion into two 3-clubs. For s4, we show a general reduction from 3-SAT to s-Club 2-Cluster Edge Deletion. The construction in the first reduction serves as a basis for the general reduction, but we note

Conclusions

In this work, we established that assuming the ETH, there is no algorithm solving s-Club Cluster Edge Deletion in time 2o(n+m+k)nO(1). We also showed that even the problem of deleting at most k edges to obtain an s-club cluster with at most d components cannot be solved in time 2o(k)nO(1) for any s2. We would like to mention that our reduction will not rule out an algorithm of running time 2o(n+m)nO(1) for s-Club d-Cluster Edge Deletion and this is an interesting open problem. We would also

Declaration of Competing Interest

The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed

References (21)

There are more references available in the full text version of this article.

Cited by (0)

A preliminary version of this paper appeared in the proceedings of MFCS 2013 [1].

1

Present address: Department of Computer Science and Engineering, IIT Hyderabad, India.

View full text