Subexponential algorithm for d-cluster edge deletion: Exception or rule?☆
Introduction
The correlation clustering problem involves identifying clusters of objects in a data set based on their similarity. A traditional way of posing this as a graph theoretic question involves associating vertices with data points and indicating similarity by adjacency. In this setting, the natural notion of a cluster would correspond to a clique, a set of mutually adjacent vertices. Thus, we call a graph a cluster graph if every connected component of G is a complete graph. The task of identifying clusters can now be viewed as an optimization problem. In particular, a subset is called a cluster edge deletion set if is a cluster graph. On the other hand, if for some , is a cluster graph, then F is called a cluster editing set. (Here EΔF is the symmetric difference between E and F.) In the Cluster Edge Deletion (Cluster Editing) problem, we are given a graph G and a positive integer k. Our task is to check whether there exists a cluster edge deletion set (cluster editing set), F of size at most k.
The complexity of Cluster Edge Deletion and Cluster Editing is well-understood. The problems are NP-complete and admit constant-factor approximation algorithms [2], [3], [4]. On the other hand, they are also known to be APX-hard [5]. Further, it has been shown that Cluster Edge Deletion and Cluster Editing cannot be solved in time unless the Exponential Time Hypothesis (ETH) fails [6], [7]. This led the authors of [7] to consider the question of editing at most k edges to obtain a graph with at most d clusters. This variant continues to be well motivated in several practical settings, where the number of clusters corresponds to an external constraint. With the restriction on the number of clusters in place, there is good news, as [7] describes an algorithm that solves the problem in time . The result in [7] also works for a weighted variant of the problem which includes d-Cluster Edge Deletion.
So far, we have considered the clustering problem in the graph theoretic context using cliques as a natural means for modeling the notion of a cluster. This effectively restricts us to a binary notion of similarity, in that a pair of data points are either similar or not, and we would like to maximize similarities within a cluster and minimize non-similarities across clusters. In many situations, however, this translation can be somewhat severe. A more flexible notion would be that of a structure where the vertices are mutually “not too far apart”, without necessarily being adjacent. Cliques are also a popular choice for modeling highly correlated or connected substructures in applications. Given that cliques impose a very strict connectivity requirement, this modeling suffers from being overly restrictive.
A natural generalization of the notion of cliques would be along the lines of small-diameter graphs. These structures are called clubs and have been proposed as a more reasonable measure of connectivity and correlation. Note that the complete graphs can be thought of as graphs of diameter one. An s-club is a graph of diameter at most s, and note that cliques are exactly 1-clubs. We say that a graph is an s-club cluster if every connected component of it is an s-club. The s-club concept was defined in [8], [9], and it has recently been used in the analysis of social and biological networks [10]. In [11], [12], [13] parameterized studies of finding s-clubs were undertaken. It is worth to mention that several other generalizations of cliques such as s-cliques and s-plexes [14] and the related notion of clustering into these graphs have been studied in literature before.
Immediate questions that arise in the context of clustering are s-Club Cluster Edge Deletion and s-Club Cluster Edge Editing. In both the problems the input is a graph G and a positive integer k. In s-Club Cluster Edge Deletion, our task is to test whether there is a set of at most k edges whose removal leaves us with a graph whose components are s-clubs. In s-Club Cluster Edge Editing, our task is to test whether there is a subset of size at most k such the graph is an s-club cluster. It is known that both the problems are NP-complete even for [15]. To explain the FPT algorithms for these problems, we define the notion of restricted in a graph G. A path on r vertices in a graph G is called a restricted , if P is a shortest - path in G. One can show that a graph G is an s-club cluster if and only if G does not contain any restricted . This observation leads to a simple branching algorithm of running time for s-Club Cluster Edge Deletion. But for , there is faster algorithm that solves the problem in time [15]. Unfortunately, the above characterization will not lead to an immediate FPT algorithm for s-Club Cluster Edge Editing. Figiel showed that 2-Club Cluster Edge Editing is W[2]-hard [16].
For s-Club Cluster Edge Deletion, it is natural to ask if the problem admits a parameterized subexponential time algorithm. Our first result shows that assuming the ETH, the answer is negative.
Theorem 1 2-Club Cluster Edge Deletion cannot be solved in time , unless ETH fails, where n and m are the number of vertices and edges of the input graph, respectively.
In the setting of cliques, it was useful to consider the question with the additional dimension of the number of clusters: if we demanded deletion into at most d clusters, then the problem turned out to admit a sub-exponential algorithm. It is therefore natural to consider the corresponding question in the s-club setting: can we identify at most k edges whose removal leaves us with at most d s-clubs? It turns out that the slightest generalization of the cluster editing problem makes the problem significantly harder in the context of subexponential time algorithms. In particular, we show the following theorem.
Theorem 2 s-Club d-Cluster Edge Deletion for and cannot be solved in time , unless ETH fails.
Theorem 2 shows that the sub-exponential algorithm in the case of 1-Club d-Cluster Edge Deletion is rather an exception. All our results are obtained by reductions from 3-SAT. The Exponential Time Hypothesis (ETH) states that there is no algorithm that solves 3-SAT in time [17], where n is the number of variables and m is the number of clauses in the input 3-SAT formula. Due to the sparsification lemma of Impagliazzo, Paturi and Zane [18], assuming ETH there is no algorithm for 3-SAT running in time . Our reductions produce instances where the size of the solution depends linearly on . We refer to the book [19, Chapters 1 and 14] for detailed discussion about parameterized algorithms and lower bounds under the assumption of ETH.
Another related study about the clustering problem was done by Drange et al. [20]. They studied Starforest Editing and Bicluster Editing, where the objective is to make the input graph a disjoint union of stars and bicliques, respectively, by doing at most k edge edit operations. They showed that these problems are NP-hard and cannot be solved in subexponential time unless ETH fails. However, upon bounding the number of stars or bicliques in the solution, they obtain subexponential time algorithms for these problems.
Organization of the paper. In Section 2 we establish the notation and state the problems formally. In Sections 3 and 4 we prove Theorem 1, Theorem 2, respectively. The proof of Theorem 2 is split into three cases, namely , , and .
Section snippets
Preliminaries
For any positive integer t, we use as a shorthand for the set . For a set X, we use to denote the set .
Graphs. For a finite set V, a pair such that is a graph on V. The elements of V are called vertices, while pairs of vertices such that are called edges. For a graph G, we use and to denote the set of vertices and edges of G, respectively. In the following, let and be graphs, and some subset of vertices
2-Club Cluster Edge Deletion
In this section we show that 2-Club Cluster Edge Deletion cannot be solved in time unless ETH fails, where n is the number of vertices in the input graph. To show this result we give a reduction from 3-SAT to 2-Club Cluster Edge Deletion. More precisely, from an instance ϕ with m clauses and n variables, of 3-SAT, we construct an instance of 2-Club Cluster Edge Deletion with the property that ϕ is satisfiable if and only if is a Yes instance, where . First we
s-Club d-Cluster Edge Deletion
In this section, we show the hardness of s-Club d-Cluster Edge Deletion for all . The results are divided into three parts. First, we demonstrate a reduction from 3-SAT to 2-Club 2-Cluster Edge Deletion. With minor modifications, we show that this reduction works for the problem of edge deletion into two 3-clubs. For , we show a general reduction from 3-SAT to s-Club 2-Cluster Edge Deletion. The construction in the first reduction serves as a basis for the general reduction, but we note
Conclusions
In this work, we established that assuming the ETH, there is no algorithm solving s-Club Cluster Edge Deletion in time . We also showed that even the problem of deleting at most k edges to obtain an s-club cluster with at most d components cannot be solved in time for any . We would like to mention that our reduction will not rule out an algorithm of running time for s-Club d-Cluster Edge Deletion and this is an interesting open problem. We would also
Declaration of Competing Interest
The authors certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed
References (21)
- et al.
Cluster graph modification problems
Discrete Appl. Math.
(2004) - et al.
Clustering with qualitative information
J. Comput. Syst. Sci.
(2005) - et al.
Cluster editing with locally bounded modifications
Discrete Appl. Math.
(2012) - et al.
Tight bounds for parameterized complexity of cluster editing with a small number of clusters
J. Comput. Syst. Sci.
(2014) - et al.
On structural parameterizations for the 2-club problem
Discrete Appl. Math.
(2015) - et al.
Which problems have strongly exponential complexity?
J. Comput. Syst. Sci.
(2001) - et al.
Subexponential algorithm for d-cluster edge deletion: exception or rule?
- et al.
Aggregating inconsistent information: ranking and clustering
J. ACM
(2008) - et al.
Correlation clustering
Mach. Learn.
(2004) A graph-theoretic definition of a sociometric clique
J. Math. Sociol.
(1973)
Cited by (0)
- 1
Present address: Department of Computer Science and Engineering, IIT Hyderabad, India.