Enumerating maximal bicliques in bipartite graphs with favorable degree sequences
Introduction
A bipartite graph with edges set E has its edges only between two vertex sets X and Y, but not inside these sets. A bipartite graph is complete, or a biclique, if E consists of all possible edges. A biclique within another bipartite graph is called a maximal biclique if it is not contained in a larger biclique. Enumeration of the maximal bicliques of a given biparite graph has important applications in data analysis, which have been reported at many places. As the number β of maximal bicliques can be exponential in the graph size, one important type of enumeration algorithms is output-sensitive algorithms whose time bounds are polynomial in the graph size and in the output size β. A stronger demand is that one may always want to output a new item, i.e., maximal biclique, after some polynomial delay (in the size of the graph only). However, in the present paper we are only concerned with the total running time. If not said otherwise, let n and m denote the number of vertices and edges, respectively, of the input graph, and let Δ be the maximal vertex degree.
Let us review known total running times from the literature about the problem. For general (not only bipartite) graphs, several incomparable time bounds are derived in [1], in particular, or alternatively . The latter bound also appears in [2] and is later refined to in [3]. This bound, in turn, also comes out in [6] from a different angle. A weighted and thresholded version of the problem is considered in [9] where an time bound is reported. A unifying view is presented in [4], however no better time bounds in our direction follow there. Finally, one of the results in [8] is an time bound for the case of bipartite graphs. The extended abstract [5] deals with the bipartite case, too, but gives no explicit time bound.
In the present paper we take advantage of skewed degree distributions in the bipartite graph , resulting in time bounds that can beat the aforementioned under some circumstances. More technically, consider the following special case of a sorted degree sequence in one partite set X, which goes as , that is, the jth highest degree in X is about a fraction of the highest degree. Here s is any constant with . Then we achieve time, where . If, furthermore, , then this is faster than . (We remark that the algorithm in [8] also outputs a new biclique after a delay of time, whereas we do not aim for a polynomial delay, and apparently it would be hard to achieve combined with our result.)
This type of time bounds is our main theoretical contribution. It seems relevant because just such non-uniform degree distributions appear in practice. We are mainly interested in applications where the vertices in Y represent many short texts, the vertices in X represent the words in these text snippets, and is an edge, if word x occurs (at least once) in text y. The texts can, e.g., be tweets, short news about events, comments in a forum, or reviews of hotels, restaurants, of products or artistic works. Combinations of words that occur frequently indicate topics and can serve as a basis for, e.g., clustering, opinion mining, or automatic summarization.
Now, the point is that rather few words appear very frequently (and these are not only stop words but also characteristic terms from the discussed domain, or evaluating phrases), whereas others are more occasional. There is empirical evidence [7] that word frequencies in random texts follow Zipf's law, i.e., they are proportional to with s close to 1. In text corpora focused on one theme one can expect more “hyper-Zipfian” distributions with , since now only a limited set of words is very frequent. Also is easily fulfilled, as Δ is high (frequent words in many texts), whereas the number k of different words comes with an exponent below 1. In a preprocessing phase we can even omit rare words that are of no interest, and thus reduce k right from the beginning.
We remark that the degree sequence does not have to obey exactly some power law, and the algorithm itself does not depend on that. We only discussed this function for its mathematical simplicity, in order to get some “crisp” specific worst-case bound. Rather, the more general, somewhat informal conclusion is that we can enumerate the maximal bicliques faster than what earlier time bounds indicate, whenever the degree sequence is “more skewed” than a Zipf's law sequence with .
Our algorithm, while taking the degree sequence into account, still follows natural ideas and should also be easy to implement. Despite earlier works we present the algorithm from scratch because, of course, the details are important for the analysis. We also add some more tricks that do not further help the worst-case bound but are beneficial for certain instances.
Section snippets
Prefix maxima in a sequence of sets
The enumeration algorithm in Section 3 will have to deal with a certain sequence of sets of vertices and, very roughly speaking, recognize which of them are already subsets of other sets early in the sequence. (See Definition 1 below for the precise statement.) This will be needed to avoid returning non-maximal bicliques. In this section we solve this task separately by a routine called PrefMax, such that we can later use this routine and focus on the main algorithm.
As a notational remark, ⊂
The maximal bicliques generated through a hull operator
Let be our given bipartite graph. As usual, let denote the set of all neighbors of a vertex v.
Definition 3 For we define to be the set of all vertices in Y which are adjacent to all vertices of A. Equivalently, . We define similarly for , and we let .
The following lemmas are straightforward to prove.
Lemma 4 ϕ is a hull operator, that is, ϕ is extensive, increasing, and idempotent. In detail: , implies , and .
Lemma 5 Consider any
Fine-tuning and analysis
We presume a uniform-cost model where dictionary operations need time. (We deal with subsets of vertices that can be stored as a table of size n, or as a hashtable, along with information about the cardinality.) In particular, a test whether can be done in time, and the intersection of two sets can be computed in time. (In a logarithmic cost model our analysis works similarly, just with logarithmic factors attached.) Since we aim at output-sensitive time
Acknowledgement
This work has been supported by the Swedish Foundation for Strategic Research (SSF) through Grant IIS11-0089 for a data mining project entitled “Data-driven secure business intelligence”.
References (9)
- et al.
Consensus algorithms for the generation of all maximal bicliques
Discrete Appl. Math.
(2004) - et al.
Generating bicliques of a graph in lexicographic order
Theor. Comput. Sci.
(2005) - et al.
On the generation of bicliques of a graph
Discrete Appl. Math.
(2007) - et al.
Enumeration aspects of maximal cliques and bicliques
Discrete Appl. Math.
(2009)
Cited by (16)
Efficient enumeration of maximal induced bicliques
2021, Discrete Applied MathematicsTight lower bounds on the number of bicliques in false-twin-free graphs
2016, Theoretical Computer ScienceCitation Excerpt :Some authors consider bicliques not induced, others, with bounded size of bipartition, etc. See for example [1,6–8,11,19–21,23,24,29,30]. Even though the bounds are polynomial, this approach can help to develop algorithms for listing the bicliques, not only because of the use itself of the bounds but also because of the ideas behind the proofs.
Finding and enumerating large intersections
2015, Theoretical Computer ScienceCitation Excerpt :Maximal biclique enumeration in general is well studied [1,7,9,11]. In [6] we have also given an algorithm for the maximal biclique enumeration that runs faster than in the general case under power-law assumptions on the degree sequence. According to an idea in [5] (see also [14]), pairs of similar sets in a set family can be efficiently identified by locality-sensitive hashing.
Efficient maintenance for maximal bicliques in bipartite graph streams
2022, World Wide WebEfficient Maximal Biclique Enumeration for Large Sparse Bipartite Graphs
2022, Contemporary MathematicsOn Maximising the Vertex Coverage for Top-k t-Bicliques in Bipartite Graphs
2022, Proceedings - International Conference on Data Engineering