Enumerating maximal bicliques in bipartite graphs with favorable degree sequences

https://doi.org/10.1016/j.ipl.2014.02.001Get rights and content

Highlights

  • We propose a biclique enumeration algorithm for bipartite graphs.

  • The algorithm is tailored to very non-uniform degree distributions.

  • This case is motivated by automatic multiple-document summarization.

Abstract

We propose an output-sensitive algorithm for the enumeration of all maximal bicliques in a bipartite graph, tailored to the case when the degree distribution in one partite set is very skewed. We accomplish a worst-case bound better than previously known general bounds if, e.g., the degree sequence follows a power law.

Introduction

A bipartite graph H=(X,Y,E) with edges set E has its edges only between two vertex sets X and Y, but not inside these sets. A bipartite graph is complete, or a biclique, if E consists of all |X||Y| possible edges. A biclique within another bipartite graph is called a maximal biclique if it is not contained in a larger biclique. Enumeration of the maximal bicliques of a given biparite graph has important applications in data analysis, which have been reported at many places. As the number β of maximal bicliques can be exponential in the graph size, one important type of enumeration algorithms is output-sensitive algorithms whose time bounds are polynomial in the graph size and in the output size β. A stronger demand is that one may always want to output a new item, i.e., maximal biclique, after some polynomial delay (in the size of the graph only). However, in the present paper we are only concerned with the total running time. If not said otherwise, let n and m denote the number of vertices and edges, respectively, of the input graph, and let Δ be the maximal vertex degree.

Let us review known total running times from the literature about the problem. For general (not only bipartite) graphs, several incomparable time bounds are derived in [1], in particular, O(n2β2) or alternatively O(n3β). The latter bound also appears in [2] and is later refined to O(nmβ) in [3]. This bound, in turn, also comes out in [6] from a different angle. A weighted and thresholded version of the problem is considered in [9] where an O(n2β) time bound is reported. A unifying view is presented in [4], however no better time bounds in our direction follow there. Finally, one of the results in [8] is an O(Δ2β) time bound for the case of bipartite graphs. The extended abstract [5] deals with the bipartite case, too, but gives no explicit time bound.

In the present paper we take advantage of skewed degree distributions in the bipartite graph H=(X,Y,E), resulting in time bounds that can beat the aforementioned O(Δ2β) under some circumstances. More technically, consider the following special case of a sorted degree sequence in one partite set X, which goes as 1/js, that is, the jth highest degree in X is about a 1/js fraction of the highest degree. Here s is any constant with 1s<2. Then we achieve O(Δk2sβ) time, where k=|X|. If, furthermore, k2s<Δ, then this is faster than O(Δ2β). (We remark that the algorithm in [8] also outputs a new biclique after a delay of O(Δ2) time, whereas we do not aim for a polynomial delay, and apparently it would be hard to achieve combined with our result.)

This type of time bounds is our main theoretical contribution. It seems relevant because just such non-uniform degree distributions appear in practice. We are mainly interested in applications where the vertices in Y represent many short texts, the vertices in X represent the words in these text snippets, and xyE is an edge, if word x occurs (at least once) in text y. The texts can, e.g., be tweets, short news about events, comments in a forum, or reviews of hotels, restaurants, of products or artistic works. Combinations of words that occur frequently indicate topics and can serve as a basis for, e.g., clustering, opinion mining, or automatic summarization.

Now, the point is that rather few words appear very frequently (and these are not only stop words but also characteristic terms from the discussed domain, or evaluating phrases), whereas others are more occasional. There is empirical evidence [7] that word frequencies in random texts follow Zipf's law, i.e., they are proportional to 1/js with s close to 1. In text corpora focused on one theme one can expect more “hyper-Zipfian” distributions with s>1, since now only a limited set of words is very frequent. Also k2s<Δ is easily fulfilled, as Δ is high (frequent words in many texts), whereas the number k of different words comes with an exponent below 1. In a preprocessing phase we can even omit rare words that are of no interest, and thus reduce k right from the beginning.

We remark that the degree sequence does not have to obey exactly some power law, and the algorithm itself does not depend on that. We only discussed this function for its mathematical simplicity, in order to get some “crisp” specific worst-case bound. Rather, the more general, somewhat informal conclusion is that we can enumerate the maximal bicliques faster than what earlier time bounds indicate, whenever the degree sequence is “more skewed” than a Zipf's law sequence with s=1.

Our algorithm, while taking the degree sequence into account, still follows natural ideas and should also be easy to implement. Despite earlier works we present the algorithm from scratch because, of course, the details are important for the analysis. We also add some more tricks that do not further help the worst-case bound but are beneficial for certain instances.

Section snippets

Prefix maxima in a sequence of sets

The enumeration algorithm in Section 3 will have to deal with a certain sequence of sets of vertices and, very roughly speaking, recognize which of them are already subsets of other sets early in the sequence. (See Definition 1 below for the precise statement.) This will be needed to avoid returning non-maximal bicliques. In this section we solve this task separately by a routine called PrefMax, such that we can later use this routine and focus on the main algorithm.

As a notational remark, ⊂

The maximal bicliques generated through a hull operator

Let H=(X,Y,E) be our given bipartite graph. As usual, let N(v) denote the set of all neighbors of a vertex v.

Definition 3

For AX we define σ(A) to be the set of all vertices in Y which are adjacent to all vertices of A. Equivalently, σ(A)=vAN(v). We define σ(B) similarly for BY, and we let ϕ(A):=σ(σ(A)).

The following lemmas are straightforward to prove.

Lemma 4

ϕ is a hull operator, that is, ϕ is extensive, increasing, and idempotent. In detail: Aϕ(A), AA implies ϕ(A)ϕ(A), and ϕ(ϕ(A))=ϕ(A).

Lemma 5

Consider any

Fine-tuning and analysis

We presume a uniform-cost model where dictionary operations need O(1) time. (We deal with subsets of vertices that can be stored as a table of size n, or as a hashtable, along with information about the cardinality.) In particular, a test whether ST can be done in O(|S|) time, and the intersection ST of two sets can be computed in O(min{|S|,|T|}) time. (In a logarithmic cost model our analysis works similarly, just with logarithmic factors attached.) Since we aim at output-sensitive time

Acknowledgement

This work has been supported by the Swedish Foundation for Strategic Research (SSF) through Grant IIS11-0089 for a data mining project entitled “Data-driven secure business intelligence”.

References (9)

There are more references available in the full text version of this article.

Cited by (16)

  • Efficient enumeration of maximal induced bicliques

    2021, Discrete Applied Mathematics
  • Tight lower bounds on the number of bicliques in false-twin-free graphs

    2016, Theoretical Computer Science
    Citation Excerpt :

    Some authors consider bicliques not induced, others, with bounded size of bipartition, etc. See for example [1,6–8,11,19–21,23,24,29,30]. Even though the bounds are polynomial, this approach can help to develop algorithms for listing the bicliques, not only because of the use itself of the bounds but also because of the ideas behind the proofs.

  • Finding and enumerating large intersections

    2015, Theoretical Computer Science
    Citation Excerpt :

    Maximal biclique enumeration in general is well studied [1,7,9,11]. In [6] we have also given an algorithm for the maximal biclique enumeration that runs faster than in the general case under power-law assumptions on the degree sequence. According to an idea in [5] (see also [14]), pairs of similar sets in a set family can be efficiently identified by locality-sensitive hashing.

  • On Maximising the Vertex Coverage for Top-k t-Bicliques in Bipartite Graphs

    2022, Proceedings - International Conference on Data Engineering
View all citing articles on Scopus
View full text