Parallel mining of large maximal quasi-cliques

Khalil, Jalal; Yan, Da; Guo, Guimu; Yuan, Lyuheng

doi:10.1007/s00778-021-00712-2

Parallel mining of large maximal quasi-cliques

Regular Paper
Published: 26 November 2021

Volume 31, pages 649–674, (2022)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jalal Khalil¹^na1,
Da Yan ORCID: orcid.org/0000-0002-4653-0408¹^na1,
Guimu Guo¹^na1 &
…
Lyuheng Yuan¹

726 Accesses
5 Citations
Explore all metrics

Abstract

Given a user-specified minimum degree threshold \(\gamma \), a \(\gamma \)-quasi-clique is a subgraph where each vertex connects to at least \(\gamma \) fraction of the other vertices. Quasi-clique is a natural definition for dense structures, so finding large and hence statistically significant quasi-cliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive, and even a recent algorithm for mining large maximal quasi-cliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasi-cliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task-subgraph size, by (a) using a timeout-based task decomposition strategy, and by (b) utilizing a priority task queue to schedule long-running tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called G-thinker, this paper targets a single-machine multi-core environment which is more accessible to an average end user. A general framework called T-thinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasi-clique mining algorithm. Additionally, we consider the problem of directly mining large quasi-cliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68\(\times \) runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasi-cliques from dense parts can provide an additional speedup of up to 89.46\(\times \).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

Article 04 August 2021

Parallel Clique-Like Subgraph Counting and Listing

Listing all maximal cliques in large graphs on vertex-centric model

Article 12 February 2019

Notes

https://github.com/beginner1010/topk-quasi-clique-enumeration

References

Abello, J., Resende, M.G.C., Sudarsky, S.: Massive quasi-clique detection. In: LATIN, volume 2286 of Lecture Notes in Computer Science, pp. 598–612. Springer (2002)
Bader, G.D., Hogue, C.W.: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform. 4(1), 2 (2003)
Article Google Scholar
Batagelj, V., Zaversnik, M.: An o(m) algorithm for cores decomposition of networks. CoRR, cs.DS/0310049 (2003)
Bayardo Jr, R.J.: Efficiently mining long patterns from databases. In: SIGMOD Conference, pp. 85–93. ACM Press (1998)
Berlowitz, D., Cohen, S., Kimelfeld, B.: Efficient enumeration of maximal k-plexes. In: SIGMOD Conference, pp. 431–444. ACM (2015)
Bhattacharyya, M., Bandyopadhyay, S.: Mining the largest quasi-clique in human protein interactome. In: 2009 International Conference on Adaptive and Intelligent Systems, pp. 194–199. IEEE (2009)
Brunato, M., Hoos, H.H., Battiti, R.: On effectively finding maximal quasi-cliques in graphs. In: International Conference on Learning and Intelligent Optimization, pp. 41–55. Springer, Berlin (2007)
Bu, D., Zhao, Y., Cai, L., Xue, H., Zhu, X., Lu, H., Zhang, J., Sun, S., Ling, L., Zhang, N., et al.: Topological structure analysis of the protein-protein interaction network in budding yeast. Nucleic Acids Res. 31(9), 2443–2450 (2003)
Article Google Scholar
COST in the Land of Databases. https://github.com/frankmcsherry/blog/blob/master/posts/2017-09-23.md
Chang, L., Yu, J.X., Qin, L., Lin, X., Liu, C., Liang, W.: Efficiently computing k-edge connected components via graph decomposition. In: SIGMOD Conference, pp. 205–216. ACM (2013)
Chen, H., Liu, M., Zhao, Y., Yan, X., Yan, D., Cheng, J.: G-miner: an efficient task-oriented graph mining system. In: EuroSys, pp. 32:1–32:12. ACM (2018)
Chou, Y.H., Wang, E.T., Chen, A.L.P.: Finding maximal quasi-cliques containing a target vertex in a graph. In: DATA, pp. 5–15. SciTePress (2015)
Chu, S., Cheng, J.: Triangle listing in massive networks. TKDD 6(4), 17:1–17:32 (2012)
Conde-Cespedes, P., Ngonmang, B., Viennet, E.: An efficient method for mining the maximal \(\alpha \)-quasi-clique-community of a given node in complex networks. Soc. Netw. Anal. Min. 8(1), 20 (2018)
Article Google Scholar
Conte, A., Firmani, D., Mordente, C., Patrignani, M., Torlone, R.: Fast enumeration of large k-plexes. In: SIGKDD, pp. 115–124. ACM (2017)
Conte, A., Matteis, T.D., Sensi, D.D., Grossi, R., Marino, A., Versari, L.: D2K: scalable community detection in massive networks via small-diameter k-plexes. In: SIGKDD, pp. 1272–1281. ACM (2018)
Cui, W., Xiao, Y., Wang, H., Lu, Y., Wang, W.: Online search of overlapping communities. In: SIGMOD Conference, pp. 277–288. ACM (2013)
Fan, W., Jin, R., Liu, M., Lu, P., Luo, X., Xu, R., Yin, Q., Yu, W., Zhou, J.: Application driven graph partitioning. In: SIGMOD Conference, pp. 1765–1779. ACM (2020)
Guo, G., Yan, D., Özsu, M.T., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: an algorithm-system codesign approach. Proc. VLDB Endow. 14(4), 573–585 (2020)
Article Google Scholar
Guo, G., Yan, D., T. Özsu, M., Jiang, Z., Khalil, J.: Scalable mining of maximal quasi-cliques: An algorithm-system codesign approach. CoRR, arXiv:2005.00081 (2020)
Hopcroft, J., Khan, O., Kulis, B., Selman, B.: Tracking evolving communities in large linked networks. Proc. Natl. Acad. Sci. 101(suppl 1), 5249–5253 (2004)
Article Google Scholar
Jiang, D., Pei, J.: Mining frequent cross-graph quasi-cliques. ACM Trans. Knowl. Discov. Data 2(4), 16:1–16:42 (2009)
Joshi, A., Zhang, Y., Bogdanov, P., Hwang, J.: An efficient system for subgraph discovery. In: IEEE Big Data, pp. 703–712 (2018)
Lee, P., Lakshmanan, L.V.S.: Query-driven maximum quasi-clique search. In: SDM, pp. 522–530. SIAM (2016)
Li, J., Wang, X., Cui, Y.: Uncovering the overlapping community structure of complex networks by maximal cliques. Physica A Stat. Mech. Appl. 415, 398–406 (2014)
Article MathSciNet Google Scholar
Liu, G., Wong, L.: Effective pruning techniques for mining quasi-cliques. In: ECML/PKDD, volume 5212 of Lecture Notes in Computer Science, pp. 33–49. Springer, Berlin (2008)
Lu, C., Yu, J.X., Wei, H., Zhang, Y.: Finding the maximum clique in massive graphs. Proc. VLDB Endow. 10(11), 1538–1549 (2017)
Article Google Scholar
Lyu, B., Qin, L., Lin, X., Zhang, Y., Qian, Z., Zhou, J.: Maximum biclique search at billion scale. Proc. VLDB Endow. 13(9), 1359–1372 (2020)
Article Google Scholar
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: SIGMOD Conference, pp. 135–146 (2010)
Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci. 210(2), 305–325 (1999)
Article MathSciNet Google Scholar
McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS (2015)
Pattillo, J., Veremyev, A., Butenko, S., Boginski, V.: On the maximum quasi-clique problem. Discrete Appl. Math. 161(1–2), 244–257 (2013)
Article MathSciNet Google Scholar
Pei, J., Jiang, D., Zhang, A.: On mining cross-graph quasi-cliques. In: SIGKDD, pp. 228–238. ACM (2005)
Qin, L., Yu, J.X., Chang, L., Cheng, H., Zhang, C., Lin, X.: Scalable big graph processing in mapreduce. In: SIGMOD Conference, pp. 827–838. ACM (2014)
Quamar, A., Deshpande, A., Lin, J.: Nscale: neighborhood-centric large-scale graph analytics in the cloud. VLDB J. 1–26 (2014)
Sanei-Mehri, S., Das, A., Tirthapura, S.:Enumerating top-k quasi-cliques. In: IEEE BigData, pp. 1107–1112. IEEE (2018)
Tanner, B.K., Warner, G., Stern, H., Olechowski, S.: Koobface: The evolution of the social botnet. In: eCrime, pp. 1–10. IEEE (2010)
Teixeira, C.H.C., Fonseca, A.J., Serafini, M., Siganos, G., Zaki, M.J., Aboulnaga, A.: Arabesque: a system for distributed graph mining. In: SOSP, pp. 425–440 (2015)
Wang, K., Zuo, Z., Thorpe, J., Nguyen, T.Q., Xu, G.H.: Rstream: Marrying relational algebra with streaming for efficient graph mining on A single machine. In: OSDI, pp. 763–782 (2018)
Weiss, D., Warner, G.: Tracking criminals on facebook: a case study from a digital forensics reu program. In: Proceedings of Annual ADFSL Conference on Digital Forensics, Security and Law (2015)
Yan, D., Bu, Y., Tian, Y., Deshpande, A.: Big graph analytics platforms. Found. Trends Databases 7(1–2), 1–195 (2017)
Article Google Scholar
Yan, D., Bu, Y., Tian, Y., Deshpande, A., Cheng, J.: Big graph analytics systems. In: SIGMOD Conference, pp. 2241–2243. ACM (2016)
Yan, D., Cheng, J., Chen, H., Long, C., Bangalore, P.: Lightweight fault tolerance in pregel-like systems. In: ICPP, pp. 69:1–69:10. ACM (2019)
Yan, D., Cheng, J., Lu, Y., Ng, W.: Blogel: a block-centric framework for distributed computation on real-world graphs. Proc. VLDB Endow. 7(14), 1981–1992 (2014)
Article Google Scholar
Yan, D., Cheng, J., Lu, Y., Ng, W.: Effective techniques for message reduction and load balancing in distributed graph computation. In: WWW, pp. 1307–1317 (2015)
Yan, D., Cheng, J., Özsu, M.T., Yang, F., Lu, Y., Lui, J.C.S., Zhang, Q., Ng, W.: A general-purpose query-centric framework for querying big graphs. Proc. VLDB Endow. 9(7), 564–575 (2016)
Article Google Scholar
Yan, D., Cheng, J., Xing, K., Lu, Y., Ng, W., Bu, Y.: Pregel algorithms for graph connectivity problems with performance guarantees. PVLDB 7(14), 1821–1832 (2014)
Google Scholar
Yan, D., Guo, G.: Systems and algorithms for massively parallel graph mining. In: BigData. IEEE (2020)
Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Ku, W., Lui, J.C.S.: G-thinker: a distributed framework for mining subgraphs in a big graph. In: ICDE, pp. 1369–1380. IEEE (2020)
Yan, D., Guo, G., Khalil, J. et al. G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00688-z
Yan, D., Guo, G., Chowdhury, M.M.R., Özsu, M.T., Lui, J.C.S., Tan, W.: T-thinker: a task-centric distributed framework for compute-intensive divide-and-conquer algorithms. In: PPoPP, pp. 411–412. ACM (2019)
Yan, D., Huang, Y., Liu, M., Chen, H., Cheng, J., Wu, H., Zhang, C.: Graphd: Distributed vertex-centric graph processing beyond the memory limit. IEEE Trans. Parallel Distrib. Syst. 29(1), 99–114 (2018)
Article Google Scholar
Yan, D., Liu, H.: Parallel graph processing. In: Encyclopedia of Big Data Technologies. Springer (2019)
Yan, D., Qu, W., Guo, G., Wang, X.: Prefixfpm: A parallel framework for general-purpose frequent pattern mining. In: ICDE, pp. 1938–1941. IEEE (2020)
Yan, D., Qu, W., Guo, G. et al.: PrefixFPM: a parallel framework for general-purpose mining of frequent and closed patterns. The VLDB Journal (2021). https://doi.org/10.1007/s00778-021-00687-0
Yan, D., Tian, Y., Cheng, J.: Systems for Big Graph Analytics. Springer Briefs in Computer Science. Springer (2017)
Yang, Y., Yan, D., Wu, H., Cheng, J., Zhou, S., Lui, J.C.S.: Diversified temporal subgraph pattern mining. In: SIGKDD, pp. 1965–1974. ACM (2016)
Zeng, Z., Wang, J., Zhou, L., Karypis, G.: Coherent closed quasi-clique discovery from large dense graph databases. In: SIGKDD, pp. 797–802. ACM (2006)
Zhang, Q., Yan, D., Cheng, J.: Quegel: A general-purpose system for querying big graphs. In: SIGMOD Conference, pp. 2189–2192. ACM (2016)
Zhou, Y., Xu, J., Guo, Z., Xiao, M., Jin, Y.: Enumerating maximal k-plexes with worst-case time guarantee. In: AAAI, pp. 2442–2449. AAAI Press (2020)

Download references

Acknowledgements

This work was supported by NSF OAC-1755464 (CRII) and DGE-1723250 (SaTC). Guimu Guo acknowledges financial support from the Alabama Graduate Research Scholars Program (GRSP) funded through the Alabama Commission for Higher Education and administered by the Alabama EPSCoR.

Author information

Jalal Khalil, Da Yan and Guimu Guo have contributed equally to this work.

Authors and Affiliations

Department of Computer Science, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
Jalal Khalil, Da Yan, Guimu Guo & Lyuheng Yuan

Authors

Jalal Khalil
View author publications
You can also search for this author in PubMed Google Scholar
Da Yan
View author publications
You can also search for this author in PubMed Google Scholar
Guimu Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lyuheng Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Da Yan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khalil, J., Yan, D., Guo, G. et al. Parallel mining of large maximal quasi-cliques. The VLDB Journal 31, 649–674 (2022). https://doi.org/10.1007/s00778-021-00712-2

Download citation

Received: 04 March 2021
Revised: 13 August 2021
Accepted: 19 October 2021
Published: 26 November 2021
Issue Date: July 2022
DOI: https://doi.org/10.1007/s00778-021-00712-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel mining of large maximal quasi-cliques

Abstract

Access this article

Similar content being viewed by others

G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

Parallel Clique-Like Subgraph Counting and Listing

Listing all maximal cliques in large graphs on vertex-centric model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel mining of large maximal quasi-cliques

Abstract

Access this article

Similar content being viewed by others

G-thinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing

Parallel Clique-Like Subgraph Counting and Listing

Listing all maximal cliques in large graphs on vertex-centric model

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation